Spaces:
Running
Running
updates to wording, source code page, signature
Browse files- docs/data/presse.parquet.sh +0 -2
- docs/index.md +25 -4
- docs/source.md +25 -0
- observablehq.config.ts +4 -2
docs/data/presse.parquet.sh
CHANGED
|
@@ -1,5 +1,3 @@
|
|
| 1 |
-
# file_id, ocr, title, date, author, page_count, word_count, character_count
|
| 2 |
-
|
| 3 |
echo """
|
| 4 |
CREATE TABLE presse AS (
|
| 5 |
SELECT title
|
|
|
|
|
|
|
|
|
|
| 1 |
echo """
|
| 2 |
CREATE TABLE presse AS (
|
| 3 |
SELECT title
|
docs/index.md
CHANGED
|
@@ -1,10 +1,14 @@
|
|
| 1 |
-
#
|
| 2 |
|
| 3 |
-
A
|
| 4 |
|
| 5 |
-
|
| 6 |
|
| 7 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 8 |
|
| 9 |
In this project, I’m exploring two aspects of the dataset:
|
| 10 |
|
|
@@ -39,3 +43,20 @@ Plot.plot({
|
|
| 39 |
],
|
| 40 |
});
|
| 41 |
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# French public domain newspapers
|
| 2 |
|
| 3 |
+
## A quick glance at 3 million periodicals
|
| 4 |
|
| 5 |
+
<p class=signature>by <a href="https://observablehq.com/@fil">Fil</a>
|
| 6 |
|
| 7 |
+
This new fascinating dataset just dropped on Hugging Face : [French public domain newspapers](https://huggingface.co/datasets/PleIAs/French-PD-Newspapers) 🤗 references about **3 million newspapers and periodicals** with their full text OCR’ed and some meta-data.
|
| 8 |
+
|
| 9 |
+
The data is stored in 320 large parquet files. The data loader for this [Observable framework](https://observablehq.com/framework) project uses [DuckDB](https://duckdb.org/) to read these files (altogether about 200GB) and combines a minimal subset of their metadata — title and year of publication, most importantly without the text contents —, into a single highly optimized parquet file. This takes only about 1 minute to run in a hugging-face Space.
|
| 10 |
+
|
| 11 |
+
The resulting file is small enough (about 8MB) that we can load it in the browser and create “live” charts with [Observable Plot](https://observablehq.com/plot).
|
| 12 |
|
| 13 |
In this project, I’m exploring two aspects of the dataset:
|
| 14 |
|
|
|
|
| 43 |
],
|
| 44 |
});
|
| 45 |
```
|
| 46 |
+
|
| 47 |
+
<style>
|
| 48 |
+
|
| 49 |
+
.signature a[href] {
|
| 50 |
+
color: var(--theme-foreground)
|
| 51 |
+
}
|
| 52 |
+
|
| 53 |
+
.signature {
|
| 54 |
+
text-align: right;
|
| 55 |
+
font-size: small;
|
| 56 |
+
}
|
| 57 |
+
|
| 58 |
+
.signature::before {
|
| 59 |
+
content: "◼︎ ";
|
| 60 |
+
}
|
| 61 |
+
|
| 62 |
+
</style>
|
docs/source.md
ADDED
|
@@ -0,0 +1,25 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Source code
|
| 2 |
+
|
| 3 |
+
This project relies on a **data loader** that reads all the source files and outputs a single summary file, minimized to contain only a subset of the source information:
|
| 4 |
+
|
| 5 |
+
```js
|
| 6 |
+
import hljs from "npm:highlight.js";
|
| 7 |
+
```
|
| 8 |
+
|
| 9 |
+
`data/presse.parquet.sh`
|
| 10 |
+
|
| 11 |
+
```js
|
| 12 |
+
const pre = display(document.createElement("pre"));
|
| 13 |
+
FileAttachment("data/presse.parquet.sh")
|
| 14 |
+
.text()
|
| 15 |
+
.then(
|
| 16 |
+
(text) => (pre.innerHTML = hljs.highlight(text, { language: "bash" }).value)
|
| 17 |
+
);
|
| 18 |
+
```
|
| 19 |
+
|
| 20 |
+
This is the file that the other pages load with DuckDB, as a FileAttachment:
|
| 21 |
+
|
| 22 |
+
```js echo run=false
|
| 23 |
+
import { DuckDBClient } from "npm:@observablehq/duckdb";
|
| 24 |
+
const db = DuckDBClient.of({ presse: FileAttachment("data/presse.parquet") });
|
| 25 |
+
```
|
observablehq.config.ts
CHANGED
|
@@ -1,3 +1,5 @@
|
|
| 1 |
export default {
|
| 2 |
-
|
| 3 |
-
|
|
|
|
|
|
|
|
|
| 1 |
export default {
|
| 2 |
+
title: "FPDN",
|
| 3 |
+
footer: `<script type="module" src="https://cdnjs.cloudflare.com/ajax/libs/iframe-resizer/4.3.9/iframeResizer.contentWindow.min.js"></script>`,
|
| 4 |
+
pager: false,
|
| 5 |
+
};
|