Spaces:

observablehq
/

fpdn

Running

fil commited on Feb 15, 2024

Commit

a1a75ce

1 Parent(s): 22c1938

updates to wording, source code page, signature

Files changed (4) hide show

docs/data/presse.parquet.sh CHANGED Viewed

@@ -1,5 +1,3 @@
-# file_id, ocr, title, date, author, page_count, word_count, character_count
 echo """
 CREATE TABLE presse AS (
 SELECT title

 echo """
 CREATE TABLE presse AS (
 SELECT title

docs/index.md CHANGED Viewed

@@ -1,10 +1,14 @@
-# FPDN exploration
-A new fascinating dataset just dropped on 🤗. [French public domain newspapers](https://huggingface.co/datasets/PleIAs/French-PD-Newspapers) references about 3 million newspapers and periodicals with their full text OCR’ed and some meta-data.
-The data is stored in 320 chunks weighting about 700MB each. The data loader for this Observable project uses DuckDB to read these 320 parquet files (altogether about 200GB) and combines a minimal subset of their metadata — title and year of publication, most importantly without the text contents&nbsp;—, into a single parquet file. It takes only about 1 minute to run in a hugging-face Space.
-The resulting file is small enough (about 8MB) that we can load it in the browser and create “live” charts with Observable Plot.
 In this project, I’m exploring two aspects of the dataset:
@@ -39,3 +43,20 @@ Plot.plot({
   ],
 });
 ```

+# French public domain newspapers
+## A quick glance at 3&nbsp;million periodicals
+<p class=signature>by <a href="https://observablehq.com/@fil">Fil</a>
+This new fascinating dataset just dropped on Hugging Face&nbsp;: [French public domain newspapers](https://huggingface.co/datasets/PleIAs/French-PD-Newspapers) 🤗 references about **3&nbsp;million newspapers and periodicals** with their full text OCR’ed and some meta-data.
+The data is stored in 320 large parquet files. The data loader for this [Observable framework](https://observablehq.com/framework) project uses [DuckDB](https://duckdb.org/) to read these files (altogether about 200GB) and combines a minimal subset of their metadata — title and year of publication, most importantly without the text contents&nbsp;—, into a single highly optimized parquet file. This takes only about 1 minute to run in a hugging-face Space.
+The resulting file is small enough (about 8MB) that we can load it in the browser and create “live” charts with [Observable Plot](https://observablehq.com/plot).
 In this project, I’m exploring two aspects of the dataset:
   ],
 });
 ```
+<style>
+.signature a[href] {
+  color: var(--theme-foreground)
+}
+.signature {
+  text-align: right;
+  font-size: small;
+}
+.signature::before {
+  content: "◼︎ ";
+}
+</style>

docs/source.md ADDED Viewed

+# Source code
+This project relies on a **data loader** that reads all the source files and outputs a single summary file, minimized to contain only a subset of the source information:
+```js
+import hljs from "npm:highlight.js";
+```
+`data/presse.parquet.sh`
+```js
+const pre = display(document.createElement("pre"));
+FileAttachment("data/presse.parquet.sh")
+  .text()
+  .then(
+    (text) => (pre.innerHTML = hljs.highlight(text, { language: "bash" }).value)
+  );
+```
+This is the file that the other pages load with DuckDB, as a FileAttachment:
+```js echo run=false
+import { DuckDBClient } from "npm:@observablehq/duckdb";
+const db = DuckDBClient.of({ presse: FileAttachment("data/presse.parquet") });
+```

observablehq.config.ts CHANGED Viewed

@@ -1,3 +1,5 @@
 export default {
-    header: `<script src="https://cdnjs.cloudflare.com/ajax/libs/iframe-resizer/4.3.9/iframeResizer.contentWindow.min.js" async=""></script>`
-};

 export default {
+  title: "FPDN",
+  footer: `<script type="module" src="https://cdnjs.cloudflare.com/ajax/libs/iframe-resizer/4.3.9/iframeResizer.contentWindow.min.js"></script>`,
+  pager: false,
+};