19c RoBERTa v5 Newspaper Language Model (In Development)

This repository is set up for the upcoming RoBERTa v5 language model, which will be pretrained on a newly reconstructed, high-resolution 19th-century American newspaper corpus.

The Core Strategy: Replacing Noisy OCR

The existing Tesseract OCR from Chronicling America (loc.gov) suffers from error rates of 12–40% due to degraded microfilms, mixed case, long-s (ΕΏ) ligatures, and poor line segmentation.

To build a clean corpus for RoBERTa v5, we are bypassing the bad original OCR and performing OCR reconstruction from the original page images using a custom pipeline:

  1. High-Resolution Harvest: Digitized pages downloaded at 50% resolution (doubling linear pixel density for small body text).
  2. YOLO Column Segmentation: A fine-tuned YOLO detector segments the newspaper pages into clean column crops.
  3. Strip Tiling: Slices columns into overlapping horizontal strips to keep text density within VLM visual attention boundaries.
  4. Local VLM Transcription: Transcribes strips using Gemma MoE via LM Studio APIs (achieving 90–95% transcription accuracy).
  5. Deduplication & Merge: Reassembles transcripts using sequence matching and de-hyphenation.
  6. Quality Gating via RoBERTa v4:
    • Perplexity (PPL) Filtering: Assembled column texts are evaluated by the RoBERTa v4 language model. High-perplexity columns (indicating gibberish, heavy distortion, or transcription failure) are automatically discarded.
    • Looping & Hallucination Mitigation: Microfilm defects or visual artifacts can cause the VLM to repeat paragraphs (looping), output text in blank areas, or generate plausible but hallucinated paragraphs. RoBERTa v4's perplexity scores easily flag and filter out these repeated blocks, empty segments, and anomalies.

Pretraining Corpus: The 19c Newspaper Harvest

The source corpus has been harvested and banked in ambrosfitz/19c_newspapers_images_alto containing 1,361 Parquet shards (~68,050 images).

Decade-by-Decade Banked Image Counters

Decade Banked / Target Images (Redis) Shards on HF Est. Images Status
1800s 1,115 / 1,040 28 1,400 Satisfied βœ“
1810s 1,178 / 1,040 28 1,400 Satisfied βœ“
1820s 1,651 / 1,560 41 2,050 Satisfied βœ“
1830s 2,647 / 2,600 25 1,250 Satisfied βœ“
1840s 5,850 / 4,160 39 1,950 Satisfied βœ“
1850s 7,889 / 7,800 93 4,650 Satisfied βœ“
1860s 10,486 / 10,400 169 8,450 Satisfied βœ“
1870s 7,919 / 7,800 172 8,600 Satisfied βœ“
1880s 10,519 / 10,400 215 10,750 Satisfied βœ“
1890s 10,421 / 10,400 551 27,550 Satisfied βœ“
Total 68,050 Banked 1,361 68,050 ALL COMPLETED

How to Load (Once Released)

Once training is complete, the model will be loadable using the Hugging Face transformers library:

from transformers import AutoModelForMaskedLM, AutoTokenizer

repo_id = "ambrosfitz/19c_roberta_v5_newspaper"

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(repo_id)
model = AutoModelForMaskedLM.from_pretrained(repo_id)
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support