Buckets:

sroecker
/

pdf-chandra-ocr

Files

xet

sroecker/pdf-chandra-ocr / README.md

sroecker

26 days ago

preview code

download

raw

2.04 kB

metadata

configs:
  - config_name: default
    data_files:
      - split: train
        path: data/*.jsonl.gz
tags:
  - pdf
  - ocr
  - chandra
  - chandra-ocr-2
  - markdown
  - html
  - hf-jobs
  - uv-script

PDF OCR with Chandra OCR 2

This output bundle stores OCR results for PDFs referenced by a supplied URL list using datalab-to/chandra-ocr-2.

Summary

Output bucket: hf://buckets/sroecker/pdf-chandra-ocr
Source PDF URLs in input list: 111
Processed inputs recorded in state/processed_inputs.txt: 111
Successes: 111
Partial successes: 0
Errors: 0
Next shard index: 12
Updated at: 2026-04-15T18:56:12.558178+00:00

Files

data/part-*.jsonl.gz: OCR result shards, one JSON object per PDF
state/processed_inputs.txt: completed PDF URLs used for resume
state/summary.json: aggregate counters and bookkeeping

Each record includes:

num_pages: total number of pages in the source PDF
num_pages_processed: number of pages actually sent to OCR
pdf_exceeds_page_limit: whether the PDF had more pages than the configured OCR cap
max_pages_per_paper: configured OCR page cap for the run

Load the results

from datasets import load_dataset

dataset = load_dataset("<dataset-id>", data_files="data/*.jsonl.gz", split="train")
print(dataset[0]["source_id"])
print(dataset[0]["pdf_url"])
print(dataset[0]["markdown"][:1000])

Job config

Prompt type: ocr_layout
Page batch size: 28
Max output tokens: 12384
Max model length: 18000
GPU memory utilization: 0.85
Minimum download request interval: 0.0 seconds
Max pages per paper sent to OCR: 30
Bucket backend: hf-cli
Paginate output: False
Include headers/footers: False

Reproduction

hf jobs uv run --flavor l4x1 --image vllm/vllm-openai:v0.17.0 \
  -s HF_TOKEN --timeout 2d \
  ./chandra2-arxiv-ocr.py --output-dataset hf://buckets/sroecker/pdf-chandra-ocr \
  --output-bucket hf://buckets/sroecker/pdf-chandra-ocr \
  --pdf-urls-url https://.../pdf_urls.txt

Xet Storage Details

Size:: 2.04 kB
Xet hash:: b973ac8d8ed441c4d8e986b7ccc5a250ab16605219d8a4c5ffd8228b0cfeb823

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.