Buckets:

sroecker
/

pdf-chandra-ocr

Files

xet

sroecker/pdf-chandra-ocr / README.md

sroecker

27 days ago

preview code

download

raw

2.04 kB

	---
	configs:
	- config_name: default
	data_files:
	- split: train
	path: data/*.jsonl.gz
	tags:
	- pdf
	- ocr
	- chandra
	- chandra-ocr-2
	- markdown
	- html
	- hf-jobs
	- uv-script
	---

	# PDF OCR with Chandra OCR 2

	This output bundle stores OCR results for PDFs referenced by a supplied URL list using [datalab-to/chandra-ocr-2](https://huggingface.co/datalab-to/chandra-ocr-2).

	## Summary

	- Output bucket: `hf://buckets/sroecker/pdf-chandra-ocr`
	- Source PDF URLs in input list: 111
	- Processed inputs recorded in `state/processed_inputs.txt`: 111
	- Successes: 111
	- Partial successes: 0
	- Errors: 0
	- Next shard index: 12
	- Updated at: 2026-04-15T18:56:12.558178+00:00

	## Files

	- `data/part-*.jsonl.gz`: OCR result shards, one JSON object per PDF
	- `state/processed_inputs.txt`: completed PDF URLs used for resume
	- `state/summary.json`: aggregate counters and bookkeeping

	Each record includes:

	- `num_pages`: total number of pages in the source PDF
	- `num_pages_processed`: number of pages actually sent to OCR
	- `pdf_exceeds_page_limit`: whether the PDF had more pages than the configured OCR cap
	- `max_pages_per_paper`: configured OCR page cap for the run

	## Load the results

	```python
	from datasets import load_dataset

	dataset = load_dataset("<dataset-id>", data_files="data/*.jsonl.gz", split="train")
	print(dataset[0]["source_id"])
	print(dataset[0]["pdf_url"])
	print(dataset[0]["markdown"][:1000])
	```

	## Job config

	- Prompt type: `ocr_layout`
	- Page batch size: 28
	- Max output tokens: 12384
	- Max model length: 18000
	- GPU memory utilization: 0.85
	- Minimum download request interval: 0.0 seconds
	- Max pages per paper sent to OCR: 30
	- Bucket backend: hf-cli
	- Paginate output: False
	- Include headers/footers: False

	## Reproduction

	```bash
	hf jobs uv run --flavor l4x1 --image vllm/vllm-openai:v0.17.0 \
	-s HF_TOKEN --timeout 2d \
	./chandra2-arxiv-ocr.py --output-dataset hf://buckets/sroecker/pdf-chandra-ocr \
	--output-bucket hf://buckets/sroecker/pdf-chandra-ocr \
	--pdf-urls-url https://.../pdf_urls.txt
	```

Xet Storage Details

Size:: 2.04 kB
Xet hash:: b973ac8d8ed441c4d8e986b7ccc5a250ab16605219d8a4c5ffd8228b0cfeb823

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.