Buckets:
| configs: | |
| - config_name: default | |
| data_files: | |
| - split: train | |
| path: data/*.jsonl.gz | |
| tags: | |
| - ocr | |
| - chandra | |
| - chandra-ocr-2 | |
| - markdown | |
| - html | |
| - hf-jobs | |
| - uv-script | |
| # PDF OCR with Chandra OCR 2 | |
| This output bundle stores OCR results for PDFs referenced by a supplied URL list using [datalab-to/chandra-ocr-2](https://huggingface.co/datalab-to/chandra-ocr-2). | |
| ## Summary | |
| - Output bucket: `hf://buckets/sroecker/pdf-chandra-ocr` | |
| - Source PDF URLs in input list: 111 | |
| - Processed inputs recorded in `state/processed_inputs.txt`: 111 | |
| - Successes: 111 | |
| - Partial successes: 0 | |
| - Errors: 0 | |
| - Next shard index: 12 | |
| - Updated at: 2026-04-15T18:56:12.558178+00:00 | |
| ## Files | |
| - `data/part-*.jsonl.gz`: OCR result shards, one JSON object per PDF | |
| - `state/processed_inputs.txt`: completed PDF URLs used for resume | |
| - `state/summary.json`: aggregate counters and bookkeeping | |
| Each record includes: | |
| - `num_pages`: total number of pages in the source PDF | |
| - `num_pages_processed`: number of pages actually sent to OCR | |
| - `pdf_exceeds_page_limit`: whether the PDF had more pages than the configured OCR cap | |
| - `max_pages_per_paper`: configured OCR page cap for the run | |
| ## Load the results | |
| ```python | |
| from datasets import load_dataset | |
| dataset = load_dataset("<dataset-id>", data_files="data/*.jsonl.gz", split="train") | |
| print(dataset[0]["source_id"]) | |
| print(dataset[0]["pdf_url"]) | |
| print(dataset[0]["markdown"][:1000]) | |
| ``` | |
| ## Job config | |
| - Prompt type: `ocr_layout` | |
| - Page batch size: 28 | |
| - Max output tokens: 12384 | |
| - Max model length: 18000 | |
| - GPU memory utilization: 0.85 | |
| - Minimum download request interval: 0.0 seconds | |
| - Max pages per paper sent to OCR: 30 | |
| - Bucket backend: hf-cli | |
| - Paginate output: False | |
| - Include headers/footers: False | |
| ## Reproduction | |
| ```bash | |
| hf jobs uv run --flavor l4x1 --image vllm/vllm-openai:v0.17.0 \ | |
| -s HF_TOKEN --timeout 2d \ | |
| ./chandra2-arxiv-ocr.py --output-dataset hf://buckets/sroecker/pdf-chandra-ocr \ | |
| --output-bucket hf://buckets/sroecker/pdf-chandra-ocr \ | |
| --pdf-urls-url https://.../pdf_urls.txt | |
| ``` | |
Xet Storage Details
- Size:
- 2.04 kB
- Xet hash:
- b973ac8d8ed441c4d8e986b7ccc5a250ab16605219d8a4c5ffd8228b0cfeb823
·
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.