anthropic-pdf-parser / model_card.yml
algorembrant's picture
Upload 6 files
0ee11bd verified
---
language:
- en
license: mit
library_name: anthropic
tags:
- pdf
- document-parsing
- ocr
- multimodal
- equations
- table-extraction
- agent
- claude
- information-extraction
- scientific-documents
pipeline_tag: document-question-answering
model_name: PDF Atomic Parser
authors:
- algorembrant
sdk: other
sdk_version: "1.0.0"
app_file: pdf_atomic_parser.py
short_description: >
Atomically parse complex PDFs (equations, graphs, algorithms, tables)
using Claude claude-opus-4-6 without hallucination. Agent-ready.
---
# PDF Atomic Parser
Powered by **claude-opus-4-6** (Anthropic).
## Description
A single-file Python tool for extracting structured content from complex
academic and technical PDFs. Works on documents containing:
- Mathematical equations (extracted as LaTeX)
- Data tables (extracted as Markdown + JSON)
- Algorithms and pseudocode (verbatim with language detection)
- Figures, charts, graphs, and drawings (semantic descriptions)
- Multi-column layouts, footnotes, margin notes
- 100+ page documents via automatic chunking
## Usage
```bash
pip install anthropic PyMuPDF rich tqdm
export ANTHROPIC_API_KEY="sk-ant-..."
python pdf_atomic_parser.py parse document.pdf
python pdf_atomic_parser.py atomic document.pdf --output ./results/
python pdf_atomic_parser.py extract-equations document.pdf
python pdf_atomic_parser.py query document.pdf "What is the main theorem?"
```
## Agent Integration
```python
from pdf_atomic_parser import AgentPDFInterface
agent = AgentPDFInterface(model="opus")
result = agent.parse("paper.pdf")
equations = agent.get_equations("paper.pdf")
tables = agent.get_tables("paper.pdf")
answer = agent.ask("paper.pdf", "What datasets were used?")
```
## Model Details
| Property | Value |
|---|---|
| Underlying model | claude-opus-4-6 (Anthropic) |
| Parsing modes | native PDF, page-as-image (300 DPI) |
| Max pages per call | 20 (configurable) |
| Cache | SQLite, keyed by SHA-256 + page + model + mode |
| Output formats | JSON, Markdown, plain text |
## Citation
```bibtex
@software{algorembrant2025pdfparser,
author = {algorembrant},
title = {PDF Atomic Parser},
year = {2025},
url = {https://github.com/algorembrant/pdf-atomic-parser}
}
```