| --- |
| language: |
| - en |
| license: mit |
| library_name: anthropic |
| tags: |
| - pdf |
| - document-parsing |
| - ocr |
| - multimodal |
| - equations |
| - table-extraction |
| - agent |
| - claude |
| - information-extraction |
| - scientific-documents |
| pipeline_tag: document-question-answering |
| model_name: PDF Atomic Parser |
| authors: |
| - algorembrant |
| sdk: other |
| sdk_version: "1.0.0" |
| app_file: pdf_atomic_parser.py |
| short_description: > |
| Atomically parse complex PDFs (equations, graphs, algorithms, tables) |
| using Claude claude-opus-4-6 without hallucination. Agent-ready. |
| --- |
| |
| |
|
|
| Powered by **claude-opus-4-6** (Anthropic). |
|
|
| |
|
|
| A single-file Python tool for extracting structured content from complex |
| academic and technical PDFs. Works on documents containing: |
|
|
| - Mathematical equations (extracted as LaTeX) |
| - Data tables (extracted as Markdown + JSON) |
| - Algorithms and pseudocode (verbatim with language detection) |
| - Figures, charts, graphs, and drawings (semantic descriptions) |
| - Multi-column layouts, footnotes, margin notes |
| - 100+ page documents via automatic chunking |
|
|
| |
|
|
| ```bash |
| pip install anthropic PyMuPDF rich tqdm |
| export ANTHROPIC_API_KEY="sk-ant-..." |
|
|
| python pdf_atomic_parser.py parse document.pdf |
| python pdf_atomic_parser.py atomic document.pdf --output ./results/ |
| python pdf_atomic_parser.py extract-equations document.pdf |
| python pdf_atomic_parser.py query document.pdf "What is the main theorem?" |
| ``` |
|
|
| |
|
|
| ```python |
| from pdf_atomic_parser import AgentPDFInterface |
|
|
| agent = AgentPDFInterface(model="opus") |
| result = agent.parse("paper.pdf") |
| equations = agent.get_equations("paper.pdf") |
| tables = agent.get_tables("paper.pdf") |
| answer = agent.ask("paper.pdf", "What datasets were used?") |
| ``` |
|
|
| |
|
|
| | Property | Value | |
| |---|---| |
| | Underlying model | claude-opus-4-6 (Anthropic) | |
| | Parsing modes | native PDF, page-as-image (300 DPI) | |
| | Max pages per call | 20 (configurable) | |
| | Cache | SQLite, keyed by SHA-256 + page + model + mode | |
| | Output formats | JSON, Markdown, plain text | |
|
|
| |
|
|
| ```bibtex |
| @software{algorembrant2025pdfparser, |
| author = {algorembrant}, |
| title = {PDF Atomic Parser}, |
| year = {2025}, |
| url = {https://github.com/algorembrant/pdf-atomic-parser} |
| } |
| ``` |
|
|