| | --- |
| | language: |
| | - en |
| | license: mit |
| | library_name: anthropic |
| | tags: |
| | - pdf |
| | - document-parsing |
| | - ocr |
| | - multimodal |
| | - equations |
| | - table-extraction |
| | - agent |
| | - claude |
| | - information-extraction |
| | - scientific-documents |
| | pipeline_tag: document-question-answering |
| | model_name: PDF Atomic Parser |
| | authors: |
| | - algorembrant |
| | sdk: other |
| | sdk_version: "1.0.0" |
| | app_file: pdf_atomic_parser.py |
| | short_description: > |
| | Atomically parse complex PDFs (equations, graphs, algorithms, tables) |
| | using Claude claude-opus-4-6 without hallucination. Agent-ready. |
| | --- |
| | |
| | |
| |
|
| | Powered by **claude-opus-4-6** (Anthropic). |
| |
|
| | |
| |
|
| | A single-file Python tool for extracting structured content from complex |
| | academic and technical PDFs. Works on documents containing: |
| |
|
| | - Mathematical equations (extracted as LaTeX) |
| | - Data tables (extracted as Markdown + JSON) |
| | - Algorithms and pseudocode (verbatim with language detection) |
| | - Figures, charts, graphs, and drawings (semantic descriptions) |
| | - Multi-column layouts, footnotes, margin notes |
| | - 100+ page documents via automatic chunking |
| |
|
| | |
| |
|
| | ```bash |
| | pip install anthropic PyMuPDF rich tqdm |
| | export ANTHROPIC_API_KEY="sk-ant-..." |
| |
|
| | python pdf_atomic_parser.py parse document.pdf |
| | python pdf_atomic_parser.py atomic document.pdf --output ./results/ |
| | python pdf_atomic_parser.py extract-equations document.pdf |
| | python pdf_atomic_parser.py query document.pdf "What is the main theorem?" |
| | ``` |
| |
|
| | |
| |
|
| | ```python |
| | from pdf_atomic_parser import AgentPDFInterface |
| |
|
| | agent = AgentPDFInterface(model="opus") |
| | result = agent.parse("paper.pdf") |
| | equations = agent.get_equations("paper.pdf") |
| | tables = agent.get_tables("paper.pdf") |
| | answer = agent.ask("paper.pdf", "What datasets were used?") |
| | ``` |
| |
|
| | |
| |
|
| | | Property | Value | |
| | |---|---| |
| | | Underlying model | claude-opus-4-6 (Anthropic) | |
| | | Parsing modes | native PDF, page-as-image (300 DPI) | |
| | | Max pages per call | 20 (configurable) | |
| | | Cache | SQLite, keyed by SHA-256 + page + model + mode | |
| | | Output formats | JSON, Markdown, plain text | |
| |
|
| | |
| |
|
| | ```bibtex |
| | @software{algorembrant2025pdfparser, |
| | author = {algorembrant}, |
| | title = {PDF Atomic Parser}, |
| | year = {2025}, |
| | url = {https://github.com/algorembrant/pdf-atomic-parser} |
| | } |
| | ``` |
| |
|