File size: 2,268 Bytes
0ee11bd
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
---
language:
  - en
license: mit
library_name: anthropic
tags:
  - pdf
  - document-parsing
  - ocr
  - multimodal
  - equations
  - table-extraction
  - agent
  - claude
  - information-extraction
  - scientific-documents
pipeline_tag: document-question-answering
model_name: PDF Atomic Parser
authors:
  - algorembrant
sdk: other
sdk_version: "1.0.0"
app_file: pdf_atomic_parser.py
short_description: >
  Atomically parse complex PDFs (equations, graphs, algorithms, tables)
  using Claude claude-opus-4-6 without hallucination. Agent-ready.
---

# PDF Atomic Parser

Powered by **claude-opus-4-6** (Anthropic).

## Description

A single-file Python tool for extracting structured content from complex
academic and technical PDFs. Works on documents containing:

- Mathematical equations (extracted as LaTeX)
- Data tables (extracted as Markdown + JSON)
- Algorithms and pseudocode (verbatim with language detection)
- Figures, charts, graphs, and drawings (semantic descriptions)
- Multi-column layouts, footnotes, margin notes
- 100+ page documents via automatic chunking

## Usage

```bash
pip install anthropic PyMuPDF rich tqdm
export ANTHROPIC_API_KEY="sk-ant-..."

python pdf_atomic_parser.py parse document.pdf
python pdf_atomic_parser.py atomic document.pdf --output ./results/
python pdf_atomic_parser.py extract-equations document.pdf
python pdf_atomic_parser.py query document.pdf "What is the main theorem?"
```

## Agent Integration

```python
from pdf_atomic_parser import AgentPDFInterface

agent = AgentPDFInterface(model="opus")
result    = agent.parse("paper.pdf")
equations = agent.get_equations("paper.pdf")
tables    = agent.get_tables("paper.pdf")
answer    = agent.ask("paper.pdf", "What datasets were used?")
```

## Model Details

| Property | Value |
|---|---|
| Underlying model | claude-opus-4-6 (Anthropic) |
| Parsing modes | native PDF, page-as-image (300 DPI) |
| Max pages per call | 20 (configurable) |
| Cache | SQLite, keyed by SHA-256 + page + model + mode |
| Output formats | JSON, Markdown, plain text |

## Citation

```bibtex
@software{algorembrant2025pdfparser,
  author    = {algorembrant},
  title     = {PDF Atomic Parser},
  year      = {2025},
  url       = {https://github.com/algorembrant/pdf-atomic-parser}
}
```