--- language: - en license: mit library_name: anthropic tags: - pdf - document-parsing - ocr - multimodal - equations - table-extraction - agent - claude - information-extraction - scientific-documents pipeline_tag: document-question-answering model_name: PDF Atomic Parser authors: - algorembrant sdk: other sdk_version: "1.0.0" app_file: pdf_atomic_parser.py short_description: > Atomically parse complex PDFs (equations, graphs, algorithms, tables) using Claude claude-opus-4-6 without hallucination. Agent-ready. --- # PDF Atomic Parser Powered by **claude-opus-4-6** (Anthropic). ## Description A single-file Python tool for extracting structured content from complex academic and technical PDFs. Works on documents containing: - Mathematical equations (extracted as LaTeX) - Data tables (extracted as Markdown + JSON) - Algorithms and pseudocode (verbatim with language detection) - Figures, charts, graphs, and drawings (semantic descriptions) - Multi-column layouts, footnotes, margin notes - 100+ page documents via automatic chunking ## Usage ```bash pip install anthropic PyMuPDF rich tqdm export ANTHROPIC_API_KEY="sk-ant-..." python pdf_atomic_parser.py parse document.pdf python pdf_atomic_parser.py atomic document.pdf --output ./results/ python pdf_atomic_parser.py extract-equations document.pdf python pdf_atomic_parser.py query document.pdf "What is the main theorem?" ``` ## Agent Integration ```python from pdf_atomic_parser import AgentPDFInterface agent = AgentPDFInterface(model="opus") result = agent.parse("paper.pdf") equations = agent.get_equations("paper.pdf") tables = agent.get_tables("paper.pdf") answer = agent.ask("paper.pdf", "What datasets were used?") ``` ## Model Details | Property | Value | |---|---| | Underlying model | claude-opus-4-6 (Anthropic) | | Parsing modes | native PDF, page-as-image (300 DPI) | | Max pages per call | 20 (configurable) | | Cache | SQLite, keyed by SHA-256 + page + model + mode | | Output formats | JSON, Markdown, plain text | ## Citation ```bibtex @software{algorembrant2025pdfparser, author = {algorembrant}, title = {PDF Atomic Parser}, year = {2025}, url = {https://github.com/algorembrant/pdf-atomic-parser} } ```