algorembrant
/

anthropic-pdf-parser

Document Question Answering

document-parsing

table-extraction

information-extraction

scientific-documents

Model card Files Files and versions

anthropic-pdf-parser / model_card.yml

algorembrant's picture

Upload 6 files

0ee11bd verified 1 day ago

history blame contribute delete

2.27 kB

	---
	language:
	- en
	license: mit
	library_name: anthropic
	tags:
	- pdf
	- document-parsing
	- ocr
	- multimodal
	- equations
	- table-extraction
	- agent
	- claude
	- information-extraction
	- scientific-documents
	pipeline_tag: document-question-answering
	model_name: PDF Atomic Parser
	authors:
	- algorembrant
	sdk: other
	sdk_version: "1.0.0"
	app_file: pdf_atomic_parser.py
	short_description: >
	Atomically parse complex PDFs (equations, graphs, algorithms, tables)
	using Claude claude-opus-4-6 without hallucination. Agent-ready.
	---

	# PDF Atomic Parser

	Powered by claude-opus-4-6 (Anthropic).

	## Description

	A single-file Python tool for extracting structured content from complex
	academic and technical PDFs. Works on documents containing:

	- Mathematical equations (extracted as LaTeX)
	- Data tables (extracted as Markdown + JSON)
	- Algorithms and pseudocode (verbatim with language detection)
	- Figures, charts, graphs, and drawings (semantic descriptions)
	- Multi-column layouts, footnotes, margin notes
	- 100+ page documents via automatic chunking

	## Usage

	```bash
	pip install anthropic PyMuPDF rich tqdm
	export ANTHROPIC_API_KEY="sk-ant-..."

	python pdf_atomic_parser.py parse document.pdf
	python pdf_atomic_parser.py atomic document.pdf --output ./results/
	python pdf_atomic_parser.py extract-equations document.pdf
	python pdf_atomic_parser.py query document.pdf "What is the main theorem?"
	```

	## Agent Integration

	```python
	from pdf_atomic_parser import AgentPDFInterface

	agent = AgentPDFInterface(model="opus")
	result = agent.parse("paper.pdf")
	equations = agent.get_equations("paper.pdf")
	tables = agent.get_tables("paper.pdf")
	answer = agent.ask("paper.pdf", "What datasets were used?")
	```

	## Model Details

	\| Property \| Value \|
	\|---\|---\|
	\| Underlying model \| claude-opus-4-6 (Anthropic) \|
	\| Parsing modes \| native PDF, page-as-image (300 DPI) \|
	\| Max pages per call \| 20 (configurable) \|
	\| Cache \| SQLite, keyed by SHA-256 + page + model + mode \|
	\| Output formats \| JSON, Markdown, plain text \|

	## Citation

	```bibtex
	@software{algorembrant2025pdfparser,
	author = {algorembrant},
	title = {PDF Atomic Parser},
	year = {2025},
	url = {https://github.com/algorembrant/pdf-atomic-parser}
	}
	```