File size: 1,009 Bytes
b8ca6f2 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 | # pdfsys-parser-mupdf
Text-ok extraction backend. This is the fast path for PDFs that the router classifies as having a clean embedded text layer (i.e. `ocr_prob < threshold`).
## What it does
1. Opens the PDF with PyMuPDF.
2. Iterates every page, calling `page.get_text("blocks", sort=True)`.
3. Filters to text blocks (drops image blocks).
4. Normalizes each block's bbox to [0, 1] coordinates.
5. Produces one `Segment` per block, joined into an `ExtractedDoc` with merged Markdown.
## Usage
```python
from pdfsys_parser_mupdf import extract_doc
doc = extract_doc("path/to/clean.pdf")
print(doc.markdown[:500])
print(f"{doc.segment_count} segments, {doc.char_count} chars")
```
## Scope
This backend intentionally does NOT:
- Run OCR (that's what parser-pipeline and parser-vlm are for)
- Use a layout model (not needed for text-ok PDFs)
- Extract images or tables (image-heavy PDFs should be routed elsewhere)
It is the simplest possible extraction: unwrap PyMuPDF blocks into structured output.
|