Spaces:

roger1024
/

DocPipe

Runtime error

App Files Files Community

DocPipe / packages /pdfsys-parser-mupdf /README.md

yin

docs: add project README, CONTRIBUTING guide, and per-package READMEs

b8ca6f2 about 1 month ago

preview code

raw

history blame contribute delete

1.01 kB

A newer version of the Gradio SDK is available: 6.14.0

Upgrade

pdfsys-parser-mupdf

Text-ok extraction backend. This is the fast path for PDFs that the router classifies as having a clean embedded text layer (i.e. ocr_prob < threshold).

What it does

Opens the PDF with PyMuPDF.
Iterates every page, calling page.get_text("blocks", sort=True).
Filters to text blocks (drops image blocks).
Normalizes each block's bbox to [0, 1] coordinates.
Produces one Segment per block, joined into an ExtractedDoc with merged Markdown.

Usage

from pdfsys_parser_mupdf import extract_doc

doc = extract_doc("path/to/clean.pdf")
print(doc.markdown[:500])
print(f"{doc.segment_count} segments, {doc.char_count} chars")

Scope

This backend intentionally does NOT:

Run OCR (that's what parser-pipeline and parser-vlm are for)
Use a layout model (not needed for text-ok PDFs)
Extract images or tables (image-heavy PDFs should be routed elsewhere)

It is the simplest possible extraction: unwrap PyMuPDF blocks into structured output.