Spaces:

roger1024
/

DocPipe

Runtime error

yin

docs: add project README, CONTRIBUTING guide, and per-package READMEs

b8ca6f2 about 1 month ago

1.01 kB

	# pdfsys-parser-mupdf

	Text-ok extraction backend. This is the fast path for PDFs that the router classifies as having a clean embedded text layer (i.e. `ocr_prob < threshold`).

	## What it does

	1. Opens the PDF with PyMuPDF.
	2. Iterates every page, calling `page.get_text("blocks", sort=True)`.
	3. Filters to text blocks (drops image blocks).
	4. Normalizes each block's bbox to [0, 1] coordinates.
	5. Produces one `Segment` per block, joined into an `ExtractedDoc` with merged Markdown.

	## Usage

	```python
	from pdfsys_parser_mupdf import extract_doc

	doc = extract_doc("path/to/clean.pdf")
	print(doc.markdown[:500])
	print(f"{doc.segment_count} segments, {doc.char_count} chars")
	```

	## Scope

	This backend intentionally does NOT:
	- Run OCR (that's what parser-pipeline and parser-vlm are for)
	- Use a layout model (not needed for text-ok PDFs)
	- Extract images or tables (image-heavy PDFs should be routed elsewhere)

	It is the simplest possible extraction: unwrap PyMuPDF blocks into structured output.