| --- |
| title: "PDFSystem: PB-Scale PDF Processing Pipeline" |
| emoji: π |
| colorFrom: green |
| colorTo: purple |
| sdk: gradio |
| sdk_version: 6.12.0 |
| app_file: app.py |
| pinned: false |
| license: apache-2.0 |
| short_description: "PDF to Markdown pipeline with ML-powered routing" |
| --- |
| |
| # PDFSystem for MNBVC |
|
|
| <p align="center"> |
| <strong>PB-scale PDF β Pretraining Data Pipeline</strong><br> |
| <em>FinePDFs-inspired architecture for Chinese-heavy, mixed-quality PDFs</em> |
| </p> |
|
|
| <p align="center"> |
| <a href="https://huggingface.co/spaces/roger1024/DocPipe"> |
| <img src="https://img.shields.io/badge/π€%20Hugging%20Face-Spaces-yellow" alt="Hugging Face Spaces"> |
| </a> |
| <a href="https://github.com/MIracleyin/pdfsystem_mnbvc"> |
| <img src="https://img.shields.io/badge/GitHub-Repository-blue?logo=github" alt="GitHub"> |
| </a> |
| <img src="https://img.shields.io/badge/Python-3.11-blue?logo=python" alt="Python 3.11"> |
| <img src="https://img.shields.io/badge/Gradio-6.12.0-green" alt="Gradio"> |
| <img src="https://img.shields.io/badge/License-Apache%202.0-orange" alt="License"> |
| </p> |
| |
| --- |
|
|
| ## π Quick Links |
|
|
| | Platform | Link | Description | |
| |----------|------|-------------| |
| | **Live Demo** | [π€ HF Spaces](https://huggingface.co/spaces/roger1024/DocPipe) | Upload PDF and try the pipeline instantly | |
| | **Source Code** | [GitHub](https://github.com/MIracleyin/pdfsystem_mnbvc) | Full source code and documentation | |
|
|
| --- |
|
|
| ## β¨ Features |
|
|
| - **π§ ML-Powered Routing**: XGBoost classifier (124 features) routes PDFs to optimal backend |
| - **β‘ Fast Path**: PyMuPDF extraction for text-ok documents (~10ms/page) |
| - **π Quality Scoring**: ModernBERT-large OCR quality assessment [0-3 scale] |
| - **π Visual Debug**: Page preview with extracted bbox overlays |
| - **π¦ Modular Design**: Stateless, backend-agnostic pipeline components |
|
|
| --- |
|
|
| ## π― Current Status |
|
|
| | Component | Status | Description | |
| |-----------|--------|-------------| |
| | **Stage-A Router** | β
Ready | XGBoost binary classifier with 124 PyMuPDF features | |
| | **MuPDF Parser** | β
Ready | Fast extraction for clean-text PDFs | |
| | **OCR Quality Scorer** | β
Ready | ModernBERT-large regression model | |
| | **Stage-B Router** | π§ Planned | Layout-based complexity routing | |
| | **Pipeline Parser** | π§ Planned | Region-level OCR for simple layouts | |
| | **VLM Parser** | π§ Planned | Vision-Language model for complex layouts | |
|
|
| --- |
|
|
| ## π Quick Start |
|
|
| ### Option 1: Online Demo (Fastest) |
|
|
| Visit [Hugging Face Spaces](https://huggingface.co/spaces/roger1024/DocPipe) and upload a PDF β no installation required. |
|
|
| ### Option 2: Local Development |
|
|
| ```bash |
| # 1. Install uv package manager |
| curl -LsSf https://astral.sh/uv/install.sh | sh |
| |
| # 2. Clone and setup |
| git clone https://github.com/MIracleyin/pdfsystem_mnbvc.git |
| cd pdfsystem_mnbvc |
| uv sync |
| |
| # 3. Download router weights (257 KB, one-time) |
| python -m pdfsys_router.download_weights |
| |
| # 4. Run interactive demo |
| python app.py |
| # Open http://localhost:7860 |
| ``` |
|
|
| ### Option 3: Batch Processing |
|
|
| ```bash |
| python -m pdfsys_bench \ |
| --pdf-dir /path/to/pdfs \ |
| --out results.jsonl \ |
| --markdown-dir ./extracted |
| ``` |
|
|
| --- |
|
|
| ## ποΈ Architecture |
|
|
| ``` |
| βββββββββββββββββββ |
| PDF Input ββββΊ β Stage-A Router β XGBoost (124 features) |
| β (Implemented) β ~10ms per PDF |
| ββββββββββ¬βββββββββ |
| β ocr_prob |
| βββββββββββββββββββΌββββββββββββββββββ |
| βΌ βΌ βΌ |
| βββββββββββ ββββββββββββ βββββββββββ |
| β MUPDF β β PIPELINE β β VLM β |
| β (Fast) β β (OCR) β β(Complex)β |
| ββββββ¬βββββ ββββββββββββ βββββββββββ |
| β |
| βΌ |
| βββββββββββββββββββββββββββββββββββββββ |
| β ExtractedDoc: Markdown + Segments β |
| βββββββββββββββββββββββββββββββββββββββ |
| β |
| βΌ |
| βββββββββββββββββββββββββββββββββββββββ |
| β Quality Scorer (ModernBERT-large) β |
| β Score: [0, 3] β |
| βββββββββββββββββββββββββββββββββββββββ |
| ``` |
|
|
| --- |
|
|
| ## π¦ Workspace Packages |
|
|
| | Package | Purpose | Dependencies | |
| |---------|---------|--------------| |
| | `pdfsys-core` | Shared types, schemas, layout cache | stdlib only | |
| | `pdfsys-router` | Stage-A/Stage-B routing decisions | pymupdf, xgboost, pandas, sklearn | |
| | `pdfsys-parser-mupdf` | Fast PyMuPDF extraction | pymupdf | |
| | `pdfsys-bench` | Evaluation harness + quality scorer | torch, transformers | |
| | `pdfsys-layout-analyser` | Layout model runner | π§ Planned | |
| | `pdfsys-parser-pipeline` | OCR backend | π§ Planned | |
| | `pdfsys-parser-vlm` | VLM backend | π§ Planned | |
|
|
| --- |
|
|
| ## π Benchmark Results |
|
|
| **OmniDocBench-100 Dataset:** |
|
|
| ``` |
| Backend Split: mupdf=70 pipeline=30 |
| Avg OCR Prob: mupdf=0.034 pipeline=0.634 |
| Extraction: 70 success 0 errors |
| Quality Score: avg=1.71 min=0.39 max=2.73 |
| Timing: router=49ms extract=7ms quality=3.6s |
| ``` |
|
|
| --- |
|
|
| ## π¨ Demo Interface |
|
|
| The Gradio demo provides: |
|
|
| - **π€ PDF Upload**: Drag-and-drop or click to upload |
| - **π Routing Info**: OCR probability, selected backend, page count |
| - **πΌοΈ Page Preview**: First page with colored bbox overlays |
| - **π Markdown Output**: Extracted text content |
| - **π Segment Table**: Block-level extraction details |
| - **π§ Feature View**: Selected router features |
| - **π Raw JSON**: Complete pipeline output |
| - **β Quality Score**: Optional ModernBERT scoring |
|
|
| --- |
|
|
| ## π Documentation |
|
|
| | Document | Description | |
| |----------|-------------| |
| | [`docs/PRD.md`](docs/PRD.md) | Product Requirements & Architecture Rationale | |
| | [`docs/ROADMAP.md`](docs/ROADMAP.md) | Implementation Timeline & Milestones | |
| | [`CONTRIBUTING.md`](CONTRIBUTING.md) | Development Guidelines & Commit Conventions | |
| | [`demo/README.md`](demo/README.md) | Demo-specific Documentation | |
|
|
| --- |
|
|
| ## π» Development |
|
|
| ### Data Structures |
|
|
| **Router Output:** |
| ```python |
| @dataclass |
| class RouterDecision: |
| backend: Backend # MUPDF | PIPELINE | VLM | DEFERRED |
| ocr_prob: float # P(needs OCR) [0, 1] |
| num_pages: int |
| is_form: bool |
| features: dict # 124-dim feature vector |
| ``` |
|
|
| **Parser Output:** |
| ```python |
| @dataclass(frozen=True) |
| class ExtractedDoc: |
| sha256: str |
| backend: Backend |
| segments: tuple[Segment, ...] |
| markdown: str |
| stats: dict |
| ``` |
|
|
| ### CLI Reference |
|
|
| ```bash |
| # Download router weights |
| python -m pdfsys_router.download_weights |
| |
| # Run benchmark |
| python -m pdfsys_bench \ |
| --pdf-dir PATH \ |
| --out results.jsonl \ |
| --no-quality # Skip quality scoring |
| ``` |
|
|
| --- |
|
|
| ## π€ Contributing |
|
|
| We welcome contributions! Please see [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines. |
|
|
| --- |
|
|
| ## π License |
|
|
| This project is licensed under the [Apache License 2.0](LICENSE). |
|
|
| --- |
|
|
| <p align="center"> |
| Built with β€οΈ for the <a href="https://github.com/esbatmop/MNBVC">MNBVC</a> corpus project |
| </p> |
|
|