DocPipe / README.md
jieluo1024
fix: add back HF Spaces YAML frontmatter
d80f375
---
title: "PDFSystem: PB-Scale PDF Processing Pipeline"
emoji: πŸš€
colorFrom: green
colorTo: purple
sdk: gradio
sdk_version: 6.12.0
app_file: app.py
pinned: false
license: apache-2.0
short_description: "PDF to Markdown pipeline with ML-powered routing"
---
# PDFSystem for MNBVC
<p align="center">
<strong>PB-scale PDF β†’ Pretraining Data Pipeline</strong><br>
<em>FinePDFs-inspired architecture for Chinese-heavy, mixed-quality PDFs</em>
</p>
<p align="center">
<a href="https://huggingface.co/spaces/roger1024/DocPipe">
<img src="https://img.shields.io/badge/πŸ€—%20Hugging%20Face-Spaces-yellow" alt="Hugging Face Spaces">
</a>
<a href="https://github.com/MIracleyin/pdfsystem_mnbvc">
<img src="https://img.shields.io/badge/GitHub-Repository-blue?logo=github" alt="GitHub">
</a>
<img src="https://img.shields.io/badge/Python-3.11-blue?logo=python" alt="Python 3.11">
<img src="https://img.shields.io/badge/Gradio-6.12.0-green" alt="Gradio">
<img src="https://img.shields.io/badge/License-Apache%202.0-orange" alt="License">
</p>
---
## πŸš€ Quick Links
| Platform | Link | Description |
|----------|------|-------------|
| **Live Demo** | [πŸ€— HF Spaces](https://huggingface.co/spaces/roger1024/DocPipe) | Upload PDF and try the pipeline instantly |
| **Source Code** | [GitHub](https://github.com/MIracleyin/pdfsystem_mnbvc) | Full source code and documentation |
---
## ✨ Features
- **🧠 ML-Powered Routing**: XGBoost classifier (124 features) routes PDFs to optimal backend
- **⚑ Fast Path**: PyMuPDF extraction for text-ok documents (~10ms/page)
- **πŸ“Š Quality Scoring**: ModernBERT-large OCR quality assessment [0-3 scale]
- **πŸ” Visual Debug**: Page preview with extracted bbox overlays
- **πŸ“¦ Modular Design**: Stateless, backend-agnostic pipeline components
---
## 🎯 Current Status
| Component | Status | Description |
|-----------|--------|-------------|
| **Stage-A Router** | βœ… Ready | XGBoost binary classifier with 124 PyMuPDF features |
| **MuPDF Parser** | βœ… Ready | Fast extraction for clean-text PDFs |
| **OCR Quality Scorer** | βœ… Ready | ModernBERT-large regression model |
| **Stage-B Router** | 🚧 Planned | Layout-based complexity routing |
| **Pipeline Parser** | 🚧 Planned | Region-level OCR for simple layouts |
| **VLM Parser** | 🚧 Planned | Vision-Language model for complex layouts |
---
## πŸƒ Quick Start
### Option 1: Online Demo (Fastest)
Visit [Hugging Face Spaces](https://huggingface.co/spaces/roger1024/DocPipe) and upload a PDF β€” no installation required.
### Option 2: Local Development
```bash
# 1. Install uv package manager
curl -LsSf https://astral.sh/uv/install.sh | sh
# 2. Clone and setup
git clone https://github.com/MIracleyin/pdfsystem_mnbvc.git
cd pdfsystem_mnbvc
uv sync
# 3. Download router weights (257 KB, one-time)
python -m pdfsys_router.download_weights
# 4. Run interactive demo
python app.py
# Open http://localhost:7860
```
### Option 3: Batch Processing
```bash
python -m pdfsys_bench \
--pdf-dir /path/to/pdfs \
--out results.jsonl \
--markdown-dir ./extracted
```
---
## πŸ—οΈ Architecture
```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
PDF Input ───► β”‚ Stage-A Router β”‚ XGBoost (124 features)
β”‚ (Implemented) β”‚ ~10ms per PDF
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚ ocr_prob
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β–Ό β–Ό β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ MUPDF β”‚ β”‚ PIPELINE β”‚ β”‚ VLM β”‚
β”‚ (Fast) β”‚ β”‚ (OCR) β”‚ β”‚(Complex)β”‚
β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ ExtractedDoc: Markdown + Segments β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Quality Scorer (ModernBERT-large) β”‚
β”‚ Score: [0, 3] β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```
---
## πŸ“¦ Workspace Packages
| Package | Purpose | Dependencies |
|---------|---------|--------------|
| `pdfsys-core` | Shared types, schemas, layout cache | stdlib only |
| `pdfsys-router` | Stage-A/Stage-B routing decisions | pymupdf, xgboost, pandas, sklearn |
| `pdfsys-parser-mupdf` | Fast PyMuPDF extraction | pymupdf |
| `pdfsys-bench` | Evaluation harness + quality scorer | torch, transformers |
| `pdfsys-layout-analyser` | Layout model runner | 🚧 Planned |
| `pdfsys-parser-pipeline` | OCR backend | 🚧 Planned |
| `pdfsys-parser-vlm` | VLM backend | 🚧 Planned |
---
## πŸ“Š Benchmark Results
**OmniDocBench-100 Dataset:**
```
Backend Split: mupdf=70 pipeline=30
Avg OCR Prob: mupdf=0.034 pipeline=0.634
Extraction: 70 success 0 errors
Quality Score: avg=1.71 min=0.39 max=2.73
Timing: router=49ms extract=7ms quality=3.6s
```
---
## 🎨 Demo Interface
The Gradio demo provides:
- **πŸ“€ PDF Upload**: Drag-and-drop or click to upload
- **πŸ“ˆ Routing Info**: OCR probability, selected backend, page count
- **πŸ–ΌοΈ Page Preview**: First page with colored bbox overlays
- **πŸ“ Markdown Output**: Extracted text content
- **πŸ“‹ Segment Table**: Block-level extraction details
- **πŸ”§ Feature View**: Selected router features
- **πŸ“„ Raw JSON**: Complete pipeline output
- **⭐ Quality Score**: Optional ModernBERT scoring
---
## πŸ“š Documentation
| Document | Description |
|----------|-------------|
| [`docs/PRD.md`](docs/PRD.md) | Product Requirements & Architecture Rationale |
| [`docs/ROADMAP.md`](docs/ROADMAP.md) | Implementation Timeline & Milestones |
| [`CONTRIBUTING.md`](CONTRIBUTING.md) | Development Guidelines & Commit Conventions |
| [`demo/README.md`](demo/README.md) | Demo-specific Documentation |
---
## πŸ’» Development
### Data Structures
**Router Output:**
```python
@dataclass
class RouterDecision:
backend: Backend # MUPDF | PIPELINE | VLM | DEFERRED
ocr_prob: float # P(needs OCR) [0, 1]
num_pages: int
is_form: bool
features: dict # 124-dim feature vector
```
**Parser Output:**
```python
@dataclass(frozen=True)
class ExtractedDoc:
sha256: str
backend: Backend
segments: tuple[Segment, ...]
markdown: str
stats: dict
```
### CLI Reference
```bash
# Download router weights
python -m pdfsys_router.download_weights
# Run benchmark
python -m pdfsys_bench \
--pdf-dir PATH \
--out results.jsonl \
--no-quality # Skip quality scoring
```
---
## 🀝 Contributing
We welcome contributions! Please see [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines.
---
## πŸ“„ License
This project is licensed under the [Apache License 2.0](LICENSE).
---
<p align="center">
Built with ❀️ for the <a href="https://github.com/esbatmop/MNBVC">MNBVC</a> corpus project
</p>