File size: 7,599 Bytes
d80f375
 
 
 
 
 
 
 
 
 
 
 
 
c540108
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
67495fe
c540108
 
 
67495fe
c540108
 
 
 
00b2f48
c540108
b8ca6f2
c540108
b8ca6f2
c540108
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b8ca6f2
c540108
0bc9210
c540108
0bc9210
c540108
 
 
0bc9210
b8ca6f2
c540108
b8ca6f2
 
c540108
0bc9210
b8ca6f2
 
67495fe
c540108
b8ca6f2
 
c540108
0bc9210
c540108
 
0bc9210
c540108
 
 
b8ca6f2
c540108
 
 
b8ca6f2
 
c540108
b8ca6f2
c540108
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b8ca6f2
c540108
67495fe
c540108
 
 
 
 
 
 
 
 
 
 
67495fe
c540108
b8ca6f2
c540108
b8ca6f2
c540108
b8ca6f2
 
c540108
 
 
 
 
b8ca6f2
67495fe
c540108
67495fe
c540108
b8ca6f2
c540108
b8ca6f2
c540108
 
 
 
 
 
 
 
 
 
b8ca6f2
c540108
b8ca6f2
c540108
 
 
 
 
 
67495fe
c540108
b8ca6f2
c540108
 
 
 
 
b8ca6f2
 
 
 
c540108
b8ca6f2
 
c540108
b8ca6f2
 
c540108
b8ca6f2
 
 
 
 
c540108
 
b8ca6f2
 
 
c540108
67495fe
 
c540108
 
00b2f48
c540108
 
 
 
 
00b2f48
 
c540108
00b2f48
c540108
00b2f48
c540108
00b2f48
 
 
c540108
00b2f48
c540108
67495fe
c540108
67495fe
c540108
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
---
title: "PDFSystem: PB-Scale PDF Processing Pipeline"
emoji: πŸš€
colorFrom: green
colorTo: purple
sdk: gradio
sdk_version: 6.12.0
app_file: app.py
pinned: false
license: apache-2.0
short_description: "PDF to Markdown pipeline with ML-powered routing"
---

# PDFSystem for MNBVC

<p align="center">
  <strong>PB-scale PDF β†’ Pretraining Data Pipeline</strong><br>
  <em>FinePDFs-inspired architecture for Chinese-heavy, mixed-quality PDFs</em>
</p>

<p align="center">
  <a href="https://huggingface.co/spaces/roger1024/DocPipe">
    <img src="https://img.shields.io/badge/πŸ€—%20Hugging%20Face-Spaces-yellow" alt="Hugging Face Spaces">
  </a>
  <a href="https://github.com/MIracleyin/pdfsystem_mnbvc">
    <img src="https://img.shields.io/badge/GitHub-Repository-blue?logo=github" alt="GitHub">
  </a>
  <img src="https://img.shields.io/badge/Python-3.11-blue?logo=python" alt="Python 3.11">
  <img src="https://img.shields.io/badge/Gradio-6.12.0-green" alt="Gradio">
  <img src="https://img.shields.io/badge/License-Apache%202.0-orange" alt="License">
</p>

---

## πŸš€ Quick Links

| Platform | Link | Description |
|----------|------|-------------|
| **Live Demo** | [πŸ€— HF Spaces](https://huggingface.co/spaces/roger1024/DocPipe) | Upload PDF and try the pipeline instantly |
| **Source Code** | [GitHub](https://github.com/MIracleyin/pdfsystem_mnbvc) | Full source code and documentation |

---

## ✨ Features

- **🧠 ML-Powered Routing**: XGBoost classifier (124 features) routes PDFs to optimal backend
- **⚑ Fast Path**: PyMuPDF extraction for text-ok documents (~10ms/page)
- **πŸ“Š Quality Scoring**: ModernBERT-large OCR quality assessment [0-3 scale]
- **πŸ” Visual Debug**: Page preview with extracted bbox overlays
- **πŸ“¦ Modular Design**: Stateless, backend-agnostic pipeline components

---

## 🎯 Current Status

| Component | Status | Description |
|-----------|--------|-------------|
| **Stage-A Router** | βœ… Ready | XGBoost binary classifier with 124 PyMuPDF features |
| **MuPDF Parser** | βœ… Ready | Fast extraction for clean-text PDFs |
| **OCR Quality Scorer** | βœ… Ready | ModernBERT-large regression model |
| **Stage-B Router** | 🚧 Planned | Layout-based complexity routing |
| **Pipeline Parser** | 🚧 Planned | Region-level OCR for simple layouts |
| **VLM Parser** | 🚧 Planned | Vision-Language model for complex layouts |

---

## πŸƒ Quick Start

### Option 1: Online Demo (Fastest)

Visit [Hugging Face Spaces](https://huggingface.co/spaces/roger1024/DocPipe) and upload a PDF β€” no installation required.

### Option 2: Local Development

```bash
# 1. Install uv package manager
curl -LsSf https://astral.sh/uv/install.sh | sh

# 2. Clone and setup
git clone https://github.com/MIracleyin/pdfsystem_mnbvc.git
cd pdfsystem_mnbvc
uv sync

# 3. Download router weights (257 KB, one-time)
python -m pdfsys_router.download_weights

# 4. Run interactive demo
python app.py
# Open http://localhost:7860
```

### Option 3: Batch Processing

```bash
python -m pdfsys_bench \
  --pdf-dir /path/to/pdfs \
  --out results.jsonl \
  --markdown-dir ./extracted
```

---

## πŸ—οΈ Architecture

```
                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
   PDF Input  ───►  β”‚  Stage-A Router β”‚  XGBoost (124 features)
                    β”‚  (Implemented)  β”‚  ~10ms per PDF
                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                             β”‚ ocr_prob
           β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
           β–Ό                 β–Ό                 β–Ό
      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”
      β”‚  MUPDF  β”‚      β”‚ PIPELINE β”‚      β”‚   VLM   β”‚
      β”‚  (Fast) β”‚      β”‚  (OCR)   β”‚      β”‚(Complex)β”‚
      β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜      β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜      β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
           β”‚
           β–Ό
   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
   β”‚  ExtractedDoc: Markdown + Segments  β”‚
   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
           β”‚
           β–Ό
   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
   β”‚  Quality Scorer (ModernBERT-large)  β”‚
   β”‚  Score: [0, 3]                      β”‚
   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```

---

## πŸ“¦ Workspace Packages

| Package | Purpose | Dependencies |
|---------|---------|--------------|
| `pdfsys-core` | Shared types, schemas, layout cache | stdlib only |
| `pdfsys-router` | Stage-A/Stage-B routing decisions | pymupdf, xgboost, pandas, sklearn |
| `pdfsys-parser-mupdf` | Fast PyMuPDF extraction | pymupdf |
| `pdfsys-bench` | Evaluation harness + quality scorer | torch, transformers |
| `pdfsys-layout-analyser` | Layout model runner | 🚧 Planned |
| `pdfsys-parser-pipeline` | OCR backend | 🚧 Planned |
| `pdfsys-parser-vlm` | VLM backend | 🚧 Planned |

---

## πŸ“Š Benchmark Results

**OmniDocBench-100 Dataset:**

```
Backend Split:    mupdf=70    pipeline=30
Avg OCR Prob:     mupdf=0.034  pipeline=0.634
Extraction:       70 success   0 errors
Quality Score:    avg=1.71     min=0.39   max=2.73
Timing:           router=49ms  extract=7ms  quality=3.6s
```

---

## 🎨 Demo Interface

The Gradio demo provides:

- **πŸ“€ PDF Upload**: Drag-and-drop or click to upload
- **πŸ“ˆ Routing Info**: OCR probability, selected backend, page count
- **πŸ–ΌοΈ Page Preview**: First page with colored bbox overlays
- **πŸ“ Markdown Output**: Extracted text content
- **πŸ“‹ Segment Table**: Block-level extraction details
- **πŸ”§ Feature View**: Selected router features
- **πŸ“„ Raw JSON**: Complete pipeline output
- **⭐ Quality Score**: Optional ModernBERT scoring

---

## πŸ“š Documentation

| Document | Description |
|----------|-------------|
| [`docs/PRD.md`](docs/PRD.md) | Product Requirements & Architecture Rationale |
| [`docs/ROADMAP.md`](docs/ROADMAP.md) | Implementation Timeline & Milestones |
| [`CONTRIBUTING.md`](CONTRIBUTING.md) | Development Guidelines & Commit Conventions |
| [`demo/README.md`](demo/README.md) | Demo-specific Documentation |

---

## πŸ’» Development

### Data Structures

**Router Output:**
```python
@dataclass
class RouterDecision:
    backend: Backend          # MUPDF | PIPELINE | VLM | DEFERRED
    ocr_prob: float           # P(needs OCR) [0, 1]
    num_pages: int
    is_form: bool
    features: dict            # 124-dim feature vector
```

**Parser Output:**
```python
@dataclass(frozen=True)
class ExtractedDoc:
    sha256: str
    backend: Backend
    segments: tuple[Segment, ...]
    markdown: str
    stats: dict
```

### CLI Reference

```bash
# Download router weights
python -m pdfsys_router.download_weights

# Run benchmark
python -m pdfsys_bench \
  --pdf-dir PATH \
  --out results.jsonl \
  --no-quality          # Skip quality scoring
```

---

## 🀝 Contributing

We welcome contributions! Please see [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines.

---

## πŸ“„ License

This project is licensed under the [Apache License 2.0](LICENSE).

---

<p align="center">
  Built with ❀️ for the <a href="https://github.com/esbatmop/MNBVC">MNBVC</a> corpus project
</p>