jieluo1024 commited on
Commit
c540108
·
1 Parent(s): 0bc9210

docs: professional redesign of README.md

Browse files

- Add badges and shields for professional appearance
- Restructure content with clear hierarchy
- Add feature table and quick links section
- Improve architecture diagram formatting
- Add centered header and footer
- Update HF Spaces YAML with better title/description

Files changed (1) hide show
  1. README.md +162 -257
README.md CHANGED
@@ -1,6 +1,6 @@
1
  ---
2
- title: PDFSystem MNBVC Demo
3
- emoji: 📄
4
  colorFrom: green
5
  colorTo: purple
6
  sdk: gradio
@@ -8,329 +8,234 @@ sdk_version: 6.12.0
8
  app_file: app.py
9
  pinned: false
10
  license: apache-2.0
11
- short_description: FinePDFs-style PDF pipeline demo for MNBVC
12
  ---
13
 
14
- # pdfsys-mnbvc
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
15
 
16
- PB-scale PDF → pretraining-data pipeline for the [MNBVC](https://github.com/esbatmop/MNBVC) corpus project.
17
- FinePDFs-inspired architecture adapted for Chinese-heavy, mixed-quality input.
 
18
 
19
- > **Try it:**
20
- > - 🚀 **在线 Demo**: [Hugging Face Spaces](https://huggingface.co/spaces/roger1024/DocPipe) - 直接上传 PDF 体验完整流程
21
- > - 💻 **本地运行**: `python app.py` - 详见下方 [Quick start](#quick-start)
22
- >
23
- > 部署到 Hugging Face Spaces 只需一键,YAML header 就是全部配置。详见 [`demo/README.md`](demo/README.md)
24
 
25
- ## Current status: MVP closed loop ✅
26
 
27
- The first end-to-end path — **Router → MuPDF parser → OCR quality scorer** — is working on the OmniDocBench-100 evaluation set. PDFs that need OCR are routed to `PIPELINE` but not yet extracted (that backend is not implemented yet).
28
 
29
- ## Quick start
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
30
 
31
- ### 方式一:在线体验(最快)
32
 
33
- 直接访问 [Hugging Face Spaces Demo](https://huggingface.co/spaces/roger1024/DocPipe) 上传 PDF 即可体验,无需安装任何环境。
34
 
35
- ### 方式二:本地运行
 
 
36
 
37
  ```bash
38
- # 1. Install uv (>= 0.4)
39
  curl -LsSf https://astral.sh/uv/install.sh | sh
40
 
41
- # 2. Clone the repo and sync all workspace packages
42
  git clone https://github.com/MIracleyin/pdfsystem_mnbvc.git
43
  cd pdfsystem_mnbvc
44
  uv sync
45
 
46
- # 3. Fetch the XGBoost router weights (257 KB, one-time)
47
  python -m pdfsys_router.download_weights
48
 
49
- # 4. Run Gradio demo
50
  python app.py
51
- # 访问 http://localhost:7860
 
52
 
53
- # 5. Or run the MVP closed loop on the bench dataset
 
 
54
  python -m pdfsys_bench \
55
- --pdf-dir packages/pdfsys-bench/omnidocbench_100/pdfs \
56
- --out out/bench_omnidoc100.jsonl \
57
- --markdown-dir out/bench_omnidoc100_md
58
  ```
59
 
60
- > **Note:** The first run downloads the ModernBERT-large quality scorer
61
- > (~800 MB) from HuggingFace Hub. Set `HF_HOME` to control where it's
62
- > cached. If you don't need quality scoring, add `--no-quality` to skip it.
63
 
64
- > **Note:** The bench dataset (omnidocbench_100) is NOT committed to the repo.
65
- > You need to obtain it separately and place it under
66
- > `packages/pdfsys-bench/omnidocbench_100/`.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
67
 
68
- ## Architecture
69
 
70
- ```
71
- ┌──────────────┐
72
- PDF ──► │ pdfsys-router│ stage A: XGBoost (124 PyMuPDF features)
73
- └──────┬───────┘
74
-
75
- text-ok ◄──┴──► needs-ocr
76
- │ │
77
- ▼ ▼
78
- parser-mupdf pdfsys-layout-analyser (runs once, caches LayoutDocument)
79
-
80
-
81
- stage B decision
82
-
83
- no-complex ◄───┴───► complex (tables / formulas)
84
- │ │
85
- ▼ ▼
86
- parser-pipeline parser-vlm
87
- ```
88
 
89
- ### What's implemented
90
 
91
- | Stage | Status | Description |
92
- |-------|--------|-------------|
93
- | **Stage-A router** | ✅ | XGBoost binary classifier, ported from FinePDFs. 124 features (4 doc-level + 15 page-level × 8 sampled pages). Routes to `MUPDF` (text-ok) or `PIPELINE` (needs-ocr). |
94
- | **MuPDF parser** | ✅ | `page.get_text("blocks", sort=True)` → `ExtractedDoc` with normalized bbox and merged Markdown. Fast path for clean-text PDFs. |
95
- | **OCR quality scorer** | ✅ | ModernBERT-large regression head (`HuggingFaceFW/finepdfs_ocr_quality_classifier_eng_Latn`). Scores extracted text on a [0, 3] scale. |
96
- | **Bench CLI** | ✅ | `python -m pdfsys_bench` — drives the full loop, emits per-doc JSONL + summary JSON. |
97
- | Stage-B router | ❌ | Pending layout-analyser and LayoutCache integration. |
98
- | Layout analyser | ❌ | PP-DocLayoutV3 / docling-layout-heron runner — not started. |
99
- | Pipeline parser | ❌ | Region-level OCR (RapidOCR / PaddleOCR) — not started. |
100
- | VLM parser | ❌ | MinerU 2.5 / PaddleOCR-VL on complex regions — not started. |
101
 
102
- ### MVP benchmark results (OmniDocBench-100)
103
 
104
  ```
105
- Backend split: mupdf=70 pipeline=30
106
- Avg ocr_prob: mupdf=0.034 pipeline=0.634
107
- Extracted: 70 Errors: 0
108
- Quality: avg=1.71 min=0.39 max=2.73
109
- Per-doc time: router=49ms extract=7ms quality=3.6s
110
  ```
111
 
112
- ## Workspace packages
113
 
114
- | Package | Role | Dependencies |
115
- |---------|------|-------------|
116
- | `pdfsys-core` | Shared dataclasses, enums, layout cache, serde. No PDF/ML deps. | stdlib only |
117
- | `pdfsys-router` | Stage-A XGBoost classifier + Stage-B layout decision (stub). | pymupdf, xgboost, pandas, numpy, scikit-learn |
118
- | `pdfsys-layout-analyser` | Page layout model runner. Stub only. | — |
119
- | `pdfsys-parser-mupdf` | Text-ok backend: PyMuPDF block extraction → Markdown. | pymupdf |
120
- | `pdfsys-parser-pipeline` | OCR backend for simple layouts. Stub only. | — |
121
- | `pdfsys-parser-vlm` | VLM backend for complex layouts. Stub only. | — |
122
- | `pdfsys-bench` | Closed-loop evaluation harness + quality scorer. | torch, transformers, pdfsys-router, pdfsys-parser-mupdf |
123
 
124
- ### Package dependency graph
125
 
126
- ```
127
- pdfsys-core ◄── pdfsys-router
128
- ◄── pdfsys-parser-mupdf
129
- ◄── pdfsys-parser-pipeline (stub)
130
- ◄── pdfsys-parser-vlm (stub)
131
- ◄── pdfsys-layout-analyser (stub)
132
-
133
- pdfsys-router ◄── pdfsys-bench
134
- pdfsys-parser-mupdf ◄── pdfsys-bench
135
- ```
136
 
137
- `pdfsys-core` is the root dependency: every other package imports it, and it has zero external deps beyond the Python stdlib.
138
 
139
- ## Key data structures
 
 
 
 
 
140
 
141
- ### Router output (`RouterDecision`)
142
 
 
 
 
 
 
143
  ```python
144
  @dataclass
145
  class RouterDecision:
146
  backend: Backend # MUPDF | PIPELINE | VLM | DEFERRED
147
- ocr_prob: float # P(needs OCR) from XGBoost, [0, 1]
148
  num_pages: int
149
  is_form: bool
150
- garbled_text_ratio: float
151
- is_encrypted: bool
152
- needs_password: bool
153
- features: dict # full 124-feature vector for debugging
154
- error: str | None
155
  ```
156
 
157
- ### Parser output (`ExtractedDoc`)
158
-
159
  ```python
160
  @dataclass(frozen=True)
161
  class ExtractedDoc:
162
  sha256: str
163
  backend: Backend
164
- segments: tuple[Segment, ...] # ordered block-level units
165
- markdown: str # segments merged with \n\n
166
  stats: dict
167
  ```
168
 
169
- Each `Segment` carries `page_index`, `RegionType` (TEXT/IMAGE/TABLE/FORMULA), `content` (Markdown / HTML / LaTeX), and a normalized `BBox` in [0, 1].
170
-
171
- ### Quality score
172
-
173
- ```python
174
- @dataclass
175
- class QualityScore:
176
- score: float # [0, 3]: 0=garbage, 1=format issues, 2=minor, 3=clean
177
- num_chars: int
178
- num_tokens: int
179
- model: str
180
- ```
181
-
182
- ## Design principles
183
-
184
- 1. **Stateless processing.** No manifest, no central DB. Every PDF produces self-contained output. Following FinePDFs' datatrove-style design.
185
- 2. **Content-addressable caching.** LayoutCache shards by `sha256 + model_tag`. Bumping the model tag lazily invalidates old entries.
186
- 3. **Atomic writes.** All file outputs use `tmp + os.replace()` for crash safety.
187
- 4. **Normalized coordinates.** BBox is always `[0, 1]` with origin top-left; backends convert to pixels/points on demand.
188
- 5. **Backend-agnostic output.** All three parser backends emit the same `ExtractedDoc` / `Segment` schema, so downstream stages don't need to know which backend produced a document.
189
-
190
- ## CLI reference
191
-
192
- ### `python -m pdfsys_bench`
193
-
194
- ```
195
- usage: pdfsys-bench [-h] --pdf-dir PDF_DIR --out OUT [--limit N]
196
- [--no-quality] [--quality-model MODEL]
197
- [--router-weights PATH] [--markdown-dir DIR]
198
- [--ocr-threshold FLOAT]
199
-
200
- Run the MVP pdfsys closed loop.
201
-
202
- options:
203
- --pdf-dir PATH Directory of PDFs to process (recursive).
204
- --out PATH Output JSONL path (one line per PDF).
205
- --limit N Cap the number of PDFs processed.
206
- --no-quality Skip the ModernBERT quality scorer.
207
- --quality-model ID HuggingFace model for quality scoring.
208
- --router-weights P Path to xgb_classifier.ubj.
209
- --markdown-dir DIR Dump per-PDF extracted markdown here.
210
- --ocr-threshold F P(ocr) threshold (default: 0.5).
211
- ```
212
-
213
- ### `python -m pdfsys_router.download_weights`
214
-
215
- Downloads the XGBoost router weights (~257 KB) from the FinePDFs Git LFS.
216
 
217
  ```bash
218
- python -m pdfsys_router.download_weights # first time
219
- python -m pdfsys_router.download_weights --force # re-download
220
- ```
221
-
222
- ## Output format
223
-
224
- The JSONL output (`--out`) has one JSON object per PDF:
225
-
226
- ```json
227
- {
228
- "pdf_path": "packages/pdfsys-bench/omnidocbench_100/pdfs/example.pdf",
229
- "sha256": "a53b50cb0d3d...",
230
- "backend": "mupdf",
231
- "ocr_prob": 0.003,
232
- "num_pages": 1,
233
- "is_form": false,
234
- "garbled_text_ratio": 0.0,
235
- "router_error": null,
236
- "extract_stats": {"page_count": 1, "pages_extracted": 1, "segment_count": 5, "char_count": 5734},
237
- "extract_error": null,
238
- "quality_score": 2.45,
239
- "quality_num_chars": 5734,
240
- "quality_num_tokens": 512,
241
- "quality_model": "HuggingFaceFW/finepdfs_ocr_quality_classifier_eng_Latn",
242
- "markdown_chars": 5734,
243
- "wall_ms_router": 42.1,
244
- "wall_ms_extract": 6.3,
245
- "wall_ms_quality": 3421.0
246
- }
247
- ```
248
-
249
- A companion `.summary.json` file is also written with aggregate statistics.
250
-
251
- ## Demo 功能说明
252
-
253
- 在线 Demo 展示了完整的 MVP 流程,包含以下功能:
254
-
255
- | 功能 | 描述 |
256
- |------|------|
257
- | **PDF 上传** | 支持拖拽或点击上传 PDF 文件 |
258
- | **路由决策** | 实时显示 XGBoost 路由器的 OCR 概率和选择的 Backend |
259
- | **页面预览** | 第一页渲染并叠加提取的文本块边界框(颜色标识 Backend) |
260
- | **Markdown 输出** | PyMuPDF 提取的文本内容 |
261
- | **Segments 表格** | 详细的块级提取信息(类型、坐标、字符数等) |
262
- | **Router Features** | 精选的 124 维特征子集展示 |
263
- | **Raw JSON** | 完整的 pipeline 输出数据 |
264
- | **OCR 质量评分** | 可选的 ModernBERT 质量评分(默认关闭,约 3-5 秒) |
265
-
266
- ### Demo 技术栈
267
- - **Frontend**: Gradio 6.12.0
268
- - **Backend**: Python 3.11 + PyMuPDF + XGBoost
269
- - **部署**: Hugging Face Spaces (CPU)
270
-
271
- ## 文档索引
272
-
273
- | 文档 | 内容 |
274
- |------|------|
275
- | [`docs/PRD.md`](docs/PRD.md) | 完整产品需求文档,包含资源预算和架构原理 |
276
- | [`docs/ROADMAP.md`](docs/ROADMAP.md) | 优先级排序的实现计划、工作量估算和验收标准 |
277
- | [`CONTRIBUTING.md`](CONTRIBUTING.md) | 命名规范、一致性规则、提交格式 |
278
- | [`demo/README.md`](demo/README.md) | Gradio Demo 详情 + Hugging Face Spaces 部署指南 |
279
-
280
- ## Collaborating with Cursor
281
-
282
- This repo ships a full set of [Cursor project rules](https://docs.cursor.com/context/rules) under `.cursor/rules/`. They give the AI agent the same mental model senior contributors have — including the non-obvious bits (FinePDFs feature parity, `pdfsys-core` zero-dep rule, Gradio UI/logic separation) that a new collaborator would otherwise step on.
283
-
284
- ### Quick start
285
 
286
- ```bash
287
- # One-shot bootstrap: checks python/uv, syncs workspace, downloads router weights.
288
- bash scripts/setup_cursor.sh
 
 
289
  ```
290
 
291
- Then open the repo in Cursor (≥ 0.50, which supports `.cursor/rules/*.mdc`). The always-on rules activate immediately; file-specific rules attach as you open matching files.
292
-
293
- ### Active rules
294
-
295
- | Rule | Scope | What it enforces |
296
- |------|-------|------------------|
297
- | `00-project-context.mdc` | always | Project goals, tech stack, must-read docs, explicit non-goals. |
298
- | `01-architecture-invariants.mdc` | always | 7 load-bearing invariants (zero-dep core, stateless processing, normalized bbox, etc.). |
299
- | `02-commit-workflow.mdc` | always | Conventional commits with package-scoped names; pre-commit checklist. |
300
- | `03-doc-sync.mdc` | always | Doc-sync mapping table: which code change forces which doc update. Cursor proactively scans after edits. |
301
- | `10-python-standards.mdc` | `**/*.py` | Type hints, frozen dataclass, lazy imports for heavy deps. |
302
- | `20-core-contracts.mdc` | `packages/pdfsys-core/**` | Zero external deps; no I/O; schema change ripple rules. |
303
- | `21-router-parity.mdc` | `packages/pdfsys-router/**` | FinePDFs 124-feature parity is sacred; how to verify. |
304
- | `22-parser-backends.mdc` | `packages/pdfsys-parser-*/**` | All three backends must emit identical `ExtractedDoc`. |
305
- | `23-bench-scorer.mdc` | `packages/pdfsys-bench/**` | torch/transformers lazy load; bf16 default; loop never raises. |
306
- | `30-gradio-demo.mdc` | `demo/**,app.py` | UI layer has no business logic; callbacks never raise; lazy singletons. |
307
-
308
- ### Recommended Cursor workflow
309
-
310
- 1. **Before touching `pdfsys-core`** — read `20-core-contracts.mdc`. The AI will refuse to add third-party deps here and surface schema-ripple questions.
311
- 2. **Before touching `feature_extractor.py`** — `21-router-parity.mdc` kicks in; the AI will suggest running the parity check before you commit.
312
- 3. **When building a new parser backend** — `22-parser-backends.mdc` walks through the 6-step addition procedure and refuses partial impls.
313
- 4. **When writing demo UI** — `30-gradio-demo.mdc` rejects `import pymupdf` in `demo/app.py` (belongs in `demo/pipeline.py`).
314
 
315
- ### Authoring new rules
316
 
317
- Rules live in `.cursor/rules/*.mdc`. Format:
318
 
319
- ```yaml
320
  ---
321
- description: Short description shown in the rule picker
322
- globs: packages/<pkg>/**/*.py # omit for always-on rules
323
- alwaysApply: false # true = always loaded
324
- ---
325
-
326
- # Rule Title
327
 
328
- - Bullet rule 1 (with ✅/❌ example)
329
- - Bullet rule 2
330
- ```
331
 
332
- Keep each rule under 100 lines, one concern per file. See existing rules for patterns.
333
 
334
- ## License
335
 
336
- Apache-2.0
 
 
 
1
  ---
2
+ title: "PDFSystem: PB-Scale PDF Processing Pipeline"
3
+ emoji: 🚀
4
  colorFrom: green
5
  colorTo: purple
6
  sdk: gradio
 
8
  app_file: app.py
9
  pinned: false
10
  license: apache-2.0
11
+ short_description: "PDF to Markdown pipeline with ML-powered routing"
12
  ---
13
 
14
+ # PDFSystem for MNBVC
15
+
16
+ <p align="center">
17
+ <strong>PB-scale PDF → Pretraining Data Pipeline</strong><br>
18
+ <em>FinePDFs-inspired architecture for Chinese-heavy, mixed-quality PDFs</em>
19
+ </p>
20
+
21
+ <p align="center">
22
+ <a href="https://huggingface.co/spaces/roger1024/DocPipe">
23
+ <img src="https://img.shields.io/badge/🤗%20Hugging%20Face-Spaces-yellow" alt="Hugging Face Spaces">
24
+ </a>
25
+ <a href="https://github.com/MIracleyin/pdfsystem_mnbvc">
26
+ <img src="https://img.shields.io/badge/GitHub-Repository-blue?logo=github" alt="GitHub">
27
+ </a>
28
+ <img src="https://img.shields.io/badge/Python-3.11-blue?logo=python" alt="Python 3.11">
29
+ <img src="https://img.shields.io/badge/Gradio-6.12.0-green" alt="Gradio">
30
+ <img src="https://img.shields.io/badge/License-Apache%202.0-orange" alt="License">
31
+ </p>
32
 
33
+ ---
34
+
35
+ ## 🚀 Quick Links
36
 
37
+ | Platform | Link | Description |
38
+ |----------|------|-------------|
39
+ | **Live Demo** | [🤗 HF Spaces](https://huggingface.co/spaces/roger1024/DocPipe) | Upload PDF and try the pipeline instantly |
40
+ | **Source Code** | [GitHub](https://github.com/MIracleyin/pdfsystem_mnbvc) | Full source code and documentation |
 
41
 
42
+ ---
43
 
44
+ ## Features
45
 
46
+ - **🧠 ML-Powered Routing**: XGBoost classifier (124 features) routes PDFs to optimal backend
47
+ - **⚡ Fast Path**: PyMuPDF extraction for text-ok documents (~10ms/page)
48
+ - **📊 Quality Scoring**: ModernBERT-large OCR quality assessment [0-3 scale]
49
+ - **🔍 Visual Debug**: Page preview with extracted bbox overlays
50
+ - **📦 Modular Design**: Stateless, backend-agnostic pipeline components
51
+
52
+ ---
53
+
54
+ ## 🎯 Current Status
55
+
56
+ | Component | Status | Description |
57
+ |-----------|--------|-------------|
58
+ | **Stage-A Router** | ✅ Ready | XGBoost binary classifier with 124 PyMuPDF features |
59
+ | **MuPDF Parser** | ✅ Ready | Fast extraction for clean-text PDFs |
60
+ | **OCR Quality Scorer** | ✅ Ready | ModernBERT-large regression model |
61
+ | **Stage-B Router** | 🚧 Planned | Layout-based complexity routing |
62
+ | **Pipeline Parser** | 🚧 Planned | Region-level OCR for simple layouts |
63
+ | **VLM Parser** | 🚧 Planned | Vision-Language model for complex layouts |
64
+
65
+ ---
66
 
67
+ ## 🏃 Quick Start
68
 
69
+ ### Option 1: Online Demo (Fastest)
70
 
71
+ Visit [Hugging Face Spaces](https://huggingface.co/spaces/roger1024/DocPipe) and upload a PDF — no installation required.
72
+
73
+ ### Option 2: Local Development
74
 
75
  ```bash
76
+ # 1. Install uv package manager
77
  curl -LsSf https://astral.sh/uv/install.sh | sh
78
 
79
+ # 2. Clone and setup
80
  git clone https://github.com/MIracleyin/pdfsystem_mnbvc.git
81
  cd pdfsystem_mnbvc
82
  uv sync
83
 
84
+ # 3. Download router weights (257 KB, one-time)
85
  python -m pdfsys_router.download_weights
86
 
87
+ # 4. Run interactive demo
88
  python app.py
89
+ # Open http://localhost:7860
90
+ ```
91
 
92
+ ### Option 3: Batch Processing
93
+
94
+ ```bash
95
  python -m pdfsys_bench \
96
+ --pdf-dir /path/to/pdfs \
97
+ --out results.jsonl \
98
+ --markdown-dir ./extracted
99
  ```
100
 
101
+ ---
 
 
102
 
103
+ ## 🏗️ Architecture
104
+
105
+ ```
106
+ ┌─────────────────┐
107
+ PDF Input ───► │ Stage-A Router │ XGBoost (124 features)
108
+ │ (Implemented) │ ~10ms per PDF
109
+ └────────┬────────┘
110
+ │ ocr_prob
111
+ ┌─────────────────┼─────────────────┐
112
+ ▼ ▼ ▼
113
+ ┌─────────┐ ┌──────────┐ ┌─────────┐
114
+ │ MUPDF │ │ PIPELINE │ │ VLM │
115
+ │ (Fast) │ │ (OCR) │ │(Complex)│
116
+ └────┬────┘ └──────────┘ └─────────┘
117
+
118
+
119
+ ┌─────────────────────────────────────┐
120
+ │ ExtractedDoc: Markdown + Segments │
121
+ └─────────────────────────────────────┘
122
+
123
+
124
+ ┌─────────────────────────────────────┐
125
+ │ Quality Scorer (ModernBERT-large) │
126
+ │ Score: [0, 3] │
127
+ └─────────────────────────────────────┘
128
+ ```
129
 
130
+ ---
131
 
132
+ ## 📦 Workspace Packages
133
+
134
+ | Package | Purpose | Dependencies |
135
+ |---------|---------|--------------|
136
+ | `pdfsys-core` | Shared types, schemas, layout cache | stdlib only |
137
+ | `pdfsys-router` | Stage-A/Stage-B routing decisions | pymupdf, xgboost, pandas, sklearn |
138
+ | `pdfsys-parser-mupdf` | Fast PyMuPDF extraction | pymupdf |
139
+ | `pdfsys-bench` | Evaluation harness + quality scorer | torch, transformers |
140
+ | `pdfsys-layout-analyser` | Layout model runner | 🚧 Planned |
141
+ | `pdfsys-parser-pipeline` | OCR backend | 🚧 Planned |
142
+ | `pdfsys-parser-vlm` | VLM backend | 🚧 Planned |
 
 
 
 
 
 
 
143
 
144
+ ---
145
 
146
+ ## 📊 Benchmark Results
 
 
 
 
 
 
 
 
 
147
 
148
+ **OmniDocBench-100 Dataset:**
149
 
150
  ```
151
+ Backend Split: mupdf=70 pipeline=30
152
+ Avg OCR Prob: mupdf=0.034 pipeline=0.634
153
+ Extraction: 70 success 0 errors
154
+ Quality Score: avg=1.71 min=0.39 max=2.73
155
+ Timing: router=49ms extract=7ms quality=3.6s
156
  ```
157
 
158
+ ---
159
 
160
+ ## 🎨 Demo Interface
 
 
 
 
 
 
 
 
161
 
162
+ The Gradio demo provides:
163
 
164
+ - **📤 PDF Upload**: Drag-and-drop or click to upload
165
+ - **📈 Routing Info**: OCR probability, selected backend, page count
166
+ - **🖼️ Page Preview**: First page with colored bbox overlays
167
+ - **📝 Markdown Output**: Extracted text content
168
+ - **📋 Segment Table**: Block-level extraction details
169
+ - **🔧 Feature View**: Selected router features
170
+ - **📄 Raw JSON**: Complete pipeline output
171
+ - **⭐ Quality Score**: Optional ModernBERT scoring
172
+
173
+ ---
174
 
175
+ ## 📚 Documentation
176
 
177
+ | Document | Description |
178
+ |----------|-------------|
179
+ | [`docs/PRD.md`](docs/PRD.md) | Product Requirements & Architecture Rationale |
180
+ | [`docs/ROADMAP.md`](docs/ROADMAP.md) | Implementation Timeline & Milestones |
181
+ | [`CONTRIBUTING.md`](CONTRIBUTING.md) | Development Guidelines & Commit Conventions |
182
+ | [`demo/README.md`](demo/README.md) | Demo-specific Documentation |
183
 
184
+ ---
185
 
186
+ ## 💻 Development
187
+
188
+ ### Data Structures
189
+
190
+ **Router Output:**
191
  ```python
192
  @dataclass
193
  class RouterDecision:
194
  backend: Backend # MUPDF | PIPELINE | VLM | DEFERRED
195
+ ocr_prob: float # P(needs OCR) [0, 1]
196
  num_pages: int
197
  is_form: bool
198
+ features: dict # 124-dim feature vector
 
 
 
 
199
  ```
200
 
201
+ **Parser Output:**
 
202
  ```python
203
  @dataclass(frozen=True)
204
  class ExtractedDoc:
205
  sha256: str
206
  backend: Backend
207
+ segments: tuple[Segment, ...]
208
+ markdown: str
209
  stats: dict
210
  ```
211
 
212
+ ### CLI Reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
213
 
214
  ```bash
215
+ # Download router weights
216
+ python -m pdfsys_router.download_weights
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
217
 
218
+ # Run benchmark
219
+ python -m pdfsys_bench \
220
+ --pdf-dir PATH \
221
+ --out results.jsonl \
222
+ --no-quality # Skip quality scoring
223
  ```
224
 
225
+ ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
226
 
227
+ ## 🤝 Contributing
228
 
229
+ We welcome contributions! Please see [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines.
230
 
 
231
  ---
 
 
 
 
 
 
232
 
233
+ ## 📄 License
 
 
234
 
235
+ This project is licensed under the [Apache License 2.0](LICENSE).
236
 
237
+ ---
238
 
239
+ <p align="center">
240
+ Built with ❤️ for the <a href="https://github.com/esbatmop/MNBVC">MNBVC</a> corpus project
241
+ </p>