jieluo1024 commited on
Commit
0bc9210
·
1 Parent(s): e88d082

docs: update README with demo link and documentation info

Browse files

- Add Hugging Face Spaces demo link (https://huggingface.co/spaces/roger1024/DocPipe)
- Add demo features table and tech stack info
- Reorganize Quick start with online/local options
- Add document index table for better navigation

Files changed (1) hide show
  1. README.md +45 -11
README.md CHANGED
@@ -16,9 +16,11 @@ short_description: FinePDFs-style PDF pipeline demo for MNBVC
16
  PB-scale PDF → pretraining-data pipeline for the [MNBVC](https://github.com/esbatmop/MNBVC) corpus project.
17
  FinePDFs-inspired architecture adapted for Chinese-heavy, mixed-quality input.
18
 
19
- > **Try it:** `python app.py` locally, or deploy to Hugging Face Spaces with one click
20
- > the YAML header above is all the Space config needed. See [`demo/README.md`](demo/README.md)
21
- > for both paths.
 
 
22
 
23
  ## Current status: MVP closed loop ✅
24
 
@@ -26,19 +28,29 @@ The first end-to-end path — **Router → MuPDF parser → OCR quality scorer**
26
 
27
  ## Quick start
28
 
 
 
 
 
 
 
29
  ```bash
30
  # 1. Install uv (>= 0.4)
31
  curl -LsSf https://astral.sh/uv/install.sh | sh
32
 
33
  # 2. Clone the repo and sync all workspace packages
34
- git clone <this-repo-url>
35
  cd pdfsystem_mnbvc
36
  uv sync
37
 
38
  # 3. Fetch the XGBoost router weights (257 KB, one-time)
39
  python -m pdfsys_router.download_weights
40
 
41
- # 4. Run the MVP closed loop on the bench dataset
 
 
 
 
42
  python -m pdfsys_bench \
43
  --pdf-dir packages/pdfsys-bench/omnidocbench_100/pdfs \
44
  --out out/bench_omnidoc100.jsonl \
@@ -236,12 +248,34 @@ The JSONL output (`--out`) has one JSON object per PDF:
236
 
237
  A companion `.summary.json` file is also written with aggregate statistics.
238
 
239
- ## Docs
240
-
241
- - [`docs/PRD.md`](docs/PRD.md) full PRD with resource budgets and architectural rationale (the "what & why").
242
- - [`docs/ROADMAP.md`](docs/ROADMAP.md) — prioritised implementation plan with work-estimates and acceptance criteria (the "how & when").
243
- - [`CONTRIBUTING.md`](CONTRIBUTING.md) naming, parity rules, commit scopes.
244
- - [`demo/README.md`](demo/README.md) — Gradio demo + Hugging Face Spaces deploy guide.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
245
 
246
  ## Collaborating with Cursor
247
 
 
16
  PB-scale PDF → pretraining-data pipeline for the [MNBVC](https://github.com/esbatmop/MNBVC) corpus project.
17
  FinePDFs-inspired architecture adapted for Chinese-heavy, mixed-quality input.
18
 
19
+ > **Try it:**
20
+ > - 🚀 **在线 Demo**: [Hugging Face Spaces](https://huggingface.co/spaces/roger1024/DocPipe) - 直接上传 PDF 体验完整流程
21
+ > - 💻 **本地运行**: `python app.py` - 详见下方 [Quick start](#quick-start)
22
+ >
23
+ > 部署到 Hugging Face Spaces 只需一键,YAML header 就是全部配置。详见 [`demo/README.md`](demo/README.md)
24
 
25
  ## Current status: MVP closed loop ✅
26
 
 
28
 
29
  ## Quick start
30
 
31
+ ### 方式一:在线体验(最快)
32
+
33
+ 直接访问 [Hugging Face Spaces Demo](https://huggingface.co/spaces/roger1024/DocPipe) 上传 PDF 即可体验,无需安装任何环境。
34
+
35
+ ### 方式二:本地运行
36
+
37
  ```bash
38
  # 1. Install uv (>= 0.4)
39
  curl -LsSf https://astral.sh/uv/install.sh | sh
40
 
41
  # 2. Clone the repo and sync all workspace packages
42
+ git clone https://github.com/MIracleyin/pdfsystem_mnbvc.git
43
  cd pdfsystem_mnbvc
44
  uv sync
45
 
46
  # 3. Fetch the XGBoost router weights (257 KB, one-time)
47
  python -m pdfsys_router.download_weights
48
 
49
+ # 4. Run Gradio demo
50
+ python app.py
51
+ # 访问 http://localhost:7860
52
+
53
+ # 5. Or run the MVP closed loop on the bench dataset
54
  python -m pdfsys_bench \
55
  --pdf-dir packages/pdfsys-bench/omnidocbench_100/pdfs \
56
  --out out/bench_omnidoc100.jsonl \
 
248
 
249
  A companion `.summary.json` file is also written with aggregate statistics.
250
 
251
+ ## Demo 功能说明
252
+
253
+ 在线 Demo 展示了完整的 MVP 流程,包含以下功能:
254
+
255
+ | 功能 | 描述 |
256
+ |------|------|
257
+ | **PDF 上传** | 支持拖拽或点击上传 PDF 文件 |
258
+ | **路由决策** | 实时显示 XGBoost 路由器的 OCR 概率和选择的 Backend |
259
+ | **页面预览** | 第一页渲染并叠加提取的文本块边界框(颜色标识 Backend) |
260
+ | **Markdown 输出** | PyMuPDF 提取的文本内容 |
261
+ | **Segments 表格** | 详细的块级提取信息(类型、坐标、字符数等) |
262
+ | **Router Features** | 精选的 124 维特征子集展示 |
263
+ | **Raw JSON** | 完整的 pipeline 输出数据 |
264
+ | **OCR 质量评分** | 可选的 ModernBERT 质量评分(默认关闭,约 3-5 秒) |
265
+
266
+ ### Demo 技术栈
267
+ - **Frontend**: Gradio 6.12.0
268
+ - **Backend**: Python 3.11 + PyMuPDF + XGBoost
269
+ - **部署**: Hugging Face Spaces (CPU)
270
+
271
+ ## 文档索引
272
+
273
+ | 文档 | 内容 |
274
+ |------|------|
275
+ | [`docs/PRD.md`](docs/PRD.md) | 完整产品需求文档,包含资源预算和架构原理 |
276
+ | [`docs/ROADMAP.md`](docs/ROADMAP.md) | 优先级排序的实现计划、工作量估算和验收标准 |
277
+ | [`CONTRIBUTING.md`](CONTRIBUTING.md) | 命名规范、一致性规则、提交格式 |
278
+ | [`demo/README.md`](demo/README.md) | Gradio Demo 详情 + Hugging Face Spaces 部署指南 |
279
 
280
  ## Collaborating with Cursor
281