jieluo1024 commited on
Commit ·
0bc9210
1
Parent(s): e88d082
docs: update README with demo link and documentation info
Browse files- Add Hugging Face Spaces demo link (https://huggingface.co/spaces/roger1024/DocPipe)
- Add demo features table and tech stack info
- Reorganize Quick start with online/local options
- Add document index table for better navigation
README.md
CHANGED
|
@@ -16,9 +16,11 @@ short_description: FinePDFs-style PDF pipeline demo for MNBVC
|
|
| 16 |
PB-scale PDF → pretraining-data pipeline for the [MNBVC](https://github.com/esbatmop/MNBVC) corpus project.
|
| 17 |
FinePDFs-inspired architecture adapted for Chinese-heavy, mixed-quality input.
|
| 18 |
|
| 19 |
-
> **Try it:**
|
| 20 |
-
>
|
| 21 |
-
>
|
|
|
|
|
|
|
| 22 |
|
| 23 |
## Current status: MVP closed loop ✅
|
| 24 |
|
|
@@ -26,19 +28,29 @@ The first end-to-end path — **Router → MuPDF parser → OCR quality scorer**
|
|
| 26 |
|
| 27 |
## Quick start
|
| 28 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 29 |
```bash
|
| 30 |
# 1. Install uv (>= 0.4)
|
| 31 |
curl -LsSf https://astral.sh/uv/install.sh | sh
|
| 32 |
|
| 33 |
# 2. Clone the repo and sync all workspace packages
|
| 34 |
-
git clone
|
| 35 |
cd pdfsystem_mnbvc
|
| 36 |
uv sync
|
| 37 |
|
| 38 |
# 3. Fetch the XGBoost router weights (257 KB, one-time)
|
| 39 |
python -m pdfsys_router.download_weights
|
| 40 |
|
| 41 |
-
# 4. Run
|
|
|
|
|
|
|
|
|
|
|
|
|
| 42 |
python -m pdfsys_bench \
|
| 43 |
--pdf-dir packages/pdfsys-bench/omnidocbench_100/pdfs \
|
| 44 |
--out out/bench_omnidoc100.jsonl \
|
|
@@ -236,12 +248,34 @@ The JSONL output (`--out`) has one JSON object per PDF:
|
|
| 236 |
|
| 237 |
A companion `.summary.json` file is also written with aggregate statistics.
|
| 238 |
|
| 239 |
-
##
|
| 240 |
-
|
| 241 |
-
|
| 242 |
-
|
| 243 |
-
|
| 244 |
-
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 245 |
|
| 246 |
## Collaborating with Cursor
|
| 247 |
|
|
|
|
| 16 |
PB-scale PDF → pretraining-data pipeline for the [MNBVC](https://github.com/esbatmop/MNBVC) corpus project.
|
| 17 |
FinePDFs-inspired architecture adapted for Chinese-heavy, mixed-quality input.
|
| 18 |
|
| 19 |
+
> **Try it:**
|
| 20 |
+
> - 🚀 **在线 Demo**: [Hugging Face Spaces](https://huggingface.co/spaces/roger1024/DocPipe) - 直接上传 PDF 体验完整流程
|
| 21 |
+
> - 💻 **本地运行**: `python app.py` - 详见下方 [Quick start](#quick-start)
|
| 22 |
+
>
|
| 23 |
+
> 部署到 Hugging Face Spaces 只需一键,YAML header 就是全部配置。详见 [`demo/README.md`](demo/README.md)
|
| 24 |
|
| 25 |
## Current status: MVP closed loop ✅
|
| 26 |
|
|
|
|
| 28 |
|
| 29 |
## Quick start
|
| 30 |
|
| 31 |
+
### 方式一:在线体验(最快)
|
| 32 |
+
|
| 33 |
+
直接访问 [Hugging Face Spaces Demo](https://huggingface.co/spaces/roger1024/DocPipe) 上传 PDF 即可体验,无需安装任何环境。
|
| 34 |
+
|
| 35 |
+
### 方式二:本地运行
|
| 36 |
+
|
| 37 |
```bash
|
| 38 |
# 1. Install uv (>= 0.4)
|
| 39 |
curl -LsSf https://astral.sh/uv/install.sh | sh
|
| 40 |
|
| 41 |
# 2. Clone the repo and sync all workspace packages
|
| 42 |
+
git clone https://github.com/MIracleyin/pdfsystem_mnbvc.git
|
| 43 |
cd pdfsystem_mnbvc
|
| 44 |
uv sync
|
| 45 |
|
| 46 |
# 3. Fetch the XGBoost router weights (257 KB, one-time)
|
| 47 |
python -m pdfsys_router.download_weights
|
| 48 |
|
| 49 |
+
# 4. Run Gradio demo
|
| 50 |
+
python app.py
|
| 51 |
+
# 访问 http://localhost:7860
|
| 52 |
+
|
| 53 |
+
# 5. Or run the MVP closed loop on the bench dataset
|
| 54 |
python -m pdfsys_bench \
|
| 55 |
--pdf-dir packages/pdfsys-bench/omnidocbench_100/pdfs \
|
| 56 |
--out out/bench_omnidoc100.jsonl \
|
|
|
|
| 248 |
|
| 249 |
A companion `.summary.json` file is also written with aggregate statistics.
|
| 250 |
|
| 251 |
+
## Demo 功能说明
|
| 252 |
+
|
| 253 |
+
在线 Demo 展示了完整的 MVP 流程,包含以下功能:
|
| 254 |
+
|
| 255 |
+
| 功能 | 描述 |
|
| 256 |
+
|------|------|
|
| 257 |
+
| **PDF 上传** | 支持拖拽或点击上传 PDF 文件 |
|
| 258 |
+
| **路由决策** | 实时显示 XGBoost 路由器的 OCR 概率和选择的 Backend |
|
| 259 |
+
| **页面预览** | 第一页渲染并叠加提取的文本块边界框(颜色标识 Backend) |
|
| 260 |
+
| **Markdown 输出** | PyMuPDF 提取的文本内容 |
|
| 261 |
+
| **Segments 表格** | 详细的块级提取信息(类型、坐标、字符数等) |
|
| 262 |
+
| **Router Features** | 精选的 124 维特征子集展示 |
|
| 263 |
+
| **Raw JSON** | 完整的 pipeline 输出数据 |
|
| 264 |
+
| **OCR 质量评分** | 可选的 ModernBERT 质量评分(默认关闭,约 3-5 秒) |
|
| 265 |
+
|
| 266 |
+
### Demo 技术栈
|
| 267 |
+
- **Frontend**: Gradio 6.12.0
|
| 268 |
+
- **Backend**: Python 3.11 + PyMuPDF + XGBoost
|
| 269 |
+
- **部署**: Hugging Face Spaces (CPU)
|
| 270 |
+
|
| 271 |
+
## 文档索引
|
| 272 |
+
|
| 273 |
+
| 文档 | 内容 |
|
| 274 |
+
|------|------|
|
| 275 |
+
| [`docs/PRD.md`](docs/PRD.md) | 完整产品需求文档,包含资源预算和架构原理 |
|
| 276 |
+
| [`docs/ROADMAP.md`](docs/ROADMAP.md) | 优先级排序的实现计划、工作量估算和验收标准 |
|
| 277 |
+
| [`CONTRIBUTING.md`](CONTRIBUTING.md) | 命名规范、一致性规则、提交格式 |
|
| 278 |
+
| [`demo/README.md`](demo/README.md) | Gradio Demo 详情 + Hugging Face Spaces 部署指南 |
|
| 279 |
|
| 280 |
## Collaborating with Cursor
|
| 281 |
|