sidoutcome commited on
Commit
6162371
·
0 Parent(s):

feat: support both API_TOKEN and API_DEV_TOKEN

Browse files
Files changed (6) hide show
  1. .gitignore +37 -0
  2. CLAUDE.md +238 -0
  3. Dockerfile +118 -0
  4. README.md +537 -0
  5. app.py +1226 -0
  6. requirements.txt +19 -0
.gitignore ADDED
@@ -0,0 +1,37 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Python
2
+ __pycache__/
3
+ *.py[cod]
4
+ *$py.class
5
+ *.so
6
+ .Python
7
+ env/
8
+ venv/
9
+ .venv/
10
+ *.egg-info/
11
+ dist/
12
+ build/
13
+
14
+ # IDE
15
+ .idea/
16
+ .vscode/
17
+ *.swp
18
+ *.swo
19
+
20
+ # Testing
21
+ .pytest_cache/
22
+ .coverage
23
+ htmlcov/
24
+
25
+ # Temp files
26
+ *.tmp
27
+ *.temp
28
+ temp/
29
+ tmp/
30
+
31
+ # Model cache
32
+ .cache/
33
+ models/
34
+
35
+ # OS
36
+ .DS_Store
37
+ Thumbs.db
CLAUDE.md ADDED
@@ -0,0 +1,238 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # CLAUDE.md
2
+
3
+ This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
4
+
5
+ ## Project Overview
6
+
7
+ MD Parser - A Hugging Face Spaces API service that deploys MinerU for PDF/document parsing. Transforms complex documents (PDFs, images) into LLM-ready markdown/JSON formats. API endpoints are protected by Bearer token authentication.
8
+
9
+ ## Architecture
10
+
11
+ ```
12
+ hf_md_parser/
13
+ ├── app.py # FastAPI application with parsing endpoints
14
+ ├── Dockerfile # HF Spaces Docker configuration (GPU-enabled, VLM models pre-downloaded)
15
+ ├── requirements.txt # Python dependencies
16
+ ├── README.md # HF Spaces metadata and API documentation
17
+ ├── CLAUDE.md # Claude Code development guide
18
+ └── .gitignore # Git ignore patterns
19
+ ```
20
+
21
+ ## Common Commands
22
+
23
+ ```bash
24
+ # Local development
25
+ pip install -r requirements.txt
26
+ uvicorn app:app --host 0.0.0.0 --port 7860 --reload
27
+
28
+ # Test the API locally
29
+ curl -X POST "http://localhost:7860/parse" \
30
+ -F "file=@document.pdf" \
31
+ -F "output_format=markdown"
32
+
33
+ # Test deployed API (health check - no auth needed)
34
+ curl https://outcomelabs-md-parser.hf.space/
35
+
36
+ # Test deployed API (requires API_TOKEN)
37
+ curl -X POST "https://outcomelabs-md-parser.hf.space/parse" \
38
+ -H "Authorization: Bearer YOUR_API_TOKEN" \
39
+ -F "file=@document.pdf" \
40
+ -F "output_format=markdown"
41
+
42
+ # Build and test Docker locally
43
+ docker build -t hf-mineru .
44
+ docker run --gpus all --shm-size 32g -p 7860:7860 hf-mineru
45
+ ```
46
+
47
+ ## Deploying to Hugging Face Spaces
48
+
49
+ **Space URL:** https://huggingface.co/spaces/outcomelabs/md-parser
50
+ **API URL:** https://outcomelabs-md-parser.hf.space
51
+
52
+ ### First-time Setup (already done)
53
+
54
+ ```bash
55
+ hf auth login
56
+ hf repo create md-parser --repo-type space --space_sdk docker
57
+ git init
58
+ git remote add hf https://huggingface.co/spaces/outcomelabs/md-parser
59
+ ```
60
+
61
+ ### Push New Code
62
+
63
+ ```bash
64
+ git add .
65
+ git commit -m "feat: description of changes"
66
+ git push hf main
67
+ ```
68
+
69
+ ### Force Push (if needed)
70
+
71
+ ```bash
72
+ git push hf main --force
73
+ ```
74
+
75
+ ### Settings (configure in HF web UI)
76
+
77
+ - **Hardware:** Nvidia A100 Large 80GB ($2.50/hr)
78
+ - **Sleep time:** 1 hour (auto-shutdown after 60 min idle)
79
+ - **Secrets:** `API_TOKEN` (required for API authentication)
80
+
81
+ ## API Endpoints
82
+
83
+ | Endpoint | Method | Description |
84
+ | ------------ | ------ | --------------------------------------------- |
85
+ | `/` | GET | Health check and API info |
86
+ | `/parse` | POST | Parse a document (PDF/image) to markdown/JSON |
87
+ | `/parse/url` | POST | Parse a document from URL |
88
+ | `/docs` | GET | OpenAPI documentation (Swagger UI) |
89
+
90
+ ## Key Dependencies
91
+
92
+ - **mineru[all]**: Core document parsing library
93
+ - **fastapi**: API framework
94
+ - **python-multipart**: File upload handling
95
+ - **uvicorn**: ASGI server
96
+ - **httpx**: HTTP client for URL parsing
97
+ - **pydantic**: Request/response validation
98
+
99
+ ## Docker Base Image
100
+
101
+ Uses `vllm/vllm-openai:v0.14.1` as base image. This includes vLLM with security patches (CVE-2025-66448/CVE-2025-30165), CUDA dependencies, and PyTorch pre-configured. Supports Ampere, Ada Lovelace, and Hopper GPU architectures. nvidia-cudnn-cu12 is installed separately to provide cuDNN 9 compatibility with MinerU's torch.
102
+
103
+ ## Environment Variables
104
+
105
+ | Variable | Description | Default |
106
+ | ----------------------------- | --------------------------------------------------- | ------------------------------- |
107
+ | `API_TOKEN` | **Required.** Secret token for API authentication | (set in HF Secrets) |
108
+ | `MINERU_BACKEND` | Parsing backend (pipeline, hybrid-auto-engine, vlm) | `pipeline` |
109
+ | `MINERU_LANG` | Default OCR language | `en` |
110
+ | `MAX_FILE_SIZE_MB` | Maximum upload file size in MB | `1024` |
111
+ | `MINERU_MODEL_SOURCE` | Model source (local = use pre-downloaded models) | `local` |
112
+ | `HF_HOME` | HuggingFace model cache directory | `/home/user/.cache/huggingface` |
113
+ | `TORCH_HOME` | PyTorch model cache directory | `/home/user/.cache/torch` |
114
+ | `MODELSCOPE_CACHE` | ModelScope model cache directory | `/home/user/.cache/modelscope` |
115
+ | `VLLM_GPU_MEMORY_UTILIZATION` | vLLM GPU memory fraction (hybrid backend only) | `0.4` |
116
+
117
+ ## Authentication
118
+
119
+ All `/parse` endpoints require a Bearer token in the Authorization header.
120
+
121
+ ```bash
122
+ # Set API_TOKEN in HF Space Settings > Secrets
123
+ # Then call API with:
124
+ curl -X POST "https://outcomelabs-md-parser.hf.space/parse" \
125
+ -H "Authorization: Bearer YOUR_API_TOKEN" \
126
+ -F "file=@document.pdf" \
127
+ -F "output_format=markdown"
128
+ ```
129
+
130
+ ## HuggingFace Spaces Configuration
131
+
132
+ The `README.md` contains YAML frontmatter for HF Spaces:
133
+
134
+ - `sdk: docker` - Uses Docker SDK
135
+ - `app_port: 7860` - Standard Gradio/FastAPI port
136
+ - `suggested_hardware: a100-large` - Nvidia A100 GPU (80GB VRAM)
137
+
138
+ ## Logging & Monitoring
139
+
140
+ The API provides comprehensive logging:
141
+
142
+ - **Request IDs**: Each request gets a unique 8-char ID (e.g., `[a1b2c3d4]`)
143
+ - **Startup logs**: Model cache status, MinerU version, configuration
144
+ - **Request logs**: File size, type, page range, processing time, pages/sec speed
145
+ - **MinerU output**: stdout/stderr from parsing commands
146
+ - **Error tracking**: Full exception details with context
147
+
148
+ View logs in HuggingFace Space → Logs tab.
149
+
150
+ ## Performance Target
151
+
152
+ **A100 GPU (80GB):** ~100 pages/minute (1.5-2 pages/second)
153
+
154
+ ## MinerU Backends
155
+
156
+ - **pipeline**: General purpose, 6GB VRAM minimum
157
+ - **hybrid-auto-engine**: Best accuracy + speed balance (default), 8-10GB VRAM minimum
158
+ - **vlm**: Vision-language model based, highest accuracy for complex docs
159
+
160
+ ## Testing
161
+
162
+ The app exposes HTTP endpoints - there is no `parse_document` function to call directly. Instead, POST to the `/parse` endpoint while the server is running.
163
+
164
+ ### Start the server
165
+
166
+ ```bash
167
+ uvicorn app:app --host 0.0.0.0 --port 7860 --reload
168
+ ```
169
+
170
+ ### Test with curl
171
+
172
+ ```bash
173
+ # Test /parse endpoint with a sample PDF (multipart form upload)
174
+ curl -X POST "http://localhost:7860/parse" \
175
+ -H "Authorization: Bearer YOUR_API_TOKEN" \
176
+ -F "file=@sample.pdf" \
177
+ -F "output_format=markdown"
178
+
179
+ # Test /parse endpoint with images included
180
+ curl -X POST "http://localhost:7860/parse" \
181
+ -H "Authorization: Bearer YOUR_API_TOKEN" \
182
+ -F "file=@sample.pdf" \
183
+ -F "output_format=markdown" \
184
+ -F "include_images=true"
185
+
186
+ # Test /parse/url endpoint
187
+ curl -X POST "http://localhost:7860/parse/url" \
188
+ -H "Authorization: Bearer YOUR_API_TOKEN" \
189
+ -H "Content-Type: application/json" \
190
+ -d '{"url": "https://example.com/document.pdf", "output_format": "markdown"}'
191
+
192
+ # Test /parse/url endpoint with images included
193
+ curl -X POST "http://localhost:7860/parse/url" \
194
+ -H "Authorization: Bearer YOUR_API_TOKEN" \
195
+ -H "Content-Type: application/json" \
196
+ -d '{"url": "https://example.com/document.pdf", "output_format": "markdown", "include_images": true}'
197
+ ```
198
+
199
+ ### Test with Python httpx
200
+
201
+ ```python
202
+ import httpx
203
+
204
+ API_URL = "http://localhost:7860"
205
+ API_TOKEN = "your_api_token"
206
+
207
+ # Test /parse endpoint with file upload
208
+ with open("sample.pdf", "rb") as f:
209
+ response = httpx.post(
210
+ f"{API_URL}/parse",
211
+ headers={"Authorization": f"Bearer {API_TOKEN}"},
212
+ files={"file": ("sample.pdf", f, "application/pdf")},
213
+ data={"output_format": "markdown"},
214
+ )
215
+ print(response.json())
216
+
217
+ # Test /parse endpoint with images included
218
+ import base64
219
+ import zipfile
220
+ import io
221
+
222
+ with open("sample.pdf", "rb") as f:
223
+ response = httpx.post(
224
+ f"{API_URL}/parse",
225
+ headers={"Authorization": f"Bearer {API_TOKEN}"},
226
+ files={"file": ("sample.pdf", f, "application/pdf")},
227
+ data={"output_format": "markdown", "include_images": "true"},
228
+ )
229
+ result = response.json()
230
+ if result["images_zip"]:
231
+ print(f"Extracted {result['image_count']} images")
232
+ # Decode and extract images from zip
233
+ zip_bytes = base64.b64decode(result["images_zip"])
234
+ with zipfile.ZipFile(io.BytesIO(zip_bytes), 'r') as zf:
235
+ for name in zf.namelist():
236
+ print(f" - {name}")
237
+ # zf.read(name) returns the image bytes
238
+ ```
Dockerfile ADDED
@@ -0,0 +1,118 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Hugging Face Spaces Dockerfile for MinerU Document Parser API
2
+ # Based on official MinerU Docker deployment
3
+ # Optimized for L40S GPU (Ada Lovelace architecture, 48GB VRAM)
4
+ # Build: v1.4.0 - Using mineru[core] for full backend support
5
+
6
+ # Use official vLLM image as base (includes CUDA, PyTorch, vLLM properly configured)
7
+ # v0.14.1 includes security patches (CVE-2025-66448/CVE-2025-30165) and memory leak fixes
8
+ # Supports Ampere, Ada Lovelace, Hopper architectures (L40S is Ada Lovelace)
9
+ FROM vllm/vllm-openai:v0.14.1
10
+
11
+ USER root
12
+
13
+ RUN echo "========== BUILD STARTED at $(date -u '+%Y-%m-%d %H:%M:%S UTC') =========="
14
+
15
+ # Install system dependencies (fonts required by MinerU, curl for health checks)
16
+ RUN echo "========== STEP 1: Installing system dependencies ==========" && \
17
+ apt-get update && apt-get install -y --no-install-recommends \
18
+ fonts-noto-core \
19
+ fonts-noto-cjk \
20
+ fontconfig \
21
+ libgl1 \
22
+ curl \
23
+ poppler-utils \
24
+ && fc-cache -fv && \
25
+ rm -rf /var/lib/apt/lists/* && \
26
+ echo "========== System dependencies installed =========="
27
+
28
+ # Create non-root user for HF Spaces (required by HuggingFace)
29
+ RUN useradd -m -u 1000 user
30
+
31
+ # Set environment variables (MINERU_MODEL_SOURCE set later after download)
32
+ # LD_LIBRARY_PATH includes pip nvidia packages for cuDNN (libcudnn.so.9)
33
+ ENV PYTHONUNBUFFERED=1 \
34
+ PYTHONDONTWRITEBYTECODE=1 \
35
+ MINERU_BACKEND=pipeline \
36
+ MINERU_LANG=en \
37
+ MAX_FILE_SIZE_MB=1024 \
38
+ HF_HOME=/home/user/.cache/huggingface \
39
+ TORCH_HOME=/home/user/.cache/torch \
40
+ MODELSCOPE_CACHE=/home/user/.cache/modelscope \
41
+ XDG_CACHE_HOME=/home/user/.cache \
42
+ HOME=/home/user \
43
+ PATH=/home/user/.local/bin:/usr/local/bin:/usr/bin:$PATH \
44
+ LD_LIBRARY_PATH=/home/user/.local/lib/python3.12/site-packages/nvidia/cudnn/lib:$LD_LIBRARY_PATH \
45
+ VLLM_GPU_MEMORY_UTILIZATION=0.4
46
+
47
+ # Create cache directories with correct ownership
48
+ RUN mkdir -p /home/user/.cache/huggingface \
49
+ /home/user/.cache/torch \
50
+ /home/user/.cache/modelscope \
51
+ /home/user/app && \
52
+ chown -R user:user /home/user
53
+
54
+ # Switch to non-root user
55
+ USER user
56
+ WORKDIR /home/user/app
57
+
58
+ # Copy requirements first for better caching
59
+ COPY --chown=user:user requirements.txt .
60
+
61
+ # Install Python dependencies
62
+ # Note: nvidia-cudnn-cu12 provides libcudnn.so.9 required by torch
63
+ RUN echo "========== STEP 2: Installing Python dependencies ==========" && \
64
+ pip install --user --upgrade pip && \
65
+ pip install --user nvidia-cudnn-cu12 && \
66
+ pip install --user -r requirements.txt && \
67
+ echo "Reinstalling modelscope in user space for torch compatibility..." && \
68
+ pip install --user --force-reinstall modelscope && \
69
+ echo "Installed packages:" && \
70
+ pip list --user | grep -E "(mineru|fastapi|uvicorn|httpx|pydantic|modelscope|torch|cudnn|doclayout)" && \
71
+ echo "========== Python dependencies installed =========="
72
+
73
+ # Create MinerU config file (required BEFORE downloading models)
74
+ # The mineru-models-download command reads ~/mineru.json to know where to store models
75
+ RUN echo "========== STEP 3a: Creating MinerU config ==========" && \
76
+ mkdir -p /home/user/.cache/mineru/models && \
77
+ echo '{"models-dir": {"pipeline": "/home/user/.cache/mineru/models", "vlm": "/home/user/.cache/mineru/models"}, "config_version": "1.3.1"}' > /home/user/mineru.json && \
78
+ cat /home/user/mineru.json && \
79
+ echo "========== MinerU config created =========="
80
+
81
+ # Download MinerU models using official tool
82
+ RUN echo "========== STEP 3b: Downloading MinerU models ==========" && \
83
+ echo "This downloads all required models (~4-5GB)..." && \
84
+ echo "Cache directories before download:" && \
85
+ ls -la /home/user/.cache/ && \
86
+ echo "Downloading all models from huggingface..." && \
87
+ mineru-models-download --source huggingface --model_type all && \
88
+ echo "" && \
89
+ echo "========== Model cache summary ==========" && \
90
+ echo "MinerU models cache:" && \
91
+ du -sh /home/user/.cache/mineru 2>/dev/null || echo " (empty)" && \
92
+ ls -la /home/user/.cache/mineru/models 2>/dev/null || echo " (no files)" && \
93
+ find /home/user/.cache/mineru -type f 2>/dev/null | head -20 || echo " (no files found)" && \
94
+ echo "HuggingFace cache:" && \
95
+ du -sh /home/user/.cache/huggingface 2>/dev/null || echo " (empty)" && \
96
+ echo "Total cache size:" && \
97
+ du -sh /home/user/.cache 2>/dev/null || echo " (empty)" && \
98
+ echo "========== Models downloaded =========="
99
+
100
+ # Set model source to local AFTER downloading (prevents re-download at runtime)
101
+ ENV MINERU_MODEL_SOURCE=local
102
+
103
+ # Copy application code
104
+ COPY --chown=user:user . .
105
+
106
+ RUN echo "Files in app directory:" && ls -la /home/user/app/ && \
107
+ echo "========== BUILD COMPLETED at $(date -u '+%Y-%m-%d %H:%M:%S UTC') =========="
108
+
109
+ # Expose the port
110
+ EXPOSE 7860
111
+
112
+ # Health check
113
+ HEALTHCHECK --interval=30s --timeout=30s --start-period=300s --retries=5 \
114
+ CMD curl -f http://localhost:7860/ || exit 1
115
+
116
+ # Override vLLM entrypoint and run our FastAPI server
117
+ ENTRYPOINT []
118
+ CMD ["/usr/bin/python3", "-m", "uvicorn", "app:app", "--host", "0.0.0.0", "--port", "7860", "--workers", "1", "--timeout-keep-alive", "300"]
README.md ADDED
@@ -0,0 +1,537 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: MD Parser API
3
+ emoji: 📄
4
+ colorFrom: blue
5
+ colorTo: purple
6
+ sdk: docker
7
+ app_port: 7860
8
+ pinned: false
9
+ license: agpl-3.0
10
+ suggested_hardware: a100-large
11
+ ---
12
+
13
+ # MD Parser API
14
+
15
+ A FastAPI service that transforms PDFs and images into LLM-ready markdown/JSON using [MinerU](https://github.com/opendatalab/MinerU).
16
+
17
+ ## Features
18
+
19
+ - **PDF Parsing**: Extract text, tables, formulas, and images from PDFs
20
+ - **Image OCR**: Process scanned documents and images
21
+ - **Multiple Formats**: Output as markdown or JSON
22
+ - **109 Languages**: Supports OCR in 109 languages
23
+ - **GPU Accelerated**: Uses CUDA for fast processing on A100 GPU (80GB VRAM)
24
+ - **Two Backends**: Fast `pipeline` (default) or accurate `hybrid-auto-engine`
25
+ - **Parallel Chunking**: Large PDFs (>20 pages) are automatically split into 10-page chunks and processed in parallel
26
+
27
+ ## API Endpoints
28
+
29
+ | Endpoint | Method | Description |
30
+ | ------------ | ------ | ----------------------------------------- |
31
+ | `/` | GET | Health check |
32
+ | `/parse` | POST | Parse uploaded file (multipart/form-data) |
33
+ | `/parse/url` | POST | Parse document from URL (JSON body) |
34
+
35
+ ## Authentication
36
+
37
+ All `/parse` endpoints require Bearer token authentication.
38
+
39
+ ```
40
+ Authorization: Bearer YOUR_API_TOKEN
41
+ ```
42
+
43
+ Set `API_TOKEN` in HF Space Settings > Secrets.
44
+
45
+ ## Quick Start
46
+
47
+ ### cURL - File Upload
48
+
49
+ ```bash
50
+ curl -X POST "https://outcomelabs-md-parser.hf.space/parse" \
51
+ -H "Authorization: Bearer YOUR_API_TOKEN" \
52
+ -F "file=@document.pdf" \
53
+ -F "output_format=markdown"
54
+ ```
55
+
56
+ ### cURL - Parse from URL
57
+
58
+ ```bash
59
+ curl -X POST "https://outcomelabs-md-parser.hf.space/parse/url" \
60
+ -H "Authorization: Bearer YOUR_API_TOKEN" \
61
+ -H "Content-Type: application/json" \
62
+ -d '{"url": "https://example.com/document.pdf", "output_format": "markdown"}'
63
+ ```
64
+
65
+ ### Python
66
+
67
+ ```python
68
+ import requests
69
+
70
+ API_URL = "https://outcomelabs-md-parser.hf.space"
71
+ API_TOKEN = "your_api_token"
72
+ headers = {"Authorization": f"Bearer {API_TOKEN}"}
73
+
74
+ # Option 1: Upload a file
75
+ with open("document.pdf", "rb") as f:
76
+ response = requests.post(
77
+ f"{API_URL}/parse",
78
+ headers=headers,
79
+ files={"file": ("document.pdf", f, "application/pdf")},
80
+ data={"output_format": "markdown"}
81
+ )
82
+
83
+ # Option 2: Parse from URL
84
+ response = requests.post(
85
+ f"{API_URL}/parse/url",
86
+ headers=headers,
87
+ json={
88
+ "url": "https://example.com/document.pdf",
89
+ "output_format": "markdown"
90
+ }
91
+ )
92
+
93
+ result = response.json()
94
+ if result["success"]:
95
+ print(f"Parsed {result['pages_processed']} pages")
96
+ print(result["markdown"])
97
+ else:
98
+ print(f"Error: {result['error']}")
99
+ ```
100
+
101
+ ### Python with Images
102
+
103
+ ```python
104
+ import requests
105
+ import base64
106
+ import zipfile
107
+ import io
108
+
109
+ API_URL = "https://outcomelabs-md-parser.hf.space"
110
+ API_TOKEN = "your_api_token"
111
+ headers = {"Authorization": f"Bearer {API_TOKEN}"}
112
+
113
+ # Request with images included
114
+ with open("document.pdf", "rb") as f:
115
+ response = requests.post(
116
+ f"{API_URL}/parse",
117
+ headers=headers,
118
+ files={"file": ("document.pdf", f, "application/pdf")},
119
+ data={"output_format": "markdown", "include_images": "true"}
120
+ )
121
+
122
+ result = response.json()
123
+ if result["success"]:
124
+ print(f"Parsed {result['pages_processed']} pages")
125
+ print(result["markdown"])
126
+
127
+ # Extract images from ZIP
128
+ if result["images_zip"]:
129
+ print(f"Extracting {result['image_count']} images...")
130
+ zip_bytes = base64.b64decode(result["images_zip"])
131
+ with zipfile.ZipFile(io.BytesIO(zip_bytes), 'r') as zf:
132
+ zf.extractall("./extracted_images")
133
+ print(f"Images saved to ./extracted_images/")
134
+ ```
135
+
136
+ ### JavaScript/Node.js
137
+
138
+ ```javascript
139
+ const API_URL = 'https://outcomelabs-md-parser.hf.space';
140
+ const API_TOKEN = 'your_api_token';
141
+
142
+ // Parse from URL
143
+ const response = await fetch(`${API_URL}/parse/url`, {
144
+ method: 'POST',
145
+ headers: {
146
+ Authorization: `Bearer ${API_TOKEN}`,
147
+ 'Content-Type': 'application/json',
148
+ },
149
+ body: JSON.stringify({
150
+ url: 'https://example.com/document.pdf',
151
+ output_format: 'markdown',
152
+ }),
153
+ });
154
+
155
+ const result = await response.json();
156
+ console.log(result.markdown);
157
+ ```
158
+
159
+ ### JavaScript/Node.js with Images
160
+
161
+ ```javascript
162
+ import JSZip from 'jszip';
163
+ import fs from 'fs';
164
+
165
+ const API_URL = 'https://outcomelabs-md-parser.hf.space';
166
+ const API_TOKEN = 'your_api_token';
167
+
168
+ // Parse with images
169
+ const response = await fetch(`${API_URL}/parse/url`, {
170
+ method: 'POST',
171
+ headers: {
172
+ Authorization: `Bearer ${API_TOKEN}`,
173
+ 'Content-Type': 'application/json',
174
+ },
175
+ body: JSON.stringify({
176
+ url: 'https://example.com/document.pdf',
177
+ output_format: 'markdown',
178
+ include_images: true,
179
+ }),
180
+ });
181
+
182
+ const result = await response.json();
183
+ console.log(result.markdown);
184
+
185
+ // Extract images from ZIP
186
+ if (result.images_zip) {
187
+ console.log(`Extracting ${result.image_count} images...`);
188
+ const zipData = Buffer.from(result.images_zip, 'base64');
189
+ const zip = await JSZip.loadAsync(zipData);
190
+
191
+ for (const [name, file] of Object.entries(zip.files)) {
192
+ if (!file.dir) {
193
+ const content = await file.async('nodebuffer');
194
+ fs.writeFileSync(`./extracted_images/${name}`, content);
195
+ console.log(` Saved: ${name}`);
196
+ }
197
+ }
198
+ }
199
+ ```
200
+
201
+ ## Postman Setup
202
+
203
+ ### File Upload (POST /parse)
204
+
205
+ 1. **Method:** `POST`
206
+ 2. **URL:** `https://outcomelabs-md-parser.hf.space/parse`
207
+ 3. **Authorization tab:** Type = Bearer Token, Token = `your_api_token`
208
+ 4. **Body tab:** Select `form-data`
209
+
210
+ | Key | Type | Value |
211
+ | -------------- | ---- | --------------------------------------------- |
212
+ | file | File | Select your PDF/image |
213
+ | output_format | Text | `markdown` or `json` |
214
+ | lang | Text | `en` (optional) |
215
+ | backend | Text | `pipeline` or `hybrid-auto-engine` (optional) |
216
+ | start_page | Text | `0` (optional) |
217
+ | end_page | Text | `10` (optional) |
218
+ | include_images | Text | `true` or `false` (optional) |
219
+
220
+ ### URL Parsing (POST /parse/url)
221
+
222
+ 1. **Method:** `POST`
223
+ 2. **URL:** `https://outcomelabs-md-parser.hf.space/parse/url`
224
+ 3. **Authorization tab:** Type = Bearer Token, Token = `your_api_token`
225
+ 4. **Headers tab:** Add `Content-Type: application/json`
226
+ 5. **Body tab:** Select `raw` and `JSON`
227
+
228
+ ```json
229
+ {
230
+ "url": "https://example.com/document.pdf",
231
+ "output_format": "markdown",
232
+ "lang": "en",
233
+ "start_page": 0,
234
+ "end_page": null,
235
+ "include_images": false
236
+ }
237
+ ```
238
+
239
+ ## Request Parameters
240
+
241
+ ### File Upload (/parse)
242
+
243
+ | Parameter | Type | Required | Default | Description |
244
+ | -------------- | ------ | -------- | ---------- | ---------------------------------------------------- |
245
+ | file | File | Yes | - | PDF or image file |
246
+ | output_format | string | No | `markdown` | `markdown` or `json` |
247
+ | lang | string | No | `en` | OCR language code |
248
+ | backend | string | No | `pipeline` | `pipeline` (fast) or `hybrid-auto-engine` (accurate) |
249
+ | start_page | int | No | `0` | Starting page (0-indexed) |
250
+ | end_page | int | No | `null` | Ending page (null = all pages) |
251
+ | include_images | bool | No | `false` | Include base64-encoded images in response |
252
+
253
+ ### URL Parsing (/parse/url)
254
+
255
+ | Parameter | Type | Required | Default | Description |
256
+ | -------------- | ------ | -------- | ---------- | ---------------------------------------------------- |
257
+ | url | string | Yes | - | URL to PDF or image |
258
+ | output_format | string | No | `markdown` | `markdown` or `json` |
259
+ | lang | string | No | `en` | OCR language code |
260
+ | backend | string | No | `pipeline` | `pipeline` (fast) or `hybrid-auto-engine` (accurate) |
261
+ | start_page | int | No | `0` | Starting page (0-indexed) |
262
+ | end_page | int | No | `null` | Ending page (null = all pages) |
263
+ | include_images | bool | No | `false` | Include base64-encoded images in response |
264
+
265
+ ## Response Format
266
+
267
+ ```json
268
+ {
269
+ "success": true,
270
+ "markdown": "# Document Title\n\nExtracted content...",
271
+ "json_content": null,
272
+ "images_zip": null,
273
+ "image_count": 0,
274
+ "error": null,
275
+ "pages_processed": 20,
276
+ "backend_used": "pipeline"
277
+ }
278
+ ```
279
+
280
+ | Field | Type | Description |
281
+ | --------------- | ------- | ---------------------------------------------------------------------- |
282
+ | success | boolean | Whether parsing succeeded |
283
+ | markdown | string | Extracted markdown (if output_format=markdown) |
284
+ | json_content | object | Extracted JSON (if output_format=json) |
285
+ | images_zip | string | Base64-encoded ZIP file containing all images (if include_images=true) |
286
+ | image_count | int | Number of images in the ZIP file |
287
+ | error | string | Error message if failed |
288
+ | pages_processed | int | Number of pages processed |
289
+ | backend_used | string | Actual backend used (may differ from requested if fallback occurred) |
290
+
291
+ ### Images Response
292
+
293
+ When `include_images=true`, the `images_zip` field contains a base64-encoded ZIP file with all extracted images:
294
+
295
+ ```json
296
+ {
297
+ "images_zip": "UEsDBBQAAAAIAGJ...",
298
+ "image_count": 3
299
+ }
300
+ ```
301
+
302
+ #### Extracting Images (Python)
303
+
304
+ ```python
305
+ import base64
306
+ import zipfile
307
+ import io
308
+
309
+ result = response.json()
310
+ if result["images_zip"]:
311
+ print(f"Extracted {result['image_count']} images")
312
+
313
+ # Decode the base64 ZIP
314
+ zip_bytes = base64.b64decode(result["images_zip"])
315
+
316
+ # Extract images from ZIP
317
+ with zipfile.ZipFile(io.BytesIO(zip_bytes), 'r') as zf:
318
+ for name in zf.namelist():
319
+ print(f" - {name}") # e.g., "images/fig1.png"
320
+ img_bytes = zf.read(name)
321
+ # Save or process img_bytes as needed
322
+ ```
323
+
324
+ #### Extracting Images (JavaScript)
325
+
326
+ ```javascript
327
+ import JSZip from 'jszip';
328
+
329
+ const result = await response.json();
330
+ if (result.images_zip) {
331
+ console.log(`Extracted ${result.image_count} images`);
332
+
333
+ // Decode base64 and unzip
334
+ const zipData = Uint8Array.from(atob(result.images_zip), c =>
335
+ c.charCodeAt(0)
336
+ );
337
+ const zip = await JSZip.loadAsync(zipData);
338
+
339
+ for (const [name, file] of Object.entries(zip.files)) {
340
+ console.log(` - ${name}`); // e.g., "images/fig1.png"
341
+ const imgBlob = await file.async('blob');
342
+ // Use imgBlob as needed
343
+ }
344
+ }
345
+ ```
346
+
347
+ #### Image Path Structure
348
+
349
+ - **Non-chunked documents**: `images/filename.png`
350
+ - **Chunked documents (>20 pages)**: `chunk_0/images/filename.png`, `chunk_1/images/filename.png`, etc.
351
+
352
+ ## Backends
353
+
354
+ | Backend | Speed | Accuracy | Best For |
355
+ | -------------------- | --------------- | ---------------- | --------------------------------------------- |
356
+ | `pipeline` (default) | ~0.77 pages/sec | Good | Native PDFs, text-heavy docs, fast processing |
357
+ | `hybrid-auto-engine` | ~0.39 pages/sec | Excellent (90%+) | Complex layouts, scanned docs, forms |
358
+
359
+ ### When to Use `pipeline` (Default)
360
+
361
+ The pipeline backend uses traditional ML models for faster processing. Use it for:
362
+
363
+ - **Native PDFs with text layers** - Academic papers, eBooks, reports generated digitally
364
+ - **High-volume processing** - When speed matters more than perfect accuracy (2x faster)
365
+ - **Well-structured documents** - Clean, single-column text-heavy documents
366
+ - **arXiv papers** - Both backends produce identical output for well-structured PDFs
367
+ - **Cost optimization** - Faster processing = less GPU time
368
+
369
+ ### When to Use `hybrid-auto-engine`
370
+
371
+ The hybrid backend uses a Vision-Language Model (VLM) to understand document layouts visually. Use it for:
372
+
373
+ - **Scanned documents** - Better OCR accuracy, fewer typos
374
+ - **Forms and applications** - Extracts 18x more content from complex form layouts (tested on IRS Form 1040)
375
+ - **Documents with complex layouts** - Multi-column, mixed text/images, tables with merged cells
376
+ - **Handwritten content** - Better recognition of cursive and handwriting
377
+ - **Low-quality scans** - VLM can interpret degraded or noisy images
378
+ - **Legal documents** - Leases, contracts with signatures and stamps
379
+ - **Historical documents** - Older typewritten or faded documents
380
+
381
+ ### Real-World Comparison
382
+
383
+ | Document Type | Pipeline Output | Hybrid Output |
384
+ | ---------------------- | ------------------------ | ----------------------------- |
385
+ | arXiv paper (15 pages) | 42KB, clean extraction | 42KB, identical |
386
+ | IRS Form 1040 | 825 bytes, mostly images | **15KB, full form structure** |
387
+ | Scanned lease (31 pg) | 104KB, OCR errors | **105KB, cleaner OCR** |
388
+
389
+ **OCR Accuracy Example (scanned lease):**
390
+
391
+ - Pipeline: "Ilinois" (9 occurrences of typo)
392
+ - Hybrid: "Illinois" (21 correct occurrences)
393
+
394
+ Override per-request with the `backend` parameter, or set `MINERU_BACKEND` env var.
395
+
396
+ ## Parallel Chunking
397
+
398
+ For large PDFs, the API automatically splits processing into parallel chunks to avoid timeouts and improve throughput.
399
+
400
+ ### How It Works
401
+
402
+ 1. **Detection**: PDFs with more than 20 pages (configurable via `CHUNKING_THRESHOLD`) trigger chunking
403
+ 2. **Splitting**: Document is split into 10-page chunks (configurable via `CHUNK_SIZE`)
404
+ 3. **Parallel Processing**: Up to 3 chunks (configurable via `MAX_WORKERS`) are processed simultaneously
405
+ 4. **Combining**: Results are merged in page order, with chunk boundaries marked in markdown output
406
+
407
+ ### Performance Impact
408
+
409
+ | Document Size | Without Chunking | With Chunking (3 workers) | Speedup |
410
+ | ------------- | ---------------- | ------------------------- | ------- |
411
+ | 30 pages | ~80 seconds | ~30 seconds | ~2.7x |
412
+ | 60 pages | ~160 seconds | ~55 seconds | ~2.9x |
413
+ | 100 pages | Timeout (>600s) | ~100 seconds | N/A |
414
+
415
+ ### OOM Protection
416
+
417
+ If GPU out-of-memory errors are detected during parallel processing, the system automatically falls back to sequential processing (1 worker) and retries all chunks.
418
+
419
+ ### Notes
420
+
421
+ - Chunking only applies to PDF files (images are always processed as single units)
422
+ - Each chunk maintains context for tables and formulas within its page range
423
+ - Chunk boundaries are marked with HTML comments in markdown output for transparency
424
+ - If any chunk fails, partial results are still returned with an error message
425
+ - Requested backend is used for chunked processing (with OOM auto-fallback to sequential)
426
+
427
+ ## Supported File Types
428
+
429
+ - PDF (.pdf)
430
+ - Images (.png, .jpg, .jpeg, .tiff, .bmp)
431
+
432
+ Maximum file size: 1GB (configurable via `MAX_FILE_SIZE_MB`)
433
+
434
+ ## Configuration
435
+
436
+ | Environment Variable | Description | Default |
437
+ | ----------------------------- | ---------------------------------------------- | ---------- |
438
+ | `API_TOKEN` | **Required.** API authentication token | - |
439
+ | `MINERU_BACKEND` | Default parsing backend | `pipeline` |
440
+ | `MINERU_LANG` | Default OCR language | `en` |
441
+ | `MAX_FILE_SIZE_MB` | Maximum upload size in MB | `1024` |
442
+ | `VLLM_GPU_MEMORY_UTILIZATION` | vLLM GPU memory fraction (hybrid backend only) | `0.4` |
443
+ | `CHUNK_SIZE` | Pages per chunk for chunked processing | `10` |
444
+ | `CHUNKING_THRESHOLD` | Minimum pages to trigger chunking | `20` |
445
+ | `MAX_WORKERS` | Parallel workers for chunk processing | `3` |
446
+
447
+ ### GPU Memory & Automatic Fallback
448
+
449
+ The `hybrid-auto-engine` backend uses vLLM internally, which requires GPU memory. **If GPU memory is insufficient, the API automatically falls back to `pipeline` backend** and returns results (check `backend_used` in response).
450
+
451
+ To force a specific backend or tune memory:
452
+
453
+ 1. **Use `pipeline` backend** - Add `backend=pipeline` to your request (doesn't use vLLM, faster but less accurate for scanned docs)
454
+ 2. **Lower GPU memory** - Set `VLLM_GPU_MEMORY_UTILIZATION` to a lower value (e.g., `0.3`)
455
+
456
+ ## Performance
457
+
458
+ **Hardware:** Nvidia A100 Large (80GB VRAM, 12 vCPU, 142GB RAM)
459
+
460
+ | Backend | Speed | 15-page PDF | 31-page PDF |
461
+ | -------------------- | --------------- | ----------- | ----------- |
462
+ | `pipeline` | ~0.77 pages/sec | ~20 seconds | ~40 seconds |
463
+ | `hybrid-auto-engine` | ~0.39 pages/sec | ~40 seconds | ~80 seconds |
464
+
465
+ **Trade-off:** Hybrid is 2x slower but produces significantly better results for scanned/complex documents. For native PDFs, both produce identical output.
466
+
467
+ **Sleep behavior:** Space sleeps after 60 minutes idle. First request after sleep takes ~30-60 seconds for cold start.
468
+
469
+ ## Deployment
470
+
471
+ - **Space:** https://huggingface.co/spaces/outcomelabs/md-parser
472
+ - **API:** https://outcomelabs-md-parser.hf.space
473
+ - **Hardware:** Nvidia A100 Large 80GB ($2.50/hr, stops billing when sleeping)
474
+
475
+ ### Deploy Updates
476
+
477
+ ```bash
478
+ git add .
479
+ git commit -m "feat: description"
480
+ git push hf main
481
+ ```
482
+
483
+ ## Logging
484
+
485
+ View logs in HuggingFace Space > Logs tab:
486
+
487
+ ```
488
+ 2026-01-26 10:30:00 | INFO | [a1b2c3d4] New parse request received
489
+ 2026-01-26 10:30:00 | INFO | [a1b2c3d4] Filename: document.pdf
490
+ 2026-01-26 10:30:00 | INFO | [a1b2c3d4] File size: 2.45 MB
491
+ 2026-01-26 10:30:00 | INFO | [a1b2c3d4] Backend: pipeline
492
+ 2026-01-26 10:30:27 | INFO | [a1b2c3d4] MinerU completed in 27.23s
493
+ 2026-01-26 10:30:27 | INFO | [a1b2c3d4] Pages processed: 20
494
+ 2026-01-26 10:30:27 | INFO | [a1b2c3d4] Speed: 0.73 pages/sec
495
+ ```
496
+
497
+ ## Changelog
498
+
499
+ ### v1.4.0 (Breaking Change)
500
+
501
+ **Images now returned as ZIP file instead of dictionary:**
502
+
503
+ - `images` field removed
504
+ - `images_zip` field added (base64-encoded ZIP containing all images)
505
+ - `image_count` field added (number of images in ZIP)
506
+
507
+ **Migration from v1.3.0:**
508
+
509
+ ```python
510
+ # OLD (v1.3.0)
511
+ if result["images"]:
512
+ for filename, b64_data in result["images"].items():
513
+ img_bytes = base64.b64decode(b64_data)
514
+
515
+ # NEW (v1.4.0)
516
+ if result["images_zip"]:
517
+ zip_bytes = base64.b64decode(result["images_zip"])
518
+ with zipfile.ZipFile(io.BytesIO(zip_bytes), 'r') as zf:
519
+ for filename in zf.namelist():
520
+ img_bytes = zf.read(filename)
521
+ ```
522
+
523
+ **Benefits:**
524
+
525
+ - Smaller payload size due to ZIP compression
526
+ - Single field instead of large dictionary
527
+ - Easier to save/extract as a file
528
+
529
+ ### v1.3.0
530
+
531
+ - Added `include_images` parameter for optional image extraction
532
+ - Added parallel chunking for large PDFs (>20 pages)
533
+ - Added automatic OOM fallback to sequential processing
534
+
535
+ ## Credits
536
+
537
+ Built with [MinerU](https://github.com/opendatalab/MinerU) by OpenDataLab.
app.py ADDED
@@ -0,0 +1,1226 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ MinerU Document Parser API
3
+
4
+ A FastAPI service that wraps MinerU for parsing PDFs and images
5
+ into LLM-ready markdown/JSON formats.
6
+
7
+ Features:
8
+ - Automatic chunking for large PDFs (10 pages per chunk)
9
+ - Parallel processing of chunks for faster throughput
10
+ - Automatic fallback to pipeline backend on GPU memory errors
11
+ """
12
+
13
+ import asyncio
14
+ import base64
15
+ import io
16
+ import ipaddress
17
+ import json
18
+ import logging
19
+ import os
20
+ import re
21
+ import secrets
22
+ import shutil
23
+ import socket
24
+ import subprocess
25
+ import tempfile
26
+ import time
27
+ import zipfile
28
+ from concurrent.futures import ThreadPoolExecutor, as_completed
29
+ from pathlib import Path
30
+ from typing import BinaryIO, Optional, Union
31
+ from urllib.parse import urlparse
32
+ from uuid import uuid4
33
+
34
+ import httpx
35
+ from fastapi import Depends, FastAPI, File, Form, HTTPException, Request, UploadFile
36
+ from fastapi.security import HTTPAuthorizationCredentials, HTTPBearer
37
+ from pydantic import BaseModel
38
+
39
+ # Configure logging
40
+ logging.basicConfig(
41
+ level=logging.INFO,
42
+ format="%(asctime)s | %(levelname)-8s | %(message)s",
43
+ datefmt="%Y-%m-%d %H:%M:%S",
44
+ )
45
+ logger = logging.getLogger("md-parser")
46
+
47
+ # Security
48
+ API_TOKEN = os.getenv("API_TOKEN")
49
+ API_DEV_TOKEN = os.getenv("API_DEV_TOKEN")
50
+ security = HTTPBearer()
51
+
52
+
53
+ def verify_token(credentials: HTTPAuthorizationCredentials = Depends(security)) -> str:
54
+ """Verify the API token from Authorization header."""
55
+ if not API_TOKEN and not API_DEV_TOKEN:
56
+ raise HTTPException(
57
+ status_code=500,
58
+ detail="No API tokens configured on server",
59
+ )
60
+
61
+ token = credentials.credentials
62
+
63
+ # Check against both tokens
64
+ token_valid = False
65
+ if API_TOKEN and secrets.compare_digest(token, API_TOKEN):
66
+ token_valid = True
67
+ if API_DEV_TOKEN and secrets.compare_digest(token, API_DEV_TOKEN):
68
+ token_valid = True
69
+
70
+ if not token_valid:
71
+ raise HTTPException(
72
+ status_code=401,
73
+ detail="Invalid API token",
74
+ )
75
+ return token
76
+
77
+ from contextlib import asynccontextmanager
78
+
79
+
80
+ def _check_model_cache() -> dict:
81
+ """Check model cache status and return cache info."""
82
+ cache_info = {}
83
+ cache_dirs = [
84
+ ("HuggingFace", os.environ.get("HF_HOME", "/home/user/.cache/huggingface")),
85
+ ("Torch", os.environ.get("TORCH_HOME", "/home/user/.cache/torch")),
86
+ ("ModelScope", os.environ.get("MODELSCOPE_CACHE", "/home/user/.cache/modelscope")),
87
+ ]
88
+
89
+ for name, path in cache_dirs:
90
+ if os.path.exists(path):
91
+ try:
92
+ # Get directory size
93
+ total_size = 0
94
+ file_count = 0
95
+ for dirpath, dirnames, filenames in os.walk(path):
96
+ for f in filenames:
97
+ fp = os.path.join(dirpath, f)
98
+ total_size += os.path.getsize(fp)
99
+ file_count += 1
100
+ size_mb = total_size / (1024 * 1024)
101
+ cache_info[name] = {"size_mb": round(size_mb, 2), "files": file_count, "status": "cached"}
102
+ except Exception as e:
103
+ cache_info[name] = {"status": f"error: {e}"}
104
+ else:
105
+ cache_info[name] = {"status": "not found"}
106
+
107
+ return cache_info
108
+
109
+
110
+ @asynccontextmanager
111
+ async def lifespan(app: FastAPI):
112
+ """Startup: verify MinerU is available and check model cache."""
113
+ logger.info("=" * 60)
114
+ logger.info("Starting MD Parser API v1.4.0...")
115
+ logger.info(f"Backend: {MINERU_BACKEND}")
116
+ logger.info(f"Default language: {MINERU_LANG}")
117
+ logger.info(f"Max file size: {MAX_FILE_SIZE_MB}MB")
118
+ logger.info(f"Chunking: {CHUNK_SIZE} pages/chunk, threshold {CHUNKING_THRESHOLD} pages, {MAX_WORKERS} workers")
119
+
120
+ try:
121
+ # Verify mineru CLI is available
122
+ result = subprocess.run(["mineru", "--version"], capture_output=True, text=True)
123
+ logger.info(f"MinerU version: {result.stdout.strip()}")
124
+ except Exception as e:
125
+ logger.warning(f"MinerU check failed: {e}")
126
+
127
+ # Check model cache status
128
+ logger.info("-" * 40)
129
+ logger.info("Model cache status:")
130
+ cache_info = _check_model_cache()
131
+ for name, info in cache_info.items():
132
+ if info.get("status") == "cached":
133
+ logger.info(f" {name}: {info['size_mb']:.2f} MB ({info['files']} files) - CACHED")
134
+ else:
135
+ logger.warning(f" {name}: {info.get('status', 'unknown')}")
136
+
137
+ total_cached = sum(info.get("size_mb", 0) for info in cache_info.values() if info.get("status") == "cached")
138
+ if total_cached > 0:
139
+ logger.info(f" Total cached: {total_cached:.2f} MB")
140
+ logger.info(" Models are pre-loaded - no download needed at runtime")
141
+ else:
142
+ logger.warning(" No cached models found - first request may be slow")
143
+
144
+ logger.info("=" * 60)
145
+ logger.info("MD Parser API ready to accept requests")
146
+ logger.info("=" * 60)
147
+ yield
148
+ logger.info("Shutting down MD Parser API...")
149
+
150
+
151
+ app = FastAPI(
152
+ title="MD Parser API",
153
+ description="Transform PDFs and images into markdown/JSON using MinerU",
154
+ version="1.4.0",
155
+ lifespan=lifespan,
156
+ )
157
+
158
+ # Configuration from environment (optimized for A100 GPU)
159
+ MINERU_BACKEND = os.getenv("MINERU_BACKEND", "pipeline")
160
+ MINERU_LANG = os.getenv("MINERU_LANG", "en")
161
+ MAX_FILE_SIZE_MB = int(os.getenv("MAX_FILE_SIZE_MB", "1024"))
162
+ MAX_FILE_SIZE_BYTES = MAX_FILE_SIZE_MB * 1024 * 1024
163
+
164
+ # Chunking configuration
165
+ CHUNK_SIZE = int(os.getenv("CHUNK_SIZE", "10")) # Pages per chunk
166
+ # MAX_WORKERS: Number of parallel workers for chunk processing
167
+ # - Default 3 for faster processing on A100 (80GB VRAM)
168
+ # - If OOM occurs, automatically falls back to sequential (1 worker)
169
+ MAX_WORKERS = int(os.getenv("MAX_WORKERS", "3"))
170
+ CHUNKING_THRESHOLD = int(os.getenv("CHUNKING_THRESHOLD", "20")) # Min pages to enable chunking
171
+
172
+ # Enable torch.compile for ~15% speedup if available
173
+ if os.getenv("TORCH_COMPILE_ENABLED", "0") == "1":
174
+ try:
175
+ import torch
176
+ torch.set_float32_matmul_precision('high')
177
+ except Exception:
178
+ pass
179
+
180
+ # Blocked hostnames for SSRF protection
181
+ BLOCKED_HOSTNAMES = {
182
+ "localhost",
183
+ "metadata",
184
+ "metadata.google.internal",
185
+ "metadata.google",
186
+ "169.254.169.254", # AWS/GCP/Azure metadata service
187
+ "fd00:ec2::254", # AWS IPv6 metadata
188
+ }
189
+
190
+
191
+ def _validate_url(url: str) -> None:
192
+ """
193
+ Validate URL to prevent SSRF attacks.
194
+
195
+ Raises HTTPException if URL is invalid or points to internal/private resources.
196
+ """
197
+ try:
198
+ parsed = urlparse(url)
199
+ except Exception as e:
200
+ raise HTTPException(
201
+ status_code=400,
202
+ detail=f"Invalid URL format: {str(e)}",
203
+ )
204
+
205
+ # Check scheme
206
+ if parsed.scheme not in ("http", "https"):
207
+ raise HTTPException(
208
+ status_code=400,
209
+ detail=f"Invalid URL scheme '{parsed.scheme}'. Only http and https are allowed.",
210
+ )
211
+
212
+ # Check hostname exists
213
+ hostname = parsed.hostname
214
+ if not hostname:
215
+ raise HTTPException(
216
+ status_code=400,
217
+ detail="Invalid URL: missing hostname.",
218
+ )
219
+
220
+ # Check against blocked hostnames
221
+ hostname_lower = hostname.lower()
222
+ if hostname_lower in BLOCKED_HOSTNAMES:
223
+ raise HTTPException(
224
+ status_code=400,
225
+ detail="Access to internal/metadata services is not allowed.",
226
+ )
227
+
228
+ # Block hostnames containing suspicious patterns
229
+ blocked_patterns = ["metadata", "internal", "localhost", "127.0.0.1", "::1"]
230
+ for pattern in blocked_patterns:
231
+ if pattern in hostname_lower:
232
+ raise HTTPException(
233
+ status_code=400,
234
+ detail="Access to internal/metadata services is not allowed.",
235
+ )
236
+
237
+ # Resolve hostname and check IP address
238
+ try:
239
+ ip_str = socket.gethostbyname(hostname)
240
+ ip = ipaddress.ip_address(ip_str)
241
+ except socket.gaierror:
242
+ raise HTTPException(
243
+ status_code=400,
244
+ detail=f"Could not resolve hostname: {hostname}",
245
+ )
246
+ except ValueError as e:
247
+ raise HTTPException(
248
+ status_code=400,
249
+ detail=f"Invalid IP address resolved: {str(e)}",
250
+ )
251
+
252
+ # Block private, loopback, link-local, and reserved IP ranges
253
+ if ip.is_private:
254
+ raise HTTPException(
255
+ status_code=400,
256
+ detail="Access to private IP addresses is not allowed.",
257
+ )
258
+ if ip.is_loopback:
259
+ raise HTTPException(
260
+ status_code=400,
261
+ detail="Access to loopback addresses is not allowed.",
262
+ )
263
+ if ip.is_link_local:
264
+ raise HTTPException(
265
+ status_code=400,
266
+ detail="Access to link-local addresses is not allowed.",
267
+ )
268
+ if ip.is_reserved:
269
+ raise HTTPException(
270
+ status_code=400,
271
+ detail="Access to reserved IP addresses is not allowed.",
272
+ )
273
+ if ip.is_multicast:
274
+ raise HTTPException(
275
+ status_code=400,
276
+ detail="Access to multicast addresses is not allowed.",
277
+ )
278
+
279
+
280
+ def _save_uploaded_file(input_path: Path, file_obj: BinaryIO) -> None:
281
+ """Sync helper to save uploaded file to disk (runs in thread)."""
282
+ with open(input_path, "wb") as f:
283
+ shutil.copyfileobj(file_obj, f)
284
+
285
+
286
+ def _save_downloaded_content(input_path: Path, content: bytes) -> None:
287
+ """Sync helper to save downloaded content to disk (runs in thread)."""
288
+ with open(input_path, "wb") as f:
289
+ f.write(content)
290
+
291
+
292
+ def _extract_images_as_zip(output_dir: Path, prefix: str = "") -> tuple[bytes, int]:
293
+ """
294
+ Extract all images from output directory and return as a zip file bytes.
295
+
296
+ Args:
297
+ output_dir: Directory containing images (MinerU puts them in images/ subfolder)
298
+ prefix: Optional prefix for image paths in the zip (e.g., "chunk_0/")
299
+
300
+ Returns:
301
+ Tuple of (zip_bytes, image_count)
302
+ """
303
+ image_extensions = {".png", ".jpg", ".jpeg", ".gif", ".bmp", ".tiff", ".webp"}
304
+
305
+ zip_buffer = io.BytesIO()
306
+ image_count = 0
307
+
308
+ with zipfile.ZipFile(zip_buffer, 'w', zipfile.ZIP_DEFLATED) as zf:
309
+ for img_path in output_dir.glob("**/*"):
310
+ if img_path.is_file() and img_path.suffix.lower() in image_extensions:
311
+ try:
312
+ # Use relative path from output_dir as path in zip
313
+ relative_path = img_path.relative_to(output_dir)
314
+ zip_path = f"{prefix}{relative_path}" if prefix else str(relative_path)
315
+ zf.write(img_path, zip_path)
316
+ image_count += 1
317
+ except Exception as e:
318
+ logger.warning(f"Failed to add image {img_path} to zip: {e}")
319
+
320
+ return zip_buffer.getvalue(), image_count
321
+
322
+
323
+ def _create_images_zip_base64(output_dir: Path, prefix: str = "") -> tuple[Optional[str], int]:
324
+ """
325
+ Extract images and return as base64-encoded zip.
326
+
327
+ Returns:
328
+ Tuple of (base64_zip_string or None if no images, image_count)
329
+ """
330
+ zip_bytes, image_count = _extract_images_as_zip(output_dir, prefix)
331
+
332
+ if image_count == 0:
333
+ return None, 0
334
+
335
+ return base64.b64encode(zip_bytes).decode("utf-8"), image_count
336
+
337
+
338
+ class ParseResponse(BaseModel):
339
+ """Response model for document parsing."""
340
+
341
+ success: bool
342
+ markdown: Optional[str] = None
343
+ json_content: Optional[Union[dict, list]] = None # Can be dict (single) or list (chunked)
344
+ images_zip: Optional[str] = None # Base64-encoded zip file containing all images
345
+ image_count: int = 0 # Number of images in the zip
346
+ error: Optional[str] = None
347
+ pages_processed: int = 0
348
+ backend_used: Optional[str] = None # Actual backend used (may differ if fallback occurred)
349
+
350
+
351
+ # vLLM GPU memory error patterns that trigger fallback to pipeline
352
+ VLLM_MEMORY_ERROR_PATTERNS = [
353
+ "Free memory on device cuda",
354
+ "Decrease GPU memory utilization",
355
+ "CUDA out of memory",
356
+ "OutOfMemoryError",
357
+ ]
358
+
359
+
360
+ def _has_gpu_memory_error(output: str) -> bool:
361
+ """Check if output contains GPU memory error patterns."""
362
+ for pattern in VLLM_MEMORY_ERROR_PATTERNS:
363
+ if pattern in output:
364
+ return True
365
+ return False
366
+
367
+
368
+ def _run_mineru(
369
+ input_path: Path,
370
+ output_dir: Path,
371
+ backend: str,
372
+ lang: str,
373
+ start_page: int,
374
+ end_page: Optional[int],
375
+ request_id: str,
376
+ ) -> tuple[subprocess.CompletedProcess, str]:
377
+ """
378
+ Run MinerU with the specified backend.
379
+
380
+ Returns tuple of (process result, backend actually used).
381
+ If GPU memory error occurs with hybrid backend, automatically retries with pipeline.
382
+
383
+ Uses global lock to prevent parallel execution which causes silent failures.
384
+ """
385
+ def build_cmd(use_backend: str) -> list[str]:
386
+ cmd = [
387
+ "mineru",
388
+ "-p", str(input_path),
389
+ "-o", str(output_dir),
390
+ "-b", use_backend,
391
+ "-l", lang,
392
+ ]
393
+ if start_page > 0:
394
+ cmd.extend(["-s", str(start_page)])
395
+ if end_page is not None:
396
+ cmd.extend(["-e", str(end_page)])
397
+ return cmd
398
+
399
+ # First attempt with requested backend
400
+ cmd = build_cmd(backend)
401
+ logger.info(f"[{request_id}] Starting MinerU processing...")
402
+ logger.info(f"[{request_id}] Command: {' '.join(cmd)}")
403
+ logger.info(f"[{request_id}] Backend: {backend}")
404
+
405
+ parse_start = time.time()
406
+ proc = subprocess.run(cmd, capture_output=True, text=True, timeout=600)
407
+ parse_duration = time.time() - parse_start
408
+
409
+ logger.info(f"[{request_id}] MinerU completed in {parse_duration:.2f}s")
410
+ logger.info(f"[{request_id}] Return code: {proc.returncode}")
411
+
412
+ if proc.stdout:
413
+ for line in proc.stdout.strip().split('\n')[-10:]:
414
+ logger.info(f"[{request_id}] [stdout] {line}")
415
+
416
+ if proc.stderr:
417
+ for line in proc.stderr.strip().split('\n')[-10:]:
418
+ logger.warning(f"[{request_id}] [stderr] {line}")
419
+
420
+ combined_output = (proc.stdout or "") + (proc.stderr or "")
421
+
422
+ # Check for GPU memory errors and fallback to pipeline if needed
423
+ if backend != "pipeline" and _has_gpu_memory_error(combined_output):
424
+ logger.warning(f"[{request_id}] GPU memory error detected with {backend}, falling back to pipeline...")
425
+
426
+ # Clear output directory for retry
427
+ for f in output_dir.glob("*"):
428
+ if f.is_file():
429
+ f.unlink()
430
+ elif f.is_dir():
431
+ shutil.rmtree(f)
432
+
433
+ # Retry with pipeline backend
434
+ fallback_cmd = build_cmd("pipeline")
435
+ logger.info(f"[{request_id}] Retrying with pipeline backend...")
436
+ logger.info(f"[{request_id}] Command: {' '.join(fallback_cmd)}")
437
+
438
+ parse_start = time.time()
439
+ proc = subprocess.run(fallback_cmd, capture_output=True, text=True, timeout=600)
440
+ parse_duration = time.time() - parse_start
441
+
442
+ logger.info(f"[{request_id}] MinerU (pipeline fallback) completed in {parse_duration:.2f}s")
443
+ logger.info(f"[{request_id}] Return code: {proc.returncode}")
444
+
445
+ if proc.stdout:
446
+ for line in proc.stdout.strip().split('\n')[-10:]:
447
+ logger.info(f"[{request_id}] [stdout] {line}")
448
+
449
+ return proc, "pipeline"
450
+
451
+ return proc, backend
452
+
453
+
454
+ def _get_pdf_page_count(input_path: Path) -> int:
455
+ """Get the total number of pages in a PDF using pdfinfo."""
456
+ try:
457
+ result = subprocess.run(
458
+ ["pdfinfo", str(input_path)],
459
+ capture_output=True,
460
+ text=True,
461
+ timeout=30
462
+ )
463
+ if result.returncode == 0:
464
+ for line in result.stdout.split('\n'):
465
+ if line.startswith('Pages:'):
466
+ return int(line.split(':')[1].strip())
467
+ except Exception as e:
468
+ logger.warning(f"Failed to get PDF page count: {e}")
469
+ return 0
470
+
471
+
472
+ def _process_single_chunk(
473
+ chunk_id: int,
474
+ input_path: Path,
475
+ chunk_output_dir: Path,
476
+ backend: str,
477
+ lang: str,
478
+ start_page: int,
479
+ end_page: int,
480
+ request_id: str,
481
+ include_images: bool = False,
482
+ ) -> dict:
483
+ """Process a single chunk of pages. Returns dict with chunk results."""
484
+ chunk_request_id = f"{request_id}-c{chunk_id}"
485
+ logger.info(f"[{chunk_request_id}] Processing chunk {chunk_id}: pages {start_page}-{end_page}")
486
+
487
+ try:
488
+ chunk_output_dir.mkdir(parents=True, exist_ok=True)
489
+
490
+ proc, backend_used = _run_mineru(
491
+ input_path=input_path,
492
+ output_dir=chunk_output_dir,
493
+ backend=backend,
494
+ lang=lang,
495
+ start_page=start_page,
496
+ end_page=end_page,
497
+ request_id=chunk_request_id,
498
+ )
499
+
500
+ if proc.returncode != 0:
501
+ logger.error(f"[{chunk_request_id}] Chunk {chunk_id} failed with code {proc.returncode}")
502
+ return {
503
+ "chunk_id": chunk_id,
504
+ "success": False,
505
+ "error": f"MinerU failed (code {proc.returncode}): {proc.stderr[:500] if proc.stderr else 'No stderr'}",
506
+ "backend_used": backend_used,
507
+ "pages": end_page - start_page + 1,
508
+ }
509
+
510
+ # Read chunk output - list all files for debugging
511
+ all_files = list(chunk_output_dir.glob("**/*"))
512
+ logger.info(f"[{chunk_request_id}] Output files: {[str(f) for f in all_files[:20]]}")
513
+
514
+ md_files = list(chunk_output_dir.glob("**/*.md"))
515
+ markdown_content = ""
516
+ if md_files:
517
+ markdown_content = md_files[0].read_text(encoding="utf-8")
518
+ logger.info(f"[{chunk_request_id}] Found markdown: {md_files[0]}")
519
+
520
+ json_content = None
521
+ json_files = [f for f in chunk_output_dir.glob("**/*.json") if "_content_list" not in f.name]
522
+ if json_files:
523
+ try:
524
+ json_content = json.loads(json_files[0].read_text(encoding="utf-8"))
525
+ except json.JSONDecodeError:
526
+ pass
527
+
528
+ # Extract images from chunk output (only if requested)
529
+ chunk_images_zip = None
530
+ chunk_image_count = 0
531
+ if include_images:
532
+ zip_bytes, chunk_image_count = _extract_images_as_zip(chunk_output_dir)
533
+ # Only keep zip bytes if we actually have images
534
+ if chunk_image_count > 0:
535
+ chunk_images_zip = zip_bytes
536
+
537
+ logger.info(f"[{chunk_request_id}] Chunk {chunk_id} completed: {len(markdown_content)} chars markdown, json={'yes' if json_content else 'no'}, images={chunk_image_count}")
538
+
539
+ # Check if we got any content - empty output might indicate a problem
540
+ has_content = bool(markdown_content.strip()) or bool(json_content)
541
+ if not has_content:
542
+ logger.warning(f"[{chunk_request_id}] Chunk {chunk_id} produced no content (pages {start_page}-{end_page})")
543
+
544
+ return {
545
+ "chunk_id": chunk_id,
546
+ "success": True, # MinerU succeeded, even if content is empty (e.g., blank pages)
547
+ "markdown": markdown_content,
548
+ "json_content": json_content,
549
+ "images_zip_bytes": chunk_images_zip,
550
+ "image_count": chunk_image_count,
551
+ "backend_used": backend_used,
552
+ "pages": end_page - start_page + 1,
553
+ "start_page": start_page,
554
+ "end_page": end_page,
555
+ "has_content": has_content,
556
+ }
557
+
558
+ except Exception as e:
559
+ logger.error(f"[{chunk_request_id}] Chunk {chunk_id} exception: {e}")
560
+ return {
561
+ "chunk_id": chunk_id,
562
+ "success": False,
563
+ "error": str(e),
564
+ "backend_used": backend,
565
+ "pages": 0,
566
+ }
567
+
568
+
569
+ def _has_oom_error_in_results(chunk_results: list) -> bool:
570
+ """Check if any chunk failed due to OOM error."""
571
+ for r in chunk_results:
572
+ if not r["success"]:
573
+ error_msg = r.get("error", "")
574
+ if any(pattern in error_msg for pattern in VLLM_MEMORY_ERROR_PATTERNS):
575
+ return True
576
+ return False
577
+
578
+
579
+ def _process_chunks_with_workers(
580
+ chunks: list,
581
+ input_path: Path,
582
+ base_output_dir: Path,
583
+ chunk_backend: str,
584
+ lang: str,
585
+ request_id: str,
586
+ num_workers: int,
587
+ include_images: bool = False,
588
+ ) -> list:
589
+ """Process chunks with specified number of workers."""
590
+ chunk_results = []
591
+ with ThreadPoolExecutor(max_workers=num_workers) as executor:
592
+ futures = {}
593
+ for cid, cstart, cend in chunks:
594
+ chunk_output_dir = base_output_dir / f"chunk_{cid}"
595
+ # Clean up any previous attempt
596
+ if chunk_output_dir.exists():
597
+ shutil.rmtree(chunk_output_dir)
598
+ future = executor.submit(
599
+ _process_single_chunk,
600
+ cid,
601
+ input_path,
602
+ chunk_output_dir,
603
+ chunk_backend,
604
+ lang,
605
+ cstart,
606
+ cend,
607
+ request_id,
608
+ include_images,
609
+ )
610
+ futures[future] = cid
611
+
612
+ for future in as_completed(futures):
613
+ result = future.result()
614
+ chunk_results.append(result)
615
+ return chunk_results
616
+
617
+
618
+ def _process_chunked(
619
+ input_path: Path,
620
+ base_output_dir: Path,
621
+ backend: str,
622
+ lang: str,
623
+ start_page: int,
624
+ end_page: Optional[int],
625
+ total_pages: int,
626
+ request_id: str,
627
+ output_format: str,
628
+ include_images: bool = False,
629
+ ) -> ParseResponse:
630
+ """Process a PDF in parallel chunks and combine results.
631
+
632
+ Automatically falls back to sequential processing if OOM errors are detected.
633
+ """
634
+ # Calculate actual end page
635
+ actual_end = end_page if end_page is not None else total_pages - 1
636
+
637
+ # Generate chunk ranges
638
+ chunks = []
639
+ current_start = start_page
640
+ chunk_id = 0
641
+ while current_start <= actual_end:
642
+ chunk_end = min(current_start + CHUNK_SIZE - 1, actual_end)
643
+ chunks.append((chunk_id, current_start, chunk_end))
644
+ current_start = chunk_end + 1
645
+ chunk_id += 1
646
+
647
+ # Use requested backend for chunked processing
648
+ # OOM protection will automatically fall back to sequential if needed
649
+ chunk_backend = backend
650
+
651
+ logger.info(f"[{request_id}] Splitting into {len(chunks)} chunks of up to {CHUNK_SIZE} pages each")
652
+ logger.info(f"[{request_id}] Backend: {chunk_backend}, workers: {MAX_WORKERS}")
653
+
654
+ # Process chunks - start with configured workers, fall back to sequential on OOM
655
+ current_workers = MAX_WORKERS
656
+ chunk_results = _process_chunks_with_workers(
657
+ chunks, input_path, base_output_dir, chunk_backend, lang, request_id, current_workers, include_images
658
+ )
659
+
660
+ # Check for OOM errors and retry with fewer workers if needed
661
+ if _has_oom_error_in_results(chunk_results) and current_workers > 1:
662
+ logger.warning(f"[{request_id}] OOM detected with {current_workers} workers, retrying sequentially (1 worker)")
663
+ # Clean up and retry with sequential processing
664
+ for cid, _, _ in chunks:
665
+ chunk_dir = base_output_dir / f"chunk_{cid}"
666
+ if chunk_dir.exists():
667
+ shutil.rmtree(chunk_dir)
668
+
669
+ chunk_results = _process_chunks_with_workers(
670
+ chunks, input_path, base_output_dir, chunk_backend, lang, request_id, 1, include_images
671
+ )
672
+
673
+ # Sort by chunk_id to maintain page order
674
+ chunk_results.sort(key=lambda x: x["chunk_id"])
675
+
676
+ # Check for failures and empty chunks
677
+ failed_chunks = [r for r in chunk_results if not r["success"]]
678
+ if failed_chunks:
679
+ errors = "; ".join([f"Chunk {r['chunk_id']}: {r.get('error', 'Unknown')}" for r in failed_chunks])
680
+ logger.error(f"[{request_id}] {len(failed_chunks)} chunks failed: {errors}")
681
+
682
+ empty_chunks = [r for r in chunk_results if r["success"] and not r.get("has_content", True)]
683
+ if empty_chunks:
684
+ empty_ranges = [f"pages {r['start_page']}-{r['end_page']}" for r in empty_chunks]
685
+ logger.warning(f"[{request_id}] {len(empty_chunks)} chunks had no content: {', '.join(empty_ranges)}")
686
+
687
+ # Combine results
688
+ total_pages_processed = sum(r.get("pages", 0) for r in chunk_results if r["success"])
689
+ backends_used = list(set(r.get("backend_used", backend) for r in chunk_results if r["success"]))
690
+ backend_used = backends_used[0] if len(backends_used) == 1 else ",".join(backends_used)
691
+
692
+ # Combine images from all chunks into a single zip (with chunk prefixes to avoid collisions)
693
+ combined_zip_buffer = io.BytesIO()
694
+ total_image_count = 0
695
+
696
+ with zipfile.ZipFile(combined_zip_buffer, 'w', zipfile.ZIP_DEFLATED) as combined_zf:
697
+ for r in chunk_results:
698
+ if r["success"] and r.get("images_zip_bytes"):
699
+ chunk_zip_bytes = r["images_zip_bytes"]
700
+ chunk_id = r["chunk_id"]
701
+
702
+ # Extract from chunk zip and add to combined zip with chunk prefix
703
+ with zipfile.ZipFile(io.BytesIO(chunk_zip_bytes), 'r') as chunk_zf:
704
+ for name in chunk_zf.namelist():
705
+ prefixed_name = f"chunk_{chunk_id}/{name}"
706
+ combined_zf.writestr(prefixed_name, chunk_zf.read(name))
707
+ total_image_count += 1
708
+
709
+ combined_images_zip = None
710
+ if total_image_count > 0:
711
+ combined_images_zip = base64.b64encode(combined_zip_buffer.getvalue()).decode("utf-8")
712
+ logger.info(f"[{request_id}] Combined {total_image_count} images from all chunks into zip")
713
+
714
+ if output_format == "json":
715
+ # Combine JSON content (merge arrays or create array of results)
716
+ combined_json = []
717
+ for r in chunk_results:
718
+ if r["success"] and r.get("json_content"):
719
+ jc = r["json_content"]
720
+ if isinstance(jc, list):
721
+ combined_json.extend(jc)
722
+ else:
723
+ combined_json.append(jc)
724
+
725
+ if failed_chunks and not combined_json:
726
+ return ParseResponse(
727
+ success=False,
728
+ error=f"All chunks failed: {errors}",
729
+ pages_processed=0,
730
+ backend_used=backend_used,
731
+ )
732
+
733
+ return ParseResponse(
734
+ success=True,
735
+ json_content=combined_json if combined_json else None,
736
+ images_zip=combined_images_zip,
737
+ image_count=total_image_count,
738
+ pages_processed=total_pages_processed,
739
+ backend_used=backend_used,
740
+ error=f"{len(failed_chunks)} chunks failed" if failed_chunks else None,
741
+ )
742
+ else:
743
+ # Combine markdown content
744
+ combined_markdown = []
745
+ for r in chunk_results:
746
+ if r["success"] and r.get("markdown"):
747
+ # Add page separator for clarity
748
+ if combined_markdown:
749
+ combined_markdown.append(f"\n\n<!-- Chunk {r['chunk_id']} (pages {r['start_page']}-{r['end_page']}) -->\n\n")
750
+ combined_markdown.append(r["markdown"])
751
+
752
+ if failed_chunks and not combined_markdown:
753
+ return ParseResponse(
754
+ success=False,
755
+ error=f"All chunks failed: {errors}",
756
+ pages_processed=0,
757
+ backend_used=backend_used,
758
+ )
759
+
760
+ return ParseResponse(
761
+ success=True,
762
+ markdown="".join(combined_markdown) if combined_markdown else None,
763
+ images_zip=combined_images_zip,
764
+ image_count=total_image_count,
765
+ pages_processed=total_pages_processed,
766
+ backend_used=backend_used,
767
+ error=f"{len(failed_chunks)} chunks failed" if failed_chunks else None,
768
+ )
769
+
770
+
771
+ class HealthResponse(BaseModel):
772
+ """Health check response."""
773
+
774
+ status: str
775
+ version: str
776
+ backend: str
777
+ chunk_size: int
778
+ chunking_threshold: int
779
+ max_workers: int
780
+
781
+
782
+ class URLParseRequest(BaseModel):
783
+ """Request model for URL-based parsing."""
784
+
785
+ url: str
786
+ output_format: str = "markdown"
787
+ lang: str = MINERU_LANG
788
+ backend: Optional[str] = None # Override backend: pipeline, hybrid-auto-engine
789
+ start_page: int = 0
790
+ end_page: Optional[int] = None
791
+ include_images: bool = False # Include base64-encoded images in response
792
+
793
+
794
+ @app.get("/", response_model=HealthResponse)
795
+ async def health_check() -> HealthResponse:
796
+ """Health check endpoint."""
797
+ return HealthResponse(
798
+ status="healthy",
799
+ version="1.4.0",
800
+ backend=MINERU_BACKEND,
801
+ chunk_size=CHUNK_SIZE,
802
+ chunking_threshold=CHUNKING_THRESHOLD,
803
+ max_workers=MAX_WORKERS,
804
+ )
805
+
806
+
807
+ @app.post("/parse", response_model=ParseResponse)
808
+ async def parse_document(
809
+ file: UploadFile = File(..., description="PDF or image file to parse"),
810
+ output_format: str = Form(
811
+ default="markdown", description="Output format: markdown or json"
812
+ ),
813
+ lang: str = Form(default=MINERU_LANG, description="OCR language code"),
814
+ start_page: int = Form(default=0, description="Starting page (0-indexed)"),
815
+ end_page: Optional[int] = Form(default=None, description="Ending page (None=all)"),
816
+ backend: Optional[str] = Form(default=None, description="Override backend: pipeline, hybrid-auto-engine"),
817
+ include_images: bool = Form(default=False, description="Include base64-encoded images in response"),
818
+ _token: str = Depends(verify_token),
819
+ ) -> ParseResponse:
820
+ """
821
+ Parse a document file (PDF or image) and return extracted content.
822
+
823
+ Supports:
824
+ - PDF files (.pdf)
825
+ - Images (.png, .jpg, .jpeg, .tiff, .bmp)
826
+ """
827
+ request_id = str(uuid4())[:8]
828
+ start_time = time.time()
829
+
830
+ logger.info(f"[{request_id}] {'='*50}")
831
+ logger.info(f"[{request_id}] New parse request received")
832
+ logger.info(f"[{request_id}] Filename: {file.filename}")
833
+ logger.info(f"[{request_id}] Output format: {output_format}")
834
+ logger.info(f"[{request_id}] Language: {lang}")
835
+ logger.info(f"[{request_id}] Page range: {start_page} to {end_page or 'end'}")
836
+
837
+ # Validate file size
838
+ file.file.seek(0, 2)
839
+ file_size = file.file.tell()
840
+ file.file.seek(0)
841
+
842
+ file_size_mb = file_size / (1024 * 1024)
843
+ logger.info(f"[{request_id}] File size: {file_size_mb:.2f} MB")
844
+
845
+ if file_size > MAX_FILE_SIZE_BYTES:
846
+ logger.error(f"[{request_id}] File too large: {file_size_mb:.2f} MB > {MAX_FILE_SIZE_MB} MB")
847
+ raise HTTPException(
848
+ status_code=413,
849
+ detail=f"File size exceeds maximum allowed size of {MAX_FILE_SIZE_MB}MB",
850
+ )
851
+
852
+ # Validate file type
853
+ allowed_extensions = {".pdf", ".png", ".jpg", ".jpeg", ".tiff", ".bmp"}
854
+ file_ext = Path(file.filename).suffix.lower() if file.filename else ""
855
+ if file_ext not in allowed_extensions:
856
+ logger.error(f"[{request_id}] Unsupported file type: {file_ext}")
857
+ raise HTTPException(
858
+ status_code=400,
859
+ detail=f"Unsupported file type. Allowed: {', '.join(allowed_extensions)}",
860
+ )
861
+
862
+ logger.info(f"[{request_id}] File type: {file_ext}")
863
+
864
+ # Create temp directory for processing
865
+ temp_dir = tempfile.mkdtemp()
866
+ logger.info(f"[{request_id}] Created temp directory: {temp_dir}")
867
+
868
+ try:
869
+ # Save uploaded file (run blocking I/O in thread)
870
+ input_path = Path(temp_dir) / f"input{file_ext}"
871
+ await asyncio.to_thread(_save_uploaded_file, input_path, file.file)
872
+ logger.info(f"[{request_id}] Saved file to: {input_path}")
873
+
874
+ # Create output directory
875
+ output_dir = Path(temp_dir) / "output"
876
+ output_dir.mkdir(exist_ok=True)
877
+
878
+ use_backend = backend if backend else MINERU_BACKEND
879
+
880
+ # Check if chunking should be used (PDF only, sufficient pages)
881
+ total_pages = 0
882
+ use_chunking = False
883
+ if file_ext == ".pdf":
884
+ total_pages = _get_pdf_page_count(input_path)
885
+ logger.info(f"[{request_id}] PDF has {total_pages} pages")
886
+
887
+ # Calculate effective page range
888
+ effective_end = end_page if end_page is not None else total_pages - 1
889
+ effective_pages = effective_end - start_page + 1
890
+
891
+ if effective_pages > CHUNKING_THRESHOLD:
892
+ use_chunking = True
893
+ logger.info(f"[{request_id}] Chunking enabled: {effective_pages} pages > {CHUNKING_THRESHOLD} threshold")
894
+
895
+ if use_chunking:
896
+ # Process in parallel chunks
897
+ parse_result = _process_chunked(
898
+ input_path=input_path,
899
+ base_output_dir=output_dir,
900
+ backend=use_backend,
901
+ lang=lang,
902
+ start_page=start_page,
903
+ end_page=end_page,
904
+ total_pages=total_pages,
905
+ request_id=request_id,
906
+ output_format=output_format,
907
+ include_images=include_images,
908
+ )
909
+ else:
910
+ # Process normally (single pass)
911
+ logger.info(f"[{request_id}] Processing without chunking")
912
+ proc, backend_used = _run_mineru(
913
+ input_path=input_path,
914
+ output_dir=output_dir,
915
+ backend=use_backend,
916
+ lang=lang,
917
+ start_page=start_page,
918
+ end_page=end_page,
919
+ request_id=request_id,
920
+ )
921
+
922
+ if proc.returncode != 0:
923
+ logger.error(f"[{request_id}] MinerU failed with code {proc.returncode}")
924
+ if proc.stderr:
925
+ for line in proc.stderr.strip().split('\n'):
926
+ logger.error(f"[{request_id}] [stderr] {line}")
927
+ raise RuntimeError(f"MinerU failed (code {proc.returncode}): {proc.stderr}")
928
+
929
+ # Read output
930
+ logger.info(f"[{request_id}] Reading output files...")
931
+ parse_result = _read_parse_output(output_dir, output_format, proc.stdout, proc.stderr, request_id, include_images)
932
+ parse_result.backend_used = backend_used
933
+
934
+ if backend_used != use_backend:
935
+ logger.info(f"[{request_id}] Note: Fell back from {use_backend} to {backend_used} due to GPU memory constraints")
936
+
937
+ total_duration = time.time() - start_time
938
+ logger.info(f"[{request_id}] {'='*50}")
939
+ logger.info(f"[{request_id}] Request completed successfully")
940
+ logger.info(f"[{request_id}] Pages processed: {parse_result.pages_processed}")
941
+ logger.info(f"[{request_id}] Total time: {total_duration:.2f}s")
942
+ if parse_result.pages_processed > 0:
943
+ logger.info(f"[{request_id}] Speed: {parse_result.pages_processed / total_duration:.2f} pages/sec")
944
+ logger.info(f"[{request_id}] {'='*50}")
945
+
946
+ return parse_result
947
+
948
+ except Exception as e:
949
+ total_duration = time.time() - start_time
950
+ logger.error(f"[{request_id}] {'='*50}")
951
+ logger.error(f"[{request_id}] Request failed after {total_duration:.2f}s")
952
+ logger.error(f"[{request_id}] Error: {type(e).__name__}: {str(e)}")
953
+ logger.error(f"[{request_id}] {'='*50}")
954
+ return ParseResponse(
955
+ success=False,
956
+ error=f"{type(e).__name__}: {str(e)}",
957
+ )
958
+ finally:
959
+ # Cleanup temp directory
960
+ shutil.rmtree(temp_dir, ignore_errors=True)
961
+ logger.info(f"[{request_id}] Cleaned up temp directory")
962
+
963
+
964
+ @app.post("/parse/url", response_model=ParseResponse)
965
+ async def parse_document_from_url(
966
+ request: URLParseRequest,
967
+ _token: str = Depends(verify_token),
968
+ ) -> ParseResponse:
969
+ """
970
+ Parse a document from a URL.
971
+
972
+ Downloads the file and processes it through MinerU.
973
+ """
974
+ request_id = str(uuid4())[:8]
975
+ start_time = time.time()
976
+
977
+ logger.info(f"[{request_id}] {'='*50}")
978
+ logger.info(f"[{request_id}] New URL parse request received")
979
+ logger.info(f"[{request_id}] URL: {request.url}")
980
+ logger.info(f"[{request_id}] Output format: {request.output_format}")
981
+ logger.info(f"[{request_id}] Language: {request.lang}")
982
+ logger.info(f"[{request_id}] Page range: {request.start_page} to {request.end_page or 'end'}")
983
+
984
+ # Validate URL to prevent SSRF attacks
985
+ logger.info(f"[{request_id}] Validating URL...")
986
+ _validate_url(request.url)
987
+ logger.info(f"[{request_id}] URL validation passed")
988
+
989
+ temp_dir = tempfile.mkdtemp()
990
+ logger.info(f"[{request_id}] Created temp directory: {temp_dir}")
991
+
992
+ try:
993
+ # Download file from URL
994
+ logger.info(f"[{request_id}] Downloading file from URL...")
995
+ download_start = time.time()
996
+ async with httpx.AsyncClient(timeout=60.0, follow_redirects=True) as client:
997
+ response = await client.get(request.url)
998
+ response.raise_for_status()
999
+ download_duration = time.time() - download_start
1000
+
1001
+ file_size_mb = len(response.content) / (1024 * 1024)
1002
+ logger.info(f"[{request_id}] Download completed in {download_duration:.2f}s")
1003
+ logger.info(f"[{request_id}] File size: {file_size_mb:.2f} MB")
1004
+
1005
+ # Determine file extension from URL or content-type
1006
+ url_path = Path(request.url.split("?")[0])
1007
+ file_ext = url_path.suffix.lower() or ".pdf"
1008
+
1009
+ # Validate file type
1010
+ allowed_extensions = {".pdf", ".png", ".jpg", ".jpeg", ".tiff", ".bmp"}
1011
+ if file_ext not in allowed_extensions:
1012
+ logger.error(f"[{request_id}] Unsupported file type: {file_ext}")
1013
+ raise HTTPException(
1014
+ status_code=400,
1015
+ detail=f"Unsupported file type. Allowed: {', '.join(allowed_extensions)}",
1016
+ )
1017
+
1018
+ logger.info(f"[{request_id}] File type: {file_ext}")
1019
+
1020
+ # Check file size
1021
+ if len(response.content) > MAX_FILE_SIZE_BYTES:
1022
+ logger.error(f"[{request_id}] File too large: {file_size_mb:.2f} MB > {MAX_FILE_SIZE_MB} MB")
1023
+ raise HTTPException(
1024
+ status_code=413,
1025
+ detail=f"File size exceeds maximum allowed size of {MAX_FILE_SIZE_MB}MB",
1026
+ )
1027
+
1028
+ # Save downloaded file (run blocking I/O in thread)
1029
+ input_path = Path(temp_dir) / f"input{file_ext}"
1030
+ await asyncio.to_thread(_save_downloaded_content, input_path, response.content)
1031
+ logger.info(f"[{request_id}] Saved file to: {input_path}")
1032
+
1033
+ # Create output directory
1034
+ output_dir = Path(temp_dir) / "output"
1035
+ output_dir.mkdir(exist_ok=True)
1036
+
1037
+ use_backend = request.backend if request.backend else MINERU_BACKEND
1038
+
1039
+ # Check if chunking should be used (PDF only, sufficient pages)
1040
+ total_pages = 0
1041
+ use_chunking = False
1042
+ if file_ext == ".pdf":
1043
+ total_pages = _get_pdf_page_count(input_path)
1044
+ logger.info(f"[{request_id}] PDF has {total_pages} pages")
1045
+
1046
+ # Calculate effective page range
1047
+ effective_end = request.end_page if request.end_page is not None else total_pages - 1
1048
+ effective_pages = effective_end - request.start_page + 1
1049
+
1050
+ if effective_pages > CHUNKING_THRESHOLD:
1051
+ use_chunking = True
1052
+ logger.info(f"[{request_id}] Chunking enabled: {effective_pages} pages > {CHUNKING_THRESHOLD} threshold")
1053
+
1054
+ if use_chunking:
1055
+ # Process in parallel chunks
1056
+ parse_result = _process_chunked(
1057
+ input_path=input_path,
1058
+ base_output_dir=output_dir,
1059
+ backend=use_backend,
1060
+ lang=request.lang,
1061
+ start_page=request.start_page,
1062
+ end_page=request.end_page,
1063
+ total_pages=total_pages,
1064
+ request_id=request_id,
1065
+ output_format=request.output_format,
1066
+ include_images=request.include_images,
1067
+ )
1068
+ else:
1069
+ # Process normally (single pass)
1070
+ logger.info(f"[{request_id}] Processing without chunking")
1071
+ proc, backend_used = _run_mineru(
1072
+ input_path=input_path,
1073
+ output_dir=output_dir,
1074
+ backend=use_backend,
1075
+ lang=request.lang,
1076
+ start_page=request.start_page,
1077
+ end_page=request.end_page,
1078
+ request_id=request_id,
1079
+ )
1080
+
1081
+ if proc.returncode != 0:
1082
+ logger.error(f"[{request_id}] MinerU failed with code {proc.returncode}")
1083
+ if proc.stderr:
1084
+ for line in proc.stderr.strip().split('\n'):
1085
+ logger.error(f"[{request_id}] [stderr] {line}")
1086
+ raise RuntimeError(f"MinerU failed (code {proc.returncode}): {proc.stderr}")
1087
+
1088
+ # Read output
1089
+ logger.info(f"[{request_id}] Reading output files...")
1090
+ parse_result = _read_parse_output(output_dir, request.output_format, proc.stdout, proc.stderr, request_id, request.include_images)
1091
+ parse_result.backend_used = backend_used
1092
+
1093
+ if backend_used != use_backend:
1094
+ logger.info(f"[{request_id}] Note: Fell back from {use_backend} to {backend_used} due to GPU memory constraints")
1095
+
1096
+ total_duration = time.time() - start_time
1097
+ logger.info(f"[{request_id}] {'='*50}")
1098
+ logger.info(f"[{request_id}] Request completed successfully")
1099
+ logger.info(f"[{request_id}] Pages processed: {parse_result.pages_processed}")
1100
+ logger.info(f"[{request_id}] Total time: {total_duration:.2f}s")
1101
+ if parse_result.pages_processed > 0:
1102
+ logger.info(f"[{request_id}] Speed: {parse_result.pages_processed / total_duration:.2f} pages/sec")
1103
+ logger.info(f"[{request_id}] {'='*50}")
1104
+
1105
+ return parse_result
1106
+
1107
+ except httpx.HTTPError as e:
1108
+ total_duration = time.time() - start_time
1109
+ logger.error(f"[{request_id}] Download failed after {total_duration:.2f}s: {str(e)}")
1110
+ return ParseResponse(
1111
+ success=False,
1112
+ error=f"Failed to download file from URL: {str(e)}",
1113
+ )
1114
+ except Exception as e:
1115
+ total_duration = time.time() - start_time
1116
+ logger.error(f"[{request_id}] {'='*50}")
1117
+ logger.error(f"[{request_id}] Request failed after {total_duration:.2f}s")
1118
+ logger.error(f"[{request_id}] Error: {type(e).__name__}: {str(e)}")
1119
+ logger.error(f"[{request_id}] {'='*50}")
1120
+ return ParseResponse(
1121
+ success=False,
1122
+ error=str(e),
1123
+ )
1124
+ finally:
1125
+ # Cleanup temp directory
1126
+ shutil.rmtree(temp_dir, ignore_errors=True)
1127
+ logger.info(f"[{request_id}] Cleaned up temp directory")
1128
+
1129
+
1130
+ def _read_parse_output(output_dir: Path, output_format: str, stdout: str = "", stderr: str = "", request_id: str = "", include_images: bool = False) -> ParseResponse:
1131
+ """Read the parsed output from MinerU output directory."""
1132
+ log_prefix = f"[{request_id}] " if request_id else ""
1133
+
1134
+ # List all files in output directory for debugging
1135
+ all_files = []
1136
+ for root, dirs, files in os.walk(output_dir):
1137
+ for f in files:
1138
+ all_files.append(os.path.join(root, f))
1139
+
1140
+ logger.info(f"{log_prefix}Output directory contents: {len(all_files)} files")
1141
+ for f in all_files:
1142
+ logger.info(f"{log_prefix} - {f}")
1143
+
1144
+ # Find markdown files recursively in output directory
1145
+ md_files = list(output_dir.glob("**/*.md"))
1146
+ json_files_all = list(output_dir.glob("**/*.json"))
1147
+
1148
+ logger.info(f"{log_prefix}Found {len(md_files)} markdown files, {len(json_files_all)} JSON files")
1149
+
1150
+ if not md_files and not json_files_all:
1151
+ logger.error(f"{log_prefix}No output files found!")
1152
+ return ParseResponse(
1153
+ success=False,
1154
+ error=f"No output files found. All files: {all_files}. Stdout: {stdout[:500]}. Stderr: {stderr[:500]}",
1155
+ )
1156
+
1157
+ # Read markdown output
1158
+ markdown_content = None
1159
+ if md_files:
1160
+ markdown_content = md_files[0].read_text(encoding="utf-8")
1161
+ logger.info(f"{log_prefix}Markdown content length: {len(markdown_content)} chars")
1162
+
1163
+ # Read JSON output (prefer non-content-list files)
1164
+ json_content = None
1165
+ main_json_files = [f for f in json_files_all if "_content_list" not in f.name]
1166
+ if main_json_files:
1167
+ try:
1168
+ json_content = json.loads(main_json_files[0].read_text(encoding="utf-8"))
1169
+ logger.info(f"{log_prefix}JSON content loaded from: {main_json_files[0].name}")
1170
+ except json.JSONDecodeError as e:
1171
+ logger.warning(f"{log_prefix}Failed to parse JSON: {e}")
1172
+
1173
+ # Count pages from content list if available
1174
+ pages_processed = 0
1175
+ content_list_files = [f for f in json_files_all if "_content_list" in f.name]
1176
+ if content_list_files:
1177
+ try:
1178
+ content_list = json.loads(
1179
+ content_list_files[0].read_text(encoding="utf-8")
1180
+ )
1181
+ if isinstance(content_list, list):
1182
+ pages_processed = len(
1183
+ set(item.get("page_idx", 0) for item in content_list)
1184
+ )
1185
+ logger.info(f"{log_prefix}Pages processed: {pages_processed}")
1186
+ except (json.JSONDecodeError, KeyError) as e:
1187
+ logger.warning(f"{log_prefix}Failed to count pages: {e}")
1188
+
1189
+ # Extract images from output directory (only if requested)
1190
+ images_zip = None
1191
+ image_count = 0
1192
+ if include_images:
1193
+ images_zip, image_count = _create_images_zip_base64(output_dir)
1194
+ if image_count > 0:
1195
+ logger.info(f"{log_prefix}Extracted {image_count} images into zip")
1196
+
1197
+ if output_format == "json" and json_content:
1198
+ logger.info(f"{log_prefix}Returning JSON output")
1199
+ return ParseResponse(
1200
+ success=True,
1201
+ json_content=json_content,
1202
+ images_zip=images_zip,
1203
+ image_count=image_count,
1204
+ pages_processed=pages_processed,
1205
+ )
1206
+ elif markdown_content:
1207
+ logger.info(f"{log_prefix}Returning markdown output")
1208
+ return ParseResponse(
1209
+ success=True,
1210
+ markdown=markdown_content,
1211
+ images_zip=images_zip,
1212
+ image_count=image_count,
1213
+ pages_processed=pages_processed,
1214
+ )
1215
+ else:
1216
+ logger.error(f"{log_prefix}No usable output generated")
1217
+ return ParseResponse(
1218
+ success=False,
1219
+ error=f"No output generated. MD files: {[str(f) for f in md_files]}. JSON files: {[str(f) for f in json_files_all]}. Stderr: {stderr[:500]}",
1220
+ )
1221
+
1222
+
1223
+ if __name__ == "__main__":
1224
+ import uvicorn
1225
+
1226
+ uvicorn.run(app, host="0.0.0.0", port=7860)
requirements.txt ADDED
@@ -0,0 +1,19 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # MinerU Document Parser API Dependencies
2
+ # Pins match vLLM v0.14.1 base image to avoid pip backtracking
3
+
4
+ # MinerU with core extra (pipeline + vlm + api + gradio)
5
+ # Avoids [all] which adds platform-specific vllm pinning that conflicts with base image
6
+ mineru[core]>=1.0.0
7
+
8
+ # Web framework
9
+ fastapi>=0.115.0
10
+ uvicorn[standard]>=0.32.0
11
+
12
+ # File upload handling
13
+ python-multipart>=0.0.9
14
+
15
+ # HTTP client for URL parsing
16
+ httpx>=0.27.0
17
+
18
+ # Type checking
19
+ pydantic>=2.0.0