Spaces:

outcomelabs
/

md-parser

Paused

App Files Files Community

sidoutcome commited on Feb 5

Commit

6162371

0 Parent(s):

feat: support both API_TOKEN and API_DEV_TOKEN

Browse files

Files changed (6) hide show

.gitignore +37 -0
CLAUDE.md +238 -0
Dockerfile +118 -0
README.md +537 -0
app.py +1226 -0
requirements.txt +19 -0

.gitignore ADDED Viewed

	@@ -0,0 +1,37 @@

+# Python
+__pycache__/
+*.py[cod]
+*$py.class
+*.so
+.Python
+env/
+venv/
+.venv/
+*.egg-info/
+dist/
+build/
+# IDE
+.idea/
+.vscode/
+*.swp
+*.swo
+# Testing
+.pytest_cache/
+.coverage
+htmlcov/
+# Temp files
+*.tmp
+*.temp
+temp/
+tmp/
+# Model cache
+.cache/
+models/
+# OS
+.DS_Store
+Thumbs.db

CLAUDE.md ADDED Viewed

	@@ -0,0 +1,238 @@

+# CLAUDE.md
+This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
+## Project Overview
+MD Parser - A Hugging Face Spaces API service that deploys MinerU for PDF/document parsing. Transforms complex documents (PDFs, images) into LLM-ready markdown/JSON formats. API endpoints are protected by Bearer token authentication.
+## Architecture
+```
+hf_md_parser/
+├── app.py              # FastAPI application with parsing endpoints
+├── Dockerfile          # HF Spaces Docker configuration (GPU-enabled, VLM models pre-downloaded)
+├── requirements.txt    # Python dependencies
+├── README.md           # HF Spaces metadata and API documentation
+├── CLAUDE.md           # Claude Code development guide
+└── .gitignore          # Git ignore patterns
+```
+## Common Commands
+```bash
+# Local development
+pip install -r requirements.txt
+uvicorn app:app --host 0.0.0.0 --port 7860 --reload
+# Test the API locally
+curl -X POST "http://localhost:7860/parse" \
+  -F "file=@document.pdf" \
+  -F "output_format=markdown"
+# Test deployed API (health check - no auth needed)
+curl https://outcomelabs-md-parser.hf.space/
+# Test deployed API (requires API_TOKEN)
+curl -X POST "https://outcomelabs-md-parser.hf.space/parse" \
+  -H "Authorization: Bearer YOUR_API_TOKEN" \
+  -F "file=@document.pdf" \
+  -F "output_format=markdown"
+# Build and test Docker locally
+docker build -t hf-mineru .
+docker run --gpus all --shm-size 32g -p 7860:7860 hf-mineru
+```
+## Deploying to Hugging Face Spaces
+**Space URL:** https://huggingface.co/spaces/outcomelabs/md-parser
+**API URL:** https://outcomelabs-md-parser.hf.space
+### First-time Setup (already done)
+```bash
+hf auth login
+hf repo create md-parser --repo-type space --space_sdk docker
+git init
+git remote add hf https://huggingface.co/spaces/outcomelabs/md-parser
+```
+### Push New Code
+```bash
+git add .
+git commit -m "feat: description of changes"
+git push hf main
+```
+### Force Push (if needed)
+```bash
+git push hf main --force
+```
+### Settings (configure in HF web UI)
+- **Hardware:** Nvidia A100 Large 80GB ($2.50/hr)
+- **Sleep time:** 1 hour (auto-shutdown after 60 min idle)
+- **Secrets:** `API_TOKEN` (required for API authentication)
+## API Endpoints
+| Endpoint     | Method | Description                                   |
+| ------------ | ------ | --------------------------------------------- |
+| `/`          | GET    | Health check and API info                     |
+| `/parse`     | POST   | Parse a document (PDF/image) to markdown/JSON |
+| `/parse/url` | POST   | Parse a document from URL                     |
+| `/docs`      | GET    | OpenAPI documentation (Swagger UI)            |
+## Key Dependencies
+- **mineru[all]**: Core document parsing library
+- **fastapi**: API framework
+- **python-multipart**: File upload handling
+- **uvicorn**: ASGI server
+- **httpx**: HTTP client for URL parsing
+- **pydantic**: Request/response validation
+## Docker Base Image
+Uses `vllm/vllm-openai:v0.14.1` as base image. This includes vLLM with security patches (CVE-2025-66448/CVE-2025-30165), CUDA dependencies, and PyTorch pre-configured. Supports Ampere, Ada Lovelace, and Hopper GPU architectures. nvidia-cudnn-cu12 is installed separately to provide cuDNN 9 compatibility with MinerU's torch.
+## Environment Variables
+| Variable                      | Description                                         | Default                         |
+| ----------------------------- | --------------------------------------------------- | ------------------------------- |
+| `API_TOKEN`                   | **Required.** Secret token for API authentication   | (set in HF Secrets)             |
+| `MINERU_BACKEND`              | Parsing backend (pipeline, hybrid-auto-engine, vlm) | `pipeline`                      |
+| `MINERU_LANG`                 | Default OCR language                                | `en`                            |
+| `MAX_FILE_SIZE_MB`            | Maximum upload file size in MB                      | `1024`                          |
+| `MINERU_MODEL_SOURCE`         | Model source (local = use pre-downloaded models)    | `local`                         |
+| `HF_HOME`                     | HuggingFace model cache directory                   | `/home/user/.cache/huggingface` |
+| `TORCH_HOME`                  | PyTorch model cache directory                       | `/home/user/.cache/torch`       |
+| `MODELSCOPE_CACHE`            | ModelScope model cache directory                    | `/home/user/.cache/modelscope`  |
+| `VLLM_GPU_MEMORY_UTILIZATION` | vLLM GPU memory fraction (hybrid backend only)      | `0.4`                           |
+## Authentication
+All `/parse` endpoints require a Bearer token in the Authorization header.
+```bash
+# Set API_TOKEN in HF Space Settings > Secrets
+# Then call API with:
+curl -X POST "https://outcomelabs-md-parser.hf.space/parse" \
+  -H "Authorization: Bearer YOUR_API_TOKEN" \
+  -F "file=@document.pdf" \
+  -F "output_format=markdown"
+```
+## HuggingFace Spaces Configuration
+The `README.md` contains YAML frontmatter for HF Spaces:
+- `sdk: docker` - Uses Docker SDK
+- `app_port: 7860` - Standard Gradio/FastAPI port
+- `suggested_hardware: a100-large` - Nvidia A100 GPU (80GB VRAM)
+## Logging & Monitoring
+The API provides comprehensive logging:
+- **Request IDs**: Each request gets a unique 8-char ID (e.g., `[a1b2c3d4]`)
+- **Startup logs**: Model cache status, MinerU version, configuration
+- **Request logs**: File size, type, page range, processing time, pages/sec speed
+- **MinerU output**: stdout/stderr from parsing commands
+- **Error tracking**: Full exception details with context
+View logs in HuggingFace Space → Logs tab.
+## Performance Target
+**A100 GPU (80GB):** ~100 pages/minute (1.5-2 pages/second)
+## MinerU Backends
+- **pipeline**: General purpose, 6GB VRAM minimum
+- **hybrid-auto-engine**: Best accuracy + speed balance (default), 8-10GB VRAM minimum
+- **vlm**: Vision-language model based, highest accuracy for complex docs
+## Testing
+The app exposes HTTP endpoints - there is no `parse_document` function to call directly. Instead, POST to the `/parse` endpoint while the server is running.
+### Start the server
+```bash
+uvicorn app:app --host 0.0.0.0 --port 7860 --reload
+```
+### Test with curl
+```bash
+# Test /parse endpoint with a sample PDF (multipart form upload)
+curl -X POST "http://localhost:7860/parse" \
+  -H "Authorization: Bearer YOUR_API_TOKEN" \
+  -F "file=@sample.pdf" \
+  -F "output_format=markdown"
+# Test /parse endpoint with images included
+curl -X POST "http://localhost:7860/parse" \
+  -H "Authorization: Bearer YOUR_API_TOKEN" \
+  -F "file=@sample.pdf" \
+  -F "output_format=markdown" \
+  -F "include_images=true"
+# Test /parse/url endpoint
+curl -X POST "http://localhost:7860/parse/url" \
+  -H "Authorization: Bearer YOUR_API_TOKEN" \
+  -H "Content-Type: application/json" \
+  -d '{"url": "https://example.com/document.pdf", "output_format": "markdown"}'
+# Test /parse/url endpoint with images included
+curl -X POST "http://localhost:7860/parse/url" \
+  -H "Authorization: Bearer YOUR_API_TOKEN" \
+  -H "Content-Type: application/json" \
+  -d '{"url": "https://example.com/document.pdf", "output_format": "markdown", "include_images": true}'
+```
+### Test with Python httpx
+```python
+import httpx
+API_URL = "http://localhost:7860"
+API_TOKEN = "your_api_token"
+# Test /parse endpoint with file upload
+with open("sample.pdf", "rb") as f:
+    response = httpx.post(
+        f"{API_URL}/parse",
+        headers={"Authorization": f"Bearer {API_TOKEN}"},
+        files={"file": ("sample.pdf", f, "application/pdf")},
+        data={"output_format": "markdown"},
+    )
+print(response.json())
+# Test /parse endpoint with images included
+import base64
+import zipfile
+import io
+with open("sample.pdf", "rb") as f:
+    response = httpx.post(
+        f"{API_URL}/parse",
+        headers={"Authorization": f"Bearer {API_TOKEN}"},
+        files={"file": ("sample.pdf", f, "application/pdf")},
+        data={"output_format": "markdown", "include_images": "true"},
+    )
+result = response.json()
+if result["images_zip"]:
+    print(f"Extracted {result['image_count']} images")
+    # Decode and extract images from zip
+    zip_bytes = base64.b64decode(result["images_zip"])
+    with zipfile.ZipFile(io.BytesIO(zip_bytes), 'r') as zf:
+        for name in zf.namelist():
+            print(f"  - {name}")
+            # zf.read(name) returns the image bytes
+```

Dockerfile ADDED Viewed

	@@ -0,0 +1,118 @@

+# Hugging Face Spaces Dockerfile for MinerU Document Parser API
+# Based on official MinerU Docker deployment
+# Optimized for L40S GPU (Ada Lovelace architecture, 48GB VRAM)
+# Build: v1.4.0 - Using mineru[core] for full backend support
+# Use official vLLM image as base (includes CUDA, PyTorch, vLLM properly configured)
+# v0.14.1 includes security patches (CVE-2025-66448/CVE-2025-30165) and memory leak fixes
+# Supports Ampere, Ada Lovelace, Hopper architectures (L40S is Ada Lovelace)
+FROM vllm/vllm-openai:v0.14.1
+USER root
+RUN echo "========== BUILD STARTED at $(date -u '+%Y-%m-%d %H:%M:%S UTC') =========="
+# Install system dependencies (fonts required by MinerU, curl for health checks)
+RUN echo "========== STEP 1: Installing system dependencies ==========" && \
+    apt-get update && apt-get install -y --no-install-recommends \
+    fonts-noto-core \
+    fonts-noto-cjk \
+    fontconfig \
+    libgl1 \
+    curl \
+    poppler-utils \
+    && fc-cache -fv && \
+    rm -rf /var/lib/apt/lists/* && \
+    echo "========== System dependencies installed =========="
+# Create non-root user for HF Spaces (required by HuggingFace)
+RUN useradd -m -u 1000 user
+# Set environment variables (MINERU_MODEL_SOURCE set later after download)
+# LD_LIBRARY_PATH includes pip nvidia packages for cuDNN (libcudnn.so.9)
+ENV PYTHONUNBUFFERED=1 \
+    PYTHONDONTWRITEBYTECODE=1 \
+    MINERU_BACKEND=pipeline \
+    MINERU_LANG=en \
+    MAX_FILE_SIZE_MB=1024 \
+    HF_HOME=/home/user/.cache/huggingface \
+    TORCH_HOME=/home/user/.cache/torch \
+    MODELSCOPE_CACHE=/home/user/.cache/modelscope \
+    XDG_CACHE_HOME=/home/user/.cache \
+    HOME=/home/user \
+    PATH=/home/user/.local/bin:/usr/local/bin:/usr/bin:$PATH \
+    LD_LIBRARY_PATH=/home/user/.local/lib/python3.12/site-packages/nvidia/cudnn/lib:$LD_LIBRARY_PATH \
+    VLLM_GPU_MEMORY_UTILIZATION=0.4
+# Create cache directories with correct ownership
+RUN mkdir -p /home/user/.cache/huggingface \
+    /home/user/.cache/torch \
+    /home/user/.cache/modelscope \
+    /home/user/app && \
+    chown -R user:user /home/user
+# Switch to non-root user
+USER user
+WORKDIR /home/user/app
+# Copy requirements first for better caching
+COPY --chown=user:user requirements.txt .
+# Install Python dependencies
+# Note: nvidia-cudnn-cu12 provides libcudnn.so.9 required by torch
+RUN echo "========== STEP 2: Installing Python dependencies ==========" && \
+    pip install --user --upgrade pip && \
+    pip install --user nvidia-cudnn-cu12 && \
+    pip install --user -r requirements.txt && \
+    echo "Reinstalling modelscope in user space for torch compatibility..." && \
+    pip install --user --force-reinstall modelscope && \
+    echo "Installed packages:" && \
+    pip list --user | grep -E "(mineru|fastapi|uvicorn|httpx|pydantic|modelscope|torch|cudnn|doclayout)" && \
+    echo "========== Python dependencies installed =========="
+# Create MinerU config file (required BEFORE downloading models)
+# The mineru-models-download command reads ~/mineru.json to know where to store models
+RUN echo "========== STEP 3a: Creating MinerU config ==========" && \
+    mkdir -p /home/user/.cache/mineru/models && \
+    echo '{"models-dir": {"pipeline": "/home/user/.cache/mineru/models", "vlm": "/home/user/.cache/mineru/models"}, "config_version": "1.3.1"}' > /home/user/mineru.json && \
+    cat /home/user/mineru.json && \
+    echo "========== MinerU config created =========="
+# Download MinerU models using official tool
+RUN echo "========== STEP 3b: Downloading MinerU models ==========" && \
+    echo "This downloads all required models (~4-5GB)..." && \
+    echo "Cache directories before download:" && \
+    ls -la /home/user/.cache/ && \
+    echo "Downloading all models from huggingface..." && \
+    mineru-models-download --source huggingface --model_type all && \
+    echo "" && \
+    echo "========== Model cache summary ==========" && \
+    echo "MinerU models cache:" && \
+    du -sh /home/user/.cache/mineru 2>/dev/null || echo "  (empty)" && \
+    ls -la /home/user/.cache/mineru/models 2>/dev/null || echo "  (no files)" && \
+    find /home/user/.cache/mineru -type f 2>/dev/null | head -20 || echo "  (no files found)" && \
+    echo "HuggingFace cache:" && \
+    du -sh /home/user/.cache/huggingface 2>/dev/null || echo "  (empty)" && \
+    echo "Total cache size:" && \
+    du -sh /home/user/.cache 2>/dev/null || echo "  (empty)" && \
+    echo "========== Models downloaded =========="
+# Set model source to local AFTER downloading (prevents re-download at runtime)
+ENV MINERU_MODEL_SOURCE=local
+# Copy application code
+COPY --chown=user:user . .
+RUN echo "Files in app directory:" && ls -la /home/user/app/ && \
+    echo "========== BUILD COMPLETED at $(date -u '+%Y-%m-%d %H:%M:%S UTC') =========="
+# Expose the port
+EXPOSE 7860
+# Health check
+HEALTHCHECK --interval=30s --timeout=30s --start-period=300s --retries=5 \
+    CMD curl -f http://localhost:7860/ || exit 1
+# Override vLLM entrypoint and run our FastAPI server
+ENTRYPOINT []
+CMD ["/usr/bin/python3", "-m", "uvicorn", "app:app", "--host", "0.0.0.0", "--port", "7860", "--workers", "1", "--timeout-keep-alive", "300"]

README.md ADDED Viewed

	@@ -0,0 +1,537 @@

+---
+title: MD Parser API
+emoji: 📄
+colorFrom: blue
+colorTo: purple
+sdk: docker
+app_port: 7860
+pinned: false
+license: agpl-3.0
+suggested_hardware: a100-large
+---
+# MD Parser API
+A FastAPI service that transforms PDFs and images into LLM-ready markdown/JSON using [MinerU](https://github.com/opendatalab/MinerU).
+## Features
+- **PDF Parsing**: Extract text, tables, formulas, and images from PDFs
+- **Image OCR**: Process scanned documents and images
+- **Multiple Formats**: Output as markdown or JSON
+- **109 Languages**: Supports OCR in 109 languages
+- **GPU Accelerated**: Uses CUDA for fast processing on A100 GPU (80GB VRAM)
+- **Two Backends**: Fast `pipeline` (default) or accurate `hybrid-auto-engine`
+- **Parallel Chunking**: Large PDFs (>20 pages) are automatically split into 10-page chunks and processed in parallel
+## API Endpoints
+| Endpoint     | Method | Description                               |
+| ------------ | ------ | ----------------------------------------- |
+| `/`          | GET    | Health check                              |
+| `/parse`     | POST   | Parse uploaded file (multipart/form-data) |
+| `/parse/url` | POST   | Parse document from URL (JSON body)       |
+## Authentication
+All `/parse` endpoints require Bearer token authentication.
+```
+Authorization: Bearer YOUR_API_TOKEN
+```
+Set `API_TOKEN` in HF Space Settings > Secrets.
+## Quick Start
+### cURL - File Upload
+```bash
+curl -X POST "https://outcomelabs-md-parser.hf.space/parse" \
+  -H "Authorization: Bearer YOUR_API_TOKEN" \
+  -F "file=@document.pdf" \
+  -F "output_format=markdown"
+```
+### cURL - Parse from URL
+```bash
+curl -X POST "https://outcomelabs-md-parser.hf.space/parse/url" \
+  -H "Authorization: Bearer YOUR_API_TOKEN" \
+  -H "Content-Type: application/json" \
+  -d '{"url": "https://example.com/document.pdf", "output_format": "markdown"}'
+```
+### Python
+```python
+import requests
+API_URL = "https://outcomelabs-md-parser.hf.space"
+API_TOKEN = "your_api_token"
+headers = {"Authorization": f"Bearer {API_TOKEN}"}
+# Option 1: Upload a file
+with open("document.pdf", "rb") as f:
+    response = requests.post(
+        f"{API_URL}/parse",
+        headers=headers,
+        files={"file": ("document.pdf", f, "application/pdf")},
+        data={"output_format": "markdown"}
+    )
+# Option 2: Parse from URL
+response = requests.post(
+    f"{API_URL}/parse/url",
+    headers=headers,
+    json={
+        "url": "https://example.com/document.pdf",
+        "output_format": "markdown"
+    }
+)
+result = response.json()
+if result["success"]:
+    print(f"Parsed {result['pages_processed']} pages")
+    print(result["markdown"])
+else:
+    print(f"Error: {result['error']}")
+```
+### Python with Images
+```python
+import requests
+import base64
+import zipfile
+import io
+API_URL = "https://outcomelabs-md-parser.hf.space"
+API_TOKEN = "your_api_token"
+headers = {"Authorization": f"Bearer {API_TOKEN}"}
+# Request with images included
+with open("document.pdf", "rb") as f:
+    response = requests.post(
+        f"{API_URL}/parse",
+        headers=headers,
+        files={"file": ("document.pdf", f, "application/pdf")},
+        data={"output_format": "markdown", "include_images": "true"}
+    )
+result = response.json()
+if result["success"]:
+    print(f"Parsed {result['pages_processed']} pages")
+    print(result["markdown"])
+    # Extract images from ZIP
+    if result["images_zip"]:
+        print(f"Extracting {result['image_count']} images...")
+        zip_bytes = base64.b64decode(result["images_zip"])
+        with zipfile.ZipFile(io.BytesIO(zip_bytes), 'r') as zf:
+            zf.extractall("./extracted_images")
+            print(f"Images saved to ./extracted_images/")
+```
+### JavaScript/Node.js
+```javascript
+const API_URL = 'https://outcomelabs-md-parser.hf.space';
+const API_TOKEN = 'your_api_token';
+// Parse from URL
+const response = await fetch(`${API_URL}/parse/url`, {
+  method: 'POST',
+  headers: {
+    Authorization: `Bearer ${API_TOKEN}`,
+    'Content-Type': 'application/json',
+  },
+  body: JSON.stringify({
+    url: 'https://example.com/document.pdf',
+    output_format: 'markdown',
+  }),
+});
+const result = await response.json();
+console.log(result.markdown);
+```
+### JavaScript/Node.js with Images
+```javascript
+import JSZip from 'jszip';
+import fs from 'fs';
+const API_URL = 'https://outcomelabs-md-parser.hf.space';
+const API_TOKEN = 'your_api_token';
+// Parse with images
+const response = await fetch(`${API_URL}/parse/url`, {
+  method: 'POST',
+  headers: {
+    Authorization: `Bearer ${API_TOKEN}`,
+    'Content-Type': 'application/json',
+  },
+  body: JSON.stringify({
+    url: 'https://example.com/document.pdf',
+    output_format: 'markdown',
+    include_images: true,
+  }),
+});
+const result = await response.json();
+console.log(result.markdown);
+// Extract images from ZIP
+if (result.images_zip) {
+  console.log(`Extracting ${result.image_count} images...`);
+  const zipData = Buffer.from(result.images_zip, 'base64');
+  const zip = await JSZip.loadAsync(zipData);
+  for (const [name, file] of Object.entries(zip.files)) {
+    if (!file.dir) {
+      const content = await file.async('nodebuffer');
+      fs.writeFileSync(`./extracted_images/${name}`, content);
+      console.log(`  Saved: ${name}`);
+    }
+  }
+}
+```
+## Postman Setup
+### File Upload (POST /parse)
+1. **Method:** `POST`
+2. **URL:** `https://outcomelabs-md-parser.hf.space/parse`
+3. **Authorization tab:** Type = Bearer Token, Token = `your_api_token`
+4. **Body tab:** Select `form-data`
+| Key            | Type | Value                                         |
+| -------------- | ---- | --------------------------------------------- |
+| file           | File | Select your PDF/image                         |
+| output_format  | Text | `markdown` or `json`                          |
+| lang           | Text | `en` (optional)                               |
+| backend        | Text | `pipeline` or `hybrid-auto-engine` (optional) |
+| start_page     | Text | `0` (optional)                                |
+| end_page       | Text | `10` (optional)                               |
+| include_images | Text | `true` or `false` (optional)                  |
+### URL Parsing (POST /parse/url)
+1. **Method:** `POST`
+2. **URL:** `https://outcomelabs-md-parser.hf.space/parse/url`
+3. **Authorization tab:** Type = Bearer Token, Token = `your_api_token`
+4. **Headers tab:** Add `Content-Type: application/json`
+5. **Body tab:** Select `raw` and `JSON`
+```json
+{
+  "url": "https://example.com/document.pdf",
+  "output_format": "markdown",
+  "lang": "en",
+  "start_page": 0,
+  "end_page": null,
+  "include_images": false
+}
+```
+## Request Parameters
+### File Upload (/parse)
+| Parameter      | Type   | Required | Default    | Description                                          |
+| -------------- | ------ | -------- | ---------- | ---------------------------------------------------- |
+| file           | File   | Yes      | -          | PDF or image file                                    |
+| output_format  | string | No       | `markdown` | `markdown` or `json`                                 |
+| lang           | string | No       | `en`       | OCR language code                                    |
+| backend        | string | No       | `pipeline` | `pipeline` (fast) or `hybrid-auto-engine` (accurate) |
+| start_page     | int    | No       | `0`        | Starting page (0-indexed)                            |
+| end_page       | int    | No       | `null`     | Ending page (null = all pages)                       |
+| include_images | bool   | No       | `false`    | Include base64-encoded images in response            |
+### URL Parsing (/parse/url)
+| Parameter      | Type   | Required | Default    | Description                                          |
+| -------------- | ------ | -------- | ---------- | ---------------------------------------------------- |
+| url            | string | Yes      | -          | URL to PDF or image                                  |
+| output_format  | string | No       | `markdown` | `markdown` or `json`                                 |
+| lang           | string | No       | `en`       | OCR language code                                    |
+| backend        | string | No       | `pipeline` | `pipeline` (fast) or `hybrid-auto-engine` (accurate) |
+| start_page     | int    | No       | `0`        | Starting page (0-indexed)                            |
+| end_page       | int    | No       | `null`     | Ending page (null = all pages)                       |
+| include_images | bool   | No       | `false`    | Include base64-encoded images in response            |
+## Response Format
+```json
+{
+  "success": true,
+  "markdown": "# Document Title\n\nExtracted content...",
+  "json_content": null,
+  "images_zip": null,
+  "image_count": 0,
+  "error": null,
+  "pages_processed": 20,
+  "backend_used": "pipeline"
+}
+```
+| Field           | Type    | Description                                                            |
+| --------------- | ------- | ---------------------------------------------------------------------- |
+| success         | boolean | Whether parsing succeeded                                              |
+| markdown        | string  | Extracted markdown (if output_format=markdown)                         |
+| json_content    | object  | Extracted JSON (if output_format=json)                                 |
+| images_zip      | string  | Base64-encoded ZIP file containing all images (if include_images=true) |
+| image_count     | int     | Number of images in the ZIP file                                       |
+| error           | string  | Error message if failed                                                |
+| pages_processed | int     | Number of pages processed                                              |
+| backend_used    | string  | Actual backend used (may differ from requested if fallback occurred)   |
+### Images Response
+When `include_images=true`, the `images_zip` field contains a base64-encoded ZIP file with all extracted images:
+```json
+{
+  "images_zip": "UEsDBBQAAAAIAGJ...",
+  "image_count": 3
+}
+```
+#### Extracting Images (Python)
+```python
+import base64
+import zipfile
+import io
+result = response.json()
+if result["images_zip"]:
+    print(f"Extracted {result['image_count']} images")
+    # Decode the base64 ZIP
+    zip_bytes = base64.b64decode(result["images_zip"])
+    # Extract images from ZIP
+    with zipfile.ZipFile(io.BytesIO(zip_bytes), 'r') as zf:
+        for name in zf.namelist():
+            print(f"  - {name}")  # e.g., "images/fig1.png"
+            img_bytes = zf.read(name)
+            # Save or process img_bytes as needed
+```
+#### Extracting Images (JavaScript)
+```javascript
+import JSZip from 'jszip';
+const result = await response.json();
+if (result.images_zip) {
+  console.log(`Extracted ${result.image_count} images`);
+  // Decode base64 and unzip
+  const zipData = Uint8Array.from(atob(result.images_zip), c =>
+    c.charCodeAt(0)
+  );
+  const zip = await JSZip.loadAsync(zipData);
+  for (const [name, file] of Object.entries(zip.files)) {
+    console.log(`  - ${name}`); // e.g., "images/fig1.png"
+    const imgBlob = await file.async('blob');
+    // Use imgBlob as needed
+  }
+}
+```
+#### Image Path Structure
+- **Non-chunked documents**: `images/filename.png`
+- **Chunked documents (>20 pages)**: `chunk_0/images/filename.png`, `chunk_1/images/filename.png`, etc.
+## Backends
+| Backend              | Speed           | Accuracy         | Best For                                      |
+| -------------------- | --------------- | ---------------- | --------------------------------------------- |
+| `pipeline` (default) | ~0.77 pages/sec | Good             | Native PDFs, text-heavy docs, fast processing |
+| `hybrid-auto-engine` | ~0.39 pages/sec | Excellent (90%+) | Complex layouts, scanned docs, forms          |
+### When to Use `pipeline` (Default)
+The pipeline backend uses traditional ML models for faster processing. Use it for:
+- **Native PDFs with text layers** - Academic papers, eBooks, reports generated digitally
+- **High-volume processing** - When speed matters more than perfect accuracy (2x faster)
+- **Well-structured documents** - Clean, single-column text-heavy documents
+- **arXiv papers** - Both backends produce identical output for well-structured PDFs
+- **Cost optimization** - Faster processing = less GPU time
+### When to Use `hybrid-auto-engine`
+The hybrid backend uses a Vision-Language Model (VLM) to understand document layouts visually. Use it for:
+- **Scanned documents** - Better OCR accuracy, fewer typos
+- **Forms and applications** - Extracts 18x more content from complex form layouts (tested on IRS Form 1040)
+- **Documents with complex layouts** - Multi-column, mixed text/images, tables with merged cells
+- **Handwritten content** - Better recognition of cursive and handwriting
+- **Low-quality scans** - VLM can interpret degraded or noisy images
+- **Legal documents** - Leases, contracts with signatures and stamps
+- **Historical documents** - Older typewritten or faded documents
+### Real-World Comparison
+| Document Type          | Pipeline Output          | Hybrid Output                 |
+| ---------------------- | ------------------------ | ----------------------------- |
+| arXiv paper (15 pages) | 42KB, clean extraction   | 42KB, identical               |
+| IRS Form 1040          | 825 bytes, mostly images | **15KB, full form structure** |
+| Scanned lease (31 pg)  | 104KB, OCR errors        | **105KB, cleaner OCR**        |
+**OCR Accuracy Example (scanned lease):**
+- Pipeline: "Ilinois" (9 occurrences of typo)
+- Hybrid: "Illinois" (21 correct occurrences)
+Override per-request with the `backend` parameter, or set `MINERU_BACKEND` env var.
+## Parallel Chunking
+For large PDFs, the API automatically splits processing into parallel chunks to avoid timeouts and improve throughput.
+### How It Works
+1. **Detection**: PDFs with more than 20 pages (configurable via `CHUNKING_THRESHOLD`) trigger chunking
+2. **Splitting**: Document is split into 10-page chunks (configurable via `CHUNK_SIZE`)
+3. **Parallel Processing**: Up to 3 chunks (configurable via `MAX_WORKERS`) are processed simultaneously
+4. **Combining**: Results are merged in page order, with chunk boundaries marked in markdown output
+### Performance Impact
+| Document Size | Without Chunking | With Chunking (3 workers) | Speedup |
+| ------------- | ---------------- | ------------------------- | ------- |
+| 30 pages      | ~80 seconds      | ~30 seconds               | ~2.7x   |
+| 60 pages      | ~160 seconds     | ~55 seconds               | ~2.9x   |
+| 100 pages     | Timeout (>600s)  | ~100 seconds              | N/A     |
+### OOM Protection
+If GPU out-of-memory errors are detected during parallel processing, the system automatically falls back to sequential processing (1 worker) and retries all chunks.
+### Notes
+- Chunking only applies to PDF files (images are always processed as single units)
+- Each chunk maintains context for tables and formulas within its page range
+- Chunk boundaries are marked with HTML comments in markdown output for transparency
+- If any chunk fails, partial results are still returned with an error message
+- Requested backend is used for chunked processing (with OOM auto-fallback to sequential)
+## Supported File Types
+- PDF (.pdf)
+- Images (.png, .jpg, .jpeg, .tiff, .bmp)
+Maximum file size: 1GB (configurable via `MAX_FILE_SIZE_MB`)
+## Configuration
+| Environment Variable          | Description                                    | Default    |
+| ----------------------------- | ---------------------------------------------- | ---------- |
+| `API_TOKEN`                   | **Required.** API authentication token         | -          |
+| `MINERU_BACKEND`              | Default parsing backend                        | `pipeline` |
+| `MINERU_LANG`                 | Default OCR language                           | `en`       |
+| `MAX_FILE_SIZE_MB`            | Maximum upload size in MB                      | `1024`     |
+| `VLLM_GPU_MEMORY_UTILIZATION` | vLLM GPU memory fraction (hybrid backend only) | `0.4`      |
+| `CHUNK_SIZE`                  | Pages per chunk for chunked processing         | `10`       |
+| `CHUNKING_THRESHOLD`          | Minimum pages to trigger chunking              | `20`       |
+| `MAX_WORKERS`                 | Parallel workers for chunk processing          | `3`        |
+### GPU Memory & Automatic Fallback
+The `hybrid-auto-engine` backend uses vLLM internally, which requires GPU memory. **If GPU memory is insufficient, the API automatically falls back to `pipeline` backend** and returns results (check `backend_used` in response).
+To force a specific backend or tune memory:
+1. **Use `pipeline` backend** - Add `backend=pipeline` to your request (doesn't use vLLM, faster but less accurate for scanned docs)
+2. **Lower GPU memory** - Set `VLLM_GPU_MEMORY_UTILIZATION` to a lower value (e.g., `0.3`)
+## Performance
+**Hardware:** Nvidia A100 Large (80GB VRAM, 12 vCPU, 142GB RAM)
+| Backend              | Speed           | 15-page PDF | 31-page PDF |
+| -------------------- | --------------- | ----------- | ----------- |
+| `pipeline`           | ~0.77 pages/sec | ~20 seconds | ~40 seconds |
+| `hybrid-auto-engine` | ~0.39 pages/sec | ~40 seconds | ~80 seconds |
+**Trade-off:** Hybrid is 2x slower but produces significantly better results for scanned/complex documents. For native PDFs, both produce identical output.
+**Sleep behavior:** Space sleeps after 60 minutes idle. First request after sleep takes ~30-60 seconds for cold start.
+## Deployment
+- **Space:** https://huggingface.co/spaces/outcomelabs/md-parser
+- **API:** https://outcomelabs-md-parser.hf.space
+- **Hardware:** Nvidia A100 Large 80GB ($2.50/hr, stops billing when sleeping)
+### Deploy Updates
+```bash
+git add .
+git commit -m "feat: description"
+git push hf main
+```
+## Logging
+View logs in HuggingFace Space > Logs tab:
+```
+2026-01-26 10:30:00 | INFO | [a1b2c3d4] New parse request received
+2026-01-26 10:30:00 | INFO | [a1b2c3d4] Filename: document.pdf
+2026-01-26 10:30:00 | INFO | [a1b2c3d4] File size: 2.45 MB
+2026-01-26 10:30:00 | INFO | [a1b2c3d4] Backend: pipeline
+2026-01-26 10:30:27 | INFO | [a1b2c3d4] MinerU completed in 27.23s
+2026-01-26 10:30:27 | INFO | [a1b2c3d4] Pages processed: 20
+2026-01-26 10:30:27 | INFO | [a1b2c3d4] Speed: 0.73 pages/sec
+```
+## Changelog
+### v1.4.0 (Breaking Change)
+**Images now returned as ZIP file instead of dictionary:**
+- `images` field removed
+- `images_zip` field added (base64-encoded ZIP containing all images)
+- `image_count` field added (number of images in ZIP)
+**Migration from v1.3.0:**
+```python
+# OLD (v1.3.0)
+if result["images"]:
+    for filename, b64_data in result["images"].items():
+        img_bytes = base64.b64decode(b64_data)
+# NEW (v1.4.0)
+if result["images_zip"]:
+    zip_bytes = base64.b64decode(result["images_zip"])
+    with zipfile.ZipFile(io.BytesIO(zip_bytes), 'r') as zf:
+        for filename in zf.namelist():
+            img_bytes = zf.read(filename)
+```
+**Benefits:**
+- Smaller payload size due to ZIP compression
+- Single field instead of large dictionary
+- Easier to save/extract as a file
+### v1.3.0
+- Added `include_images` parameter for optional image extraction
+- Added parallel chunking for large PDFs (>20 pages)
+- Added automatic OOM fallback to sequential processing
+## Credits
+Built with [MinerU](https://github.com/opendatalab/MinerU) by OpenDataLab.

app.py ADDED Viewed

	@@ -0,0 +1,1226 @@

+"""
+MinerU Document Parser API
+A FastAPI service that wraps MinerU for parsing PDFs and images
+into LLM-ready markdown/JSON formats.
+Features:
+- Automatic chunking for large PDFs (10 pages per chunk)
+- Parallel processing of chunks for faster throughput
+- Automatic fallback to pipeline backend on GPU memory errors
+"""
+import asyncio
+import base64
+import io
+import ipaddress
+import json
+import logging
+import os
+import re
+import secrets
+import shutil
+import socket
+import subprocess
+import tempfile
+import time
+import zipfile
+from concurrent.futures import ThreadPoolExecutor, as_completed
+from pathlib import Path
+from typing import BinaryIO, Optional, Union
+from urllib.parse import urlparse
+from uuid import uuid4
+import httpx
+from fastapi import Depends, FastAPI, File, Form, HTTPException, Request, UploadFile
+from fastapi.security import HTTPAuthorizationCredentials, HTTPBearer
+from pydantic import BaseModel
+# Configure logging
+logging.basicConfig(
+    level=logging.INFO,
+    format="%(asctime)s | %(levelname)-8s | %(message)s",
+    datefmt="%Y-%m-%d %H:%M:%S",
+)
+logger = logging.getLogger("md-parser")
+# Security
+API_TOKEN = os.getenv("API_TOKEN")
+API_DEV_TOKEN = os.getenv("API_DEV_TOKEN")
+security = HTTPBearer()
+def verify_token(credentials: HTTPAuthorizationCredentials = Depends(security)) -> str:
+    """Verify the API token from Authorization header."""
+    if not API_TOKEN and not API_DEV_TOKEN:
+        raise HTTPException(
+            status_code=500,
+            detail="No API tokens configured on server",
+        )
+    token = credentials.credentials
+    # Check against both tokens
+    token_valid = False
+    if API_TOKEN and secrets.compare_digest(token, API_TOKEN):
+        token_valid = True
+    if API_DEV_TOKEN and secrets.compare_digest(token, API_DEV_TOKEN):
+        token_valid = True
+    if not token_valid:
+        raise HTTPException(
+            status_code=401,
+            detail="Invalid API token",
+        )
+    return token
+from contextlib import asynccontextmanager
+def _check_model_cache() -> dict:
+    """Check model cache status and return cache info."""
+    cache_info = {}
+    cache_dirs = [
+        ("HuggingFace", os.environ.get("HF_HOME", "/home/user/.cache/huggingface")),
+        ("Torch", os.environ.get("TORCH_HOME", "/home/user/.cache/torch")),
+        ("ModelScope", os.environ.get("MODELSCOPE_CACHE", "/home/user/.cache/modelscope")),
+    ]
+    for name, path in cache_dirs:
+        if os.path.exists(path):
+            try:
+                # Get directory size
+                total_size = 0
+                file_count = 0
+                for dirpath, dirnames, filenames in os.walk(path):
+                    for f in filenames:
+                        fp = os.path.join(dirpath, f)
+                        total_size += os.path.getsize(fp)
+                        file_count += 1
+                size_mb = total_size / (1024 * 1024)
+                cache_info[name] = {"size_mb": round(size_mb, 2), "files": file_count, "status": "cached"}
+            except Exception as e:
+                cache_info[name] = {"status": f"error: {e}"}
+        else:
+            cache_info[name] = {"status": "not found"}
+    return cache_info
+@asynccontextmanager
+async def lifespan(app: FastAPI):
+    """Startup: verify MinerU is available and check model cache."""
+    logger.info("=" * 60)
+    logger.info("Starting MD Parser API v1.4.0...")
+    logger.info(f"Backend: {MINERU_BACKEND}")
+    logger.info(f"Default language: {MINERU_LANG}")
+    logger.info(f"Max file size: {MAX_FILE_SIZE_MB}MB")
+    logger.info(f"Chunking: {CHUNK_SIZE} pages/chunk, threshold {CHUNKING_THRESHOLD} pages, {MAX_WORKERS} workers")
+    try:
+        # Verify mineru CLI is available
+        result = subprocess.run(["mineru", "--version"], capture_output=True, text=True)
+        logger.info(f"MinerU version: {result.stdout.strip()}")
+    except Exception as e:
+        logger.warning(f"MinerU check failed: {e}")
+    # Check model cache status
+    logger.info("-" * 40)
+    logger.info("Model cache status:")
+    cache_info = _check_model_cache()
+    for name, info in cache_info.items():
+        if info.get("status") == "cached":
+            logger.info(f"  {name}: {info['size_mb']:.2f} MB ({info['files']} files) - CACHED")
+        else:
+            logger.warning(f"  {name}: {info.get('status', 'unknown')}")
+    total_cached = sum(info.get("size_mb", 0) for info in cache_info.values() if info.get("status") == "cached")
+    if total_cached > 0:
+        logger.info(f"  Total cached: {total_cached:.2f} MB")
+        logger.info("  Models are pre-loaded - no download needed at runtime")
+    else:
+        logger.warning("  No cached models found - first request may be slow")
+    logger.info("=" * 60)
+    logger.info("MD Parser API ready to accept requests")
+    logger.info("=" * 60)
+    yield
+    logger.info("Shutting down MD Parser API...")
+app = FastAPI(
+    title="MD Parser API",
+    description="Transform PDFs and images into markdown/JSON using MinerU",
+    version="1.4.0",
+    lifespan=lifespan,
+)
+# Configuration from environment (optimized for A100 GPU)
+MINERU_BACKEND = os.getenv("MINERU_BACKEND", "pipeline")
+MINERU_LANG = os.getenv("MINERU_LANG", "en")
+MAX_FILE_SIZE_MB = int(os.getenv("MAX_FILE_SIZE_MB", "1024"))
+MAX_FILE_SIZE_BYTES = MAX_FILE_SIZE_MB * 1024 * 1024
+# Chunking configuration
+CHUNK_SIZE = int(os.getenv("CHUNK_SIZE", "10"))  # Pages per chunk
+# MAX_WORKERS: Number of parallel workers for chunk processing
+# - Default 3 for faster processing on A100 (80GB VRAM)
+# - If OOM occurs, automatically falls back to sequential (1 worker)
+MAX_WORKERS = int(os.getenv("MAX_WORKERS", "3"))
+CHUNKING_THRESHOLD = int(os.getenv("CHUNKING_THRESHOLD", "20"))  # Min pages to enable chunking
+# Enable torch.compile for ~15% speedup if available
+if os.getenv("TORCH_COMPILE_ENABLED", "0") == "1":
+    try:
+        import torch
+        torch.set_float32_matmul_precision('high')
+    except Exception:
+        pass
+# Blocked hostnames for SSRF protection
+BLOCKED_HOSTNAMES = {
+    "localhost",
+    "metadata",
+    "metadata.google.internal",
+    "metadata.google",
+    "169.254.169.254",  # AWS/GCP/Azure metadata service
+    "fd00:ec2::254",    # AWS IPv6 metadata
+}
+def _validate_url(url: str) -> None:
+    """
+    Validate URL to prevent SSRF attacks.
+    Raises HTTPException if URL is invalid or points to internal/private resources.
+    """
+    try:
+        parsed = urlparse(url)
+    except Exception as e:
+        raise HTTPException(
+            status_code=400,
+            detail=f"Invalid URL format: {str(e)}",
+        )
+    # Check scheme
+    if parsed.scheme not in ("http", "https"):
+        raise HTTPException(
+            status_code=400,
+            detail=f"Invalid URL scheme '{parsed.scheme}'. Only http and https are allowed.",
+        )
+    # Check hostname exists
+    hostname = parsed.hostname
+    if not hostname:
+        raise HTTPException(
+            status_code=400,
+            detail="Invalid URL: missing hostname.",
+        )
+    # Check against blocked hostnames
+    hostname_lower = hostname.lower()
+    if hostname_lower in BLOCKED_HOSTNAMES:
+        raise HTTPException(
+            status_code=400,
+            detail="Access to internal/metadata services is not allowed.",
+        )
+    # Block hostnames containing suspicious patterns
+    blocked_patterns = ["metadata", "internal", "localhost", "127.0.0.1", "::1"]
+    for pattern in blocked_patterns:
+        if pattern in hostname_lower:
+            raise HTTPException(
+                status_code=400,
+                detail="Access to internal/metadata services is not allowed.",
+            )
+    # Resolve hostname and check IP address
+    try:
+        ip_str = socket.gethostbyname(hostname)
+        ip = ipaddress.ip_address(ip_str)
+    except socket.gaierror:
+        raise HTTPException(
+            status_code=400,
+            detail=f"Could not resolve hostname: {hostname}",
+        )
+    except ValueError as e:
+        raise HTTPException(
+            status_code=400,
+            detail=f"Invalid IP address resolved: {str(e)}",
+        )
+    # Block private, loopback, link-local, and reserved IP ranges
+    if ip.is_private:
+        raise HTTPException(
+            status_code=400,
+            detail="Access to private IP addresses is not allowed.",
+        )
+    if ip.is_loopback:
+        raise HTTPException(
+            status_code=400,
+            detail="Access to loopback addresses is not allowed.",
+        )
+    if ip.is_link_local:
+        raise HTTPException(
+            status_code=400,
+            detail="Access to link-local addresses is not allowed.",
+        )
+    if ip.is_reserved:
+        raise HTTPException(
+            status_code=400,
+            detail="Access to reserved IP addresses is not allowed.",
+        )
+    if ip.is_multicast:
+        raise HTTPException(
+            status_code=400,
+            detail="Access to multicast addresses is not allowed.",
+        )
+def _save_uploaded_file(input_path: Path, file_obj: BinaryIO) -> None:
+    """Sync helper to save uploaded file to disk (runs in thread)."""
+    with open(input_path, "wb") as f:
+        shutil.copyfileobj(file_obj, f)
+def _save_downloaded_content(input_path: Path, content: bytes) -> None:
+    """Sync helper to save downloaded content to disk (runs in thread)."""
+    with open(input_path, "wb") as f:
+        f.write(content)
+def _extract_images_as_zip(output_dir: Path, prefix: str = "") -> tuple[bytes, int]:
+    """
+    Extract all images from output directory and return as a zip file bytes.
+    Args:
+        output_dir: Directory containing images (MinerU puts them in images/ subfolder)
+        prefix: Optional prefix for image paths in the zip (e.g., "chunk_0/")
+    Returns:
+        Tuple of (zip_bytes, image_count)
+    """
+    image_extensions = {".png", ".jpg", ".jpeg", ".gif", ".bmp", ".tiff", ".webp"}
+    zip_buffer = io.BytesIO()
+    image_count = 0
+    with zipfile.ZipFile(zip_buffer, 'w', zipfile.ZIP_DEFLATED) as zf:
+        for img_path in output_dir.glob("**/*"):
+            if img_path.is_file() and img_path.suffix.lower() in image_extensions:
+                try:
+                    # Use relative path from output_dir as path in zip
+                    relative_path = img_path.relative_to(output_dir)
+                    zip_path = f"{prefix}{relative_path}" if prefix else str(relative_path)
+                    zf.write(img_path, zip_path)
+                    image_count += 1
+                except Exception as e:
+                    logger.warning(f"Failed to add image {img_path} to zip: {e}")
+    return zip_buffer.getvalue(), image_count
+def _create_images_zip_base64(output_dir: Path, prefix: str = "") -> tuple[Optional[str], int]:
+    """
+    Extract images and return as base64-encoded zip.
+    Returns:
+        Tuple of (base64_zip_string or None if no images, image_count)
+    """
+    zip_bytes, image_count = _extract_images_as_zip(output_dir, prefix)
+    if image_count == 0:
+        return None, 0
+    return base64.b64encode(zip_bytes).decode("utf-8"), image_count
+class ParseResponse(BaseModel):
+    """Response model for document parsing."""
+    success: bool
+    markdown: Optional[str] = None
+    json_content: Optional[Union[dict, list]] = None  # Can be dict (single) or list (chunked)
+    images_zip: Optional[str] = None  # Base64-encoded zip file containing all images
+    image_count: int = 0  # Number of images in the zip
+    error: Optional[str] = None
+    pages_processed: int = 0
+    backend_used: Optional[str] = None  # Actual backend used (may differ if fallback occurred)
+# vLLM GPU memory error patterns that trigger fallback to pipeline
+VLLM_MEMORY_ERROR_PATTERNS = [
+    "Free memory on device cuda",
+    "Decrease GPU memory utilization",
+    "CUDA out of memory",
+    "OutOfMemoryError",
+]
+def _has_gpu_memory_error(output: str) -> bool:
+    """Check if output contains GPU memory error patterns."""
+    for pattern in VLLM_MEMORY_ERROR_PATTERNS:
+        if pattern in output:
+            return True
+    return False
+def _run_mineru(
+    input_path: Path,
+    output_dir: Path,
+    backend: str,
+    lang: str,
+    start_page: int,
+    end_page: Optional[int],
+    request_id: str,
+) -> tuple[subprocess.CompletedProcess, str]:
+    """
+    Run MinerU with the specified backend.
+    Returns tuple of (process result, backend actually used).
+    If GPU memory error occurs with hybrid backend, automatically retries with pipeline.
+    Uses global lock to prevent parallel execution which causes silent failures.
+    """
+    def build_cmd(use_backend: str) -> list[str]:
+        cmd = [
+            "mineru",
+            "-p", str(input_path),
+            "-o", str(output_dir),
+            "-b", use_backend,
+            "-l", lang,
+        ]
+        if start_page > 0:
+            cmd.extend(["-s", str(start_page)])
+        if end_page is not None:
+            cmd.extend(["-e", str(end_page)])
+        return cmd
+    # First attempt with requested backend
+    cmd = build_cmd(backend)
+    logger.info(f"[{request_id}] Starting MinerU processing...")
+    logger.info(f"[{request_id}] Command: {' '.join(cmd)}")
+    logger.info(f"[{request_id}] Backend: {backend}")
+    parse_start = time.time()
+    proc = subprocess.run(cmd, capture_output=True, text=True, timeout=600)
+    parse_duration = time.time() - parse_start
+    logger.info(f"[{request_id}] MinerU completed in {parse_duration:.2f}s")
+    logger.info(f"[{request_id}] Return code: {proc.returncode}")
+    if proc.stdout:
+        for line in proc.stdout.strip().split('\n')[-10:]:
+            logger.info(f"[{request_id}] [stdout] {line}")
+    if proc.stderr:
+        for line in proc.stderr.strip().split('\n')[-10:]:
+            logger.warning(f"[{request_id}] [stderr] {line}")
+    combined_output = (proc.stdout or "") + (proc.stderr or "")
+    # Check for GPU memory errors and fallback to pipeline if needed
+    if backend != "pipeline" and _has_gpu_memory_error(combined_output):
+        logger.warning(f"[{request_id}] GPU memory error detected with {backend}, falling back to pipeline...")
+        # Clear output directory for retry
+        for f in output_dir.glob("*"):
+            if f.is_file():
+                f.unlink()
+            elif f.is_dir():
+                shutil.rmtree(f)
+        # Retry with pipeline backend
+        fallback_cmd = build_cmd("pipeline")
+        logger.info(f"[{request_id}] Retrying with pipeline backend...")
+        logger.info(f"[{request_id}] Command: {' '.join(fallback_cmd)}")
+        parse_start = time.time()
+        proc = subprocess.run(fallback_cmd, capture_output=True, text=True, timeout=600)
+        parse_duration = time.time() - parse_start
+        logger.info(f"[{request_id}] MinerU (pipeline fallback) completed in {parse_duration:.2f}s")
+        logger.info(f"[{request_id}] Return code: {proc.returncode}")
+        if proc.stdout:
+            for line in proc.stdout.strip().split('\n')[-10:]:
+                logger.info(f"[{request_id}] [stdout] {line}")
+        return proc, "pipeline"
+    return proc, backend
+def _get_pdf_page_count(input_path: Path) -> int:
+    """Get the total number of pages in a PDF using pdfinfo."""
+    try:
+        result = subprocess.run(
+            ["pdfinfo", str(input_path)],
+            capture_output=True,
+            text=True,
+            timeout=30
+        )
+        if result.returncode == 0:
+            for line in result.stdout.split('\n'):
+                if line.startswith('Pages:'):
+                    return int(line.split(':')[1].strip())
+    except Exception as e:
+        logger.warning(f"Failed to get PDF page count: {e}")
+    return 0
+def _process_single_chunk(
+    chunk_id: int,
+    input_path: Path,
+    chunk_output_dir: Path,
+    backend: str,
+    lang: str,
+    start_page: int,
+    end_page: int,
+    request_id: str,
+    include_images: bool = False,
+) -> dict:
+    """Process a single chunk of pages. Returns dict with chunk results."""
+    chunk_request_id = f"{request_id}-c{chunk_id}"
+    logger.info(f"[{chunk_request_id}] Processing chunk {chunk_id}: pages {start_page}-{end_page}")
+    try:
+        chunk_output_dir.mkdir(parents=True, exist_ok=True)
+        proc, backend_used = _run_mineru(
+            input_path=input_path,
+            output_dir=chunk_output_dir,
+            backend=backend,
+            lang=lang,
+            start_page=start_page,
+            end_page=end_page,
+            request_id=chunk_request_id,
+        )
+        if proc.returncode != 0:
+            logger.error(f"[{chunk_request_id}] Chunk {chunk_id} failed with code {proc.returncode}")
+            return {
+                "chunk_id": chunk_id,
+                "success": False,
+                "error": f"MinerU failed (code {proc.returncode}): {proc.stderr[:500] if proc.stderr else 'No stderr'}",
+                "backend_used": backend_used,
+                "pages": end_page - start_page + 1,
+            }
+        # Read chunk output - list all files for debugging
+        all_files = list(chunk_output_dir.glob("**/*"))
+        logger.info(f"[{chunk_request_id}] Output files: {[str(f) for f in all_files[:20]]}")
+        md_files = list(chunk_output_dir.glob("**/*.md"))
+        markdown_content = ""
+        if md_files:
+            markdown_content = md_files[0].read_text(encoding="utf-8")
+            logger.info(f"[{chunk_request_id}] Found markdown: {md_files[0]}")
+        json_content = None
+        json_files = [f for f in chunk_output_dir.glob("**/*.json") if "_content_list" not in f.name]
+        if json_files:
+            try:
+                json_content = json.loads(json_files[0].read_text(encoding="utf-8"))
+            except json.JSONDecodeError:
+                pass
+        # Extract images from chunk output (only if requested)
+        chunk_images_zip = None
+        chunk_image_count = 0
+        if include_images:
+            zip_bytes, chunk_image_count = _extract_images_as_zip(chunk_output_dir)
+            # Only keep zip bytes if we actually have images
+            if chunk_image_count > 0:
+                chunk_images_zip = zip_bytes
+        logger.info(f"[{chunk_request_id}] Chunk {chunk_id} completed: {len(markdown_content)} chars markdown, json={'yes' if json_content else 'no'}, images={chunk_image_count}")
+        # Check if we got any content - empty output might indicate a problem
+        has_content = bool(markdown_content.strip()) or bool(json_content)
+        if not has_content:
+            logger.warning(f"[{chunk_request_id}] Chunk {chunk_id} produced no content (pages {start_page}-{end_page})")
+        return {
+            "chunk_id": chunk_id,
+            "success": True,  # MinerU succeeded, even if content is empty (e.g., blank pages)
+            "markdown": markdown_content,
+            "json_content": json_content,
+            "images_zip_bytes": chunk_images_zip,
+            "image_count": chunk_image_count,
+            "backend_used": backend_used,
+            "pages": end_page - start_page + 1,
+            "start_page": start_page,
+            "end_page": end_page,
+            "has_content": has_content,
+        }
+    except Exception as e:
+        logger.error(f"[{chunk_request_id}] Chunk {chunk_id} exception: {e}")
+        return {
+            "chunk_id": chunk_id,
+            "success": False,
+            "error": str(e),
+            "backend_used": backend,
+            "pages": 0,
+        }
+def _has_oom_error_in_results(chunk_results: list) -> bool:
+    """Check if any chunk failed due to OOM error."""
+    for r in chunk_results:
+        if not r["success"]:
+            error_msg = r.get("error", "")
+            if any(pattern in error_msg for pattern in VLLM_MEMORY_ERROR_PATTERNS):
+                return True
+    return False
+def _process_chunks_with_workers(
+    chunks: list,
+    input_path: Path,
+    base_output_dir: Path,
+    chunk_backend: str,
+    lang: str,
+    request_id: str,
+    num_workers: int,
+    include_images: bool = False,
+) -> list:
+    """Process chunks with specified number of workers."""
+    chunk_results = []
+    with ThreadPoolExecutor(max_workers=num_workers) as executor:
+        futures = {}
+        for cid, cstart, cend in chunks:
+            chunk_output_dir = base_output_dir / f"chunk_{cid}"
+            # Clean up any previous attempt
+            if chunk_output_dir.exists():
+                shutil.rmtree(chunk_output_dir)
+            future = executor.submit(
+                _process_single_chunk,
+                cid,
+                input_path,
+                chunk_output_dir,
+                chunk_backend,
+                lang,
+                cstart,
+                cend,
+                request_id,
+                include_images,
+            )
+            futures[future] = cid
+        for future in as_completed(futures):
+            result = future.result()
+            chunk_results.append(result)
+    return chunk_results
+def _process_chunked(
+    input_path: Path,
+    base_output_dir: Path,
+    backend: str,
+    lang: str,
+    start_page: int,
+    end_page: Optional[int],
+    total_pages: int,
+    request_id: str,
+    output_format: str,
+    include_images: bool = False,
+) -> ParseResponse:
+    """Process a PDF in parallel chunks and combine results.
+    Automatically falls back to sequential processing if OOM errors are detected.
+    """
+    # Calculate actual end page
+    actual_end = end_page if end_page is not None else total_pages - 1
+    # Generate chunk ranges
+    chunks = []
+    current_start = start_page
+    chunk_id = 0
+    while current_start <= actual_end:
+        chunk_end = min(current_start + CHUNK_SIZE - 1, actual_end)
+        chunks.append((chunk_id, current_start, chunk_end))
+        current_start = chunk_end + 1
+        chunk_id += 1
+    # Use requested backend for chunked processing
+    # OOM protection will automatically fall back to sequential if needed
+    chunk_backend = backend
+    logger.info(f"[{request_id}] Splitting into {len(chunks)} chunks of up to {CHUNK_SIZE} pages each")
+    logger.info(f"[{request_id}] Backend: {chunk_backend}, workers: {MAX_WORKERS}")
+    # Process chunks - start with configured workers, fall back to sequential on OOM
+    current_workers = MAX_WORKERS
+    chunk_results = _process_chunks_with_workers(
+        chunks, input_path, base_output_dir, chunk_backend, lang, request_id, current_workers, include_images
+    )
+    # Check for OOM errors and retry with fewer workers if needed
+    if _has_oom_error_in_results(chunk_results) and current_workers > 1:
+        logger.warning(f"[{request_id}] OOM detected with {current_workers} workers, retrying sequentially (1 worker)")
+        # Clean up and retry with sequential processing
+        for cid, _, _ in chunks:
+            chunk_dir = base_output_dir / f"chunk_{cid}"
+            if chunk_dir.exists():
+                shutil.rmtree(chunk_dir)
+        chunk_results = _process_chunks_with_workers(
+            chunks, input_path, base_output_dir, chunk_backend, lang, request_id, 1, include_images
+        )
+    # Sort by chunk_id to maintain page order
+    chunk_results.sort(key=lambda x: x["chunk_id"])
+    # Check for failures and empty chunks
+    failed_chunks = [r for r in chunk_results if not r["success"]]
+    if failed_chunks:
+        errors = "; ".join([f"Chunk {r['chunk_id']}: {r.get('error', 'Unknown')}" for r in failed_chunks])
+        logger.error(f"[{request_id}] {len(failed_chunks)} chunks failed: {errors}")
+    empty_chunks = [r for r in chunk_results if r["success"] and not r.get("has_content", True)]
+    if empty_chunks:
+        empty_ranges = [f"pages {r['start_page']}-{r['end_page']}" for r in empty_chunks]
+        logger.warning(f"[{request_id}] {len(empty_chunks)} chunks had no content: {', '.join(empty_ranges)}")
+    # Combine results
+    total_pages_processed = sum(r.get("pages", 0) for r in chunk_results if r["success"])
+    backends_used = list(set(r.get("backend_used", backend) for r in chunk_results if r["success"]))
+    backend_used = backends_used[0] if len(backends_used) == 1 else ",".join(backends_used)
+    # Combine images from all chunks into a single zip (with chunk prefixes to avoid collisions)
+    combined_zip_buffer = io.BytesIO()
+    total_image_count = 0
+    with zipfile.ZipFile(combined_zip_buffer, 'w', zipfile.ZIP_DEFLATED) as combined_zf:
+        for r in chunk_results:
+            if r["success"] and r.get("images_zip_bytes"):
+                chunk_zip_bytes = r["images_zip_bytes"]
+                chunk_id = r["chunk_id"]
+                # Extract from chunk zip and add to combined zip with chunk prefix
+                with zipfile.ZipFile(io.BytesIO(chunk_zip_bytes), 'r') as chunk_zf:
+                    for name in chunk_zf.namelist():
+                        prefixed_name = f"chunk_{chunk_id}/{name}"
+                        combined_zf.writestr(prefixed_name, chunk_zf.read(name))
+                        total_image_count += 1
+    combined_images_zip = None
+    if total_image_count > 0:
+        combined_images_zip = base64.b64encode(combined_zip_buffer.getvalue()).decode("utf-8")
+        logger.info(f"[{request_id}] Combined {total_image_count} images from all chunks into zip")
+    if output_format == "json":
+        # Combine JSON content (merge arrays or create array of results)
+        combined_json = []
+        for r in chunk_results:
+            if r["success"] and r.get("json_content"):
+                jc = r["json_content"]
+                if isinstance(jc, list):
+                    combined_json.extend(jc)
+                else:
+                    combined_json.append(jc)
+        if failed_chunks and not combined_json:
+            return ParseResponse(
+                success=False,
+                error=f"All chunks failed: {errors}",
+                pages_processed=0,
+                backend_used=backend_used,
+            )
+        return ParseResponse(
+            success=True,
+            json_content=combined_json if combined_json else None,
+            images_zip=combined_images_zip,
+            image_count=total_image_count,
+            pages_processed=total_pages_processed,
+            backend_used=backend_used,
+            error=f"{len(failed_chunks)} chunks failed" if failed_chunks else None,
+        )
+    else:
+        # Combine markdown content
+        combined_markdown = []
+        for r in chunk_results:
+            if r["success"] and r.get("markdown"):
+                # Add page separator for clarity
+                if combined_markdown:
+                    combined_markdown.append(f"\n\n<!-- Chunk {r['chunk_id']} (pages {r['start_page']}-{r['end_page']}) -->\n\n")
+                combined_markdown.append(r["markdown"])
+        if failed_chunks and not combined_markdown:
+            return ParseResponse(
+                success=False,
+                error=f"All chunks failed: {errors}",
+                pages_processed=0,
+                backend_used=backend_used,
+            )
+        return ParseResponse(
+            success=True,
+            markdown="".join(combined_markdown) if combined_markdown else None,
+            images_zip=combined_images_zip,
+            image_count=total_image_count,
+            pages_processed=total_pages_processed,
+            backend_used=backend_used,
+            error=f"{len(failed_chunks)} chunks failed" if failed_chunks else None,
+        )
+class HealthResponse(BaseModel):
+    """Health check response."""
+    status: str
+    version: str
+    backend: str
+    chunk_size: int
+    chunking_threshold: int
+    max_workers: int
+class URLParseRequest(BaseModel):
+    """Request model for URL-based parsing."""
+    url: str
+    output_format: str = "markdown"
+    lang: str = MINERU_LANG
+    backend: Optional[str] = None  # Override backend: pipeline, hybrid-auto-engine
+    start_page: int = 0
+    end_page: Optional[int] = None
+    include_images: bool = False  # Include base64-encoded images in response
+@app.get("/", response_model=HealthResponse)
+async def health_check() -> HealthResponse:
+    """Health check endpoint."""
+    return HealthResponse(
+        status="healthy",
+        version="1.4.0",
+        backend=MINERU_BACKEND,
+        chunk_size=CHUNK_SIZE,
+        chunking_threshold=CHUNKING_THRESHOLD,
+        max_workers=MAX_WORKERS,
+    )
+@app.post("/parse", response_model=ParseResponse)
+async def parse_document(
+    file: UploadFile = File(..., description="PDF or image file to parse"),
+    output_format: str = Form(
+        default="markdown", description="Output format: markdown or json"
+    ),
+    lang: str = Form(default=MINERU_LANG, description="OCR language code"),
+    start_page: int = Form(default=0, description="Starting page (0-indexed)"),
+    end_page: Optional[int] = Form(default=None, description="Ending page (None=all)"),
+    backend: Optional[str] = Form(default=None, description="Override backend: pipeline, hybrid-auto-engine"),
+    include_images: bool = Form(default=False, description="Include base64-encoded images in response"),
+    _token: str = Depends(verify_token),
+) -> ParseResponse:
+    """
+    Parse a document file (PDF or image) and return extracted content.
+    Supports:
+    - PDF files (.pdf)
+    - Images (.png, .jpg, .jpeg, .tiff, .bmp)
+    """
+    request_id = str(uuid4())[:8]
+    start_time = time.time()
+    logger.info(f"[{request_id}] {'='*50}")
+    logger.info(f"[{request_id}] New parse request received")
+    logger.info(f"[{request_id}] Filename: {file.filename}")
+    logger.info(f"[{request_id}] Output format: {output_format}")
+    logger.info(f"[{request_id}] Language: {lang}")
+    logger.info(f"[{request_id}] Page range: {start_page} to {end_page or 'end'}")
+    # Validate file size
+    file.file.seek(0, 2)
+    file_size = file.file.tell()
+    file.file.seek(0)
+    file_size_mb = file_size / (1024 * 1024)
+    logger.info(f"[{request_id}] File size: {file_size_mb:.2f} MB")
+    if file_size > MAX_FILE_SIZE_BYTES:
+        logger.error(f"[{request_id}] File too large: {file_size_mb:.2f} MB > {MAX_FILE_SIZE_MB} MB")
+        raise HTTPException(
+            status_code=413,
+            detail=f"File size exceeds maximum allowed size of {MAX_FILE_SIZE_MB}MB",
+        )
+    # Validate file type
+    allowed_extensions = {".pdf", ".png", ".jpg", ".jpeg", ".tiff", ".bmp"}
+    file_ext = Path(file.filename).suffix.lower() if file.filename else ""
+    if file_ext not in allowed_extensions:
+        logger.error(f"[{request_id}] Unsupported file type: {file_ext}")
+        raise HTTPException(
+            status_code=400,
+            detail=f"Unsupported file type. Allowed: {', '.join(allowed_extensions)}",
+        )
+    logger.info(f"[{request_id}] File type: {file_ext}")
+    # Create temp directory for processing
+    temp_dir = tempfile.mkdtemp()
+    logger.info(f"[{request_id}] Created temp directory: {temp_dir}")
+    try:
+        # Save uploaded file (run blocking I/O in thread)
+        input_path = Path(temp_dir) / f"input{file_ext}"
+        await asyncio.to_thread(_save_uploaded_file, input_path, file.file)
+        logger.info(f"[{request_id}] Saved file to: {input_path}")
+        # Create output directory
+        output_dir = Path(temp_dir) / "output"
+        output_dir.mkdir(exist_ok=True)
+        use_backend = backend if backend else MINERU_BACKEND
+        # Check if chunking should be used (PDF only, sufficient pages)
+        total_pages = 0
+        use_chunking = False
+        if file_ext == ".pdf":
+            total_pages = _get_pdf_page_count(input_path)
+            logger.info(f"[{request_id}] PDF has {total_pages} pages")
+            # Calculate effective page range
+            effective_end = end_page if end_page is not None else total_pages - 1
+            effective_pages = effective_end - start_page + 1
+            if effective_pages > CHUNKING_THRESHOLD:
+                use_chunking = True
+                logger.info(f"[{request_id}] Chunking enabled: {effective_pages} pages > {CHUNKING_THRESHOLD} threshold")
+        if use_chunking:
+            # Process in parallel chunks
+            parse_result = _process_chunked(
+                input_path=input_path,
+                base_output_dir=output_dir,
+                backend=use_backend,
+                lang=lang,
+                start_page=start_page,
+                end_page=end_page,
+                total_pages=total_pages,
+                request_id=request_id,
+                output_format=output_format,
+                include_images=include_images,
+            )
+        else:
+            # Process normally (single pass)
+            logger.info(f"[{request_id}] Processing without chunking")
+            proc, backend_used = _run_mineru(
+                input_path=input_path,
+                output_dir=output_dir,
+                backend=use_backend,
+                lang=lang,
+                start_page=start_page,
+                end_page=end_page,
+                request_id=request_id,
+            )
+            if proc.returncode != 0:
+                logger.error(f"[{request_id}] MinerU failed with code {proc.returncode}")
+                if proc.stderr:
+                    for line in proc.stderr.strip().split('\n'):
+                        logger.error(f"[{request_id}] [stderr] {line}")
+                raise RuntimeError(f"MinerU failed (code {proc.returncode}): {proc.stderr}")
+            # Read output
+            logger.info(f"[{request_id}] Reading output files...")
+            parse_result = _read_parse_output(output_dir, output_format, proc.stdout, proc.stderr, request_id, include_images)
+            parse_result.backend_used = backend_used
+            if backend_used != use_backend:
+                logger.info(f"[{request_id}] Note: Fell back from {use_backend} to {backend_used} due to GPU memory constraints")
+        total_duration = time.time() - start_time
+        logger.info(f"[{request_id}] {'='*50}")
+        logger.info(f"[{request_id}] Request completed successfully")
+        logger.info(f"[{request_id}] Pages processed: {parse_result.pages_processed}")
+        logger.info(f"[{request_id}] Total time: {total_duration:.2f}s")
+        if parse_result.pages_processed > 0:
+            logger.info(f"[{request_id}] Speed: {parse_result.pages_processed / total_duration:.2f} pages/sec")
+        logger.info(f"[{request_id}] {'='*50}")
+        return parse_result
+    except Exception as e:
+        total_duration = time.time() - start_time
+        logger.error(f"[{request_id}] {'='*50}")
+        logger.error(f"[{request_id}] Request failed after {total_duration:.2f}s")
+        logger.error(f"[{request_id}] Error: {type(e).__name__}: {str(e)}")
+        logger.error(f"[{request_id}] {'='*50}")
+        return ParseResponse(
+            success=False,
+            error=f"{type(e).__name__}: {str(e)}",
+        )
+    finally:
+        # Cleanup temp directory
+        shutil.rmtree(temp_dir, ignore_errors=True)
+        logger.info(f"[{request_id}] Cleaned up temp directory")
+@app.post("/parse/url", response_model=ParseResponse)
+async def parse_document_from_url(
+    request: URLParseRequest,
+    _token: str = Depends(verify_token),
+) -> ParseResponse:
+    """
+    Parse a document from a URL.
+    Downloads the file and processes it through MinerU.
+    """
+    request_id = str(uuid4())[:8]
+    start_time = time.time()
+    logger.info(f"[{request_id}] {'='*50}")
+    logger.info(f"[{request_id}] New URL parse request received")
+    logger.info(f"[{request_id}] URL: {request.url}")
+    logger.info(f"[{request_id}] Output format: {request.output_format}")
+    logger.info(f"[{request_id}] Language: {request.lang}")
+    logger.info(f"[{request_id}] Page range: {request.start_page} to {request.end_page or 'end'}")
+    # Validate URL to prevent SSRF attacks
+    logger.info(f"[{request_id}] Validating URL...")
+    _validate_url(request.url)
+    logger.info(f"[{request_id}] URL validation passed")
+    temp_dir = tempfile.mkdtemp()
+    logger.info(f"[{request_id}] Created temp directory: {temp_dir}")
+    try:
+        # Download file from URL
+        logger.info(f"[{request_id}] Downloading file from URL...")
+        download_start = time.time()
+        async with httpx.AsyncClient(timeout=60.0, follow_redirects=True) as client:
+            response = await client.get(request.url)
+            response.raise_for_status()
+        download_duration = time.time() - download_start
+        file_size_mb = len(response.content) / (1024 * 1024)
+        logger.info(f"[{request_id}] Download completed in {download_duration:.2f}s")
+        logger.info(f"[{request_id}] File size: {file_size_mb:.2f} MB")
+        # Determine file extension from URL or content-type
+        url_path = Path(request.url.split("?")[0])
+        file_ext = url_path.suffix.lower() or ".pdf"
+        # Validate file type
+        allowed_extensions = {".pdf", ".png", ".jpg", ".jpeg", ".tiff", ".bmp"}
+        if file_ext not in allowed_extensions:
+            logger.error(f"[{request_id}] Unsupported file type: {file_ext}")
+            raise HTTPException(
+                status_code=400,
+                detail=f"Unsupported file type. Allowed: {', '.join(allowed_extensions)}",
+            )
+        logger.info(f"[{request_id}] File type: {file_ext}")
+        # Check file size
+        if len(response.content) > MAX_FILE_SIZE_BYTES:
+            logger.error(f"[{request_id}] File too large: {file_size_mb:.2f} MB > {MAX_FILE_SIZE_MB} MB")
+            raise HTTPException(
+                status_code=413,
+                detail=f"File size exceeds maximum allowed size of {MAX_FILE_SIZE_MB}MB",
+            )
+        # Save downloaded file (run blocking I/O in thread)
+        input_path = Path(temp_dir) / f"input{file_ext}"
+        await asyncio.to_thread(_save_downloaded_content, input_path, response.content)
+        logger.info(f"[{request_id}] Saved file to: {input_path}")
+        # Create output directory
+        output_dir = Path(temp_dir) / "output"
+        output_dir.mkdir(exist_ok=True)
+        use_backend = request.backend if request.backend else MINERU_BACKEND
+        # Check if chunking should be used (PDF only, sufficient pages)
+        total_pages = 0
+        use_chunking = False
+        if file_ext == ".pdf":
+            total_pages = _get_pdf_page_count(input_path)
+            logger.info(f"[{request_id}] PDF has {total_pages} pages")
+            # Calculate effective page range
+            effective_end = request.end_page if request.end_page is not None else total_pages - 1
+            effective_pages = effective_end - request.start_page + 1
+            if effective_pages > CHUNKING_THRESHOLD:
+                use_chunking = True
+                logger.info(f"[{request_id}] Chunking enabled: {effective_pages} pages > {CHUNKING_THRESHOLD} threshold")
+        if use_chunking:
+            # Process in parallel chunks
+            parse_result = _process_chunked(
+                input_path=input_path,
+                base_output_dir=output_dir,
+                backend=use_backend,
+                lang=request.lang,
+                start_page=request.start_page,
+                end_page=request.end_page,
+                total_pages=total_pages,
+                request_id=request_id,
+                output_format=request.output_format,
+                include_images=request.include_images,
+            )
+        else:
+            # Process normally (single pass)
+            logger.info(f"[{request_id}] Processing without chunking")
+            proc, backend_used = _run_mineru(
+                input_path=input_path,
+                output_dir=output_dir,
+                backend=use_backend,
+                lang=request.lang,
+                start_page=request.start_page,
+                end_page=request.end_page,
+                request_id=request_id,
+            )
+            if proc.returncode != 0:
+                logger.error(f"[{request_id}] MinerU failed with code {proc.returncode}")
+                if proc.stderr:
+                    for line in proc.stderr.strip().split('\n'):
+                        logger.error(f"[{request_id}] [stderr] {line}")
+                raise RuntimeError(f"MinerU failed (code {proc.returncode}): {proc.stderr}")
+            # Read output
+            logger.info(f"[{request_id}] Reading output files...")
+            parse_result = _read_parse_output(output_dir, request.output_format, proc.stdout, proc.stderr, request_id, request.include_images)
+            parse_result.backend_used = backend_used
+            if backend_used != use_backend:
+                logger.info(f"[{request_id}] Note: Fell back from {use_backend} to {backend_used} due to GPU memory constraints")
+        total_duration = time.time() - start_time
+        logger.info(f"[{request_id}] {'='*50}")
+        logger.info(f"[{request_id}] Request completed successfully")
+        logger.info(f"[{request_id}] Pages processed: {parse_result.pages_processed}")
+        logger.info(f"[{request_id}] Total time: {total_duration:.2f}s")
+        if parse_result.pages_processed > 0:
+            logger.info(f"[{request_id}] Speed: {parse_result.pages_processed / total_duration:.2f} pages/sec")
+        logger.info(f"[{request_id}] {'='*50}")
+        return parse_result
+    except httpx.HTTPError as e:
+        total_duration = time.time() - start_time
+        logger.error(f"[{request_id}] Download failed after {total_duration:.2f}s: {str(e)}")
+        return ParseResponse(
+            success=False,
+            error=f"Failed to download file from URL: {str(e)}",
+        )
+    except Exception as e:
+        total_duration = time.time() - start_time
+        logger.error(f"[{request_id}] {'='*50}")
+        logger.error(f"[{request_id}] Request failed after {total_duration:.2f}s")
+        logger.error(f"[{request_id}] Error: {type(e).__name__}: {str(e)}")
+        logger.error(f"[{request_id}] {'='*50}")
+        return ParseResponse(
+            success=False,
+            error=str(e),
+        )
+    finally:
+        # Cleanup temp directory
+        shutil.rmtree(temp_dir, ignore_errors=True)
+        logger.info(f"[{request_id}] Cleaned up temp directory")
+def _read_parse_output(output_dir: Path, output_format: str, stdout: str = "", stderr: str = "", request_id: str = "", include_images: bool = False) -> ParseResponse:
+    """Read the parsed output from MinerU output directory."""
+    log_prefix = f"[{request_id}] " if request_id else ""
+    # List all files in output directory for debugging
+    all_files = []
+    for root, dirs, files in os.walk(output_dir):
+        for f in files:
+            all_files.append(os.path.join(root, f))
+    logger.info(f"{log_prefix}Output directory contents: {len(all_files)} files")
+    for f in all_files:
+        logger.info(f"{log_prefix}  - {f}")
+    # Find markdown files recursively in output directory
+    md_files = list(output_dir.glob("**/*.md"))
+    json_files_all = list(output_dir.glob("**/*.json"))
+    logger.info(f"{log_prefix}Found {len(md_files)} markdown files, {len(json_files_all)} JSON files")
+    if not md_files and not json_files_all:
+        logger.error(f"{log_prefix}No output files found!")
+        return ParseResponse(
+            success=False,
+            error=f"No output files found. All files: {all_files}. Stdout: {stdout[:500]}. Stderr: {stderr[:500]}",
+        )
+    # Read markdown output
+    markdown_content = None
+    if md_files:
+        markdown_content = md_files[0].read_text(encoding="utf-8")
+        logger.info(f"{log_prefix}Markdown content length: {len(markdown_content)} chars")
+    # Read JSON output (prefer non-content-list files)
+    json_content = None
+    main_json_files = [f for f in json_files_all if "_content_list" not in f.name]
+    if main_json_files:
+        try:
+            json_content = json.loads(main_json_files[0].read_text(encoding="utf-8"))
+            logger.info(f"{log_prefix}JSON content loaded from: {main_json_files[0].name}")
+        except json.JSONDecodeError as e:
+            logger.warning(f"{log_prefix}Failed to parse JSON: {e}")
+    # Count pages from content list if available
+    pages_processed = 0
+    content_list_files = [f for f in json_files_all if "_content_list" in f.name]
+    if content_list_files:
+        try:
+            content_list = json.loads(
+                content_list_files[0].read_text(encoding="utf-8")
+            )
+            if isinstance(content_list, list):
+                pages_processed = len(
+                    set(item.get("page_idx", 0) for item in content_list)
+                )
+                logger.info(f"{log_prefix}Pages processed: {pages_processed}")
+        except (json.JSONDecodeError, KeyError) as e:
+            logger.warning(f"{log_prefix}Failed to count pages: {e}")
+    # Extract images from output directory (only if requested)
+    images_zip = None
+    image_count = 0
+    if include_images:
+        images_zip, image_count = _create_images_zip_base64(output_dir)
+        if image_count > 0:
+            logger.info(f"{log_prefix}Extracted {image_count} images into zip")
+    if output_format == "json" and json_content:
+        logger.info(f"{log_prefix}Returning JSON output")
+        return ParseResponse(
+            success=True,
+            json_content=json_content,
+            images_zip=images_zip,
+            image_count=image_count,
+            pages_processed=pages_processed,
+        )
+    elif markdown_content:
+        logger.info(f"{log_prefix}Returning markdown output")
+        return ParseResponse(
+            success=True,
+            markdown=markdown_content,
+            images_zip=images_zip,
+            image_count=image_count,
+            pages_processed=pages_processed,
+        )
+    else:
+        logger.error(f"{log_prefix}No usable output generated")
+        return ParseResponse(
+            success=False,
+            error=f"No output generated. MD files: {[str(f) for f in md_files]}. JSON files: {[str(f) for f in json_files_all]}. Stderr: {stderr[:500]}",
+        )
+if __name__ == "__main__":
+    import uvicorn
+    uvicorn.run(app, host="0.0.0.0", port=7860)

requirements.txt ADDED Viewed

	@@ -0,0 +1,19 @@

+# MinerU Document Parser API Dependencies
+# Pins match vLLM v0.14.1 base image to avoid pip backtracking
+# MinerU with core extra (pipeline + vlm + api + gradio)
+# Avoids [all] which adds platform-specific vllm pinning that conflicts with base image
+mineru[core]>=1.0.0
+# Web framework
+fastapi>=0.115.0
+uvicorn[standard]>=0.32.0
+# File upload handling
+python-multipart>=0.0.9
+# HTTP client for URL parsing
+httpx>=0.27.0
+# Type checking
+pydantic>=2.0.0