Spaces:
Paused
Paused
| title: MD Parser API | |
| emoji: 📄 | |
| colorFrom: blue | |
| colorTo: purple | |
| sdk: docker | |
| app_port: 7860 | |
| pinned: false | |
| license: agpl-3.0 | |
| suggested_hardware: a100-large | |
| # MD Parser API | |
| A FastAPI service that transforms PDFs and images into LLM-ready markdown/JSON using [MinerU](https://github.com/opendatalab/MinerU). | |
| ## Features | |
| - **PDF Parsing**: Extract text, tables, formulas, and images from PDFs | |
| - **Image OCR**: Process scanned documents and images | |
| - **Multiple Formats**: Output as markdown or JSON | |
| - **109 Languages**: Supports OCR in 109 languages | |
| - **GPU Accelerated**: Uses CUDA for fast processing on A100 GPU (80GB VRAM) | |
| - **Two Backends**: Fast `pipeline` (default) or accurate `hybrid-auto-engine` | |
| - **Parallel Chunking**: Large PDFs (>20 pages) are automatically split into 10-page chunks and processed in parallel | |
| ## API Endpoints | |
| | Endpoint | Method | Description | | |
| | ------------ | ------ | ----------------------------------------- | | |
| | `/` | GET | Health check | | |
| | `/parse` | POST | Parse uploaded file (multipart/form-data) | | |
| | `/parse/url` | POST | Parse document from URL (JSON body) | | |
| ## Authentication | |
| All `/parse` endpoints require Bearer token authentication. | |
| ``` | |
| Authorization: Bearer YOUR_API_TOKEN | |
| ``` | |
| Set `API_TOKEN` in HF Space Settings > Secrets. | |
| ## Quick Start | |
| ### cURL - File Upload | |
| ```bash | |
| curl -X POST "https://outcomelabs-md-parser.hf.space/parse" \ | |
| -H "Authorization: Bearer YOUR_API_TOKEN" \ | |
| -F "file=@document.pdf" \ | |
| -F "output_format=markdown" | |
| ``` | |
| ### cURL - Parse from URL | |
| ```bash | |
| curl -X POST "https://outcomelabs-md-parser.hf.space/parse/url" \ | |
| -H "Authorization: Bearer YOUR_API_TOKEN" \ | |
| -H "Content-Type: application/json" \ | |
| -d '{"url": "https://example.com/document.pdf", "output_format": "markdown"}' | |
| ``` | |
| ### Python | |
| ```python | |
| import requests | |
| API_URL = "https://outcomelabs-md-parser.hf.space" | |
| API_TOKEN = "your_api_token" | |
| headers = {"Authorization": f"Bearer {API_TOKEN}"} | |
| # Option 1: Upload a file | |
| with open("document.pdf", "rb") as f: | |
| response = requests.post( | |
| f"{API_URL}/parse", | |
| headers=headers, | |
| files={"file": ("document.pdf", f, "application/pdf")}, | |
| data={"output_format": "markdown"} | |
| ) | |
| # Option 2: Parse from URL | |
| response = requests.post( | |
| f"{API_URL}/parse/url", | |
| headers=headers, | |
| json={ | |
| "url": "https://example.com/document.pdf", | |
| "output_format": "markdown" | |
| } | |
| ) | |
| result = response.json() | |
| if result["success"]: | |
| print(f"Parsed {result['pages_processed']} pages") | |
| print(result["markdown"]) | |
| else: | |
| print(f"Error: {result['error']}") | |
| ``` | |
| ### Python with Images | |
| ```python | |
| import requests | |
| import base64 | |
| import zipfile | |
| import io | |
| API_URL = "https://outcomelabs-md-parser.hf.space" | |
| API_TOKEN = "your_api_token" | |
| headers = {"Authorization": f"Bearer {API_TOKEN}"} | |
| # Request with images included | |
| with open("document.pdf", "rb") as f: | |
| response = requests.post( | |
| f"{API_URL}/parse", | |
| headers=headers, | |
| files={"file": ("document.pdf", f, "application/pdf")}, | |
| data={"output_format": "markdown", "include_images": "true"} | |
| ) | |
| result = response.json() | |
| if result["success"]: | |
| print(f"Parsed {result['pages_processed']} pages") | |
| print(result["markdown"]) | |
| # Extract images from ZIP | |
| if result["images_zip"]: | |
| print(f"Extracting {result['image_count']} images...") | |
| zip_bytes = base64.b64decode(result["images_zip"]) | |
| with zipfile.ZipFile(io.BytesIO(zip_bytes), 'r') as zf: | |
| zf.extractall("./extracted_images") | |
| print(f"Images saved to ./extracted_images/") | |
| ``` | |
| ### JavaScript/Node.js | |
| ```javascript | |
| const API_URL = 'https://outcomelabs-md-parser.hf.space'; | |
| const API_TOKEN = 'your_api_token'; | |
| // Parse from URL | |
| const response = await fetch(`${API_URL}/parse/url`, { | |
| method: 'POST', | |
| headers: { | |
| Authorization: `Bearer ${API_TOKEN}`, | |
| 'Content-Type': 'application/json', | |
| }, | |
| body: JSON.stringify({ | |
| url: 'https://example.com/document.pdf', | |
| output_format: 'markdown', | |
| }), | |
| }); | |
| const result = await response.json(); | |
| console.log(result.markdown); | |
| ``` | |
| ### JavaScript/Node.js with Images | |
| ```javascript | |
| import JSZip from 'jszip'; | |
| import fs from 'fs'; | |
| const API_URL = 'https://outcomelabs-md-parser.hf.space'; | |
| const API_TOKEN = 'your_api_token'; | |
| // Parse with images | |
| const response = await fetch(`${API_URL}/parse/url`, { | |
| method: 'POST', | |
| headers: { | |
| Authorization: `Bearer ${API_TOKEN}`, | |
| 'Content-Type': 'application/json', | |
| }, | |
| body: JSON.stringify({ | |
| url: 'https://example.com/document.pdf', | |
| output_format: 'markdown', | |
| include_images: true, | |
| }), | |
| }); | |
| const result = await response.json(); | |
| console.log(result.markdown); | |
| // Extract images from ZIP | |
| if (result.images_zip) { | |
| console.log(`Extracting ${result.image_count} images...`); | |
| const zipData = Buffer.from(result.images_zip, 'base64'); | |
| const zip = await JSZip.loadAsync(zipData); | |
| for (const [name, file] of Object.entries(zip.files)) { | |
| if (!file.dir) { | |
| const content = await file.async('nodebuffer'); | |
| fs.writeFileSync(`./extracted_images/${name}`, content); | |
| console.log(` Saved: ${name}`); | |
| } | |
| } | |
| } | |
| ``` | |
| ## Postman Setup | |
| ### File Upload (POST /parse) | |
| 1. **Method:** `POST` | |
| 2. **URL:** `https://outcomelabs-md-parser.hf.space/parse` | |
| 3. **Authorization tab:** Type = Bearer Token, Token = `your_api_token` | |
| 4. **Body tab:** Select `form-data` | |
| | Key | Type | Value | | |
| | -------------- | ---- | --------------------------------------------- | | |
| | file | File | Select your PDF/image | | |
| | output_format | Text | `markdown` or `json` | | |
| | lang | Text | `en` (optional) | | |
| | backend | Text | `pipeline` or `hybrid-auto-engine` (optional) | | |
| | start_page | Text | `0` (optional) | | |
| | end_page | Text | `10` (optional) | | |
| | include_images | Text | `true` or `false` (optional) | | |
| ### URL Parsing (POST /parse/url) | |
| 1. **Method:** `POST` | |
| 2. **URL:** `https://outcomelabs-md-parser.hf.space/parse/url` | |
| 3. **Authorization tab:** Type = Bearer Token, Token = `your_api_token` | |
| 4. **Headers tab:** Add `Content-Type: application/json` | |
| 5. **Body tab:** Select `raw` and `JSON` | |
| ```json | |
| { | |
| "url": "https://example.com/document.pdf", | |
| "output_format": "markdown", | |
| "lang": "en", | |
| "start_page": 0, | |
| "end_page": null, | |
| "include_images": false | |
| } | |
| ``` | |
| ## Request Parameters | |
| ### File Upload (/parse) | |
| | Parameter | Type | Required | Default | Description | | |
| | -------------- | ------ | -------- | ---------- | ---------------------------------------------------- | | |
| | file | File | Yes | - | PDF or image file | | |
| | output_format | string | No | `markdown` | `markdown` or `json` | | |
| | lang | string | No | `en` | OCR language code | | |
| | backend | string | No | `pipeline` | `pipeline` (fast) or `hybrid-auto-engine` (accurate) | | |
| | start_page | int | No | `0` | Starting page (0-indexed) | | |
| | end_page | int | No | `null` | Ending page (null = all pages) | | |
| | include_images | bool | No | `false` | Include base64-encoded images in response | | |
| ### URL Parsing (/parse/url) | |
| | Parameter | Type | Required | Default | Description | | |
| | -------------- | ------ | -------- | ---------- | ---------------------------------------------------- | | |
| | url | string | Yes | - | URL to PDF or image | | |
| | output_format | string | No | `markdown` | `markdown` or `json` | | |
| | lang | string | No | `en` | OCR language code | | |
| | backend | string | No | `pipeline` | `pipeline` (fast) or `hybrid-auto-engine` (accurate) | | |
| | start_page | int | No | `0` | Starting page (0-indexed) | | |
| | end_page | int | No | `null` | Ending page (null = all pages) | | |
| | include_images | bool | No | `false` | Include base64-encoded images in response | | |
| ## Response Format | |
| ```json | |
| { | |
| "success": true, | |
| "markdown": "# Document Title\n\nExtracted content...", | |
| "json_content": null, | |
| "images_zip": null, | |
| "image_count": 0, | |
| "error": null, | |
| "pages_processed": 20, | |
| "backend_used": "pipeline" | |
| } | |
| ``` | |
| | Field | Type | Description | | |
| | --------------- | ------- | ---------------------------------------------------------------------- | | |
| | success | boolean | Whether parsing succeeded | | |
| | markdown | string | Extracted markdown (if output_format=markdown) | | |
| | json_content | object | Extracted JSON (if output_format=json) | | |
| | images_zip | string | Base64-encoded ZIP file containing all images (if include_images=true) | | |
| | image_count | int | Number of images in the ZIP file | | |
| | error | string | Error message if failed | | |
| | pages_processed | int | Number of pages processed | | |
| | backend_used | string | Actual backend used (may differ from requested if fallback occurred) | | |
| ### Images Response | |
| When `include_images=true`, the `images_zip` field contains a base64-encoded ZIP file with all extracted images: | |
| ```json | |
| { | |
| "images_zip": "UEsDBBQAAAAIAGJ...", | |
| "image_count": 3 | |
| } | |
| ``` | |
| #### Extracting Images (Python) | |
| ```python | |
| import base64 | |
| import zipfile | |
| import io | |
| result = response.json() | |
| if result["images_zip"]: | |
| print(f"Extracted {result['image_count']} images") | |
| # Decode the base64 ZIP | |
| zip_bytes = base64.b64decode(result["images_zip"]) | |
| # Extract images from ZIP | |
| with zipfile.ZipFile(io.BytesIO(zip_bytes), 'r') as zf: | |
| for name in zf.namelist(): | |
| print(f" - {name}") # e.g., "images/fig1.png" | |
| img_bytes = zf.read(name) | |
| # Save or process img_bytes as needed | |
| ``` | |
| #### Extracting Images (JavaScript) | |
| ```javascript | |
| import JSZip from 'jszip'; | |
| const result = await response.json(); | |
| if (result.images_zip) { | |
| console.log(`Extracted ${result.image_count} images`); | |
| // Decode base64 and unzip | |
| const zipData = Uint8Array.from(atob(result.images_zip), c => | |
| c.charCodeAt(0) | |
| ); | |
| const zip = await JSZip.loadAsync(zipData); | |
| for (const [name, file] of Object.entries(zip.files)) { | |
| console.log(` - ${name}`); // e.g., "images/fig1.png" | |
| const imgBlob = await file.async('blob'); | |
| // Use imgBlob as needed | |
| } | |
| } | |
| ``` | |
| #### Image Path Structure | |
| - **Non-chunked documents**: `images/filename.png` | |
| - **Chunked documents (>20 pages)**: `chunk_0/images/filename.png`, `chunk_1/images/filename.png`, etc. | |
| ## Backends | |
| | Backend | Speed | Accuracy | Best For | | |
| | -------------------- | --------------- | ---------------- | --------------------------------------------- | | |
| | `pipeline` (default) | ~0.77 pages/sec | Good | Native PDFs, text-heavy docs, fast processing | | |
| | `hybrid-auto-engine` | ~0.39 pages/sec | Excellent (90%+) | Complex layouts, scanned docs, forms | | |
| ### When to Use `pipeline` (Default) | |
| The pipeline backend uses traditional ML models for faster processing. Use it for: | |
| - **Native PDFs with text layers** - Academic papers, eBooks, reports generated digitally | |
| - **High-volume processing** - When speed matters more than perfect accuracy (2x faster) | |
| - **Well-structured documents** - Clean, single-column text-heavy documents | |
| - **arXiv papers** - Both backends produce identical output for well-structured PDFs | |
| - **Cost optimization** - Faster processing = less GPU time | |
| ### When to Use `hybrid-auto-engine` | |
| The hybrid backend uses a Vision-Language Model (VLM) to understand document layouts visually. Use it for: | |
| - **Scanned documents** - Better OCR accuracy, fewer typos | |
| - **Forms and applications** - Extracts 18x more content from complex form layouts (tested on IRS Form 1040) | |
| - **Documents with complex layouts** - Multi-column, mixed text/images, tables with merged cells | |
| - **Handwritten content** - Better recognition of cursive and handwriting | |
| - **Low-quality scans** - VLM can interpret degraded or noisy images | |
| - **Legal documents** - Leases, contracts with signatures and stamps | |
| - **Historical documents** - Older typewritten or faded documents | |
| ### Real-World Comparison | |
| | Document Type | Pipeline Output | Hybrid Output | | |
| | ---------------------- | ------------------------ | ----------------------------- | | |
| | arXiv paper (15 pages) | 42KB, clean extraction | 42KB, identical | | |
| | IRS Form 1040 | 825 bytes, mostly images | **15KB, full form structure** | | |
| | Scanned lease (31 pg) | 104KB, OCR errors | **105KB, cleaner OCR** | | |
| **OCR Accuracy Example (scanned lease):** | |
| - Pipeline: "Ilinois" (9 occurrences of typo) | |
| - Hybrid: "Illinois" (21 correct occurrences) | |
| Override per-request with the `backend` parameter, or set `MINERU_BACKEND` env var. | |
| ## Parallel Chunking | |
| For large PDFs, the API automatically splits processing into parallel chunks to avoid timeouts and improve throughput. | |
| ### How It Works | |
| 1. **Detection**: PDFs with more than 20 pages (configurable via `CHUNKING_THRESHOLD`) trigger chunking | |
| 2. **Splitting**: Document is split into 10-page chunks (configurable via `CHUNK_SIZE`) | |
| 3. **Parallel Processing**: Up to 3 chunks (configurable via `MAX_WORKERS`) are processed simultaneously | |
| 4. **Combining**: Results are merged in page order, with chunk boundaries marked in markdown output | |
| ### Performance Impact | |
| | Document Size | Without Chunking | With Chunking (3 workers) | Speedup | | |
| | ------------- | ---------------- | ------------------------- | ------- | | |
| | 30 pages | ~80 seconds | ~30 seconds | ~2.7x | | |
| | 60 pages | ~160 seconds | ~55 seconds | ~2.9x | | |
| | 100 pages | Timeout (>600s) | ~100 seconds | N/A | | |
| ### OOM Protection | |
| If GPU out-of-memory errors are detected during parallel processing, the system automatically falls back to sequential processing (1 worker) and retries all chunks. | |
| ### Notes | |
| - Chunking only applies to PDF files (images are always processed as single units) | |
| - Each chunk maintains context for tables and formulas within its page range | |
| - Chunk boundaries are marked with HTML comments in markdown output for transparency | |
| - If any chunk fails, partial results are still returned with an error message | |
| - Requested backend is used for chunked processing (with OOM auto-fallback to sequential) | |
| ## Supported File Types | |
| - PDF (.pdf) | |
| - Images (.png, .jpg, .jpeg, .tiff, .bmp) | |
| Maximum file size: 1GB (configurable via `MAX_FILE_SIZE_MB`) | |
| ## Configuration | |
| | Environment Variable | Description | Default | | |
| | ----------------------------- | ---------------------------------------------- | ---------- | | |
| | `API_TOKEN` | **Required.** API authentication token | - | | |
| | `MINERU_BACKEND` | Default parsing backend | `pipeline` | | |
| | `MINERU_LANG` | Default OCR language | `en` | | |
| | `MAX_FILE_SIZE_MB` | Maximum upload size in MB | `1024` | | |
| | `VLLM_GPU_MEMORY_UTILIZATION` | vLLM GPU memory fraction (hybrid backend only) | `0.4` | | |
| | `CHUNK_SIZE` | Pages per chunk for chunked processing | `10` | | |
| | `CHUNKING_THRESHOLD` | Minimum pages to trigger chunking | `20` | | |
| | `MAX_WORKERS` | Parallel workers for chunk processing | `3` | | |
| ### GPU Memory & Automatic Fallback | |
| The `hybrid-auto-engine` backend uses vLLM internally, which requires GPU memory. **If GPU memory is insufficient, the API automatically falls back to `pipeline` backend** and returns results (check `backend_used` in response). | |
| To force a specific backend or tune memory: | |
| 1. **Use `pipeline` backend** - Add `backend=pipeline` to your request (doesn't use vLLM, faster but less accurate for scanned docs) | |
| 2. **Lower GPU memory** - Set `VLLM_GPU_MEMORY_UTILIZATION` to a lower value (e.g., `0.3`) | |
| ## Performance | |
| **Hardware:** Nvidia A100 Large (80GB VRAM, 12 vCPU, 142GB RAM) | |
| | Backend | Speed | 15-page PDF | 31-page PDF | | |
| | -------------------- | --------------- | ----------- | ----------- | | |
| | `pipeline` | ~0.77 pages/sec | ~20 seconds | ~40 seconds | | |
| | `hybrid-auto-engine` | ~0.39 pages/sec | ~40 seconds | ~80 seconds | | |
| **Trade-off:** Hybrid is 2x slower but produces significantly better results for scanned/complex documents. For native PDFs, both produce identical output. | |
| **Sleep behavior:** Space sleeps after 60 minutes idle. First request after sleep takes ~30-60 seconds for cold start. | |
| ## Deployment | |
| - **Space:** https://huggingface.co/spaces/outcomelabs/md-parser | |
| - **API:** https://outcomelabs-md-parser.hf.space | |
| - **Hardware:** Nvidia A100 Large 80GB ($2.50/hr, stops billing when sleeping) | |
| ### Deploy Updates | |
| ```bash | |
| git add . | |
| git commit -m "feat: description" | |
| git push hf main | |
| ``` | |
| ## Logging | |
| View logs in HuggingFace Space > Logs tab: | |
| ``` | |
| 2026-01-26 10:30:00 | INFO | [a1b2c3d4] New parse request received | |
| 2026-01-26 10:30:00 | INFO | [a1b2c3d4] Filename: document.pdf | |
| 2026-01-26 10:30:00 | INFO | [a1b2c3d4] File size: 2.45 MB | |
| 2026-01-26 10:30:00 | INFO | [a1b2c3d4] Backend: pipeline | |
| 2026-01-26 10:30:27 | INFO | [a1b2c3d4] MinerU completed in 27.23s | |
| 2026-01-26 10:30:27 | INFO | [a1b2c3d4] Pages processed: 20 | |
| 2026-01-26 10:30:27 | INFO | [a1b2c3d4] Speed: 0.73 pages/sec | |
| ``` | |
| ## Changelog | |
| ### v1.4.0 (Breaking Change) | |
| **Images now returned as ZIP file instead of dictionary:** | |
| - `images` field removed | |
| - `images_zip` field added (base64-encoded ZIP containing all images) | |
| - `image_count` field added (number of images in ZIP) | |
| **Migration from v1.3.0:** | |
| ```python | |
| # OLD (v1.3.0) | |
| if result["images"]: | |
| for filename, b64_data in result["images"].items(): | |
| img_bytes = base64.b64decode(b64_data) | |
| # NEW (v1.4.0) | |
| if result["images_zip"]: | |
| zip_bytes = base64.b64decode(result["images_zip"]) | |
| with zipfile.ZipFile(io.BytesIO(zip_bytes), 'r') as zf: | |
| for filename in zf.namelist(): | |
| img_bytes = zf.read(filename) | |
| ``` | |
| **Benefits:** | |
| - Smaller payload size due to ZIP compression | |
| - Single field instead of large dictionary | |
| - Easier to save/extract as a file | |
| ### v1.3.0 | |
| - Added `include_images` parameter for optional image extraction | |
| - Added parallel chunking for large PDFs (>20 pages) | |
| - Added automatic OOM fallback to sequential processing | |
| ## Credits | |
| Built with [MinerU](https://github.com/opendatalab/MinerU) by OpenDataLab. | |