--- title: MD Parser API emoji: 📄 colorFrom: blue colorTo: purple sdk: docker app_port: 7860 pinned: false license: agpl-3.0 suggested_hardware: a100-large --- # MD Parser API A FastAPI service that transforms PDFs and images into LLM-ready markdown/JSON using [MinerU](https://github.com/opendatalab/MinerU). ## Features - **PDF Parsing**: Extract text, tables, formulas, and images from PDFs - **Image OCR**: Process scanned documents and images - **Multiple Formats**: Output as markdown or JSON - **109 Languages**: Supports OCR in 109 languages - **GPU Accelerated**: Uses CUDA for fast processing on A100 GPU (80GB VRAM) - **Two Backends**: Fast `pipeline` (default) or accurate `hybrid-auto-engine` - **Parallel Chunking**: Large PDFs (>20 pages) are automatically split into 10-page chunks and processed in parallel ## API Endpoints | Endpoint | Method | Description | | ------------ | ------ | ----------------------------------------- | | `/` | GET | Health check | | `/parse` | POST | Parse uploaded file (multipart/form-data) | | `/parse/url` | POST | Parse document from URL (JSON body) | ## Authentication All `/parse` endpoints require Bearer token authentication. ``` Authorization: Bearer YOUR_API_TOKEN ``` Set `API_TOKEN` in HF Space Settings > Secrets. ## Quick Start ### cURL - File Upload ```bash curl -X POST "https://outcomelabs-md-parser.hf.space/parse" \ -H "Authorization: Bearer YOUR_API_TOKEN" \ -F "file=@document.pdf" \ -F "output_format=markdown" ``` ### cURL - Parse from URL ```bash curl -X POST "https://outcomelabs-md-parser.hf.space/parse/url" \ -H "Authorization: Bearer YOUR_API_TOKEN" \ -H "Content-Type: application/json" \ -d '{"url": "https://example.com/document.pdf", "output_format": "markdown"}' ``` ### Python ```python import requests API_URL = "https://outcomelabs-md-parser.hf.space" API_TOKEN = "your_api_token" headers = {"Authorization": f"Bearer {API_TOKEN}"} # Option 1: Upload a file with open("document.pdf", "rb") as f: response = requests.post( f"{API_URL}/parse", headers=headers, files={"file": ("document.pdf", f, "application/pdf")}, data={"output_format": "markdown"} ) # Option 2: Parse from URL response = requests.post( f"{API_URL}/parse/url", headers=headers, json={ "url": "https://example.com/document.pdf", "output_format": "markdown" } ) result = response.json() if result["success"]: print(f"Parsed {result['pages_processed']} pages") print(result["markdown"]) else: print(f"Error: {result['error']}") ``` ### Python with Images ```python import requests import base64 import zipfile import io API_URL = "https://outcomelabs-md-parser.hf.space" API_TOKEN = "your_api_token" headers = {"Authorization": f"Bearer {API_TOKEN}"} # Request with images included with open("document.pdf", "rb") as f: response = requests.post( f"{API_URL}/parse", headers=headers, files={"file": ("document.pdf", f, "application/pdf")}, data={"output_format": "markdown", "include_images": "true"} ) result = response.json() if result["success"]: print(f"Parsed {result['pages_processed']} pages") print(result["markdown"]) # Extract images from ZIP if result["images_zip"]: print(f"Extracting {result['image_count']} images...") zip_bytes = base64.b64decode(result["images_zip"]) with zipfile.ZipFile(io.BytesIO(zip_bytes), 'r') as zf: zf.extractall("./extracted_images") print(f"Images saved to ./extracted_images/") ``` ### JavaScript/Node.js ```javascript const API_URL = 'https://outcomelabs-md-parser.hf.space'; const API_TOKEN = 'your_api_token'; // Parse from URL const response = await fetch(`${API_URL}/parse/url`, { method: 'POST', headers: { Authorization: `Bearer ${API_TOKEN}`, 'Content-Type': 'application/json', }, body: JSON.stringify({ url: 'https://example.com/document.pdf', output_format: 'markdown', }), }); const result = await response.json(); console.log(result.markdown); ``` ### JavaScript/Node.js with Images ```javascript import JSZip from 'jszip'; import fs from 'fs'; const API_URL = 'https://outcomelabs-md-parser.hf.space'; const API_TOKEN = 'your_api_token'; // Parse with images const response = await fetch(`${API_URL}/parse/url`, { method: 'POST', headers: { Authorization: `Bearer ${API_TOKEN}`, 'Content-Type': 'application/json', }, body: JSON.stringify({ url: 'https://example.com/document.pdf', output_format: 'markdown', include_images: true, }), }); const result = await response.json(); console.log(result.markdown); // Extract images from ZIP if (result.images_zip) { console.log(`Extracting ${result.image_count} images...`); const zipData = Buffer.from(result.images_zip, 'base64'); const zip = await JSZip.loadAsync(zipData); for (const [name, file] of Object.entries(zip.files)) { if (!file.dir) { const content = await file.async('nodebuffer'); fs.writeFileSync(`./extracted_images/${name}`, content); console.log(` Saved: ${name}`); } } } ``` ## Postman Setup ### File Upload (POST /parse) 1. **Method:** `POST` 2. **URL:** `https://outcomelabs-md-parser.hf.space/parse` 3. **Authorization tab:** Type = Bearer Token, Token = `your_api_token` 4. **Body tab:** Select `form-data` | Key | Type | Value | | -------------- | ---- | --------------------------------------------- | | file | File | Select your PDF/image | | output_format | Text | `markdown` or `json` | | lang | Text | `en` (optional) | | backend | Text | `pipeline` or `hybrid-auto-engine` (optional) | | start_page | Text | `0` (optional) | | end_page | Text | `10` (optional) | | include_images | Text | `true` or `false` (optional) | ### URL Parsing (POST /parse/url) 1. **Method:** `POST` 2. **URL:** `https://outcomelabs-md-parser.hf.space/parse/url` 3. **Authorization tab:** Type = Bearer Token, Token = `your_api_token` 4. **Headers tab:** Add `Content-Type: application/json` 5. **Body tab:** Select `raw` and `JSON` ```json { "url": "https://example.com/document.pdf", "output_format": "markdown", "lang": "en", "start_page": 0, "end_page": null, "include_images": false } ``` ## Request Parameters ### File Upload (/parse) | Parameter | Type | Required | Default | Description | | -------------- | ------ | -------- | ---------- | ---------------------------------------------------- | | file | File | Yes | - | PDF or image file | | output_format | string | No | `markdown` | `markdown` or `json` | | lang | string | No | `en` | OCR language code | | backend | string | No | `pipeline` | `pipeline` (fast) or `hybrid-auto-engine` (accurate) | | start_page | int | No | `0` | Starting page (0-indexed) | | end_page | int | No | `null` | Ending page (null = all pages) | | include_images | bool | No | `false` | Include base64-encoded images in response | ### URL Parsing (/parse/url) | Parameter | Type | Required | Default | Description | | -------------- | ------ | -------- | ---------- | ---------------------------------------------------- | | url | string | Yes | - | URL to PDF or image | | output_format | string | No | `markdown` | `markdown` or `json` | | lang | string | No | `en` | OCR language code | | backend | string | No | `pipeline` | `pipeline` (fast) or `hybrid-auto-engine` (accurate) | | start_page | int | No | `0` | Starting page (0-indexed) | | end_page | int | No | `null` | Ending page (null = all pages) | | include_images | bool | No | `false` | Include base64-encoded images in response | ## Response Format ```json { "success": true, "markdown": "# Document Title\n\nExtracted content...", "json_content": null, "images_zip": null, "image_count": 0, "error": null, "pages_processed": 20, "backend_used": "pipeline" } ``` | Field | Type | Description | | --------------- | ------- | ---------------------------------------------------------------------- | | success | boolean | Whether parsing succeeded | | markdown | string | Extracted markdown (if output_format=markdown) | | json_content | object | Extracted JSON (if output_format=json) | | images_zip | string | Base64-encoded ZIP file containing all images (if include_images=true) | | image_count | int | Number of images in the ZIP file | | error | string | Error message if failed | | pages_processed | int | Number of pages processed | | backend_used | string | Actual backend used (may differ from requested if fallback occurred) | ### Images Response When `include_images=true`, the `images_zip` field contains a base64-encoded ZIP file with all extracted images: ```json { "images_zip": "UEsDBBQAAAAIAGJ...", "image_count": 3 } ``` #### Extracting Images (Python) ```python import base64 import zipfile import io result = response.json() if result["images_zip"]: print(f"Extracted {result['image_count']} images") # Decode the base64 ZIP zip_bytes = base64.b64decode(result["images_zip"]) # Extract images from ZIP with zipfile.ZipFile(io.BytesIO(zip_bytes), 'r') as zf: for name in zf.namelist(): print(f" - {name}") # e.g., "images/fig1.png" img_bytes = zf.read(name) # Save or process img_bytes as needed ``` #### Extracting Images (JavaScript) ```javascript import JSZip from 'jszip'; const result = await response.json(); if (result.images_zip) { console.log(`Extracted ${result.image_count} images`); // Decode base64 and unzip const zipData = Uint8Array.from(atob(result.images_zip), c => c.charCodeAt(0) ); const zip = await JSZip.loadAsync(zipData); for (const [name, file] of Object.entries(zip.files)) { console.log(` - ${name}`); // e.g., "images/fig1.png" const imgBlob = await file.async('blob'); // Use imgBlob as needed } } ``` #### Image Path Structure - **Non-chunked documents**: `images/filename.png` - **Chunked documents (>20 pages)**: `chunk_0/images/filename.png`, `chunk_1/images/filename.png`, etc. ## Backends | Backend | Speed | Accuracy | Best For | | -------------------- | --------------- | ---------------- | --------------------------------------------- | | `pipeline` (default) | ~0.77 pages/sec | Good | Native PDFs, text-heavy docs, fast processing | | `hybrid-auto-engine` | ~0.39 pages/sec | Excellent (90%+) | Complex layouts, scanned docs, forms | ### When to Use `pipeline` (Default) The pipeline backend uses traditional ML models for faster processing. Use it for: - **Native PDFs with text layers** - Academic papers, eBooks, reports generated digitally - **High-volume processing** - When speed matters more than perfect accuracy (2x faster) - **Well-structured documents** - Clean, single-column text-heavy documents - **arXiv papers** - Both backends produce identical output for well-structured PDFs - **Cost optimization** - Faster processing = less GPU time ### When to Use `hybrid-auto-engine` The hybrid backend uses a Vision-Language Model (VLM) to understand document layouts visually. Use it for: - **Scanned documents** - Better OCR accuracy, fewer typos - **Forms and applications** - Extracts 18x more content from complex form layouts (tested on IRS Form 1040) - **Documents with complex layouts** - Multi-column, mixed text/images, tables with merged cells - **Handwritten content** - Better recognition of cursive and handwriting - **Low-quality scans** - VLM can interpret degraded or noisy images - **Legal documents** - Leases, contracts with signatures and stamps - **Historical documents** - Older typewritten or faded documents ### Real-World Comparison | Document Type | Pipeline Output | Hybrid Output | | ---------------------- | ------------------------ | ----------------------------- | | arXiv paper (15 pages) | 42KB, clean extraction | 42KB, identical | | IRS Form 1040 | 825 bytes, mostly images | **15KB, full form structure** | | Scanned lease (31 pg) | 104KB, OCR errors | **105KB, cleaner OCR** | **OCR Accuracy Example (scanned lease):** - Pipeline: "Ilinois" (9 occurrences of typo) - Hybrid: "Illinois" (21 correct occurrences) Override per-request with the `backend` parameter, or set `MINERU_BACKEND` env var. ## Parallel Chunking For large PDFs, the API automatically splits processing into parallel chunks to avoid timeouts and improve throughput. ### How It Works 1. **Detection**: PDFs with more than 20 pages (configurable via `CHUNKING_THRESHOLD`) trigger chunking 2. **Splitting**: Document is split into 10-page chunks (configurable via `CHUNK_SIZE`) 3. **Parallel Processing**: Up to 3 chunks (configurable via `MAX_WORKERS`) are processed simultaneously 4. **Combining**: Results are merged in page order, with chunk boundaries marked in markdown output ### Performance Impact | Document Size | Without Chunking | With Chunking (3 workers) | Speedup | | ------------- | ---------------- | ------------------------- | ------- | | 30 pages | ~80 seconds | ~30 seconds | ~2.7x | | 60 pages | ~160 seconds | ~55 seconds | ~2.9x | | 100 pages | Timeout (>600s) | ~100 seconds | N/A | ### OOM Protection If GPU out-of-memory errors are detected during parallel processing, the system automatically falls back to sequential processing (1 worker) and retries all chunks. ### Notes - Chunking only applies to PDF files (images are always processed as single units) - Each chunk maintains context for tables and formulas within its page range - Chunk boundaries are marked with HTML comments in markdown output for transparency - If any chunk fails, partial results are still returned with an error message - Requested backend is used for chunked processing (with OOM auto-fallback to sequential) ## Supported File Types - PDF (.pdf) - Images (.png, .jpg, .jpeg, .tiff, .bmp) Maximum file size: 1GB (configurable via `MAX_FILE_SIZE_MB`) ## Configuration | Environment Variable | Description | Default | | ----------------------------- | ---------------------------------------------- | ---------- | | `API_TOKEN` | **Required.** API authentication token | - | | `MINERU_BACKEND` | Default parsing backend | `pipeline` | | `MINERU_LANG` | Default OCR language | `en` | | `MAX_FILE_SIZE_MB` | Maximum upload size in MB | `1024` | | `VLLM_GPU_MEMORY_UTILIZATION` | vLLM GPU memory fraction (hybrid backend only) | `0.4` | | `CHUNK_SIZE` | Pages per chunk for chunked processing | `10` | | `CHUNKING_THRESHOLD` | Minimum pages to trigger chunking | `20` | | `MAX_WORKERS` | Parallel workers for chunk processing | `3` | ### GPU Memory & Automatic Fallback The `hybrid-auto-engine` backend uses vLLM internally, which requires GPU memory. **If GPU memory is insufficient, the API automatically falls back to `pipeline` backend** and returns results (check `backend_used` in response). To force a specific backend or tune memory: 1. **Use `pipeline` backend** - Add `backend=pipeline` to your request (doesn't use vLLM, faster but less accurate for scanned docs) 2. **Lower GPU memory** - Set `VLLM_GPU_MEMORY_UTILIZATION` to a lower value (e.g., `0.3`) ## Performance **Hardware:** Nvidia A100 Large (80GB VRAM, 12 vCPU, 142GB RAM) | Backend | Speed | 15-page PDF | 31-page PDF | | -------------------- | --------------- | ----------- | ----------- | | `pipeline` | ~0.77 pages/sec | ~20 seconds | ~40 seconds | | `hybrid-auto-engine` | ~0.39 pages/sec | ~40 seconds | ~80 seconds | **Trade-off:** Hybrid is 2x slower but produces significantly better results for scanned/complex documents. For native PDFs, both produce identical output. **Sleep behavior:** Space sleeps after 60 minutes idle. First request after sleep takes ~30-60 seconds for cold start. ## Deployment - **Space:** https://huggingface.co/spaces/outcomelabs/md-parser - **API:** https://outcomelabs-md-parser.hf.space - **Hardware:** Nvidia A100 Large 80GB ($2.50/hr, stops billing when sleeping) ### Deploy Updates ```bash git add . git commit -m "feat: description" git push hf main ``` ## Logging View logs in HuggingFace Space > Logs tab: ``` 2026-01-26 10:30:00 | INFO | [a1b2c3d4] New parse request received 2026-01-26 10:30:00 | INFO | [a1b2c3d4] Filename: document.pdf 2026-01-26 10:30:00 | INFO | [a1b2c3d4] File size: 2.45 MB 2026-01-26 10:30:00 | INFO | [a1b2c3d4] Backend: pipeline 2026-01-26 10:30:27 | INFO | [a1b2c3d4] MinerU completed in 27.23s 2026-01-26 10:30:27 | INFO | [a1b2c3d4] Pages processed: 20 2026-01-26 10:30:27 | INFO | [a1b2c3d4] Speed: 0.73 pages/sec ``` ## Changelog ### v1.4.0 (Breaking Change) **Images now returned as ZIP file instead of dictionary:** - `images` field removed - `images_zip` field added (base64-encoded ZIP containing all images) - `image_count` field added (number of images in ZIP) **Migration from v1.3.0:** ```python # OLD (v1.3.0) if result["images"]: for filename, b64_data in result["images"].items(): img_bytes = base64.b64decode(b64_data) # NEW (v1.4.0) if result["images_zip"]: zip_bytes = base64.b64decode(result["images_zip"]) with zipfile.ZipFile(io.BytesIO(zip_bytes), 'r') as zf: for filename in zf.namelist(): img_bytes = zf.read(filename) ``` **Benefits:** - Smaller payload size due to ZIP compression - Single field instead of large dictionary - Easier to save/extract as a file ### v1.3.0 - Added `include_images` parameter for optional image extraction - Added parallel chunking for large PDFs (>20 pages) - Added automatic OOM fallback to sequential processing ## Credits Built with [MinerU](https://github.com/opendatalab/MinerU) by OpenDataLab.