Spaces:
Paused
title: MD Parser API
emoji: 📄
colorFrom: blue
colorTo: purple
sdk: docker
app_port: 7860
pinned: false
license: agpl-3.0
suggested_hardware: a100-large
MD Parser API
A FastAPI service that transforms PDFs and images into LLM-ready markdown/JSON using MinerU.
Features
- PDF Parsing: Extract text, tables, formulas, and images from PDFs
- Image OCR: Process scanned documents and images
- Multiple Formats: Output as markdown or JSON
- 109 Languages: Supports OCR in 109 languages
- GPU Accelerated: Uses CUDA for fast processing on A100 GPU (80GB VRAM)
- Two Backends: Fast
pipeline(default) or accuratehybrid-auto-engine - Parallel Chunking: Large PDFs (>20 pages) are automatically split into 10-page chunks and processed in parallel
API Endpoints
| Endpoint | Method | Description |
|---|---|---|
/ |
GET | Health check |
/parse |
POST | Parse uploaded file (multipart/form-data) |
/parse/url |
POST | Parse document from URL (JSON body) |
Authentication
All /parse endpoints require Bearer token authentication.
Authorization: Bearer YOUR_API_TOKEN
Set API_TOKEN in HF Space Settings > Secrets.
Quick Start
cURL - File Upload
curl -X POST "https://outcomelabs-md-parser.hf.space/parse" \
-H "Authorization: Bearer YOUR_API_TOKEN" \
-F "file=@document.pdf" \
-F "output_format=markdown"
cURL - Parse from URL
curl -X POST "https://outcomelabs-md-parser.hf.space/parse/url" \
-H "Authorization: Bearer YOUR_API_TOKEN" \
-H "Content-Type: application/json" \
-d '{"url": "https://example.com/document.pdf", "output_format": "markdown"}'
Python
import requests
API_URL = "https://outcomelabs-md-parser.hf.space"
API_TOKEN = "your_api_token"
headers = {"Authorization": f"Bearer {API_TOKEN}"}
# Option 1: Upload a file
with open("document.pdf", "rb") as f:
response = requests.post(
f"{API_URL}/parse",
headers=headers,
files={"file": ("document.pdf", f, "application/pdf")},
data={"output_format": "markdown"}
)
# Option 2: Parse from URL
response = requests.post(
f"{API_URL}/parse/url",
headers=headers,
json={
"url": "https://example.com/document.pdf",
"output_format": "markdown"
}
)
result = response.json()
if result["success"]:
print(f"Parsed {result['pages_processed']} pages")
print(result["markdown"])
else:
print(f"Error: {result['error']}")
Python with Images
import requests
import base64
import zipfile
import io
API_URL = "https://outcomelabs-md-parser.hf.space"
API_TOKEN = "your_api_token"
headers = {"Authorization": f"Bearer {API_TOKEN}"}
# Request with images included
with open("document.pdf", "rb") as f:
response = requests.post(
f"{API_URL}/parse",
headers=headers,
files={"file": ("document.pdf", f, "application/pdf")},
data={"output_format": "markdown", "include_images": "true"}
)
result = response.json()
if result["success"]:
print(f"Parsed {result['pages_processed']} pages")
print(result["markdown"])
# Extract images from ZIP
if result["images_zip"]:
print(f"Extracting {result['image_count']} images...")
zip_bytes = base64.b64decode(result["images_zip"])
with zipfile.ZipFile(io.BytesIO(zip_bytes), 'r') as zf:
zf.extractall("./extracted_images")
print(f"Images saved to ./extracted_images/")
JavaScript/Node.js
const API_URL = 'https://outcomelabs-md-parser.hf.space';
const API_TOKEN = 'your_api_token';
// Parse from URL
const response = await fetch(`${API_URL}/parse/url`, {
method: 'POST',
headers: {
Authorization: `Bearer ${API_TOKEN}`,
'Content-Type': 'application/json',
},
body: JSON.stringify({
url: 'https://example.com/document.pdf',
output_format: 'markdown',
}),
});
const result = await response.json();
console.log(result.markdown);
JavaScript/Node.js with Images
import JSZip from 'jszip';
import fs from 'fs';
const API_URL = 'https://outcomelabs-md-parser.hf.space';
const API_TOKEN = 'your_api_token';
// Parse with images
const response = await fetch(`${API_URL}/parse/url`, {
method: 'POST',
headers: {
Authorization: `Bearer ${API_TOKEN}`,
'Content-Type': 'application/json',
},
body: JSON.stringify({
url: 'https://example.com/document.pdf',
output_format: 'markdown',
include_images: true,
}),
});
const result = await response.json();
console.log(result.markdown);
// Extract images from ZIP
if (result.images_zip) {
console.log(`Extracting ${result.image_count} images...`);
const zipData = Buffer.from(result.images_zip, 'base64');
const zip = await JSZip.loadAsync(zipData);
for (const [name, file] of Object.entries(zip.files)) {
if (!file.dir) {
const content = await file.async('nodebuffer');
fs.writeFileSync(`./extracted_images/${name}`, content);
console.log(` Saved: ${name}`);
}
}
}
Postman Setup
File Upload (POST /parse)
- Method:
POST - URL:
https://outcomelabs-md-parser.hf.space/parse - Authorization tab: Type = Bearer Token, Token =
your_api_token - Body tab: Select
form-data
| Key | Type | Value |
|---|---|---|
| file | File | Select your PDF/image |
| output_format | Text | markdown or json |
| lang | Text | en (optional) |
| backend | Text | pipeline or hybrid-auto-engine (optional) |
| start_page | Text | 0 (optional) |
| end_page | Text | 10 (optional) |
| include_images | Text | true or false (optional) |
URL Parsing (POST /parse/url)
- Method:
POST - URL:
https://outcomelabs-md-parser.hf.space/parse/url - Authorization tab: Type = Bearer Token, Token =
your_api_token - Headers tab: Add
Content-Type: application/json - Body tab: Select
rawandJSON
{
"url": "https://example.com/document.pdf",
"output_format": "markdown",
"lang": "en",
"start_page": 0,
"end_page": null,
"include_images": false
}
Request Parameters
File Upload (/parse)
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
| file | File | Yes | - | PDF or image file |
| output_format | string | No | markdown |
markdown or json |
| lang | string | No | en |
OCR language code |
| backend | string | No | pipeline |
pipeline (fast) or hybrid-auto-engine (accurate) |
| start_page | int | No | 0 |
Starting page (0-indexed) |
| end_page | int | No | null |
Ending page (null = all pages) |
| include_images | bool | No | false |
Include base64-encoded images in response |
URL Parsing (/parse/url)
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
| url | string | Yes | - | URL to PDF or image |
| output_format | string | No | markdown |
markdown or json |
| lang | string | No | en |
OCR language code |
| backend | string | No | pipeline |
pipeline (fast) or hybrid-auto-engine (accurate) |
| start_page | int | No | 0 |
Starting page (0-indexed) |
| end_page | int | No | null |
Ending page (null = all pages) |
| include_images | bool | No | false |
Include base64-encoded images in response |
Response Format
{
"success": true,
"markdown": "# Document Title\n\nExtracted content...",
"json_content": null,
"images_zip": null,
"image_count": 0,
"error": null,
"pages_processed": 20,
"backend_used": "pipeline"
}
| Field | Type | Description |
|---|---|---|
| success | boolean | Whether parsing succeeded |
| markdown | string | Extracted markdown (if output_format=markdown) |
| json_content | object | Extracted JSON (if output_format=json) |
| images_zip | string | Base64-encoded ZIP file containing all images (if include_images=true) |
| image_count | int | Number of images in the ZIP file |
| error | string | Error message if failed |
| pages_processed | int | Number of pages processed |
| backend_used | string | Actual backend used (may differ from requested if fallback occurred) |
Images Response
When include_images=true, the images_zip field contains a base64-encoded ZIP file with all extracted images:
{
"images_zip": "UEsDBBQAAAAIAGJ...",
"image_count": 3
}
Extracting Images (Python)
import base64
import zipfile
import io
result = response.json()
if result["images_zip"]:
print(f"Extracted {result['image_count']} images")
# Decode the base64 ZIP
zip_bytes = base64.b64decode(result["images_zip"])
# Extract images from ZIP
with zipfile.ZipFile(io.BytesIO(zip_bytes), 'r') as zf:
for name in zf.namelist():
print(f" - {name}") # e.g., "images/fig1.png"
img_bytes = zf.read(name)
# Save or process img_bytes as needed
Extracting Images (JavaScript)
import JSZip from 'jszip';
const result = await response.json();
if (result.images_zip) {
console.log(`Extracted ${result.image_count} images`);
// Decode base64 and unzip
const zipData = Uint8Array.from(atob(result.images_zip), c =>
c.charCodeAt(0)
);
const zip = await JSZip.loadAsync(zipData);
for (const [name, file] of Object.entries(zip.files)) {
console.log(` - ${name}`); // e.g., "images/fig1.png"
const imgBlob = await file.async('blob');
// Use imgBlob as needed
}
}
Image Path Structure
- Non-chunked documents:
images/filename.png - Chunked documents (>20 pages):
chunk_0/images/filename.png,chunk_1/images/filename.png, etc.
Backends
| Backend | Speed | Accuracy | Best For |
|---|---|---|---|
pipeline (default) |
~0.77 pages/sec | Good | Native PDFs, text-heavy docs, fast processing |
hybrid-auto-engine |
~0.39 pages/sec | Excellent (90%+) | Complex layouts, scanned docs, forms |
When to Use pipeline (Default)
The pipeline backend uses traditional ML models for faster processing. Use it for:
- Native PDFs with text layers - Academic papers, eBooks, reports generated digitally
- High-volume processing - When speed matters more than perfect accuracy (2x faster)
- Well-structured documents - Clean, single-column text-heavy documents
- arXiv papers - Both backends produce identical output for well-structured PDFs
- Cost optimization - Faster processing = less GPU time
When to Use hybrid-auto-engine
The hybrid backend uses a Vision-Language Model (VLM) to understand document layouts visually. Use it for:
- Scanned documents - Better OCR accuracy, fewer typos
- Forms and applications - Extracts 18x more content from complex form layouts (tested on IRS Form 1040)
- Documents with complex layouts - Multi-column, mixed text/images, tables with merged cells
- Handwritten content - Better recognition of cursive and handwriting
- Low-quality scans - VLM can interpret degraded or noisy images
- Legal documents - Leases, contracts with signatures and stamps
- Historical documents - Older typewritten or faded documents
Real-World Comparison
| Document Type | Pipeline Output | Hybrid Output |
|---|---|---|
| arXiv paper (15 pages) | 42KB, clean extraction | 42KB, identical |
| IRS Form 1040 | 825 bytes, mostly images | 15KB, full form structure |
| Scanned lease (31 pg) | 104KB, OCR errors | 105KB, cleaner OCR |
OCR Accuracy Example (scanned lease):
- Pipeline: "Ilinois" (9 occurrences of typo)
- Hybrid: "Illinois" (21 correct occurrences)
Override per-request with the backend parameter, or set MINERU_BACKEND env var.
Parallel Chunking
For large PDFs, the API automatically splits processing into parallel chunks to avoid timeouts and improve throughput.
How It Works
- Detection: PDFs with more than 20 pages (configurable via
CHUNKING_THRESHOLD) trigger chunking - Splitting: Document is split into 10-page chunks (configurable via
CHUNK_SIZE) - Parallel Processing: Up to 3 chunks (configurable via
MAX_WORKERS) are processed simultaneously - Combining: Results are merged in page order, with chunk boundaries marked in markdown output
Performance Impact
| Document Size | Without Chunking | With Chunking (3 workers) | Speedup |
|---|---|---|---|
| 30 pages | ~80 seconds | ~30 seconds | ~2.7x |
| 60 pages | ~160 seconds | ~55 seconds | ~2.9x |
| 100 pages | Timeout (>600s) | ~100 seconds | N/A |
OOM Protection
If GPU out-of-memory errors are detected during parallel processing, the system automatically falls back to sequential processing (1 worker) and retries all chunks.
Notes
- Chunking only applies to PDF files (images are always processed as single units)
- Each chunk maintains context for tables and formulas within its page range
- Chunk boundaries are marked with HTML comments in markdown output for transparency
- If any chunk fails, partial results are still returned with an error message
- Requested backend is used for chunked processing (with OOM auto-fallback to sequential)
Supported File Types
- PDF (.pdf)
- Images (.png, .jpg, .jpeg, .tiff, .bmp)
Maximum file size: 1GB (configurable via MAX_FILE_SIZE_MB)
Configuration
| Environment Variable | Description | Default |
|---|---|---|
API_TOKEN |
Required. API authentication token | - |
MINERU_BACKEND |
Default parsing backend | pipeline |
MINERU_LANG |
Default OCR language | en |
MAX_FILE_SIZE_MB |
Maximum upload size in MB | 1024 |
VLLM_GPU_MEMORY_UTILIZATION |
vLLM GPU memory fraction (hybrid backend only) | 0.4 |
CHUNK_SIZE |
Pages per chunk for chunked processing | 10 |
CHUNKING_THRESHOLD |
Minimum pages to trigger chunking | 20 |
MAX_WORKERS |
Parallel workers for chunk processing | 3 |
GPU Memory & Automatic Fallback
The hybrid-auto-engine backend uses vLLM internally, which requires GPU memory. If GPU memory is insufficient, the API automatically falls back to pipeline backend and returns results (check backend_used in response).
To force a specific backend or tune memory:
- Use
pipelinebackend - Addbackend=pipelineto your request (doesn't use vLLM, faster but less accurate for scanned docs) - Lower GPU memory - Set
VLLM_GPU_MEMORY_UTILIZATIONto a lower value (e.g.,0.3)
Performance
Hardware: Nvidia A100 Large (80GB VRAM, 12 vCPU, 142GB RAM)
| Backend | Speed | 15-page PDF | 31-page PDF |
|---|---|---|---|
pipeline |
~0.77 pages/sec | ~20 seconds | ~40 seconds |
hybrid-auto-engine |
~0.39 pages/sec | ~40 seconds | ~80 seconds |
Trade-off: Hybrid is 2x slower but produces significantly better results for scanned/complex documents. For native PDFs, both produce identical output.
Sleep behavior: Space sleeps after 60 minutes idle. First request after sleep takes ~30-60 seconds for cold start.
Deployment
- Space: https://huggingface.co/spaces/outcomelabs/md-parser
- API: https://outcomelabs-md-parser.hf.space
- Hardware: Nvidia A100 Large 80GB ($2.50/hr, stops billing when sleeping)
Deploy Updates
git add .
git commit -m "feat: description"
git push hf main
Logging
View logs in HuggingFace Space > Logs tab:
2026-01-26 10:30:00 | INFO | [a1b2c3d4] New parse request received
2026-01-26 10:30:00 | INFO | [a1b2c3d4] Filename: document.pdf
2026-01-26 10:30:00 | INFO | [a1b2c3d4] File size: 2.45 MB
2026-01-26 10:30:00 | INFO | [a1b2c3d4] Backend: pipeline
2026-01-26 10:30:27 | INFO | [a1b2c3d4] MinerU completed in 27.23s
2026-01-26 10:30:27 | INFO | [a1b2c3d4] Pages processed: 20
2026-01-26 10:30:27 | INFO | [a1b2c3d4] Speed: 0.73 pages/sec
Changelog
v1.4.0 (Breaking Change)
Images now returned as ZIP file instead of dictionary:
imagesfield removedimages_zipfield added (base64-encoded ZIP containing all images)image_countfield added (number of images in ZIP)
Migration from v1.3.0:
# OLD (v1.3.0)
if result["images"]:
for filename, b64_data in result["images"].items():
img_bytes = base64.b64decode(b64_data)
# NEW (v1.4.0)
if result["images_zip"]:
zip_bytes = base64.b64decode(result["images_zip"])
with zipfile.ZipFile(io.BytesIO(zip_bytes), 'r') as zf:
for filename in zf.namelist():
img_bytes = zf.read(filename)
Benefits:
- Smaller payload size due to ZIP compression
- Single field instead of large dictionary
- Easier to save/extract as a file
v1.3.0
- Added
include_imagesparameter for optional image extraction - Added parallel chunking for large PDFs (>20 pages)
- Added automatic OOM fallback to sequential processing
Credits
Built with MinerU by OpenDataLab.