Spaces:

outcomelabs
/

md-parser

Paused

App Files Files Community

md-parser / README.md

sidoutcome

feat: support both API_TOKEN and API_DEV_TOKEN

6162371 2 months ago

preview code

raw

history blame contribute delete

19.5 kB

	---
	title: MD Parser API
	emoji: 📄
	colorFrom: blue
	colorTo: purple
	sdk: docker
	app_port: 7860
	pinned: false
	license: agpl-3.0
	suggested_hardware: a100-large
	---

	# MD Parser API

	A FastAPI service that transforms PDFs and images into LLM-ready markdown/JSON using [MinerU](https://github.com/opendatalab/MinerU).

	## Features

	- PDF Parsing: Extract text, tables, formulas, and images from PDFs
	- Image OCR: Process scanned documents and images
	- Multiple Formats: Output as markdown or JSON
	- 109 Languages: Supports OCR in 109 languages
	- GPU Accelerated: Uses CUDA for fast processing on A100 GPU (80GB VRAM)
	- Two Backends: Fast `pipeline` (default) or accurate `hybrid-auto-engine`
	- Parallel Chunking: Large PDFs (>20 pages) are automatically split into 10-page chunks and processed in parallel

	## API Endpoints

	\| Endpoint \| Method \| Description \|
	\| ------------ \| ------ \| ----------------------------------------- \|
	\| `/` \| GET \| Health check \|
	\| `/parse` \| POST \| Parse uploaded file (multipart/form-data) \|
	\| `/parse/url` \| POST \| Parse document from URL (JSON body) \|

	## Authentication

	All `/parse` endpoints require Bearer token authentication.

	```
	Authorization: Bearer YOUR_API_TOKEN
	```

	Set `API_TOKEN` in HF Space Settings > Secrets.

	## Quick Start

	### cURL - File Upload

	```bash
	curl -X POST "https://outcomelabs-md-parser.hf.space/parse" \
	-H "Authorization: Bearer YOUR_API_TOKEN" \
	-F "file=@document.pdf" \
	-F "output_format=markdown"
	```

	### cURL - Parse from URL

	```bash
	curl -X POST "https://outcomelabs-md-parser.hf.space/parse/url" \
	-H "Authorization: Bearer YOUR_API_TOKEN" \
	-H "Content-Type: application/json" \
	-d '{"url": "https://example.com/document.pdf", "output_format": "markdown"}'
	```

	### Python

	```python
	import requests

	API_URL = "https://outcomelabs-md-parser.hf.space"
	API_TOKEN = "your_api_token"
	headers = {"Authorization": f"Bearer {API_TOKEN}"}

	# Option 1: Upload a file
	with open("document.pdf", "rb") as f:
	response = requests.post(
	f"{API_URL}/parse",
	headers=headers,
	files={"file": ("document.pdf", f, "application/pdf")},
	data={"output_format": "markdown"}
	)

	# Option 2: Parse from URL
	response = requests.post(
	f"{API_URL}/parse/url",
	headers=headers,
	json={
	"url": "https://example.com/document.pdf",
	"output_format": "markdown"
	}
	)

	result = response.json()
	if result["success"]:
	print(f"Parsed {result['pages_processed']} pages")
	print(result["markdown"])
	else:
	print(f"Error: {result['error']}")
	```

	### Python with Images

	```python
	import requests
	import base64
	import zipfile
	import io

	API_URL = "https://outcomelabs-md-parser.hf.space"
	API_TOKEN = "your_api_token"
	headers = {"Authorization": f"Bearer {API_TOKEN}"}

	# Request with images included
	with open("document.pdf", "rb") as f:
	response = requests.post(
	f"{API_URL}/parse",
	headers=headers,
	files={"file": ("document.pdf", f, "application/pdf")},
	data={"output_format": "markdown", "include_images": "true"}
	)

	result = response.json()
	if result["success"]:
	print(f"Parsed {result['pages_processed']} pages")
	print(result["markdown"])

	# Extract images from ZIP
	if result["images_zip"]:
	print(f"Extracting {result['image_count']} images...")
	zip_bytes = base64.b64decode(result["images_zip"])
	with zipfile.ZipFile(io.BytesIO(zip_bytes), 'r') as zf:
	zf.extractall("./extracted_images")
	print(f"Images saved to ./extracted_images/")
	```

	### JavaScript/Node.js

	```javascript
	const API_URL = 'https://outcomelabs-md-parser.hf.space';
	const API_TOKEN = 'your_api_token';

	// Parse from URL
	const response = await fetch(`${API_URL}/parse/url`, {
	method: 'POST',
	headers: {
	Authorization: `Bearer ${API_TOKEN}`,
	'Content-Type': 'application/json',
	},
	body: JSON.stringify({
	url: 'https://example.com/document.pdf',
	output_format: 'markdown',
	}),
	});

	const result = await response.json();
	console.log(result.markdown);
	```

	### JavaScript/Node.js with Images

	```javascript
	import JSZip from 'jszip';
	import fs from 'fs';

	const API_URL = 'https://outcomelabs-md-parser.hf.space';
	const API_TOKEN = 'your_api_token';

	// Parse with images
	const response = await fetch(`${API_URL}/parse/url`, {
	method: 'POST',
	headers: {
	Authorization: `Bearer ${API_TOKEN}`,
	'Content-Type': 'application/json',
	},
	body: JSON.stringify({
	url: 'https://example.com/document.pdf',
	output_format: 'markdown',
	include_images: true,
	}),
	});

	const result = await response.json();
	console.log(result.markdown);

	// Extract images from ZIP
	if (result.images_zip) {
	console.log(`Extracting ${result.image_count} images...`);
	const zipData = Buffer.from(result.images_zip, 'base64');
	const zip = await JSZip.loadAsync(zipData);

	for (const [name, file] of Object.entries(zip.files)) {
	if (!file.dir) {
	const content = await file.async('nodebuffer');
	fs.writeFileSync(`./extracted_images/${name}`, content);
	console.log(` Saved: ${name}`);
	}
	}
	}
	```

	## Postman Setup

	### File Upload (POST /parse)

	1. Method: `POST`
	2. URL: `https://outcomelabs-md-parser.hf.space/parse`
	3. Authorization tab: Type = Bearer Token, Token = `your_api_token`
	4. Body tab: Select `form-data`

	\| Key \| Type \| Value \|
	\| -------------- \| ---- \| --------------------------------------------- \|
	\| file \| File \| Select your PDF/image \|
	\| output_format \| Text \| `markdown` or `json` \|
	\| lang \| Text \| `en` (optional) \|
	\| backend \| Text \| `pipeline` or `hybrid-auto-engine` (optional) \|
	\| start_page \| Text \| `0` (optional) \|
	\| end_page \| Text \| `10` (optional) \|
	\| include_images \| Text \| `true` or `false` (optional) \|

	### URL Parsing (POST /parse/url)

	1. Method: `POST`
	2. URL: `https://outcomelabs-md-parser.hf.space/parse/url`
	3. Authorization tab: Type = Bearer Token, Token = `your_api_token`
	4. Headers tab: Add `Content-Type: application/json`
	5. Body tab: Select `raw` and `JSON`

	```json
	{
	"url": "https://example.com/document.pdf",
	"output_format": "markdown",
	"lang": "en",
	"start_page": 0,
	"end_page": null,
	"include_images": false
	}
	```

	## Request Parameters

	### File Upload (/parse)

	\| Parameter \| Type \| Required \| Default \| Description \|
	\| -------------- \| ------ \| -------- \| ---------- \| ---------------------------------------------------- \|
	\| file \| File \| Yes \| - \| PDF or image file \|
	\| output_format \| string \| No \| `markdown` \| `markdown` or `json` \|
	\| lang \| string \| No \| `en` \| OCR language code \|
	\| backend \| string \| No \| `pipeline` \| `pipeline` (fast) or `hybrid-auto-engine` (accurate) \|
	\| start_page \| int \| No \| `0` \| Starting page (0-indexed) \|
	\| end_page \| int \| No \| `null` \| Ending page (null = all pages) \|
	\| include_images \| bool \| No \| `false` \| Include base64-encoded images in response \|

	### URL Parsing (/parse/url)

	\| Parameter \| Type \| Required \| Default \| Description \|
	\| -------------- \| ------ \| -------- \| ---------- \| ---------------------------------------------------- \|
	\| url \| string \| Yes \| - \| URL to PDF or image \|
	\| output_format \| string \| No \| `markdown` \| `markdown` or `json` \|
	\| lang \| string \| No \| `en` \| OCR language code \|
	\| backend \| string \| No \| `pipeline` \| `pipeline` (fast) or `hybrid-auto-engine` (accurate) \|
	\| start_page \| int \| No \| `0` \| Starting page (0-indexed) \|
	\| end_page \| int \| No \| `null` \| Ending page (null = all pages) \|
	\| include_images \| bool \| No \| `false` \| Include base64-encoded images in response \|

	## Response Format

	```json
	{
	"success": true,
	"markdown": "# Document Title\n\nExtracted content...",
	"json_content": null,
	"images_zip": null,
	"image_count": 0,
	"error": null,
	"pages_processed": 20,
	"backend_used": "pipeline"
	}
	```

	\| Field \| Type \| Description \|
	\| --------------- \| ------- \| ---------------------------------------------------------------------- \|
	\| success \| boolean \| Whether parsing succeeded \|
	\| markdown \| string \| Extracted markdown (if output_format=markdown) \|
	\| json_content \| object \| Extracted JSON (if output_format=json) \|
	\| images_zip \| string \| Base64-encoded ZIP file containing all images (if include_images=true) \|
	\| image_count \| int \| Number of images in the ZIP file \|
	\| error \| string \| Error message if failed \|
	\| pages_processed \| int \| Number of pages processed \|
	\| backend_used \| string \| Actual backend used (may differ from requested if fallback occurred) \|

	### Images Response

	When `include_images=true`, the `images_zip` field contains a base64-encoded ZIP file with all extracted images:

	```json
	{
	"images_zip": "UEsDBBQAAAAIAGJ...",
	"image_count": 3
	}
	```

	#### Extracting Images (Python)

	```python
	import base64
	import zipfile
	import io

	result = response.json()
	if result["images_zip"]:
	print(f"Extracted {result['image_count']} images")

	# Decode the base64 ZIP
	zip_bytes = base64.b64decode(result["images_zip"])

	# Extract images from ZIP
	with zipfile.ZipFile(io.BytesIO(zip_bytes), 'r') as zf:
	for name in zf.namelist():
	print(f" - {name}") # e.g., "images/fig1.png"
	img_bytes = zf.read(name)
	# Save or process img_bytes as needed
	```

	#### Extracting Images (JavaScript)

	```javascript
	import JSZip from 'jszip';

	const result = await response.json();
	if (result.images_zip) {
	console.log(`Extracted ${result.image_count} images`);

	// Decode base64 and unzip
	const zipData = Uint8Array.from(atob(result.images_zip), c =>
	c.charCodeAt(0)
	);
	const zip = await JSZip.loadAsync(zipData);

	for (const [name, file] of Object.entries(zip.files)) {
	console.log(` - ${name}`); // e.g., "images/fig1.png"
	const imgBlob = await file.async('blob');
	// Use imgBlob as needed
	}
	}
	```

	#### Image Path Structure

	- Non-chunked documents: `images/filename.png`
	- Chunked documents (>20 pages): `chunk_0/images/filename.png`, `chunk_1/images/filename.png`, etc.

	## Backends

	\| Backend \| Speed \| Accuracy \| Best For \|
	\| -------------------- \| --------------- \| ---------------- \| --------------------------------------------- \|
	\| `pipeline` (default) \| ~0.77 pages/sec \| Good \| Native PDFs, text-heavy docs, fast processing \|
	\| `hybrid-auto-engine` \| ~0.39 pages/sec \| Excellent (90%+) \| Complex layouts, scanned docs, forms \|

	### When to Use `pipeline` (Default)

	The pipeline backend uses traditional ML models for faster processing. Use it for:

	- Native PDFs with text layers - Academic papers, eBooks, reports generated digitally
	- High-volume processing - When speed matters more than perfect accuracy (2x faster)
	- Well-structured documents - Clean, single-column text-heavy documents
	- arXiv papers - Both backends produce identical output for well-structured PDFs
	- Cost optimization - Faster processing = less GPU time

	### When to Use `hybrid-auto-engine`

	The hybrid backend uses a Vision-Language Model (VLM) to understand document layouts visually. Use it for:

	- Scanned documents - Better OCR accuracy, fewer typos
	- Forms and applications - Extracts 18x more content from complex form layouts (tested on IRS Form 1040)
	- Documents with complex layouts - Multi-column, mixed text/images, tables with merged cells
	- Handwritten content - Better recognition of cursive and handwriting
	- Low-quality scans - VLM can interpret degraded or noisy images
	- Legal documents - Leases, contracts with signatures and stamps
	- Historical documents - Older typewritten or faded documents

	### Real-World Comparison

	\| Document Type \| Pipeline Output \| Hybrid Output \|
	\| ---------------------- \| ------------------------ \| ----------------------------- \|
	\| arXiv paper (15 pages) \| 42KB, clean extraction \| 42KB, identical \|
	\| IRS Form 1040 \| 825 bytes, mostly images \| 15KB, full form structure \|
	\| Scanned lease (31 pg) \| 104KB, OCR errors \| 105KB, cleaner OCR \|

	OCR Accuracy Example (scanned lease):

	- Pipeline: "Ilinois" (9 occurrences of typo)
	- Hybrid: "Illinois" (21 correct occurrences)

	Override per-request with the `backend` parameter, or set `MINERU_BACKEND` env var.

	## Parallel Chunking

	For large PDFs, the API automatically splits processing into parallel chunks to avoid timeouts and improve throughput.

	### How It Works

	1. Detection: PDFs with more than 20 pages (configurable via `CHUNKING_THRESHOLD`) trigger chunking
	2. Splitting: Document is split into 10-page chunks (configurable via `CHUNK_SIZE`)
	3. Parallel Processing: Up to 3 chunks (configurable via `MAX_WORKERS`) are processed simultaneously
	4. Combining: Results are merged in page order, with chunk boundaries marked in markdown output

	### Performance Impact

	\| Document Size \| Without Chunking \| With Chunking (3 workers) \| Speedup \|
	\| ------------- \| ---------------- \| ------------------------- \| ------- \|
	\| 30 pages \| ~80 seconds \| ~30 seconds \| ~2.7x \|
	\| 60 pages \| ~160 seconds \| ~55 seconds \| ~2.9x \|
	\| 100 pages \| Timeout (>600s) \| ~100 seconds \| N/A \|

	### OOM Protection

	If GPU out-of-memory errors are detected during parallel processing, the system automatically falls back to sequential processing (1 worker) and retries all chunks.

	### Notes

	- Chunking only applies to PDF files (images are always processed as single units)
	- Each chunk maintains context for tables and formulas within its page range
	- Chunk boundaries are marked with HTML comments in markdown output for transparency
	- If any chunk fails, partial results are still returned with an error message
	- Requested backend is used for chunked processing (with OOM auto-fallback to sequential)

	## Supported File Types

	- PDF (.pdf)
	- Images (.png, .jpg, .jpeg, .tiff, .bmp)

	Maximum file size: 1GB (configurable via `MAX_FILE_SIZE_MB`)

	## Configuration

	\| Environment Variable \| Description \| Default \|
	\| ----------------------------- \| ---------------------------------------------- \| ---------- \|
	\| `API_TOKEN` \| Required. API authentication token \| - \|
	\| `MINERU_BACKEND` \| Default parsing backend \| `pipeline` \|
	\| `MINERU_LANG` \| Default OCR language \| `en` \|
	\| `MAX_FILE_SIZE_MB` \| Maximum upload size in MB \| `1024` \|
	\| `VLLM_GPU_MEMORY_UTILIZATION` \| vLLM GPU memory fraction (hybrid backend only) \| `0.4` \|
	\| `CHUNK_SIZE` \| Pages per chunk for chunked processing \| `10` \|
	\| `CHUNKING_THRESHOLD` \| Minimum pages to trigger chunking \| `20` \|
	\| `MAX_WORKERS` \| Parallel workers for chunk processing \| `3` \|

	### GPU Memory & Automatic Fallback

	The `hybrid-auto-engine` backend uses vLLM internally, which requires GPU memory. If GPU memory is insufficient, the API automatically falls back to `pipeline` backend and returns results (check `backend_used` in response).

	To force a specific backend or tune memory:

	1. Use `pipeline` backend - Add `backend=pipeline` to your request (doesn't use vLLM, faster but less accurate for scanned docs)
	2. Lower GPU memory - Set `VLLM_GPU_MEMORY_UTILIZATION` to a lower value (e.g., `0.3`)

	## Performance

	Hardware: Nvidia A100 Large (80GB VRAM, 12 vCPU, 142GB RAM)

	\| Backend \| Speed \| 15-page PDF \| 31-page PDF \|
	\| -------------------- \| --------------- \| ----------- \| ----------- \|
	\| `pipeline` \| ~0.77 pages/sec \| ~20 seconds \| ~40 seconds \|
	\| `hybrid-auto-engine` \| ~0.39 pages/sec \| ~40 seconds \| ~80 seconds \|

	Trade-off: Hybrid is 2x slower but produces significantly better results for scanned/complex documents. For native PDFs, both produce identical output.

	Sleep behavior: Space sleeps after 60 minutes idle. First request after sleep takes ~30-60 seconds for cold start.

	## Deployment

	- Space: https://huggingface.co/spaces/outcomelabs/md-parser
	- API: https://outcomelabs-md-parser.hf.space
	- Hardware: Nvidia A100 Large 80GB ($2.50/hr, stops billing when sleeping)

	### Deploy Updates

	```bash
	git add .
	git commit -m "feat: description"
	git push hf main
	```

	## Logging

	View logs in HuggingFace Space > Logs tab:

	```
	2026-01-26 10:30:00 \| INFO \| [a1b2c3d4] New parse request received
	2026-01-26 10:30:00 \| INFO \| [a1b2c3d4] Filename: document.pdf
	2026-01-26 10:30:00 \| INFO \| [a1b2c3d4] File size: 2.45 MB
	2026-01-26 10:30:00 \| INFO \| [a1b2c3d4] Backend: pipeline
	2026-01-26 10:30:27 \| INFO \| [a1b2c3d4] MinerU completed in 27.23s
	2026-01-26 10:30:27 \| INFO \| [a1b2c3d4] Pages processed: 20
	2026-01-26 10:30:27 \| INFO \| [a1b2c3d4] Speed: 0.73 pages/sec
	```

	## Changelog

	### v1.4.0 (Breaking Change)

	Images now returned as ZIP file instead of dictionary:

	- `images` field removed
	- `images_zip` field added (base64-encoded ZIP containing all images)
	- `image_count` field added (number of images in ZIP)

	Migration from v1.3.0:

	```python
	# OLD (v1.3.0)
	if result["images"]:
	for filename, b64_data in result["images"].items():
	img_bytes = base64.b64decode(b64_data)

	# NEW (v1.4.0)
	if result["images_zip"]:
	zip_bytes = base64.b64decode(result["images_zip"])
	with zipfile.ZipFile(io.BytesIO(zip_bytes), 'r') as zf:
	for filename in zf.namelist():
	img_bytes = zf.read(filename)
	```

	Benefits:

	- Smaller payload size due to ZIP compression
	- Single field instead of large dictionary
	- Easier to save/extract as a file

	### v1.3.0

	- Added `include_images` parameter for optional image extraction
	- Added parallel chunking for large PDFs (>20 pages)
	- Added automatic OOM fallback to sequential processing

	## Credits

	Built with [MinerU](https://github.com/opendatalab/MinerU) by OpenDataLab.