Spaces:

outcomelabs
/

md-parser

Paused

App Files Files Community

md-parser / README.md

sidoutcome

feat: support both API_TOKEN and API_DEV_TOKEN

6162371 2 months ago

preview code

raw

history blame contribute delete

19.5 kB

metadata

title: MD Parser API
emoji: 📄
colorFrom: blue
colorTo: purple
sdk: docker
app_port: 7860
pinned: false
license: agpl-3.0
suggested_hardware: a100-large

MD Parser API

A FastAPI service that transforms PDFs and images into LLM-ready markdown/JSON using MinerU.

Features

PDF Parsing: Extract text, tables, formulas, and images from PDFs
Image OCR: Process scanned documents and images
Multiple Formats: Output as markdown or JSON
109 Languages: Supports OCR in 109 languages
GPU Accelerated: Uses CUDA for fast processing on A100 GPU (80GB VRAM)
Two Backends: Fast pipeline (default) or accurate hybrid-auto-engine
Parallel Chunking: Large PDFs (>20 pages) are automatically split into 10-page chunks and processed in parallel

API Endpoints

Endpoint	Method	Description
`/`	GET	Health check
`/parse`	POST	Parse uploaded file (multipart/form-data)
`/parse/url`	POST	Parse document from URL (JSON body)

Authentication

All /parse endpoints require Bearer token authentication.

Authorization: Bearer YOUR_API_TOKEN

Set API_TOKEN in HF Space Settings > Secrets.

Quick Start

cURL - File Upload

curl -X POST "https://outcomelabs-md-parser.hf.space/parse" \
  -H "Authorization: Bearer YOUR_API_TOKEN" \
  -F "file=@document.pdf" \
  -F "output_format=markdown"

cURL - Parse from URL

curl -X POST "https://outcomelabs-md-parser.hf.space/parse/url" \
  -H "Authorization: Bearer YOUR_API_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com/document.pdf", "output_format": "markdown"}'

Python

import requests

API_URL = "https://outcomelabs-md-parser.hf.space"
API_TOKEN = "your_api_token"
headers = {"Authorization": f"Bearer {API_TOKEN}"}

# Option 1: Upload a file
with open("document.pdf", "rb") as f:
    response = requests.post(
        f"{API_URL}/parse",
        headers=headers,
        files={"file": ("document.pdf", f, "application/pdf")},
        data={"output_format": "markdown"}
    )

# Option 2: Parse from URL
response = requests.post(
    f"{API_URL}/parse/url",
    headers=headers,
    json={
        "url": "https://example.com/document.pdf",
        "output_format": "markdown"
    }
)

result = response.json()
if result["success"]:
    print(f"Parsed {result['pages_processed']} pages")
    print(result["markdown"])
else:
    print(f"Error: {result['error']}")

Python with Images

import requests
import base64
import zipfile
import io

API_URL = "https://outcomelabs-md-parser.hf.space"
API_TOKEN = "your_api_token"
headers = {"Authorization": f"Bearer {API_TOKEN}"}

# Request with images included
with open("document.pdf", "rb") as f:
    response = requests.post(
        f"{API_URL}/parse",
        headers=headers,
        files={"file": ("document.pdf", f, "application/pdf")},
        data={"output_format": "markdown", "include_images": "true"}
    )

result = response.json()
if result["success"]:
    print(f"Parsed {result['pages_processed']} pages")
    print(result["markdown"])

    # Extract images from ZIP
    if result["images_zip"]:
        print(f"Extracting {result['image_count']} images...")
        zip_bytes = base64.b64decode(result["images_zip"])
        with zipfile.ZipFile(io.BytesIO(zip_bytes), 'r') as zf:
            zf.extractall("./extracted_images")
            print(f"Images saved to ./extracted_images/")

JavaScript/Node.js

const API_URL = 'https://outcomelabs-md-parser.hf.space';
const API_TOKEN = 'your_api_token';

// Parse from URL
const response = await fetch(`${API_URL}/parse/url`, {
  method: 'POST',
  headers: {
    Authorization: `Bearer ${API_TOKEN}`,
    'Content-Type': 'application/json',
  },
  body: JSON.stringify({
    url: 'https://example.com/document.pdf',
    output_format: 'markdown',
  }),
});

const result = await response.json();
console.log(result.markdown);

JavaScript/Node.js with Images

import JSZip from 'jszip';
import fs from 'fs';

const API_URL = 'https://outcomelabs-md-parser.hf.space';
const API_TOKEN = 'your_api_token';

// Parse with images
const response = await fetch(`${API_URL}/parse/url`, {
  method: 'POST',
  headers: {
    Authorization: `Bearer ${API_TOKEN}`,
    'Content-Type': 'application/json',
  },
  body: JSON.stringify({
    url: 'https://example.com/document.pdf',
    output_format: 'markdown',
    include_images: true,
  }),
});

const result = await response.json();
console.log(result.markdown);

// Extract images from ZIP
if (result.images_zip) {
  console.log(`Extracting ${result.image_count} images...`);
  const zipData = Buffer.from(result.images_zip, 'base64');
  const zip = await JSZip.loadAsync(zipData);

  for (const [name, file] of Object.entries(zip.files)) {
    if (!file.dir) {
      const content = await file.async('nodebuffer');
      fs.writeFileSync(`./extracted_images/${name}`, content);
      console.log(`  Saved: ${name}`);
    }
  }
}

Postman Setup

File Upload (POST /parse)

Method: POST
URL: https://outcomelabs-md-parser.hf.space/parse
Authorization tab: Type = Bearer Token, Token = your_api_token
Body tab: Select form-data

Key	Type	Value
file	File	Select your PDF/image
output_format	Text	`markdown` or `json`
lang	Text	`en` (optional)
backend	Text	`pipeline` or `hybrid-auto-engine` (optional)
start_page	Text	`0` (optional)
end_page	Text	`10` (optional)
include_images	Text	`true` or `false` (optional)

URL Parsing (POST /parse/url)

Method: POST
URL: https://outcomelabs-md-parser.hf.space/parse/url
Authorization tab: Type = Bearer Token, Token = your_api_token
Headers tab: Add Content-Type: application/json
Body tab: Select raw and JSON

{
  "url": "https://example.com/document.pdf",
  "output_format": "markdown",
  "lang": "en",
  "start_page": 0,
  "end_page": null,
  "include_images": false
}

Request Parameters

File Upload (/parse)

Parameter	Type	Required	Default	Description
file	File	Yes	-	PDF or image file
output_format	string	No	`markdown`	`markdown` or `json`
lang	string	No	`en`	OCR language code
backend	string	No	`pipeline`	`pipeline` (fast) or `hybrid-auto-engine` (accurate)
start_page	int	No	`0`	Starting page (0-indexed)
end_page	int	No	`null`	Ending page (null = all pages)
include_images	bool	No	`false`	Include base64-encoded images in response

URL Parsing (/parse/url)

Parameter	Type	Required	Default	Description
url	string	Yes	-	URL to PDF or image
output_format	string	No	`markdown`	`markdown` or `json`
lang	string	No	`en`	OCR language code
backend	string	No	`pipeline`	`pipeline` (fast) or `hybrid-auto-engine` (accurate)
start_page	int	No	`0`	Starting page (0-indexed)
end_page	int	No	`null`	Ending page (null = all pages)
include_images	bool	No	`false`	Include base64-encoded images in response

Response Format

{
  "success": true,
  "markdown": "# Document Title\n\nExtracted content...",
  "json_content": null,
  "images_zip": null,
  "image_count": 0,
  "error": null,
  "pages_processed": 20,
  "backend_used": "pipeline"
}

Field	Type	Description
success	boolean	Whether parsing succeeded
markdown	string	Extracted markdown (if output_format=markdown)
json_content	object	Extracted JSON (if output_format=json)
images_zip	string	Base64-encoded ZIP file containing all images (if include_images=true)
image_count	int	Number of images in the ZIP file
error	string	Error message if failed
pages_processed	int	Number of pages processed
backend_used	string	Actual backend used (may differ from requested if fallback occurred)

Images Response

When include_images=true, the images_zip field contains a base64-encoded ZIP file with all extracted images:

{
  "images_zip": "UEsDBBQAAAAIAGJ...",
  "image_count": 3
}

Extracting Images (Python)

import base64
import zipfile
import io

result = response.json()
if result["images_zip"]:
    print(f"Extracted {result['image_count']} images")

    # Decode the base64 ZIP
    zip_bytes = base64.b64decode(result["images_zip"])

    # Extract images from ZIP
    with zipfile.ZipFile(io.BytesIO(zip_bytes), 'r') as zf:
        for name in zf.namelist():
            print(f"  - {name}")  # e.g., "images/fig1.png"
            img_bytes = zf.read(name)
            # Save or process img_bytes as needed

Extracting Images (JavaScript)

import JSZip from 'jszip';

const result = await response.json();
if (result.images_zip) {
  console.log(`Extracted ${result.image_count} images`);

  // Decode base64 and unzip
  const zipData = Uint8Array.from(atob(result.images_zip), c =>
    c.charCodeAt(0)
  );
  const zip = await JSZip.loadAsync(zipData);

  for (const [name, file] of Object.entries(zip.files)) {
    console.log(`  - ${name}`); // e.g., "images/fig1.png"
    const imgBlob = await file.async('blob');
    // Use imgBlob as needed
  }
}

Image Path Structure

Non-chunked documents: images/filename.png
Chunked documents (>20 pages): chunk_0/images/filename.png, chunk_1/images/filename.png, etc.

Backends

Backend	Speed	Accuracy	Best For
`pipeline` (default)	~0.77 pages/sec	Good	Native PDFs, text-heavy docs, fast processing
`hybrid-auto-engine`	~0.39 pages/sec	Excellent (90%+)	Complex layouts, scanned docs, forms

When to Use `pipeline` (Default)

The pipeline backend uses traditional ML models for faster processing. Use it for:

Native PDFs with text layers - Academic papers, eBooks, reports generated digitally
High-volume processing - When speed matters more than perfect accuracy (2x faster)
Well-structured documents - Clean, single-column text-heavy documents
arXiv papers - Both backends produce identical output for well-structured PDFs
Cost optimization - Faster processing = less GPU time

When to Use `hybrid-auto-engine`

The hybrid backend uses a Vision-Language Model (VLM) to understand document layouts visually. Use it for:

Scanned documents - Better OCR accuracy, fewer typos
Forms and applications - Extracts 18x more content from complex form layouts (tested on IRS Form 1040)
Documents with complex layouts - Multi-column, mixed text/images, tables with merged cells
Handwritten content - Better recognition of cursive and handwriting
Low-quality scans - VLM can interpret degraded or noisy images
Legal documents - Leases, contracts with signatures and stamps
Historical documents - Older typewritten or faded documents

Real-World Comparison

Document Type	Pipeline Output	Hybrid Output
arXiv paper (15 pages)	42KB, clean extraction	42KB, identical
IRS Form 1040	825 bytes, mostly images	15KB, full form structure
Scanned lease (31 pg)	104KB, OCR errors	105KB, cleaner OCR

OCR Accuracy Example (scanned lease):

Pipeline: "Ilinois" (9 occurrences of typo)
Hybrid: "Illinois" (21 correct occurrences)

Override per-request with the backend parameter, or set MINERU_BACKEND env var.

Parallel Chunking

For large PDFs, the API automatically splits processing into parallel chunks to avoid timeouts and improve throughput.

How It Works

Detection: PDFs with more than 20 pages (configurable via CHUNKING_THRESHOLD) trigger chunking
Splitting: Document is split into 10-page chunks (configurable via CHUNK_SIZE)
Parallel Processing: Up to 3 chunks (configurable via MAX_WORKERS) are processed simultaneously
Combining: Results are merged in page order, with chunk boundaries marked in markdown output

Performance Impact

Document Size	Without Chunking	With Chunking (3 workers)	Speedup
30 pages	~80 seconds	~30 seconds	~2.7x
60 pages	~160 seconds	~55 seconds	~2.9x
100 pages	Timeout (>600s)	~100 seconds	N/A

OOM Protection

If GPU out-of-memory errors are detected during parallel processing, the system automatically falls back to sequential processing (1 worker) and retries all chunks.

Notes

Chunking only applies to PDF files (images are always processed as single units)
Each chunk maintains context for tables and formulas within its page range
Chunk boundaries are marked with HTML comments in markdown output for transparency
If any chunk fails, partial results are still returned with an error message
Requested backend is used for chunked processing (with OOM auto-fallback to sequential)

Supported File Types

PDF (.pdf)
Images (.png, .jpg, .jpeg, .tiff, .bmp)

Maximum file size: 1GB (configurable via MAX_FILE_SIZE_MB)

Configuration

Environment Variable	Description	Default
`API_TOKEN`	Required. API authentication token	-
`MINERU_BACKEND`	Default parsing backend	`pipeline`
`MINERU_LANG`	Default OCR language	`en`
`MAX_FILE_SIZE_MB`	Maximum upload size in MB	`1024`
`VLLM_GPU_MEMORY_UTILIZATION`	vLLM GPU memory fraction (hybrid backend only)	`0.4`
`CHUNK_SIZE`	Pages per chunk for chunked processing	`10`
`CHUNKING_THRESHOLD`	Minimum pages to trigger chunking	`20`
`MAX_WORKERS`	Parallel workers for chunk processing	`3`

GPU Memory & Automatic Fallback

The hybrid-auto-engine backend uses vLLM internally, which requires GPU memory. If GPU memory is insufficient, the API automatically falls back to pipeline backend and returns results (check backend_used in response).

To force a specific backend or tune memory:

Use pipeline backend - Add backend=pipeline to your request (doesn't use vLLM, faster but less accurate for scanned docs)
Lower GPU memory - Set VLLM_GPU_MEMORY_UTILIZATION to a lower value (e.g., 0.3)

Performance

Hardware: Nvidia A100 Large (80GB VRAM, 12 vCPU, 142GB RAM)

Backend	Speed	15-page PDF	31-page PDF
`pipeline`	~0.77 pages/sec	~20 seconds	~40 seconds
`hybrid-auto-engine`	~0.39 pages/sec	~40 seconds	~80 seconds

Trade-off: Hybrid is 2x slower but produces significantly better results for scanned/complex documents. For native PDFs, both produce identical output.

Sleep behavior: Space sleeps after 60 minutes idle. First request after sleep takes ~30-60 seconds for cold start.

Deployment

Space: https://huggingface.co/spaces/outcomelabs/md-parser
API: https://outcomelabs-md-parser.hf.space
Hardware: Nvidia A100 Large 80GB ($2.50/hr, stops billing when sleeping)

Deploy Updates

git add .
git commit -m "feat: description"
git push hf main

Logging

View logs in HuggingFace Space > Logs tab:

2026-01-26 10:30:00 | INFO | [a1b2c3d4] New parse request received
2026-01-26 10:30:00 | INFO | [a1b2c3d4] Filename: document.pdf
2026-01-26 10:30:00 | INFO | [a1b2c3d4] File size: 2.45 MB
2026-01-26 10:30:00 | INFO | [a1b2c3d4] Backend: pipeline
2026-01-26 10:30:27 | INFO | [a1b2c3d4] MinerU completed in 27.23s
2026-01-26 10:30:27 | INFO | [a1b2c3d4] Pages processed: 20
2026-01-26 10:30:27 | INFO | [a1b2c3d4] Speed: 0.73 pages/sec

Changelog

v1.4.0 (Breaking Change)

Images now returned as ZIP file instead of dictionary:

images field removed
images_zip field added (base64-encoded ZIP containing all images)
image_count field added (number of images in ZIP)

Migration from v1.3.0:

# OLD (v1.3.0)
if result["images"]:
    for filename, b64_data in result["images"].items():
        img_bytes = base64.b64decode(b64_data)

# NEW (v1.4.0)
if result["images_zip"]:
    zip_bytes = base64.b64decode(result["images_zip"])
    with zipfile.ZipFile(io.BytesIO(zip_bytes), 'r') as zf:
        for filename in zf.namelist():
            img_bytes = zf.read(filename)

Benefits:

Smaller payload size due to ZIP compression
Single field instead of large dictionary
Easier to save/extract as a file

v1.3.0

Added include_images parameter for optional image extraction
Added parallel chunking for large PDFs (>20 pages)
Added automatic OOM fallback to sequential processing

Credits

Built with MinerU by OpenDataLab.