md-parser / README.md
sidoutcome's picture
feat: support both API_TOKEN and API_DEV_TOKEN
6162371
metadata
title: MD Parser API
emoji: 📄
colorFrom: blue
colorTo: purple
sdk: docker
app_port: 7860
pinned: false
license: agpl-3.0
suggested_hardware: a100-large

MD Parser API

A FastAPI service that transforms PDFs and images into LLM-ready markdown/JSON using MinerU.

Features

  • PDF Parsing: Extract text, tables, formulas, and images from PDFs
  • Image OCR: Process scanned documents and images
  • Multiple Formats: Output as markdown or JSON
  • 109 Languages: Supports OCR in 109 languages
  • GPU Accelerated: Uses CUDA for fast processing on A100 GPU (80GB VRAM)
  • Two Backends: Fast pipeline (default) or accurate hybrid-auto-engine
  • Parallel Chunking: Large PDFs (>20 pages) are automatically split into 10-page chunks and processed in parallel

API Endpoints

Endpoint Method Description
/ GET Health check
/parse POST Parse uploaded file (multipart/form-data)
/parse/url POST Parse document from URL (JSON body)

Authentication

All /parse endpoints require Bearer token authentication.

Authorization: Bearer YOUR_API_TOKEN

Set API_TOKEN in HF Space Settings > Secrets.

Quick Start

cURL - File Upload

curl -X POST "https://outcomelabs-md-parser.hf.space/parse" \
  -H "Authorization: Bearer YOUR_API_TOKEN" \
  -F "file=@document.pdf" \
  -F "output_format=markdown"

cURL - Parse from URL

curl -X POST "https://outcomelabs-md-parser.hf.space/parse/url" \
  -H "Authorization: Bearer YOUR_API_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com/document.pdf", "output_format": "markdown"}'

Python

import requests

API_URL = "https://outcomelabs-md-parser.hf.space"
API_TOKEN = "your_api_token"
headers = {"Authorization": f"Bearer {API_TOKEN}"}

# Option 1: Upload a file
with open("document.pdf", "rb") as f:
    response = requests.post(
        f"{API_URL}/parse",
        headers=headers,
        files={"file": ("document.pdf", f, "application/pdf")},
        data={"output_format": "markdown"}
    )

# Option 2: Parse from URL
response = requests.post(
    f"{API_URL}/parse/url",
    headers=headers,
    json={
        "url": "https://example.com/document.pdf",
        "output_format": "markdown"
    }
)

result = response.json()
if result["success"]:
    print(f"Parsed {result['pages_processed']} pages")
    print(result["markdown"])
else:
    print(f"Error: {result['error']}")

Python with Images

import requests
import base64
import zipfile
import io

API_URL = "https://outcomelabs-md-parser.hf.space"
API_TOKEN = "your_api_token"
headers = {"Authorization": f"Bearer {API_TOKEN}"}

# Request with images included
with open("document.pdf", "rb") as f:
    response = requests.post(
        f"{API_URL}/parse",
        headers=headers,
        files={"file": ("document.pdf", f, "application/pdf")},
        data={"output_format": "markdown", "include_images": "true"}
    )

result = response.json()
if result["success"]:
    print(f"Parsed {result['pages_processed']} pages")
    print(result["markdown"])

    # Extract images from ZIP
    if result["images_zip"]:
        print(f"Extracting {result['image_count']} images...")
        zip_bytes = base64.b64decode(result["images_zip"])
        with zipfile.ZipFile(io.BytesIO(zip_bytes), 'r') as zf:
            zf.extractall("./extracted_images")
            print(f"Images saved to ./extracted_images/")

JavaScript/Node.js

const API_URL = 'https://outcomelabs-md-parser.hf.space';
const API_TOKEN = 'your_api_token';

// Parse from URL
const response = await fetch(`${API_URL}/parse/url`, {
  method: 'POST',
  headers: {
    Authorization: `Bearer ${API_TOKEN}`,
    'Content-Type': 'application/json',
  },
  body: JSON.stringify({
    url: 'https://example.com/document.pdf',
    output_format: 'markdown',
  }),
});

const result = await response.json();
console.log(result.markdown);

JavaScript/Node.js with Images

import JSZip from 'jszip';
import fs from 'fs';

const API_URL = 'https://outcomelabs-md-parser.hf.space';
const API_TOKEN = 'your_api_token';

// Parse with images
const response = await fetch(`${API_URL}/parse/url`, {
  method: 'POST',
  headers: {
    Authorization: `Bearer ${API_TOKEN}`,
    'Content-Type': 'application/json',
  },
  body: JSON.stringify({
    url: 'https://example.com/document.pdf',
    output_format: 'markdown',
    include_images: true,
  }),
});

const result = await response.json();
console.log(result.markdown);

// Extract images from ZIP
if (result.images_zip) {
  console.log(`Extracting ${result.image_count} images...`);
  const zipData = Buffer.from(result.images_zip, 'base64');
  const zip = await JSZip.loadAsync(zipData);

  for (const [name, file] of Object.entries(zip.files)) {
    if (!file.dir) {
      const content = await file.async('nodebuffer');
      fs.writeFileSync(`./extracted_images/${name}`, content);
      console.log(`  Saved: ${name}`);
    }
  }
}

Postman Setup

File Upload (POST /parse)

  1. Method: POST
  2. URL: https://outcomelabs-md-parser.hf.space/parse
  3. Authorization tab: Type = Bearer Token, Token = your_api_token
  4. Body tab: Select form-data
Key Type Value
file File Select your PDF/image
output_format Text markdown or json
lang Text en (optional)
backend Text pipeline or hybrid-auto-engine (optional)
start_page Text 0 (optional)
end_page Text 10 (optional)
include_images Text true or false (optional)

URL Parsing (POST /parse/url)

  1. Method: POST
  2. URL: https://outcomelabs-md-parser.hf.space/parse/url
  3. Authorization tab: Type = Bearer Token, Token = your_api_token
  4. Headers tab: Add Content-Type: application/json
  5. Body tab: Select raw and JSON
{
  "url": "https://example.com/document.pdf",
  "output_format": "markdown",
  "lang": "en",
  "start_page": 0,
  "end_page": null,
  "include_images": false
}

Request Parameters

File Upload (/parse)

Parameter Type Required Default Description
file File Yes - PDF or image file
output_format string No markdown markdown or json
lang string No en OCR language code
backend string No pipeline pipeline (fast) or hybrid-auto-engine (accurate)
start_page int No 0 Starting page (0-indexed)
end_page int No null Ending page (null = all pages)
include_images bool No false Include base64-encoded images in response

URL Parsing (/parse/url)

Parameter Type Required Default Description
url string Yes - URL to PDF or image
output_format string No markdown markdown or json
lang string No en OCR language code
backend string No pipeline pipeline (fast) or hybrid-auto-engine (accurate)
start_page int No 0 Starting page (0-indexed)
end_page int No null Ending page (null = all pages)
include_images bool No false Include base64-encoded images in response

Response Format

{
  "success": true,
  "markdown": "# Document Title\n\nExtracted content...",
  "json_content": null,
  "images_zip": null,
  "image_count": 0,
  "error": null,
  "pages_processed": 20,
  "backend_used": "pipeline"
}
Field Type Description
success boolean Whether parsing succeeded
markdown string Extracted markdown (if output_format=markdown)
json_content object Extracted JSON (if output_format=json)
images_zip string Base64-encoded ZIP file containing all images (if include_images=true)
image_count int Number of images in the ZIP file
error string Error message if failed
pages_processed int Number of pages processed
backend_used string Actual backend used (may differ from requested if fallback occurred)

Images Response

When include_images=true, the images_zip field contains a base64-encoded ZIP file with all extracted images:

{
  "images_zip": "UEsDBBQAAAAIAGJ...",
  "image_count": 3
}

Extracting Images (Python)

import base64
import zipfile
import io

result = response.json()
if result["images_zip"]:
    print(f"Extracted {result['image_count']} images")

    # Decode the base64 ZIP
    zip_bytes = base64.b64decode(result["images_zip"])

    # Extract images from ZIP
    with zipfile.ZipFile(io.BytesIO(zip_bytes), 'r') as zf:
        for name in zf.namelist():
            print(f"  - {name}")  # e.g., "images/fig1.png"
            img_bytes = zf.read(name)
            # Save or process img_bytes as needed

Extracting Images (JavaScript)

import JSZip from 'jszip';

const result = await response.json();
if (result.images_zip) {
  console.log(`Extracted ${result.image_count} images`);

  // Decode base64 and unzip
  const zipData = Uint8Array.from(atob(result.images_zip), c =>
    c.charCodeAt(0)
  );
  const zip = await JSZip.loadAsync(zipData);

  for (const [name, file] of Object.entries(zip.files)) {
    console.log(`  - ${name}`); // e.g., "images/fig1.png"
    const imgBlob = await file.async('blob');
    // Use imgBlob as needed
  }
}

Image Path Structure

  • Non-chunked documents: images/filename.png
  • Chunked documents (>20 pages): chunk_0/images/filename.png, chunk_1/images/filename.png, etc.

Backends

Backend Speed Accuracy Best For
pipeline (default) ~0.77 pages/sec Good Native PDFs, text-heavy docs, fast processing
hybrid-auto-engine ~0.39 pages/sec Excellent (90%+) Complex layouts, scanned docs, forms

When to Use pipeline (Default)

The pipeline backend uses traditional ML models for faster processing. Use it for:

  • Native PDFs with text layers - Academic papers, eBooks, reports generated digitally
  • High-volume processing - When speed matters more than perfect accuracy (2x faster)
  • Well-structured documents - Clean, single-column text-heavy documents
  • arXiv papers - Both backends produce identical output for well-structured PDFs
  • Cost optimization - Faster processing = less GPU time

When to Use hybrid-auto-engine

The hybrid backend uses a Vision-Language Model (VLM) to understand document layouts visually. Use it for:

  • Scanned documents - Better OCR accuracy, fewer typos
  • Forms and applications - Extracts 18x more content from complex form layouts (tested on IRS Form 1040)
  • Documents with complex layouts - Multi-column, mixed text/images, tables with merged cells
  • Handwritten content - Better recognition of cursive and handwriting
  • Low-quality scans - VLM can interpret degraded or noisy images
  • Legal documents - Leases, contracts with signatures and stamps
  • Historical documents - Older typewritten or faded documents

Real-World Comparison

Document Type Pipeline Output Hybrid Output
arXiv paper (15 pages) 42KB, clean extraction 42KB, identical
IRS Form 1040 825 bytes, mostly images 15KB, full form structure
Scanned lease (31 pg) 104KB, OCR errors 105KB, cleaner OCR

OCR Accuracy Example (scanned lease):

  • Pipeline: "Ilinois" (9 occurrences of typo)
  • Hybrid: "Illinois" (21 correct occurrences)

Override per-request with the backend parameter, or set MINERU_BACKEND env var.

Parallel Chunking

For large PDFs, the API automatically splits processing into parallel chunks to avoid timeouts and improve throughput.

How It Works

  1. Detection: PDFs with more than 20 pages (configurable via CHUNKING_THRESHOLD) trigger chunking
  2. Splitting: Document is split into 10-page chunks (configurable via CHUNK_SIZE)
  3. Parallel Processing: Up to 3 chunks (configurable via MAX_WORKERS) are processed simultaneously
  4. Combining: Results are merged in page order, with chunk boundaries marked in markdown output

Performance Impact

Document Size Without Chunking With Chunking (3 workers) Speedup
30 pages ~80 seconds ~30 seconds ~2.7x
60 pages ~160 seconds ~55 seconds ~2.9x
100 pages Timeout (>600s) ~100 seconds N/A

OOM Protection

If GPU out-of-memory errors are detected during parallel processing, the system automatically falls back to sequential processing (1 worker) and retries all chunks.

Notes

  • Chunking only applies to PDF files (images are always processed as single units)
  • Each chunk maintains context for tables and formulas within its page range
  • Chunk boundaries are marked with HTML comments in markdown output for transparency
  • If any chunk fails, partial results are still returned with an error message
  • Requested backend is used for chunked processing (with OOM auto-fallback to sequential)

Supported File Types

  • PDF (.pdf)
  • Images (.png, .jpg, .jpeg, .tiff, .bmp)

Maximum file size: 1GB (configurable via MAX_FILE_SIZE_MB)

Configuration

Environment Variable Description Default
API_TOKEN Required. API authentication token -
MINERU_BACKEND Default parsing backend pipeline
MINERU_LANG Default OCR language en
MAX_FILE_SIZE_MB Maximum upload size in MB 1024
VLLM_GPU_MEMORY_UTILIZATION vLLM GPU memory fraction (hybrid backend only) 0.4
CHUNK_SIZE Pages per chunk for chunked processing 10
CHUNKING_THRESHOLD Minimum pages to trigger chunking 20
MAX_WORKERS Parallel workers for chunk processing 3

GPU Memory & Automatic Fallback

The hybrid-auto-engine backend uses vLLM internally, which requires GPU memory. If GPU memory is insufficient, the API automatically falls back to pipeline backend and returns results (check backend_used in response).

To force a specific backend or tune memory:

  1. Use pipeline backend - Add backend=pipeline to your request (doesn't use vLLM, faster but less accurate for scanned docs)
  2. Lower GPU memory - Set VLLM_GPU_MEMORY_UTILIZATION to a lower value (e.g., 0.3)

Performance

Hardware: Nvidia A100 Large (80GB VRAM, 12 vCPU, 142GB RAM)

Backend Speed 15-page PDF 31-page PDF
pipeline ~0.77 pages/sec ~20 seconds ~40 seconds
hybrid-auto-engine ~0.39 pages/sec ~40 seconds ~80 seconds

Trade-off: Hybrid is 2x slower but produces significantly better results for scanned/complex documents. For native PDFs, both produce identical output.

Sleep behavior: Space sleeps after 60 minutes idle. First request after sleep takes ~30-60 seconds for cold start.

Deployment

Deploy Updates

git add .
git commit -m "feat: description"
git push hf main

Logging

View logs in HuggingFace Space > Logs tab:

2026-01-26 10:30:00 | INFO | [a1b2c3d4] New parse request received
2026-01-26 10:30:00 | INFO | [a1b2c3d4] Filename: document.pdf
2026-01-26 10:30:00 | INFO | [a1b2c3d4] File size: 2.45 MB
2026-01-26 10:30:00 | INFO | [a1b2c3d4] Backend: pipeline
2026-01-26 10:30:27 | INFO | [a1b2c3d4] MinerU completed in 27.23s
2026-01-26 10:30:27 | INFO | [a1b2c3d4] Pages processed: 20
2026-01-26 10:30:27 | INFO | [a1b2c3d4] Speed: 0.73 pages/sec

Changelog

v1.4.0 (Breaking Change)

Images now returned as ZIP file instead of dictionary:

  • images field removed
  • images_zip field added (base64-encoded ZIP containing all images)
  • image_count field added (number of images in ZIP)

Migration from v1.3.0:

# OLD (v1.3.0)
if result["images"]:
    for filename, b64_data in result["images"].items():
        img_bytes = base64.b64decode(b64_data)

# NEW (v1.4.0)
if result["images_zip"]:
    zip_bytes = base64.b64decode(result["images_zip"])
    with zipfile.ZipFile(io.BytesIO(zip_bytes), 'r') as zf:
        for filename in zf.namelist():
            img_bytes = zf.read(filename)

Benefits:

  • Smaller payload size due to ZIP compression
  • Single field instead of large dictionary
  • Easier to save/extract as a file

v1.3.0

  • Added include_images parameter for optional image extraction
  • Added parallel chunking for large PDFs (>20 pages)
  • Added automatic OOM fallback to sequential processing

Credits

Built with MinerU by OpenDataLab.