Spaces:

outcomelabs
/

md-parser

Paused

File size: 19,549 Bytes
---
title: MD Parser API
emoji: 📄
colorFrom: blue
colorTo: purple
sdk: docker
app_port: 7860
pinned: false
license: agpl-3.0
suggested_hardware: a100-large
---

# MD Parser API

A FastAPI service that transforms PDFs and images into LLM-ready markdown/JSON using [MinerU](https://github.com/opendatalab/MinerU).

## Features

- **PDF Parsing**: Extract text, tables, formulas, and images from PDFs
- **Image OCR**: Process scanned documents and images
- **Multiple Formats**: Output as markdown or JSON
- **109 Languages**: Supports OCR in 109 languages
- **GPU Accelerated**: Uses CUDA for fast processing on A100 GPU (80GB VRAM)
- **Two Backends**: Fast `pipeline` (default) or accurate `hybrid-auto-engine`
- **Parallel Chunking**: Large PDFs (>20 pages) are automatically split into 10-page chunks and processed in parallel

## API Endpoints

| Endpoint     | Method | Description                               |
| ------------ | ------ | ----------------------------------------- |
| `/`          | GET    | Health check                              |
| `/parse`     | POST   | Parse uploaded file (multipart/form-data) |
| `/parse/url` | POST   | Parse document from URL (JSON body)       |

## Authentication

All `/parse` endpoints require Bearer token authentication.

```
Authorization: Bearer YOUR_API_TOKEN
```

Set `API_TOKEN` in HF Space Settings > Secrets.

## Quick Start

### cURL - File Upload

```bash
curl -X POST "https://outcomelabs-md-parser.hf.space/parse" \
  -H "Authorization: Bearer YOUR_API_TOKEN" \
  -F "file=@document.pdf" \
  -F "output_format=markdown"
```

### cURL - Parse from URL

```bash
curl -X POST "https://outcomelabs-md-parser.hf.space/parse/url" \
  -H "Authorization: Bearer YOUR_API_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com/document.pdf", "output_format": "markdown"}'
```

### Python

```python
import requests

API_URL = "https://outcomelabs-md-parser.hf.space"
API_TOKEN = "your_api_token"
headers = {"Authorization": f"Bearer {API_TOKEN}"}

# Option 1: Upload a file
with open("document.pdf", "rb") as f:
    response = requests.post(
        f"{API_URL}/parse",
        headers=headers,
        files={"file": ("document.pdf", f, "application/pdf")},
        data={"output_format": "markdown"}
    )

# Option 2: Parse from URL
response = requests.post(
    f"{API_URL}/parse/url",
    headers=headers,
    json={
        "url": "https://example.com/document.pdf",
        "output_format": "markdown"
    }
)

result = response.json()
if result["success"]:
    print(f"Parsed {result['pages_processed']} pages")
    print(result["markdown"])
else:
    print(f"Error: {result['error']}")
```

### Python with Images

```python
import requests
import base64
import zipfile
import io

API_URL = "https://outcomelabs-md-parser.hf.space"
API_TOKEN = "your_api_token"
headers = {"Authorization": f"Bearer {API_TOKEN}"}

# Request with images included
with open("document.pdf", "rb") as f:
    response = requests.post(
        f"{API_URL}/parse",
        headers=headers,
        files={"file": ("document.pdf", f, "application/pdf")},
        data={"output_format": "markdown", "include_images": "true"}
    )

result = response.json()
if result["success"]:
    print(f"Parsed {result['pages_processed']} pages")
    print(result["markdown"])

    # Extract images from ZIP
    if result["images_zip"]:
        print(f"Extracting {result['image_count']} images...")
        zip_bytes = base64.b64decode(result["images_zip"])
        with zipfile.ZipFile(io.BytesIO(zip_bytes), 'r') as zf:
            zf.extractall("./extracted_images")
            print(f"Images saved to ./extracted_images/")
```

### JavaScript/Node.js

```javascript
const API_URL = 'https://outcomelabs-md-parser.hf.space';
const API_TOKEN = 'your_api_token';

// Parse from URL
const response = await fetch(`${API_URL}/parse/url`, {
  method: 'POST',
  headers: {
    Authorization: `Bearer ${API_TOKEN}`,
    'Content-Type': 'application/json',
  },
  body: JSON.stringify({
    url: 'https://example.com/document.pdf',
    output_format: 'markdown',
  }),
});

const result = await response.json();
console.log(result.markdown);
```

### JavaScript/Node.js with Images

```javascript
import JSZip from 'jszip';
import fs from 'fs';

const API_URL = 'https://outcomelabs-md-parser.hf.space';
const API_TOKEN = 'your_api_token';

// Parse with images
const response = await fetch(`${API_URL}/parse/url`, {
  method: 'POST',
  headers: {
    Authorization: `Bearer ${API_TOKEN}`,
    'Content-Type': 'application/json',
  },
  body: JSON.stringify({
    url: 'https://example.com/document.pdf',
    output_format: 'markdown',
    include_images: true,
  }),
});

const result = await response.json();
console.log(result.markdown);

// Extract images from ZIP
if (result.images_zip) {
  console.log(`Extracting ${result.image_count} images...`);
  const zipData = Buffer.from(result.images_zip, 'base64');
  const zip = await JSZip.loadAsync(zipData);

  for (const [name, file] of Object.entries(zip.files)) {
    if (!file.dir) {
      const content = await file.async('nodebuffer');
      fs.writeFileSync(`./extracted_images/${name}`, content);
      console.log(`  Saved: ${name}`);
    }
  }
}
```

## Postman Setup

### File Upload (POST /parse)

1. **Method:** `POST`
2. **URL:** `https://outcomelabs-md-parser.hf.space/parse`
3. **Authorization tab:** Type = Bearer Token, Token = `your_api_token`
4. **Body tab:** Select `form-data`

| Key            | Type | Value                                         |
| -------------- | ---- | --------------------------------------------- |
| file           | File | Select your PDF/image                         |
| output_format  | Text | `markdown` or `json`                          |
| lang           | Text | `en` (optional)                               |
| backend        | Text | `pipeline` or `hybrid-auto-engine` (optional) |
| start_page     | Text | `0` (optional)                                |
| end_page       | Text | `10` (optional)                               |
| include_images | Text | `true` or `false` (optional)                  |

### URL Parsing (POST /parse/url)

1. **Method:** `POST`
2. **URL:** `https://outcomelabs-md-parser.hf.space/parse/url`
3. **Authorization tab:** Type = Bearer Token, Token = `your_api_token`
4. **Headers tab:** Add `Content-Type: application/json`
5. **Body tab:** Select `raw` and `JSON`

```json
{
  "url": "https://example.com/document.pdf",
  "output_format": "markdown",
  "lang": "en",
  "start_page": 0,
  "end_page": null,
  "include_images": false
}
```

## Request Parameters

### File Upload (/parse)

| Parameter      | Type   | Required | Default    | Description                                          |
| -------------- | ------ | -------- | ---------- | ---------------------------------------------------- |
| file           | File   | Yes      | -          | PDF or image file                                    |
| output_format  | string | No       | `markdown` | `markdown` or `json`                                 |
| lang           | string | No       | `en`       | OCR language code                                    |
| backend        | string | No       | `pipeline` | `pipeline` (fast) or `hybrid-auto-engine` (accurate) |
| start_page     | int    | No       | `0`        | Starting page (0-indexed)                            |
| end_page       | int    | No       | `null`     | Ending page (null = all pages)                       |
| include_images | bool   | No       | `false`    | Include base64-encoded images in response            |

### URL Parsing (/parse/url)

| Parameter      | Type   | Required | Default    | Description                                          |
| -------------- | ------ | -------- | ---------- | ---------------------------------------------------- |
| url            | string | Yes      | -          | URL to PDF or image                                  |
| output_format  | string | No       | `markdown` | `markdown` or `json`                                 |
| lang           | string | No       | `en`       | OCR language code                                    |
| backend        | string | No       | `pipeline` | `pipeline` (fast) or `hybrid-auto-engine` (accurate) |
| start_page     | int    | No       | `0`        | Starting page (0-indexed)                            |
| end_page       | int    | No       | `null`     | Ending page (null = all pages)                       |
| include_images | bool   | No       | `false`    | Include base64-encoded images in response            |

## Response Format

```json
{
  "success": true,
  "markdown": "# Document Title\n\nExtracted content...",
  "json_content": null,
  "images_zip": null,
  "image_count": 0,
  "error": null,
  "pages_processed": 20,
  "backend_used": "pipeline"
}
```

| Field           | Type    | Description                                                            |
| --------------- | ------- | ---------------------------------------------------------------------- |
| success         | boolean | Whether parsing succeeded                                              |
| markdown        | string  | Extracted markdown (if output_format=markdown)                         |
| json_content    | object  | Extracted JSON (if output_format=json)                                 |
| images_zip      | string  | Base64-encoded ZIP file containing all images (if include_images=true) |
| image_count     | int     | Number of images in the ZIP file                                       |
| error           | string  | Error message if failed                                                |
| pages_processed | int     | Number of pages processed                                              |
| backend_used    | string  | Actual backend used (may differ from requested if fallback occurred)   |

### Images Response

When `include_images=true`, the `images_zip` field contains a base64-encoded ZIP file with all extracted images:

```json
{
  "images_zip": "UEsDBBQAAAAIAGJ...",
  "image_count": 3
}
```

#### Extracting Images (Python)

```python
import base64
import zipfile
import io

result = response.json()
if result["images_zip"]:
    print(f"Extracted {result['image_count']} images")

    # Decode the base64 ZIP
    zip_bytes = base64.b64decode(result["images_zip"])

    # Extract images from ZIP
    with zipfile.ZipFile(io.BytesIO(zip_bytes), 'r') as zf:
        for name in zf.namelist():
            print(f"  - {name}")  # e.g., "images/fig1.png"
            img_bytes = zf.read(name)
            # Save or process img_bytes as needed
```

#### Extracting Images (JavaScript)

```javascript
import JSZip from 'jszip';

const result = await response.json();
if (result.images_zip) {
  console.log(`Extracted ${result.image_count} images`);

  // Decode base64 and unzip
  const zipData = Uint8Array.from(atob(result.images_zip), c =>
    c.charCodeAt(0)
  );
  const zip = await JSZip.loadAsync(zipData);

  for (const [name, file] of Object.entries(zip.files)) {
    console.log(`  - ${name}`); // e.g., "images/fig1.png"
    const imgBlob = await file.async('blob');
    // Use imgBlob as needed
  }
}
```

#### Image Path Structure

- **Non-chunked documents**: `images/filename.png`
- **Chunked documents (>20 pages)**: `chunk_0/images/filename.png`, `chunk_1/images/filename.png`, etc.

## Backends

| Backend              | Speed           | Accuracy         | Best For                                      |
| -------------------- | --------------- | ---------------- | --------------------------------------------- |
| `pipeline` (default) | ~0.77 pages/sec | Good             | Native PDFs, text-heavy docs, fast processing |
| `hybrid-auto-engine` | ~0.39 pages/sec | Excellent (90%+) | Complex layouts, scanned docs, forms          |

### When to Use `pipeline` (Default)

The pipeline backend uses traditional ML models for faster processing. Use it for:

- **Native PDFs with text layers** - Academic papers, eBooks, reports generated digitally
- **High-volume processing** - When speed matters more than perfect accuracy (2x faster)
- **Well-structured documents** - Clean, single-column text-heavy documents
- **arXiv papers** - Both backends produce identical output for well-structured PDFs
- **Cost optimization** - Faster processing = less GPU time

### When to Use `hybrid-auto-engine`

The hybrid backend uses a Vision-Language Model (VLM) to understand document layouts visually. Use it for:

- **Scanned documents** - Better OCR accuracy, fewer typos
- **Forms and applications** - Extracts 18x more content from complex form layouts (tested on IRS Form 1040)
- **Documents with complex layouts** - Multi-column, mixed text/images, tables with merged cells
- **Handwritten content** - Better recognition of cursive and handwriting
- **Low-quality scans** - VLM can interpret degraded or noisy images
- **Legal documents** - Leases, contracts with signatures and stamps
- **Historical documents** - Older typewritten or faded documents

### Real-World Comparison

| Document Type          | Pipeline Output          | Hybrid Output                 |
| ---------------------- | ------------------------ | ----------------------------- |
| arXiv paper (15 pages) | 42KB, clean extraction   | 42KB, identical               |
| IRS Form 1040          | 825 bytes, mostly images | **15KB, full form structure** |
| Scanned lease (31 pg)  | 104KB, OCR errors        | **105KB, cleaner OCR**        |

**OCR Accuracy Example (scanned lease):**

- Pipeline: "Ilinois" (9 occurrences of typo)
- Hybrid: "Illinois" (21 correct occurrences)

Override per-request with the `backend` parameter, or set `MINERU_BACKEND` env var.

## Parallel Chunking

For large PDFs, the API automatically splits processing into parallel chunks to avoid timeouts and improve throughput.

### How It Works

1. **Detection**: PDFs with more than 20 pages (configurable via `CHUNKING_THRESHOLD`) trigger chunking
2. **Splitting**: Document is split into 10-page chunks (configurable via `CHUNK_SIZE`)
3. **Parallel Processing**: Up to 3 chunks (configurable via `MAX_WORKERS`) are processed simultaneously
4. **Combining**: Results are merged in page order, with chunk boundaries marked in markdown output

### Performance Impact

| Document Size | Without Chunking | With Chunking (3 workers) | Speedup |
| ------------- | ---------------- | ------------------------- | ------- |
| 30 pages      | ~80 seconds      | ~30 seconds               | ~2.7x   |
| 60 pages      | ~160 seconds     | ~55 seconds               | ~2.9x   |
| 100 pages     | Timeout (>600s)  | ~100 seconds              | N/A     |

### OOM Protection

If GPU out-of-memory errors are detected during parallel processing, the system automatically falls back to sequential processing (1 worker) and retries all chunks.

### Notes

- Chunking only applies to PDF files (images are always processed as single units)
- Each chunk maintains context for tables and formulas within its page range
- Chunk boundaries are marked with HTML comments in markdown output for transparency
- If any chunk fails, partial results are still returned with an error message
- Requested backend is used for chunked processing (with OOM auto-fallback to sequential)

## Supported File Types

- PDF (.pdf)
- Images (.png, .jpg, .jpeg, .tiff, .bmp)

Maximum file size: 1GB (configurable via `MAX_FILE_SIZE_MB`)

## Configuration

| Environment Variable          | Description                                    | Default    |
| ----------------------------- | ---------------------------------------------- | ---------- |
| `API_TOKEN`                   | **Required.** API authentication token         | -          |
| `MINERU_BACKEND`              | Default parsing backend                        | `pipeline` |
| `MINERU_LANG`                 | Default OCR language                           | `en`       |
| `MAX_FILE_SIZE_MB`            | Maximum upload size in MB                      | `1024`     |
| `VLLM_GPU_MEMORY_UTILIZATION` | vLLM GPU memory fraction (hybrid backend only) | `0.4`      |
| `CHUNK_SIZE`                  | Pages per chunk for chunked processing         | `10`       |
| `CHUNKING_THRESHOLD`          | Minimum pages to trigger chunking              | `20`       |
| `MAX_WORKERS`                 | Parallel workers for chunk processing          | `3`        |

### GPU Memory & Automatic Fallback

The `hybrid-auto-engine` backend uses vLLM internally, which requires GPU memory. **If GPU memory is insufficient, the API automatically falls back to `pipeline` backend** and returns results (check `backend_used` in response).

To force a specific backend or tune memory:

1. **Use `pipeline` backend** - Add `backend=pipeline` to your request (doesn't use vLLM, faster but less accurate for scanned docs)
2. **Lower GPU memory** - Set `VLLM_GPU_MEMORY_UTILIZATION` to a lower value (e.g., `0.3`)

## Performance

**Hardware:** Nvidia A100 Large (80GB VRAM, 12 vCPU, 142GB RAM)

| Backend              | Speed           | 15-page PDF | 31-page PDF |
| -------------------- | --------------- | ----------- | ----------- |
| `pipeline`           | ~0.77 pages/sec | ~20 seconds | ~40 seconds |
| `hybrid-auto-engine` | ~0.39 pages/sec | ~40 seconds | ~80 seconds |

**Trade-off:** Hybrid is 2x slower but produces significantly better results for scanned/complex documents. For native PDFs, both produce identical output.

**Sleep behavior:** Space sleeps after 60 minutes idle. First request after sleep takes ~30-60 seconds for cold start.

## Deployment

- **Space:** https://huggingface.co/spaces/outcomelabs/md-parser
- **API:** https://outcomelabs-md-parser.hf.space
- **Hardware:** Nvidia A100 Large 80GB ($2.50/hr, stops billing when sleeping)

### Deploy Updates

```bash
git add .
git commit -m "feat: description"
git push hf main
```

## Logging

View logs in HuggingFace Space > Logs tab:

```
2026-01-26 10:30:00 | INFO | [a1b2c3d4] New parse request received
2026-01-26 10:30:00 | INFO | [a1b2c3d4] Filename: document.pdf
2026-01-26 10:30:00 | INFO | [a1b2c3d4] File size: 2.45 MB
2026-01-26 10:30:00 | INFO | [a1b2c3d4] Backend: pipeline
2026-01-26 10:30:27 | INFO | [a1b2c3d4] MinerU completed in 27.23s
2026-01-26 10:30:27 | INFO | [a1b2c3d4] Pages processed: 20
2026-01-26 10:30:27 | INFO | [a1b2c3d4] Speed: 0.73 pages/sec
```

## Changelog

### v1.4.0 (Breaking Change)

**Images now returned as ZIP file instead of dictionary:**

- `images` field removed
- `images_zip` field added (base64-encoded ZIP containing all images)
- `image_count` field added (number of images in ZIP)

**Migration from v1.3.0:**

```python
# OLD (v1.3.0)
if result["images"]:
    for filename, b64_data in result["images"].items():
        img_bytes = base64.b64decode(b64_data)

# NEW (v1.4.0)
if result["images_zip"]:
    zip_bytes = base64.b64decode(result["images_zip"])
    with zipfile.ZipFile(io.BytesIO(zip_bytes), 'r') as zf:
        for filename in zf.namelist():
            img_bytes = zf.read(filename)
```

**Benefits:**

- Smaller payload size due to ZIP compression
- Single field instead of large dictionary
- Easier to save/extract as a file

### v1.3.0

- Added `include_images` parameter for optional image extraction
- Added parallel chunking for large PDFs (>20 pages)
- Added automatic OOM fallback to sequential processing

## Credits

Built with [MinerU](https://github.com/opendatalab/MinerU) by OpenDataLab.