File size: 19,549 Bytes
6162371
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
---
title: MD Parser API
emoji: 📄
colorFrom: blue
colorTo: purple
sdk: docker
app_port: 7860
pinned: false
license: agpl-3.0
suggested_hardware: a100-large
---

# MD Parser API

A FastAPI service that transforms PDFs and images into LLM-ready markdown/JSON using [MinerU](https://github.com/opendatalab/MinerU).

## Features

- **PDF Parsing**: Extract text, tables, formulas, and images from PDFs
- **Image OCR**: Process scanned documents and images
- **Multiple Formats**: Output as markdown or JSON
- **109 Languages**: Supports OCR in 109 languages
- **GPU Accelerated**: Uses CUDA for fast processing on A100 GPU (80GB VRAM)
- **Two Backends**: Fast `pipeline` (default) or accurate `hybrid-auto-engine`
- **Parallel Chunking**: Large PDFs (>20 pages) are automatically split into 10-page chunks and processed in parallel

## API Endpoints

| Endpoint     | Method | Description                               |
| ------------ | ------ | ----------------------------------------- |
| `/`          | GET    | Health check                              |
| `/parse`     | POST   | Parse uploaded file (multipart/form-data) |
| `/parse/url` | POST   | Parse document from URL (JSON body)       |

## Authentication

All `/parse` endpoints require Bearer token authentication.

```
Authorization: Bearer YOUR_API_TOKEN
```

Set `API_TOKEN` in HF Space Settings > Secrets.

## Quick Start

### cURL - File Upload

```bash
curl -X POST "https://outcomelabs-md-parser.hf.space/parse" \
  -H "Authorization: Bearer YOUR_API_TOKEN" \
  -F "file=@document.pdf" \
  -F "output_format=markdown"
```

### cURL - Parse from URL

```bash
curl -X POST "https://outcomelabs-md-parser.hf.space/parse/url" \
  -H "Authorization: Bearer YOUR_API_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com/document.pdf", "output_format": "markdown"}'
```

### Python

```python
import requests

API_URL = "https://outcomelabs-md-parser.hf.space"
API_TOKEN = "your_api_token"
headers = {"Authorization": f"Bearer {API_TOKEN}"}

# Option 1: Upload a file
with open("document.pdf", "rb") as f:
    response = requests.post(
        f"{API_URL}/parse",
        headers=headers,
        files={"file": ("document.pdf", f, "application/pdf")},
        data={"output_format": "markdown"}
    )

# Option 2: Parse from URL
response = requests.post(
    f"{API_URL}/parse/url",
    headers=headers,
    json={
        "url": "https://example.com/document.pdf",
        "output_format": "markdown"
    }
)

result = response.json()
if result["success"]:
    print(f"Parsed {result['pages_processed']} pages")
    print(result["markdown"])
else:
    print(f"Error: {result['error']}")
```

### Python with Images

```python
import requests
import base64
import zipfile
import io

API_URL = "https://outcomelabs-md-parser.hf.space"
API_TOKEN = "your_api_token"
headers = {"Authorization": f"Bearer {API_TOKEN}"}

# Request with images included
with open("document.pdf", "rb") as f:
    response = requests.post(
        f"{API_URL}/parse",
        headers=headers,
        files={"file": ("document.pdf", f, "application/pdf")},
        data={"output_format": "markdown", "include_images": "true"}
    )

result = response.json()
if result["success"]:
    print(f"Parsed {result['pages_processed']} pages")
    print(result["markdown"])

    # Extract images from ZIP
    if result["images_zip"]:
        print(f"Extracting {result['image_count']} images...")
        zip_bytes = base64.b64decode(result["images_zip"])
        with zipfile.ZipFile(io.BytesIO(zip_bytes), 'r') as zf:
            zf.extractall("./extracted_images")
            print(f"Images saved to ./extracted_images/")
```

### JavaScript/Node.js

```javascript
const API_URL = 'https://outcomelabs-md-parser.hf.space';
const API_TOKEN = 'your_api_token';

// Parse from URL
const response = await fetch(`${API_URL}/parse/url`, {
  method: 'POST',
  headers: {
    Authorization: `Bearer ${API_TOKEN}`,
    'Content-Type': 'application/json',
  },
  body: JSON.stringify({
    url: 'https://example.com/document.pdf',
    output_format: 'markdown',
  }),
});

const result = await response.json();
console.log(result.markdown);
```

### JavaScript/Node.js with Images

```javascript
import JSZip from 'jszip';
import fs from 'fs';

const API_URL = 'https://outcomelabs-md-parser.hf.space';
const API_TOKEN = 'your_api_token';

// Parse with images
const response = await fetch(`${API_URL}/parse/url`, {
  method: 'POST',
  headers: {
    Authorization: `Bearer ${API_TOKEN}`,
    'Content-Type': 'application/json',
  },
  body: JSON.stringify({
    url: 'https://example.com/document.pdf',
    output_format: 'markdown',
    include_images: true,
  }),
});

const result = await response.json();
console.log(result.markdown);

// Extract images from ZIP
if (result.images_zip) {
  console.log(`Extracting ${result.image_count} images...`);
  const zipData = Buffer.from(result.images_zip, 'base64');
  const zip = await JSZip.loadAsync(zipData);

  for (const [name, file] of Object.entries(zip.files)) {
    if (!file.dir) {
      const content = await file.async('nodebuffer');
      fs.writeFileSync(`./extracted_images/${name}`, content);
      console.log(`  Saved: ${name}`);
    }
  }
}
```

## Postman Setup

### File Upload (POST /parse)

1. **Method:** `POST`
2. **URL:** `https://outcomelabs-md-parser.hf.space/parse`
3. **Authorization tab:** Type = Bearer Token, Token = `your_api_token`
4. **Body tab:** Select `form-data`

| Key            | Type | Value                                         |
| -------------- | ---- | --------------------------------------------- |
| file           | File | Select your PDF/image                         |
| output_format  | Text | `markdown` or `json`                          |
| lang           | Text | `en` (optional)                               |
| backend        | Text | `pipeline` or `hybrid-auto-engine` (optional) |
| start_page     | Text | `0` (optional)                                |
| end_page       | Text | `10` (optional)                               |
| include_images | Text | `true` or `false` (optional)                  |

### URL Parsing (POST /parse/url)

1. **Method:** `POST`
2. **URL:** `https://outcomelabs-md-parser.hf.space/parse/url`
3. **Authorization tab:** Type = Bearer Token, Token = `your_api_token`
4. **Headers tab:** Add `Content-Type: application/json`
5. **Body tab:** Select `raw` and `JSON`

```json
{
  "url": "https://example.com/document.pdf",
  "output_format": "markdown",
  "lang": "en",
  "start_page": 0,
  "end_page": null,
  "include_images": false
}
```

## Request Parameters

### File Upload (/parse)

| Parameter      | Type   | Required | Default    | Description                                          |
| -------------- | ------ | -------- | ---------- | ---------------------------------------------------- |
| file           | File   | Yes      | -          | PDF or image file                                    |
| output_format  | string | No       | `markdown` | `markdown` or `json`                                 |
| lang           | string | No       | `en`       | OCR language code                                    |
| backend        | string | No       | `pipeline` | `pipeline` (fast) or `hybrid-auto-engine` (accurate) |
| start_page     | int    | No       | `0`        | Starting page (0-indexed)                            |
| end_page       | int    | No       | `null`     | Ending page (null = all pages)                       |
| include_images | bool   | No       | `false`    | Include base64-encoded images in response            |

### URL Parsing (/parse/url)

| Parameter      | Type   | Required | Default    | Description                                          |
| -------------- | ------ | -------- | ---------- | ---------------------------------------------------- |
| url            | string | Yes      | -          | URL to PDF or image                                  |
| output_format  | string | No       | `markdown` | `markdown` or `json`                                 |
| lang           | string | No       | `en`       | OCR language code                                    |
| backend        | string | No       | `pipeline` | `pipeline` (fast) or `hybrid-auto-engine` (accurate) |
| start_page     | int    | No       | `0`        | Starting page (0-indexed)                            |
| end_page       | int    | No       | `null`     | Ending page (null = all pages)                       |
| include_images | bool   | No       | `false`    | Include base64-encoded images in response            |

## Response Format

```json
{
  "success": true,
  "markdown": "# Document Title\n\nExtracted content...",
  "json_content": null,
  "images_zip": null,
  "image_count": 0,
  "error": null,
  "pages_processed": 20,
  "backend_used": "pipeline"
}
```

| Field           | Type    | Description                                                            |
| --------------- | ------- | ---------------------------------------------------------------------- |
| success         | boolean | Whether parsing succeeded                                              |
| markdown        | string  | Extracted markdown (if output_format=markdown)                         |
| json_content    | object  | Extracted JSON (if output_format=json)                                 |
| images_zip      | string  | Base64-encoded ZIP file containing all images (if include_images=true) |
| image_count     | int     | Number of images in the ZIP file                                       |
| error           | string  | Error message if failed                                                |
| pages_processed | int     | Number of pages processed                                              |
| backend_used    | string  | Actual backend used (may differ from requested if fallback occurred)   |

### Images Response

When `include_images=true`, the `images_zip` field contains a base64-encoded ZIP file with all extracted images:

```json
{
  "images_zip": "UEsDBBQAAAAIAGJ...",
  "image_count": 3
}
```

#### Extracting Images (Python)

```python
import base64
import zipfile
import io

result = response.json()
if result["images_zip"]:
    print(f"Extracted {result['image_count']} images")

    # Decode the base64 ZIP
    zip_bytes = base64.b64decode(result["images_zip"])

    # Extract images from ZIP
    with zipfile.ZipFile(io.BytesIO(zip_bytes), 'r') as zf:
        for name in zf.namelist():
            print(f"  - {name}")  # e.g., "images/fig1.png"
            img_bytes = zf.read(name)
            # Save or process img_bytes as needed
```

#### Extracting Images (JavaScript)

```javascript
import JSZip from 'jszip';

const result = await response.json();
if (result.images_zip) {
  console.log(`Extracted ${result.image_count} images`);

  // Decode base64 and unzip
  const zipData = Uint8Array.from(atob(result.images_zip), c =>
    c.charCodeAt(0)
  );
  const zip = await JSZip.loadAsync(zipData);

  for (const [name, file] of Object.entries(zip.files)) {
    console.log(`  - ${name}`); // e.g., "images/fig1.png"
    const imgBlob = await file.async('blob');
    // Use imgBlob as needed
  }
}
```

#### Image Path Structure

- **Non-chunked documents**: `images/filename.png`
- **Chunked documents (>20 pages)**: `chunk_0/images/filename.png`, `chunk_1/images/filename.png`, etc.

## Backends

| Backend              | Speed           | Accuracy         | Best For                                      |
| -------------------- | --------------- | ---------------- | --------------------------------------------- |
| `pipeline` (default) | ~0.77 pages/sec | Good             | Native PDFs, text-heavy docs, fast processing |
| `hybrid-auto-engine` | ~0.39 pages/sec | Excellent (90%+) | Complex layouts, scanned docs, forms          |

### When to Use `pipeline` (Default)

The pipeline backend uses traditional ML models for faster processing. Use it for:

- **Native PDFs with text layers** - Academic papers, eBooks, reports generated digitally
- **High-volume processing** - When speed matters more than perfect accuracy (2x faster)
- **Well-structured documents** - Clean, single-column text-heavy documents
- **arXiv papers** - Both backends produce identical output for well-structured PDFs
- **Cost optimization** - Faster processing = less GPU time

### When to Use `hybrid-auto-engine`

The hybrid backend uses a Vision-Language Model (VLM) to understand document layouts visually. Use it for:

- **Scanned documents** - Better OCR accuracy, fewer typos
- **Forms and applications** - Extracts 18x more content from complex form layouts (tested on IRS Form 1040)
- **Documents with complex layouts** - Multi-column, mixed text/images, tables with merged cells
- **Handwritten content** - Better recognition of cursive and handwriting
- **Low-quality scans** - VLM can interpret degraded or noisy images
- **Legal documents** - Leases, contracts with signatures and stamps
- **Historical documents** - Older typewritten or faded documents

### Real-World Comparison

| Document Type          | Pipeline Output          | Hybrid Output                 |
| ---------------------- | ------------------------ | ----------------------------- |
| arXiv paper (15 pages) | 42KB, clean extraction   | 42KB, identical               |
| IRS Form 1040          | 825 bytes, mostly images | **15KB, full form structure** |
| Scanned lease (31 pg)  | 104KB, OCR errors        | **105KB, cleaner OCR**        |

**OCR Accuracy Example (scanned lease):**

- Pipeline: "Ilinois" (9 occurrences of typo)
- Hybrid: "Illinois" (21 correct occurrences)

Override per-request with the `backend` parameter, or set `MINERU_BACKEND` env var.

## Parallel Chunking

For large PDFs, the API automatically splits processing into parallel chunks to avoid timeouts and improve throughput.

### How It Works

1. **Detection**: PDFs with more than 20 pages (configurable via `CHUNKING_THRESHOLD`) trigger chunking
2. **Splitting**: Document is split into 10-page chunks (configurable via `CHUNK_SIZE`)
3. **Parallel Processing**: Up to 3 chunks (configurable via `MAX_WORKERS`) are processed simultaneously
4. **Combining**: Results are merged in page order, with chunk boundaries marked in markdown output

### Performance Impact

| Document Size | Without Chunking | With Chunking (3 workers) | Speedup |
| ------------- | ---------------- | ------------------------- | ------- |
| 30 pages      | ~80 seconds      | ~30 seconds               | ~2.7x   |
| 60 pages      | ~160 seconds     | ~55 seconds               | ~2.9x   |
| 100 pages     | Timeout (>600s)  | ~100 seconds              | N/A     |

### OOM Protection

If GPU out-of-memory errors are detected during parallel processing, the system automatically falls back to sequential processing (1 worker) and retries all chunks.

### Notes

- Chunking only applies to PDF files (images are always processed as single units)
- Each chunk maintains context for tables and formulas within its page range
- Chunk boundaries are marked with HTML comments in markdown output for transparency
- If any chunk fails, partial results are still returned with an error message
- Requested backend is used for chunked processing (with OOM auto-fallback to sequential)

## Supported File Types

- PDF (.pdf)
- Images (.png, .jpg, .jpeg, .tiff, .bmp)

Maximum file size: 1GB (configurable via `MAX_FILE_SIZE_MB`)

## Configuration

| Environment Variable          | Description                                    | Default    |
| ----------------------------- | ---------------------------------------------- | ---------- |
| `API_TOKEN`                   | **Required.** API authentication token         | -          |
| `MINERU_BACKEND`              | Default parsing backend                        | `pipeline` |
| `MINERU_LANG`                 | Default OCR language                           | `en`       |
| `MAX_FILE_SIZE_MB`            | Maximum upload size in MB                      | `1024`     |
| `VLLM_GPU_MEMORY_UTILIZATION` | vLLM GPU memory fraction (hybrid backend only) | `0.4`      |
| `CHUNK_SIZE`                  | Pages per chunk for chunked processing         | `10`       |
| `CHUNKING_THRESHOLD`          | Minimum pages to trigger chunking              | `20`       |
| `MAX_WORKERS`                 | Parallel workers for chunk processing          | `3`        |

### GPU Memory & Automatic Fallback

The `hybrid-auto-engine` backend uses vLLM internally, which requires GPU memory. **If GPU memory is insufficient, the API automatically falls back to `pipeline` backend** and returns results (check `backend_used` in response).

To force a specific backend or tune memory:

1. **Use `pipeline` backend** - Add `backend=pipeline` to your request (doesn't use vLLM, faster but less accurate for scanned docs)
2. **Lower GPU memory** - Set `VLLM_GPU_MEMORY_UTILIZATION` to a lower value (e.g., `0.3`)

## Performance

**Hardware:** Nvidia A100 Large (80GB VRAM, 12 vCPU, 142GB RAM)

| Backend              | Speed           | 15-page PDF | 31-page PDF |
| -------------------- | --------------- | ----------- | ----------- |
| `pipeline`           | ~0.77 pages/sec | ~20 seconds | ~40 seconds |
| `hybrid-auto-engine` | ~0.39 pages/sec | ~40 seconds | ~80 seconds |

**Trade-off:** Hybrid is 2x slower but produces significantly better results for scanned/complex documents. For native PDFs, both produce identical output.

**Sleep behavior:** Space sleeps after 60 minutes idle. First request after sleep takes ~30-60 seconds for cold start.

## Deployment

- **Space:** https://huggingface.co/spaces/outcomelabs/md-parser
- **API:** https://outcomelabs-md-parser.hf.space
- **Hardware:** Nvidia A100 Large 80GB ($2.50/hr, stops billing when sleeping)

### Deploy Updates

```bash
git add .
git commit -m "feat: description"
git push hf main
```

## Logging

View logs in HuggingFace Space > Logs tab:

```
2026-01-26 10:30:00 | INFO | [a1b2c3d4] New parse request received
2026-01-26 10:30:00 | INFO | [a1b2c3d4] Filename: document.pdf
2026-01-26 10:30:00 | INFO | [a1b2c3d4] File size: 2.45 MB
2026-01-26 10:30:00 | INFO | [a1b2c3d4] Backend: pipeline
2026-01-26 10:30:27 | INFO | [a1b2c3d4] MinerU completed in 27.23s
2026-01-26 10:30:27 | INFO | [a1b2c3d4] Pages processed: 20
2026-01-26 10:30:27 | INFO | [a1b2c3d4] Speed: 0.73 pages/sec
```

## Changelog

### v1.4.0 (Breaking Change)

**Images now returned as ZIP file instead of dictionary:**

- `images` field removed
- `images_zip` field added (base64-encoded ZIP containing all images)
- `image_count` field added (number of images in ZIP)

**Migration from v1.3.0:**

```python
# OLD (v1.3.0)
if result["images"]:
    for filename, b64_data in result["images"].items():
        img_bytes = base64.b64decode(b64_data)

# NEW (v1.4.0)
if result["images_zip"]:
    zip_bytes = base64.b64decode(result["images_zip"])
    with zipfile.ZipFile(io.BytesIO(zip_bytes), 'r') as zf:
        for filename in zf.namelist():
            img_bytes = zf.read(filename)
```

**Benefits:**

- Smaller payload size due to ZIP compression
- Single field instead of large dictionary
- Easier to save/extract as a file

### v1.3.0

- Added `include_images` parameter for optional image extraction
- Added parallel chunking for large PDFs (>20 pages)
- Added automatic OOM fallback to sequential processing

## Credits

Built with [MinerU](https://github.com/opendatalab/MinerU) by OpenDataLab.