Spaces:

outcomelabs
/

docling-parser

Running on T4

File size: 10,963 Bytes

---
title: Docling VLM Parser API
emoji: 📄
colorFrom: blue
colorTo: green
sdk: docker
app_port: 7860
pinned: false
license: mit
suggested_hardware: a100-large
---

# Docling VLM Parser API

A FastAPI service that transforms PDFs and images into LLM-ready markdown/JSON using a **hybrid two-pass architecture**: [IBM's Docling](https://github.com/DS4SD/docling) for document structure and [Qwen3-VL-30B-A3B](https://huggingface.co/Qwen/Qwen3-VL-30B-A3B-Instruct) via [vLLM](https://github.com/vllm-project/vllm) for enhanced text recognition.

## Features

- **Hybrid Two-Pass Architecture**: Docling Standard Pipeline (Pass 1) + Qwen3-VL VLM OCR (Pass 2)
- **TableFormer ACCURATE**: High-accuracy table structure detection preserved from Docling
- **VLM-Powered OCR**: Qwen3-VL-30B-A3B via vLLM replaces baseline RapidOCR for superior text accuracy
- **OpenCV Preprocessing**: Denoising and CLAHE contrast enhancement for better image quality
- **32+ Language Support**: Multilingual text recognition powered by Qwen3-VL
- **Handwriting Recognition**: Transcribe handwritten text via VLM
- **Image Extraction**: Extract and return all document images
- **Multiple Formats**: Output as markdown or JSON
- **GPU Accelerated**: Dual-process on A100 80GB (vLLM + FastAPI)

## Architecture

```
┌─────────────────────────────────────────────────────────────┐
│                     Docker Container                        │
│                  (vllm/vllm-openai:v0.14.1)                 │
│                                                             │
│  ┌─────────────────────┐    ┌────────────────────────────┐  │
│  │  vLLM Server :8000  │    │   FastAPI App :7860        │  │
│  │  Qwen3-VL-30B-A3B        │◄───│                            │  │
│  │  (GPU inference)     │    │   Pass 1: Docling Standard │  │
│  └─────────────────────┘    │   - DocLayNet layout       │  │
│                              │   - TableFormer ACCURATE   │  │
│                              │   - RapidOCR baseline      │  │
│                              │                            │  │
│                              │   Pass 2: VLM OCR          │  │
│                              │   - Page images → Qwen3-VL │  │
│                              │   - OpenCV preprocessing   │  │
│                              │                            │  │
│                              │   Merge:                   │  │
│                              │   - VLM text (primary)     │  │
│                              │   - TableFormer tables     │  │
│                              └────────────────────────────┘  │
└─────────────────────────────────────────────────────────────┘
```

## API Endpoints

| Endpoint     | Method | Description                               |
| ------------ | ------ | ----------------------------------------- |
| `/`          | GET    | Health check (includes vLLM status)       |
| `/parse`     | POST   | Parse uploaded file (multipart/form-data) |
| `/parse/url` | POST   | Parse document from URL (JSON body)       |

## Authentication

All `/parse` endpoints require Bearer token authentication.

```
Authorization: Bearer YOUR_API_TOKEN
```

Set `API_TOKEN` in HF Space Settings > Secrets.

## Quick Start

### cURL - File Upload

```bash
curl -X POST "https://YOUR-SPACE-URL/parse" \
  -H "Authorization: Bearer YOUR_API_TOKEN" \
  -F "file=@document.pdf" \
  -F "output_format=markdown"
```

### cURL - Parse from URL

```bash
curl -X POST "https://YOUR-SPACE-URL/parse/url" \
  -H "Authorization: Bearer YOUR_API_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com/document.pdf", "output_format": "markdown"}'
```

### Python

```python
import requests

API_URL = "https://YOUR-SPACE-URL"
API_TOKEN = "your_api_token"
headers = {"Authorization": f"Bearer {API_TOKEN}"}

# Option 1: Upload a file
with open("document.pdf", "rb") as f:
    response = requests.post(
        f"{API_URL}/parse",
        headers=headers,
        files={"file": ("document.pdf", f, "application/pdf")},
        data={"output_format": "markdown"}
    )

# Option 2: Parse from URL
response = requests.post(
    f"{API_URL}/parse/url",
    headers=headers,
    json={
        "url": "https://example.com/document.pdf",
        "output_format": "markdown"
    }
)

result = response.json()
if result["success"]:
    print(f"Parsed {result['pages_processed']} pages using {result['vlm_model']}")
    print(result["markdown"])
else:
    print(f"Error: {result['error']}")
```

### Python with Images

```python
import requests
import base64
import zipfile
import io

API_URL = "https://YOUR-SPACE-URL"
API_TOKEN = "your_api_token"
headers = {"Authorization": f"Bearer {API_TOKEN}"}

# Request with images included
with open("document.pdf", "rb") as f:
    response = requests.post(
        f"{API_URL}/parse",
        headers=headers,
        files={"file": ("document.pdf", f, "application/pdf")},
        data={"output_format": "markdown", "include_images": "true"}
    )

result = response.json()
if result["success"]:
    print(f"Parsed {result['pages_processed']} pages")
    print(result["markdown"])

    # Extract images from ZIP
    if result["images_zip"]:
        print(f"Extracting {result['image_count']} images...")
        zip_bytes = base64.b64decode(result["images_zip"])
        with zipfile.ZipFile(io.BytesIO(zip_bytes), 'r') as zf:
            zf.extractall("./extracted_images")
            print(f"Images saved to ./extracted_images/")
```

## Request Parameters

### File Upload (/parse)

| Parameter      | Type   | Required | Default    | Description                              |
| -------------- | ------ | -------- | ---------- | ---------------------------------------- |
| file           | File   | Yes      | -          | PDF or image file                        |
| output_format  | string | No       | `markdown` | `markdown` or `json`                     |
| images_scale   | float  | No       | `2.0`      | Image resolution scale (higher = better) |
| start_page     | int    | No       | `0`        | Starting page (0-indexed)                |
| end_page       | int    | No       | `null`     | Ending page (null = all pages)           |
| include_images | bool   | No       | `false`    | Include extracted images in response     |

### URL Parsing (/parse/url)

| Parameter      | Type   | Required | Default    | Description                              |
| -------------- | ------ | -------- | ---------- | ---------------------------------------- |
| url            | string | Yes      | -          | URL to PDF or image                      |
| output_format  | string | No       | `markdown` | `markdown` or `json`                     |
| images_scale   | float  | No       | `2.0`      | Image resolution scale (higher = better) |
| start_page     | int    | No       | `0`        | Starting page (0-indexed)                |
| end_page       | int    | No       | `null`     | Ending page (null = all pages)           |
| include_images | bool   | No       | `false`    | Include extracted images in response     |

## Response Format

```json
{
  "success": true,
  "markdown": "# Document Title\n\nExtracted content...",
  "json_content": null,
  "images_zip": null,
  "image_count": 0,
  "error": null,
  "pages_processed": 20,
  "device_used": "cuda",
  "vlm_model": "Qwen/Qwen3-VL-30B-A3B-Instruct"
}
```

| Field           | Type    | Description                                    |
| --------------- | ------- | ---------------------------------------------- |
| success         | boolean | Whether parsing succeeded                      |
| markdown        | string  | Extracted markdown (if output_format=markdown) |
| json_content    | object  | Extracted JSON (if output_format=json)         |
| images_zip      | string  | Base64-encoded ZIP file containing all images  |
| image_count     | int     | Number of images in the ZIP file               |
| error           | string  | Error message if failed                        |
| pages_processed | int     | Number of pages processed                      |
| device_used     | string  | Device used for processing (cuda, mps, or cpu) |
| vlm_model       | string  | VLM model used for OCR (e.g. Qwen3-VL-30B-A3B)      |

## Supported File Types

- PDF (.pdf)
- Images (.png, .jpg, .jpeg, .tiff, .bmp)

Maximum file size: 1GB (configurable via `MAX_FILE_SIZE_MB`)

## Configuration

| Environment Variable         | Description                            | Default                     |
| ---------------------------- | -------------------------------------- | --------------------------- |
| `API_TOKEN`                  | **Required.** API authentication token | -                           |
| `VLM_MODEL`                  | VLM model for OCR                      | `Qwen/Qwen3-VL-30B-A3B-Instruct`  |
| `VLM_HOST`                   | vLLM server host                       | `127.0.0.1`                 |
| `VLM_PORT`                   | vLLM server port                       | `8000`                      |
| `VLM_GPU_MEMORY_UTILIZATION` | GPU memory fraction for vLLM           | `0.85`                      |
| `VLM_MAX_MODEL_LEN`          | Max context length for VLM             | `8192`                      |
| `IMAGES_SCALE`               | Default image resolution scale         | `2.0`                       |
| `MAX_FILE_SIZE_MB`           | Maximum upload size in MB              | `1024`                      |

## Logging

View logs in HuggingFace Space > Logs tab:

```
2026-02-04 10:30:00 | INFO | [a1b2c3d4] New parse request received
2026-02-04 10:30:00 | INFO | [a1b2c3d4] Filename: document.pdf
2026-02-04 10:30:00 | INFO | [a1b2c3d4] File size: 2.45 MB
2026-02-04 10:30:15 | INFO | [a1b2c3d4] Pass 1: Docling Standard Pipeline completed in 15.23s
2026-02-04 10:30:15 | INFO | [a1b2c3d4] TableFormer detected 3 tables
2026-02-04 10:30:27 | INFO | [a1b2c3d4] Pass 2: VLM OCR completed in 12.00s (20 pages)
2026-02-04 10:30:27 | INFO | [a1b2c3d4] Hybrid conversion complete: 20 pages, 3 tables, 27.23s total
2026-02-04 10:30:27 | INFO | [a1b2c3d4] Speed: 0.73 pages/sec
```

## Credits

Built with [Docling](https://github.com/DS4SD/docling) by IBM Research, [Qwen3-VL](https://huggingface.co/Qwen/Qwen3-VL-30B-A3B-Instruct) by Qwen Team, and [vLLM](https://github.com/vllm-project/vllm).