Spaces:
Running on T4
Running on T4
File size: 10,963 Bytes
5052def 8c4351b 5052def 8c4351b 5052def 922ba62 5052def 8c4351b 922ba62 8c4351b 5052def 8c4351b 922ba62 8c4351b 5052def 8c4351b 5052def 8c4351b 5052def 8c4351b 5052def 8c4351b 5052def 8c4351b 922ba62 5052def 922ba62 5052def 8c4351b 922ba62 8c4351b 922ba62 8c4351b 5052def 8c4351b 5052def 922ba62 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 | ---
title: Docling VLM Parser API
emoji: π
colorFrom: blue
colorTo: green
sdk: docker
app_port: 7860
pinned: false
license: mit
suggested_hardware: a100-large
---
# Docling VLM Parser API
A FastAPI service that transforms PDFs and images into LLM-ready markdown/JSON using a **hybrid two-pass architecture**: [IBM's Docling](https://github.com/DS4SD/docling) for document structure and [Qwen3-VL-30B-A3B](https://huggingface.co/Qwen/Qwen3-VL-30B-A3B-Instruct) via [vLLM](https://github.com/vllm-project/vllm) for enhanced text recognition.
## Features
- **Hybrid Two-Pass Architecture**: Docling Standard Pipeline (Pass 1) + Qwen3-VL VLM OCR (Pass 2)
- **TableFormer ACCURATE**: High-accuracy table structure detection preserved from Docling
- **VLM-Powered OCR**: Qwen3-VL-30B-A3B via vLLM replaces baseline RapidOCR for superior text accuracy
- **OpenCV Preprocessing**: Denoising and CLAHE contrast enhancement for better image quality
- **32+ Language Support**: Multilingual text recognition powered by Qwen3-VL
- **Handwriting Recognition**: Transcribe handwritten text via VLM
- **Image Extraction**: Extract and return all document images
- **Multiple Formats**: Output as markdown or JSON
- **GPU Accelerated**: Dual-process on A100 80GB (vLLM + FastAPI)
## Architecture
```
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Docker Container β
β (vllm/vllm-openai:v0.14.1) β
β β
β βββββββββββββββββββββββ ββββββββββββββββββββββββββββββ β
β β vLLM Server :8000 β β FastAPI App :7860 β β
β β Qwen3-VL-30B-A3B ββββββ β β
β β (GPU inference) β β Pass 1: Docling Standard β β
β βββββββββββββββββββββββ β - DocLayNet layout β β
β β - TableFormer ACCURATE β β
β β - RapidOCR baseline β β
β β β β
β β Pass 2: VLM OCR β β
β β - Page images β Qwen3-VL β β
β β - OpenCV preprocessing β β
β β β β
β β Merge: β β
β β - VLM text (primary) β β
β β - TableFormer tables β β
β ββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
```
## API Endpoints
| Endpoint | Method | Description |
| ------------ | ------ | ----------------------------------------- |
| `/` | GET | Health check (includes vLLM status) |
| `/parse` | POST | Parse uploaded file (multipart/form-data) |
| `/parse/url` | POST | Parse document from URL (JSON body) |
## Authentication
All `/parse` endpoints require Bearer token authentication.
```
Authorization: Bearer YOUR_API_TOKEN
```
Set `API_TOKEN` in HF Space Settings > Secrets.
## Quick Start
### cURL - File Upload
```bash
curl -X POST "https://YOUR-SPACE-URL/parse" \
-H "Authorization: Bearer YOUR_API_TOKEN" \
-F "file=@document.pdf" \
-F "output_format=markdown"
```
### cURL - Parse from URL
```bash
curl -X POST "https://YOUR-SPACE-URL/parse/url" \
-H "Authorization: Bearer YOUR_API_TOKEN" \
-H "Content-Type: application/json" \
-d '{"url": "https://example.com/document.pdf", "output_format": "markdown"}'
```
### Python
```python
import requests
API_URL = "https://YOUR-SPACE-URL"
API_TOKEN = "your_api_token"
headers = {"Authorization": f"Bearer {API_TOKEN}"}
# Option 1: Upload a file
with open("document.pdf", "rb") as f:
response = requests.post(
f"{API_URL}/parse",
headers=headers,
files={"file": ("document.pdf", f, "application/pdf")},
data={"output_format": "markdown"}
)
# Option 2: Parse from URL
response = requests.post(
f"{API_URL}/parse/url",
headers=headers,
json={
"url": "https://example.com/document.pdf",
"output_format": "markdown"
}
)
result = response.json()
if result["success"]:
print(f"Parsed {result['pages_processed']} pages using {result['vlm_model']}")
print(result["markdown"])
else:
print(f"Error: {result['error']}")
```
### Python with Images
```python
import requests
import base64
import zipfile
import io
API_URL = "https://YOUR-SPACE-URL"
API_TOKEN = "your_api_token"
headers = {"Authorization": f"Bearer {API_TOKEN}"}
# Request with images included
with open("document.pdf", "rb") as f:
response = requests.post(
f"{API_URL}/parse",
headers=headers,
files={"file": ("document.pdf", f, "application/pdf")},
data={"output_format": "markdown", "include_images": "true"}
)
result = response.json()
if result["success"]:
print(f"Parsed {result['pages_processed']} pages")
print(result["markdown"])
# Extract images from ZIP
if result["images_zip"]:
print(f"Extracting {result['image_count']} images...")
zip_bytes = base64.b64decode(result["images_zip"])
with zipfile.ZipFile(io.BytesIO(zip_bytes), 'r') as zf:
zf.extractall("./extracted_images")
print(f"Images saved to ./extracted_images/")
```
## Request Parameters
### File Upload (/parse)
| Parameter | Type | Required | Default | Description |
| -------------- | ------ | -------- | ---------- | ---------------------------------------- |
| file | File | Yes | - | PDF or image file |
| output_format | string | No | `markdown` | `markdown` or `json` |
| images_scale | float | No | `2.0` | Image resolution scale (higher = better) |
| start_page | int | No | `0` | Starting page (0-indexed) |
| end_page | int | No | `null` | Ending page (null = all pages) |
| include_images | bool | No | `false` | Include extracted images in response |
### URL Parsing (/parse/url)
| Parameter | Type | Required | Default | Description |
| -------------- | ------ | -------- | ---------- | ---------------------------------------- |
| url | string | Yes | - | URL to PDF or image |
| output_format | string | No | `markdown` | `markdown` or `json` |
| images_scale | float | No | `2.0` | Image resolution scale (higher = better) |
| start_page | int | No | `0` | Starting page (0-indexed) |
| end_page | int | No | `null` | Ending page (null = all pages) |
| include_images | bool | No | `false` | Include extracted images in response |
## Response Format
```json
{
"success": true,
"markdown": "# Document Title\n\nExtracted content...",
"json_content": null,
"images_zip": null,
"image_count": 0,
"error": null,
"pages_processed": 20,
"device_used": "cuda",
"vlm_model": "Qwen/Qwen3-VL-30B-A3B-Instruct"
}
```
| Field | Type | Description |
| --------------- | ------- | ---------------------------------------------- |
| success | boolean | Whether parsing succeeded |
| markdown | string | Extracted markdown (if output_format=markdown) |
| json_content | object | Extracted JSON (if output_format=json) |
| images_zip | string | Base64-encoded ZIP file containing all images |
| image_count | int | Number of images in the ZIP file |
| error | string | Error message if failed |
| pages_processed | int | Number of pages processed |
| device_used | string | Device used for processing (cuda, mps, or cpu) |
| vlm_model | string | VLM model used for OCR (e.g. Qwen3-VL-30B-A3B) |
## Supported File Types
- PDF (.pdf)
- Images (.png, .jpg, .jpeg, .tiff, .bmp)
Maximum file size: 1GB (configurable via `MAX_FILE_SIZE_MB`)
## Configuration
| Environment Variable | Description | Default |
| ---------------------------- | -------------------------------------- | --------------------------- |
| `API_TOKEN` | **Required.** API authentication token | - |
| `VLM_MODEL` | VLM model for OCR | `Qwen/Qwen3-VL-30B-A3B-Instruct` |
| `VLM_HOST` | vLLM server host | `127.0.0.1` |
| `VLM_PORT` | vLLM server port | `8000` |
| `VLM_GPU_MEMORY_UTILIZATION` | GPU memory fraction for vLLM | `0.85` |
| `VLM_MAX_MODEL_LEN` | Max context length for VLM | `8192` |
| `IMAGES_SCALE` | Default image resolution scale | `2.0` |
| `MAX_FILE_SIZE_MB` | Maximum upload size in MB | `1024` |
## Logging
View logs in HuggingFace Space > Logs tab:
```
2026-02-04 10:30:00 | INFO | [a1b2c3d4] New parse request received
2026-02-04 10:30:00 | INFO | [a1b2c3d4] Filename: document.pdf
2026-02-04 10:30:00 | INFO | [a1b2c3d4] File size: 2.45 MB
2026-02-04 10:30:15 | INFO | [a1b2c3d4] Pass 1: Docling Standard Pipeline completed in 15.23s
2026-02-04 10:30:15 | INFO | [a1b2c3d4] TableFormer detected 3 tables
2026-02-04 10:30:27 | INFO | [a1b2c3d4] Pass 2: VLM OCR completed in 12.00s (20 pages)
2026-02-04 10:30:27 | INFO | [a1b2c3d4] Hybrid conversion complete: 20 pages, 3 tables, 27.23s total
2026-02-04 10:30:27 | INFO | [a1b2c3d4] Speed: 0.73 pages/sec
```
## Credits
Built with [Docling](https://github.com/DS4SD/docling) by IBM Research, [Qwen3-VL](https://huggingface.co/Qwen/Qwen3-VL-30B-A3B-Instruct) by Qwen Team, and [vLLM](https://github.com/vllm-project/vllm).
|