Spaces:
Running on T4
Running on T4
metadata
title: Docling VLM Parser API
emoji: π
colorFrom: blue
colorTo: green
sdk: docker
app_port: 7860
pinned: false
license: mit
suggested_hardware: a100-large
Docling VLM Parser API
A FastAPI service that transforms PDFs and images into LLM-ready markdown/JSON using a hybrid two-pass architecture: IBM's Docling for document structure and Qwen3-VL-30B-A3B via vLLM for enhanced text recognition.
Features
- Hybrid Two-Pass Architecture: Docling Standard Pipeline (Pass 1) + Qwen3-VL VLM OCR (Pass 2)
- TableFormer ACCURATE: High-accuracy table structure detection preserved from Docling
- VLM-Powered OCR: Qwen3-VL-30B-A3B via vLLM replaces baseline RapidOCR for superior text accuracy
- OpenCV Preprocessing: Denoising and CLAHE contrast enhancement for better image quality
- 32+ Language Support: Multilingual text recognition powered by Qwen3-VL
- Handwriting Recognition: Transcribe handwritten text via VLM
- Image Extraction: Extract and return all document images
- Multiple Formats: Output as markdown or JSON
- GPU Accelerated: Dual-process on A100 80GB (vLLM + FastAPI)
Architecture
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Docker Container β
β (vllm/vllm-openai:v0.14.1) β
β β
β βββββββββββββββββββββββ ββββββββββββββββββββββββββββββ β
β β vLLM Server :8000 β β FastAPI App :7860 β β
β β Qwen3-VL-30B-A3B ββββββ β β
β β (GPU inference) β β Pass 1: Docling Standard β β
β βββββββββββββββββββββββ β - DocLayNet layout β β
β β - TableFormer ACCURATE β β
β β - RapidOCR baseline β β
β β β β
β β Pass 2: VLM OCR β β
β β - Page images β Qwen3-VL β β
β β - OpenCV preprocessing β β
β β β β
β β Merge: β β
β β - VLM text (primary) β β
β β - TableFormer tables β β
β ββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
API Endpoints
| Endpoint | Method | Description |
|---|---|---|
/ |
GET | Health check (includes vLLM status) |
/parse |
POST | Parse uploaded file (multipart/form-data) |
/parse/url |
POST | Parse document from URL (JSON body) |
Authentication
All /parse endpoints require Bearer token authentication.
Authorization: Bearer YOUR_API_TOKEN
Set API_TOKEN in HF Space Settings > Secrets.
Quick Start
cURL - File Upload
curl -X POST "https://YOUR-SPACE-URL/parse" \
-H "Authorization: Bearer YOUR_API_TOKEN" \
-F "file=@document.pdf" \
-F "output_format=markdown"
cURL - Parse from URL
curl -X POST "https://YOUR-SPACE-URL/parse/url" \
-H "Authorization: Bearer YOUR_API_TOKEN" \
-H "Content-Type: application/json" \
-d '{"url": "https://example.com/document.pdf", "output_format": "markdown"}'
Python
import requests
API_URL = "https://YOUR-SPACE-URL"
API_TOKEN = "your_api_token"
headers = {"Authorization": f"Bearer {API_TOKEN}"}
# Option 1: Upload a file
with open("document.pdf", "rb") as f:
response = requests.post(
f"{API_URL}/parse",
headers=headers,
files={"file": ("document.pdf", f, "application/pdf")},
data={"output_format": "markdown"}
)
# Option 2: Parse from URL
response = requests.post(
f"{API_URL}/parse/url",
headers=headers,
json={
"url": "https://example.com/document.pdf",
"output_format": "markdown"
}
)
result = response.json()
if result["success"]:
print(f"Parsed {result['pages_processed']} pages using {result['vlm_model']}")
print(result["markdown"])
else:
print(f"Error: {result['error']}")
Python with Images
import requests
import base64
import zipfile
import io
API_URL = "https://YOUR-SPACE-URL"
API_TOKEN = "your_api_token"
headers = {"Authorization": f"Bearer {API_TOKEN}"}
# Request with images included
with open("document.pdf", "rb") as f:
response = requests.post(
f"{API_URL}/parse",
headers=headers,
files={"file": ("document.pdf", f, "application/pdf")},
data={"output_format": "markdown", "include_images": "true"}
)
result = response.json()
if result["success"]:
print(f"Parsed {result['pages_processed']} pages")
print(result["markdown"])
# Extract images from ZIP
if result["images_zip"]:
print(f"Extracting {result['image_count']} images...")
zip_bytes = base64.b64decode(result["images_zip"])
with zipfile.ZipFile(io.BytesIO(zip_bytes), 'r') as zf:
zf.extractall("./extracted_images")
print(f"Images saved to ./extracted_images/")
Request Parameters
File Upload (/parse)
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
| file | File | Yes | - | PDF or image file |
| output_format | string | No | markdown |
markdown or json |
| images_scale | float | No | 2.0 |
Image resolution scale (higher = better) |
| start_page | int | No | 0 |
Starting page (0-indexed) |
| end_page | int | No | null |
Ending page (null = all pages) |
| include_images | bool | No | false |
Include extracted images in response |
URL Parsing (/parse/url)
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
| url | string | Yes | - | URL to PDF or image |
| output_format | string | No | markdown |
markdown or json |
| images_scale | float | No | 2.0 |
Image resolution scale (higher = better) |
| start_page | int | No | 0 |
Starting page (0-indexed) |
| end_page | int | No | null |
Ending page (null = all pages) |
| include_images | bool | No | false |
Include extracted images in response |
Response Format
{
"success": true,
"markdown": "# Document Title\n\nExtracted content...",
"json_content": null,
"images_zip": null,
"image_count": 0,
"error": null,
"pages_processed": 20,
"device_used": "cuda",
"vlm_model": "Qwen/Qwen3-VL-30B-A3B-Instruct"
}
| Field | Type | Description |
|---|---|---|
| success | boolean | Whether parsing succeeded |
| markdown | string | Extracted markdown (if output_format=markdown) |
| json_content | object | Extracted JSON (if output_format=json) |
| images_zip | string | Base64-encoded ZIP file containing all images |
| image_count | int | Number of images in the ZIP file |
| error | string | Error message if failed |
| pages_processed | int | Number of pages processed |
| device_used | string | Device used for processing (cuda, mps, or cpu) |
| vlm_model | string | VLM model used for OCR (e.g. Qwen3-VL-30B-A3B) |
Supported File Types
- PDF (.pdf)
- Images (.png, .jpg, .jpeg, .tiff, .bmp)
Maximum file size: 1GB (configurable via MAX_FILE_SIZE_MB)
Configuration
| Environment Variable | Description | Default |
|---|---|---|
API_TOKEN |
Required. API authentication token | - |
VLM_MODEL |
VLM model for OCR | Qwen/Qwen3-VL-30B-A3B-Instruct |
VLM_HOST |
vLLM server host | 127.0.0.1 |
VLM_PORT |
vLLM server port | 8000 |
VLM_GPU_MEMORY_UTILIZATION |
GPU memory fraction for vLLM | 0.85 |
VLM_MAX_MODEL_LEN |
Max context length for VLM | 8192 |
IMAGES_SCALE |
Default image resolution scale | 2.0 |
MAX_FILE_SIZE_MB |
Maximum upload size in MB | 1024 |
Logging
View logs in HuggingFace Space > Logs tab:
2026-02-04 10:30:00 | INFO | [a1b2c3d4] New parse request received
2026-02-04 10:30:00 | INFO | [a1b2c3d4] Filename: document.pdf
2026-02-04 10:30:00 | INFO | [a1b2c3d4] File size: 2.45 MB
2026-02-04 10:30:15 | INFO | [a1b2c3d4] Pass 1: Docling Standard Pipeline completed in 15.23s
2026-02-04 10:30:15 | INFO | [a1b2c3d4] TableFormer detected 3 tables
2026-02-04 10:30:27 | INFO | [a1b2c3d4] Pass 2: VLM OCR completed in 12.00s (20 pages)
2026-02-04 10:30:27 | INFO | [a1b2c3d4] Hybrid conversion complete: 20 pages, 3 tables, 27.23s total
2026-02-04 10:30:27 | INFO | [a1b2c3d4] Speed: 0.73 pages/sec
Credits
Built with Docling by IBM Research, Qwen3-VL by Qwen Team, and vLLM.