docling-parser / README.md
sidoutcome's picture
feat: upgrade to Qwen3-VL-30B-A3B, simplify auth, fix redirects
922ba62
metadata
title: Docling VLM Parser API
emoji: πŸ“„
colorFrom: blue
colorTo: green
sdk: docker
app_port: 7860
pinned: false
license: mit
suggested_hardware: a100-large

Docling VLM Parser API

A FastAPI service that transforms PDFs and images into LLM-ready markdown/JSON using a hybrid two-pass architecture: IBM's Docling for document structure and Qwen3-VL-30B-A3B via vLLM for enhanced text recognition.

Features

  • Hybrid Two-Pass Architecture: Docling Standard Pipeline (Pass 1) + Qwen3-VL VLM OCR (Pass 2)
  • TableFormer ACCURATE: High-accuracy table structure detection preserved from Docling
  • VLM-Powered OCR: Qwen3-VL-30B-A3B via vLLM replaces baseline RapidOCR for superior text accuracy
  • OpenCV Preprocessing: Denoising and CLAHE contrast enhancement for better image quality
  • 32+ Language Support: Multilingual text recognition powered by Qwen3-VL
  • Handwriting Recognition: Transcribe handwritten text via VLM
  • Image Extraction: Extract and return all document images
  • Multiple Formats: Output as markdown or JSON
  • GPU Accelerated: Dual-process on A100 80GB (vLLM + FastAPI)

Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                     Docker Container                        β”‚
β”‚                  (vllm/vllm-openai:v0.14.1)                 β”‚
β”‚                                                             β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚  vLLM Server :8000  β”‚    β”‚   FastAPI App :7860        β”‚  β”‚
β”‚  β”‚  Qwen3-VL-30B-A3B        │◄───│                            β”‚  β”‚
β”‚  β”‚  (GPU inference)     β”‚    β”‚   Pass 1: Docling Standard β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚   - DocLayNet layout       β”‚  β”‚
β”‚                              β”‚   - TableFormer ACCURATE   β”‚  β”‚
β”‚                              β”‚   - RapidOCR baseline      β”‚  β”‚
β”‚                              β”‚                            β”‚  β”‚
β”‚                              β”‚   Pass 2: VLM OCR          β”‚  β”‚
β”‚                              β”‚   - Page images β†’ Qwen3-VL β”‚  β”‚
β”‚                              β”‚   - OpenCV preprocessing   β”‚  β”‚
β”‚                              β”‚                            β”‚  β”‚
β”‚                              β”‚   Merge:                   β”‚  β”‚
β”‚                              β”‚   - VLM text (primary)     β”‚  β”‚
β”‚                              β”‚   - TableFormer tables     β”‚  β”‚
β”‚                              β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

API Endpoints

Endpoint Method Description
/ GET Health check (includes vLLM status)
/parse POST Parse uploaded file (multipart/form-data)
/parse/url POST Parse document from URL (JSON body)

Authentication

All /parse endpoints require Bearer token authentication.

Authorization: Bearer YOUR_API_TOKEN

Set API_TOKEN in HF Space Settings > Secrets.

Quick Start

cURL - File Upload

curl -X POST "https://YOUR-SPACE-URL/parse" \
  -H "Authorization: Bearer YOUR_API_TOKEN" \
  -F "file=@document.pdf" \
  -F "output_format=markdown"

cURL - Parse from URL

curl -X POST "https://YOUR-SPACE-URL/parse/url" \
  -H "Authorization: Bearer YOUR_API_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com/document.pdf", "output_format": "markdown"}'

Python

import requests

API_URL = "https://YOUR-SPACE-URL"
API_TOKEN = "your_api_token"
headers = {"Authorization": f"Bearer {API_TOKEN}"}

# Option 1: Upload a file
with open("document.pdf", "rb") as f:
    response = requests.post(
        f"{API_URL}/parse",
        headers=headers,
        files={"file": ("document.pdf", f, "application/pdf")},
        data={"output_format": "markdown"}
    )

# Option 2: Parse from URL
response = requests.post(
    f"{API_URL}/parse/url",
    headers=headers,
    json={
        "url": "https://example.com/document.pdf",
        "output_format": "markdown"
    }
)

result = response.json()
if result["success"]:
    print(f"Parsed {result['pages_processed']} pages using {result['vlm_model']}")
    print(result["markdown"])
else:
    print(f"Error: {result['error']}")

Python with Images

import requests
import base64
import zipfile
import io

API_URL = "https://YOUR-SPACE-URL"
API_TOKEN = "your_api_token"
headers = {"Authorization": f"Bearer {API_TOKEN}"}

# Request with images included
with open("document.pdf", "rb") as f:
    response = requests.post(
        f"{API_URL}/parse",
        headers=headers,
        files={"file": ("document.pdf", f, "application/pdf")},
        data={"output_format": "markdown", "include_images": "true"}
    )

result = response.json()
if result["success"]:
    print(f"Parsed {result['pages_processed']} pages")
    print(result["markdown"])

    # Extract images from ZIP
    if result["images_zip"]:
        print(f"Extracting {result['image_count']} images...")
        zip_bytes = base64.b64decode(result["images_zip"])
        with zipfile.ZipFile(io.BytesIO(zip_bytes), 'r') as zf:
            zf.extractall("./extracted_images")
            print(f"Images saved to ./extracted_images/")

Request Parameters

File Upload (/parse)

Parameter Type Required Default Description
file File Yes - PDF or image file
output_format string No markdown markdown or json
images_scale float No 2.0 Image resolution scale (higher = better)
start_page int No 0 Starting page (0-indexed)
end_page int No null Ending page (null = all pages)
include_images bool No false Include extracted images in response

URL Parsing (/parse/url)

Parameter Type Required Default Description
url string Yes - URL to PDF or image
output_format string No markdown markdown or json
images_scale float No 2.0 Image resolution scale (higher = better)
start_page int No 0 Starting page (0-indexed)
end_page int No null Ending page (null = all pages)
include_images bool No false Include extracted images in response

Response Format

{
  "success": true,
  "markdown": "# Document Title\n\nExtracted content...",
  "json_content": null,
  "images_zip": null,
  "image_count": 0,
  "error": null,
  "pages_processed": 20,
  "device_used": "cuda",
  "vlm_model": "Qwen/Qwen3-VL-30B-A3B-Instruct"
}
Field Type Description
success boolean Whether parsing succeeded
markdown string Extracted markdown (if output_format=markdown)
json_content object Extracted JSON (if output_format=json)
images_zip string Base64-encoded ZIP file containing all images
image_count int Number of images in the ZIP file
error string Error message if failed
pages_processed int Number of pages processed
device_used string Device used for processing (cuda, mps, or cpu)
vlm_model string VLM model used for OCR (e.g. Qwen3-VL-30B-A3B)

Supported File Types

  • PDF (.pdf)
  • Images (.png, .jpg, .jpeg, .tiff, .bmp)

Maximum file size: 1GB (configurable via MAX_FILE_SIZE_MB)

Configuration

Environment Variable Description Default
API_TOKEN Required. API authentication token -
VLM_MODEL VLM model for OCR Qwen/Qwen3-VL-30B-A3B-Instruct
VLM_HOST vLLM server host 127.0.0.1
VLM_PORT vLLM server port 8000
VLM_GPU_MEMORY_UTILIZATION GPU memory fraction for vLLM 0.85
VLM_MAX_MODEL_LEN Max context length for VLM 8192
IMAGES_SCALE Default image resolution scale 2.0
MAX_FILE_SIZE_MB Maximum upload size in MB 1024

Logging

View logs in HuggingFace Space > Logs tab:

2026-02-04 10:30:00 | INFO | [a1b2c3d4] New parse request received
2026-02-04 10:30:00 | INFO | [a1b2c3d4] Filename: document.pdf
2026-02-04 10:30:00 | INFO | [a1b2c3d4] File size: 2.45 MB
2026-02-04 10:30:15 | INFO | [a1b2c3d4] Pass 1: Docling Standard Pipeline completed in 15.23s
2026-02-04 10:30:15 | INFO | [a1b2c3d4] TableFormer detected 3 tables
2026-02-04 10:30:27 | INFO | [a1b2c3d4] Pass 2: VLM OCR completed in 12.00s (20 pages)
2026-02-04 10:30:27 | INFO | [a1b2c3d4] Hybrid conversion complete: 20 pages, 3 tables, 27.23s total
2026-02-04 10:30:27 | INFO | [a1b2c3d4] Speed: 0.73 pages/sec

Credits

Built with Docling by IBM Research, Qwen3-VL by Qwen Team, and vLLM.