Spaces:

outcomelabs
/

docling-parser

Running on T4

App Files Files Community

docling-parser / README.md

sidoutcome

feat: upgrade to Qwen3-VL-30B-A3B, simplify auth, fix redirects

922ba62 about 1 month ago

preview code

raw

history blame contribute delete

11 kB

metadata

title: Docling VLM Parser API
emoji: 📄
colorFrom: blue
colorTo: green
sdk: docker
app_port: 7860
pinned: false
license: mit
suggested_hardware: a100-large

Docling VLM Parser API

A FastAPI service that transforms PDFs and images into LLM-ready markdown/JSON using a hybrid two-pass architecture: IBM's Docling for document structure and Qwen3-VL-30B-A3B via vLLM for enhanced text recognition.

Features

Hybrid Two-Pass Architecture: Docling Standard Pipeline (Pass 1) + Qwen3-VL VLM OCR (Pass 2)
TableFormer ACCURATE: High-accuracy table structure detection preserved from Docling
VLM-Powered OCR: Qwen3-VL-30B-A3B via vLLM replaces baseline RapidOCR for superior text accuracy
OpenCV Preprocessing: Denoising and CLAHE contrast enhancement for better image quality
32+ Language Support: Multilingual text recognition powered by Qwen3-VL
Handwriting Recognition: Transcribe handwritten text via VLM
Image Extraction: Extract and return all document images
Multiple Formats: Output as markdown or JSON
GPU Accelerated: Dual-process on A100 80GB (vLLM + FastAPI)

Architecture

┌─────────────────────────────────────────────────────────────┐
│                     Docker Container                        │
│                  (vllm/vllm-openai:v0.14.1)                 │
│                                                             │
│  ┌─────────────────────┐    ┌────────────────────────────┐  │
│  │  vLLM Server :8000  │    │   FastAPI App :7860        │  │
│  │  Qwen3-VL-30B-A3B        │◄───│                            │  │
│  │  (GPU inference)     │    │   Pass 1: Docling Standard │  │
│  └─────────────────────┘    │   - DocLayNet layout       │  │
│                              │   - TableFormer ACCURATE   │  │
│                              │   - RapidOCR baseline      │  │
│                              │                            │  │
│                              │   Pass 2: VLM OCR          │  │
│                              │   - Page images → Qwen3-VL │  │
│                              │   - OpenCV preprocessing   │  │
│                              │                            │  │
│                              │   Merge:                   │  │
│                              │   - VLM text (primary)     │  │
│                              │   - TableFormer tables     │  │
│                              └────────────────────────────┘  │
└─────────────────────────────────────────────────────────────┘

API Endpoints

Endpoint	Method	Description
`/`	GET	Health check (includes vLLM status)
`/parse`	POST	Parse uploaded file (multipart/form-data)
`/parse/url`	POST	Parse document from URL (JSON body)

Authentication

All /parse endpoints require Bearer token authentication.

Authorization: Bearer YOUR_API_TOKEN

Set API_TOKEN in HF Space Settings > Secrets.

Quick Start

cURL - File Upload

curl -X POST "https://YOUR-SPACE-URL/parse" \
  -H "Authorization: Bearer YOUR_API_TOKEN" \
  -F "file=@document.pdf" \
  -F "output_format=markdown"

cURL - Parse from URL

curl -X POST "https://YOUR-SPACE-URL/parse/url" \
  -H "Authorization: Bearer YOUR_API_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com/document.pdf", "output_format": "markdown"}'

Python

import requests

API_URL = "https://YOUR-SPACE-URL"
API_TOKEN = "your_api_token"
headers = {"Authorization": f"Bearer {API_TOKEN}"}

# Option 1: Upload a file
with open("document.pdf", "rb") as f:
    response = requests.post(
        f"{API_URL}/parse",
        headers=headers,
        files={"file": ("document.pdf", f, "application/pdf")},
        data={"output_format": "markdown"}
    )

# Option 2: Parse from URL
response = requests.post(
    f"{API_URL}/parse/url",
    headers=headers,
    json={
        "url": "https://example.com/document.pdf",
        "output_format": "markdown"
    }
)

result = response.json()
if result["success"]:
    print(f"Parsed {result['pages_processed']} pages using {result['vlm_model']}")
    print(result["markdown"])
else:
    print(f"Error: {result['error']}")

Python with Images

import requests
import base64
import zipfile
import io

API_URL = "https://YOUR-SPACE-URL"
API_TOKEN = "your_api_token"
headers = {"Authorization": f"Bearer {API_TOKEN}"}

# Request with images included
with open("document.pdf", "rb") as f:
    response = requests.post(
        f"{API_URL}/parse",
        headers=headers,
        files={"file": ("document.pdf", f, "application/pdf")},
        data={"output_format": "markdown", "include_images": "true"}
    )

result = response.json()
if result["success"]:
    print(f"Parsed {result['pages_processed']} pages")
    print(result["markdown"])

    # Extract images from ZIP
    if result["images_zip"]:
        print(f"Extracting {result['image_count']} images...")
        zip_bytes = base64.b64decode(result["images_zip"])
        with zipfile.ZipFile(io.BytesIO(zip_bytes), 'r') as zf:
            zf.extractall("./extracted_images")
            print(f"Images saved to ./extracted_images/")

Request Parameters

File Upload (/parse)

Parameter	Type	Required	Default	Description
file	File	Yes	-	PDF or image file
output_format	string	No	`markdown`	`markdown` or `json`
images_scale	float	No	`2.0`	Image resolution scale (higher = better)
start_page	int	No	`0`	Starting page (0-indexed)
end_page	int	No	`null`	Ending page (null = all pages)
include_images	bool	No	`false`	Include extracted images in response

URL Parsing (/parse/url)

Parameter	Type	Required	Default	Description
url	string	Yes	-	URL to PDF or image
output_format	string	No	`markdown`	`markdown` or `json`
images_scale	float	No	`2.0`	Image resolution scale (higher = better)
start_page	int	No	`0`	Starting page (0-indexed)
end_page	int	No	`null`	Ending page (null = all pages)
include_images	bool	No	`false`	Include extracted images in response

Response Format

{
  "success": true,
  "markdown": "# Document Title\n\nExtracted content...",
  "json_content": null,
  "images_zip": null,
  "image_count": 0,
  "error": null,
  "pages_processed": 20,
  "device_used": "cuda",
  "vlm_model": "Qwen/Qwen3-VL-30B-A3B-Instruct"
}

Field	Type	Description
success	boolean	Whether parsing succeeded
markdown	string	Extracted markdown (if output_format=markdown)
json_content	object	Extracted JSON (if output_format=json)
images_zip	string	Base64-encoded ZIP file containing all images
image_count	int	Number of images in the ZIP file
error	string	Error message if failed
pages_processed	int	Number of pages processed
device_used	string	Device used for processing (cuda, mps, or cpu)
vlm_model	string	VLM model used for OCR (e.g. Qwen3-VL-30B-A3B)

Supported File Types

PDF (.pdf)
Images (.png, .jpg, .jpeg, .tiff, .bmp)

Maximum file size: 1GB (configurable via MAX_FILE_SIZE_MB)

Configuration

Environment Variable	Description	Default
`API_TOKEN`	Required. API authentication token	-
`VLM_MODEL`	VLM model for OCR	`Qwen/Qwen3-VL-30B-A3B-Instruct`
`VLM_HOST`	vLLM server host	`127.0.0.1`
`VLM_PORT`	vLLM server port	`8000`
`VLM_GPU_MEMORY_UTILIZATION`	GPU memory fraction for vLLM	`0.85`
`VLM_MAX_MODEL_LEN`	Max context length for VLM	`8192`
`IMAGES_SCALE`	Default image resolution scale	`2.0`
`MAX_FILE_SIZE_MB`	Maximum upload size in MB	`1024`

Logging

View logs in HuggingFace Space > Logs tab:

2026-02-04 10:30:00 | INFO | [a1b2c3d4] New parse request received
2026-02-04 10:30:00 | INFO | [a1b2c3d4] Filename: document.pdf
2026-02-04 10:30:00 | INFO | [a1b2c3d4] File size: 2.45 MB
2026-02-04 10:30:15 | INFO | [a1b2c3d4] Pass 1: Docling Standard Pipeline completed in 15.23s
2026-02-04 10:30:15 | INFO | [a1b2c3d4] TableFormer detected 3 tables
2026-02-04 10:30:27 | INFO | [a1b2c3d4] Pass 2: VLM OCR completed in 12.00s (20 pages)
2026-02-04 10:30:27 | INFO | [a1b2c3d4] Hybrid conversion complete: 20 pages, 3 tables, 27.23s total
2026-02-04 10:30:27 | INFO | [a1b2c3d4] Speed: 0.73 pages/sec

Credits

Built with Docling by IBM Research, Qwen3-VL by Qwen Team, and vLLM.