Spaces:

outcomelabs
/

docling-parser

Running on T4

App Files Files Community

docling-parser / README.md

sidoutcome

feat: upgrade to Qwen3-VL-30B-A3B, simplify auth, fix redirects

922ba62 about 1 month ago

preview code

raw

history blame contribute delete

11 kB

	---
	title: Docling VLM Parser API
	emoji: 📄
	colorFrom: blue
	colorTo: green
	sdk: docker
	app_port: 7860
	pinned: false
	license: mit
	suggested_hardware: a100-large
	---

	# Docling VLM Parser API

	A FastAPI service that transforms PDFs and images into LLM-ready markdown/JSON using a hybrid two-pass architecture: [IBM's Docling](https://github.com/DS4SD/docling) for document structure and [Qwen3-VL-30B-A3B](https://huggingface.co/Qwen/Qwen3-VL-30B-A3B-Instruct) via [vLLM](https://github.com/vllm-project/vllm) for enhanced text recognition.

	## Features

	- Hybrid Two-Pass Architecture: Docling Standard Pipeline (Pass 1) + Qwen3-VL VLM OCR (Pass 2)
	- TableFormer ACCURATE: High-accuracy table structure detection preserved from Docling
	- VLM-Powered OCR: Qwen3-VL-30B-A3B via vLLM replaces baseline RapidOCR for superior text accuracy
	- OpenCV Preprocessing: Denoising and CLAHE contrast enhancement for better image quality
	- 32+ Language Support: Multilingual text recognition powered by Qwen3-VL
	- Handwriting Recognition: Transcribe handwritten text via VLM
	- Image Extraction: Extract and return all document images
	- Multiple Formats: Output as markdown or JSON
	- GPU Accelerated: Dual-process on A100 80GB (vLLM + FastAPI)

	## Architecture

	```
	┌─────────────────────────────────────────────────────────────┐
	│ Docker Container │
	│ (vllm/vllm-openai:v0.14.1) │
	│ │
	│ ┌─────────────────────┐ ┌────────────────────────────┐ │
	│ │ vLLM Server :8000 │ │ FastAPI App :7860 │ │
	│ │ Qwen3-VL-30B-A3B │◄───│ │ │
	│ │ (GPU inference) │ │ Pass 1: Docling Standard │ │
	│ └─────────────────────┘ │ - DocLayNet layout │ │
	│ │ - TableFormer ACCURATE │ │
	│ │ - RapidOCR baseline │ │
	│ │ │ │
	│ │ Pass 2: VLM OCR │ │
	│ │ - Page images → Qwen3-VL │ │
	│ │ - OpenCV preprocessing │ │
	│ │ │ │
	│ │ Merge: │ │
	│ │ - VLM text (primary) │ │
	│ │ - TableFormer tables │ │
	│ └────────────────────────────┘ │
	└─────────────────────────────────────────────────────────────┘
	```

	## API Endpoints

	\| Endpoint \| Method \| Description \|
	\| ------------ \| ------ \| ----------------------------------------- \|
	\| `/` \| GET \| Health check (includes vLLM status) \|
	\| `/parse` \| POST \| Parse uploaded file (multipart/form-data) \|
	\| `/parse/url` \| POST \| Parse document from URL (JSON body) \|

	## Authentication

	All `/parse` endpoints require Bearer token authentication.

	```
	Authorization: Bearer YOUR_API_TOKEN
	```

	Set `API_TOKEN` in HF Space Settings > Secrets.

	## Quick Start

	### cURL - File Upload

	```bash
	curl -X POST "https://YOUR-SPACE-URL/parse" \
	-H "Authorization: Bearer YOUR_API_TOKEN" \
	-F "file=@document.pdf" \
	-F "output_format=markdown"
	```

	### cURL - Parse from URL

	```bash
	curl -X POST "https://YOUR-SPACE-URL/parse/url" \
	-H "Authorization: Bearer YOUR_API_TOKEN" \
	-H "Content-Type: application/json" \
	-d '{"url": "https://example.com/document.pdf", "output_format": "markdown"}'
	```

	### Python

	```python
	import requests

	API_URL = "https://YOUR-SPACE-URL"
	API_TOKEN = "your_api_token"
	headers = {"Authorization": f"Bearer {API_TOKEN}"}

	# Option 1: Upload a file
	with open("document.pdf", "rb") as f:
	response = requests.post(
	f"{API_URL}/parse",
	headers=headers,
	files={"file": ("document.pdf", f, "application/pdf")},
	data={"output_format": "markdown"}
	)

	# Option 2: Parse from URL
	response = requests.post(
	f"{API_URL}/parse/url",
	headers=headers,
	json={
	"url": "https://example.com/document.pdf",
	"output_format": "markdown"
	}
	)

	result = response.json()
	if result["success"]:
	print(f"Parsed {result['pages_processed']} pages using {result['vlm_model']}")
	print(result["markdown"])
	else:
	print(f"Error: {result['error']}")
	```

	### Python with Images

	```python
	import requests
	import base64
	import zipfile
	import io

	API_URL = "https://YOUR-SPACE-URL"
	API_TOKEN = "your_api_token"
	headers = {"Authorization": f"Bearer {API_TOKEN}"}

	# Request with images included
	with open("document.pdf", "rb") as f:
	response = requests.post(
	f"{API_URL}/parse",
	headers=headers,
	files={"file": ("document.pdf", f, "application/pdf")},
	data={"output_format": "markdown", "include_images": "true"}
	)

	result = response.json()
	if result["success"]:
	print(f"Parsed {result['pages_processed']} pages")
	print(result["markdown"])

	# Extract images from ZIP
	if result["images_zip"]:
	print(f"Extracting {result['image_count']} images...")
	zip_bytes = base64.b64decode(result["images_zip"])
	with zipfile.ZipFile(io.BytesIO(zip_bytes), 'r') as zf:
	zf.extractall("./extracted_images")
	print(f"Images saved to ./extracted_images/")
	```

	## Request Parameters

	### File Upload (/parse)

	\| Parameter \| Type \| Required \| Default \| Description \|
	\| -------------- \| ------ \| -------- \| ---------- \| ---------------------------------------- \|
	\| file \| File \| Yes \| - \| PDF or image file \|
	\| output_format \| string \| No \| `markdown` \| `markdown` or `json` \|
	\| images_scale \| float \| No \| `2.0` \| Image resolution scale (higher = better) \|
	\| start_page \| int \| No \| `0` \| Starting page (0-indexed) \|
	\| end_page \| int \| No \| `null` \| Ending page (null = all pages) \|
	\| include_images \| bool \| No \| `false` \| Include extracted images in response \|

	### URL Parsing (/parse/url)

	\| Parameter \| Type \| Required \| Default \| Description \|
	\| -------------- \| ------ \| -------- \| ---------- \| ---------------------------------------- \|
	\| url \| string \| Yes \| - \| URL to PDF or image \|
	\| output_format \| string \| No \| `markdown` \| `markdown` or `json` \|
	\| images_scale \| float \| No \| `2.0` \| Image resolution scale (higher = better) \|
	\| start_page \| int \| No \| `0` \| Starting page (0-indexed) \|
	\| end_page \| int \| No \| `null` \| Ending page (null = all pages) \|
	\| include_images \| bool \| No \| `false` \| Include extracted images in response \|

	## Response Format

	```json
	{
	"success": true,
	"markdown": "# Document Title\n\nExtracted content...",
	"json_content": null,
	"images_zip": null,
	"image_count": 0,
	"error": null,
	"pages_processed": 20,
	"device_used": "cuda",
	"vlm_model": "Qwen/Qwen3-VL-30B-A3B-Instruct"
	}
	```

	\| Field \| Type \| Description \|
	\| --------------- \| ------- \| ---------------------------------------------- \|
	\| success \| boolean \| Whether parsing succeeded \|
	\| markdown \| string \| Extracted markdown (if output_format=markdown) \|
	\| json_content \| object \| Extracted JSON (if output_format=json) \|
	\| images_zip \| string \| Base64-encoded ZIP file containing all images \|
	\| image_count \| int \| Number of images in the ZIP file \|
	\| error \| string \| Error message if failed \|
	\| pages_processed \| int \| Number of pages processed \|
	\| device_used \| string \| Device used for processing (cuda, mps, or cpu) \|
	\| vlm_model \| string \| VLM model used for OCR (e.g. Qwen3-VL-30B-A3B) \|

	## Supported File Types

	- PDF (.pdf)
	- Images (.png, .jpg, .jpeg, .tiff, .bmp)

	Maximum file size: 1GB (configurable via `MAX_FILE_SIZE_MB`)

	## Configuration

	\| Environment Variable \| Description \| Default \|
	\| ---------------------------- \| -------------------------------------- \| --------------------------- \|
	\| `API_TOKEN` \| Required. API authentication token \| - \|
	\| `VLM_MODEL` \| VLM model for OCR \| `Qwen/Qwen3-VL-30B-A3B-Instruct` \|
	\| `VLM_HOST` \| vLLM server host \| `127.0.0.1` \|
	\| `VLM_PORT` \| vLLM server port \| `8000` \|
	\| `VLM_GPU_MEMORY_UTILIZATION` \| GPU memory fraction for vLLM \| `0.85` \|
	\| `VLM_MAX_MODEL_LEN` \| Max context length for VLM \| `8192` \|
	\| `IMAGES_SCALE` \| Default image resolution scale \| `2.0` \|
	\| `MAX_FILE_SIZE_MB` \| Maximum upload size in MB \| `1024` \|

	## Logging

	View logs in HuggingFace Space > Logs tab:

	```
	2026-02-04 10:30:00 \| INFO \| [a1b2c3d4] New parse request received
	2026-02-04 10:30:00 \| INFO \| [a1b2c3d4] Filename: document.pdf
	2026-02-04 10:30:00 \| INFO \| [a1b2c3d4] File size: 2.45 MB
	2026-02-04 10:30:15 \| INFO \| [a1b2c3d4] Pass 1: Docling Standard Pipeline completed in 15.23s
	2026-02-04 10:30:15 \| INFO \| [a1b2c3d4] TableFormer detected 3 tables
	2026-02-04 10:30:27 \| INFO \| [a1b2c3d4] Pass 2: VLM OCR completed in 12.00s (20 pages)
	2026-02-04 10:30:27 \| INFO \| [a1b2c3d4] Hybrid conversion complete: 20 pages, 3 tables, 27.23s total
	2026-02-04 10:30:27 \| INFO \| [a1b2c3d4] Speed: 0.73 pages/sec
	```

	## Credits

	Built with [Docling](https://github.com/DS4SD/docling) by IBM Research, [Qwen3-VL](https://huggingface.co/Qwen/Qwen3-VL-30B-A3B-Instruct) by Qwen Team, and [vLLM](https://github.com/vllm-project/vllm).