--- title: Docling VLM Parser API emoji: 📄 colorFrom: blue colorTo: green sdk: docker app_port: 7860 pinned: false license: mit suggested_hardware: a100-large --- # Docling VLM Parser API A FastAPI service that transforms PDFs and images into LLM-ready markdown/JSON using a **hybrid two-pass architecture**: [IBM's Docling](https://github.com/DS4SD/docling) for document structure and [Qwen3-VL-30B-A3B](https://huggingface.co/Qwen/Qwen3-VL-30B-A3B-Instruct) via [vLLM](https://github.com/vllm-project/vllm) for enhanced text recognition. ## Features - **Hybrid Two-Pass Architecture**: Docling Standard Pipeline (Pass 1) + Qwen3-VL VLM OCR (Pass 2) - **TableFormer ACCURATE**: High-accuracy table structure detection preserved from Docling - **VLM-Powered OCR**: Qwen3-VL-30B-A3B via vLLM replaces baseline RapidOCR for superior text accuracy - **OpenCV Preprocessing**: Denoising and CLAHE contrast enhancement for better image quality - **32+ Language Support**: Multilingual text recognition powered by Qwen3-VL - **Handwriting Recognition**: Transcribe handwritten text via VLM - **Image Extraction**: Extract and return all document images - **Multiple Formats**: Output as markdown or JSON - **GPU Accelerated**: Dual-process on A100 80GB (vLLM + FastAPI) ## Architecture ``` ┌─────────────────────────────────────────────────────────────┐ │ Docker Container │ │ (vllm/vllm-openai:v0.14.1) │ │ │ │ ┌─────────────────────┐ ┌────────────────────────────┐ │ │ │ vLLM Server :8000 │ │ FastAPI App :7860 │ │ │ │ Qwen3-VL-30B-A3B │◄───│ │ │ │ │ (GPU inference) │ │ Pass 1: Docling Standard │ │ │ └─────────────────────┘ │ - DocLayNet layout │ │ │ │ - TableFormer ACCURATE │ │ │ │ - RapidOCR baseline │ │ │ │ │ │ │ │ Pass 2: VLM OCR │ │ │ │ - Page images → Qwen3-VL │ │ │ │ - OpenCV preprocessing │ │ │ │ │ │ │ │ Merge: │ │ │ │ - VLM text (primary) │ │ │ │ - TableFormer tables │ │ │ └────────────────────────────┘ │ └─────────────────────────────────────────────────────────────┘ ``` ## API Endpoints | Endpoint | Method | Description | | ------------ | ------ | ----------------------------------------- | | `/` | GET | Health check (includes vLLM status) | | `/parse` | POST | Parse uploaded file (multipart/form-data) | | `/parse/url` | POST | Parse document from URL (JSON body) | ## Authentication All `/parse` endpoints require Bearer token authentication. ``` Authorization: Bearer YOUR_API_TOKEN ``` Set `API_TOKEN` in HF Space Settings > Secrets. ## Quick Start ### cURL - File Upload ```bash curl -X POST "https://YOUR-SPACE-URL/parse" \ -H "Authorization: Bearer YOUR_API_TOKEN" \ -F "file=@document.pdf" \ -F "output_format=markdown" ``` ### cURL - Parse from URL ```bash curl -X POST "https://YOUR-SPACE-URL/parse/url" \ -H "Authorization: Bearer YOUR_API_TOKEN" \ -H "Content-Type: application/json" \ -d '{"url": "https://example.com/document.pdf", "output_format": "markdown"}' ``` ### Python ```python import requests API_URL = "https://YOUR-SPACE-URL" API_TOKEN = "your_api_token" headers = {"Authorization": f"Bearer {API_TOKEN}"} # Option 1: Upload a file with open("document.pdf", "rb") as f: response = requests.post( f"{API_URL}/parse", headers=headers, files={"file": ("document.pdf", f, "application/pdf")}, data={"output_format": "markdown"} ) # Option 2: Parse from URL response = requests.post( f"{API_URL}/parse/url", headers=headers, json={ "url": "https://example.com/document.pdf", "output_format": "markdown" } ) result = response.json() if result["success"]: print(f"Parsed {result['pages_processed']} pages using {result['vlm_model']}") print(result["markdown"]) else: print(f"Error: {result['error']}") ``` ### Python with Images ```python import requests import base64 import zipfile import io API_URL = "https://YOUR-SPACE-URL" API_TOKEN = "your_api_token" headers = {"Authorization": f"Bearer {API_TOKEN}"} # Request with images included with open("document.pdf", "rb") as f: response = requests.post( f"{API_URL}/parse", headers=headers, files={"file": ("document.pdf", f, "application/pdf")}, data={"output_format": "markdown", "include_images": "true"} ) result = response.json() if result["success"]: print(f"Parsed {result['pages_processed']} pages") print(result["markdown"]) # Extract images from ZIP if result["images_zip"]: print(f"Extracting {result['image_count']} images...") zip_bytes = base64.b64decode(result["images_zip"]) with zipfile.ZipFile(io.BytesIO(zip_bytes), 'r') as zf: zf.extractall("./extracted_images") print(f"Images saved to ./extracted_images/") ``` ## Request Parameters ### File Upload (/parse) | Parameter | Type | Required | Default | Description | | -------------- | ------ | -------- | ---------- | ---------------------------------------- | | file | File | Yes | - | PDF or image file | | output_format | string | No | `markdown` | `markdown` or `json` | | images_scale | float | No | `2.0` | Image resolution scale (higher = better) | | start_page | int | No | `0` | Starting page (0-indexed) | | end_page | int | No | `null` | Ending page (null = all pages) | | include_images | bool | No | `false` | Include extracted images in response | ### URL Parsing (/parse/url) | Parameter | Type | Required | Default | Description | | -------------- | ------ | -------- | ---------- | ---------------------------------------- | | url | string | Yes | - | URL to PDF or image | | output_format | string | No | `markdown` | `markdown` or `json` | | images_scale | float | No | `2.0` | Image resolution scale (higher = better) | | start_page | int | No | `0` | Starting page (0-indexed) | | end_page | int | No | `null` | Ending page (null = all pages) | | include_images | bool | No | `false` | Include extracted images in response | ## Response Format ```json { "success": true, "markdown": "# Document Title\n\nExtracted content...", "json_content": null, "images_zip": null, "image_count": 0, "error": null, "pages_processed": 20, "device_used": "cuda", "vlm_model": "Qwen/Qwen3-VL-30B-A3B-Instruct" } ``` | Field | Type | Description | | --------------- | ------- | ---------------------------------------------- | | success | boolean | Whether parsing succeeded | | markdown | string | Extracted markdown (if output_format=markdown) | | json_content | object | Extracted JSON (if output_format=json) | | images_zip | string | Base64-encoded ZIP file containing all images | | image_count | int | Number of images in the ZIP file | | error | string | Error message if failed | | pages_processed | int | Number of pages processed | | device_used | string | Device used for processing (cuda, mps, or cpu) | | vlm_model | string | VLM model used for OCR (e.g. Qwen3-VL-30B-A3B) | ## Supported File Types - PDF (.pdf) - Images (.png, .jpg, .jpeg, .tiff, .bmp) Maximum file size: 1GB (configurable via `MAX_FILE_SIZE_MB`) ## Configuration | Environment Variable | Description | Default | | ---------------------------- | -------------------------------------- | --------------------------- | | `API_TOKEN` | **Required.** API authentication token | - | | `VLM_MODEL` | VLM model for OCR | `Qwen/Qwen3-VL-30B-A3B-Instruct` | | `VLM_HOST` | vLLM server host | `127.0.0.1` | | `VLM_PORT` | vLLM server port | `8000` | | `VLM_GPU_MEMORY_UTILIZATION` | GPU memory fraction for vLLM | `0.85` | | `VLM_MAX_MODEL_LEN` | Max context length for VLM | `8192` | | `IMAGES_SCALE` | Default image resolution scale | `2.0` | | `MAX_FILE_SIZE_MB` | Maximum upload size in MB | `1024` | ## Logging View logs in HuggingFace Space > Logs tab: ``` 2026-02-04 10:30:00 | INFO | [a1b2c3d4] New parse request received 2026-02-04 10:30:00 | INFO | [a1b2c3d4] Filename: document.pdf 2026-02-04 10:30:00 | INFO | [a1b2c3d4] File size: 2.45 MB 2026-02-04 10:30:15 | INFO | [a1b2c3d4] Pass 1: Docling Standard Pipeline completed in 15.23s 2026-02-04 10:30:15 | INFO | [a1b2c3d4] TableFormer detected 3 tables 2026-02-04 10:30:27 | INFO | [a1b2c3d4] Pass 2: VLM OCR completed in 12.00s (20 pages) 2026-02-04 10:30:27 | INFO | [a1b2c3d4] Hybrid conversion complete: 20 pages, 3 tables, 27.23s total 2026-02-04 10:30:27 | INFO | [a1b2c3d4] Speed: 0.73 pages/sec ``` ## Credits Built with [Docling](https://github.com/DS4SD/docling) by IBM Research, [Qwen3-VL](https://huggingface.co/Qwen/Qwen3-VL-30B-A3B-Instruct) by Qwen Team, and [vLLM](https://github.com/vllm-project/vllm).