sidoutcome commited on
Commit
8c4351b
Β·
1 Parent(s): 4848ba0

feat: hybrid VLM parser with Qwen3-VL-8B via vLLM (v2.0.0)

Browse files
Files changed (6) hide show
  1. CLAUDE.md +46 -55
  2. Dockerfile +38 -25
  3. README.md +59 -37
  4. app.py +544 -245
  5. requirements.txt +13 -4
  6. start.sh +71 -0
CLAUDE.md CHANGED
@@ -4,48 +4,43 @@ This file provides guidance to Claude Code (claude.ai/code) when working with co
4
 
5
  ## Project Overview
6
 
7
- Docling Parser - A Hugging Face Spaces API service that deploys IBM's Docling library for PDF/document parsing. Transforms complex documents (PDFs, images) into LLM-ready markdown/JSON formats. API endpoints are protected by Bearer token authentication.
8
 
9
  ## Architecture
10
 
11
  ```
12
  hf_docling_parser/
13
- β”œβ”€β”€ app.py # FastAPI application with parsing endpoints
14
- β”œβ”€β”€ Dockerfile # HF Spaces Docker configuration (GPU-enabled)
15
- β”œβ”€β”€ requirements.txt # Python dependencies
16
- β”œβ”€β”€ README.md # HF Spaces metadata and API documentation
 
17
  β”œβ”€β”€ CLAUDE.md # Claude Code development guide
18
  └── .gitignore # Git ignore patterns
19
  ```
20
 
 
 
21
  ## Common Commands
22
 
23
  ```bash
24
- # Local development
25
- pip install -r requirements.txt
26
- uvicorn app:app --host 0.0.0.0 --port 7860 --reload
27
 
28
  # Test the API locally
29
  curl -X POST "http://localhost:7860/parse" \
30
  -H "Authorization: Bearer YOUR_API_TOKEN" \
31
  -F "file=@document.pdf" \
32
  -F "output_format=markdown"
33
-
34
- # Build and test Docker locally
35
- docker build -t hf-docling .
36
- docker run --gpus all -p 7860:7860 -e API_TOKEN=test hf-docling
37
  ```
38
 
39
- ## Deploying to Hugging Face Spaces
40
 
41
- ### First-time Setup
42
 
43
- ```bash
44
- hf auth login
45
- hf repo create docling-parser --repo-type space --space_sdk docker
46
- git init
47
- git remote add hf https://huggingface.co/spaces/YOUR_USERNAME/docling-parser
48
- ```
49
 
50
  ### Push New Code
51
 
@@ -57,7 +52,7 @@ git push hf main
57
 
58
  ### Settings (configure in HF web UI)
59
 
60
- - **Hardware:** Nvidia A10G Large (24GB) or A100 (for larger docs)
61
  - **Sleep time:** 1 hour (auto-shutdown after 60 min idle)
62
  - **Secrets:** `API_TOKEN` (required for API authentication)
63
 
@@ -78,37 +73,29 @@ git push hf main
78
  - **uvicorn**: ASGI server
79
  - **httpx**: HTTP client for URL parsing
80
  - **pydantic**: Request/response validation
81
- - **torch**: PyTorch for GPU acceleration
82
-
83
- ## Environment Variables
 
84
 
85
- | Variable | Description | Default |
86
- | ------------------ | ------------------------------------------------- | ---------- |
87
- | `API_TOKEN` | **Required.** Secret token for API authentication | - |
88
- | `DO_OCR` | Enable OCR by default | `true` |
89
- | `TABLE_MODE` | Table detection mode (accurate, fast) | `accurate` |
90
- | `IMAGES_SCALE` | Image resolution scale for extraction | `3.0` |
91
- | `DEFAULT_LANG` | Default OCR language code | `en` |
92
- | `MAX_FILE_SIZE_MB` | Maximum upload file size in MB | `1024` |
93
 
94
- ## Docling Pipeline Options
95
-
96
- The converter supports these key options:
97
 
98
- - `do_ocr`: Enable/disable OCR for scanned documents
99
- - `do_table_structure`: Enable table structure detection
100
- - `table_structure_options.mode`: TableFormerMode.ACCURATE or FAST
101
- - `generate_picture_images`: Extract images from documents
102
- - `images_scale`: Resolution multiplier for extracted images
103
- - `accelerator_options.device`: cuda, mps, or cpu
 
 
 
 
104
 
105
  ## Testing
106
 
107
- ### Start the server
108
-
109
- ```bash
110
- API_TOKEN=test uvicorn app:app --host 0.0.0.0 --port 7860 --reload
111
- ```
112
 
113
  ### Test with curl
114
 
@@ -163,12 +150,16 @@ The API provides comprehensive logging:
163
 
164
  ## Comparison with MinerU
165
 
166
- | Feature | Docling | MinerU |
167
- | ---------------- | ---------------------- | ------------------- |
168
- | Maintainer | IBM Research | OpenDataLab |
169
- | Table Detection | TableFormer (built-in) | Multiple backends |
170
- | OCR | Built-in | Built-in |
171
- | VLM Support | Optional | Hybrid backend |
172
- | License | MIT | AGPL-3.0 |
173
- | GPU Memory | ~8-12GB | ~6-10GB (pipeline) |
174
- | Primary Use Case | Enterprise documents | General PDF parsing |
 
 
 
 
 
4
 
5
  ## Project Overview
6
 
7
+ Docling Parser (v2.0.0) - A Hugging Face Spaces API service using a hybrid two-pass VLM architecture for PDF/document parsing. Pass 1 runs Docling's standard pipeline (DocLayNet layout + TableFormer ACCURATE + RapidOCR baseline). Pass 2 sends full page images to Qwen3-VL-8B via vLLM for enhanced text recognition. The merge step preserves TableFormer tables while replacing RapidOCR text with VLM output. Includes OpenCV preprocessing (denoise, CLAHE contrast enhancement). API endpoints are protected by Bearer token authentication.
8
 
9
  ## Architecture
10
 
11
  ```
12
  hf_docling_parser/
13
+ β”œβ”€β”€ app.py # FastAPI + hybrid two-pass parsing (v2.0.0)
14
+ β”œβ”€β”€ start.sh # Startup script (vLLM + FastAPI dual-process)
15
+ β”œβ”€β”€ Dockerfile # vLLM base image, Qwen3-VL pre-downloaded
16
+ β”œβ”€β”€ requirements.txt # Python deps (docling, opencv, pdf2image, etc.)
17
+ β”œβ”€β”€ README.md # HF Spaces metadata
18
  β”œβ”€β”€ CLAUDE.md # Claude Code development guide
19
  └── .gitignore # Git ignore patterns
20
  ```
21
 
22
+ **Dual-process Docker architecture:** `start.sh` launches vLLM on port 8000 (GPU model serving) and FastAPI on port 7860 (API). Base image: `vllm/vllm-openai:v0.14.1`.
23
+
24
  ## Common Commands
25
 
26
  ```bash
27
+ # Build and test Docker locally (requires A100 GPU)
28
+ docker build --shm-size 32g -t hf-docling .
29
+ docker run --gpus all --shm-size 32g -p 7860:7860 -e API_TOKEN=test hf-docling
30
 
31
  # Test the API locally
32
  curl -X POST "http://localhost:7860/parse" \
33
  -H "Authorization: Bearer YOUR_API_TOKEN" \
34
  -F "file=@document.pdf" \
35
  -F "output_format=markdown"
 
 
 
 
36
  ```
37
 
38
+ > **Note:** Local dev without Docker is not practical β€” the hybrid pipeline requires vLLM + Qwen3-VL-8B running on an A100 GPU.
39
 
40
+ ## Deploying to Hugging Face Spaces
41
 
42
+ - **Space URL:** https://huggingface.co/spaces/outcomelabs/docling-parser
43
+ - **API URL:** https://outcomelabs-docling-parser.hf.space
 
 
 
 
44
 
45
  ### Push New Code
46
 
 
52
 
53
  ### Settings (configure in HF web UI)
54
 
55
+ - **Hardware:** Nvidia A100 Large 80GB ($2.50/hr) β€” vLLM requires GPU
56
  - **Sleep time:** 1 hour (auto-shutdown after 60 min idle)
57
  - **Secrets:** `API_TOKEN` (required for API authentication)
58
 
 
73
  - **uvicorn**: ASGI server
74
  - **httpx**: HTTP client for URL parsing
75
  - **pydantic**: Request/response validation
76
+ - **opencv-python-headless**: Image preprocessing (denoise, CLAHE)
77
+ - **pdf2image**: PDF page to image conversion for VLM
78
+ - **numpy**: Array operations for image processing
79
+ - **huggingface-hub**: Model/space utilities
80
 
81
+ > **Note:** vLLM and PyTorch are provided by the base Docker image (`vllm/vllm-openai:v0.14.1`), not in `requirements.txt`.
 
 
 
 
 
 
 
82
 
83
+ ## Environment Variables
 
 
84
 
85
+ | Variable | Description | Default |
86
+ | ---------------------------- | ------------------------------------------------- | --------------------------- |
87
+ | `API_TOKEN` | **Required.** Secret token for API authentication | - |
88
+ | `MAX_FILE_SIZE_MB` | Maximum upload file size in MB | `1024` |
89
+ | `IMAGES_SCALE` | Image resolution scale for page rendering | `2.0` |
90
+ | `VLM_MODEL` | VLM model for text recognition pass | `Qwen/Qwen3-VL-8B-Instruct` |
91
+ | `VLM_HOST` | vLLM server host | `127.0.0.1` |
92
+ | `VLM_PORT` | vLLM server port | `8000` |
93
+ | `VLM_GPU_MEMORY_UTILIZATION` | Fraction of GPU memory for vLLM | `0.70` |
94
+ | `VLM_MAX_MODEL_LEN` | Max sequence length for vLLM | `8192` |
95
 
96
  ## Testing
97
 
98
+ > **Note:** Testing requires an A100 GPU with vLLM running. Use the Docker container for testing.
 
 
 
 
99
 
100
  ### Test with curl
101
 
 
150
 
151
  ## Comparison with MinerU
152
 
153
+ | Feature | Docling (Hybrid VLM) | MinerU |
154
+ | ---------------- | -------------------------------- | ------------------- |
155
+ | Maintainer | IBM Research + Qwen3-VL | OpenDataLab |
156
+ | Table Detection | TableFormer (built-in) | Multiple backends |
157
+ | OCR | Qwen3-VL-8B via vLLM | Built-in |
158
+ | VLM Support | Hybrid (Standard + VLM two-pass) | Hybrid backend |
159
+ | License | MIT | AGPL-3.0 |
160
+ | GPU Memory | ~24GB (vLLM + Docling) | ~6-10GB (pipeline) |
161
+ | Primary Use Case | Enterprise documents | General PDF parsing |
162
+
163
+ ## Workflow Orchestration, Task Management & Core Principles
164
+
165
+ > **See root `CLAUDE.md`** for full Workflow Orchestration (plan mode, subagents, self-improvement, verification, elegance, bug fixing), Task Management, and Core Principles. Files: `<workspace-root>/tasks/todo.md`, `<workspace-root>/tasks/lessons.md`.
Dockerfile CHANGED
@@ -1,9 +1,9 @@
1
- # Hugging Face Spaces Dockerfile for Docling Document Parser API
2
- # Optimized for GPU-accelerated document parsing
3
- # Build: v1.0.0 - Using IBM's Docling library
4
 
5
- # Use PyTorch base image with CUDA support
6
- FROM pytorch/pytorch:2.5.1-cuda12.4-cudnn9-runtime
7
 
8
  USER root
9
 
@@ -23,8 +23,6 @@ RUN echo "========== STEP 1: Installing system dependencies ==========" && \
23
  poppler-utils \
24
  # Health checks
25
  curl \
26
- # Build tools for some Python packages
27
- build-essential \
28
  && fc-cache -fv && \
29
  rm -rf /var/lib/apt/lists/* && \
30
  echo "========== System dependencies installed =========="
@@ -35,23 +33,27 @@ RUN useradd -m -u 1000 user
35
  # Set environment variables
36
  ENV PYTHONUNBUFFERED=1 \
37
  PYTHONDONTWRITEBYTECODE=1 \
38
- DO_OCR=true \
39
- TABLE_MODE=accurate \
40
- IMAGES_SCALE=3.0 \
 
 
 
41
  MAX_FILE_SIZE_MB=1024 \
42
- DEFAULT_LANG=en \
43
  HF_HOME=/home/user/.cache/huggingface \
44
  TORCH_HOME=/home/user/.cache/torch \
45
  XDG_CACHE_HOME=/home/user/.cache \
46
  HOME=/home/user \
47
- PATH=/home/user/.local/bin:/usr/local/bin:/usr/bin:$PATH
 
48
 
49
  # Create cache directories with correct ownership
50
- RUN mkdir -p /home/user/.cache/huggingface \
 
51
  /home/user/.cache/torch \
52
- /home/user/.cache/docling \
53
  /home/user/app && \
54
- chown -R user:user /home/user
 
55
 
56
  # Switch to non-root user
57
  USER user
@@ -61,34 +63,45 @@ WORKDIR /home/user/app
61
  COPY --chown=user:user requirements.txt .
62
 
63
  # Install Python dependencies
64
- RUN echo "========== STEP 2: Installing Python dependencies ==========" && \
65
  pip install --user --upgrade pip && \
 
66
  pip install --user -r requirements.txt && \
67
  echo "Installed packages:" && \
68
- pip list --user | grep -E "(docling|fastapi|uvicorn|httpx|pydantic|torch)" && \
69
  echo "========== Python dependencies installed =========="
70
 
 
 
 
 
 
 
 
71
  # Pre-download Docling models
72
- RUN echo "========== STEP 3: Pre-downloading Docling models ==========" && \
73
- python -c "from docling.document_converter import DocumentConverter; print('Initializing Docling to download models...'); converter = DocumentConverter(); print('Models downloaded successfully')" && \
74
  echo "Model cache summary:" && \
75
  du -sh /home/user/.cache/huggingface 2>/dev/null || echo " HF cache: (empty)" && \
76
  du -sh /home/user/.cache/torch 2>/dev/null || echo " Torch cache: (empty)" && \
77
  du -sh /home/user/.cache 2>/dev/null || echo " Total cache: (empty)" && \
78
- echo "========== Models downloaded =========="
79
 
80
  # Copy application code
81
  COPY --chown=user:user . .
82
 
83
- RUN echo "Files in app directory:" && ls -la /home/user/app/ && \
 
 
84
  echo "========== BUILD COMPLETED at $(date -u '+%Y-%m-%d %H:%M:%S UTC') =========="
85
 
86
  # Expose the port
87
  EXPOSE 7860
88
 
89
- # Health check
90
- HEALTHCHECK --interval=30s --timeout=30s --start-period=300s --retries=5 \
91
  CMD curl -f http://localhost:7860/ || exit 1
92
 
93
- # Run FastAPI server
94
- CMD ["python", "-m", "uvicorn", "app:app", "--host", "0.0.0.0", "--port", "7860", "--workers", "1", "--timeout-keep-alive", "300"]
 
 
1
+ # Hugging Face Spaces Dockerfile for Docling VLM Document Parser API
2
+ # GPU-accelerated document parsing with Docling + Qwen3-VL-8B via vLLM
3
+ # Build: v2.0.0 - Docling with VLM backend for superior accuracy
4
 
5
+ # Use vLLM base image with CUDA, PyTorch, and vLLM pre-installed
6
+ FROM vllm/vllm-openai:v0.14.1
7
 
8
  USER root
9
 
 
23
  poppler-utils \
24
  # Health checks
25
  curl \
 
 
26
  && fc-cache -fv && \
27
  rm -rf /var/lib/apt/lists/* && \
28
  echo "========== System dependencies installed =========="
 
33
  # Set environment variables
34
  ENV PYTHONUNBUFFERED=1 \
35
  PYTHONDONTWRITEBYTECODE=1 \
36
+ VLM_MODEL=Qwen/Qwen3-VL-8B-Instruct \
37
+ VLM_HOST=127.0.0.1 \
38
+ VLM_PORT=8000 \
39
+ VLM_GPU_MEMORY_UTILIZATION=0.70 \
40
+ VLM_MAX_MODEL_LEN=8192 \
41
+ IMAGES_SCALE=2.0 \
42
  MAX_FILE_SIZE_MB=1024 \
 
43
  HF_HOME=/home/user/.cache/huggingface \
44
  TORCH_HOME=/home/user/.cache/torch \
45
  XDG_CACHE_HOME=/home/user/.cache \
46
  HOME=/home/user \
47
+ PATH=/home/user/.local/bin:/usr/local/bin:/usr/bin:$PATH \
48
+ LD_LIBRARY_PATH=/home/user/.local/lib/python3.12/site-packages/nvidia/cudnn/lib:$LD_LIBRARY_PATH
49
 
50
  # Create cache directories with correct ownership
51
+ RUN echo "========== STEP 2: Creating cache directories ==========" && \
52
+ mkdir -p /home/user/.cache/huggingface \
53
  /home/user/.cache/torch \
 
54
  /home/user/app && \
55
+ chown -R user:user /home/user && \
56
+ echo "========== Cache directories created =========="
57
 
58
  # Switch to non-root user
59
  USER user
 
63
  COPY --chown=user:user requirements.txt .
64
 
65
  # Install Python dependencies
66
+ RUN echo "========== STEP 3: Installing Python dependencies ==========" && \
67
  pip install --user --upgrade pip && \
68
+ pip install --user nvidia-cudnn-cu12 && \
69
  pip install --user -r requirements.txt && \
70
  echo "Installed packages:" && \
71
+ pip list --user && \
72
  echo "========== Python dependencies installed =========="
73
 
74
+ # Pre-download Qwen3-VL-8B model for vLLM
75
+ RUN echo "========== STEP 4: Pre-downloading Qwen3-VL-8B model ==========" && \
76
+ python -c "from huggingface_hub import snapshot_download; snapshot_download('Qwen/Qwen3-VL-8B-Instruct', local_dir='/home/user/.cache/huggingface/Qwen3-VL-8B-Instruct')" && \
77
+ echo "Model cache summary:" && \
78
+ du -sh /home/user/.cache/huggingface 2>/dev/null || echo " HF cache: (empty)" && \
79
+ echo "========== Qwen3-VL-8B model downloaded =========="
80
+
81
  # Pre-download Docling models
82
+ RUN echo "========== STEP 5: Pre-downloading Docling models ==========" && \
83
+ python -c "from docling.document_converter import DocumentConverter; print('Downloading Docling models...'); converter = DocumentConverter(); print('Done')" && \
84
  echo "Model cache summary:" && \
85
  du -sh /home/user/.cache/huggingface 2>/dev/null || echo " HF cache: (empty)" && \
86
  du -sh /home/user/.cache/torch 2>/dev/null || echo " Torch cache: (empty)" && \
87
  du -sh /home/user/.cache 2>/dev/null || echo " Total cache: (empty)" && \
88
+ echo "========== Docling models downloaded =========="
89
 
90
  # Copy application code
91
  COPY --chown=user:user . .
92
 
93
+ RUN echo "========== STEP 6: Finalizing build ==========" && \
94
+ chmod +x start.sh && \
95
+ echo "Files in app directory:" && ls -la /home/user/app/ && \
96
  echo "========== BUILD COMPLETED at $(date -u '+%Y-%m-%d %H:%M:%S UTC') =========="
97
 
98
  # Expose the port
99
  EXPOSE 7860
100
 
101
+ # Health check (longer start-period for vLLM model loading)
102
+ HEALTHCHECK --interval=30s --timeout=30s --start-period=600s --retries=5 \
103
  CMD curl -f http://localhost:7860/ || exit 1
104
 
105
+ # Override vLLM entrypoint and use our startup script
106
+ ENTRYPOINT []
107
+ CMD ["bash", "start.sh"]
README.md CHANGED
@@ -1,5 +1,5 @@
1
  ---
2
- title: Docling Parser API
3
  emoji: πŸ“„
4
  colorFrom: blue
5
  colorTo: green
@@ -10,24 +10,53 @@ license: mit
10
  suggested_hardware: a100-large
11
  ---
12
 
13
- # Docling Parser API
14
 
15
- A FastAPI service that transforms PDFs and images into LLM-ready markdown/JSON using [IBM's Docling](https://github.com/DS4SD/docling).
16
 
17
  ## Features
18
 
19
- - **PDF Parsing**: Extract text, tables, formulas, and images from PDFs
20
- - **Image OCR**: Process scanned documents and images
21
- - **Multiple Formats**: Output as markdown or JSON
22
- - **Table Detection**: Accurate table structure detection with TableFormer
23
- - **GPU Accelerated**: Uses CUDA for fast processing
 
24
  - **Image Extraction**: Extract and return all document images
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
25
 
26
  ## API Endpoints
27
 
28
  | Endpoint | Method | Description |
29
  | ------------ | ------ | ----------------------------------------- |
30
- | `/` | GET | Health check |
31
  | `/parse` | POST | Parse uploaded file (multipart/form-data) |
32
  | `/parse/url` | POST | Parse document from URL (JSON body) |
33
 
@@ -91,7 +120,7 @@ response = requests.post(
91
 
92
  result = response.json()
93
  if result["success"]:
94
- print(f"Parsed {result['pages_processed']} pages")
95
  print(result["markdown"])
96
  else:
97
  print(f"Error: {result['error']}")
@@ -140,10 +169,7 @@ if result["success"]:
140
  | -------------- | ------ | -------- | ---------- | ---------------------------------------- |
141
  | file | File | Yes | - | PDF or image file |
142
  | output_format | string | No | `markdown` | `markdown` or `json` |
143
- | lang | string | No | `en` | OCR language code |
144
- | do_ocr | bool | No | `true` | Enable OCR for scanned documents |
145
- | table_mode | string | No | `accurate` | `accurate` (slow) or `fast` |
146
- | images_scale | float | No | `3.0` | Image resolution scale (higher = better) |
147
  | start_page | int | No | `0` | Starting page (0-indexed) |
148
  | end_page | int | No | `null` | Ending page (null = all pages) |
149
  | include_images | bool | No | `false` | Include extracted images in response |
@@ -154,10 +180,7 @@ if result["success"]:
154
  | -------------- | ------ | -------- | ---------- | ---------------------------------------- |
155
  | url | string | Yes | - | URL to PDF or image |
156
  | output_format | string | No | `markdown` | `markdown` or `json` |
157
- | lang | string | No | `en` | OCR language code |
158
- | do_ocr | bool | No | `true` | Enable OCR for scanned documents |
159
- | table_mode | string | No | `accurate` | `accurate` (slow) or `fast` |
160
- | images_scale | float | No | `3.0` | Image resolution scale (higher = better) |
161
  | start_page | int | No | `0` | Starting page (0-indexed) |
162
  | end_page | int | No | `null` | Ending page (null = all pages) |
163
  | include_images | bool | No | `false` | Include extracted images in response |
@@ -173,7 +196,8 @@ if result["success"]:
173
  "image_count": 0,
174
  "error": null,
175
  "pages_processed": 20,
176
- "device_used": "cuda"
 
177
  }
178
  ```
179
 
@@ -187,13 +211,7 @@ if result["success"]:
187
  | error | string | Error message if failed |
188
  | pages_processed | int | Number of pages processed |
189
  | device_used | string | Device used for processing (cuda, mps, or cpu) |
190
-
191
- ## Table Detection Modes
192
-
193
- | Mode | Speed | Accuracy | Best For |
194
- | ---------- | ------ | -------- | ------------------------------------- |
195
- | `accurate` | Slower | Higher | Complex tables, forms, financial docs |
196
- | `fast` | Faster | Good | Simple tables, high-volume processing |
197
 
198
  ## Supported File Types
199
 
@@ -204,14 +222,16 @@ Maximum file size: 1GB (configurable via `MAX_FILE_SIZE_MB`)
204
 
205
  ## Configuration
206
 
207
- | Environment Variable | Description | Default |
208
- | -------------------- | -------------------------------------- | ---------- |
209
- | `API_TOKEN` | **Required.** API authentication token | - |
210
- | `DO_OCR` | Enable OCR by default | `true` |
211
- | `TABLE_MODE` | Default table detection mode | `accurate` |
212
- | `IMAGES_SCALE` | Default image resolution scale | `3.0` |
213
- | `DEFAULT_LANG` | Default OCR language code | `en` |
214
- | `MAX_FILE_SIZE_MB` | Maximum upload size in MB | `1024` |
 
 
215
 
216
  ## Logging
217
 
@@ -221,11 +241,13 @@ View logs in HuggingFace Space > Logs tab:
221
  2026-02-04 10:30:00 | INFO | [a1b2c3d4] New parse request received
222
  2026-02-04 10:30:00 | INFO | [a1b2c3d4] Filename: document.pdf
223
  2026-02-04 10:30:00 | INFO | [a1b2c3d4] File size: 2.45 MB
224
- 2026-02-04 10:30:27 | INFO | [a1b2c3d4] Docling conversion completed in 27.23s
225
- 2026-02-04 10:30:27 | INFO | [a1b2c3d4] Pages processed: 20
 
 
226
  2026-02-04 10:30:27 | INFO | [a1b2c3d4] Speed: 0.73 pages/sec
227
  ```
228
 
229
  ## Credits
230
 
231
- Built with [Docling](https://github.com/DS4SD/docling) by IBM Research.
 
1
  ---
2
+ title: Docling VLM Parser API
3
  emoji: πŸ“„
4
  colorFrom: blue
5
  colorTo: green
 
10
  suggested_hardware: a100-large
11
  ---
12
 
13
+ # Docling VLM Parser API
14
 
15
+ A FastAPI service that transforms PDFs and images into LLM-ready markdown/JSON using a **hybrid two-pass architecture**: [IBM's Docling](https://github.com/DS4SD/docling) for document structure and [Qwen3-VL-8B](https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct) via [vLLM](https://github.com/vllm-project/vllm) for enhanced text recognition.
16
 
17
  ## Features
18
 
19
+ - **Hybrid Two-Pass Architecture**: Docling Standard Pipeline (Pass 1) + Qwen3-VL VLM OCR (Pass 2)
20
+ - **TableFormer ACCURATE**: High-accuracy table structure detection preserved from Docling
21
+ - **VLM-Powered OCR**: Qwen3-VL-8B via vLLM replaces baseline RapidOCR for superior text accuracy
22
+ - **OpenCV Preprocessing**: Denoising and CLAHE contrast enhancement for better image quality
23
+ - **32+ Language Support**: Multilingual text recognition powered by Qwen3-VL
24
+ - **Handwriting Recognition**: Transcribe handwritten text via VLM
25
  - **Image Extraction**: Extract and return all document images
26
+ - **Multiple Formats**: Output as markdown or JSON
27
+ - **GPU Accelerated**: Dual-process on A100 80GB (vLLM + FastAPI)
28
+
29
+ ## Architecture
30
+
31
+ ```
32
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
33
+ β”‚ Docker Container β”‚
34
+ β”‚ (vllm/vllm-openai:v0.14.1) β”‚
35
+ β”‚ β”‚
36
+ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
37
+ β”‚ β”‚ vLLM Server :8000 β”‚ β”‚ FastAPI App :7860 β”‚ β”‚
38
+ β”‚ β”‚ Qwen3-VL-8B │◄───│ β”‚ β”‚
39
+ β”‚ β”‚ (GPU inference) β”‚ β”‚ Pass 1: Docling Standard β”‚ β”‚
40
+ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ - DocLayNet layout β”‚ β”‚
41
+ β”‚ β”‚ - TableFormer ACCURATE β”‚ β”‚
42
+ β”‚ β”‚ - RapidOCR baseline β”‚ β”‚
43
+ β”‚ β”‚ β”‚ β”‚
44
+ β”‚ β”‚ Pass 2: VLM OCR β”‚ β”‚
45
+ β”‚ β”‚ - Page images β†’ Qwen3-VL β”‚ β”‚
46
+ β”‚ β”‚ - OpenCV preprocessing β”‚ β”‚
47
+ β”‚ β”‚ β”‚ β”‚
48
+ β”‚ β”‚ Merge: β”‚ β”‚
49
+ β”‚ β”‚ - VLM text (primary) β”‚ β”‚
50
+ β”‚ β”‚ - TableFormer tables β”‚ β”‚
51
+ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
52
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
53
+ ```
54
 
55
  ## API Endpoints
56
 
57
  | Endpoint | Method | Description |
58
  | ------------ | ------ | ----------------------------------------- |
59
+ | `/` | GET | Health check (includes vLLM status) |
60
  | `/parse` | POST | Parse uploaded file (multipart/form-data) |
61
  | `/parse/url` | POST | Parse document from URL (JSON body) |
62
 
 
120
 
121
  result = response.json()
122
  if result["success"]:
123
+ print(f"Parsed {result['pages_processed']} pages using {result['vlm_model']}")
124
  print(result["markdown"])
125
  else:
126
  print(f"Error: {result['error']}")
 
169
  | -------------- | ------ | -------- | ---------- | ---------------------------------------- |
170
  | file | File | Yes | - | PDF or image file |
171
  | output_format | string | No | `markdown` | `markdown` or `json` |
172
+ | images_scale | float | No | `2.0` | Image resolution scale (higher = better) |
 
 
 
173
  | start_page | int | No | `0` | Starting page (0-indexed) |
174
  | end_page | int | No | `null` | Ending page (null = all pages) |
175
  | include_images | bool | No | `false` | Include extracted images in response |
 
180
  | -------------- | ------ | -------- | ---------- | ---------------------------------------- |
181
  | url | string | Yes | - | URL to PDF or image |
182
  | output_format | string | No | `markdown` | `markdown` or `json` |
183
+ | images_scale | float | No | `2.0` | Image resolution scale (higher = better) |
 
 
 
184
  | start_page | int | No | `0` | Starting page (0-indexed) |
185
  | end_page | int | No | `null` | Ending page (null = all pages) |
186
  | include_images | bool | No | `false` | Include extracted images in response |
 
196
  "image_count": 0,
197
  "error": null,
198
  "pages_processed": 20,
199
+ "device_used": "cuda",
200
+ "vlm_model": "Qwen/Qwen3-VL-8B-Instruct"
201
  }
202
  ```
203
 
 
211
  | error | string | Error message if failed |
212
  | pages_processed | int | Number of pages processed |
213
  | device_used | string | Device used for processing (cuda, mps, or cpu) |
214
+ | vlm_model | string | VLM model used for OCR (e.g. Qwen3-VL-8B) |
 
 
 
 
 
 
215
 
216
  ## Supported File Types
217
 
 
222
 
223
  ## Configuration
224
 
225
+ | Environment Variable | Description | Default |
226
+ | ---------------------------- | -------------------------------------- | --------------------------- |
227
+ | `API_TOKEN` | **Required.** API authentication token | - |
228
+ | `VLM_MODEL` | VLM model for OCR | `Qwen/Qwen3-VL-8B-Instruct` |
229
+ | `VLM_HOST` | vLLM server host | `127.0.0.1` |
230
+ | `VLM_PORT` | vLLM server port | `8000` |
231
+ | `VLM_GPU_MEMORY_UTILIZATION` | GPU memory fraction for vLLM | `0.70` |
232
+ | `VLM_MAX_MODEL_LEN` | Max context length for VLM | `8192` |
233
+ | `IMAGES_SCALE` | Default image resolution scale | `2.0` |
234
+ | `MAX_FILE_SIZE_MB` | Maximum upload size in MB | `1024` |
235
 
236
  ## Logging
237
 
 
241
  2026-02-04 10:30:00 | INFO | [a1b2c3d4] New parse request received
242
  2026-02-04 10:30:00 | INFO | [a1b2c3d4] Filename: document.pdf
243
  2026-02-04 10:30:00 | INFO | [a1b2c3d4] File size: 2.45 MB
244
+ 2026-02-04 10:30:15 | INFO | [a1b2c3d4] Pass 1: Docling Standard Pipeline completed in 15.23s
245
+ 2026-02-04 10:30:15 | INFO | [a1b2c3d4] TableFormer detected 3 tables
246
+ 2026-02-04 10:30:27 | INFO | [a1b2c3d4] Pass 2: VLM OCR completed in 12.00s (20 pages)
247
+ 2026-02-04 10:30:27 | INFO | [a1b2c3d4] Hybrid conversion complete: 20 pages, 3 tables, 27.23s total
248
  2026-02-04 10:30:27 | INFO | [a1b2c3d4] Speed: 0.73 pages/sec
249
  ```
250
 
251
  ## Credits
252
 
253
+ Built with [Docling](https://github.com/DS4SD/docling) by IBM Research, [Qwen3-VL](https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct) by Qwen Team, and [vLLM](https://github.com/vllm-project/vllm).
app.py CHANGED
@@ -1,12 +1,16 @@
1
  """
2
- Docling Document Parser API
3
 
4
- A FastAPI service that wraps IBM's Docling library for parsing PDFs and images
5
- into LLM-ready markdown/JSON formats.
 
 
6
 
7
  Features:
8
  - GPU-accelerated parsing with CUDA support
9
- - Table structure detection with TableFormer
 
 
10
  - Image extraction with configurable resolution
11
  - Automatic page chunking for large PDFs
12
  """
@@ -24,27 +28,31 @@ import socket
24
  import tempfile
25
  import time
26
  import zipfile
 
27
  from pathlib import Path
28
  from typing import BinaryIO, Optional, Union
29
  from urllib.parse import urlparse
30
  from uuid import uuid4
31
 
 
32
  import httpx
33
  import torch
34
  from fastapi import Depends, FastAPI, File, Form, HTTPException, UploadFile
35
  from fastapi.security import HTTPAuthorizationCredentials, HTTPBearer
 
36
  from pydantic import BaseModel
37
 
38
  # Docling imports
39
- from docling.document_converter import DocumentConverter, PdfFormatOption
40
  from docling.datamodel.base_models import InputFormat
 
41
  from docling.datamodel.pipeline_options import (
 
42
  PdfPipelineOptions,
 
43
  TableFormerMode,
44
- AcceleratorOptions,
45
  )
46
- from docling.backend.docling_parse_v4_backend import DoclingParseV4DocumentBackend
47
- from docling.datamodel.document import PictureItem
48
 
49
  # Configure logging
50
  logging.basicConfig(
@@ -85,120 +93,12 @@ def verify_token(credentials: HTTPAuthorizationCredentials = Depends(security))
85
  return token
86
 
87
 
88
- # Global converter instance (initialized on startup)
89
- _converter: Optional[DocumentConverter] = None
90
-
91
-
92
- def _get_device() -> str:
93
- """Get the best available device for processing."""
94
- if torch.cuda.is_available():
95
- return "cuda"
96
- elif hasattr(torch.backends, "mps") and torch.backends.mps.is_available():
97
- return "mps"
98
- return "cpu"
99
-
100
-
101
- def _create_converter(
102
- do_ocr: bool = True,
103
- table_mode: str = "accurate",
104
- images_scale: float = 3.0,
105
- ) -> DocumentConverter:
106
- """Create a Docling DocumentConverter with specified options."""
107
- device = _get_device()
108
- logger.info(f"Creating converter with device: {device}")
109
-
110
- pipeline_options = PdfPipelineOptions()
111
- pipeline_options.do_ocr = do_ocr
112
- pipeline_options.do_table_structure = True
113
-
114
- # Set table structure mode
115
- if table_mode == "accurate":
116
- pipeline_options.table_structure_options.mode = TableFormerMode.ACCURATE
117
- else:
118
- pipeline_options.table_structure_options.mode = TableFormerMode.FAST
119
-
120
- # Enable image extraction
121
- pipeline_options.generate_picture_images = True
122
- pipeline_options.generate_page_images = True
123
- pipeline_options.images_scale = images_scale
124
-
125
- # GPU/CPU configuration
126
- pipeline_options.accelerator_options = AcceleratorOptions(
127
- device=device,
128
- num_threads=0 if device == "cuda" else 4,
129
- )
130
-
131
- converter = DocumentConverter(
132
- format_options={
133
- InputFormat.PDF: PdfFormatOption(
134
- pipeline_options=pipeline_options,
135
- backend=DoclingParseV4DocumentBackend,
136
- )
137
- }
138
- )
139
-
140
- return converter
141
-
142
-
143
- def _get_converter() -> DocumentConverter:
144
- """Get or create the global converter instance."""
145
- global _converter
146
- if _converter is None:
147
- _converter = _create_converter()
148
- return _converter
149
-
150
-
151
- from contextlib import asynccontextmanager
152
-
153
-
154
- @asynccontextmanager
155
- async def lifespan(app: FastAPI):
156
- """Startup: initialize Docling converter and check GPU."""
157
- logger.info("=" * 60)
158
- logger.info("Starting Docling Parser API v1.0.0...")
159
-
160
- device = _get_device()
161
- logger.info(f"Device: {device}")
162
-
163
- if device == "cuda":
164
- logger.info(f"GPU: {torch.cuda.get_device_name(0)}")
165
- logger.info(f"CUDA Version: {torch.version.cuda}")
166
- logger.info(f"GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1024**3:.1f} GB")
167
-
168
- logger.info(f"Default OCR: {DO_OCR}")
169
- logger.info(f"Default table mode: {TABLE_MODE}")
170
- logger.info(f"Default images scale: {IMAGES_SCALE}")
171
- logger.info(f"Default language: {DEFAULT_LANG}")
172
- logger.info(f"Max file size: {MAX_FILE_SIZE_MB}MB")
173
-
174
- # Pre-initialize converter to load models
175
- logger.info("Pre-loading Docling models...")
176
- try:
177
- _get_converter()
178
- logger.info("Models loaded successfully")
179
- except Exception as e:
180
- logger.warning(f"Failed to pre-load models: {e}")
181
-
182
- logger.info("=" * 60)
183
- logger.info("Docling Parser API ready to accept requests")
184
- logger.info("=" * 60)
185
- yield
186
- logger.info("Shutting down Docling Parser API...")
187
-
188
-
189
- app = FastAPI(
190
- title="Docling Parser API",
191
- description="Transform PDFs and images into markdown/JSON using IBM's Docling",
192
- version="1.0.0",
193
- lifespan=lifespan,
194
- )
195
-
196
- # Configuration from environment
197
- DO_OCR = os.getenv("DO_OCR", "true").lower() == "true"
198
- TABLE_MODE = os.getenv("TABLE_MODE", "accurate") # accurate or fast
199
- IMAGES_SCALE = float(os.getenv("IMAGES_SCALE", "3.0")) # High res for A100 80GB VRAM
200
  MAX_FILE_SIZE_MB = int(os.getenv("MAX_FILE_SIZE_MB", "1024"))
201
- DEFAULT_LANG = os.getenv("DEFAULT_LANG", "en") # OCR language code
202
  MAX_FILE_SIZE_BYTES = MAX_FILE_SIZE_MB * 1024 * 1024
203
 
204
  # Blocked hostnames for SSRF protection
@@ -211,6 +111,18 @@ BLOCKED_HOSTNAMES = {
211
  "fd00:ec2::254",
212
  }
213
 
 
 
 
 
 
 
 
 
 
 
 
 
214
 
215
  def _validate_url(url: str) -> None:
216
  """Validate URL to prevent SSRF attacks."""
@@ -283,6 +195,11 @@ def _save_downloaded_content(input_path: Path, content: bytes) -> None:
283
  f.write(content)
284
 
285
 
 
 
 
 
 
286
  class ParseResponse(BaseModel):
287
  """Response model for document parsing."""
288
 
@@ -294,6 +211,7 @@ class ParseResponse(BaseModel):
294
  error: Optional[str] = None
295
  pages_processed: int = 0
296
  device_used: Optional[str] = None
 
297
 
298
 
299
  class HealthResponse(BaseModel):
@@ -303,9 +221,9 @@ class HealthResponse(BaseModel):
303
  version: str
304
  device: str
305
  gpu_name: Optional[str] = None
306
- do_ocr: bool
307
- table_mode: str
308
- images_scale: float
309
 
310
 
311
  class URLParseRequest(BaseModel):
@@ -313,130 +231,421 @@ class URLParseRequest(BaseModel):
313
 
314
  url: str
315
  output_format: str = "markdown"
316
- lang: str = DEFAULT_LANG # OCR language code
317
- do_ocr: Optional[bool] = None
318
- table_mode: Optional[str] = None
319
  images_scale: Optional[float] = None
320
  start_page: int = 0 # Starting page (0-indexed)
321
  end_page: Optional[int] = None # Ending page (None = all pages)
322
  include_images: bool = False
323
 
324
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
325
  def _convert_document(
326
  input_path: Path,
327
  output_dir: Path,
328
- do_ocr: bool,
329
- table_mode: str,
330
  images_scale: float,
331
  include_images: bool,
332
  request_id: str,
333
  start_page: int = 0,
334
  end_page: Optional[int] = None,
335
- lang: str = "en",
336
- ) -> tuple[str, Optional[list], int, int]:
337
  """
338
- Convert a document using Docling.
339
 
340
- Returns:
341
- Tuple of (markdown_content, json_content, pages_processed, image_count)
 
 
 
342
  """
343
- logger.info(f"[{request_id}] Creating converter with OCR={do_ocr}, table_mode={table_mode}, lang={lang}")
344
- if start_page > 0 or end_page is not None:
345
- logger.info(f"[{request_id}] Page range: {start_page} to {end_page or 'end'}")
346
-
347
- # Create converter with specified options
348
- converter = _create_converter(
349
- do_ocr=do_ocr,
350
- table_mode=table_mode,
351
- images_scale=images_scale,
352
  )
 
353
 
354
- logger.info(f"[{request_id}] Starting Docling conversion...")
355
  start_time = time.time()
356
-
357
  result = converter.convert(input_path)
358
  doc = result.document
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
359
 
360
- conversion_time = time.time() - start_time
361
- logger.info(f"[{request_id}] Docling conversion completed in {conversion_time:.2f}s")
 
 
362
 
363
- # Build page labels map for better page numbering
364
- page_labels = {}
365
- for element, _ in doc.iterate_items():
366
- if element.label in ["page_footer", "metadata", "text"]:
367
- raw_text = getattr(element, "text", "").strip()
368
- if re.fullmatch(r"([0-9]+|[ivxIVX]+)", raw_text):
369
- if element.prov and element.prov[0].page_no not in page_labels:
370
- page_labels[element.prov[0].page_no] = raw_text
371
-
372
- # Build markdown output
373
- md_body = []
374
- current_page_idx = -1
375
- pages_seen = set()
376
  image_count = 0
377
  image_dir = output_dir / "images"
378
 
379
  if include_images:
380
  image_dir.mkdir(parents=True, exist_ok=True)
381
 
382
- for element, _level in doc.iterate_items():
383
- # Track page breaks
384
- if element.prov and element.prov[0].page_no != current_page_idx:
385
- current_page_idx = element.prov[0].page_no
386
-
387
- # Skip pages outside the requested range
388
- if current_page_idx < start_page:
389
- continue
390
- if end_page is not None and current_page_idx > end_page:
391
- continue
392
-
393
- pages_seen.add(current_page_idx)
394
- label = page_labels.get(current_page_idx, str(current_page_idx + 1))
395
- md_body.append(f"\n\n<!-- Page {label} -->\n\n")
396
-
397
- # Skip elements outside the requested page range
398
- if current_page_idx < start_page:
399
- continue
400
- if end_page is not None and current_page_idx > end_page:
401
- continue
402
-
403
- element_text = getattr(element, "text", "").strip()
404
 
405
- # Skip page number elements
406
- if element_text and element_text == page_labels.get(current_page_idx):
407
- continue
408
 
409
- # Handle images
410
- if isinstance(element, PictureItem):
411
- if include_images and element.image and element.image.pil_image:
412
- image_id = element.self_ref.split("/")[-1]
413
- label = page_labels.get(current_page_idx, str(current_page_idx + 1))
414
- image_name = f"page_{label}_{image_id}.png"
415
- image_name = re.sub(r'[\\/*?:"<>|]', "", image_name)
416
- image_path = image_dir / image_name
417
 
 
 
 
418
  try:
419
- element.image.pil_image.save(image_path, format="PNG")
420
- md_body.append(f"![{image_name}](images/{image_name})\n\n")
421
- image_count += 1
422
- except Exception as e:
423
- logger.warning(f"[{request_id}] Failed to save image {image_name}: {e}")
424
  else:
425
- # Export element to markdown
426
- try:
427
- md_body.append(element.export_to_markdown(doc=doc))
428
- except Exception:
429
- if element_text:
430
- md_body.append(element_text + "\n\n")
431
 
432
- markdown_content = "".join(md_body)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
433
  pages_processed = len(pages_seen)
434
 
435
- logger.info(f"[{request_id}] Extracted {len(markdown_content)} chars, {pages_processed} pages, {image_count} images")
 
 
 
 
 
 
436
 
437
  return markdown_content, None, pages_processed, image_count
438
 
439
 
 
 
 
 
 
440
  def _create_images_zip(output_dir: Path) -> tuple[Optional[str], int]:
441
  """Create a zip file from extracted images."""
442
  image_dir = output_dir / "images"
@@ -462,6 +671,75 @@ def _create_images_zip(output_dir: Path) -> tuple[Optional[str], int]:
462
  return base64.b64encode(zip_buffer.getvalue()).decode("utf-8"), image_count
463
 
464
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
465
  @app.get("/", response_model=HealthResponse)
466
  async def health_check() -> HealthResponse:
467
  """Health check endpoint."""
@@ -470,13 +748,22 @@ async def health_check() -> HealthResponse:
470
  if device == "cuda":
471
  gpu_name = torch.cuda.get_device_name(0)
472
 
 
 
 
 
 
 
 
 
 
473
  return HealthResponse(
474
  status="healthy",
475
- version="1.0.0",
476
  device=device,
477
- gpu_name=gpu_name,
478
- do_ocr=DO_OCR,
479
- table_mode=TABLE_MODE,
480
  images_scale=IMAGES_SCALE,
481
  )
482
 
@@ -485,10 +772,7 @@ async def health_check() -> HealthResponse:
485
  async def parse_document(
486
  file: UploadFile = File(..., description="PDF or image file to parse"),
487
  output_format: str = Form(default="markdown", description="Output format: markdown or json"),
488
- lang: str = Form(default=DEFAULT_LANG, description="OCR language code"),
489
- do_ocr: Optional[bool] = Form(default=None, description="Enable OCR (default: true)"),
490
- table_mode: Optional[str] = Form(default=None, description="Table detection mode: accurate or fast"),
491
- images_scale: Optional[float] = Form(default=None, description="Image resolution scale (default: 3.0)"),
492
  start_page: int = Form(default=0, description="Starting page (0-indexed)"),
493
  end_page: Optional[int] = Form(default=None, description="Ending page (None = all pages)"),
494
  include_images: bool = Form(default=False, description="Include extracted images in response"),
@@ -497,6 +781,11 @@ async def parse_document(
497
  """
498
  Parse a document file (PDF or image) and return extracted content.
499
 
 
 
 
 
 
500
  Supports:
501
  - PDF files (.pdf)
502
  - Images (.png, .jpg, .jpeg, .tiff, .bmp)
@@ -506,9 +795,16 @@ async def parse_document(
506
 
507
  logger.info(f"[{request_id}] {'='*50}")
508
  logger.info(f"[{request_id}] New parse request received")
509
- logger.info(f"[{request_id}] Filename: {file.filename}")
 
510
  logger.info(f"[{request_id}] Output format: {output_format}")
511
 
 
 
 
 
 
 
512
  # Validate file size
513
  file.file.seek(0, 2)
514
  file_size = file.file.tell()
@@ -535,39 +831,34 @@ async def parse_document(
535
  )
536
 
537
  # Use defaults if not specified
538
- use_ocr = do_ocr if do_ocr is not None else DO_OCR
539
- use_table_mode = table_mode if table_mode else TABLE_MODE
540
- use_images_scale = images_scale if images_scale else IMAGES_SCALE
541
 
542
- logger.info(f"[{request_id}] OCR: {use_ocr}, Table mode: {use_table_mode}, Images scale: {use_images_scale}, Lang: {lang}")
543
  logger.info(f"[{request_id}] Page range: {start_page} to {end_page or 'end'}")
544
 
545
  temp_dir = tempfile.mkdtemp()
546
- logger.info(f"[{request_id}] Created temp directory: {temp_dir}")
547
 
548
  try:
549
  # Save uploaded file
550
  input_path = Path(temp_dir) / f"input{file_ext}"
551
  await asyncio.to_thread(_save_uploaded_file, input_path, file.file)
552
- logger.info(f"[{request_id}] Saved file to: {input_path}")
553
 
554
  # Create output directory
555
  output_dir = Path(temp_dir) / "output"
556
  output_dir.mkdir(exist_ok=True)
557
 
558
- # Convert document
559
  markdown_content, json_content, pages_processed, image_count = await asyncio.to_thread(
560
  _convert_document,
561
  input_path,
562
  output_dir,
563
- use_ocr,
564
- use_table_mode,
565
  use_images_scale,
566
  include_images,
567
  request_id,
568
  start_page,
569
  end_page,
570
- lang,
571
  )
572
 
573
  # Create images zip if requested
@@ -593,21 +884,22 @@ async def parse_document(
593
  image_count=image_count,
594
  pages_processed=pages_processed,
595
  device_used=_get_device(),
 
596
  )
597
 
598
  except Exception as e:
599
  total_duration = time.time() - start_time
600
  logger.error(f"[{request_id}] {'='*50}")
601
  logger.error(f"[{request_id}] Request failed after {total_duration:.2f}s")
602
- logger.error(f"[{request_id}] Error: {type(e).__name__}: {str(e)}")
603
  logger.error(f"[{request_id}] {'='*50}")
604
  return ParseResponse(
605
  success=False,
606
- error=f"{type(e).__name__}: {str(e)}",
607
  )
608
  finally:
609
  shutil.rmtree(temp_dir, ignore_errors=True)
610
- logger.info(f"[{request_id}] Cleaned up temp directory")
611
 
612
 
613
  @app.post("/parse/url", response_model=ParseResponse)
@@ -618,7 +910,10 @@ async def parse_document_from_url(
618
  """
619
  Parse a document from a URL.
620
 
621
- Downloads the file and processes it through Docling.
 
 
 
622
  """
623
  request_id = str(uuid4())[:8]
624
  start_time = time.time()
@@ -628,19 +923,25 @@ async def parse_document_from_url(
628
  logger.info(f"[{request_id}] URL: {request.url}")
629
  logger.info(f"[{request_id}] Output format: {request.output_format}")
630
 
 
 
 
 
 
 
631
  # Validate URL
632
  logger.info(f"[{request_id}] Validating URL...")
633
  _validate_url(request.url)
634
  logger.info(f"[{request_id}] URL validation passed")
635
 
636
  temp_dir = tempfile.mkdtemp()
637
- logger.info(f"[{request_id}] Created temp directory: {temp_dir}")
638
 
639
  try:
640
  # Download file
641
  logger.info(f"[{request_id}] Downloading file from URL...")
642
  download_start = time.time()
643
- async with httpx.AsyncClient(timeout=60.0, follow_redirects=True) as client:
644
  response = await client.get(request.url)
645
  response.raise_for_status()
646
  download_duration = time.time() - download_start
@@ -662,7 +963,9 @@ async def parse_document_from_url(
662
  )
663
 
664
  if len(response.content) > MAX_FILE_SIZE_BYTES:
665
- logger.error(f"[{request_id}] File too large: {file_size_mb:.2f} MB > {MAX_FILE_SIZE_MB} MB")
 
 
666
  raise HTTPException(
667
  status_code=413,
668
  detail=f"File size exceeds maximum allowed size of {MAX_FILE_SIZE_MB}MB",
@@ -671,33 +974,28 @@ async def parse_document_from_url(
671
  # Save downloaded file
672
  input_path = Path(temp_dir) / f"input{file_ext}"
673
  await asyncio.to_thread(_save_downloaded_content, input_path, response.content)
674
- logger.info(f"[{request_id}] Saved file to: {input_path}")
675
 
676
  # Create output directory
677
  output_dir = Path(temp_dir) / "output"
678
  output_dir.mkdir(exist_ok=True)
679
 
680
  # Use defaults if not specified
681
- use_ocr = request.do_ocr if request.do_ocr is not None else DO_OCR
682
- use_table_mode = request.table_mode if request.table_mode else TABLE_MODE
683
- use_images_scale = request.images_scale if request.images_scale else IMAGES_SCALE
684
 
685
- logger.info(f"[{request_id}] OCR: {use_ocr}, Table mode: {use_table_mode}, Images scale: {use_images_scale}, Lang: {request.lang}")
686
  logger.info(f"[{request_id}] Page range: {request.start_page} to {request.end_page or 'end'}")
687
 
688
- # Convert document
689
  markdown_content, json_content, pages_processed, image_count = await asyncio.to_thread(
690
  _convert_document,
691
  input_path,
692
  output_dir,
693
- use_ocr,
694
- use_table_mode,
695
  use_images_scale,
696
  request.include_images,
697
  request_id,
698
  request.start_page,
699
  request.end_page,
700
- request.lang,
701
  )
702
 
703
  # Create images zip if requested
@@ -723,6 +1021,7 @@ async def parse_document_from_url(
723
  image_count=image_count,
724
  pages_processed=pages_processed,
725
  device_used=_get_device(),
 
726
  )
727
 
728
  except httpx.HTTPError as e:
@@ -730,21 +1029,21 @@ async def parse_document_from_url(
730
  logger.error(f"[{request_id}] Download failed after {total_duration:.2f}s: {str(e)}")
731
  return ParseResponse(
732
  success=False,
733
- error=f"Failed to download file from URL: {str(e)}",
734
  )
735
  except Exception as e:
736
  total_duration = time.time() - start_time
737
  logger.error(f"[{request_id}] {'='*50}")
738
  logger.error(f"[{request_id}] Request failed after {total_duration:.2f}s")
739
- logger.error(f"[{request_id}] Error: {type(e).__name__}: {str(e)}")
740
  logger.error(f"[{request_id}] {'='*50}")
741
  return ParseResponse(
742
  success=False,
743
- error=str(e),
744
  )
745
  finally:
746
  shutil.rmtree(temp_dir, ignore_errors=True)
747
- logger.info(f"[{request_id}] Cleaned up temp directory")
748
 
749
 
750
  if __name__ == "__main__":
 
1
  """
2
+ Docling VLM Parser API v2.0.0
3
 
4
+ A FastAPI service that uses a HYBRID two-pass approach for document parsing:
5
+ Pass 1: Docling Standard Pipeline (DocLayNet + TableFormer + RapidOCR) for document structure
6
+ Pass 2: Qwen3-VL-8B via vLLM for enhanced text recognition
7
+ Merge: TableFormer tables preserved, VLM text replaces RapidOCR text
8
 
9
  Features:
10
  - GPU-accelerated parsing with CUDA support
11
+ - TableFormer ACCURATE for table structure detection
12
+ - Qwen3-VL via vLLM for superior OCR accuracy
13
+ - OpenCV image preprocessing (deskew, denoise, CLAHE)
14
  - Image extraction with configurable resolution
15
  - Automatic page chunking for large PDFs
16
  """
 
28
  import tempfile
29
  import time
30
  import zipfile
31
+ from contextlib import asynccontextmanager
32
  from pathlib import Path
33
  from typing import BinaryIO, Optional, Union
34
  from urllib.parse import urlparse
35
  from uuid import uuid4
36
 
37
+ import cv2
38
  import httpx
39
  import torch
40
  from fastapi import Depends, FastAPI, File, Form, HTTPException, UploadFile
41
  from fastapi.security import HTTPAuthorizationCredentials, HTTPBearer
42
+ from pdf2image import convert_from_path
43
  from pydantic import BaseModel
44
 
45
  # Docling imports
46
+ from docling.backend.docling_parse_v4_backend import DoclingParseV4DocumentBackend
47
  from docling.datamodel.base_models import InputFormat
48
+ from docling.datamodel.document import PictureItem, TableItem
49
  from docling.datamodel.pipeline_options import (
50
+ AcceleratorOptions,
51
  PdfPipelineOptions,
52
+ RapidOcrOptions,
53
  TableFormerMode,
 
54
  )
55
+ from docling.document_converter import DocumentConverter, PdfFormatOption
 
56
 
57
  # Configure logging
58
  logging.basicConfig(
 
93
  return token
94
 
95
 
96
+ # VLM Configuration
97
+ VLM_MODEL = os.getenv("VLM_MODEL", "Qwen/Qwen3-VL-8B-Instruct")
98
+ VLM_HOST = os.getenv("VLM_HOST", "127.0.0.1")
99
+ VLM_PORT = os.getenv("VLM_PORT", "8000")
100
+ IMAGES_SCALE = float(os.getenv("IMAGES_SCALE", "2.0"))
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
101
  MAX_FILE_SIZE_MB = int(os.getenv("MAX_FILE_SIZE_MB", "1024"))
 
102
  MAX_FILE_SIZE_BYTES = MAX_FILE_SIZE_MB * 1024 * 1024
103
 
104
  # Blocked hostnames for SSRF protection
 
111
  "fd00:ec2::254",
112
  }
113
 
114
+ # Global converter instance (initialized on startup)
115
+ _converter: Optional[DocumentConverter] = None
116
+
117
+
118
+ def _get_device() -> str:
119
+ """Get the best available device for processing."""
120
+ if torch.cuda.is_available():
121
+ return "cuda"
122
+ elif hasattr(torch.backends, "mps") and torch.backends.mps.is_available():
123
+ return "mps"
124
+ return "cpu"
125
+
126
 
127
  def _validate_url(url: str) -> None:
128
  """Validate URL to prevent SSRF attacks."""
 
195
  f.write(content)
196
 
197
 
198
+ # ---------------------------------------------------------------------------
199
+ # Pydantic Models
200
+ # ---------------------------------------------------------------------------
201
+
202
+
203
  class ParseResponse(BaseModel):
204
  """Response model for document parsing."""
205
 
 
211
  error: Optional[str] = None
212
  pages_processed: int = 0
213
  device_used: Optional[str] = None
214
+ vlm_model: Optional[str] = None
215
 
216
 
217
  class HealthResponse(BaseModel):
 
221
  version: str
222
  device: str
223
  gpu_name: Optional[str] = None
224
+ vlm_model: str = ""
225
+ vlm_status: str = "unknown"
226
+ images_scale: float = 2.0
227
 
228
 
229
  class URLParseRequest(BaseModel):
 
231
 
232
  url: str
233
  output_format: str = "markdown"
 
 
 
234
  images_scale: Optional[float] = None
235
  start_page: int = 0 # Starting page (0-indexed)
236
  end_page: Optional[int] = None # Ending page (None = all pages)
237
  include_images: bool = False
238
 
239
 
240
+ # ---------------------------------------------------------------------------
241
+ # OpenCV Image Preprocessing
242
+ # ---------------------------------------------------------------------------
243
+
244
+
245
+ def _preprocess_image_for_ocr(image_path: str) -> str:
246
+ """Enhance image quality for better OCR accuracy.
247
+
248
+ Applies: deskew correction, denoising, CLAHE contrast enhancement.
249
+ Returns the path to the preprocessed image (same path, overwritten).
250
+ """
251
+ img = cv2.imread(image_path)
252
+ if img is None:
253
+ return image_path
254
+
255
+ # Denoise
256
+ img = cv2.fastNlMeansDenoisingColored(img, None, 10, 10, 7, 21)
257
+
258
+ # CLAHE contrast enhancement on L channel
259
+ lab = cv2.cvtColor(img, cv2.COLOR_BGR2LAB)
260
+ l, a, b = cv2.split(lab)
261
+ clahe = cv2.createCLAHE(clipLimit=2.0, tileGridSize=(8, 8))
262
+ l = clahe.apply(l)
263
+ lab = cv2.merge([l, a, b])
264
+ img = cv2.cvtColor(lab, cv2.COLOR_LAB2BGR)
265
+
266
+ cv2.imwrite(image_path, img)
267
+ return image_path
268
+
269
+
270
+ # ---------------------------------------------------------------------------
271
+ # VLM OCR (Pass 2)
272
+ # ---------------------------------------------------------------------------
273
+
274
+
275
+ def _vlm_ocr_page(page_image_bytes: bytes) -> str:
276
+ """Send a page image to Qwen3-VL via vLLM for text extraction.
277
+
278
+ Args:
279
+ page_image_bytes: PNG image bytes of the page
280
+
281
+ Returns:
282
+ Extracted markdown text from the page
283
+ """
284
+ b64_image = base64.b64encode(page_image_bytes).decode("utf-8")
285
+
286
+ response = httpx.post(
287
+ f"http://{VLM_HOST}:{VLM_PORT}/v1/chat/completions",
288
+ json={
289
+ "model": VLM_MODEL,
290
+ "messages": [
291
+ {
292
+ "role": "user",
293
+ "content": [
294
+ {
295
+ "type": "image_url",
296
+ "image_url": {"url": f"data:image/png;base64,{b64_image}"},
297
+ },
298
+ {
299
+ "type": "text",
300
+ "text": (
301
+ "OCR this document page to markdown. "
302
+ "Extract ALL text exactly as written, preserving headings, lists, and paragraphs. "
303
+ "For tables, output them as markdown tables. "
304
+ "For handwritten text, transcribe as accurately as possible. "
305
+ "Return ONLY the extracted content, no explanations."
306
+ ),
307
+ },
308
+ ],
309
+ }
310
+ ],
311
+ "max_tokens": 8192,
312
+ "temperature": 0.1,
313
+ "skip_special_tokens": True,
314
+ },
315
+ timeout=120.0,
316
+ )
317
+ response.raise_for_status()
318
+ result = response.json()
319
+ choices = result.get("choices")
320
+ if not choices:
321
+ raise ValueError(f"vLLM returned no choices")
322
+ content = choices[0].get("message", {}).get("content")
323
+ if content is None:
324
+ raise ValueError(f"vLLM response missing content")
325
+ return content
326
+
327
+
328
+ # ---------------------------------------------------------------------------
329
+ # Table Extraction Helper
330
+ # ---------------------------------------------------------------------------
331
+
332
+
333
+ def _extract_table_markdowns(doc) -> dict:
334
+ """Extract table markdown from Docling document, keyed by page number."""
335
+ tables_by_page: dict[int, list[str]] = {}
336
+ for element, _ in doc.iterate_items():
337
+ if isinstance(element, TableItem):
338
+ page_no = element.prov[0].page_no if element.prov else -1
339
+ table_md = element.export_to_markdown(doc=doc)
340
+ if page_no not in tables_by_page:
341
+ tables_by_page[page_no] = []
342
+ tables_by_page[page_no].append(table_md)
343
+ return tables_by_page
344
+
345
+
346
+ # ---------------------------------------------------------------------------
347
+ # Merge: VLM Text + TableFormer Tables
348
+ # ---------------------------------------------------------------------------
349
+
350
+
351
+ def _merge_vlm_with_tables(vlm_text: str, table_markdowns: list) -> str:
352
+ """Replace VLM's table sections with TableFormer's more accurate tables.
353
+
354
+ Detects markdown table patterns (lines with |...|) in VLM output
355
+ and replaces them with TableFormer output.
356
+ """
357
+ if not table_markdowns:
358
+ return vlm_text
359
+
360
+ # Pattern: consecutive lines that look like markdown tables
361
+ # A markdown table has lines starting and ending with |
362
+ table_pattern = re.compile(r"((?:^\|[^\n]+\|$\n?)+)", re.MULTILINE)
363
+
364
+ vlm_table_count = len(table_pattern.findall(vlm_text))
365
+ if vlm_table_count != len(table_markdowns):
366
+ logger.warning(
367
+ f"Table count mismatch: VLM={vlm_table_count}, TableFormer={len(table_markdowns)}. "
368
+ f"Positional replacement may be imprecise."
369
+ )
370
+
371
+ table_idx = 0
372
+
373
+ def replace_table(match):
374
+ nonlocal table_idx
375
+ if table_idx < len(table_markdowns):
376
+ replacement = table_markdowns[table_idx]
377
+ table_idx += 1
378
+ return replacement.strip() + "\n"
379
+ return match.group(0)
380
+
381
+ result = table_pattern.sub(replace_table, vlm_text)
382
+
383
+ # If there are remaining TableFormer tables not matched, append them
384
+ while table_idx < len(table_markdowns):
385
+ result += "\n\n" + table_markdowns[table_idx].strip() + "\n"
386
+ table_idx += 1
387
+
388
+ return result
389
+
390
+
391
+ # ---------------------------------------------------------------------------
392
+ # PDF to Page Images
393
+ # ---------------------------------------------------------------------------
394
+
395
+
396
+ def _pdf_to_page_images(
397
+ input_path: Path, start_page: int = 0, end_page: Optional[int] = None
398
+ ) -> list:
399
+ """Convert PDF pages to PNG image bytes using pdf2image.
400
+
401
+ Processes one page at a time to avoid loading all pages into memory.
402
+ Returns list of (page_no, png_bytes) tuples.
403
+ """
404
+ page_images: list[tuple[int, bytes]] = []
405
+
406
+ try:
407
+ # Determine total page count first
408
+ from pdf2image.pdf2image import pdfinfo_from_path
409
+
410
+ info = pdfinfo_from_path(str(input_path))
411
+ total_pages = info["Pages"]
412
+ last_page = min(end_page + 1, total_pages) if end_page is not None else total_pages
413
+
414
+ for i in range(start_page, last_page):
415
+ # Convert one page at a time (pdf2image is 1-indexed)
416
+ images = convert_from_path(
417
+ str(input_path), dpi=300, first_page=i + 1, last_page=i + 1
418
+ )
419
+ if not images:
420
+ continue
421
+ img = images[0]
422
+ # Save to temp file for OpenCV preprocessing
423
+ with tempfile.NamedTemporaryFile(suffix=".png", delete=False) as tmp:
424
+ tmp_path = tmp.name
425
+ img.save(tmp_path, format="PNG")
426
+ try:
427
+ _preprocess_image_for_ocr(tmp_path)
428
+ with open(tmp_path, "rb") as f:
429
+ page_images.append((i, f.read()))
430
+ finally:
431
+ os.unlink(tmp_path)
432
+ except Exception as e:
433
+ # Fallback: log warning β€” caller handles empty list
434
+ logger.warning(f"pdf2image failed, VLM OCR may be limited: {e}")
435
+
436
+ return page_images
437
+
438
+
439
+ # ---------------------------------------------------------------------------
440
+ # Docling Converter (Pass 1)
441
+ # ---------------------------------------------------------------------------
442
+
443
+
444
+ def _create_converter(images_scale: float = 2.0) -> DocumentConverter:
445
+ """Create a Docling converter with Standard Pipeline.
446
+
447
+ Uses DocLayNet (layout) + TableFormer ACCURATE (tables) + RapidOCR (baseline text).
448
+ """
449
+ device = _get_device()
450
+ logger.info(f"Creating converter with device: {device}")
451
+
452
+ pipeline_options = PdfPipelineOptions()
453
+ pipeline_options.do_ocr = True
454
+ pipeline_options.do_table_structure = True
455
+ pipeline_options.table_structure_options.mode = TableFormerMode.ACCURATE
456
+ pipeline_options.table_structure_options.do_cell_matching = True
457
+
458
+ # Use RapidOCR as baseline (VLM will enhance text in pass 2)
459
+ pipeline_options.ocr_options = RapidOcrOptions()
460
+ pipeline_options.ocr_options.force_full_page_ocr = True
461
+
462
+ # Enable page image generation (needed for VLM pass)
463
+ pipeline_options.generate_page_images = True
464
+ pipeline_options.images_scale = images_scale
465
+
466
+ # Also enable picture image extraction
467
+ pipeline_options.generate_picture_images = True
468
+
469
+ pipeline_options.accelerator_options = AcceleratorOptions(
470
+ device=device,
471
+ num_threads=0 if device == "cuda" else 4,
472
+ )
473
+
474
+ converter = DocumentConverter(
475
+ format_options={
476
+ InputFormat.PDF: PdfFormatOption(
477
+ pipeline_options=pipeline_options,
478
+ backend=DoclingParseV4DocumentBackend,
479
+ )
480
+ }
481
+ )
482
+ return converter
483
+
484
+
485
+ def _get_converter() -> DocumentConverter:
486
+ """Get or create the global converter instance."""
487
+ global _converter
488
+ if _converter is None:
489
+ _converter = _create_converter(images_scale=IMAGES_SCALE)
490
+ return _converter
491
+
492
+
493
+ # ---------------------------------------------------------------------------
494
+ # Hybrid Conversion (Pass 1 + Pass 2 + Merge)
495
+ # ---------------------------------------------------------------------------
496
+
497
+
498
  def _convert_document(
499
  input_path: Path,
500
  output_dir: Path,
 
 
501
  images_scale: float,
502
  include_images: bool,
503
  request_id: str,
504
  start_page: int = 0,
505
  end_page: Optional[int] = None,
506
+ ) -> tuple:
 
507
  """
508
+ Hybrid conversion: TableFormer for tables + Qwen3-VL for text.
509
 
510
+ Pass 1: Docling Standard Pipeline -> document structure + tables
511
+ Pass 2: VLM OCR -> enhanced text recognition per page
512
+ Merge: TableFormer tables + VLM text
513
+
514
+ Returns: (markdown_content, json_content, pages_processed, image_count)
515
  """
516
+ # PASS 1: Docling Standard Pipeline (structure + tables)
517
+ logger.info(
518
+ f"[{request_id}] Pass 1: Docling Standard Pipeline (DocLayNet + TableFormer + RapidOCR)"
 
 
 
 
 
 
519
  )
520
+ converter = _get_converter()
521
 
 
522
  start_time = time.time()
 
523
  result = converter.convert(input_path)
524
  doc = result.document
525
+ if doc is None:
526
+ raise ValueError(
527
+ f"Docling failed to parse document (status: {getattr(result, 'status', 'unknown')})"
528
+ )
529
+ pass1_time = time.time() - start_time
530
+ logger.info(f"[{request_id}] Pass 1 completed in {pass1_time:.2f}s")
531
+
532
+ # Extract TableFormer tables (keyed by page number)
533
+ tables_by_page = _extract_table_markdowns(doc)
534
+ total_tables = sum(len(v) for v in tables_by_page.values())
535
+ logger.info(f"[{request_id}] TableFormer detected {total_tables} tables")
536
+
537
+ # PASS 2: VLM OCR (enhanced text per page)
538
+ logger.info(f"[{request_id}] Pass 2: VLM OCR via Qwen3-VL ({VLM_MODEL})")
539
+
540
+ # Get page images for VLM
541
+ page_images = _pdf_to_page_images(input_path, start_page, end_page)
542
+
543
+ if not page_images:
544
+ # Fallback: use Docling's markdown directly if no page images
545
+ logger.warning(f"[{request_id}] No page images available, using Docling output only")
546
+ markdown_content = doc.export_to_markdown()
547
+ pages_processed = len(
548
+ set(e.prov[0].page_no for e, _ in doc.iterate_items() if e.prov)
549
+ )
550
+ return markdown_content, None, pages_processed, 0
551
+
552
+ vlm_page_texts: dict[int, Optional[str]] = {}
553
+ vlm_start = time.time()
554
+ for page_no, page_bytes in page_images:
555
+ try:
556
+ vlm_text = _vlm_ocr_page(page_bytes)
557
+ vlm_page_texts[page_no] = vlm_text
558
+ logger.info(
559
+ f"[{request_id}] VLM processed page {page_no + 1} ({len(vlm_text)} chars)"
560
+ )
561
+ except Exception as e:
562
+ logger.warning(
563
+ f"[{request_id}] VLM failed on page {page_no + 1}: {e}, using Docling text"
564
+ )
565
+ # Fallback to Docling's text for this page
566
+ vlm_page_texts[page_no] = None
567
 
568
+ vlm_time = time.time() - vlm_start
569
+ logger.info(
570
+ f"[{request_id}] Pass 2 completed in {vlm_time:.2f}s ({len(vlm_page_texts)} pages)"
571
+ )
572
 
573
+ # MERGE: VLM text + TableFormer tables
574
+ logger.info(f"[{request_id}] Merging VLM text with TableFormer tables")
575
+
576
+ md_parts: list[str] = []
577
+ pages_seen: set[int] = set()
 
 
 
 
 
 
 
 
578
  image_count = 0
579
  image_dir = output_dir / "images"
580
 
581
  if include_images:
582
  image_dir.mkdir(parents=True, exist_ok=True)
583
 
584
+ # Pre-build page-to-elements index (avoids O(N^2) on VLM fallback)
585
+ elements_by_page: dict[int, list] = {}
586
+ for element, _ in doc.iterate_items():
587
+ if element.prov:
588
+ pg = element.prov[0].page_no
589
+ elements_by_page.setdefault(pg, []).append(element)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
590
 
591
+ for page_no in sorted(vlm_page_texts.keys()):
592
+ pages_seen.add(page_no)
593
+ md_parts.append(f"\n\n<!-- Page {page_no + 1} -->\n\n")
594
 
595
+ vlm_text = vlm_page_texts[page_no]
 
 
 
 
 
 
 
596
 
597
+ if vlm_text is None:
598
+ # VLM failed -- fallback to Docling's text for this page
599
+ for element in elements_by_page.get(page_no, []):
600
  try:
601
+ md_parts.append(element.export_to_markdown(doc=doc))
602
+ except Exception:
603
+ text = getattr(element, "text", "").strip()
604
+ if text:
605
+ md_parts.append(text + "\n\n")
606
  else:
607
+ # Merge VLM text with TableFormer tables for this page
608
+ page_tables = tables_by_page.get(page_no, [])
609
+ merged = _merge_vlm_with_tables(vlm_text, page_tables)
610
+ md_parts.append(merged)
 
 
611
 
612
+ # Handle images from Docling if requested
613
+ if include_images:
614
+ for element, _ in doc.iterate_items():
615
+ if isinstance(element, PictureItem):
616
+ if element.image and element.image.pil_image:
617
+ page_no = element.prov[0].page_no if element.prov else 0
618
+ image_id = element.self_ref.split("/")[-1]
619
+ image_name = f"page_{page_no + 1}_{image_id}.png"
620
+ image_name = re.sub(r'[\\/*?:"<>|]', "", image_name)
621
+ image_path = image_dir / image_name
622
+ try:
623
+ element.image.pil_image.save(image_path, format="PNG")
624
+ image_count += 1
625
+ except Exception as e:
626
+ logger.warning(
627
+ f"[{request_id}] Failed to save image {image_name}: {e}"
628
+ )
629
+
630
+ markdown_content = "".join(md_parts)
631
  pages_processed = len(pages_seen)
632
 
633
+ total_time = pass1_time + vlm_time
634
+ logger.info(
635
+ f"[{request_id}] Hybrid conversion complete: {pages_processed} pages, "
636
+ f"{total_tables} tables, {total_time:.2f}s total"
637
+ )
638
+ if pages_processed > 0:
639
+ logger.info(f"[{request_id}] Speed: {pages_processed / total_time:.2f} pages/sec")
640
 
641
  return markdown_content, None, pages_processed, image_count
642
 
643
 
644
+ # ---------------------------------------------------------------------------
645
+ # Images Zip Helper
646
+ # ---------------------------------------------------------------------------
647
+
648
+
649
  def _create_images_zip(output_dir: Path) -> tuple[Optional[str], int]:
650
  """Create a zip file from extracted images."""
651
  image_dir = output_dir / "images"
 
671
  return base64.b64encode(zip_buffer.getvalue()).decode("utf-8"), image_count
672
 
673
 
674
+ # ---------------------------------------------------------------------------
675
+ # Application Lifespan
676
+ # ---------------------------------------------------------------------------
677
+
678
+
679
+ @asynccontextmanager
680
+ async def lifespan(app: FastAPI):
681
+ """Startup: initialize Docling converter and check vLLM."""
682
+ logger.info("=" * 60)
683
+ logger.info("Starting Docling VLM Parser API v2.0.0...")
684
+
685
+ device = _get_device()
686
+ logger.info(f"Device: {device}")
687
+
688
+ if device == "cuda":
689
+ logger.info(f"GPU: {torch.cuda.get_device_name(0)}")
690
+ logger.info(f"CUDA Version: {torch.version.cuda}")
691
+ logger.info(
692
+ f"GPU Memory: {torch.cuda.get_device_properties(0).total_mem / 1024**3:.1f} GB"
693
+ )
694
+
695
+ logger.info(f"VLM Model: {VLM_MODEL}")
696
+ logger.info(f"VLM Endpoint: http://{VLM_HOST}:{VLM_PORT}")
697
+ logger.info(f"Images scale: {IMAGES_SCALE}")
698
+ logger.info(f"Max file size: {MAX_FILE_SIZE_MB}MB")
699
+
700
+ # Verify vLLM is running
701
+ logger.info("Checking vLLM server...")
702
+ try:
703
+ async with httpx.AsyncClient(timeout=10) as client:
704
+ resp = await client.get(f"http://{VLM_HOST}:{VLM_PORT}/health")
705
+ resp.raise_for_status()
706
+ logger.info("vLLM server is healthy")
707
+ except Exception as e:
708
+ logger.error(f"vLLM server not available: {e}")
709
+ raise RuntimeError(f"vLLM server not available at {VLM_HOST}:{VLM_PORT}")
710
+
711
+ # Pre-initialize Docling converter
712
+ logger.info("Pre-loading Docling models (DocLayNet + TableFormer + RapidOCR)...")
713
+ try:
714
+ _get_converter()
715
+ logger.info("Docling models loaded successfully")
716
+ except Exception as e:
717
+ logger.warning(f"Failed to pre-load Docling models: {e}")
718
+
719
+ logger.info("=" * 60)
720
+ logger.info("Docling VLM Parser API ready (Hybrid: TableFormer + Qwen3-VL)")
721
+ logger.info("=" * 60)
722
+ yield
723
+ logger.info("Shutting down Docling VLM Parser API...")
724
+
725
+
726
+ # ---------------------------------------------------------------------------
727
+ # FastAPI App
728
+ # ---------------------------------------------------------------------------
729
+
730
+ app = FastAPI(
731
+ title="Docling VLM Parser API",
732
+ description="Hybrid document parser: TableFormer tables + Qwen3-VL OCR via vLLM",
733
+ version="2.0.0",
734
+ lifespan=lifespan,
735
+ )
736
+
737
+
738
+ # ---------------------------------------------------------------------------
739
+ # Endpoints
740
+ # ---------------------------------------------------------------------------
741
+
742
+
743
  @app.get("/", response_model=HealthResponse)
744
  async def health_check() -> HealthResponse:
745
  """Health check endpoint."""
 
748
  if device == "cuda":
749
  gpu_name = torch.cuda.get_device_name(0)
750
 
751
+ # Check vLLM status (async to avoid blocking event loop)
752
+ vlm_status = "unknown"
753
+ try:
754
+ async with httpx.AsyncClient(timeout=5) as client:
755
+ resp = await client.get(f"http://{VLM_HOST}:{VLM_PORT}/health")
756
+ vlm_status = "healthy" if resp.status_code == 200 else "unhealthy"
757
+ except Exception:
758
+ vlm_status = "unreachable"
759
+
760
  return HealthResponse(
761
  status="healthy",
762
+ version="2.0.0",
763
  device=device,
764
+ gpu_name=None, # Don't leak GPU details on unauthenticated endpoint
765
+ vlm_model="active", # Confirm VLM is configured without leaking model name
766
+ vlm_status=vlm_status,
767
  images_scale=IMAGES_SCALE,
768
  )
769
 
 
772
  async def parse_document(
773
  file: UploadFile = File(..., description="PDF or image file to parse"),
774
  output_format: str = Form(default="markdown", description="Output format: markdown or json"),
775
+ images_scale: Optional[float] = Form(default=None, description="Image resolution scale (default: 2.0)"),
 
 
 
776
  start_page: int = Form(default=0, description="Starting page (0-indexed)"),
777
  end_page: Optional[int] = Form(default=None, description="Ending page (None = all pages)"),
778
  include_images: bool = Form(default=False, description="Include extracted images in response"),
 
781
  """
782
  Parse a document file (PDF or image) and return extracted content.
783
 
784
+ Uses a hybrid two-pass approach:
785
+ Pass 1: Docling Standard Pipeline (DocLayNet + TableFormer + RapidOCR)
786
+ Pass 2: Qwen3-VL via vLLM for enhanced text recognition
787
+ Merge: TableFormer tables preserved, VLM text replaces RapidOCR text
788
+
789
  Supports:
790
  - PDF files (.pdf)
791
  - Images (.png, .jpg, .jpeg, .tiff, .bmp)
 
795
 
796
  logger.info(f"[{request_id}] {'='*50}")
797
  logger.info(f"[{request_id}] New parse request received")
798
+ safe_filename = re.sub(r'[\r\n\t\x00-\x1f\x7f]', '_', file.filename or "")[:255]
799
+ logger.info(f"[{request_id}] Filename: {safe_filename}")
800
  logger.info(f"[{request_id}] Output format: {output_format}")
801
 
802
+ if output_format not in ("markdown",):
803
+ raise HTTPException(
804
+ status_code=400,
805
+ detail="Only 'markdown' output_format is supported in v2.0.0",
806
+ )
807
+
808
  # Validate file size
809
  file.file.seek(0, 2)
810
  file_size = file.file.tell()
 
831
  )
832
 
833
  # Use defaults if not specified
834
+ use_images_scale = images_scale if images_scale is not None else IMAGES_SCALE
 
 
835
 
836
+ logger.info(f"[{request_id}] Images scale: {use_images_scale}, VLM: {VLM_MODEL}")
837
  logger.info(f"[{request_id}] Page range: {start_page} to {end_page or 'end'}")
838
 
839
  temp_dir = tempfile.mkdtemp()
840
+ logger.debug(f"[{request_id}] Created temp directory: {temp_dir}")
841
 
842
  try:
843
  # Save uploaded file
844
  input_path = Path(temp_dir) / f"input{file_ext}"
845
  await asyncio.to_thread(_save_uploaded_file, input_path, file.file)
846
+ logger.debug(f"[{request_id}] Saved file to: {input_path}")
847
 
848
  # Create output directory
849
  output_dir = Path(temp_dir) / "output"
850
  output_dir.mkdir(exist_ok=True)
851
 
852
+ # Convert document (hybrid two-pass)
853
  markdown_content, json_content, pages_processed, image_count = await asyncio.to_thread(
854
  _convert_document,
855
  input_path,
856
  output_dir,
 
 
857
  use_images_scale,
858
  include_images,
859
  request_id,
860
  start_page,
861
  end_page,
 
862
  )
863
 
864
  # Create images zip if requested
 
884
  image_count=image_count,
885
  pages_processed=pages_processed,
886
  device_used=_get_device(),
887
+ vlm_model=VLM_MODEL,
888
  )
889
 
890
  except Exception as e:
891
  total_duration = time.time() - start_time
892
  logger.error(f"[{request_id}] {'='*50}")
893
  logger.error(f"[{request_id}] Request failed after {total_duration:.2f}s")
894
+ logger.error(f"[{request_id}] Error: {type(e).__name__}: {str(e)}", exc_info=True)
895
  logger.error(f"[{request_id}] {'='*50}")
896
  return ParseResponse(
897
  success=False,
898
+ error=f"Processing failed (ref: {request_id})",
899
  )
900
  finally:
901
  shutil.rmtree(temp_dir, ignore_errors=True)
902
+ logger.debug(f"[{request_id}] Cleaned up temp directory")
903
 
904
 
905
  @app.post("/parse/url", response_model=ParseResponse)
 
910
  """
911
  Parse a document from a URL.
912
 
913
+ Downloads the file and processes it through the hybrid two-pass pipeline:
914
+ Pass 1: Docling Standard Pipeline (DocLayNet + TableFormer + RapidOCR)
915
+ Pass 2: Qwen3-VL via vLLM for enhanced text recognition
916
+ Merge: TableFormer tables preserved, VLM text replaces RapidOCR text
917
  """
918
  request_id = str(uuid4())[:8]
919
  start_time = time.time()
 
923
  logger.info(f"[{request_id}] URL: {request.url}")
924
  logger.info(f"[{request_id}] Output format: {request.output_format}")
925
 
926
+ if request.output_format not in ("markdown",):
927
+ raise HTTPException(
928
+ status_code=400,
929
+ detail="Only 'markdown' output_format is supported in v2.0.0",
930
+ )
931
+
932
  # Validate URL
933
  logger.info(f"[{request_id}] Validating URL...")
934
  _validate_url(request.url)
935
  logger.info(f"[{request_id}] URL validation passed")
936
 
937
  temp_dir = tempfile.mkdtemp()
938
+ logger.debug(f"[{request_id}] Created temp directory: {temp_dir}")
939
 
940
  try:
941
  # Download file
942
  logger.info(f"[{request_id}] Downloading file from URL...")
943
  download_start = time.time()
944
+ async with httpx.AsyncClient(timeout=60.0, follow_redirects=False) as client:
945
  response = await client.get(request.url)
946
  response.raise_for_status()
947
  download_duration = time.time() - download_start
 
963
  )
964
 
965
  if len(response.content) > MAX_FILE_SIZE_BYTES:
966
+ logger.error(
967
+ f"[{request_id}] File too large: {file_size_mb:.2f} MB > {MAX_FILE_SIZE_MB} MB"
968
+ )
969
  raise HTTPException(
970
  status_code=413,
971
  detail=f"File size exceeds maximum allowed size of {MAX_FILE_SIZE_MB}MB",
 
974
  # Save downloaded file
975
  input_path = Path(temp_dir) / f"input{file_ext}"
976
  await asyncio.to_thread(_save_downloaded_content, input_path, response.content)
977
+ logger.debug(f"[{request_id}] Saved file to: {input_path}")
978
 
979
  # Create output directory
980
  output_dir = Path(temp_dir) / "output"
981
  output_dir.mkdir(exist_ok=True)
982
 
983
  # Use defaults if not specified
984
+ use_images_scale = request.images_scale if request.images_scale is not None else IMAGES_SCALE
 
 
985
 
986
+ logger.info(f"[{request_id}] Images scale: {use_images_scale}, VLM: {VLM_MODEL}")
987
  logger.info(f"[{request_id}] Page range: {request.start_page} to {request.end_page or 'end'}")
988
 
989
+ # Convert document (hybrid two-pass)
990
  markdown_content, json_content, pages_processed, image_count = await asyncio.to_thread(
991
  _convert_document,
992
  input_path,
993
  output_dir,
 
 
994
  use_images_scale,
995
  request.include_images,
996
  request_id,
997
  request.start_page,
998
  request.end_page,
 
999
  )
1000
 
1001
  # Create images zip if requested
 
1021
  image_count=image_count,
1022
  pages_processed=pages_processed,
1023
  device_used=_get_device(),
1024
+ vlm_model=VLM_MODEL,
1025
  )
1026
 
1027
  except httpx.HTTPError as e:
 
1029
  logger.error(f"[{request_id}] Download failed after {total_duration:.2f}s: {str(e)}")
1030
  return ParseResponse(
1031
  success=False,
1032
+ error=f"Failed to download file from URL (ref: {request_id})",
1033
  )
1034
  except Exception as e:
1035
  total_duration = time.time() - start_time
1036
  logger.error(f"[{request_id}] {'='*50}")
1037
  logger.error(f"[{request_id}] Request failed after {total_duration:.2f}s")
1038
+ logger.error(f"[{request_id}] Error: {type(e).__name__}: {str(e)}", exc_info=True)
1039
  logger.error(f"[{request_id}] {'='*50}")
1040
  return ParseResponse(
1041
  success=False,
1042
+ error=f"Processing failed (ref: {request_id})",
1043
  )
1044
  finally:
1045
  shutil.rmtree(temp_dir, ignore_errors=True)
1046
+ logger.debug(f"[{request_id}] Cleaned up temp directory")
1047
 
1048
 
1049
  if __name__ == "__main__":
requirements.txt CHANGED
@@ -1,7 +1,7 @@
1
- # Docling Document Parser API Dependencies
2
- # Optimized for HuggingFace Spaces deployment with GPU support
3
 
4
- # Docling - IBM's document parsing library
5
  docling>=2.15.0
6
 
7
  # Web framework
@@ -11,8 +11,17 @@ uvicorn[standard]>=0.32.0
11
  # File upload handling
12
  python-multipart>=0.0.9
13
 
14
- # HTTP client for URL parsing
15
  httpx>=0.27.0
16
 
17
  # Type checking
18
  pydantic>=2.0.0
 
 
 
 
 
 
 
 
 
 
1
+ # Docling VLM Parser API Dependencies
2
+ # Optimized for HuggingFace Spaces with vLLM + Qwen3-VL-8B
3
 
4
+ # Docling - IBM's document parsing library (VLM pipeline support)
5
  docling>=2.15.0
6
 
7
  # Web framework
 
11
  # File upload handling
12
  python-multipart>=0.0.9
13
 
14
+ # HTTP client for URL parsing and vLLM health checks
15
  httpx>=0.27.0
16
 
17
  # Type checking
18
  pydantic>=2.0.0
19
+
20
+ # Image preprocessing for degraded documents
21
+ opencv-python-headless>=4.10.0
22
+
23
+ # PDF to image conversion for VLM OCR pass
24
+ pdf2image>=1.17.0
25
+
26
+ # HuggingFace Hub for model downloads
27
+ huggingface-hub>=0.25.0
start.sh ADDED
@@ -0,0 +1,71 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env bash
2
+ set -e
3
+
4
+ # ── Configuration ────────────────────────────────────────────────────────────
5
+ VLLM_MODEL="/home/user/.cache/huggingface/Qwen3-VL-8B-Instruct"
6
+ VLLM_HOST="127.0.0.1"
7
+ VLLM_PORT="8000"
8
+ HEALTH_URL="http://${VLLM_HOST}:${VLLM_PORT}/health"
9
+ POLL_INTERVAL=5
10
+ MAX_WAIT=600
11
+
12
+ # ── Start vLLM server in background ─────────────────────────────────────────
13
+ echo "[startup] Starting vLLM server with model: ${VLLM_MODEL}"
14
+
15
+ python -m vllm.entrypoints.openai.api_server \
16
+ --model "${VLLM_MODEL}" \
17
+ --host "${VLLM_HOST}" \
18
+ --port "${VLLM_PORT}" \
19
+ --max-num-seqs 16 \
20
+ --max-model-len 8192 \
21
+ --gpu-memory-utilization 0.70 \
22
+ --dtype auto \
23
+ --trust-remote-code \
24
+ --limit-mm-per-prompt image=1 \
25
+ 2>&1 | sed 's/^/[vLLM] /' &
26
+
27
+ VLLM_PID=$!
28
+ echo "[startup] vLLM server started with PID ${VLLM_PID}"
29
+
30
+ # ── Poll vLLM health endpoint until ready ────────────────────────────────────
31
+ echo "[startup] Waiting for vLLM to become healthy (polling every ${POLL_INTERVAL}s, timeout ${MAX_WAIT}s)..."
32
+
33
+ elapsed=0
34
+ while [ "${elapsed}" -lt "${MAX_WAIT}" ]; do
35
+ # Check if vLLM process is still alive
36
+ if ! kill -0 "${VLLM_PID}" 2>/dev/null; then
37
+ echo "[startup] ERROR: vLLM process (PID ${VLLM_PID}) died during startup"
38
+ exit 1
39
+ fi
40
+
41
+ if curl -sf "${HEALTH_URL}" > /dev/null 2>&1; then
42
+ echo "[startup] vLLM is healthy after ${elapsed}s"
43
+ break
44
+ fi
45
+
46
+ sleep "${POLL_INTERVAL}"
47
+ elapsed=$((elapsed + POLL_INTERVAL))
48
+ done
49
+
50
+ if [ "${elapsed}" -ge "${MAX_WAIT}" ]; then
51
+ echo "[startup] ERROR: vLLM did not become healthy within ${MAX_WAIT}s"
52
+ echo "[startup] Killing vLLM process (PID ${VLLM_PID})"
53
+ kill "${VLLM_PID}" 2>/dev/null || true
54
+ exit 1
55
+ fi
56
+
57
+ # ── Start FastAPI with vLLM cleanup on exit ──────────────────────────────────
58
+ _cleanup() {
59
+ echo "[startup] Shutting down vLLM (PID ${VLLM_PID})"
60
+ kill "${VLLM_PID}" 2>/dev/null
61
+ wait "${VLLM_PID}" 2>/dev/null
62
+ }
63
+ trap _cleanup EXIT TERM INT
64
+
65
+ echo "[startup] Starting FastAPI server on 0.0.0.0:7860"
66
+
67
+ python -m uvicorn app:app \
68
+ --host 0.0.0.0 \
69
+ --port 7860 \
70
+ --workers 1 \
71
+ --timeout-keep-alive 300