Spaces:
Running on T4
Running on T4
Commit Β·
8c4351b
1
Parent(s): 4848ba0
feat: hybrid VLM parser with Qwen3-VL-8B via vLLM (v2.0.0)
Browse files- CLAUDE.md +46 -55
- Dockerfile +38 -25
- README.md +59 -37
- app.py +544 -245
- requirements.txt +13 -4
- start.sh +71 -0
CLAUDE.md
CHANGED
|
@@ -4,48 +4,43 @@ This file provides guidance to Claude Code (claude.ai/code) when working with co
|
|
| 4 |
|
| 5 |
## Project Overview
|
| 6 |
|
| 7 |
-
Docling Parser - A Hugging Face Spaces API service
|
| 8 |
|
| 9 |
## Architecture
|
| 10 |
|
| 11 |
```
|
| 12 |
hf_docling_parser/
|
| 13 |
-
βββ app.py # FastAPI
|
| 14 |
-
βββ
|
| 15 |
-
βββ
|
| 16 |
-
βββ
|
|
|
|
| 17 |
βββ CLAUDE.md # Claude Code development guide
|
| 18 |
βββ .gitignore # Git ignore patterns
|
| 19 |
```
|
| 20 |
|
|
|
|
|
|
|
| 21 |
## Common Commands
|
| 22 |
|
| 23 |
```bash
|
| 24 |
-
#
|
| 25 |
-
|
| 26 |
-
|
| 27 |
|
| 28 |
# Test the API locally
|
| 29 |
curl -X POST "http://localhost:7860/parse" \
|
| 30 |
-H "Authorization: Bearer YOUR_API_TOKEN" \
|
| 31 |
-F "file=@document.pdf" \
|
| 32 |
-F "output_format=markdown"
|
| 33 |
-
|
| 34 |
-
# Build and test Docker locally
|
| 35 |
-
docker build -t hf-docling .
|
| 36 |
-
docker run --gpus all -p 7860:7860 -e API_TOKEN=test hf-docling
|
| 37 |
```
|
| 38 |
|
| 39 |
-
|
| 40 |
|
| 41 |
-
##
|
| 42 |
|
| 43 |
-
|
| 44 |
-
|
| 45 |
-
hf repo create docling-parser --repo-type space --space_sdk docker
|
| 46 |
-
git init
|
| 47 |
-
git remote add hf https://huggingface.co/spaces/YOUR_USERNAME/docling-parser
|
| 48 |
-
```
|
| 49 |
|
| 50 |
### Push New Code
|
| 51 |
|
|
@@ -57,7 +52,7 @@ git push hf main
|
|
| 57 |
|
| 58 |
### Settings (configure in HF web UI)
|
| 59 |
|
| 60 |
-
- **Hardware:** Nvidia
|
| 61 |
- **Sleep time:** 1 hour (auto-shutdown after 60 min idle)
|
| 62 |
- **Secrets:** `API_TOKEN` (required for API authentication)
|
| 63 |
|
|
@@ -78,37 +73,29 @@ git push hf main
|
|
| 78 |
- **uvicorn**: ASGI server
|
| 79 |
- **httpx**: HTTP client for URL parsing
|
| 80 |
- **pydantic**: Request/response validation
|
| 81 |
-
- **
|
| 82 |
-
|
| 83 |
-
|
|
|
|
| 84 |
|
| 85 |
-
|
| 86 |
-
| ------------------ | ------------------------------------------------- | ---------- |
|
| 87 |
-
| `API_TOKEN` | **Required.** Secret token for API authentication | - |
|
| 88 |
-
| `DO_OCR` | Enable OCR by default | `true` |
|
| 89 |
-
| `TABLE_MODE` | Table detection mode (accurate, fast) | `accurate` |
|
| 90 |
-
| `IMAGES_SCALE` | Image resolution scale for extraction | `3.0` |
|
| 91 |
-
| `DEFAULT_LANG` | Default OCR language code | `en` |
|
| 92 |
-
| `MAX_FILE_SIZE_MB` | Maximum upload file size in MB | `1024` |
|
| 93 |
|
| 94 |
-
##
|
| 95 |
-
|
| 96 |
-
The converter supports these key options:
|
| 97 |
|
| 98 |
-
|
| 99 |
-
-
|
| 100 |
-
|
| 101 |
-
|
| 102 |
-
|
| 103 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 104 |
|
| 105 |
## Testing
|
| 106 |
|
| 107 |
-
|
| 108 |
-
|
| 109 |
-
```bash
|
| 110 |
-
API_TOKEN=test uvicorn app:app --host 0.0.0.0 --port 7860 --reload
|
| 111 |
-
```
|
| 112 |
|
| 113 |
### Test with curl
|
| 114 |
|
|
@@ -163,12 +150,16 @@ The API provides comprehensive logging:
|
|
| 163 |
|
| 164 |
## Comparison with MinerU
|
| 165 |
|
| 166 |
-
| Feature | Docling
|
| 167 |
-
| ---------------- | ---------------------- | ------------------- |
|
| 168 |
-
| Maintainer | IBM Research
|
| 169 |
-
| Table Detection | TableFormer (built-in)
|
| 170 |
-
| OCR |
|
| 171 |
-
| VLM Support |
|
| 172 |
-
| License | MIT
|
| 173 |
-
| GPU Memory | ~
|
| 174 |
-
| Primary Use Case | Enterprise documents
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 4 |
|
| 5 |
## Project Overview
|
| 6 |
|
| 7 |
+
Docling Parser (v2.0.0) - A Hugging Face Spaces API service using a hybrid two-pass VLM architecture for PDF/document parsing. Pass 1 runs Docling's standard pipeline (DocLayNet layout + TableFormer ACCURATE + RapidOCR baseline). Pass 2 sends full page images to Qwen3-VL-8B via vLLM for enhanced text recognition. The merge step preserves TableFormer tables while replacing RapidOCR text with VLM output. Includes OpenCV preprocessing (denoise, CLAHE contrast enhancement). API endpoints are protected by Bearer token authentication.
|
| 8 |
|
| 9 |
## Architecture
|
| 10 |
|
| 11 |
```
|
| 12 |
hf_docling_parser/
|
| 13 |
+
βββ app.py # FastAPI + hybrid two-pass parsing (v2.0.0)
|
| 14 |
+
βββ start.sh # Startup script (vLLM + FastAPI dual-process)
|
| 15 |
+
βββ Dockerfile # vLLM base image, Qwen3-VL pre-downloaded
|
| 16 |
+
βββ requirements.txt # Python deps (docling, opencv, pdf2image, etc.)
|
| 17 |
+
βββ README.md # HF Spaces metadata
|
| 18 |
βββ CLAUDE.md # Claude Code development guide
|
| 19 |
βββ .gitignore # Git ignore patterns
|
| 20 |
```
|
| 21 |
|
| 22 |
+
**Dual-process Docker architecture:** `start.sh` launches vLLM on port 8000 (GPU model serving) and FastAPI on port 7860 (API). Base image: `vllm/vllm-openai:v0.14.1`.
|
| 23 |
+
|
| 24 |
## Common Commands
|
| 25 |
|
| 26 |
```bash
|
| 27 |
+
# Build and test Docker locally (requires A100 GPU)
|
| 28 |
+
docker build --shm-size 32g -t hf-docling .
|
| 29 |
+
docker run --gpus all --shm-size 32g -p 7860:7860 -e API_TOKEN=test hf-docling
|
| 30 |
|
| 31 |
# Test the API locally
|
| 32 |
curl -X POST "http://localhost:7860/parse" \
|
| 33 |
-H "Authorization: Bearer YOUR_API_TOKEN" \
|
| 34 |
-F "file=@document.pdf" \
|
| 35 |
-F "output_format=markdown"
|
|
|
|
|
|
|
|
|
|
|
|
|
| 36 |
```
|
| 37 |
|
| 38 |
+
> **Note:** Local dev without Docker is not practical β the hybrid pipeline requires vLLM + Qwen3-VL-8B running on an A100 GPU.
|
| 39 |
|
| 40 |
+
## Deploying to Hugging Face Spaces
|
| 41 |
|
| 42 |
+
- **Space URL:** https://huggingface.co/spaces/outcomelabs/docling-parser
|
| 43 |
+
- **API URL:** https://outcomelabs-docling-parser.hf.space
|
|
|
|
|
|
|
|
|
|
|
|
|
| 44 |
|
| 45 |
### Push New Code
|
| 46 |
|
|
|
|
| 52 |
|
| 53 |
### Settings (configure in HF web UI)
|
| 54 |
|
| 55 |
+
- **Hardware:** Nvidia A100 Large 80GB ($2.50/hr) β vLLM requires GPU
|
| 56 |
- **Sleep time:** 1 hour (auto-shutdown after 60 min idle)
|
| 57 |
- **Secrets:** `API_TOKEN` (required for API authentication)
|
| 58 |
|
|
|
|
| 73 |
- **uvicorn**: ASGI server
|
| 74 |
- **httpx**: HTTP client for URL parsing
|
| 75 |
- **pydantic**: Request/response validation
|
| 76 |
+
- **opencv-python-headless**: Image preprocessing (denoise, CLAHE)
|
| 77 |
+
- **pdf2image**: PDF page to image conversion for VLM
|
| 78 |
+
- **numpy**: Array operations for image processing
|
| 79 |
+
- **huggingface-hub**: Model/space utilities
|
| 80 |
|
| 81 |
+
> **Note:** vLLM and PyTorch are provided by the base Docker image (`vllm/vllm-openai:v0.14.1`), not in `requirements.txt`.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 82 |
|
| 83 |
+
## Environment Variables
|
|
|
|
|
|
|
| 84 |
|
| 85 |
+
| Variable | Description | Default |
|
| 86 |
+
| ---------------------------- | ------------------------------------------------- | --------------------------- |
|
| 87 |
+
| `API_TOKEN` | **Required.** Secret token for API authentication | - |
|
| 88 |
+
| `MAX_FILE_SIZE_MB` | Maximum upload file size in MB | `1024` |
|
| 89 |
+
| `IMAGES_SCALE` | Image resolution scale for page rendering | `2.0` |
|
| 90 |
+
| `VLM_MODEL` | VLM model for text recognition pass | `Qwen/Qwen3-VL-8B-Instruct` |
|
| 91 |
+
| `VLM_HOST` | vLLM server host | `127.0.0.1` |
|
| 92 |
+
| `VLM_PORT` | vLLM server port | `8000` |
|
| 93 |
+
| `VLM_GPU_MEMORY_UTILIZATION` | Fraction of GPU memory for vLLM | `0.70` |
|
| 94 |
+
| `VLM_MAX_MODEL_LEN` | Max sequence length for vLLM | `8192` |
|
| 95 |
|
| 96 |
## Testing
|
| 97 |
|
| 98 |
+
> **Note:** Testing requires an A100 GPU with vLLM running. Use the Docker container for testing.
|
|
|
|
|
|
|
|
|
|
|
|
|
| 99 |
|
| 100 |
### Test with curl
|
| 101 |
|
|
|
|
| 150 |
|
| 151 |
## Comparison with MinerU
|
| 152 |
|
| 153 |
+
| Feature | Docling (Hybrid VLM) | MinerU |
|
| 154 |
+
| ---------------- | -------------------------------- | ------------------- |
|
| 155 |
+
| Maintainer | IBM Research + Qwen3-VL | OpenDataLab |
|
| 156 |
+
| Table Detection | TableFormer (built-in) | Multiple backends |
|
| 157 |
+
| OCR | Qwen3-VL-8B via vLLM | Built-in |
|
| 158 |
+
| VLM Support | Hybrid (Standard + VLM two-pass) | Hybrid backend |
|
| 159 |
+
| License | MIT | AGPL-3.0 |
|
| 160 |
+
| GPU Memory | ~24GB (vLLM + Docling) | ~6-10GB (pipeline) |
|
| 161 |
+
| Primary Use Case | Enterprise documents | General PDF parsing |
|
| 162 |
+
|
| 163 |
+
## Workflow Orchestration, Task Management & Core Principles
|
| 164 |
+
|
| 165 |
+
> **See root `CLAUDE.md`** for full Workflow Orchestration (plan mode, subagents, self-improvement, verification, elegance, bug fixing), Task Management, and Core Principles. Files: `<workspace-root>/tasks/todo.md`, `<workspace-root>/tasks/lessons.md`.
|
Dockerfile
CHANGED
|
@@ -1,9 +1,9 @@
|
|
| 1 |
-
# Hugging Face Spaces Dockerfile for Docling Document Parser API
|
| 2 |
-
#
|
| 3 |
-
# Build:
|
| 4 |
|
| 5 |
-
# Use
|
| 6 |
-
FROM
|
| 7 |
|
| 8 |
USER root
|
| 9 |
|
|
@@ -23,8 +23,6 @@ RUN echo "========== STEP 1: Installing system dependencies ==========" && \
|
|
| 23 |
poppler-utils \
|
| 24 |
# Health checks
|
| 25 |
curl \
|
| 26 |
-
# Build tools for some Python packages
|
| 27 |
-
build-essential \
|
| 28 |
&& fc-cache -fv && \
|
| 29 |
rm -rf /var/lib/apt/lists/* && \
|
| 30 |
echo "========== System dependencies installed =========="
|
|
@@ -35,23 +33,27 @@ RUN useradd -m -u 1000 user
|
|
| 35 |
# Set environment variables
|
| 36 |
ENV PYTHONUNBUFFERED=1 \
|
| 37 |
PYTHONDONTWRITEBYTECODE=1 \
|
| 38 |
-
|
| 39 |
-
|
| 40 |
-
|
|
|
|
|
|
|
|
|
|
| 41 |
MAX_FILE_SIZE_MB=1024 \
|
| 42 |
-
DEFAULT_LANG=en \
|
| 43 |
HF_HOME=/home/user/.cache/huggingface \
|
| 44 |
TORCH_HOME=/home/user/.cache/torch \
|
| 45 |
XDG_CACHE_HOME=/home/user/.cache \
|
| 46 |
HOME=/home/user \
|
| 47 |
-
PATH=/home/user/.local/bin:/usr/local/bin:/usr/bin:$PATH
|
|
|
|
| 48 |
|
| 49 |
# Create cache directories with correct ownership
|
| 50 |
-
RUN
|
|
|
|
| 51 |
/home/user/.cache/torch \
|
| 52 |
-
/home/user/.cache/docling \
|
| 53 |
/home/user/app && \
|
| 54 |
-
chown -R user:user /home/user
|
|
|
|
| 55 |
|
| 56 |
# Switch to non-root user
|
| 57 |
USER user
|
|
@@ -61,34 +63,45 @@ WORKDIR /home/user/app
|
|
| 61 |
COPY --chown=user:user requirements.txt .
|
| 62 |
|
| 63 |
# Install Python dependencies
|
| 64 |
-
RUN echo "========== STEP
|
| 65 |
pip install --user --upgrade pip && \
|
|
|
|
| 66 |
pip install --user -r requirements.txt && \
|
| 67 |
echo "Installed packages:" && \
|
| 68 |
-
pip list --user
|
| 69 |
echo "========== Python dependencies installed =========="
|
| 70 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 71 |
# Pre-download Docling models
|
| 72 |
-
RUN echo "========== STEP
|
| 73 |
-
python -c "from docling.document_converter import DocumentConverter; print('
|
| 74 |
echo "Model cache summary:" && \
|
| 75 |
du -sh /home/user/.cache/huggingface 2>/dev/null || echo " HF cache: (empty)" && \
|
| 76 |
du -sh /home/user/.cache/torch 2>/dev/null || echo " Torch cache: (empty)" && \
|
| 77 |
du -sh /home/user/.cache 2>/dev/null || echo " Total cache: (empty)" && \
|
| 78 |
-
echo "==========
|
| 79 |
|
| 80 |
# Copy application code
|
| 81 |
COPY --chown=user:user . .
|
| 82 |
|
| 83 |
-
RUN echo "
|
|
|
|
|
|
|
| 84 |
echo "========== BUILD COMPLETED at $(date -u '+%Y-%m-%d %H:%M:%S UTC') =========="
|
| 85 |
|
| 86 |
# Expose the port
|
| 87 |
EXPOSE 7860
|
| 88 |
|
| 89 |
-
# Health check
|
| 90 |
-
HEALTHCHECK --interval=30s --timeout=30s --start-period=
|
| 91 |
CMD curl -f http://localhost:7860/ || exit 1
|
| 92 |
|
| 93 |
-
#
|
| 94 |
-
|
|
|
|
|
|
| 1 |
+
# Hugging Face Spaces Dockerfile for Docling VLM Document Parser API
|
| 2 |
+
# GPU-accelerated document parsing with Docling + Qwen3-VL-8B via vLLM
|
| 3 |
+
# Build: v2.0.0 - Docling with VLM backend for superior accuracy
|
| 4 |
|
| 5 |
+
# Use vLLM base image with CUDA, PyTorch, and vLLM pre-installed
|
| 6 |
+
FROM vllm/vllm-openai:v0.14.1
|
| 7 |
|
| 8 |
USER root
|
| 9 |
|
|
|
|
| 23 |
poppler-utils \
|
| 24 |
# Health checks
|
| 25 |
curl \
|
|
|
|
|
|
|
| 26 |
&& fc-cache -fv && \
|
| 27 |
rm -rf /var/lib/apt/lists/* && \
|
| 28 |
echo "========== System dependencies installed =========="
|
|
|
|
| 33 |
# Set environment variables
|
| 34 |
ENV PYTHONUNBUFFERED=1 \
|
| 35 |
PYTHONDONTWRITEBYTECODE=1 \
|
| 36 |
+
VLM_MODEL=Qwen/Qwen3-VL-8B-Instruct \
|
| 37 |
+
VLM_HOST=127.0.0.1 \
|
| 38 |
+
VLM_PORT=8000 \
|
| 39 |
+
VLM_GPU_MEMORY_UTILIZATION=0.70 \
|
| 40 |
+
VLM_MAX_MODEL_LEN=8192 \
|
| 41 |
+
IMAGES_SCALE=2.0 \
|
| 42 |
MAX_FILE_SIZE_MB=1024 \
|
|
|
|
| 43 |
HF_HOME=/home/user/.cache/huggingface \
|
| 44 |
TORCH_HOME=/home/user/.cache/torch \
|
| 45 |
XDG_CACHE_HOME=/home/user/.cache \
|
| 46 |
HOME=/home/user \
|
| 47 |
+
PATH=/home/user/.local/bin:/usr/local/bin:/usr/bin:$PATH \
|
| 48 |
+
LD_LIBRARY_PATH=/home/user/.local/lib/python3.12/site-packages/nvidia/cudnn/lib:$LD_LIBRARY_PATH
|
| 49 |
|
| 50 |
# Create cache directories with correct ownership
|
| 51 |
+
RUN echo "========== STEP 2: Creating cache directories ==========" && \
|
| 52 |
+
mkdir -p /home/user/.cache/huggingface \
|
| 53 |
/home/user/.cache/torch \
|
|
|
|
| 54 |
/home/user/app && \
|
| 55 |
+
chown -R user:user /home/user && \
|
| 56 |
+
echo "========== Cache directories created =========="
|
| 57 |
|
| 58 |
# Switch to non-root user
|
| 59 |
USER user
|
|
|
|
| 63 |
COPY --chown=user:user requirements.txt .
|
| 64 |
|
| 65 |
# Install Python dependencies
|
| 66 |
+
RUN echo "========== STEP 3: Installing Python dependencies ==========" && \
|
| 67 |
pip install --user --upgrade pip && \
|
| 68 |
+
pip install --user nvidia-cudnn-cu12 && \
|
| 69 |
pip install --user -r requirements.txt && \
|
| 70 |
echo "Installed packages:" && \
|
| 71 |
+
pip list --user && \
|
| 72 |
echo "========== Python dependencies installed =========="
|
| 73 |
|
| 74 |
+
# Pre-download Qwen3-VL-8B model for vLLM
|
| 75 |
+
RUN echo "========== STEP 4: Pre-downloading Qwen3-VL-8B model ==========" && \
|
| 76 |
+
python -c "from huggingface_hub import snapshot_download; snapshot_download('Qwen/Qwen3-VL-8B-Instruct', local_dir='/home/user/.cache/huggingface/Qwen3-VL-8B-Instruct')" && \
|
| 77 |
+
echo "Model cache summary:" && \
|
| 78 |
+
du -sh /home/user/.cache/huggingface 2>/dev/null || echo " HF cache: (empty)" && \
|
| 79 |
+
echo "========== Qwen3-VL-8B model downloaded =========="
|
| 80 |
+
|
| 81 |
# Pre-download Docling models
|
| 82 |
+
RUN echo "========== STEP 5: Pre-downloading Docling models ==========" && \
|
| 83 |
+
python -c "from docling.document_converter import DocumentConverter; print('Downloading Docling models...'); converter = DocumentConverter(); print('Done')" && \
|
| 84 |
echo "Model cache summary:" && \
|
| 85 |
du -sh /home/user/.cache/huggingface 2>/dev/null || echo " HF cache: (empty)" && \
|
| 86 |
du -sh /home/user/.cache/torch 2>/dev/null || echo " Torch cache: (empty)" && \
|
| 87 |
du -sh /home/user/.cache 2>/dev/null || echo " Total cache: (empty)" && \
|
| 88 |
+
echo "========== Docling models downloaded =========="
|
| 89 |
|
| 90 |
# Copy application code
|
| 91 |
COPY --chown=user:user . .
|
| 92 |
|
| 93 |
+
RUN echo "========== STEP 6: Finalizing build ==========" && \
|
| 94 |
+
chmod +x start.sh && \
|
| 95 |
+
echo "Files in app directory:" && ls -la /home/user/app/ && \
|
| 96 |
echo "========== BUILD COMPLETED at $(date -u '+%Y-%m-%d %H:%M:%S UTC') =========="
|
| 97 |
|
| 98 |
# Expose the port
|
| 99 |
EXPOSE 7860
|
| 100 |
|
| 101 |
+
# Health check (longer start-period for vLLM model loading)
|
| 102 |
+
HEALTHCHECK --interval=30s --timeout=30s --start-period=600s --retries=5 \
|
| 103 |
CMD curl -f http://localhost:7860/ || exit 1
|
| 104 |
|
| 105 |
+
# Override vLLM entrypoint and use our startup script
|
| 106 |
+
ENTRYPOINT []
|
| 107 |
+
CMD ["bash", "start.sh"]
|
README.md
CHANGED
|
@@ -1,5 +1,5 @@
|
|
| 1 |
---
|
| 2 |
-
title: Docling Parser API
|
| 3 |
emoji: π
|
| 4 |
colorFrom: blue
|
| 5 |
colorTo: green
|
|
@@ -10,24 +10,53 @@ license: mit
|
|
| 10 |
suggested_hardware: a100-large
|
| 11 |
---
|
| 12 |
|
| 13 |
-
# Docling Parser API
|
| 14 |
|
| 15 |
-
A FastAPI service that transforms PDFs and images into LLM-ready markdown/JSON using [IBM's Docling](https://github.com/DS4SD/docling).
|
| 16 |
|
| 17 |
## Features
|
| 18 |
|
| 19 |
-
- **
|
| 20 |
-
- **
|
| 21 |
-
- **
|
| 22 |
-
- **
|
| 23 |
-
- **
|
|
|
|
| 24 |
- **Image Extraction**: Extract and return all document images
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 25 |
|
| 26 |
## API Endpoints
|
| 27 |
|
| 28 |
| Endpoint | Method | Description |
|
| 29 |
| ------------ | ------ | ----------------------------------------- |
|
| 30 |
-
| `/` | GET | Health check
|
| 31 |
| `/parse` | POST | Parse uploaded file (multipart/form-data) |
|
| 32 |
| `/parse/url` | POST | Parse document from URL (JSON body) |
|
| 33 |
|
|
@@ -91,7 +120,7 @@ response = requests.post(
|
|
| 91 |
|
| 92 |
result = response.json()
|
| 93 |
if result["success"]:
|
| 94 |
-
print(f"Parsed {result['pages_processed']} pages")
|
| 95 |
print(result["markdown"])
|
| 96 |
else:
|
| 97 |
print(f"Error: {result['error']}")
|
|
@@ -140,10 +169,7 @@ if result["success"]:
|
|
| 140 |
| -------------- | ------ | -------- | ---------- | ---------------------------------------- |
|
| 141 |
| file | File | Yes | - | PDF or image file |
|
| 142 |
| output_format | string | No | `markdown` | `markdown` or `json` |
|
| 143 |
-
|
|
| 144 |
-
| do_ocr | bool | No | `true` | Enable OCR for scanned documents |
|
| 145 |
-
| table_mode | string | No | `accurate` | `accurate` (slow) or `fast` |
|
| 146 |
-
| images_scale | float | No | `3.0` | Image resolution scale (higher = better) |
|
| 147 |
| start_page | int | No | `0` | Starting page (0-indexed) |
|
| 148 |
| end_page | int | No | `null` | Ending page (null = all pages) |
|
| 149 |
| include_images | bool | No | `false` | Include extracted images in response |
|
|
@@ -154,10 +180,7 @@ if result["success"]:
|
|
| 154 |
| -------------- | ------ | -------- | ---------- | ---------------------------------------- |
|
| 155 |
| url | string | Yes | - | URL to PDF or image |
|
| 156 |
| output_format | string | No | `markdown` | `markdown` or `json` |
|
| 157 |
-
|
|
| 158 |
-
| do_ocr | bool | No | `true` | Enable OCR for scanned documents |
|
| 159 |
-
| table_mode | string | No | `accurate` | `accurate` (slow) or `fast` |
|
| 160 |
-
| images_scale | float | No | `3.0` | Image resolution scale (higher = better) |
|
| 161 |
| start_page | int | No | `0` | Starting page (0-indexed) |
|
| 162 |
| end_page | int | No | `null` | Ending page (null = all pages) |
|
| 163 |
| include_images | bool | No | `false` | Include extracted images in response |
|
|
@@ -173,7 +196,8 @@ if result["success"]:
|
|
| 173 |
"image_count": 0,
|
| 174 |
"error": null,
|
| 175 |
"pages_processed": 20,
|
| 176 |
-
"device_used": "cuda"
|
|
|
|
| 177 |
}
|
| 178 |
```
|
| 179 |
|
|
@@ -187,13 +211,7 @@ if result["success"]:
|
|
| 187 |
| error | string | Error message if failed |
|
| 188 |
| pages_processed | int | Number of pages processed |
|
| 189 |
| device_used | string | Device used for processing (cuda, mps, or cpu) |
|
| 190 |
-
|
| 191 |
-
## Table Detection Modes
|
| 192 |
-
|
| 193 |
-
| Mode | Speed | Accuracy | Best For |
|
| 194 |
-
| ---------- | ------ | -------- | ------------------------------------- |
|
| 195 |
-
| `accurate` | Slower | Higher | Complex tables, forms, financial docs |
|
| 196 |
-
| `fast` | Faster | Good | Simple tables, high-volume processing |
|
| 197 |
|
| 198 |
## Supported File Types
|
| 199 |
|
|
@@ -204,14 +222,16 @@ Maximum file size: 1GB (configurable via `MAX_FILE_SIZE_MB`)
|
|
| 204 |
|
| 205 |
## Configuration
|
| 206 |
|
| 207 |
-
| Environment Variable
|
| 208 |
-
| -------------------- | -------------------------------------- | ---------- |
|
| 209 |
-
| `API_TOKEN`
|
| 210 |
-
| `
|
| 211 |
-
| `
|
| 212 |
-
| `
|
| 213 |
-
| `
|
| 214 |
-
| `
|
|
|
|
|
|
|
| 215 |
|
| 216 |
## Logging
|
| 217 |
|
|
@@ -221,11 +241,13 @@ View logs in HuggingFace Space > Logs tab:
|
|
| 221 |
2026-02-04 10:30:00 | INFO | [a1b2c3d4] New parse request received
|
| 222 |
2026-02-04 10:30:00 | INFO | [a1b2c3d4] Filename: document.pdf
|
| 223 |
2026-02-04 10:30:00 | INFO | [a1b2c3d4] File size: 2.45 MB
|
| 224 |
-
2026-02-04 10:30:
|
| 225 |
-
2026-02-04 10:30:
|
|
|
|
|
|
|
| 226 |
2026-02-04 10:30:27 | INFO | [a1b2c3d4] Speed: 0.73 pages/sec
|
| 227 |
```
|
| 228 |
|
| 229 |
## Credits
|
| 230 |
|
| 231 |
-
Built with [Docling](https://github.com/DS4SD/docling) by IBM Research.
|
|
|
|
| 1 |
---
|
| 2 |
+
title: Docling VLM Parser API
|
| 3 |
emoji: π
|
| 4 |
colorFrom: blue
|
| 5 |
colorTo: green
|
|
|
|
| 10 |
suggested_hardware: a100-large
|
| 11 |
---
|
| 12 |
|
| 13 |
+
# Docling VLM Parser API
|
| 14 |
|
| 15 |
+
A FastAPI service that transforms PDFs and images into LLM-ready markdown/JSON using a **hybrid two-pass architecture**: [IBM's Docling](https://github.com/DS4SD/docling) for document structure and [Qwen3-VL-8B](https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct) via [vLLM](https://github.com/vllm-project/vllm) for enhanced text recognition.
|
| 16 |
|
| 17 |
## Features
|
| 18 |
|
| 19 |
+
- **Hybrid Two-Pass Architecture**: Docling Standard Pipeline (Pass 1) + Qwen3-VL VLM OCR (Pass 2)
|
| 20 |
+
- **TableFormer ACCURATE**: High-accuracy table structure detection preserved from Docling
|
| 21 |
+
- **VLM-Powered OCR**: Qwen3-VL-8B via vLLM replaces baseline RapidOCR for superior text accuracy
|
| 22 |
+
- **OpenCV Preprocessing**: Denoising and CLAHE contrast enhancement for better image quality
|
| 23 |
+
- **32+ Language Support**: Multilingual text recognition powered by Qwen3-VL
|
| 24 |
+
- **Handwriting Recognition**: Transcribe handwritten text via VLM
|
| 25 |
- **Image Extraction**: Extract and return all document images
|
| 26 |
+
- **Multiple Formats**: Output as markdown or JSON
|
| 27 |
+
- **GPU Accelerated**: Dual-process on A100 80GB (vLLM + FastAPI)
|
| 28 |
+
|
| 29 |
+
## Architecture
|
| 30 |
+
|
| 31 |
+
```
|
| 32 |
+
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 33 |
+
β Docker Container β
|
| 34 |
+
β (vllm/vllm-openai:v0.14.1) β
|
| 35 |
+
β β
|
| 36 |
+
β βββββββββββββββββββββββ ββββββββββββββββββββββββββββββ β
|
| 37 |
+
β β vLLM Server :8000 β β FastAPI App :7860 β β
|
| 38 |
+
β β Qwen3-VL-8B ββββββ β β
|
| 39 |
+
β β (GPU inference) β β Pass 1: Docling Standard β β
|
| 40 |
+
β βββββββββββββββββββββββ β - DocLayNet layout β β
|
| 41 |
+
β β - TableFormer ACCURATE β β
|
| 42 |
+
β β - RapidOCR baseline β β
|
| 43 |
+
β β β β
|
| 44 |
+
β β Pass 2: VLM OCR β β
|
| 45 |
+
β β - Page images β Qwen3-VL β β
|
| 46 |
+
β β - OpenCV preprocessing β β
|
| 47 |
+
β β β β
|
| 48 |
+
β β Merge: β β
|
| 49 |
+
β β - VLM text (primary) β β
|
| 50 |
+
β β - TableFormer tables β β
|
| 51 |
+
β ββββββββββββββββββββββββββββββ β
|
| 52 |
+
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 53 |
+
```
|
| 54 |
|
| 55 |
## API Endpoints
|
| 56 |
|
| 57 |
| Endpoint | Method | Description |
|
| 58 |
| ------------ | ------ | ----------------------------------------- |
|
| 59 |
+
| `/` | GET | Health check (includes vLLM status) |
|
| 60 |
| `/parse` | POST | Parse uploaded file (multipart/form-data) |
|
| 61 |
| `/parse/url` | POST | Parse document from URL (JSON body) |
|
| 62 |
|
|
|
|
| 120 |
|
| 121 |
result = response.json()
|
| 122 |
if result["success"]:
|
| 123 |
+
print(f"Parsed {result['pages_processed']} pages using {result['vlm_model']}")
|
| 124 |
print(result["markdown"])
|
| 125 |
else:
|
| 126 |
print(f"Error: {result['error']}")
|
|
|
|
| 169 |
| -------------- | ------ | -------- | ---------- | ---------------------------------------- |
|
| 170 |
| file | File | Yes | - | PDF or image file |
|
| 171 |
| output_format | string | No | `markdown` | `markdown` or `json` |
|
| 172 |
+
| images_scale | float | No | `2.0` | Image resolution scale (higher = better) |
|
|
|
|
|
|
|
|
|
|
| 173 |
| start_page | int | No | `0` | Starting page (0-indexed) |
|
| 174 |
| end_page | int | No | `null` | Ending page (null = all pages) |
|
| 175 |
| include_images | bool | No | `false` | Include extracted images in response |
|
|
|
|
| 180 |
| -------------- | ------ | -------- | ---------- | ---------------------------------------- |
|
| 181 |
| url | string | Yes | - | URL to PDF or image |
|
| 182 |
| output_format | string | No | `markdown` | `markdown` or `json` |
|
| 183 |
+
| images_scale | float | No | `2.0` | Image resolution scale (higher = better) |
|
|
|
|
|
|
|
|
|
|
| 184 |
| start_page | int | No | `0` | Starting page (0-indexed) |
|
| 185 |
| end_page | int | No | `null` | Ending page (null = all pages) |
|
| 186 |
| include_images | bool | No | `false` | Include extracted images in response |
|
|
|
|
| 196 |
"image_count": 0,
|
| 197 |
"error": null,
|
| 198 |
"pages_processed": 20,
|
| 199 |
+
"device_used": "cuda",
|
| 200 |
+
"vlm_model": "Qwen/Qwen3-VL-8B-Instruct"
|
| 201 |
}
|
| 202 |
```
|
| 203 |
|
|
|
|
| 211 |
| error | string | Error message if failed |
|
| 212 |
| pages_processed | int | Number of pages processed |
|
| 213 |
| device_used | string | Device used for processing (cuda, mps, or cpu) |
|
| 214 |
+
| vlm_model | string | VLM model used for OCR (e.g. Qwen3-VL-8B) |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 215 |
|
| 216 |
## Supported File Types
|
| 217 |
|
|
|
|
| 222 |
|
| 223 |
## Configuration
|
| 224 |
|
| 225 |
+
| Environment Variable | Description | Default |
|
| 226 |
+
| ---------------------------- | -------------------------------------- | --------------------------- |
|
| 227 |
+
| `API_TOKEN` | **Required.** API authentication token | - |
|
| 228 |
+
| `VLM_MODEL` | VLM model for OCR | `Qwen/Qwen3-VL-8B-Instruct` |
|
| 229 |
+
| `VLM_HOST` | vLLM server host | `127.0.0.1` |
|
| 230 |
+
| `VLM_PORT` | vLLM server port | `8000` |
|
| 231 |
+
| `VLM_GPU_MEMORY_UTILIZATION` | GPU memory fraction for vLLM | `0.70` |
|
| 232 |
+
| `VLM_MAX_MODEL_LEN` | Max context length for VLM | `8192` |
|
| 233 |
+
| `IMAGES_SCALE` | Default image resolution scale | `2.0` |
|
| 234 |
+
| `MAX_FILE_SIZE_MB` | Maximum upload size in MB | `1024` |
|
| 235 |
|
| 236 |
## Logging
|
| 237 |
|
|
|
|
| 241 |
2026-02-04 10:30:00 | INFO | [a1b2c3d4] New parse request received
|
| 242 |
2026-02-04 10:30:00 | INFO | [a1b2c3d4] Filename: document.pdf
|
| 243 |
2026-02-04 10:30:00 | INFO | [a1b2c3d4] File size: 2.45 MB
|
| 244 |
+
2026-02-04 10:30:15 | INFO | [a1b2c3d4] Pass 1: Docling Standard Pipeline completed in 15.23s
|
| 245 |
+
2026-02-04 10:30:15 | INFO | [a1b2c3d4] TableFormer detected 3 tables
|
| 246 |
+
2026-02-04 10:30:27 | INFO | [a1b2c3d4] Pass 2: VLM OCR completed in 12.00s (20 pages)
|
| 247 |
+
2026-02-04 10:30:27 | INFO | [a1b2c3d4] Hybrid conversion complete: 20 pages, 3 tables, 27.23s total
|
| 248 |
2026-02-04 10:30:27 | INFO | [a1b2c3d4] Speed: 0.73 pages/sec
|
| 249 |
```
|
| 250 |
|
| 251 |
## Credits
|
| 252 |
|
| 253 |
+
Built with [Docling](https://github.com/DS4SD/docling) by IBM Research, [Qwen3-VL](https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct) by Qwen Team, and [vLLM](https://github.com/vllm-project/vllm).
|
app.py
CHANGED
|
@@ -1,12 +1,16 @@
|
|
| 1 |
"""
|
| 2 |
-
Docling
|
| 3 |
|
| 4 |
-
A FastAPI service that
|
| 5 |
-
|
|
|
|
|
|
|
| 6 |
|
| 7 |
Features:
|
| 8 |
- GPU-accelerated parsing with CUDA support
|
| 9 |
-
-
|
|
|
|
|
|
|
| 10 |
- Image extraction with configurable resolution
|
| 11 |
- Automatic page chunking for large PDFs
|
| 12 |
"""
|
|
@@ -24,27 +28,31 @@ import socket
|
|
| 24 |
import tempfile
|
| 25 |
import time
|
| 26 |
import zipfile
|
|
|
|
| 27 |
from pathlib import Path
|
| 28 |
from typing import BinaryIO, Optional, Union
|
| 29 |
from urllib.parse import urlparse
|
| 30 |
from uuid import uuid4
|
| 31 |
|
|
|
|
| 32 |
import httpx
|
| 33 |
import torch
|
| 34 |
from fastapi import Depends, FastAPI, File, Form, HTTPException, UploadFile
|
| 35 |
from fastapi.security import HTTPAuthorizationCredentials, HTTPBearer
|
|
|
|
| 36 |
from pydantic import BaseModel
|
| 37 |
|
| 38 |
# Docling imports
|
| 39 |
-
from docling.
|
| 40 |
from docling.datamodel.base_models import InputFormat
|
|
|
|
| 41 |
from docling.datamodel.pipeline_options import (
|
|
|
|
| 42 |
PdfPipelineOptions,
|
|
|
|
| 43 |
TableFormerMode,
|
| 44 |
-
AcceleratorOptions,
|
| 45 |
)
|
| 46 |
-
from docling.
|
| 47 |
-
from docling.datamodel.document import PictureItem
|
| 48 |
|
| 49 |
# Configure logging
|
| 50 |
logging.basicConfig(
|
|
@@ -85,120 +93,12 @@ def verify_token(credentials: HTTPAuthorizationCredentials = Depends(security))
|
|
| 85 |
return token
|
| 86 |
|
| 87 |
|
| 88 |
-
#
|
| 89 |
-
|
| 90 |
-
|
| 91 |
-
|
| 92 |
-
|
| 93 |
-
"""Get the best available device for processing."""
|
| 94 |
-
if torch.cuda.is_available():
|
| 95 |
-
return "cuda"
|
| 96 |
-
elif hasattr(torch.backends, "mps") and torch.backends.mps.is_available():
|
| 97 |
-
return "mps"
|
| 98 |
-
return "cpu"
|
| 99 |
-
|
| 100 |
-
|
| 101 |
-
def _create_converter(
|
| 102 |
-
do_ocr: bool = True,
|
| 103 |
-
table_mode: str = "accurate",
|
| 104 |
-
images_scale: float = 3.0,
|
| 105 |
-
) -> DocumentConverter:
|
| 106 |
-
"""Create a Docling DocumentConverter with specified options."""
|
| 107 |
-
device = _get_device()
|
| 108 |
-
logger.info(f"Creating converter with device: {device}")
|
| 109 |
-
|
| 110 |
-
pipeline_options = PdfPipelineOptions()
|
| 111 |
-
pipeline_options.do_ocr = do_ocr
|
| 112 |
-
pipeline_options.do_table_structure = True
|
| 113 |
-
|
| 114 |
-
# Set table structure mode
|
| 115 |
-
if table_mode == "accurate":
|
| 116 |
-
pipeline_options.table_structure_options.mode = TableFormerMode.ACCURATE
|
| 117 |
-
else:
|
| 118 |
-
pipeline_options.table_structure_options.mode = TableFormerMode.FAST
|
| 119 |
-
|
| 120 |
-
# Enable image extraction
|
| 121 |
-
pipeline_options.generate_picture_images = True
|
| 122 |
-
pipeline_options.generate_page_images = True
|
| 123 |
-
pipeline_options.images_scale = images_scale
|
| 124 |
-
|
| 125 |
-
# GPU/CPU configuration
|
| 126 |
-
pipeline_options.accelerator_options = AcceleratorOptions(
|
| 127 |
-
device=device,
|
| 128 |
-
num_threads=0 if device == "cuda" else 4,
|
| 129 |
-
)
|
| 130 |
-
|
| 131 |
-
converter = DocumentConverter(
|
| 132 |
-
format_options={
|
| 133 |
-
InputFormat.PDF: PdfFormatOption(
|
| 134 |
-
pipeline_options=pipeline_options,
|
| 135 |
-
backend=DoclingParseV4DocumentBackend,
|
| 136 |
-
)
|
| 137 |
-
}
|
| 138 |
-
)
|
| 139 |
-
|
| 140 |
-
return converter
|
| 141 |
-
|
| 142 |
-
|
| 143 |
-
def _get_converter() -> DocumentConverter:
|
| 144 |
-
"""Get or create the global converter instance."""
|
| 145 |
-
global _converter
|
| 146 |
-
if _converter is None:
|
| 147 |
-
_converter = _create_converter()
|
| 148 |
-
return _converter
|
| 149 |
-
|
| 150 |
-
|
| 151 |
-
from contextlib import asynccontextmanager
|
| 152 |
-
|
| 153 |
-
|
| 154 |
-
@asynccontextmanager
|
| 155 |
-
async def lifespan(app: FastAPI):
|
| 156 |
-
"""Startup: initialize Docling converter and check GPU."""
|
| 157 |
-
logger.info("=" * 60)
|
| 158 |
-
logger.info("Starting Docling Parser API v1.0.0...")
|
| 159 |
-
|
| 160 |
-
device = _get_device()
|
| 161 |
-
logger.info(f"Device: {device}")
|
| 162 |
-
|
| 163 |
-
if device == "cuda":
|
| 164 |
-
logger.info(f"GPU: {torch.cuda.get_device_name(0)}")
|
| 165 |
-
logger.info(f"CUDA Version: {torch.version.cuda}")
|
| 166 |
-
logger.info(f"GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1024**3:.1f} GB")
|
| 167 |
-
|
| 168 |
-
logger.info(f"Default OCR: {DO_OCR}")
|
| 169 |
-
logger.info(f"Default table mode: {TABLE_MODE}")
|
| 170 |
-
logger.info(f"Default images scale: {IMAGES_SCALE}")
|
| 171 |
-
logger.info(f"Default language: {DEFAULT_LANG}")
|
| 172 |
-
logger.info(f"Max file size: {MAX_FILE_SIZE_MB}MB")
|
| 173 |
-
|
| 174 |
-
# Pre-initialize converter to load models
|
| 175 |
-
logger.info("Pre-loading Docling models...")
|
| 176 |
-
try:
|
| 177 |
-
_get_converter()
|
| 178 |
-
logger.info("Models loaded successfully")
|
| 179 |
-
except Exception as e:
|
| 180 |
-
logger.warning(f"Failed to pre-load models: {e}")
|
| 181 |
-
|
| 182 |
-
logger.info("=" * 60)
|
| 183 |
-
logger.info("Docling Parser API ready to accept requests")
|
| 184 |
-
logger.info("=" * 60)
|
| 185 |
-
yield
|
| 186 |
-
logger.info("Shutting down Docling Parser API...")
|
| 187 |
-
|
| 188 |
-
|
| 189 |
-
app = FastAPI(
|
| 190 |
-
title="Docling Parser API",
|
| 191 |
-
description="Transform PDFs and images into markdown/JSON using IBM's Docling",
|
| 192 |
-
version="1.0.0",
|
| 193 |
-
lifespan=lifespan,
|
| 194 |
-
)
|
| 195 |
-
|
| 196 |
-
# Configuration from environment
|
| 197 |
-
DO_OCR = os.getenv("DO_OCR", "true").lower() == "true"
|
| 198 |
-
TABLE_MODE = os.getenv("TABLE_MODE", "accurate") # accurate or fast
|
| 199 |
-
IMAGES_SCALE = float(os.getenv("IMAGES_SCALE", "3.0")) # High res for A100 80GB VRAM
|
| 200 |
MAX_FILE_SIZE_MB = int(os.getenv("MAX_FILE_SIZE_MB", "1024"))
|
| 201 |
-
DEFAULT_LANG = os.getenv("DEFAULT_LANG", "en") # OCR language code
|
| 202 |
MAX_FILE_SIZE_BYTES = MAX_FILE_SIZE_MB * 1024 * 1024
|
| 203 |
|
| 204 |
# Blocked hostnames for SSRF protection
|
|
@@ -211,6 +111,18 @@ BLOCKED_HOSTNAMES = {
|
|
| 211 |
"fd00:ec2::254",
|
| 212 |
}
|
| 213 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 214 |
|
| 215 |
def _validate_url(url: str) -> None:
|
| 216 |
"""Validate URL to prevent SSRF attacks."""
|
|
@@ -283,6 +195,11 @@ def _save_downloaded_content(input_path: Path, content: bytes) -> None:
|
|
| 283 |
f.write(content)
|
| 284 |
|
| 285 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 286 |
class ParseResponse(BaseModel):
|
| 287 |
"""Response model for document parsing."""
|
| 288 |
|
|
@@ -294,6 +211,7 @@ class ParseResponse(BaseModel):
|
|
| 294 |
error: Optional[str] = None
|
| 295 |
pages_processed: int = 0
|
| 296 |
device_used: Optional[str] = None
|
|
|
|
| 297 |
|
| 298 |
|
| 299 |
class HealthResponse(BaseModel):
|
|
@@ -303,9 +221,9 @@ class HealthResponse(BaseModel):
|
|
| 303 |
version: str
|
| 304 |
device: str
|
| 305 |
gpu_name: Optional[str] = None
|
| 306 |
-
|
| 307 |
-
|
| 308 |
-
images_scale: float
|
| 309 |
|
| 310 |
|
| 311 |
class URLParseRequest(BaseModel):
|
|
@@ -313,130 +231,421 @@ class URLParseRequest(BaseModel):
|
|
| 313 |
|
| 314 |
url: str
|
| 315 |
output_format: str = "markdown"
|
| 316 |
-
lang: str = DEFAULT_LANG # OCR language code
|
| 317 |
-
do_ocr: Optional[bool] = None
|
| 318 |
-
table_mode: Optional[str] = None
|
| 319 |
images_scale: Optional[float] = None
|
| 320 |
start_page: int = 0 # Starting page (0-indexed)
|
| 321 |
end_page: Optional[int] = None # Ending page (None = all pages)
|
| 322 |
include_images: bool = False
|
| 323 |
|
| 324 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 325 |
def _convert_document(
|
| 326 |
input_path: Path,
|
| 327 |
output_dir: Path,
|
| 328 |
-
do_ocr: bool,
|
| 329 |
-
table_mode: str,
|
| 330 |
images_scale: float,
|
| 331 |
include_images: bool,
|
| 332 |
request_id: str,
|
| 333 |
start_page: int = 0,
|
| 334 |
end_page: Optional[int] = None,
|
| 335 |
-
|
| 336 |
-
) -> tuple[str, Optional[list], int, int]:
|
| 337 |
"""
|
| 338 |
-
|
| 339 |
|
| 340 |
-
|
| 341 |
-
|
|
|
|
|
|
|
|
|
|
| 342 |
"""
|
| 343 |
-
|
| 344 |
-
|
| 345 |
-
|
| 346 |
-
|
| 347 |
-
# Create converter with specified options
|
| 348 |
-
converter = _create_converter(
|
| 349 |
-
do_ocr=do_ocr,
|
| 350 |
-
table_mode=table_mode,
|
| 351 |
-
images_scale=images_scale,
|
| 352 |
)
|
|
|
|
| 353 |
|
| 354 |
-
logger.info(f"[{request_id}] Starting Docling conversion...")
|
| 355 |
start_time = time.time()
|
| 356 |
-
|
| 357 |
result = converter.convert(input_path)
|
| 358 |
doc = result.document
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 359 |
|
| 360 |
-
|
| 361 |
-
logger.info(
|
|
|
|
|
|
|
| 362 |
|
| 363 |
-
#
|
| 364 |
-
|
| 365 |
-
|
| 366 |
-
|
| 367 |
-
|
| 368 |
-
if re.fullmatch(r"([0-9]+|[ivxIVX]+)", raw_text):
|
| 369 |
-
if element.prov and element.prov[0].page_no not in page_labels:
|
| 370 |
-
page_labels[element.prov[0].page_no] = raw_text
|
| 371 |
-
|
| 372 |
-
# Build markdown output
|
| 373 |
-
md_body = []
|
| 374 |
-
current_page_idx = -1
|
| 375 |
-
pages_seen = set()
|
| 376 |
image_count = 0
|
| 377 |
image_dir = output_dir / "images"
|
| 378 |
|
| 379 |
if include_images:
|
| 380 |
image_dir.mkdir(parents=True, exist_ok=True)
|
| 381 |
|
| 382 |
-
|
| 383 |
-
|
| 384 |
-
|
| 385 |
-
|
| 386 |
-
|
| 387 |
-
|
| 388 |
-
if current_page_idx < start_page:
|
| 389 |
-
continue
|
| 390 |
-
if end_page is not None and current_page_idx > end_page:
|
| 391 |
-
continue
|
| 392 |
-
|
| 393 |
-
pages_seen.add(current_page_idx)
|
| 394 |
-
label = page_labels.get(current_page_idx, str(current_page_idx + 1))
|
| 395 |
-
md_body.append(f"\n\n<!-- Page {label} -->\n\n")
|
| 396 |
-
|
| 397 |
-
# Skip elements outside the requested page range
|
| 398 |
-
if current_page_idx < start_page:
|
| 399 |
-
continue
|
| 400 |
-
if end_page is not None and current_page_idx > end_page:
|
| 401 |
-
continue
|
| 402 |
-
|
| 403 |
-
element_text = getattr(element, "text", "").strip()
|
| 404 |
|
| 405 |
-
|
| 406 |
-
|
| 407 |
-
|
| 408 |
|
| 409 |
-
|
| 410 |
-
if isinstance(element, PictureItem):
|
| 411 |
-
if include_images and element.image and element.image.pil_image:
|
| 412 |
-
image_id = element.self_ref.split("/")[-1]
|
| 413 |
-
label = page_labels.get(current_page_idx, str(current_page_idx + 1))
|
| 414 |
-
image_name = f"page_{label}_{image_id}.png"
|
| 415 |
-
image_name = re.sub(r'[\\/*?:"<>|]', "", image_name)
|
| 416 |
-
image_path = image_dir / image_name
|
| 417 |
|
|
|
|
|
|
|
|
|
|
| 418 |
try:
|
| 419 |
-
element.
|
| 420 |
-
|
| 421 |
-
|
| 422 |
-
|
| 423 |
-
|
| 424 |
else:
|
| 425 |
-
#
|
| 426 |
-
|
| 427 |
-
|
| 428 |
-
|
| 429 |
-
if element_text:
|
| 430 |
-
md_body.append(element_text + "\n\n")
|
| 431 |
|
| 432 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 433 |
pages_processed = len(pages_seen)
|
| 434 |
|
| 435 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 436 |
|
| 437 |
return markdown_content, None, pages_processed, image_count
|
| 438 |
|
| 439 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 440 |
def _create_images_zip(output_dir: Path) -> tuple[Optional[str], int]:
|
| 441 |
"""Create a zip file from extracted images."""
|
| 442 |
image_dir = output_dir / "images"
|
|
@@ -462,6 +671,75 @@ def _create_images_zip(output_dir: Path) -> tuple[Optional[str], int]:
|
|
| 462 |
return base64.b64encode(zip_buffer.getvalue()).decode("utf-8"), image_count
|
| 463 |
|
| 464 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 465 |
@app.get("/", response_model=HealthResponse)
|
| 466 |
async def health_check() -> HealthResponse:
|
| 467 |
"""Health check endpoint."""
|
|
@@ -470,13 +748,22 @@ async def health_check() -> HealthResponse:
|
|
| 470 |
if device == "cuda":
|
| 471 |
gpu_name = torch.cuda.get_device_name(0)
|
| 472 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 473 |
return HealthResponse(
|
| 474 |
status="healthy",
|
| 475 |
-
version="
|
| 476 |
device=device,
|
| 477 |
-
gpu_name=
|
| 478 |
-
|
| 479 |
-
|
| 480 |
images_scale=IMAGES_SCALE,
|
| 481 |
)
|
| 482 |
|
|
@@ -485,10 +772,7 @@ async def health_check() -> HealthResponse:
|
|
| 485 |
async def parse_document(
|
| 486 |
file: UploadFile = File(..., description="PDF or image file to parse"),
|
| 487 |
output_format: str = Form(default="markdown", description="Output format: markdown or json"),
|
| 488 |
-
|
| 489 |
-
do_ocr: Optional[bool] = Form(default=None, description="Enable OCR (default: true)"),
|
| 490 |
-
table_mode: Optional[str] = Form(default=None, description="Table detection mode: accurate or fast"),
|
| 491 |
-
images_scale: Optional[float] = Form(default=None, description="Image resolution scale (default: 3.0)"),
|
| 492 |
start_page: int = Form(default=0, description="Starting page (0-indexed)"),
|
| 493 |
end_page: Optional[int] = Form(default=None, description="Ending page (None = all pages)"),
|
| 494 |
include_images: bool = Form(default=False, description="Include extracted images in response"),
|
|
@@ -497,6 +781,11 @@ async def parse_document(
|
|
| 497 |
"""
|
| 498 |
Parse a document file (PDF or image) and return extracted content.
|
| 499 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 500 |
Supports:
|
| 501 |
- PDF files (.pdf)
|
| 502 |
- Images (.png, .jpg, .jpeg, .tiff, .bmp)
|
|
@@ -506,9 +795,16 @@ async def parse_document(
|
|
| 506 |
|
| 507 |
logger.info(f"[{request_id}] {'='*50}")
|
| 508 |
logger.info(f"[{request_id}] New parse request received")
|
| 509 |
-
|
|
|
|
| 510 |
logger.info(f"[{request_id}] Output format: {output_format}")
|
| 511 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 512 |
# Validate file size
|
| 513 |
file.file.seek(0, 2)
|
| 514 |
file_size = file.file.tell()
|
|
@@ -535,39 +831,34 @@ async def parse_document(
|
|
| 535 |
)
|
| 536 |
|
| 537 |
# Use defaults if not specified
|
| 538 |
-
|
| 539 |
-
use_table_mode = table_mode if table_mode else TABLE_MODE
|
| 540 |
-
use_images_scale = images_scale if images_scale else IMAGES_SCALE
|
| 541 |
|
| 542 |
-
logger.info(f"[{request_id}]
|
| 543 |
logger.info(f"[{request_id}] Page range: {start_page} to {end_page or 'end'}")
|
| 544 |
|
| 545 |
temp_dir = tempfile.mkdtemp()
|
| 546 |
-
logger.
|
| 547 |
|
| 548 |
try:
|
| 549 |
# Save uploaded file
|
| 550 |
input_path = Path(temp_dir) / f"input{file_ext}"
|
| 551 |
await asyncio.to_thread(_save_uploaded_file, input_path, file.file)
|
| 552 |
-
logger.
|
| 553 |
|
| 554 |
# Create output directory
|
| 555 |
output_dir = Path(temp_dir) / "output"
|
| 556 |
output_dir.mkdir(exist_ok=True)
|
| 557 |
|
| 558 |
-
# Convert document
|
| 559 |
markdown_content, json_content, pages_processed, image_count = await asyncio.to_thread(
|
| 560 |
_convert_document,
|
| 561 |
input_path,
|
| 562 |
output_dir,
|
| 563 |
-
use_ocr,
|
| 564 |
-
use_table_mode,
|
| 565 |
use_images_scale,
|
| 566 |
include_images,
|
| 567 |
request_id,
|
| 568 |
start_page,
|
| 569 |
end_page,
|
| 570 |
-
lang,
|
| 571 |
)
|
| 572 |
|
| 573 |
# Create images zip if requested
|
|
@@ -593,21 +884,22 @@ async def parse_document(
|
|
| 593 |
image_count=image_count,
|
| 594 |
pages_processed=pages_processed,
|
| 595 |
device_used=_get_device(),
|
|
|
|
| 596 |
)
|
| 597 |
|
| 598 |
except Exception as e:
|
| 599 |
total_duration = time.time() - start_time
|
| 600 |
logger.error(f"[{request_id}] {'='*50}")
|
| 601 |
logger.error(f"[{request_id}] Request failed after {total_duration:.2f}s")
|
| 602 |
-
logger.error(f"[{request_id}] Error: {type(e).__name__}: {str(e)}")
|
| 603 |
logger.error(f"[{request_id}] {'='*50}")
|
| 604 |
return ParseResponse(
|
| 605 |
success=False,
|
| 606 |
-
error=f"
|
| 607 |
)
|
| 608 |
finally:
|
| 609 |
shutil.rmtree(temp_dir, ignore_errors=True)
|
| 610 |
-
logger.
|
| 611 |
|
| 612 |
|
| 613 |
@app.post("/parse/url", response_model=ParseResponse)
|
|
@@ -618,7 +910,10 @@ async def parse_document_from_url(
|
|
| 618 |
"""
|
| 619 |
Parse a document from a URL.
|
| 620 |
|
| 621 |
-
Downloads the file and processes it through
|
|
|
|
|
|
|
|
|
|
| 622 |
"""
|
| 623 |
request_id = str(uuid4())[:8]
|
| 624 |
start_time = time.time()
|
|
@@ -628,19 +923,25 @@ async def parse_document_from_url(
|
|
| 628 |
logger.info(f"[{request_id}] URL: {request.url}")
|
| 629 |
logger.info(f"[{request_id}] Output format: {request.output_format}")
|
| 630 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 631 |
# Validate URL
|
| 632 |
logger.info(f"[{request_id}] Validating URL...")
|
| 633 |
_validate_url(request.url)
|
| 634 |
logger.info(f"[{request_id}] URL validation passed")
|
| 635 |
|
| 636 |
temp_dir = tempfile.mkdtemp()
|
| 637 |
-
logger.
|
| 638 |
|
| 639 |
try:
|
| 640 |
# Download file
|
| 641 |
logger.info(f"[{request_id}] Downloading file from URL...")
|
| 642 |
download_start = time.time()
|
| 643 |
-
async with httpx.AsyncClient(timeout=60.0, follow_redirects=
|
| 644 |
response = await client.get(request.url)
|
| 645 |
response.raise_for_status()
|
| 646 |
download_duration = time.time() - download_start
|
|
@@ -662,7 +963,9 @@ async def parse_document_from_url(
|
|
| 662 |
)
|
| 663 |
|
| 664 |
if len(response.content) > MAX_FILE_SIZE_BYTES:
|
| 665 |
-
logger.error(
|
|
|
|
|
|
|
| 666 |
raise HTTPException(
|
| 667 |
status_code=413,
|
| 668 |
detail=f"File size exceeds maximum allowed size of {MAX_FILE_SIZE_MB}MB",
|
|
@@ -671,33 +974,28 @@ async def parse_document_from_url(
|
|
| 671 |
# Save downloaded file
|
| 672 |
input_path = Path(temp_dir) / f"input{file_ext}"
|
| 673 |
await asyncio.to_thread(_save_downloaded_content, input_path, response.content)
|
| 674 |
-
logger.
|
| 675 |
|
| 676 |
# Create output directory
|
| 677 |
output_dir = Path(temp_dir) / "output"
|
| 678 |
output_dir.mkdir(exist_ok=True)
|
| 679 |
|
| 680 |
# Use defaults if not specified
|
| 681 |
-
|
| 682 |
-
use_table_mode = request.table_mode if request.table_mode else TABLE_MODE
|
| 683 |
-
use_images_scale = request.images_scale if request.images_scale else IMAGES_SCALE
|
| 684 |
|
| 685 |
-
logger.info(f"[{request_id}]
|
| 686 |
logger.info(f"[{request_id}] Page range: {request.start_page} to {request.end_page or 'end'}")
|
| 687 |
|
| 688 |
-
# Convert document
|
| 689 |
markdown_content, json_content, pages_processed, image_count = await asyncio.to_thread(
|
| 690 |
_convert_document,
|
| 691 |
input_path,
|
| 692 |
output_dir,
|
| 693 |
-
use_ocr,
|
| 694 |
-
use_table_mode,
|
| 695 |
use_images_scale,
|
| 696 |
request.include_images,
|
| 697 |
request_id,
|
| 698 |
request.start_page,
|
| 699 |
request.end_page,
|
| 700 |
-
request.lang,
|
| 701 |
)
|
| 702 |
|
| 703 |
# Create images zip if requested
|
|
@@ -723,6 +1021,7 @@ async def parse_document_from_url(
|
|
| 723 |
image_count=image_count,
|
| 724 |
pages_processed=pages_processed,
|
| 725 |
device_used=_get_device(),
|
|
|
|
| 726 |
)
|
| 727 |
|
| 728 |
except httpx.HTTPError as e:
|
|
@@ -730,21 +1029,21 @@ async def parse_document_from_url(
|
|
| 730 |
logger.error(f"[{request_id}] Download failed after {total_duration:.2f}s: {str(e)}")
|
| 731 |
return ParseResponse(
|
| 732 |
success=False,
|
| 733 |
-
error=f"Failed to download file from URL: {
|
| 734 |
)
|
| 735 |
except Exception as e:
|
| 736 |
total_duration = time.time() - start_time
|
| 737 |
logger.error(f"[{request_id}] {'='*50}")
|
| 738 |
logger.error(f"[{request_id}] Request failed after {total_duration:.2f}s")
|
| 739 |
-
logger.error(f"[{request_id}] Error: {type(e).__name__}: {str(e)}")
|
| 740 |
logger.error(f"[{request_id}] {'='*50}")
|
| 741 |
return ParseResponse(
|
| 742 |
success=False,
|
| 743 |
-
error=
|
| 744 |
)
|
| 745 |
finally:
|
| 746 |
shutil.rmtree(temp_dir, ignore_errors=True)
|
| 747 |
-
logger.
|
| 748 |
|
| 749 |
|
| 750 |
if __name__ == "__main__":
|
|
|
|
| 1 |
"""
|
| 2 |
+
Docling VLM Parser API v2.0.0
|
| 3 |
|
| 4 |
+
A FastAPI service that uses a HYBRID two-pass approach for document parsing:
|
| 5 |
+
Pass 1: Docling Standard Pipeline (DocLayNet + TableFormer + RapidOCR) for document structure
|
| 6 |
+
Pass 2: Qwen3-VL-8B via vLLM for enhanced text recognition
|
| 7 |
+
Merge: TableFormer tables preserved, VLM text replaces RapidOCR text
|
| 8 |
|
| 9 |
Features:
|
| 10 |
- GPU-accelerated parsing with CUDA support
|
| 11 |
+
- TableFormer ACCURATE for table structure detection
|
| 12 |
+
- Qwen3-VL via vLLM for superior OCR accuracy
|
| 13 |
+
- OpenCV image preprocessing (deskew, denoise, CLAHE)
|
| 14 |
- Image extraction with configurable resolution
|
| 15 |
- Automatic page chunking for large PDFs
|
| 16 |
"""
|
|
|
|
| 28 |
import tempfile
|
| 29 |
import time
|
| 30 |
import zipfile
|
| 31 |
+
from contextlib import asynccontextmanager
|
| 32 |
from pathlib import Path
|
| 33 |
from typing import BinaryIO, Optional, Union
|
| 34 |
from urllib.parse import urlparse
|
| 35 |
from uuid import uuid4
|
| 36 |
|
| 37 |
+
import cv2
|
| 38 |
import httpx
|
| 39 |
import torch
|
| 40 |
from fastapi import Depends, FastAPI, File, Form, HTTPException, UploadFile
|
| 41 |
from fastapi.security import HTTPAuthorizationCredentials, HTTPBearer
|
| 42 |
+
from pdf2image import convert_from_path
|
| 43 |
from pydantic import BaseModel
|
| 44 |
|
| 45 |
# Docling imports
|
| 46 |
+
from docling.backend.docling_parse_v4_backend import DoclingParseV4DocumentBackend
|
| 47 |
from docling.datamodel.base_models import InputFormat
|
| 48 |
+
from docling.datamodel.document import PictureItem, TableItem
|
| 49 |
from docling.datamodel.pipeline_options import (
|
| 50 |
+
AcceleratorOptions,
|
| 51 |
PdfPipelineOptions,
|
| 52 |
+
RapidOcrOptions,
|
| 53 |
TableFormerMode,
|
|
|
|
| 54 |
)
|
| 55 |
+
from docling.document_converter import DocumentConverter, PdfFormatOption
|
|
|
|
| 56 |
|
| 57 |
# Configure logging
|
| 58 |
logging.basicConfig(
|
|
|
|
| 93 |
return token
|
| 94 |
|
| 95 |
|
| 96 |
+
# VLM Configuration
|
| 97 |
+
VLM_MODEL = os.getenv("VLM_MODEL", "Qwen/Qwen3-VL-8B-Instruct")
|
| 98 |
+
VLM_HOST = os.getenv("VLM_HOST", "127.0.0.1")
|
| 99 |
+
VLM_PORT = os.getenv("VLM_PORT", "8000")
|
| 100 |
+
IMAGES_SCALE = float(os.getenv("IMAGES_SCALE", "2.0"))
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 101 |
MAX_FILE_SIZE_MB = int(os.getenv("MAX_FILE_SIZE_MB", "1024"))
|
|
|
|
| 102 |
MAX_FILE_SIZE_BYTES = MAX_FILE_SIZE_MB * 1024 * 1024
|
| 103 |
|
| 104 |
# Blocked hostnames for SSRF protection
|
|
|
|
| 111 |
"fd00:ec2::254",
|
| 112 |
}
|
| 113 |
|
| 114 |
+
# Global converter instance (initialized on startup)
|
| 115 |
+
_converter: Optional[DocumentConverter] = None
|
| 116 |
+
|
| 117 |
+
|
| 118 |
+
def _get_device() -> str:
|
| 119 |
+
"""Get the best available device for processing."""
|
| 120 |
+
if torch.cuda.is_available():
|
| 121 |
+
return "cuda"
|
| 122 |
+
elif hasattr(torch.backends, "mps") and torch.backends.mps.is_available():
|
| 123 |
+
return "mps"
|
| 124 |
+
return "cpu"
|
| 125 |
+
|
| 126 |
|
| 127 |
def _validate_url(url: str) -> None:
|
| 128 |
"""Validate URL to prevent SSRF attacks."""
|
|
|
|
| 195 |
f.write(content)
|
| 196 |
|
| 197 |
|
| 198 |
+
# ---------------------------------------------------------------------------
|
| 199 |
+
# Pydantic Models
|
| 200 |
+
# ---------------------------------------------------------------------------
|
| 201 |
+
|
| 202 |
+
|
| 203 |
class ParseResponse(BaseModel):
|
| 204 |
"""Response model for document parsing."""
|
| 205 |
|
|
|
|
| 211 |
error: Optional[str] = None
|
| 212 |
pages_processed: int = 0
|
| 213 |
device_used: Optional[str] = None
|
| 214 |
+
vlm_model: Optional[str] = None
|
| 215 |
|
| 216 |
|
| 217 |
class HealthResponse(BaseModel):
|
|
|
|
| 221 |
version: str
|
| 222 |
device: str
|
| 223 |
gpu_name: Optional[str] = None
|
| 224 |
+
vlm_model: str = ""
|
| 225 |
+
vlm_status: str = "unknown"
|
| 226 |
+
images_scale: float = 2.0
|
| 227 |
|
| 228 |
|
| 229 |
class URLParseRequest(BaseModel):
|
|
|
|
| 231 |
|
| 232 |
url: str
|
| 233 |
output_format: str = "markdown"
|
|
|
|
|
|
|
|
|
|
| 234 |
images_scale: Optional[float] = None
|
| 235 |
start_page: int = 0 # Starting page (0-indexed)
|
| 236 |
end_page: Optional[int] = None # Ending page (None = all pages)
|
| 237 |
include_images: bool = False
|
| 238 |
|
| 239 |
|
| 240 |
+
# ---------------------------------------------------------------------------
|
| 241 |
+
# OpenCV Image Preprocessing
|
| 242 |
+
# ---------------------------------------------------------------------------
|
| 243 |
+
|
| 244 |
+
|
| 245 |
+
def _preprocess_image_for_ocr(image_path: str) -> str:
|
| 246 |
+
"""Enhance image quality for better OCR accuracy.
|
| 247 |
+
|
| 248 |
+
Applies: deskew correction, denoising, CLAHE contrast enhancement.
|
| 249 |
+
Returns the path to the preprocessed image (same path, overwritten).
|
| 250 |
+
"""
|
| 251 |
+
img = cv2.imread(image_path)
|
| 252 |
+
if img is None:
|
| 253 |
+
return image_path
|
| 254 |
+
|
| 255 |
+
# Denoise
|
| 256 |
+
img = cv2.fastNlMeansDenoisingColored(img, None, 10, 10, 7, 21)
|
| 257 |
+
|
| 258 |
+
# CLAHE contrast enhancement on L channel
|
| 259 |
+
lab = cv2.cvtColor(img, cv2.COLOR_BGR2LAB)
|
| 260 |
+
l, a, b = cv2.split(lab)
|
| 261 |
+
clahe = cv2.createCLAHE(clipLimit=2.0, tileGridSize=(8, 8))
|
| 262 |
+
l = clahe.apply(l)
|
| 263 |
+
lab = cv2.merge([l, a, b])
|
| 264 |
+
img = cv2.cvtColor(lab, cv2.COLOR_LAB2BGR)
|
| 265 |
+
|
| 266 |
+
cv2.imwrite(image_path, img)
|
| 267 |
+
return image_path
|
| 268 |
+
|
| 269 |
+
|
| 270 |
+
# ---------------------------------------------------------------------------
|
| 271 |
+
# VLM OCR (Pass 2)
|
| 272 |
+
# ---------------------------------------------------------------------------
|
| 273 |
+
|
| 274 |
+
|
| 275 |
+
def _vlm_ocr_page(page_image_bytes: bytes) -> str:
|
| 276 |
+
"""Send a page image to Qwen3-VL via vLLM for text extraction.
|
| 277 |
+
|
| 278 |
+
Args:
|
| 279 |
+
page_image_bytes: PNG image bytes of the page
|
| 280 |
+
|
| 281 |
+
Returns:
|
| 282 |
+
Extracted markdown text from the page
|
| 283 |
+
"""
|
| 284 |
+
b64_image = base64.b64encode(page_image_bytes).decode("utf-8")
|
| 285 |
+
|
| 286 |
+
response = httpx.post(
|
| 287 |
+
f"http://{VLM_HOST}:{VLM_PORT}/v1/chat/completions",
|
| 288 |
+
json={
|
| 289 |
+
"model": VLM_MODEL,
|
| 290 |
+
"messages": [
|
| 291 |
+
{
|
| 292 |
+
"role": "user",
|
| 293 |
+
"content": [
|
| 294 |
+
{
|
| 295 |
+
"type": "image_url",
|
| 296 |
+
"image_url": {"url": f"data:image/png;base64,{b64_image}"},
|
| 297 |
+
},
|
| 298 |
+
{
|
| 299 |
+
"type": "text",
|
| 300 |
+
"text": (
|
| 301 |
+
"OCR this document page to markdown. "
|
| 302 |
+
"Extract ALL text exactly as written, preserving headings, lists, and paragraphs. "
|
| 303 |
+
"For tables, output them as markdown tables. "
|
| 304 |
+
"For handwritten text, transcribe as accurately as possible. "
|
| 305 |
+
"Return ONLY the extracted content, no explanations."
|
| 306 |
+
),
|
| 307 |
+
},
|
| 308 |
+
],
|
| 309 |
+
}
|
| 310 |
+
],
|
| 311 |
+
"max_tokens": 8192,
|
| 312 |
+
"temperature": 0.1,
|
| 313 |
+
"skip_special_tokens": True,
|
| 314 |
+
},
|
| 315 |
+
timeout=120.0,
|
| 316 |
+
)
|
| 317 |
+
response.raise_for_status()
|
| 318 |
+
result = response.json()
|
| 319 |
+
choices = result.get("choices")
|
| 320 |
+
if not choices:
|
| 321 |
+
raise ValueError(f"vLLM returned no choices")
|
| 322 |
+
content = choices[0].get("message", {}).get("content")
|
| 323 |
+
if content is None:
|
| 324 |
+
raise ValueError(f"vLLM response missing content")
|
| 325 |
+
return content
|
| 326 |
+
|
| 327 |
+
|
| 328 |
+
# ---------------------------------------------------------------------------
|
| 329 |
+
# Table Extraction Helper
|
| 330 |
+
# ---------------------------------------------------------------------------
|
| 331 |
+
|
| 332 |
+
|
| 333 |
+
def _extract_table_markdowns(doc) -> dict:
|
| 334 |
+
"""Extract table markdown from Docling document, keyed by page number."""
|
| 335 |
+
tables_by_page: dict[int, list[str]] = {}
|
| 336 |
+
for element, _ in doc.iterate_items():
|
| 337 |
+
if isinstance(element, TableItem):
|
| 338 |
+
page_no = element.prov[0].page_no if element.prov else -1
|
| 339 |
+
table_md = element.export_to_markdown(doc=doc)
|
| 340 |
+
if page_no not in tables_by_page:
|
| 341 |
+
tables_by_page[page_no] = []
|
| 342 |
+
tables_by_page[page_no].append(table_md)
|
| 343 |
+
return tables_by_page
|
| 344 |
+
|
| 345 |
+
|
| 346 |
+
# ---------------------------------------------------------------------------
|
| 347 |
+
# Merge: VLM Text + TableFormer Tables
|
| 348 |
+
# ---------------------------------------------------------------------------
|
| 349 |
+
|
| 350 |
+
|
| 351 |
+
def _merge_vlm_with_tables(vlm_text: str, table_markdowns: list) -> str:
|
| 352 |
+
"""Replace VLM's table sections with TableFormer's more accurate tables.
|
| 353 |
+
|
| 354 |
+
Detects markdown table patterns (lines with |...|) in VLM output
|
| 355 |
+
and replaces them with TableFormer output.
|
| 356 |
+
"""
|
| 357 |
+
if not table_markdowns:
|
| 358 |
+
return vlm_text
|
| 359 |
+
|
| 360 |
+
# Pattern: consecutive lines that look like markdown tables
|
| 361 |
+
# A markdown table has lines starting and ending with |
|
| 362 |
+
table_pattern = re.compile(r"((?:^\|[^\n]+\|$\n?)+)", re.MULTILINE)
|
| 363 |
+
|
| 364 |
+
vlm_table_count = len(table_pattern.findall(vlm_text))
|
| 365 |
+
if vlm_table_count != len(table_markdowns):
|
| 366 |
+
logger.warning(
|
| 367 |
+
f"Table count mismatch: VLM={vlm_table_count}, TableFormer={len(table_markdowns)}. "
|
| 368 |
+
f"Positional replacement may be imprecise."
|
| 369 |
+
)
|
| 370 |
+
|
| 371 |
+
table_idx = 0
|
| 372 |
+
|
| 373 |
+
def replace_table(match):
|
| 374 |
+
nonlocal table_idx
|
| 375 |
+
if table_idx < len(table_markdowns):
|
| 376 |
+
replacement = table_markdowns[table_idx]
|
| 377 |
+
table_idx += 1
|
| 378 |
+
return replacement.strip() + "\n"
|
| 379 |
+
return match.group(0)
|
| 380 |
+
|
| 381 |
+
result = table_pattern.sub(replace_table, vlm_text)
|
| 382 |
+
|
| 383 |
+
# If there are remaining TableFormer tables not matched, append them
|
| 384 |
+
while table_idx < len(table_markdowns):
|
| 385 |
+
result += "\n\n" + table_markdowns[table_idx].strip() + "\n"
|
| 386 |
+
table_idx += 1
|
| 387 |
+
|
| 388 |
+
return result
|
| 389 |
+
|
| 390 |
+
|
| 391 |
+
# ---------------------------------------------------------------------------
|
| 392 |
+
# PDF to Page Images
|
| 393 |
+
# ---------------------------------------------------------------------------
|
| 394 |
+
|
| 395 |
+
|
| 396 |
+
def _pdf_to_page_images(
|
| 397 |
+
input_path: Path, start_page: int = 0, end_page: Optional[int] = None
|
| 398 |
+
) -> list:
|
| 399 |
+
"""Convert PDF pages to PNG image bytes using pdf2image.
|
| 400 |
+
|
| 401 |
+
Processes one page at a time to avoid loading all pages into memory.
|
| 402 |
+
Returns list of (page_no, png_bytes) tuples.
|
| 403 |
+
"""
|
| 404 |
+
page_images: list[tuple[int, bytes]] = []
|
| 405 |
+
|
| 406 |
+
try:
|
| 407 |
+
# Determine total page count first
|
| 408 |
+
from pdf2image.pdf2image import pdfinfo_from_path
|
| 409 |
+
|
| 410 |
+
info = pdfinfo_from_path(str(input_path))
|
| 411 |
+
total_pages = info["Pages"]
|
| 412 |
+
last_page = min(end_page + 1, total_pages) if end_page is not None else total_pages
|
| 413 |
+
|
| 414 |
+
for i in range(start_page, last_page):
|
| 415 |
+
# Convert one page at a time (pdf2image is 1-indexed)
|
| 416 |
+
images = convert_from_path(
|
| 417 |
+
str(input_path), dpi=300, first_page=i + 1, last_page=i + 1
|
| 418 |
+
)
|
| 419 |
+
if not images:
|
| 420 |
+
continue
|
| 421 |
+
img = images[0]
|
| 422 |
+
# Save to temp file for OpenCV preprocessing
|
| 423 |
+
with tempfile.NamedTemporaryFile(suffix=".png", delete=False) as tmp:
|
| 424 |
+
tmp_path = tmp.name
|
| 425 |
+
img.save(tmp_path, format="PNG")
|
| 426 |
+
try:
|
| 427 |
+
_preprocess_image_for_ocr(tmp_path)
|
| 428 |
+
with open(tmp_path, "rb") as f:
|
| 429 |
+
page_images.append((i, f.read()))
|
| 430 |
+
finally:
|
| 431 |
+
os.unlink(tmp_path)
|
| 432 |
+
except Exception as e:
|
| 433 |
+
# Fallback: log warning β caller handles empty list
|
| 434 |
+
logger.warning(f"pdf2image failed, VLM OCR may be limited: {e}")
|
| 435 |
+
|
| 436 |
+
return page_images
|
| 437 |
+
|
| 438 |
+
|
| 439 |
+
# ---------------------------------------------------------------------------
|
| 440 |
+
# Docling Converter (Pass 1)
|
| 441 |
+
# ---------------------------------------------------------------------------
|
| 442 |
+
|
| 443 |
+
|
| 444 |
+
def _create_converter(images_scale: float = 2.0) -> DocumentConverter:
|
| 445 |
+
"""Create a Docling converter with Standard Pipeline.
|
| 446 |
+
|
| 447 |
+
Uses DocLayNet (layout) + TableFormer ACCURATE (tables) + RapidOCR (baseline text).
|
| 448 |
+
"""
|
| 449 |
+
device = _get_device()
|
| 450 |
+
logger.info(f"Creating converter with device: {device}")
|
| 451 |
+
|
| 452 |
+
pipeline_options = PdfPipelineOptions()
|
| 453 |
+
pipeline_options.do_ocr = True
|
| 454 |
+
pipeline_options.do_table_structure = True
|
| 455 |
+
pipeline_options.table_structure_options.mode = TableFormerMode.ACCURATE
|
| 456 |
+
pipeline_options.table_structure_options.do_cell_matching = True
|
| 457 |
+
|
| 458 |
+
# Use RapidOCR as baseline (VLM will enhance text in pass 2)
|
| 459 |
+
pipeline_options.ocr_options = RapidOcrOptions()
|
| 460 |
+
pipeline_options.ocr_options.force_full_page_ocr = True
|
| 461 |
+
|
| 462 |
+
# Enable page image generation (needed for VLM pass)
|
| 463 |
+
pipeline_options.generate_page_images = True
|
| 464 |
+
pipeline_options.images_scale = images_scale
|
| 465 |
+
|
| 466 |
+
# Also enable picture image extraction
|
| 467 |
+
pipeline_options.generate_picture_images = True
|
| 468 |
+
|
| 469 |
+
pipeline_options.accelerator_options = AcceleratorOptions(
|
| 470 |
+
device=device,
|
| 471 |
+
num_threads=0 if device == "cuda" else 4,
|
| 472 |
+
)
|
| 473 |
+
|
| 474 |
+
converter = DocumentConverter(
|
| 475 |
+
format_options={
|
| 476 |
+
InputFormat.PDF: PdfFormatOption(
|
| 477 |
+
pipeline_options=pipeline_options,
|
| 478 |
+
backend=DoclingParseV4DocumentBackend,
|
| 479 |
+
)
|
| 480 |
+
}
|
| 481 |
+
)
|
| 482 |
+
return converter
|
| 483 |
+
|
| 484 |
+
|
| 485 |
+
def _get_converter() -> DocumentConverter:
|
| 486 |
+
"""Get or create the global converter instance."""
|
| 487 |
+
global _converter
|
| 488 |
+
if _converter is None:
|
| 489 |
+
_converter = _create_converter(images_scale=IMAGES_SCALE)
|
| 490 |
+
return _converter
|
| 491 |
+
|
| 492 |
+
|
| 493 |
+
# ---------------------------------------------------------------------------
|
| 494 |
+
# Hybrid Conversion (Pass 1 + Pass 2 + Merge)
|
| 495 |
+
# ---------------------------------------------------------------------------
|
| 496 |
+
|
| 497 |
+
|
| 498 |
def _convert_document(
|
| 499 |
input_path: Path,
|
| 500 |
output_dir: Path,
|
|
|
|
|
|
|
| 501 |
images_scale: float,
|
| 502 |
include_images: bool,
|
| 503 |
request_id: str,
|
| 504 |
start_page: int = 0,
|
| 505 |
end_page: Optional[int] = None,
|
| 506 |
+
) -> tuple:
|
|
|
|
| 507 |
"""
|
| 508 |
+
Hybrid conversion: TableFormer for tables + Qwen3-VL for text.
|
| 509 |
|
| 510 |
+
Pass 1: Docling Standard Pipeline -> document structure + tables
|
| 511 |
+
Pass 2: VLM OCR -> enhanced text recognition per page
|
| 512 |
+
Merge: TableFormer tables + VLM text
|
| 513 |
+
|
| 514 |
+
Returns: (markdown_content, json_content, pages_processed, image_count)
|
| 515 |
"""
|
| 516 |
+
# PASS 1: Docling Standard Pipeline (structure + tables)
|
| 517 |
+
logger.info(
|
| 518 |
+
f"[{request_id}] Pass 1: Docling Standard Pipeline (DocLayNet + TableFormer + RapidOCR)"
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 519 |
)
|
| 520 |
+
converter = _get_converter()
|
| 521 |
|
|
|
|
| 522 |
start_time = time.time()
|
|
|
|
| 523 |
result = converter.convert(input_path)
|
| 524 |
doc = result.document
|
| 525 |
+
if doc is None:
|
| 526 |
+
raise ValueError(
|
| 527 |
+
f"Docling failed to parse document (status: {getattr(result, 'status', 'unknown')})"
|
| 528 |
+
)
|
| 529 |
+
pass1_time = time.time() - start_time
|
| 530 |
+
logger.info(f"[{request_id}] Pass 1 completed in {pass1_time:.2f}s")
|
| 531 |
+
|
| 532 |
+
# Extract TableFormer tables (keyed by page number)
|
| 533 |
+
tables_by_page = _extract_table_markdowns(doc)
|
| 534 |
+
total_tables = sum(len(v) for v in tables_by_page.values())
|
| 535 |
+
logger.info(f"[{request_id}] TableFormer detected {total_tables} tables")
|
| 536 |
+
|
| 537 |
+
# PASS 2: VLM OCR (enhanced text per page)
|
| 538 |
+
logger.info(f"[{request_id}] Pass 2: VLM OCR via Qwen3-VL ({VLM_MODEL})")
|
| 539 |
+
|
| 540 |
+
# Get page images for VLM
|
| 541 |
+
page_images = _pdf_to_page_images(input_path, start_page, end_page)
|
| 542 |
+
|
| 543 |
+
if not page_images:
|
| 544 |
+
# Fallback: use Docling's markdown directly if no page images
|
| 545 |
+
logger.warning(f"[{request_id}] No page images available, using Docling output only")
|
| 546 |
+
markdown_content = doc.export_to_markdown()
|
| 547 |
+
pages_processed = len(
|
| 548 |
+
set(e.prov[0].page_no for e, _ in doc.iterate_items() if e.prov)
|
| 549 |
+
)
|
| 550 |
+
return markdown_content, None, pages_processed, 0
|
| 551 |
+
|
| 552 |
+
vlm_page_texts: dict[int, Optional[str]] = {}
|
| 553 |
+
vlm_start = time.time()
|
| 554 |
+
for page_no, page_bytes in page_images:
|
| 555 |
+
try:
|
| 556 |
+
vlm_text = _vlm_ocr_page(page_bytes)
|
| 557 |
+
vlm_page_texts[page_no] = vlm_text
|
| 558 |
+
logger.info(
|
| 559 |
+
f"[{request_id}] VLM processed page {page_no + 1} ({len(vlm_text)} chars)"
|
| 560 |
+
)
|
| 561 |
+
except Exception as e:
|
| 562 |
+
logger.warning(
|
| 563 |
+
f"[{request_id}] VLM failed on page {page_no + 1}: {e}, using Docling text"
|
| 564 |
+
)
|
| 565 |
+
# Fallback to Docling's text for this page
|
| 566 |
+
vlm_page_texts[page_no] = None
|
| 567 |
|
| 568 |
+
vlm_time = time.time() - vlm_start
|
| 569 |
+
logger.info(
|
| 570 |
+
f"[{request_id}] Pass 2 completed in {vlm_time:.2f}s ({len(vlm_page_texts)} pages)"
|
| 571 |
+
)
|
| 572 |
|
| 573 |
+
# MERGE: VLM text + TableFormer tables
|
| 574 |
+
logger.info(f"[{request_id}] Merging VLM text with TableFormer tables")
|
| 575 |
+
|
| 576 |
+
md_parts: list[str] = []
|
| 577 |
+
pages_seen: set[int] = set()
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 578 |
image_count = 0
|
| 579 |
image_dir = output_dir / "images"
|
| 580 |
|
| 581 |
if include_images:
|
| 582 |
image_dir.mkdir(parents=True, exist_ok=True)
|
| 583 |
|
| 584 |
+
# Pre-build page-to-elements index (avoids O(N^2) on VLM fallback)
|
| 585 |
+
elements_by_page: dict[int, list] = {}
|
| 586 |
+
for element, _ in doc.iterate_items():
|
| 587 |
+
if element.prov:
|
| 588 |
+
pg = element.prov[0].page_no
|
| 589 |
+
elements_by_page.setdefault(pg, []).append(element)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 590 |
|
| 591 |
+
for page_no in sorted(vlm_page_texts.keys()):
|
| 592 |
+
pages_seen.add(page_no)
|
| 593 |
+
md_parts.append(f"\n\n<!-- Page {page_no + 1} -->\n\n")
|
| 594 |
|
| 595 |
+
vlm_text = vlm_page_texts[page_no]
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 596 |
|
| 597 |
+
if vlm_text is None:
|
| 598 |
+
# VLM failed -- fallback to Docling's text for this page
|
| 599 |
+
for element in elements_by_page.get(page_no, []):
|
| 600 |
try:
|
| 601 |
+
md_parts.append(element.export_to_markdown(doc=doc))
|
| 602 |
+
except Exception:
|
| 603 |
+
text = getattr(element, "text", "").strip()
|
| 604 |
+
if text:
|
| 605 |
+
md_parts.append(text + "\n\n")
|
| 606 |
else:
|
| 607 |
+
# Merge VLM text with TableFormer tables for this page
|
| 608 |
+
page_tables = tables_by_page.get(page_no, [])
|
| 609 |
+
merged = _merge_vlm_with_tables(vlm_text, page_tables)
|
| 610 |
+
md_parts.append(merged)
|
|
|
|
|
|
|
| 611 |
|
| 612 |
+
# Handle images from Docling if requested
|
| 613 |
+
if include_images:
|
| 614 |
+
for element, _ in doc.iterate_items():
|
| 615 |
+
if isinstance(element, PictureItem):
|
| 616 |
+
if element.image and element.image.pil_image:
|
| 617 |
+
page_no = element.prov[0].page_no if element.prov else 0
|
| 618 |
+
image_id = element.self_ref.split("/")[-1]
|
| 619 |
+
image_name = f"page_{page_no + 1}_{image_id}.png"
|
| 620 |
+
image_name = re.sub(r'[\\/*?:"<>|]', "", image_name)
|
| 621 |
+
image_path = image_dir / image_name
|
| 622 |
+
try:
|
| 623 |
+
element.image.pil_image.save(image_path, format="PNG")
|
| 624 |
+
image_count += 1
|
| 625 |
+
except Exception as e:
|
| 626 |
+
logger.warning(
|
| 627 |
+
f"[{request_id}] Failed to save image {image_name}: {e}"
|
| 628 |
+
)
|
| 629 |
+
|
| 630 |
+
markdown_content = "".join(md_parts)
|
| 631 |
pages_processed = len(pages_seen)
|
| 632 |
|
| 633 |
+
total_time = pass1_time + vlm_time
|
| 634 |
+
logger.info(
|
| 635 |
+
f"[{request_id}] Hybrid conversion complete: {pages_processed} pages, "
|
| 636 |
+
f"{total_tables} tables, {total_time:.2f}s total"
|
| 637 |
+
)
|
| 638 |
+
if pages_processed > 0:
|
| 639 |
+
logger.info(f"[{request_id}] Speed: {pages_processed / total_time:.2f} pages/sec")
|
| 640 |
|
| 641 |
return markdown_content, None, pages_processed, image_count
|
| 642 |
|
| 643 |
|
| 644 |
+
# ---------------------------------------------------------------------------
|
| 645 |
+
# Images Zip Helper
|
| 646 |
+
# ---------------------------------------------------------------------------
|
| 647 |
+
|
| 648 |
+
|
| 649 |
def _create_images_zip(output_dir: Path) -> tuple[Optional[str], int]:
|
| 650 |
"""Create a zip file from extracted images."""
|
| 651 |
image_dir = output_dir / "images"
|
|
|
|
| 671 |
return base64.b64encode(zip_buffer.getvalue()).decode("utf-8"), image_count
|
| 672 |
|
| 673 |
|
| 674 |
+
# ---------------------------------------------------------------------------
|
| 675 |
+
# Application Lifespan
|
| 676 |
+
# ---------------------------------------------------------------------------
|
| 677 |
+
|
| 678 |
+
|
| 679 |
+
@asynccontextmanager
|
| 680 |
+
async def lifespan(app: FastAPI):
|
| 681 |
+
"""Startup: initialize Docling converter and check vLLM."""
|
| 682 |
+
logger.info("=" * 60)
|
| 683 |
+
logger.info("Starting Docling VLM Parser API v2.0.0...")
|
| 684 |
+
|
| 685 |
+
device = _get_device()
|
| 686 |
+
logger.info(f"Device: {device}")
|
| 687 |
+
|
| 688 |
+
if device == "cuda":
|
| 689 |
+
logger.info(f"GPU: {torch.cuda.get_device_name(0)}")
|
| 690 |
+
logger.info(f"CUDA Version: {torch.version.cuda}")
|
| 691 |
+
logger.info(
|
| 692 |
+
f"GPU Memory: {torch.cuda.get_device_properties(0).total_mem / 1024**3:.1f} GB"
|
| 693 |
+
)
|
| 694 |
+
|
| 695 |
+
logger.info(f"VLM Model: {VLM_MODEL}")
|
| 696 |
+
logger.info(f"VLM Endpoint: http://{VLM_HOST}:{VLM_PORT}")
|
| 697 |
+
logger.info(f"Images scale: {IMAGES_SCALE}")
|
| 698 |
+
logger.info(f"Max file size: {MAX_FILE_SIZE_MB}MB")
|
| 699 |
+
|
| 700 |
+
# Verify vLLM is running
|
| 701 |
+
logger.info("Checking vLLM server...")
|
| 702 |
+
try:
|
| 703 |
+
async with httpx.AsyncClient(timeout=10) as client:
|
| 704 |
+
resp = await client.get(f"http://{VLM_HOST}:{VLM_PORT}/health")
|
| 705 |
+
resp.raise_for_status()
|
| 706 |
+
logger.info("vLLM server is healthy")
|
| 707 |
+
except Exception as e:
|
| 708 |
+
logger.error(f"vLLM server not available: {e}")
|
| 709 |
+
raise RuntimeError(f"vLLM server not available at {VLM_HOST}:{VLM_PORT}")
|
| 710 |
+
|
| 711 |
+
# Pre-initialize Docling converter
|
| 712 |
+
logger.info("Pre-loading Docling models (DocLayNet + TableFormer + RapidOCR)...")
|
| 713 |
+
try:
|
| 714 |
+
_get_converter()
|
| 715 |
+
logger.info("Docling models loaded successfully")
|
| 716 |
+
except Exception as e:
|
| 717 |
+
logger.warning(f"Failed to pre-load Docling models: {e}")
|
| 718 |
+
|
| 719 |
+
logger.info("=" * 60)
|
| 720 |
+
logger.info("Docling VLM Parser API ready (Hybrid: TableFormer + Qwen3-VL)")
|
| 721 |
+
logger.info("=" * 60)
|
| 722 |
+
yield
|
| 723 |
+
logger.info("Shutting down Docling VLM Parser API...")
|
| 724 |
+
|
| 725 |
+
|
| 726 |
+
# ---------------------------------------------------------------------------
|
| 727 |
+
# FastAPI App
|
| 728 |
+
# ---------------------------------------------------------------------------
|
| 729 |
+
|
| 730 |
+
app = FastAPI(
|
| 731 |
+
title="Docling VLM Parser API",
|
| 732 |
+
description="Hybrid document parser: TableFormer tables + Qwen3-VL OCR via vLLM",
|
| 733 |
+
version="2.0.0",
|
| 734 |
+
lifespan=lifespan,
|
| 735 |
+
)
|
| 736 |
+
|
| 737 |
+
|
| 738 |
+
# ---------------------------------------------------------------------------
|
| 739 |
+
# Endpoints
|
| 740 |
+
# ---------------------------------------------------------------------------
|
| 741 |
+
|
| 742 |
+
|
| 743 |
@app.get("/", response_model=HealthResponse)
|
| 744 |
async def health_check() -> HealthResponse:
|
| 745 |
"""Health check endpoint."""
|
|
|
|
| 748 |
if device == "cuda":
|
| 749 |
gpu_name = torch.cuda.get_device_name(0)
|
| 750 |
|
| 751 |
+
# Check vLLM status (async to avoid blocking event loop)
|
| 752 |
+
vlm_status = "unknown"
|
| 753 |
+
try:
|
| 754 |
+
async with httpx.AsyncClient(timeout=5) as client:
|
| 755 |
+
resp = await client.get(f"http://{VLM_HOST}:{VLM_PORT}/health")
|
| 756 |
+
vlm_status = "healthy" if resp.status_code == 200 else "unhealthy"
|
| 757 |
+
except Exception:
|
| 758 |
+
vlm_status = "unreachable"
|
| 759 |
+
|
| 760 |
return HealthResponse(
|
| 761 |
status="healthy",
|
| 762 |
+
version="2.0.0",
|
| 763 |
device=device,
|
| 764 |
+
gpu_name=None, # Don't leak GPU details on unauthenticated endpoint
|
| 765 |
+
vlm_model="active", # Confirm VLM is configured without leaking model name
|
| 766 |
+
vlm_status=vlm_status,
|
| 767 |
images_scale=IMAGES_SCALE,
|
| 768 |
)
|
| 769 |
|
|
|
|
| 772 |
async def parse_document(
|
| 773 |
file: UploadFile = File(..., description="PDF or image file to parse"),
|
| 774 |
output_format: str = Form(default="markdown", description="Output format: markdown or json"),
|
| 775 |
+
images_scale: Optional[float] = Form(default=None, description="Image resolution scale (default: 2.0)"),
|
|
|
|
|
|
|
|
|
|
| 776 |
start_page: int = Form(default=0, description="Starting page (0-indexed)"),
|
| 777 |
end_page: Optional[int] = Form(default=None, description="Ending page (None = all pages)"),
|
| 778 |
include_images: bool = Form(default=False, description="Include extracted images in response"),
|
|
|
|
| 781 |
"""
|
| 782 |
Parse a document file (PDF or image) and return extracted content.
|
| 783 |
|
| 784 |
+
Uses a hybrid two-pass approach:
|
| 785 |
+
Pass 1: Docling Standard Pipeline (DocLayNet + TableFormer + RapidOCR)
|
| 786 |
+
Pass 2: Qwen3-VL via vLLM for enhanced text recognition
|
| 787 |
+
Merge: TableFormer tables preserved, VLM text replaces RapidOCR text
|
| 788 |
+
|
| 789 |
Supports:
|
| 790 |
- PDF files (.pdf)
|
| 791 |
- Images (.png, .jpg, .jpeg, .tiff, .bmp)
|
|
|
|
| 795 |
|
| 796 |
logger.info(f"[{request_id}] {'='*50}")
|
| 797 |
logger.info(f"[{request_id}] New parse request received")
|
| 798 |
+
safe_filename = re.sub(r'[\r\n\t\x00-\x1f\x7f]', '_', file.filename or "")[:255]
|
| 799 |
+
logger.info(f"[{request_id}] Filename: {safe_filename}")
|
| 800 |
logger.info(f"[{request_id}] Output format: {output_format}")
|
| 801 |
|
| 802 |
+
if output_format not in ("markdown",):
|
| 803 |
+
raise HTTPException(
|
| 804 |
+
status_code=400,
|
| 805 |
+
detail="Only 'markdown' output_format is supported in v2.0.0",
|
| 806 |
+
)
|
| 807 |
+
|
| 808 |
# Validate file size
|
| 809 |
file.file.seek(0, 2)
|
| 810 |
file_size = file.file.tell()
|
|
|
|
| 831 |
)
|
| 832 |
|
| 833 |
# Use defaults if not specified
|
| 834 |
+
use_images_scale = images_scale if images_scale is not None else IMAGES_SCALE
|
|
|
|
|
|
|
| 835 |
|
| 836 |
+
logger.info(f"[{request_id}] Images scale: {use_images_scale}, VLM: {VLM_MODEL}")
|
| 837 |
logger.info(f"[{request_id}] Page range: {start_page} to {end_page or 'end'}")
|
| 838 |
|
| 839 |
temp_dir = tempfile.mkdtemp()
|
| 840 |
+
logger.debug(f"[{request_id}] Created temp directory: {temp_dir}")
|
| 841 |
|
| 842 |
try:
|
| 843 |
# Save uploaded file
|
| 844 |
input_path = Path(temp_dir) / f"input{file_ext}"
|
| 845 |
await asyncio.to_thread(_save_uploaded_file, input_path, file.file)
|
| 846 |
+
logger.debug(f"[{request_id}] Saved file to: {input_path}")
|
| 847 |
|
| 848 |
# Create output directory
|
| 849 |
output_dir = Path(temp_dir) / "output"
|
| 850 |
output_dir.mkdir(exist_ok=True)
|
| 851 |
|
| 852 |
+
# Convert document (hybrid two-pass)
|
| 853 |
markdown_content, json_content, pages_processed, image_count = await asyncio.to_thread(
|
| 854 |
_convert_document,
|
| 855 |
input_path,
|
| 856 |
output_dir,
|
|
|
|
|
|
|
| 857 |
use_images_scale,
|
| 858 |
include_images,
|
| 859 |
request_id,
|
| 860 |
start_page,
|
| 861 |
end_page,
|
|
|
|
| 862 |
)
|
| 863 |
|
| 864 |
# Create images zip if requested
|
|
|
|
| 884 |
image_count=image_count,
|
| 885 |
pages_processed=pages_processed,
|
| 886 |
device_used=_get_device(),
|
| 887 |
+
vlm_model=VLM_MODEL,
|
| 888 |
)
|
| 889 |
|
| 890 |
except Exception as e:
|
| 891 |
total_duration = time.time() - start_time
|
| 892 |
logger.error(f"[{request_id}] {'='*50}")
|
| 893 |
logger.error(f"[{request_id}] Request failed after {total_duration:.2f}s")
|
| 894 |
+
logger.error(f"[{request_id}] Error: {type(e).__name__}: {str(e)}", exc_info=True)
|
| 895 |
logger.error(f"[{request_id}] {'='*50}")
|
| 896 |
return ParseResponse(
|
| 897 |
success=False,
|
| 898 |
+
error=f"Processing failed (ref: {request_id})",
|
| 899 |
)
|
| 900 |
finally:
|
| 901 |
shutil.rmtree(temp_dir, ignore_errors=True)
|
| 902 |
+
logger.debug(f"[{request_id}] Cleaned up temp directory")
|
| 903 |
|
| 904 |
|
| 905 |
@app.post("/parse/url", response_model=ParseResponse)
|
|
|
|
| 910 |
"""
|
| 911 |
Parse a document from a URL.
|
| 912 |
|
| 913 |
+
Downloads the file and processes it through the hybrid two-pass pipeline:
|
| 914 |
+
Pass 1: Docling Standard Pipeline (DocLayNet + TableFormer + RapidOCR)
|
| 915 |
+
Pass 2: Qwen3-VL via vLLM for enhanced text recognition
|
| 916 |
+
Merge: TableFormer tables preserved, VLM text replaces RapidOCR text
|
| 917 |
"""
|
| 918 |
request_id = str(uuid4())[:8]
|
| 919 |
start_time = time.time()
|
|
|
|
| 923 |
logger.info(f"[{request_id}] URL: {request.url}")
|
| 924 |
logger.info(f"[{request_id}] Output format: {request.output_format}")
|
| 925 |
|
| 926 |
+
if request.output_format not in ("markdown",):
|
| 927 |
+
raise HTTPException(
|
| 928 |
+
status_code=400,
|
| 929 |
+
detail="Only 'markdown' output_format is supported in v2.0.0",
|
| 930 |
+
)
|
| 931 |
+
|
| 932 |
# Validate URL
|
| 933 |
logger.info(f"[{request_id}] Validating URL...")
|
| 934 |
_validate_url(request.url)
|
| 935 |
logger.info(f"[{request_id}] URL validation passed")
|
| 936 |
|
| 937 |
temp_dir = tempfile.mkdtemp()
|
| 938 |
+
logger.debug(f"[{request_id}] Created temp directory: {temp_dir}")
|
| 939 |
|
| 940 |
try:
|
| 941 |
# Download file
|
| 942 |
logger.info(f"[{request_id}] Downloading file from URL...")
|
| 943 |
download_start = time.time()
|
| 944 |
+
async with httpx.AsyncClient(timeout=60.0, follow_redirects=False) as client:
|
| 945 |
response = await client.get(request.url)
|
| 946 |
response.raise_for_status()
|
| 947 |
download_duration = time.time() - download_start
|
|
|
|
| 963 |
)
|
| 964 |
|
| 965 |
if len(response.content) > MAX_FILE_SIZE_BYTES:
|
| 966 |
+
logger.error(
|
| 967 |
+
f"[{request_id}] File too large: {file_size_mb:.2f} MB > {MAX_FILE_SIZE_MB} MB"
|
| 968 |
+
)
|
| 969 |
raise HTTPException(
|
| 970 |
status_code=413,
|
| 971 |
detail=f"File size exceeds maximum allowed size of {MAX_FILE_SIZE_MB}MB",
|
|
|
|
| 974 |
# Save downloaded file
|
| 975 |
input_path = Path(temp_dir) / f"input{file_ext}"
|
| 976 |
await asyncio.to_thread(_save_downloaded_content, input_path, response.content)
|
| 977 |
+
logger.debug(f"[{request_id}] Saved file to: {input_path}")
|
| 978 |
|
| 979 |
# Create output directory
|
| 980 |
output_dir = Path(temp_dir) / "output"
|
| 981 |
output_dir.mkdir(exist_ok=True)
|
| 982 |
|
| 983 |
# Use defaults if not specified
|
| 984 |
+
use_images_scale = request.images_scale if request.images_scale is not None else IMAGES_SCALE
|
|
|
|
|
|
|
| 985 |
|
| 986 |
+
logger.info(f"[{request_id}] Images scale: {use_images_scale}, VLM: {VLM_MODEL}")
|
| 987 |
logger.info(f"[{request_id}] Page range: {request.start_page} to {request.end_page or 'end'}")
|
| 988 |
|
| 989 |
+
# Convert document (hybrid two-pass)
|
| 990 |
markdown_content, json_content, pages_processed, image_count = await asyncio.to_thread(
|
| 991 |
_convert_document,
|
| 992 |
input_path,
|
| 993 |
output_dir,
|
|
|
|
|
|
|
| 994 |
use_images_scale,
|
| 995 |
request.include_images,
|
| 996 |
request_id,
|
| 997 |
request.start_page,
|
| 998 |
request.end_page,
|
|
|
|
| 999 |
)
|
| 1000 |
|
| 1001 |
# Create images zip if requested
|
|
|
|
| 1021 |
image_count=image_count,
|
| 1022 |
pages_processed=pages_processed,
|
| 1023 |
device_used=_get_device(),
|
| 1024 |
+
vlm_model=VLM_MODEL,
|
| 1025 |
)
|
| 1026 |
|
| 1027 |
except httpx.HTTPError as e:
|
|
|
|
| 1029 |
logger.error(f"[{request_id}] Download failed after {total_duration:.2f}s: {str(e)}")
|
| 1030 |
return ParseResponse(
|
| 1031 |
success=False,
|
| 1032 |
+
error=f"Failed to download file from URL (ref: {request_id})",
|
| 1033 |
)
|
| 1034 |
except Exception as e:
|
| 1035 |
total_duration = time.time() - start_time
|
| 1036 |
logger.error(f"[{request_id}] {'='*50}")
|
| 1037 |
logger.error(f"[{request_id}] Request failed after {total_duration:.2f}s")
|
| 1038 |
+
logger.error(f"[{request_id}] Error: {type(e).__name__}: {str(e)}", exc_info=True)
|
| 1039 |
logger.error(f"[{request_id}] {'='*50}")
|
| 1040 |
return ParseResponse(
|
| 1041 |
success=False,
|
| 1042 |
+
error=f"Processing failed (ref: {request_id})",
|
| 1043 |
)
|
| 1044 |
finally:
|
| 1045 |
shutil.rmtree(temp_dir, ignore_errors=True)
|
| 1046 |
+
logger.debug(f"[{request_id}] Cleaned up temp directory")
|
| 1047 |
|
| 1048 |
|
| 1049 |
if __name__ == "__main__":
|
requirements.txt
CHANGED
|
@@ -1,7 +1,7 @@
|
|
| 1 |
-
# Docling
|
| 2 |
-
# Optimized for HuggingFace Spaces
|
| 3 |
|
| 4 |
-
# Docling - IBM's document parsing library
|
| 5 |
docling>=2.15.0
|
| 6 |
|
| 7 |
# Web framework
|
|
@@ -11,8 +11,17 @@ uvicorn[standard]>=0.32.0
|
|
| 11 |
# File upload handling
|
| 12 |
python-multipart>=0.0.9
|
| 13 |
|
| 14 |
-
# HTTP client for URL parsing
|
| 15 |
httpx>=0.27.0
|
| 16 |
|
| 17 |
# Type checking
|
| 18 |
pydantic>=2.0.0
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Docling VLM Parser API Dependencies
|
| 2 |
+
# Optimized for HuggingFace Spaces with vLLM + Qwen3-VL-8B
|
| 3 |
|
| 4 |
+
# Docling - IBM's document parsing library (VLM pipeline support)
|
| 5 |
docling>=2.15.0
|
| 6 |
|
| 7 |
# Web framework
|
|
|
|
| 11 |
# File upload handling
|
| 12 |
python-multipart>=0.0.9
|
| 13 |
|
| 14 |
+
# HTTP client for URL parsing and vLLM health checks
|
| 15 |
httpx>=0.27.0
|
| 16 |
|
| 17 |
# Type checking
|
| 18 |
pydantic>=2.0.0
|
| 19 |
+
|
| 20 |
+
# Image preprocessing for degraded documents
|
| 21 |
+
opencv-python-headless>=4.10.0
|
| 22 |
+
|
| 23 |
+
# PDF to image conversion for VLM OCR pass
|
| 24 |
+
pdf2image>=1.17.0
|
| 25 |
+
|
| 26 |
+
# HuggingFace Hub for model downloads
|
| 27 |
+
huggingface-hub>=0.25.0
|
start.sh
ADDED
|
@@ -0,0 +1,71 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env bash
|
| 2 |
+
set -e
|
| 3 |
+
|
| 4 |
+
# ββ Configuration ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 5 |
+
VLLM_MODEL="/home/user/.cache/huggingface/Qwen3-VL-8B-Instruct"
|
| 6 |
+
VLLM_HOST="127.0.0.1"
|
| 7 |
+
VLLM_PORT="8000"
|
| 8 |
+
HEALTH_URL="http://${VLLM_HOST}:${VLLM_PORT}/health"
|
| 9 |
+
POLL_INTERVAL=5
|
| 10 |
+
MAX_WAIT=600
|
| 11 |
+
|
| 12 |
+
# ββ Start vLLM server in background βββββββββββββββββββββββββββββββββββββββββ
|
| 13 |
+
echo "[startup] Starting vLLM server with model: ${VLLM_MODEL}"
|
| 14 |
+
|
| 15 |
+
python -m vllm.entrypoints.openai.api_server \
|
| 16 |
+
--model "${VLLM_MODEL}" \
|
| 17 |
+
--host "${VLLM_HOST}" \
|
| 18 |
+
--port "${VLLM_PORT}" \
|
| 19 |
+
--max-num-seqs 16 \
|
| 20 |
+
--max-model-len 8192 \
|
| 21 |
+
--gpu-memory-utilization 0.70 \
|
| 22 |
+
--dtype auto \
|
| 23 |
+
--trust-remote-code \
|
| 24 |
+
--limit-mm-per-prompt image=1 \
|
| 25 |
+
2>&1 | sed 's/^/[vLLM] /' &
|
| 26 |
+
|
| 27 |
+
VLLM_PID=$!
|
| 28 |
+
echo "[startup] vLLM server started with PID ${VLLM_PID}"
|
| 29 |
+
|
| 30 |
+
# ββ Poll vLLM health endpoint until ready ββββββββββββββββββββββββββββββββββββ
|
| 31 |
+
echo "[startup] Waiting for vLLM to become healthy (polling every ${POLL_INTERVAL}s, timeout ${MAX_WAIT}s)..."
|
| 32 |
+
|
| 33 |
+
elapsed=0
|
| 34 |
+
while [ "${elapsed}" -lt "${MAX_WAIT}" ]; do
|
| 35 |
+
# Check if vLLM process is still alive
|
| 36 |
+
if ! kill -0 "${VLLM_PID}" 2>/dev/null; then
|
| 37 |
+
echo "[startup] ERROR: vLLM process (PID ${VLLM_PID}) died during startup"
|
| 38 |
+
exit 1
|
| 39 |
+
fi
|
| 40 |
+
|
| 41 |
+
if curl -sf "${HEALTH_URL}" > /dev/null 2>&1; then
|
| 42 |
+
echo "[startup] vLLM is healthy after ${elapsed}s"
|
| 43 |
+
break
|
| 44 |
+
fi
|
| 45 |
+
|
| 46 |
+
sleep "${POLL_INTERVAL}"
|
| 47 |
+
elapsed=$((elapsed + POLL_INTERVAL))
|
| 48 |
+
done
|
| 49 |
+
|
| 50 |
+
if [ "${elapsed}" -ge "${MAX_WAIT}" ]; then
|
| 51 |
+
echo "[startup] ERROR: vLLM did not become healthy within ${MAX_WAIT}s"
|
| 52 |
+
echo "[startup] Killing vLLM process (PID ${VLLM_PID})"
|
| 53 |
+
kill "${VLLM_PID}" 2>/dev/null || true
|
| 54 |
+
exit 1
|
| 55 |
+
fi
|
| 56 |
+
|
| 57 |
+
# ββ Start FastAPI with vLLM cleanup on exit ββββββββββββββββββββββββββββββββββ
|
| 58 |
+
_cleanup() {
|
| 59 |
+
echo "[startup] Shutting down vLLM (PID ${VLLM_PID})"
|
| 60 |
+
kill "${VLLM_PID}" 2>/dev/null
|
| 61 |
+
wait "${VLLM_PID}" 2>/dev/null
|
| 62 |
+
}
|
| 63 |
+
trap _cleanup EXIT TERM INT
|
| 64 |
+
|
| 65 |
+
echo "[startup] Starting FastAPI server on 0.0.0.0:7860"
|
| 66 |
+
|
| 67 |
+
python -m uvicorn app:app \
|
| 68 |
+
--host 0.0.0.0 \
|
| 69 |
+
--port 7860 \
|
| 70 |
+
--workers 1 \
|
| 71 |
+
--timeout-keep-alive 300
|