Spaces:

outcomelabs
/

docling-parser

Running on T4

App Files Files Community

docling-parser / CLAUDE.md

Ibad ur Rehman

feat: deploy docling first parser

74cacc0 about 1 month ago

preview code

raw

history blame contribute delete

7.79 kB

CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Project Overview

Docling Parser (v2.0.0) - A Hugging Face Spaces API service using a hybrid two-pass VLM architecture for PDF/document parsing. Pass 1 runs Docling's standard pipeline (DocLayNet layout + TableFormer ACCURATE + RapidOCR baseline). Pass 2 sends full page images to Qwen3-VL-30B-A3B via vLLM for enhanced text recognition. The merge step preserves TableFormer tables while replacing RapidOCR text with VLM output. Includes OpenCV preprocessing (denoise, CLAHE contrast enhancement). API endpoints are protected by Bearer token authentication.

Architecture

hf_docling_parser/
├── app.py              # FastAPI + hybrid two-pass parsing (v2.0.0)
├── start.sh            # Startup script (vLLM + FastAPI dual-process)
├── Dockerfile          # vLLM base image, Qwen3-VL pre-downloaded
├── requirements.txt    # Python deps (docling, opencv, pdf2image, etc.)
├── README.md           # HF Spaces metadata
├── CLAUDE.md           # Claude Code development guide
└── .gitignore          # Git ignore patterns

Dual-process Docker architecture: start.sh launches vLLM on port 8000 (GPU model serving) and FastAPI on port 7860 (API). Base image: vllm/vllm-openai:v0.14.1.

Common Commands

# Build and test Docker locally (requires A100 GPU)
docker build --shm-size 32g -t hf-docling .
docker run --gpus all --shm-size 32g -p 7860:7860 -e API_TOKEN=test hf-docling

# Test the API locally
curl -X POST "http://localhost:7860/parse" \
  -H "Authorization: Bearer YOUR_API_TOKEN" \
  -F "file=@document.pdf" \
  -F "output_format=markdown"

Note: Local dev without Docker is not practical — the hybrid pipeline requires vLLM + Qwen3-VL-30B-A3B running on an A100 GPU.

Deploying to Hugging Face Spaces

Space URL: https://huggingface.co/spaces/outcomelabs/docling-parser
API URL: https://outcomelabs-docling-parser.hf.space

Push New Code

Since this lives in a monorepo, deploy by cloning the HF repo, copying files, and pushing:

# Clone HF Space repo to temp directory
git clone https://huggingface.co/spaces/outcomelabs/docling-parser /tmp/hf-docling-deploy

# Copy updated files from monorepo
cp apps/hf_docling_parser/{app.py,Dockerfile,README.md,requirements.txt,start.sh,.gitignore} /tmp/hf-docling-deploy/

# Commit and push
cd /tmp/hf-docling-deploy
git add -A
git commit -m "feat: description of changes"
git push

# Clean up
rm -rf /tmp/hf-docling-deploy

Requires HF CLI auth: huggingface-cli login (logged in as sidoutcome / org outcomelabs).

Settings (configure in HF web UI)

Hardware: Nvidia A100 Large 80GB ($2.50/hr) — vLLM requires GPU
Sleep time: 1 hour (auto-shutdown after 60 min idle)
Secrets: API_TOKEN (required for API authentication)

API Endpoints

Endpoint	Method	Description
`/`	GET	Health check and API info
`/parse`	POST	Parse a document (PDF/image) to markdown/JSON
`/parse/url`	POST	Parse a document from URL
`/docs`	GET	OpenAPI documentation (Swagger UI)

Key Dependencies

docling: IBM's document parsing library with TableFormer
fastapi: API framework
python-multipart: File upload handling
uvicorn: ASGI server
httpx: HTTP client for URL parsing
pydantic: Request/response validation
opencv-python-headless: Image preprocessing (denoise, CLAHE)
pdf2image: PDF page to image conversion for VLM
huggingface-hub: Model/space utilities

Note: vLLM and PyTorch are provided by the base Docker image (vllm/vllm-openai:v0.14.1), not in requirements.txt.

Environment Variables

Variable	Description	Default
`API_TOKEN`	Required. Secret token for API authentication	-
`MAX_FILE_SIZE_MB`	Maximum upload file size in MB	`1024`
`IMAGES_SCALE`	Image resolution scale for page rendering	`2.0`
`VLM_MODEL`	VLM model for text recognition pass	`Qwen/Qwen3-VL-30B-A3B-Instruct`
`VLM_HOST`	vLLM server host	`127.0.0.1`
`VLM_PORT`	vLLM server port	`8000`
`VLM_GPU_MEMORY_UTILIZATION`	Fraction of GPU memory for vLLM	`0.85`
`VLM_MAX_MODEL_LEN`	Max sequence length for vLLM	`8192`

Testing

Bruno collection: Open docs/api-collections/ in Bruno for ready-to-use requests with local/production environments.

Note: Testing requires an A100 GPU with vLLM running. Use the Docker container for testing.

Test with curl

# Test /parse endpoint
curl -X POST "http://localhost:7860/parse" \
  -H "Authorization: Bearer test" \
  -F "file=@sample.pdf" \
  -F "output_format=markdown"

# Test with images
curl -X POST "http://localhost:7860/parse" \
  -H "Authorization: Bearer test" \
  -F "file=@sample.pdf" \
  -F "output_format=markdown" \
  -F "include_images=true"

# Test /parse/url endpoint
curl -X POST "http://localhost:7860/parse/url" \
  -H "Authorization: Bearer test" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://arxiv.org/pdf/2408.09869", "output_format": "markdown"}'

Test with Python

import httpx

API_URL = "http://localhost:7860"
API_TOKEN = "test"

with open("sample.pdf", "rb") as f:
    response = httpx.post(
        f"{API_URL}/parse",
        headers={"Authorization": f"Bearer {API_TOKEN}"},
        files={"file": ("sample.pdf", f, "application/pdf")},
        data={"output_format": "markdown"},
    )
print(response.json())

Logging & Monitoring

The API provides comprehensive logging:

Request IDs: Each request gets a unique 8-char ID
Startup logs: Device info, GPU name, configuration
Request logs: File size, type, processing time, pages/sec
Docling output: Conversion progress and timing
Error tracking: Full exception details with context

Comparison with MinerU

Feature	Docling (Hybrid VLM)	MinerU
Maintainer	IBM Research + Qwen3-VL	OpenDataLab
Table Detection	TableFormer (built-in)	Multiple backends
OCR	Qwen3-VL-30B-A3B via vLLM	Built-in
VLM Support	Hybrid (Standard + VLM two-pass)	Hybrid backend
License	MIT	AGPL-3.0
GPU Memory	~24GB (vLLM + Docling)	~6-10GB (pipeline)
Primary Use Case	Enterprise documents	General PDF parsing

Workflow Orchestration, Task Management & Core Principles

See root CLAUDE.md for full Workflow Orchestration (plan mode, subagents, self-improvement, verification, elegance, bug fixing), Task Management, and Core Principles. Files: <workspace-root>/tasks/todo.md, <workspace-root>/tasks/lessons.md.