Spaces:
Running
Running
| # Getting Started | |
| Docling Studio ships two Docker image variants: | |
| | Variant | Image tag | Size | Description | | |
| |---------|-----------|------|-------------| | |
| | **remote** | `latest-remote` | ~270 MB | Lightweight β delegates to an external [Docling Serve](https://github.com/DS4SD/docling-serve) instance | | |
| | **local** | `latest-local` | ~1.9 GB | Full β runs Docling in-process, CPU-only (downloads ML models on first run) | | |
| { width="600" } | |
| ## Docker β remote mode (fastest) | |
| ```bash | |
| docker run -p 3000:3000 \ | |
| -e DOCLING_SERVE_URL=http://your-docling-serve:5001 \ | |
| ghcr.io/scub-france/docling-studio:latest-remote | |
| ``` | |
| ## Docker β local mode (self-contained) | |
| ```bash | |
| docker run -p 3000:3000 ghcr.io/scub-france/docling-studio:latest-local | |
| ``` | |
| > **Note:** The first analysis takes longer as Docling downloads its ML models (~400 MB). Subsequent runs are fast. | |
| Open [http://localhost:3000](http://localhost:3000). | |
| ## Docker Compose (recommended for development) | |
| ```bash | |
| git clone https://github.com/scub-france/Docling-Studio.git | |
| cd Docling-Studio | |
| # Local mode (default) | |
| docker compose up --build | |
| # Remote mode | |
| CONVERSION_MODE=remote DOCLING_SERVE_URL=http://your-docling-serve:5001 docker compose up --build | |
| ``` | |
| ## Local Development | |
| === "Backend (Python 3.12+)" | |
| ```bash | |
| cd document-parser | |
| python -m venv .venv && source .venv/bin/activate | |
| # Remote mode (lightweight) | |
| pip install -r requirements.txt | |
| # Local mode (with Docling) | |
| pip install -r requirements-local.txt | |
| uvicorn main:app --reload --port 8000 | |
| ``` | |
| === "Frontend (Node 20+)" | |
| ```bash | |
| cd frontend | |
| npm install | |
| npm run dev | |
| ``` | |
| The frontend runs on `http://localhost:3000` and proxies API calls to `http://localhost:8000`. | |
| ## Running Tests | |
| === "Backend" | |
| ```bash | |
| cd document-parser | |
| pip install pytest pytest-asyncio httpx | |
| pytest tests/ -v | |
| ``` | |
| === "Frontend" | |
| ```bash | |
| cd frontend | |
| npm run test:run | |
| ``` | |
| ## Pipeline Options | |
| These options map directly to Docling's [`PdfPipelineOptions`](https://docling-project.github.io/docling/usage/). | |
| | Option | Default | Description | | |
| |--------|---------|-------------| | |
| | `do_ocr` | `true` | OCR for scanned pages and embedded images | | |
| | `do_table_structure` | `true` | Table detection and row/column reconstruction | | |
| | `table_mode` | `accurate` | `accurate` (TableFormer) or `fast` | | |
| | `do_code_enrichment` | `false` | Specialized OCR for code blocks | | |
| | `do_formula_enrichment` | `false` | Math formula recognition (LaTeX output) | | |
| | `do_picture_classification` | `false` | Classify images by type | | |
| | `do_picture_description` | `false` | Generate image descriptions via VLM | | |
| | `generate_picture_images` | `false` | Extract detected images as separate files | | |
| | `generate_page_images` | `false` | Rasterize each page as an image | | |
| | `images_scale` | `1.0` | Scale factor for generated images (0.1β10) | | |
| ## Chunking Options | |
| !!! note | |
| Chunking is only available in **local** mode. The chunking UI is hidden when using remote mode (Docling Serve). | |
| After a document is analyzed, you can split the extracted content into semantic chunks. Chunking can be configured at analysis time or re-run later with different options via the **rechunk** action. | |
| | Option | Default | Description | | |
| |--------|---------|-------------| | |
| | `chunker_type` | `hybrid` | `hybrid` (semantic + structural), `hierarchical` (heading-based), or `page` (one chunk per page) | | |
| | `max_tokens` | `512` | Maximum tokens per chunk | | |
| | `merge_peers` | `true` | Merge sibling elements under the same heading | | |
| | `repeat_table_header` | `true` | Repeat table headers when a table is split across chunks | | |
| Each chunk includes: | |
| - **text** β the chunk content | |
| - **headings** β heading hierarchy leading to the chunk | |
| - **source_page** β the page number the chunk originates from | |
| - **token_count** β number of tokens in the chunk | |
| - **bboxes** β bounding boxes of the chunk's source elements (page + coordinates) | |
| ## Configuration | |
| All configuration is done via environment variables: | |
| | Variable | Default | Description | | |
| |----------|---------|-------------| | |
| | `CONVERSION_ENGINE` | `local` | `local` (in-process Docling) or `remote` (Docling Serve) | | |
| | `DOCLING_SERVE_URL` | `http://localhost:5001` | Docling Serve endpoint (remote mode only) | | |
| | `DOCLING_SERVE_API_KEY` | β | API key for Docling Serve (optional) | | |
| | `CORS_ORIGINS` | `http://localhost:3000,...` | CORS allowed origins | | |
| | `UPLOAD_DIR` | `./uploads` | File storage directory | | |
| | `DB_PATH` | `./data/docling_studio.db` | SQLite database path | | |
| | `CONVERSION_TIMEOUT` | `600` | Max seconds per Docling conversion | | |
| | `MAX_CONCURRENT_ANALYSES` | `3` | Maximum parallel analysis jobs | | |
| | `DEPLOYMENT_MODE` | `self-hosted` | `self-hosted` or `huggingface` (shows disclaimer banner) | | |
| | `APP_VERSION` | `dev` | Application version (set automatically by CI/Docker) | | |
| ## System Requirements | |
| | | Remote image | Local image | | |
| |---|---|---| | |
| | **Image size** | ~270 MB | ~1.9 GB | | |
| | **Memory** | 2 GB | 6 GB (recommended 8 GB+) | | |
| | **CPUs** | 2 | 4 (recommended 8+) | | |
| All Docker images are multi-arch (`linux/amd64` + `linux/arm64`). No GPU required. | |