Spaces:
Running
Running
File size: 5,245 Bytes
cc59214 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 | # Getting Started
Docling Studio ships two Docker image variants:
| Variant | Image tag | Size | Description |
|---------|-----------|------|-------------|
| **remote** | `latest-remote` | ~270 MB | Lightweight β delegates to an external [Docling Serve](https://github.com/DS4SD/docling-serve) instance |
| **local** | `latest-local` | ~1.9 GB | Full β runs Docling in-process, CPU-only (downloads ML models on first run) |
{ width="600" }
## Docker β remote mode (fastest)
```bash
docker run -p 3000:3000 \
-e DOCLING_SERVE_URL=http://your-docling-serve:5001 \
ghcr.io/scub-france/docling-studio:latest-remote
```
## Docker β local mode (self-contained)
```bash
docker run -p 3000:3000 ghcr.io/scub-france/docling-studio:latest-local
```
> **Note:** The first analysis takes longer as Docling downloads its ML models (~400 MB). Subsequent runs are fast.
Open [http://localhost:3000](http://localhost:3000).
## Docker Compose (recommended for development)
```bash
git clone https://github.com/scub-france/Docling-Studio.git
cd Docling-Studio
# Local mode (default)
docker compose up --build
# Remote mode
CONVERSION_MODE=remote DOCLING_SERVE_URL=http://your-docling-serve:5001 docker compose up --build
```
## Local Development
=== "Backend (Python 3.12+)"
```bash
cd document-parser
python -m venv .venv && source .venv/bin/activate
# Remote mode (lightweight)
pip install -r requirements.txt
# Local mode (with Docling)
pip install -r requirements-local.txt
uvicorn main:app --reload --port 8000
```
=== "Frontend (Node 20+)"
```bash
cd frontend
npm install
npm run dev
```
The frontend runs on `http://localhost:3000` and proxies API calls to `http://localhost:8000`.
## Running Tests
=== "Backend"
```bash
cd document-parser
pip install pytest pytest-asyncio httpx
pytest tests/ -v
```
=== "Frontend"
```bash
cd frontend
npm run test:run
```
## Pipeline Options
These options map directly to Docling's [`PdfPipelineOptions`](https://docling-project.github.io/docling/usage/).
| Option | Default | Description |
|--------|---------|-------------|
| `do_ocr` | `true` | OCR for scanned pages and embedded images |
| `do_table_structure` | `true` | Table detection and row/column reconstruction |
| `table_mode` | `accurate` | `accurate` (TableFormer) or `fast` |
| `do_code_enrichment` | `false` | Specialized OCR for code blocks |
| `do_formula_enrichment` | `false` | Math formula recognition (LaTeX output) |
| `do_picture_classification` | `false` | Classify images by type |
| `do_picture_description` | `false` | Generate image descriptions via VLM |
| `generate_picture_images` | `false` | Extract detected images as separate files |
| `generate_page_images` | `false` | Rasterize each page as an image |
| `images_scale` | `1.0` | Scale factor for generated images (0.1β10) |
## Chunking Options
!!! note
Chunking is only available in **local** mode. The chunking UI is hidden when using remote mode (Docling Serve).
After a document is analyzed, you can split the extracted content into semantic chunks. Chunking can be configured at analysis time or re-run later with different options via the **rechunk** action.
| Option | Default | Description |
|--------|---------|-------------|
| `chunker_type` | `hybrid` | `hybrid` (semantic + structural), `hierarchical` (heading-based), or `page` (one chunk per page) |
| `max_tokens` | `512` | Maximum tokens per chunk |
| `merge_peers` | `true` | Merge sibling elements under the same heading |
| `repeat_table_header` | `true` | Repeat table headers when a table is split across chunks |
Each chunk includes:
- **text** β the chunk content
- **headings** β heading hierarchy leading to the chunk
- **source_page** β the page number the chunk originates from
- **token_count** β number of tokens in the chunk
- **bboxes** β bounding boxes of the chunk's source elements (page + coordinates)
## Configuration
All configuration is done via environment variables:
| Variable | Default | Description |
|----------|---------|-------------|
| `CONVERSION_ENGINE` | `local` | `local` (in-process Docling) or `remote` (Docling Serve) |
| `DOCLING_SERVE_URL` | `http://localhost:5001` | Docling Serve endpoint (remote mode only) |
| `DOCLING_SERVE_API_KEY` | β | API key for Docling Serve (optional) |
| `CORS_ORIGINS` | `http://localhost:3000,...` | CORS allowed origins |
| `UPLOAD_DIR` | `./uploads` | File storage directory |
| `DB_PATH` | `./data/docling_studio.db` | SQLite database path |
| `CONVERSION_TIMEOUT` | `600` | Max seconds per Docling conversion |
| `MAX_CONCURRENT_ANALYSES` | `3` | Maximum parallel analysis jobs |
| `DEPLOYMENT_MODE` | `self-hosted` | `self-hosted` or `huggingface` (shows disclaimer banner) |
| `APP_VERSION` | `dev` | Application version (set automatically by CI/Docker) |
## System Requirements
| | Remote image | Local image |
|---|---|---|
| **Image size** | ~270 MB | ~1.9 GB |
| **Memory** | 2 GB | 6 GB (recommended 8 GB+) |
| **CPUs** | 2 | 4 (recommended 8+) |
All Docker images are multi-arch (`linux/amd64` + `linux/arm64`). No GPU required.
|