File size: 5,245 Bytes
cc59214
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
# Getting Started

Docling Studio ships two Docker image variants:

| Variant | Image tag | Size | Description |
|---------|-----------|------|-------------|
| **remote** | `latest-remote` | ~270 MB | Lightweight β€” delegates to an external [Docling Serve](https://github.com/DS4SD/docling-serve) instance |
| **local** | `latest-local` | ~1.9 GB | Full β€” runs Docling in-process, CPU-only (downloads ML models on first run) |

![Docker architecture](images/docker.png){ width="600" }

## Docker β€” remote mode (fastest)

```bash
docker run -p 3000:3000 \
  -e DOCLING_SERVE_URL=http://your-docling-serve:5001 \
  ghcr.io/scub-france/docling-studio:latest-remote
```

## Docker β€” local mode (self-contained)

```bash
docker run -p 3000:3000 ghcr.io/scub-france/docling-studio:latest-local
```

> **Note:** The first analysis takes longer as Docling downloads its ML models (~400 MB). Subsequent runs are fast.

Open [http://localhost:3000](http://localhost:3000).

## Docker Compose (recommended for development)

```bash
git clone https://github.com/scub-france/Docling-Studio.git
cd Docling-Studio

# Local mode (default)
docker compose up --build

# Remote mode
CONVERSION_MODE=remote DOCLING_SERVE_URL=http://your-docling-serve:5001 docker compose up --build
```

## Local Development

=== "Backend (Python 3.12+)"

    ```bash
    cd document-parser
    python -m venv .venv && source .venv/bin/activate

    # Remote mode (lightweight)
    pip install -r requirements.txt

    # Local mode (with Docling)
    pip install -r requirements-local.txt

    uvicorn main:app --reload --port 8000
    ```

=== "Frontend (Node 20+)"

    ```bash
    cd frontend
    npm install
    npm run dev
    ```

The frontend runs on `http://localhost:3000` and proxies API calls to `http://localhost:8000`.

## Running Tests

=== "Backend"

    ```bash
    cd document-parser
    pip install pytest pytest-asyncio httpx
    pytest tests/ -v
    ```

=== "Frontend"

    ```bash
    cd frontend
    npm run test:run
    ```

## Pipeline Options

These options map directly to Docling's [`PdfPipelineOptions`](https://docling-project.github.io/docling/usage/).

| Option | Default | Description |
|--------|---------|-------------|
| `do_ocr` | `true` | OCR for scanned pages and embedded images |
| `do_table_structure` | `true` | Table detection and row/column reconstruction |
| `table_mode` | `accurate` | `accurate` (TableFormer) or `fast` |
| `do_code_enrichment` | `false` | Specialized OCR for code blocks |
| `do_formula_enrichment` | `false` | Math formula recognition (LaTeX output) |
| `do_picture_classification` | `false` | Classify images by type |
| `do_picture_description` | `false` | Generate image descriptions via VLM |
| `generate_picture_images` | `false` | Extract detected images as separate files |
| `generate_page_images` | `false` | Rasterize each page as an image |
| `images_scale` | `1.0` | Scale factor for generated images (0.1–10) |

## Chunking Options

!!! note
    Chunking is only available in **local** mode. The chunking UI is hidden when using remote mode (Docling Serve).

After a document is analyzed, you can split the extracted content into semantic chunks. Chunking can be configured at analysis time or re-run later with different options via the **rechunk** action.

| Option | Default | Description |
|--------|---------|-------------|
| `chunker_type` | `hybrid` | `hybrid` (semantic + structural), `hierarchical` (heading-based), or `page` (one chunk per page) |
| `max_tokens` | `512` | Maximum tokens per chunk |
| `merge_peers` | `true` | Merge sibling elements under the same heading |
| `repeat_table_header` | `true` | Repeat table headers when a table is split across chunks |

Each chunk includes:

- **text** β€” the chunk content
- **headings** β€” heading hierarchy leading to the chunk
- **source_page** β€” the page number the chunk originates from
- **token_count** β€” number of tokens in the chunk
- **bboxes** β€” bounding boxes of the chunk's source elements (page + coordinates)

## Configuration

All configuration is done via environment variables:

| Variable | Default | Description |
|----------|---------|-------------|
| `CONVERSION_ENGINE` | `local` | `local` (in-process Docling) or `remote` (Docling Serve) |
| `DOCLING_SERVE_URL` | `http://localhost:5001` | Docling Serve endpoint (remote mode only) |
| `DOCLING_SERVE_API_KEY` | β€” | API key for Docling Serve (optional) |
| `CORS_ORIGINS` | `http://localhost:3000,...` | CORS allowed origins |
| `UPLOAD_DIR` | `./uploads` | File storage directory |
| `DB_PATH` | `./data/docling_studio.db` | SQLite database path |
| `CONVERSION_TIMEOUT` | `600` | Max seconds per Docling conversion |
| `MAX_CONCURRENT_ANALYSES` | `3` | Maximum parallel analysis jobs |
| `DEPLOYMENT_MODE` | `self-hosted` | `self-hosted` or `huggingface` (shows disclaimer banner) |
| `APP_VERSION` | `dev` | Application version (set automatically by CI/Docker) |

## System Requirements

| | Remote image | Local image |
|---|---|---|
| **Image size** | ~270 MB | ~1.9 GB |
| **Memory** | 2 GB | 6 GB (recommended 8 GB+) |
| **CPUs** | 2 | 4 (recommended 8+) |

All Docker images are multi-arch (`linux/amd64` + `linux/arm64`). No GPU required.