docling-studio / docs /getting-started.md
Pier-Jean's picture
Upload folder using huggingface_hub
cc59214 verified
# Getting Started
Docling Studio ships two Docker image variants:
| Variant | Image tag | Size | Description |
|---------|-----------|------|-------------|
| **remote** | `latest-remote` | ~270 MB | Lightweight β€” delegates to an external [Docling Serve](https://github.com/DS4SD/docling-serve) instance |
| **local** | `latest-local` | ~1.9 GB | Full β€” runs Docling in-process, CPU-only (downloads ML models on first run) |
![Docker architecture](images/docker.png){ width="600" }
## Docker β€” remote mode (fastest)
```bash
docker run -p 3000:3000 \
-e DOCLING_SERVE_URL=http://your-docling-serve:5001 \
ghcr.io/scub-france/docling-studio:latest-remote
```
## Docker β€” local mode (self-contained)
```bash
docker run -p 3000:3000 ghcr.io/scub-france/docling-studio:latest-local
```
> **Note:** The first analysis takes longer as Docling downloads its ML models (~400 MB). Subsequent runs are fast.
Open [http://localhost:3000](http://localhost:3000).
## Docker Compose (recommended for development)
```bash
git clone https://github.com/scub-france/Docling-Studio.git
cd Docling-Studio
# Local mode (default)
docker compose up --build
# Remote mode
CONVERSION_MODE=remote DOCLING_SERVE_URL=http://your-docling-serve:5001 docker compose up --build
```
## Local Development
=== "Backend (Python 3.12+)"
```bash
cd document-parser
python -m venv .venv && source .venv/bin/activate
# Remote mode (lightweight)
pip install -r requirements.txt
# Local mode (with Docling)
pip install -r requirements-local.txt
uvicorn main:app --reload --port 8000
```
=== "Frontend (Node 20+)"
```bash
cd frontend
npm install
npm run dev
```
The frontend runs on `http://localhost:3000` and proxies API calls to `http://localhost:8000`.
## Running Tests
=== "Backend"
```bash
cd document-parser
pip install pytest pytest-asyncio httpx
pytest tests/ -v
```
=== "Frontend"
```bash
cd frontend
npm run test:run
```
## Pipeline Options
These options map directly to Docling's [`PdfPipelineOptions`](https://docling-project.github.io/docling/usage/).
| Option | Default | Description |
|--------|---------|-------------|
| `do_ocr` | `true` | OCR for scanned pages and embedded images |
| `do_table_structure` | `true` | Table detection and row/column reconstruction |
| `table_mode` | `accurate` | `accurate` (TableFormer) or `fast` |
| `do_code_enrichment` | `false` | Specialized OCR for code blocks |
| `do_formula_enrichment` | `false` | Math formula recognition (LaTeX output) |
| `do_picture_classification` | `false` | Classify images by type |
| `do_picture_description` | `false` | Generate image descriptions via VLM |
| `generate_picture_images` | `false` | Extract detected images as separate files |
| `generate_page_images` | `false` | Rasterize each page as an image |
| `images_scale` | `1.0` | Scale factor for generated images (0.1–10) |
## Chunking Options
!!! note
Chunking is only available in **local** mode. The chunking UI is hidden when using remote mode (Docling Serve).
After a document is analyzed, you can split the extracted content into semantic chunks. Chunking can be configured at analysis time or re-run later with different options via the **rechunk** action.
| Option | Default | Description |
|--------|---------|-------------|
| `chunker_type` | `hybrid` | `hybrid` (semantic + structural), `hierarchical` (heading-based), or `page` (one chunk per page) |
| `max_tokens` | `512` | Maximum tokens per chunk |
| `merge_peers` | `true` | Merge sibling elements under the same heading |
| `repeat_table_header` | `true` | Repeat table headers when a table is split across chunks |
Each chunk includes:
- **text** β€” the chunk content
- **headings** β€” heading hierarchy leading to the chunk
- **source_page** β€” the page number the chunk originates from
- **token_count** β€” number of tokens in the chunk
- **bboxes** β€” bounding boxes of the chunk's source elements (page + coordinates)
## Configuration
All configuration is done via environment variables:
| Variable | Default | Description |
|----------|---------|-------------|
| `CONVERSION_ENGINE` | `local` | `local` (in-process Docling) or `remote` (Docling Serve) |
| `DOCLING_SERVE_URL` | `http://localhost:5001` | Docling Serve endpoint (remote mode only) |
| `DOCLING_SERVE_API_KEY` | β€” | API key for Docling Serve (optional) |
| `CORS_ORIGINS` | `http://localhost:3000,...` | CORS allowed origins |
| `UPLOAD_DIR` | `./uploads` | File storage directory |
| `DB_PATH` | `./data/docling_studio.db` | SQLite database path |
| `CONVERSION_TIMEOUT` | `600` | Max seconds per Docling conversion |
| `MAX_CONCURRENT_ANALYSES` | `3` | Maximum parallel analysis jobs |
| `DEPLOYMENT_MODE` | `self-hosted` | `self-hosted` or `huggingface` (shows disclaimer banner) |
| `APP_VERSION` | `dev` | Application version (set automatically by CI/Docker) |
## System Requirements
| | Remote image | Local image |
|---|---|---|
| **Image size** | ~270 MB | ~1.9 GB |
| **Memory** | 2 GB | 6 GB (recommended 8 GB+) |
| **CPUs** | 2 | 4 (recommended 8+) |
All Docker images are multi-arch (`linux/amd64` + `linux/arm64`). No GPU required.