Spaces:

Pier-Jean
/

docling-studio

Running

App Files Files Community

docling-studio / docs /getting-started.md

Pier-Jean

Upload folder using huggingface_hub

cc59214 verified about 1 month ago

preview code

raw

history blame contribute delete

5.25 kB

Getting Started

Docling Studio ships two Docker image variants:

Variant	Image tag	Size	Description
remote	`latest-remote`	~270 MB	Lightweight — delegates to an external Docling Serve instance
local	`latest-local`	~1.9 GB	Full — runs Docling in-process, CPU-only (downloads ML models on first run)

{ width="600" }

Docker — remote mode (fastest)

docker run -p 3000:3000 \
  -e DOCLING_SERVE_URL=http://your-docling-serve:5001 \
  ghcr.io/scub-france/docling-studio:latest-remote

Docker — local mode (self-contained)

docker run -p 3000:3000 ghcr.io/scub-france/docling-studio:latest-local

Note: The first analysis takes longer as Docling downloads its ML models (~400 MB). Subsequent runs are fast.

Open http://localhost:3000.

Docker Compose (recommended for development)

git clone https://github.com/scub-france/Docling-Studio.git
cd Docling-Studio

# Local mode (default)
docker compose up --build

# Remote mode
CONVERSION_MODE=remote DOCLING_SERVE_URL=http://your-docling-serve:5001 docker compose up --build

Local Development

=== "Backend (Python 3.12+)"

```bash
cd document-parser
python -m venv .venv && source .venv/bin/activate

# Remote mode (lightweight)
pip install -r requirements.txt

# Local mode (with Docling)
pip install -r requirements-local.txt

uvicorn main:app --reload --port 8000
```

=== "Frontend (Node 20+)"

```bash
cd frontend
npm install
npm run dev
```

The frontend runs on http://localhost:3000 and proxies API calls to http://localhost:8000.

Running Tests

=== "Backend"

```bash
cd document-parser
pip install pytest pytest-asyncio httpx
pytest tests/ -v
```

=== "Frontend"

```bash
cd frontend
npm run test:run
```

Pipeline Options

These options map directly to Docling's PdfPipelineOptions.

Option	Default	Description
`do_ocr`	`true`	OCR for scanned pages and embedded images
`do_table_structure`	`true`	Table detection and row/column reconstruction
`table_mode`	`accurate`	`accurate` (TableFormer) or `fast`
`do_code_enrichment`	`false`	Specialized OCR for code blocks
`do_formula_enrichment`	`false`	Math formula recognition (LaTeX output)
`do_picture_classification`	`false`	Classify images by type
`do_picture_description`	`false`	Generate image descriptions via VLM
`generate_picture_images`	`false`	Extract detected images as separate files
`generate_page_images`	`false`	Rasterize each page as an image
`images_scale`	`1.0`	Scale factor for generated images (0.1–10)

Chunking Options

!!! note Chunking is only available in local mode. The chunking UI is hidden when using remote mode (Docling Serve).

After a document is analyzed, you can split the extracted content into semantic chunks. Chunking can be configured at analysis time or re-run later with different options via the rechunk action.

Option	Default	Description
`chunker_type`	`hybrid`	`hybrid` (semantic + structural), `hierarchical` (heading-based), or `page` (one chunk per page)
`max_tokens`	`512`	Maximum tokens per chunk
`merge_peers`	`true`	Merge sibling elements under the same heading
`repeat_table_header`	`true`	Repeat table headers when a table is split across chunks

Each chunk includes:

text — the chunk content
headings — heading hierarchy leading to the chunk
source_page — the page number the chunk originates from
token_count — number of tokens in the chunk
bboxes — bounding boxes of the chunk's source elements (page + coordinates)

Configuration

All configuration is done via environment variables:

Variable	Default	Description
`CONVERSION_ENGINE`	`local`	`local` (in-process Docling) or `remote` (Docling Serve)
`DOCLING_SERVE_URL`	`http://localhost:5001`	Docling Serve endpoint (remote mode only)
`DOCLING_SERVE_API_KEY`	—	API key for Docling Serve (optional)
`CORS_ORIGINS`	`http://localhost:3000,...`	CORS allowed origins
`UPLOAD_DIR`	`./uploads`	File storage directory
`DB_PATH`	`./data/docling_studio.db`	SQLite database path
`CONVERSION_TIMEOUT`	`600`	Max seconds per Docling conversion
`MAX_CONCURRENT_ANALYSES`	`3`	Maximum parallel analysis jobs
`DEPLOYMENT_MODE`	`self-hosted`	`self-hosted` or `huggingface` (shows disclaimer banner)
`APP_VERSION`	`dev`	Application version (set automatically by CI/Docker)

System Requirements

	Remote image	Local image
Image size	~270 MB	~1.9 GB
Memory	2 GB	6 GB (recommended 8 GB+)
CPUs	2	4 (recommended 8+)

All Docker images are multi-arch (linux/amd64 + linux/arm64). No GPU required.