Spaces:

santicas
/

ragtiquicIA

Sleeping

File size: 9,223 Bytes

cf450f7

# Knowledge Graph Example

This example builds a simple **knowledge graph** on top of **SurrealDB** from
uploaded documents, using a **data-flow pattern** to drive ingestion.

At a high level:

1. You upload a document to the FastAPI server (`/upload`).
2. The server stores it in SurrealDB as a `document` record.
3. A background ingestion worker runs *flows* that:
   - chunk documents into embedded `chunk` records
   - infer `concept` nodes from each chunk and create `MENTIONS_CONCEPT` edges
4. A separate chat agent can retrieve relevant chunks (vector + graph context)
   and answer using the ingested documents.

## Code layout

The package lives under `examples/knowledge-graph/src/knowledge_graph/`:

- `server.py`
  - FastAPI application and lifecycle management.
  - Starts the background ingestion loop on startup and stops it on shutdown.
  - Creates a `flow.Executor` bound to the SurrealDB connection.

- `ingestion.py`
  - Defines the ingestion pipeline by registering flow handlers with
    `@exe.flow(...)`.
  - Current flows:
    - `chunk`: processes `document` records that have not been chunked yet
    - `infer_concepts`: processes `chunk` records that have not had concepts
      inferred yet

- `flow/`
  - `executor.py`: generic runtime for the data-flow pattern (polls DB for work,
    executes handlers, applies backoff when idle).
  - `definitions.py`: `Flow` (Pydantic model) and shared types.

- `handlers/`
  - `upload.py`: receives a file and stores it as an original document in DB.
  - `chunk.py`: converts and chunks documents, embeds chunks, inserts them.
  - `inference.py`: uses the configured LLM to extract concepts and write
    `concept` nodes + `MENTIONS_CONCEPT` edges.

- `agent.py`
  - PydanticAI agent with a `retrieve` tool.
  - Runs SurrealQL from `surql/search_chunks.surql` to fetch relevant chunks and
    supplies them as context to the model.

- `surql/`
  - SurrealQL files for schema definition and retrieval queries.

## Data-flow pattern (flow/stamp) used here

This example uses a **database-driven data flow**:

- Each step (“flow”) queries a DB table for records that are eligible for
  processing.
- The step writes results back to the DB.
- The step marks completion by setting a *stamp field* on the record.

### Flow definition

Flows are registered via the `@exe.flow(table=..., stamp=..., dependencies=..., priority=...)`
decorator. Under the hood:

- The executor stores flow metadata in the `flow` table.
- Each handler is assigned a **stable hash** derived from its compiled code.
  This hash is written into the record’s stamp field after processing.

### Eligibility and dependencies

A record becomes a candidate when:

- its stamp field is `NONE` (meaning “not yet processed by this flow”), and
- all dependency fields (if any) are present (not `NONE`).

This makes the pipeline **restart-safe** and **incremental**:
if the server stops, it resumes based on DB state rather than in-memory state.

### Stamping and idempotency

Each handler must set its stamp field, e.g.:

- `document.chunked = <flow_hash>`
- `chunk.concepts_inferred = <flow_hash>`

This prevents reprocessing the same record and also makes changes traceable:
if you update a flow function, its hash changes and you can see which records
were processed by which version of the flow.

## Run:

### DB:

```bash
surreal start -u root -p root rocksdb:dbs/knowledge-graph
```

or use the helper script:

```bash
./scripts/run_surrealdb.sh
```

or `just knowledge-graph-db` from the repo base directory.

### LLM + embeddings (Blablador)

This example uses OpenAI-compatible APIs. For Blablador, set:

```bash
export OPENAI_API_KEY="$BLABLADOR_API_KEY"
export OPENAI_BASE_URL="${BLABLADOR_BASE_URL:-https://api.helmholtz-blablador.fz-juelich.de/v1/}"
```

Defaults are `alias-fast` for chat and local sentence-transformers embeddings.

Override chat model and fallbacks if needed:

```bash
export KG_LLM_MODEL=alias-fast
export KG_LLM_FALLBACK_MODELS=alias-large,alias-code
export KG_CHAT_MODEL=alias-fast
```

To use local embeddings explicitly:

```bash
export KG_EMBEDDINGS_PROVIDER=sentence-transformers
export KG_LOCAL_EMBEDDINGS_MODEL=sentence-transformers/all-MiniLM-L6-v2
```

To try Blablador embeddings (may be unstable):

```bash
export KG_EMBEDDINGS_PROVIDER=openai
export KG_EMBEDDINGS_MODEL=alias-embeddings
```

### Server and ingestion worker

```bash
DB_NAME=test_db uv run --env-file .env -- fastapi run examples/knowledge-graph/src/knowledge_graph/server.py --port 8080
```

By default, ingestion is disabled so the server can start quickly. To enable
ingestion at startup:

```bash
export KG_ENABLE_INGESTION=true
```

Recommended flow for large uploads:

1) Upload documents.
2) Run ingestion separately.

To run ingestion separately (recommended for large backlogs):

```bash
./scripts/start_ingestion.sh
```

If you see WebSocket disconnects, switch to HTTP for the DB client:

```bash
export KG_DB_URL=http://localhost:8000
```

### PDF converter selection

By default, the ingestion flow prefers Docling with no fallback. You can
override the order with:

```bash
export KG_PDF_CONVERTER=docling
```

Other values: `kreuzberg` (prefer Kreuzberg), `auto` (try Kreuzberg first).

To enable fallback converters:

```bash
export KG_PDF_FALLBACK=true
```

Docling tokenizer configuration:

```bash
export KG_DOCLING_TOKENIZER=cl100k_base
```

### Markdown ingestion

You can upload `.md` files directly; they are chunked locally without PDF
conversion. The uploader also guesses content types by filename when missing.

If a file comes through as `application/octet-stream`, the ingestion pipeline
will attempt to guess the type from the filename before converting.

### Party plan metadata

The knowledge-graph example includes a metadata file for the 2026 party plan
PDFs. It is used to expand acronyms and to surface plan URLs in answers:

- `examples/knowledge-graph/data/party_plan_metadata.json`

or `just knowledge-graph test_db` from the repo base directory.

### Chat agent

```bash
DB_NAME=test_db uv run --env-file .env uvicorn knowledge_graph.agent:app --host 127.0.0.1 --port 7932
```

### Status check

```bash
./scripts/status_check.sh
```

This script now acts as a status checker and log tail helper. Logs are written
to `logs/server.log` and `logs/ui.log`.

### Quickstart scripts

Start SurrealDB:

```bash
./scripts/run_surrealdb.sh
```

Start server (foreground):

```bash
./scripts/start_server.sh
```

Start server in background:

```bash
./scripts/start_server.sh -b
```

Upload PDFs/Markdowns:

```bash
./scripts/upload_pdfs.sh /path/to/folder
```

Run ingestion (process backlog):

```bash
./scripts/start_ingestion.sh
```

Start UI:

```bash
source scripts/start_ui.sh
```

### Streamlit app (query-first)

Run locally (requires SurrealDB running):

```bash
export DB_NAME=test_db
streamlit run examples/knowledge-graph/streamlit_app.py
```

Uploads are limited to one PDF/Markdown at a time (default max 50 MB).
Ingestion runs in a background thread and writes logs to `logs/ingestion.log`.

The Streamlit UI can display party banner images using `images/metadata.json`.

Set the limit explicitly:

```bash
export KG_MAX_UPLOAD_MB=50
export STREAMLIT_SERVER_MAX_UPLOAD_SIZE=50
```

Docker (single container with SurrealDB inside):

```bash
docker build -f Dockerfile.streamlit -t kaig-streamlit .
docker run -p 8501:8501 \
  -e BLABLADOR_API_KEY=... \
  -e BLABLADOR_BASE_URL=https://api.helmholtz-blablador.fz-juelich.de/v1/ \
  kaig-streamlit
```

The build expects `dbs/knowledge-graph/` to be present in the build context.

SurrealDB retry settings (optional):

```bash
export KG_DB_RETRY_ATTEMPTS=3
export KG_DB_RETRY_DELAY=1.0
```

Check status:

```bash
./scripts/status_check.sh
```

Limit retrieval tool calls per question (default: 10):

```bash
export KG_MAX_RETRIEVE_CALLS=1
```

Tune search threshold and fallback:

```bash
export KG_SEARCH_THRESHOLD=0.15
export KG_SEARCH_FALLBACK=true
```

or `just knowledge-graph-agent test_db` from the repo base directory.

## SurrealQL queries:

**Visualise the graph:**

```surql
SELECT *,
    ->MENTIONS_CONCEPT->concept as concepts
FROM chunk;
```

**Flow status**

This will show how many records have been processed by and are pending for each "flow".

```surql
LET $flows = SELECT * FROM flow;

RETURN $flows.fold([], |$a, $flow| {
    LET $b = SELECT
            $flow.id as flow,
            type::field($flow.stamp) as flow_hash,
            count() as count,
            $flow.table as table
        FROM type::table($flow.table)
        GROUP BY flow_hash;
    RETURN $a.concat($b)
});
```

Output example:

```surql
[
	{
		count: 1,
		flow: flow:chunk,
		flow_hash: NONE,
		table: 'document'
	},
	{
		count: 2,
		flow: flow:chunk,
		flow_hash: 'bbb6fe4b55cce1b3c8af0e7713a33d75',
		table: 'document'
	},
	{
		count: 4,
		flow: flow:infer_concepts,
		flow_hash: NONE,
		table: 'chunk'
	},
	{
		count: 27,
		flow: flow:infer_concepts,
		flow_hash: '75f90c71db9aeb2cf6f871ba1f75828c',
		table: 'chunk'
	}
]
```

Different hashes mean the records have been processed by different versions of the flow function. This can happen if the flow function has been updated.