Agentic-Service-Data-Eyond-Catalog

Running

File size: 13,343 Bytes

# Architecture — Data Eyond Agentic Service

**Last updated**: 2026-05-07
**Status**: Design phase — folder skeleton in place, implementation in progress

---

## Product vision (north star)

Data Eyond is an *AI data scientist* for business analytics, structured around **CRISP-DM** (Business Understanding → Data Understanding → Data Preparation → Modeling → Evaluation → Deployment). Targets executives doing self-serve deep-dives and data analysts/scientists offloading routine work.

Envisioned user flow: **interview agent** captures goal → user connects data sources → asks natural-language question → CRISP-DM-structured analytical response, exportable as a **presentation** or **notebook-style report**.

The catalog-driven, IR-based architecture documented below is the *foundation*. The next architectural evolution is an agentic layer (analytical planner, per-stage CRISP-DM agents, evaluator, reporter) that consumes the existing IntentRouter → QueryPlanner → Executor → ChatbotAgent spine as its tool layer. See `REPO_CONTEXT.md` → *Roadmap — agentic evolution* for the target agent topology.

---

## TL;DR

A catalog-driven AI service for data analysis. Users upload documents and register databases or tabular files; they ask natural-language questions and get answers grounded in their data.

The architecture has two paths:

- **Unstructured** (PDF, DOCX, TXT) — dense similarity over prose chunks (the right primitive for free-form text).
- **Structured** (databases, XLSX, CSV, Parquet) — a per-user **data catalog** describes what tables/columns exist; an LLM produces a structured **JSON intermediate representation (IR)** of the user's intent; a deterministic compiler turns the IR into SQL or pandas operations.

The LLM produces *intent*, not query syntax. Deterministic code does the rest.

---

## 1. Why catalog-driven design

For a database or spreadsheet, a user's question maps to *known tables and columns* — not to *similar text fragments*. Treating structured data with the same retrieval primitive as prose (chunk + embed + rank top-K) makes the right column survive a probabilistic ranking lottery. Catalog-based **lookup** is the right primitive instead.

A central per-user catalog also means:

- One place to keep table/column descriptions (AI-generated, refreshed when the source changes).
- The query planner sees the user's full data landscape in a single prompt.
- Schema stays stable across user sessions without hitting the source DB on every query.
- New sources auto-update the catalog without re-embedding chunks.

---

## 2. Source taxonomy

```
Sources
├── Unstructured (pdf, docx, txt)        →  Cu  (prose chunks via DocumentRetriever)
└── Structured
    ├── Schema (DB)                       →  Cs  (DB tables + columns)
    └── Tabular (xlsx, csv, parquet)      →  Ct  (sheets + columns)
                                           Cs ∪ Ct = Data Catalog Context
```

- **Cu** = unstructured prose context. Retrieval primitive: dense similarity over chunks.
- **Cs** = DB schema context (tables, columns, descriptions, sample values).
- **Ct** = tabular file context (sheets, columns, descriptions, sample values).
- **Data Catalog Context** = `Cs ∪ Ct`. Passed to the query planner as a single unified view.

DB vs tabular is **not** a routing concern — it's a per-source attribute (`source_type`) on each catalog entry. The split only matters at execution time (SQL vs pandas).

---

## 3. Routing model

```
source_hint ∈ { chat, unstructured, structured }
```

- `chat` — no search, conversational reply only
- `unstructured` — DocumentRetriever path (Cu)
- `structured` — catalog-driven path (Cs ∪ Ct → planner → compiler → executor)

The router commits to one path. Cross-source questions ("compare DB sales vs uploaded customer file") are handled inside the structured path because the planner sees both Cs and Ct in one prompt.

---

## 4. Core architectural decisions

### 4.1 Catalog as primary context, not retrieval

For most users (≤50 tables), the entire catalog fits in ~3-5k tokens and is passed verbatim to the planner. No vector search, no BM25, no chunk retrieval. The LLM reads the whole catalog and picks the right table.

When a user has hundreds of tables, **catalog-level retrieval** (BM25 + table-level vectors with RRF) can be added as a slicer between `CatalogReader` and `Planner`. Deferred until measurably needed.

### 4.2 JSON IR over raw SQL

The planner LLM emits a structured JSON IR describing query intent — not a SQL string. A deterministic compiler turns the IR into SQL (per dialect) or pandas/polars operations.

Benefits:

- Validatable with Pydantic before execution
- Compiler whitelists allowed operations (no DROP, DELETE, etc.)
- Portable: same IR → SQL (any dialect) / pandas / polars
- Cheaper tokens, easier to debug, trivially testable without an LLM
- LLM cannot emit valid-but-wrong SQL syntax

### 4.3 Deterministic compiler, not LLM SQL writer

The LLM produces *intent* (the IR). All actual query construction is deterministic Python. Compiler bugs are reproducible and fixable. Same IR always produces the same query.

### 4.4 Pipeline stage isolation

Each stage is its own module with typed input and typed output. No god classes. Stages: `IntentRouter`, `CatalogReader`, `QueryPlanner`, `IRValidator`, `QueryCompiler`, `QueryExecutor`, `ChatbotAgent`. Each is testable in isolation.

### 4.5 Minimal LLM surface

LLM calls happen in exactly three places (KM-557 removed `CatalogEnricher`; ingestion is now LLM-free — the planner reads column names, stats, and sample rows directly):

1. **`IntentRouter`** — once per user message
2. **`QueryPlanner`** — once per structured query (produces the IR)
3. **`ChatbotAgent`** — once per answer (formats the response)

Compiler and executors are pure code. No LLM in the hot path of query construction.

---

## 5. End-to-end flow

### Ingestion (when user uploads a file or connects a DB)

```
source upload / DB connect
    ↓
introspect schema (DB: information_schema; tabular: file headers + sample rows)
    ↓
validate (Pydantic)
    ↓
write to catalog store (Postgres jsonb in `data_catalog`, keyed by user_id)
```

For unstructured files: chunk + embed → PGVector.

### Query (per user message)

```
User message
    ↓
Chat cache check (Redis, 24h TTL)
    ↓ miss
Load chat history
    ↓
IntentRouter LLM   →  needs_search?  source_hint?
    ↓
    ├── chat        → ChatbotAgent → SSE stream
    ├── unstructured → DocumentRetriever → answerer
    └── structured  →
            CatalogReader (load full Cs ∪ Ct for user)
                ↓
            QueryPlanner LLM  →  JSON IR
                ↓
            IRValidator  (Pydantic + columns-exist + ops whitelist)
                ↓
            QueryCompiler  →  SQL (schema source) or pandas (tabular source)
                ↓
            QueryExecutor  (DbExecutor or TabularExecutor)
                ↓
            QueryResult
                ↓
            ChatbotAgent → SSE stream
```

---

## 6. Data catalog

### Storage

Per-user JSON document, stored as a `jsonb` row in Postgres keyed by `user_id`.

### Schema (initial scope)

```
Catalog
├── user_id, schema_version, generated_at
└── sources[]
    └── Source
        ├── source_id, source_type, name, description, location_ref, updated_at
        └── tables[]
            └── Table
                ├── table_id, name, description, row_count
                └── columns[]
                    └── Column
                        ├── column_id, name, data_type, description
                        ├── nullable
                        ├── pii_flag
                        ├── sample_values[]
                        └── stats: { min, max, distinct_count } | null
```

### Best-practice fields deferred

`description_human`, `synonyms[]`, `tags[]`, `primary_key`, `foreign_keys`, `unit`, `semantic_type`, `example_questions[]`, `schema_hash`, `enrichment_status`. Add when justified by user need.

### Stable IDs

`source_id`, `table_id`, `column_id` are stable internal references. `name` fields can change (e.g. column rename in source DB) without invalidating cached IRs.

### PII handling

Columns with `pii_flag: true` have `sample_values: null` — real values never enter LLM prompts. Auto-detected at ingestion via name patterns + value regex.

---

## 7. JSON IR

### Schema (initial scope)

```
QueryIR
├── ir_version          : "1.0"
├── source_id           : str   (references catalog)
├── table_id            : str   (references catalog)
├── select[]            : SelectItem
│   ├── { kind: "column", column_id, alias? }
│   └── { kind: "agg",    fn, column_id?, alias? }
├── filters[]           : { column_id, op, value, value_type }
├── group_by[]          : column_id
├── order_by[]          : { column_id | alias, dir }
└── limit               : int | null
```

### Whitelisted operators

```
Filter ops:  = != < <= > >= in not_in is_null is_not_null like between
Agg fns:     count count_distinct sum avg min max
```

### Validation rules (enforced before execution)

- `source_id` exists in catalog for this user
- `table_id` belongs to that source
- Every `column_id` exists in that table
- Every `agg.fn` and `filter.op` is whitelisted
- `value_type` consistent with column's `data_type`
- `limit` positive int, ≤ hard cap (e.g. 10000)

If any rule fails → reject IR → re-prompt planner with error context (max 3 retries).

### Deferred features

`having`, `offset`, boolean tree filters (OR/NOT), `distinct`, joins, window functions. Add as user demand proves the limitation.

---

## 8. Executors

Same input (validated IR), same output (`QueryResult`), different backends.

### DbExecutor (schema sources)

```
IR → SqlCompiler → SQL string + params
     ↓
sqlglot validation (SELECT-only, whitelist tables/columns, LIMIT enforced)
     ↓
asyncpg / pymysql in read-only transaction with timeout (30s)
     ↓
QueryResult
```

Identifiers come from catalog (verified at validation time, safe to inline as quoted identifiers). Values are always parameterized — never inlined as strings.

### TabularExecutor (tabular sources)

```
IR → PandasCompiler → operation chain
     ↓
choose strategy by file size:
  ≤ 100 MB    → eager pandas
  100 MB-1 GB → pyarrow with predicate pushdown
  > 1 GB      → polars lazy scan
     ↓
execute in asyncio.to_thread (CPU work off the event loop)
     ↓
QueryResult
```

Initially eager pandas is sufficient. Add the others when a real file is too big.

### Shared safety guarantees

1. IR validated before reaching compiler
2. Compiler is deterministic (no LLM)
3. Identifiers from catalog (trusted)
4. Values parameterized
5. sqlglot second-line defence for SQL
6. Read-only at every layer
7. Timeouts and row caps

---

## 9. Implementation scope

### Initial PR — what ships first

| Item | Folder |
|---|---|
| Data catalog Pydantic models | `src/catalog/models.py` |
| Catalog ingestion (introspect → enrich → validate → store) | `src/catalog/`, `src/pipeline/` |
| `IntentRouter` with 3-way source_hint | `src/agents/` |
| `CatalogReader` (loads full catalog) | `src/catalog/reader.py` |
| `QueryPlanner` LLM call | `src/query/planner/` |
| JSON IR Pydantic models | `src/query/ir/models.py` |
| IR validator | `src/query/ir/validator.py` |

**Output**: a validated JSON IR object. Execution lands in a follow-up PR.

### Follow-up PRs

| PR | Scope |
|---|---|
| 2 | `QueryCompiler` (IR → SQL / pandas) |
| 3 | `QueryExecutor` split: `DbExecutor` + `TabularExecutor` |
| 4 | Retry / self-correction loop on execution failure |
| 5 | Eval harness (golden question→IR→result examples) |
| 6 | Auto PII tagging in catalog |
| Later | Joins in IR, schema drift detection, hybrid catalog search |

---

## 10. Open questions

| # | Question | Why it matters |
|---|---|---|
| 1 | Catalog storage: JSON file per user vs Postgres `jsonb` row? | Affects ingestion + read performance |
| 2 | Should the catalog also list unstructured files (with descriptions only)? | Gives router unified view of all user sources |
| 3 | Catalog refresh trigger: explicit "rebuild" button, on every upload, or background TTL? | Staleness vs latency tradeoff |
| 4 | Confirm joins are out of initial IR scope? | Limits what user questions can be answered |
| 5 | PII handling for sample_values: mask, synthesize, or skip? | Affects what gets sent to LLM prompts |

---

## 11. References

- `docs/flowchart.html` — interactive end-to-end diagram (open in browser)
- `docs/flowchart.mmd` — mermaid source for the diagram

---

## Glossary

- **Cu** — unstructured context (prose chunks)
- **Cs** — schema context (DB tables/columns from catalog)
- **Ct** — tabular context (file sheets/columns from catalog)
- **IR** — intermediate representation (the JSON query shape)
- **PR** — pull request (a unit of code change)
- **PII** — personally identifiable information (names, emails, etc.)
- **ABC** — abstract base class (Python contract for subclasses)