Rifqi Hafizuddin
update repo context md files
35d3523
# Architecture β€” Data Eyond Agentic Service
**Last updated**: 2026-05-07
**Status**: Design phase β€” folder skeleton in place, implementation in progress
---
## Product vision (north star)
Data Eyond is an *AI data scientist* for business analytics, structured around **CRISP-DM** (Business Understanding β†’ Data Understanding β†’ Data Preparation β†’ Modeling β†’ Evaluation β†’ Deployment). Targets executives doing self-serve deep-dives and data analysts/scientists offloading routine work.
Envisioned user flow: **interview agent** captures goal β†’ user connects data sources β†’ asks natural-language question β†’ CRISP-DM-structured analytical response, exportable as a **presentation** or **notebook-style report**.
The catalog-driven, IR-based architecture documented below is the *foundation*. The next architectural evolution is an agentic layer (analytical planner, per-stage CRISP-DM agents, evaluator, reporter) that consumes the existing IntentRouter β†’ QueryPlanner β†’ Executor β†’ ChatbotAgent spine as its tool layer. See `REPO_CONTEXT.md` β†’ *Roadmap β€” agentic evolution* for the target agent topology.
---
## TL;DR
A catalog-driven AI service for data analysis. Users upload documents and register databases or tabular files; they ask natural-language questions and get answers grounded in their data.
The architecture has two paths:
- **Unstructured** (PDF, DOCX, TXT) β€” dense similarity over prose chunks (the right primitive for free-form text).
- **Structured** (databases, XLSX, CSV, Parquet) β€” a per-user **data catalog** describes what tables/columns exist; an LLM produces a structured **JSON intermediate representation (IR)** of the user's intent; a deterministic compiler turns the IR into SQL or pandas operations.
The LLM produces *intent*, not query syntax. Deterministic code does the rest.
---
## 1. Why catalog-driven design
For a database or spreadsheet, a user's question maps to *known tables and columns* β€” not to *similar text fragments*. Treating structured data with the same retrieval primitive as prose (chunk + embed + rank top-K) makes the right column survive a probabilistic ranking lottery. Catalog-based **lookup** is the right primitive instead.
A central per-user catalog also means:
- One place to keep table/column descriptions (AI-generated, refreshed when the source changes).
- The query planner sees the user's full data landscape in a single prompt.
- Schema stays stable across user sessions without hitting the source DB on every query.
- New sources auto-update the catalog without re-embedding chunks.
---
## 2. Source taxonomy
```
Sources
β”œβ”€β”€ Unstructured (pdf, docx, txt) β†’ Cu (prose chunks via DocumentRetriever)
└── Structured
β”œβ”€β”€ Schema (DB) β†’ Cs (DB tables + columns)
└── Tabular (xlsx, csv, parquet) β†’ Ct (sheets + columns)
Cs βˆͺ Ct = Data Catalog Context
```
- **Cu** = unstructured prose context. Retrieval primitive: dense similarity over chunks.
- **Cs** = DB schema context (tables, columns, descriptions, sample values).
- **Ct** = tabular file context (sheets, columns, descriptions, sample values).
- **Data Catalog Context** = `Cs βˆͺ Ct`. Passed to the query planner as a single unified view.
DB vs tabular is **not** a routing concern β€” it's a per-source attribute (`source_type`) on each catalog entry. The split only matters at execution time (SQL vs pandas).
---
## 3. Routing model
```
source_hint ∈ { chat, unstructured, structured }
```
- `chat` β€” no search, conversational reply only
- `unstructured` β€” DocumentRetriever path (Cu)
- `structured` β€” catalog-driven path (Cs βˆͺ Ct β†’ planner β†’ compiler β†’ executor)
The router commits to one path. Cross-source questions ("compare DB sales vs uploaded customer file") are handled inside the structured path because the planner sees both Cs and Ct in one prompt.
---
## 4. Core architectural decisions
### 4.1 Catalog as primary context, not retrieval
For most users (≀50 tables), the entire catalog fits in ~3-5k tokens and is passed verbatim to the planner. No vector search, no BM25, no chunk retrieval. The LLM reads the whole catalog and picks the right table.
When a user has hundreds of tables, **catalog-level retrieval** (BM25 + table-level vectors with RRF) can be added as a slicer between `CatalogReader` and `Planner`. Deferred until measurably needed.
### 4.2 JSON IR over raw SQL
The planner LLM emits a structured JSON IR describing query intent β€” not a SQL string. A deterministic compiler turns the IR into SQL (per dialect) or pandas/polars operations.
Benefits:
- Validatable with Pydantic before execution
- Compiler whitelists allowed operations (no DROP, DELETE, etc.)
- Portable: same IR β†’ SQL (any dialect) / pandas / polars
- Cheaper tokens, easier to debug, trivially testable without an LLM
- LLM cannot emit valid-but-wrong SQL syntax
### 4.3 Deterministic compiler, not LLM SQL writer
The LLM produces *intent* (the IR). All actual query construction is deterministic Python. Compiler bugs are reproducible and fixable. Same IR always produces the same query.
### 4.4 Pipeline stage isolation
Each stage is its own module with typed input and typed output. No god classes. Stages: `IntentRouter`, `CatalogReader`, `QueryPlanner`, `IRValidator`, `QueryCompiler`, `QueryExecutor`, `ChatbotAgent`. Each is testable in isolation.
### 4.5 Minimal LLM surface
LLM calls happen in exactly three places (KM-557 removed `CatalogEnricher`; ingestion is now LLM-free β€” the planner reads column names, stats, and sample rows directly):
1. **`IntentRouter`** β€” once per user message
2. **`QueryPlanner`** β€” once per structured query (produces the IR)
3. **`ChatbotAgent`** β€” once per answer (formats the response)
Compiler and executors are pure code. No LLM in the hot path of query construction.
---
## 5. End-to-end flow
### Ingestion (when user uploads a file or connects a DB)
```
source upload / DB connect
↓
introspect schema (DB: information_schema; tabular: file headers + sample rows)
↓
validate (Pydantic)
↓
write to catalog store (Postgres jsonb in `data_catalog`, keyed by user_id)
```
For unstructured files: chunk + embed β†’ PGVector.
### Query (per user message)
```
User message
↓
Chat cache check (Redis, 24h TTL)
↓ miss
Load chat history
↓
IntentRouter LLM β†’ needs_search? source_hint?
↓
β”œβ”€β”€ chat β†’ ChatbotAgent β†’ SSE stream
β”œβ”€β”€ unstructured β†’ DocumentRetriever β†’ answerer
└── structured β†’
CatalogReader (load full Cs βˆͺ Ct for user)
↓
QueryPlanner LLM β†’ JSON IR
↓
IRValidator (Pydantic + columns-exist + ops whitelist)
↓
QueryCompiler β†’ SQL (schema source) or pandas (tabular source)
↓
QueryExecutor (DbExecutor or TabularExecutor)
↓
QueryResult
↓
ChatbotAgent β†’ SSE stream
```
---
## 6. Data catalog
### Storage
Per-user JSON document, stored as a `jsonb` row in Postgres keyed by `user_id`.
### Schema (initial scope)
```
Catalog
β”œβ”€β”€ user_id, schema_version, generated_at
└── sources[]
└── Source
β”œβ”€β”€ source_id, source_type, name, description, location_ref, updated_at
└── tables[]
└── Table
β”œβ”€β”€ table_id, name, description, row_count
└── columns[]
└── Column
β”œβ”€β”€ column_id, name, data_type, description
β”œβ”€β”€ nullable
β”œβ”€β”€ pii_flag
β”œβ”€β”€ sample_values[]
└── stats: { min, max, distinct_count } | null
```
### Best-practice fields deferred
`description_human`, `synonyms[]`, `tags[]`, `primary_key`, `foreign_keys`, `unit`, `semantic_type`, `example_questions[]`, `schema_hash`, `enrichment_status`. Add when justified by user need.
### Stable IDs
`source_id`, `table_id`, `column_id` are stable internal references. `name` fields can change (e.g. column rename in source DB) without invalidating cached IRs.
### PII handling
Columns with `pii_flag: true` have `sample_values: null` β€” real values never enter LLM prompts. Auto-detected at ingestion via name patterns + value regex.
---
## 7. JSON IR
### Schema (initial scope)
```
QueryIR
β”œβ”€β”€ ir_version : "1.0"
β”œβ”€β”€ source_id : str (references catalog)
β”œβ”€β”€ table_id : str (references catalog)
β”œβ”€β”€ select[] : SelectItem
β”‚ β”œβ”€β”€ { kind: "column", column_id, alias? }
β”‚ └── { kind: "agg", fn, column_id?, alias? }
β”œβ”€β”€ filters[] : { column_id, op, value, value_type }
β”œβ”€β”€ group_by[] : column_id
β”œβ”€β”€ order_by[] : { column_id | alias, dir }
└── limit : int | null
```
### Whitelisted operators
```
Filter ops: = != < <= > >= in not_in is_null is_not_null like between
Agg fns: count count_distinct sum avg min max
```
### Validation rules (enforced before execution)
- `source_id` exists in catalog for this user
- `table_id` belongs to that source
- Every `column_id` exists in that table
- Every `agg.fn` and `filter.op` is whitelisted
- `value_type` consistent with column's `data_type`
- `limit` positive int, ≀ hard cap (e.g. 10000)
If any rule fails β†’ reject IR β†’ re-prompt planner with error context (max 3 retries).
### Deferred features
`having`, `offset`, boolean tree filters (OR/NOT), `distinct`, joins, window functions. Add as user demand proves the limitation.
---
## 8. Executors
Same input (validated IR), same output (`QueryResult`), different backends.
### DbExecutor (schema sources)
```
IR β†’ SqlCompiler β†’ SQL string + params
↓
sqlglot validation (SELECT-only, whitelist tables/columns, LIMIT enforced)
↓
asyncpg / pymysql in read-only transaction with timeout (30s)
↓
QueryResult
```
Identifiers come from catalog (verified at validation time, safe to inline as quoted identifiers). Values are always parameterized β€” never inlined as strings.
### TabularExecutor (tabular sources)
```
IR β†’ PandasCompiler β†’ operation chain
↓
choose strategy by file size:
≀ 100 MB β†’ eager pandas
100 MB-1 GB β†’ pyarrow with predicate pushdown
> 1 GB β†’ polars lazy scan
↓
execute in asyncio.to_thread (CPU work off the event loop)
↓
QueryResult
```
Initially eager pandas is sufficient. Add the others when a real file is too big.
### Shared safety guarantees
1. IR validated before reaching compiler
2. Compiler is deterministic (no LLM)
3. Identifiers from catalog (trusted)
4. Values parameterized
5. sqlglot second-line defence for SQL
6. Read-only at every layer
7. Timeouts and row caps
---
## 9. Implementation scope
### Initial PR β€” what ships first
| Item | Folder |
|---|---|
| Data catalog Pydantic models | `src/catalog/models.py` |
| Catalog ingestion (introspect β†’ enrich β†’ validate β†’ store) | `src/catalog/`, `src/pipeline/` |
| `IntentRouter` with 3-way source_hint | `src/agents/` |
| `CatalogReader` (loads full catalog) | `src/catalog/reader.py` |
| `QueryPlanner` LLM call | `src/query/planner/` |
| JSON IR Pydantic models | `src/query/ir/models.py` |
| IR validator | `src/query/ir/validator.py` |
**Output**: a validated JSON IR object. Execution lands in a follow-up PR.
### Follow-up PRs
| PR | Scope |
|---|---|
| 2 | `QueryCompiler` (IR β†’ SQL / pandas) |
| 3 | `QueryExecutor` split: `DbExecutor` + `TabularExecutor` |
| 4 | Retry / self-correction loop on execution failure |
| 5 | Eval harness (golden question→IR→result examples) |
| 6 | Auto PII tagging in catalog |
| Later | Joins in IR, schema drift detection, hybrid catalog search |
---
## 10. Open questions
| # | Question | Why it matters |
|---|---|---|
| 1 | Catalog storage: JSON file per user vs Postgres `jsonb` row? | Affects ingestion + read performance |
| 2 | Should the catalog also list unstructured files (with descriptions only)? | Gives router unified view of all user sources |
| 3 | Catalog refresh trigger: explicit "rebuild" button, on every upload, or background TTL? | Staleness vs latency tradeoff |
| 4 | Confirm joins are out of initial IR scope? | Limits what user questions can be answered |
| 5 | PII handling for sample_values: mask, synthesize, or skip? | Affects what gets sent to LLM prompts |
---
## 11. References
- `docs/flowchart.html` β€” interactive end-to-end diagram (open in browser)
- `docs/flowchart.mmd` β€” mermaid source for the diagram
---
## Glossary
- **Cu** β€” unstructured context (prose chunks)
- **Cs** β€” schema context (DB tables/columns from catalog)
- **Ct** β€” tabular context (file sheets/columns from catalog)
- **IR** β€” intermediate representation (the JSON query shape)
- **PR** β€” pull request (a unit of code change)
- **PII** β€” personally identifiable information (names, emails, etc.)
- **ABC** β€” abstract base class (Python contract for subclasses)