| # Architecture β Data Eyond Agentic Service |
|
|
| **Last updated**: 2026-05-07 |
| **Status**: Design phase β folder skeleton in place, implementation in progress |
|
|
| --- |
|
|
| ## Product vision (north star) |
|
|
| Data Eyond is an *AI data scientist* for business analytics, structured around **CRISP-DM** (Business Understanding β Data Understanding β Data Preparation β Modeling β Evaluation β Deployment). Targets executives doing self-serve deep-dives and data analysts/scientists offloading routine work. |
|
|
| Envisioned user flow: **interview agent** captures goal β user connects data sources β asks natural-language question β CRISP-DM-structured analytical response, exportable as a **presentation** or **notebook-style report**. |
|
|
| The catalog-driven, IR-based architecture documented below is the *foundation*. The next architectural evolution is an agentic layer (analytical planner, per-stage CRISP-DM agents, evaluator, reporter) that consumes the existing IntentRouter β QueryPlanner β Executor β ChatbotAgent spine as its tool layer. See `REPO_CONTEXT.md` β *Roadmap β agentic evolution* for the target agent topology. |
|
|
| --- |
|
|
| ## TL;DR |
|
|
| A catalog-driven AI service for data analysis. Users upload documents and register databases or tabular files; they ask natural-language questions and get answers grounded in their data. |
|
|
| The architecture has two paths: |
|
|
| - **Unstructured** (PDF, DOCX, TXT) β dense similarity over prose chunks (the right primitive for free-form text). |
| - **Structured** (databases, XLSX, CSV, Parquet) β a per-user **data catalog** describes what tables/columns exist; an LLM produces a structured **JSON intermediate representation (IR)** of the user's intent; a deterministic compiler turns the IR into SQL or pandas operations. |
|
|
| The LLM produces *intent*, not query syntax. Deterministic code does the rest. |
|
|
| --- |
|
|
| ## 1. Why catalog-driven design |
|
|
| For a database or spreadsheet, a user's question maps to *known tables and columns* β not to *similar text fragments*. Treating structured data with the same retrieval primitive as prose (chunk + embed + rank top-K) makes the right column survive a probabilistic ranking lottery. Catalog-based **lookup** is the right primitive instead. |
|
|
| A central per-user catalog also means: |
|
|
| - One place to keep table/column descriptions (AI-generated, refreshed when the source changes). |
| - The query planner sees the user's full data landscape in a single prompt. |
| - Schema stays stable across user sessions without hitting the source DB on every query. |
| - New sources auto-update the catalog without re-embedding chunks. |
|
|
| --- |
|
|
| ## 2. Source taxonomy |
|
|
| ``` |
| Sources |
| βββ Unstructured (pdf, docx, txt) β Cu (prose chunks via DocumentRetriever) |
| βββ Structured |
| βββ Schema (DB) β Cs (DB tables + columns) |
| βββ Tabular (xlsx, csv, parquet) β Ct (sheets + columns) |
| Cs βͺ Ct = Data Catalog Context |
| ``` |
|
|
| - **Cu** = unstructured prose context. Retrieval primitive: dense similarity over chunks. |
| - **Cs** = DB schema context (tables, columns, descriptions, sample values). |
| - **Ct** = tabular file context (sheets, columns, descriptions, sample values). |
| - **Data Catalog Context** = `Cs βͺ Ct`. Passed to the query planner as a single unified view. |
|
|
| DB vs tabular is **not** a routing concern β it's a per-source attribute (`source_type`) on each catalog entry. The split only matters at execution time (SQL vs pandas). |
|
|
| --- |
|
|
| ## 3. Routing model |
|
|
| ``` |
| source_hint β { chat, unstructured, structured } |
| ``` |
|
|
| - `chat` β no search, conversational reply only |
| - `unstructured` β DocumentRetriever path (Cu) |
| - `structured` β catalog-driven path (Cs βͺ Ct β planner β compiler β executor) |
|
|
| The router commits to one path. Cross-source questions ("compare DB sales vs uploaded customer file") are handled inside the structured path because the planner sees both Cs and Ct in one prompt. |
|
|
| --- |
|
|
| ## 4. Core architectural decisions |
|
|
| ### 4.1 Catalog as primary context, not retrieval |
|
|
| For most users (β€50 tables), the entire catalog fits in ~3-5k tokens and is passed verbatim to the planner. No vector search, no BM25, no chunk retrieval. The LLM reads the whole catalog and picks the right table. |
|
|
| When a user has hundreds of tables, **catalog-level retrieval** (BM25 + table-level vectors with RRF) can be added as a slicer between `CatalogReader` and `Planner`. Deferred until measurably needed. |
|
|
| ### 4.2 JSON IR over raw SQL |
|
|
| The planner LLM emits a structured JSON IR describing query intent β not a SQL string. A deterministic compiler turns the IR into SQL (per dialect) or pandas/polars operations. |
|
|
| Benefits: |
|
|
| - Validatable with Pydantic before execution |
| - Compiler whitelists allowed operations (no DROP, DELETE, etc.) |
| - Portable: same IR β SQL (any dialect) / pandas / polars |
| - Cheaper tokens, easier to debug, trivially testable without an LLM |
| - LLM cannot emit valid-but-wrong SQL syntax |
|
|
| ### 4.3 Deterministic compiler, not LLM SQL writer |
|
|
| The LLM produces *intent* (the IR). All actual query construction is deterministic Python. Compiler bugs are reproducible and fixable. Same IR always produces the same query. |
|
|
| ### 4.4 Pipeline stage isolation |
|
|
| Each stage is its own module with typed input and typed output. No god classes. Stages: `IntentRouter`, `CatalogReader`, `QueryPlanner`, `IRValidator`, `QueryCompiler`, `QueryExecutor`, `ChatbotAgent`. Each is testable in isolation. |
|
|
| ### 4.5 Minimal LLM surface |
|
|
| LLM calls happen in exactly three places (KM-557 removed `CatalogEnricher`; ingestion is now LLM-free β the planner reads column names, stats, and sample rows directly): |
|
|
| 1. **`IntentRouter`** β once per user message |
| 2. **`QueryPlanner`** β once per structured query (produces the IR) |
| 3. **`ChatbotAgent`** β once per answer (formats the response) |
|
|
| Compiler and executors are pure code. No LLM in the hot path of query construction. |
|
|
| --- |
|
|
| ## 5. End-to-end flow |
|
|
| ### Ingestion (when user uploads a file or connects a DB) |
|
|
| ``` |
| source upload / DB connect |
| β |
| introspect schema (DB: information_schema; tabular: file headers + sample rows) |
| β |
| validate (Pydantic) |
| β |
| write to catalog store (Postgres jsonb in `data_catalog`, keyed by user_id) |
| ``` |
|
|
| For unstructured files: chunk + embed β PGVector. |
|
|
| ### Query (per user message) |
|
|
| ``` |
| User message |
| β |
| Chat cache check (Redis, 24h TTL) |
| β miss |
| Load chat history |
| β |
| IntentRouter LLM β needs_search? source_hint? |
| β |
| βββ chat β ChatbotAgent β SSE stream |
| βββ unstructured β DocumentRetriever β answerer |
| βββ structured β |
| CatalogReader (load full Cs βͺ Ct for user) |
| β |
| QueryPlanner LLM β JSON IR |
| β |
| IRValidator (Pydantic + columns-exist + ops whitelist) |
| β |
| QueryCompiler β SQL (schema source) or pandas (tabular source) |
| β |
| QueryExecutor (DbExecutor or TabularExecutor) |
| β |
| QueryResult |
| β |
| ChatbotAgent β SSE stream |
| ``` |
|
|
| --- |
|
|
| ## 6. Data catalog |
|
|
| ### Storage |
|
|
| Per-user JSON document, stored as a `jsonb` row in Postgres keyed by `user_id`. |
|
|
| ### Schema (initial scope) |
|
|
| ``` |
| Catalog |
| βββ user_id, schema_version, generated_at |
| βββ sources[] |
| βββ Source |
| βββ source_id, source_type, name, description, location_ref, updated_at |
| βββ tables[] |
| βββ Table |
| βββ table_id, name, description, row_count |
| βββ columns[] |
| βββ Column |
| βββ column_id, name, data_type, description |
| βββ nullable |
| βββ pii_flag |
| βββ sample_values[] |
| βββ stats: { min, max, distinct_count } | null |
| ``` |
|
|
| ### Best-practice fields deferred |
|
|
| `description_human`, `synonyms[]`, `tags[]`, `primary_key`, `foreign_keys`, `unit`, `semantic_type`, `example_questions[]`, `schema_hash`, `enrichment_status`. Add when justified by user need. |
|
|
| ### Stable IDs |
|
|
| `source_id`, `table_id`, `column_id` are stable internal references. `name` fields can change (e.g. column rename in source DB) without invalidating cached IRs. |
|
|
| ### PII handling |
|
|
| Columns with `pii_flag: true` have `sample_values: null` β real values never enter LLM prompts. Auto-detected at ingestion via name patterns + value regex. |
|
|
| --- |
|
|
| ## 7. JSON IR |
|
|
| ### Schema (initial scope) |
|
|
| ``` |
| QueryIR |
| βββ ir_version : "1.0" |
| βββ source_id : str (references catalog) |
| βββ table_id : str (references catalog) |
| βββ select[] : SelectItem |
| β βββ { kind: "column", column_id, alias? } |
| β βββ { kind: "agg", fn, column_id?, alias? } |
| βββ filters[] : { column_id, op, value, value_type } |
| βββ group_by[] : column_id |
| βββ order_by[] : { column_id | alias, dir } |
| βββ limit : int | null |
| ``` |
|
|
| ### Whitelisted operators |
|
|
| ``` |
| Filter ops: = != < <= > >= in not_in is_null is_not_null like between |
| Agg fns: count count_distinct sum avg min max |
| ``` |
|
|
| ### Validation rules (enforced before execution) |
|
|
| - `source_id` exists in catalog for this user |
| - `table_id` belongs to that source |
| - Every `column_id` exists in that table |
| - Every `agg.fn` and `filter.op` is whitelisted |
| - `value_type` consistent with column's `data_type` |
| - `limit` positive int, β€ hard cap (e.g. 10000) |
|
|
| If any rule fails β reject IR β re-prompt planner with error context (max 3 retries). |
|
|
| ### Deferred features |
|
|
| `having`, `offset`, boolean tree filters (OR/NOT), `distinct`, joins, window functions. Add as user demand proves the limitation. |
|
|
| --- |
|
|
| ## 8. Executors |
|
|
| Same input (validated IR), same output (`QueryResult`), different backends. |
|
|
| ### DbExecutor (schema sources) |
|
|
| ``` |
| IR β SqlCompiler β SQL string + params |
| β |
| sqlglot validation (SELECT-only, whitelist tables/columns, LIMIT enforced) |
| β |
| asyncpg / pymysql in read-only transaction with timeout (30s) |
| β |
| QueryResult |
| ``` |
|
|
| Identifiers come from catalog (verified at validation time, safe to inline as quoted identifiers). Values are always parameterized β never inlined as strings. |
|
|
| ### TabularExecutor (tabular sources) |
|
|
| ``` |
| IR β PandasCompiler β operation chain |
| β |
| choose strategy by file size: |
| β€ 100 MB β eager pandas |
| 100 MB-1 GB β pyarrow with predicate pushdown |
| > 1 GB β polars lazy scan |
| β |
| execute in asyncio.to_thread (CPU work off the event loop) |
| β |
| QueryResult |
| ``` |
|
|
| Initially eager pandas is sufficient. Add the others when a real file is too big. |
|
|
| ### Shared safety guarantees |
|
|
| 1. IR validated before reaching compiler |
| 2. Compiler is deterministic (no LLM) |
| 3. Identifiers from catalog (trusted) |
| 4. Values parameterized |
| 5. sqlglot second-line defence for SQL |
| 6. Read-only at every layer |
| 7. Timeouts and row caps |
|
|
| --- |
|
|
| ## 9. Implementation scope |
|
|
| ### Initial PR β what ships first |
|
|
| | Item | Folder | |
| |---|---| |
| | Data catalog Pydantic models | `src/catalog/models.py` | |
| | Catalog ingestion (introspect β enrich β validate β store) | `src/catalog/`, `src/pipeline/` | |
| | `IntentRouter` with 3-way source_hint | `src/agents/` | |
| | `CatalogReader` (loads full catalog) | `src/catalog/reader.py` | |
| | `QueryPlanner` LLM call | `src/query/planner/` | |
| | JSON IR Pydantic models | `src/query/ir/models.py` | |
| | IR validator | `src/query/ir/validator.py` | |
| |
| **Output**: a validated JSON IR object. Execution lands in a follow-up PR. |
| |
| ### Follow-up PRs |
| |
| | PR | Scope | |
| |---|---| |
| | 2 | `QueryCompiler` (IR β SQL / pandas) | |
| | 3 | `QueryExecutor` split: `DbExecutor` + `TabularExecutor` | |
| | 4 | Retry / self-correction loop on execution failure | |
| | 5 | Eval harness (golden questionβIRβresult examples) | |
| | 6 | Auto PII tagging in catalog | |
| | Later | Joins in IR, schema drift detection, hybrid catalog search | |
| |
| --- |
| |
| ## 10. Open questions |
| |
| | # | Question | Why it matters | |
| |---|---|---| |
| | 1 | Catalog storage: JSON file per user vs Postgres `jsonb` row? | Affects ingestion + read performance | |
| | 2 | Should the catalog also list unstructured files (with descriptions only)? | Gives router unified view of all user sources | |
| | 3 | Catalog refresh trigger: explicit "rebuild" button, on every upload, or background TTL? | Staleness vs latency tradeoff | |
| | 4 | Confirm joins are out of initial IR scope? | Limits what user questions can be answered | |
| | 5 | PII handling for sample_values: mask, synthesize, or skip? | Affects what gets sent to LLM prompts | |
|
|
| --- |
|
|
| ## 11. References |
|
|
| - `docs/flowchart.html` β interactive end-to-end diagram (open in browser) |
| - `docs/flowchart.mmd` β mermaid source for the diagram |
|
|
| --- |
|
|
| ## Glossary |
|
|
| - **Cu** β unstructured context (prose chunks) |
| - **Cs** β schema context (DB tables/columns from catalog) |
| - **Ct** β tabular context (file sheets/columns from catalog) |
| - **IR** β intermediate representation (the JSON query shape) |
| - **PR** β pull request (a unit of code change) |
| - **PII** β personally identifiable information (names, emails, etc.) |
| - **ABC** β abstract base class (Python contract for subclasses) |
|
|