# Architecture — Data Eyond Agentic Service **Last updated**: 2026-05-07 **Status**: Design phase — folder skeleton in place, implementation in progress --- ## Product vision (north star) Data Eyond is an *AI data scientist* for business analytics, structured around **CRISP-DM** (Business Understanding → Data Understanding → Data Preparation → Modeling → Evaluation → Deployment). Targets executives doing self-serve deep-dives and data analysts/scientists offloading routine work. Envisioned user flow: **interview agent** captures goal → user connects data sources → asks natural-language question → CRISP-DM-structured analytical response, exportable as a **presentation** or **notebook-style report**. The catalog-driven, IR-based architecture documented below is the *foundation*. The next architectural evolution is an agentic layer (analytical planner, per-stage CRISP-DM agents, evaluator, reporter) that consumes the existing IntentRouter → QueryPlanner → Executor → ChatbotAgent spine as its tool layer. See `REPO_CONTEXT.md` → *Roadmap — agentic evolution* for the target agent topology. --- ## TL;DR A catalog-driven AI service for data analysis. Users upload documents and register databases or tabular files; they ask natural-language questions and get answers grounded in their data. The architecture has two paths: - **Unstructured** (PDF, DOCX, TXT) — dense similarity over prose chunks (the right primitive for free-form text). - **Structured** (databases, XLSX, CSV, Parquet) — a per-user **data catalog** describes what tables/columns exist; an LLM produces a structured **JSON intermediate representation (IR)** of the user's intent; a deterministic compiler turns the IR into SQL or pandas operations. The LLM produces *intent*, not query syntax. Deterministic code does the rest. --- ## 1. Why catalog-driven design For a database or spreadsheet, a user's question maps to *known tables and columns* — not to *similar text fragments*. Treating structured data with the same retrieval primitive as prose (chunk + embed + rank top-K) makes the right column survive a probabilistic ranking lottery. Catalog-based **lookup** is the right primitive instead. A central per-user catalog also means: - One place to keep table/column descriptions (AI-generated, refreshed when the source changes). - The query planner sees the user's full data landscape in a single prompt. - Schema stays stable across user sessions without hitting the source DB on every query. - New sources auto-update the catalog without re-embedding chunks. --- ## 2. Source taxonomy ``` Sources ├── Unstructured (pdf, docx, txt) → Cu (prose chunks via DocumentRetriever) └── Structured ├── Schema (DB) → Cs (DB tables + columns) └── Tabular (xlsx, csv, parquet) → Ct (sheets + columns) Cs ∪ Ct = Data Catalog Context ``` - **Cu** = unstructured prose context. Retrieval primitive: dense similarity over chunks. - **Cs** = DB schema context (tables, columns, descriptions, sample values). - **Ct** = tabular file context (sheets, columns, descriptions, sample values). - **Data Catalog Context** = `Cs ∪ Ct`. Passed to the query planner as a single unified view. DB vs tabular is **not** a routing concern — it's a per-source attribute (`source_type`) on each catalog entry. The split only matters at execution time (SQL vs pandas). --- ## 3. Routing model ``` source_hint ∈ { chat, unstructured, structured } ``` - `chat` — no search, conversational reply only - `unstructured` — DocumentRetriever path (Cu) - `structured` — catalog-driven path (Cs ∪ Ct → planner → compiler → executor) The router commits to one path. Cross-source questions ("compare DB sales vs uploaded customer file") are handled inside the structured path because the planner sees both Cs and Ct in one prompt. --- ## 4. Core architectural decisions ### 4.1 Catalog as primary context, not retrieval For most users (≤50 tables), the entire catalog fits in ~3-5k tokens and is passed verbatim to the planner. No vector search, no BM25, no chunk retrieval. The LLM reads the whole catalog and picks the right table. When a user has hundreds of tables, **catalog-level retrieval** (BM25 + table-level vectors with RRF) can be added as a slicer between `CatalogReader` and `Planner`. Deferred until measurably needed. ### 4.2 JSON IR over raw SQL The planner LLM emits a structured JSON IR describing query intent — not a SQL string. A deterministic compiler turns the IR into SQL (per dialect) or pandas/polars operations. Benefits: - Validatable with Pydantic before execution - Compiler whitelists allowed operations (no DROP, DELETE, etc.) - Portable: same IR → SQL (any dialect) / pandas / polars - Cheaper tokens, easier to debug, trivially testable without an LLM - LLM cannot emit valid-but-wrong SQL syntax ### 4.3 Deterministic compiler, not LLM SQL writer The LLM produces *intent* (the IR). All actual query construction is deterministic Python. Compiler bugs are reproducible and fixable. Same IR always produces the same query. ### 4.4 Pipeline stage isolation Each stage is its own module with typed input and typed output. No god classes. Stages: `IntentRouter`, `CatalogReader`, `QueryPlanner`, `IRValidator`, `QueryCompiler`, `QueryExecutor`, `ChatbotAgent`. Each is testable in isolation. ### 4.5 Minimal LLM surface LLM calls happen in exactly three places (KM-557 removed `CatalogEnricher`; ingestion is now LLM-free — the planner reads column names, stats, and sample rows directly): 1. **`IntentRouter`** — once per user message 2. **`QueryPlanner`** — once per structured query (produces the IR) 3. **`ChatbotAgent`** — once per answer (formats the response) Compiler and executors are pure code. No LLM in the hot path of query construction. --- ## 5. End-to-end flow ### Ingestion (when user uploads a file or connects a DB) ``` source upload / DB connect ↓ introspect schema (DB: information_schema; tabular: file headers + sample rows) ↓ validate (Pydantic) ↓ write to catalog store (Postgres jsonb in `data_catalog`, keyed by user_id) ``` For unstructured files: chunk + embed → PGVector. ### Query (per user message) ``` User message ↓ Chat cache check (Redis, 24h TTL) ↓ miss Load chat history ↓ IntentRouter LLM → needs_search? source_hint? ↓ ├── chat → ChatbotAgent → SSE stream ├── unstructured → DocumentRetriever → answerer └── structured → CatalogReader (load full Cs ∪ Ct for user) ↓ QueryPlanner LLM → JSON IR ↓ IRValidator (Pydantic + columns-exist + ops whitelist) ↓ QueryCompiler → SQL (schema source) or pandas (tabular source) ↓ QueryExecutor (DbExecutor or TabularExecutor) ↓ QueryResult ↓ ChatbotAgent → SSE stream ``` --- ## 6. Data catalog ### Storage Per-user JSON document, stored as a `jsonb` row in Postgres keyed by `user_id`. ### Schema (initial scope) ``` Catalog ├── user_id, schema_version, generated_at └── sources[] └── Source ├── source_id, source_type, name, description, location_ref, updated_at └── tables[] └── Table ├── table_id, name, description, row_count └── columns[] └── Column ├── column_id, name, data_type, description ├── nullable ├── pii_flag ├── sample_values[] └── stats: { min, max, distinct_count } | null ``` ### Best-practice fields deferred `description_human`, `synonyms[]`, `tags[]`, `primary_key`, `foreign_keys`, `unit`, `semantic_type`, `example_questions[]`, `schema_hash`, `enrichment_status`. Add when justified by user need. ### Stable IDs `source_id`, `table_id`, `column_id` are stable internal references. `name` fields can change (e.g. column rename in source DB) without invalidating cached IRs. ### PII handling Columns with `pii_flag: true` have `sample_values: null` — real values never enter LLM prompts. Auto-detected at ingestion via name patterns + value regex. --- ## 7. JSON IR ### Schema (initial scope) ``` QueryIR ├── ir_version : "1.0" ├── source_id : str (references catalog) ├── table_id : str (references catalog) ├── select[] : SelectItem │ ├── { kind: "column", column_id, alias? } │ └── { kind: "agg", fn, column_id?, alias? } ├── filters[] : { column_id, op, value, value_type } ├── group_by[] : column_id ├── order_by[] : { column_id | alias, dir } └── limit : int | null ``` ### Whitelisted operators ``` Filter ops: = != < <= > >= in not_in is_null is_not_null like between Agg fns: count count_distinct sum avg min max ``` ### Validation rules (enforced before execution) - `source_id` exists in catalog for this user - `table_id` belongs to that source - Every `column_id` exists in that table - Every `agg.fn` and `filter.op` is whitelisted - `value_type` consistent with column's `data_type` - `limit` positive int, ≤ hard cap (e.g. 10000) If any rule fails → reject IR → re-prompt planner with error context (max 3 retries). ### Deferred features `having`, `offset`, boolean tree filters (OR/NOT), `distinct`, joins, window functions. Add as user demand proves the limitation. --- ## 8. Executors Same input (validated IR), same output (`QueryResult`), different backends. ### DbExecutor (schema sources) ``` IR → SqlCompiler → SQL string + params ↓ sqlglot validation (SELECT-only, whitelist tables/columns, LIMIT enforced) ↓ asyncpg / pymysql in read-only transaction with timeout (30s) ↓ QueryResult ``` Identifiers come from catalog (verified at validation time, safe to inline as quoted identifiers). Values are always parameterized — never inlined as strings. ### TabularExecutor (tabular sources) ``` IR → PandasCompiler → operation chain ↓ choose strategy by file size: ≤ 100 MB → eager pandas 100 MB-1 GB → pyarrow with predicate pushdown > 1 GB → polars lazy scan ↓ execute in asyncio.to_thread (CPU work off the event loop) ↓ QueryResult ``` Initially eager pandas is sufficient. Add the others when a real file is too big. ### Shared safety guarantees 1. IR validated before reaching compiler 2. Compiler is deterministic (no LLM) 3. Identifiers from catalog (trusted) 4. Values parameterized 5. sqlglot second-line defence for SQL 6. Read-only at every layer 7. Timeouts and row caps --- ## 9. Implementation scope ### Initial PR — what ships first | Item | Folder | |---|---| | Data catalog Pydantic models | `src/catalog/models.py` | | Catalog ingestion (introspect → enrich → validate → store) | `src/catalog/`, `src/pipeline/` | | `IntentRouter` with 3-way source_hint | `src/agents/` | | `CatalogReader` (loads full catalog) | `src/catalog/reader.py` | | `QueryPlanner` LLM call | `src/query/planner/` | | JSON IR Pydantic models | `src/query/ir/models.py` | | IR validator | `src/query/ir/validator.py` | **Output**: a validated JSON IR object. Execution lands in a follow-up PR. ### Follow-up PRs | PR | Scope | |---|---| | 2 | `QueryCompiler` (IR → SQL / pandas) | | 3 | `QueryExecutor` split: `DbExecutor` + `TabularExecutor` | | 4 | Retry / self-correction loop on execution failure | | 5 | Eval harness (golden question→IR→result examples) | | 6 | Auto PII tagging in catalog | | Later | Joins in IR, schema drift detection, hybrid catalog search | --- ## 10. Open questions | # | Question | Why it matters | |---|---|---| | 1 | Catalog storage: JSON file per user vs Postgres `jsonb` row? | Affects ingestion + read performance | | 2 | Should the catalog also list unstructured files (with descriptions only)? | Gives router unified view of all user sources | | 3 | Catalog refresh trigger: explicit "rebuild" button, on every upload, or background TTL? | Staleness vs latency tradeoff | | 4 | Confirm joins are out of initial IR scope? | Limits what user questions can be answered | | 5 | PII handling for sample_values: mask, synthesize, or skip? | Affects what gets sent to LLM prompts | --- ## 11. References - `docs/flowchart.html` — interactive end-to-end diagram (open in browser) - `docs/flowchart.mmd` — mermaid source for the diagram --- ## Glossary - **Cu** — unstructured context (prose chunks) - **Cs** — schema context (DB tables/columns from catalog) - **Ct** — tabular context (file sheets/columns from catalog) - **IR** — intermediate representation (the JSON query shape) - **PR** — pull request (a unit of code change) - **PII** — personally identifiable information (names, emails, etc.) - **ABC** — abstract base class (Python contract for subclasses)