Rifqi Hafizuddin
[KM-560][KM-561] drop catalog LLM enrichment + rename store to data_catalog + add /data-catalog index endpoint
2d6eca0 | # Architecture β Data Eyond Agentic Service | |
| **Last updated**: 2026-05-07 | |
| **Status**: Design phase β folder skeleton in place, implementation in progress | |
| --- | |
| ## TL;DR | |
| A catalog-driven AI service for data analysis. Users upload documents and register databases or tabular files; they ask natural-language questions and get answers grounded in their data. | |
| The architecture has two paths: | |
| - **Unstructured** (PDF, DOCX, TXT) β dense similarity over prose chunks (the right primitive for free-form text). | |
| - **Structured** (databases, XLSX, CSV, Parquet) β a per-user **data catalog** describes what tables/columns exist; an LLM produces a structured **JSON intermediate representation (IR)** of the user's intent; a deterministic compiler turns the IR into SQL or pandas operations. | |
| The LLM produces *intent*, not query syntax. Deterministic code does the rest. | |
| --- | |
| ## 1. Why catalog-driven design | |
| For a database or spreadsheet, a user's question maps to *known tables and columns* β not to *similar text fragments*. Treating structured data with the same retrieval primitive as prose (chunk + embed + rank top-K) makes the right column survive a probabilistic ranking lottery. Catalog-based **lookup** is the right primitive instead. | |
| A central per-user catalog also means: | |
| - One place to keep table/column descriptions (AI-generated, refreshed when the source changes). | |
| - The query planner sees the user's full data landscape in a single prompt. | |
| - Schema stays stable across user sessions without hitting the source DB on every query. | |
| - New sources auto-update the catalog without re-embedding chunks. | |
| --- | |
| ## 2. Source taxonomy | |
| ``` | |
| Sources | |
| βββ Unstructured (pdf, docx, txt) β Cu (prose chunks via DocumentRetriever) | |
| βββ Structured | |
| βββ Schema (DB) β Cs (DB tables + columns) | |
| βββ Tabular (xlsx, csv, parquet) β Ct (sheets + columns) | |
| Cs βͺ Ct = Data Catalog Context | |
| ``` | |
| - **Cu** = unstructured prose context. Retrieval primitive: dense similarity over chunks. | |
| - **Cs** = DB schema context (tables, columns, descriptions, sample values). | |
| - **Ct** = tabular file context (sheets, columns, descriptions, sample values). | |
| - **Data Catalog Context** = `Cs βͺ Ct`. Passed to the query planner as a single unified view. | |
| DB vs tabular is **not** a routing concern β it's a per-source attribute (`source_type`) on each catalog entry. The split only matters at execution time (SQL vs pandas). | |
| --- | |
| ## 3. Routing model | |
| ``` | |
| source_hint β { chat, unstructured, structured } | |
| ``` | |
| - `chat` β no search, conversational reply only | |
| - `unstructured` β DocumentRetriever path (Cu) | |
| - `structured` β catalog-driven path (Cs βͺ Ct β planner β compiler β executor) | |
| The router commits to one path. Cross-source questions ("compare DB sales vs uploaded customer file") are handled inside the structured path because the planner sees both Cs and Ct in one prompt. | |
| --- | |
| ## 4. Core architectural decisions | |
| ### 4.1 Catalog as primary context, not retrieval | |
| For most users (β€50 tables), the entire catalog fits in ~3-5k tokens and is passed verbatim to the planner. No vector search, no BM25, no chunk retrieval. The LLM reads the whole catalog and picks the right table. | |
| When a user has hundreds of tables, **catalog-level retrieval** (BM25 + table-level vectors with RRF) can be added as a slicer between `CatalogReader` and `Planner`. Deferred until measurably needed. | |
| ### 4.2 JSON IR over raw SQL | |
| The planner LLM emits a structured JSON IR describing query intent β not a SQL string. A deterministic compiler turns the IR into SQL (per dialect) or pandas/polars operations. | |
| Benefits: | |
| - Validatable with Pydantic before execution | |
| - Compiler whitelists allowed operations (no DROP, DELETE, etc.) | |
| - Portable: same IR β SQL (any dialect) / pandas / polars | |
| - Cheaper tokens, easier to debug, trivially testable without an LLM | |
| - LLM cannot emit valid-but-wrong SQL syntax | |
| ### 4.3 Deterministic compiler, not LLM SQL writer | |
| The LLM produces *intent* (the IR). All actual query construction is deterministic Python. Compiler bugs are reproducible and fixable. Same IR always produces the same query. | |
| ### 4.4 Pipeline stage isolation | |
| Each stage is its own module with typed input and typed output. No god classes. Stages: `IntentRouter`, `CatalogReader`, `QueryPlanner`, `IRValidator`, `QueryCompiler`, `QueryExecutor`, `ChatbotAgent`. Each is testable in isolation. | |
| ### 4.5 Minimal LLM surface | |
| LLM calls happen in exactly three places (KM-557 removed `CatalogEnricher`; ingestion is now LLM-free β the planner reads column names, stats, and sample rows directly): | |
| 1. **`IntentRouter`** β once per user message | |
| 2. **`QueryPlanner`** β once per structured query (produces the IR) | |
| 3. **`ChatbotAgent`** β once per answer (formats the response) | |
| Compiler and executors are pure code. No LLM in the hot path of query construction. | |
| --- | |
| ## 5. End-to-end flow | |
| ### Ingestion (when user uploads a file or connects a DB) | |
| ``` | |
| source upload / DB connect | |
| β | |
| introspect schema (DB: information_schema; tabular: file headers + sample rows) | |
| β | |
| validate (Pydantic) | |
| β | |
| write to catalog store (Postgres jsonb in `data_catalog`, keyed by user_id) | |
| ``` | |
| For unstructured files: chunk + embed β PGVector. | |
| ### Query (per user message) | |
| ``` | |
| User message | |
| β | |
| Chat cache check (Redis, 24h TTL) | |
| β miss | |
| Load chat history | |
| β | |
| IntentRouter LLM β needs_search? source_hint? | |
| β | |
| βββ chat β ChatbotAgent β SSE stream | |
| βββ unstructured β DocumentRetriever β answerer | |
| βββ structured β | |
| CatalogReader (load full Cs βͺ Ct for user) | |
| β | |
| QueryPlanner LLM β JSON IR | |
| β | |
| IRValidator (Pydantic + columns-exist + ops whitelist) | |
| β | |
| QueryCompiler β SQL (schema source) or pandas (tabular source) | |
| β | |
| QueryExecutor (DbExecutor or TabularExecutor) | |
| β | |
| QueryResult | |
| β | |
| ChatbotAgent β SSE stream | |
| ``` | |
| --- | |
| ## 6. Data catalog | |
| ### Storage | |
| Per-user JSON document, stored as a `jsonb` row in Postgres keyed by `user_id`. | |
| ### Schema (initial scope) | |
| ``` | |
| Catalog | |
| βββ user_id, schema_version, generated_at | |
| βββ sources[] | |
| βββ Source | |
| βββ source_id, source_type, name, description, location_ref, updated_at | |
| βββ tables[] | |
| βββ Table | |
| βββ table_id, name, description, row_count | |
| βββ columns[] | |
| βββ Column | |
| βββ column_id, name, data_type, description | |
| βββ nullable | |
| βββ pii_flag | |
| βββ sample_values[] | |
| βββ stats: { min, max, distinct_count } | null | |
| ``` | |
| ### Best-practice fields deferred | |
| `description_human`, `synonyms[]`, `tags[]`, `primary_key`, `foreign_keys`, `unit`, `semantic_type`, `example_questions[]`, `schema_hash`, `enrichment_status`. Add when justified by user need. | |
| ### Stable IDs | |
| `source_id`, `table_id`, `column_id` are stable internal references. `name` fields can change (e.g. column rename in source DB) without invalidating cached IRs. | |
| ### PII handling | |
| Columns with `pii_flag: true` have `sample_values: null` β real values never enter LLM prompts. Auto-detected at ingestion via name patterns + value regex. | |
| --- | |
| ## 7. JSON IR | |
| ### Schema (initial scope) | |
| ``` | |
| QueryIR | |
| βββ ir_version : "1.0" | |
| βββ source_id : str (references catalog) | |
| βββ table_id : str (references catalog) | |
| βββ select[] : SelectItem | |
| β βββ { kind: "column", column_id, alias? } | |
| β βββ { kind: "agg", fn, column_id?, alias? } | |
| βββ filters[] : { column_id, op, value, value_type } | |
| βββ group_by[] : column_id | |
| βββ order_by[] : { column_id | alias, dir } | |
| βββ limit : int | null | |
| ``` | |
| ### Whitelisted operators | |
| ``` | |
| Filter ops: = != < <= > >= in not_in is_null is_not_null like between | |
| Agg fns: count count_distinct sum avg min max | |
| ``` | |
| ### Validation rules (enforced before execution) | |
| - `source_id` exists in catalog for this user | |
| - `table_id` belongs to that source | |
| - Every `column_id` exists in that table | |
| - Every `agg.fn` and `filter.op` is whitelisted | |
| - `value_type` consistent with column's `data_type` | |
| - `limit` positive int, β€ hard cap (e.g. 10000) | |
| If any rule fails β reject IR β re-prompt planner with error context (max 3 retries). | |
| ### Deferred features | |
| `having`, `offset`, boolean tree filters (OR/NOT), `distinct`, joins, window functions. Add as user demand proves the limitation. | |
| --- | |
| ## 8. Executors | |
| Same input (validated IR), same output (`QueryResult`), different backends. | |
| ### DbExecutor (schema sources) | |
| ``` | |
| IR β SqlCompiler β SQL string + params | |
| β | |
| sqlglot validation (SELECT-only, whitelist tables/columns, LIMIT enforced) | |
| β | |
| asyncpg / pymysql in read-only transaction with timeout (30s) | |
| β | |
| QueryResult | |
| ``` | |
| Identifiers come from catalog (verified at validation time, safe to inline as quoted identifiers). Values are always parameterized β never inlined as strings. | |
| ### TabularExecutor (tabular sources) | |
| ``` | |
| IR β PandasCompiler β operation chain | |
| β | |
| choose strategy by file size: | |
| β€ 100 MB β eager pandas | |
| 100 MB-1 GB β pyarrow with predicate pushdown | |
| > 1 GB β polars lazy scan | |
| β | |
| execute in asyncio.to_thread (CPU work off the event loop) | |
| β | |
| QueryResult | |
| ``` | |
| Initially eager pandas is sufficient. Add the others when a real file is too big. | |
| ### Shared safety guarantees | |
| 1. IR validated before reaching compiler | |
| 2. Compiler is deterministic (no LLM) | |
| 3. Identifiers from catalog (trusted) | |
| 4. Values parameterized | |
| 5. sqlglot second-line defence for SQL | |
| 6. Read-only at every layer | |
| 7. Timeouts and row caps | |
| --- | |
| ## 9. Implementation scope | |
| ### Initial PR β what ships first | |
| | Item | Folder | | |
| |---|---| | |
| | Data catalog Pydantic models | `src/catalog/models.py` | | |
| | Catalog ingestion (introspect β enrich β validate β store) | `src/catalog/`, `src/pipeline/` | | |
| | `IntentRouter` with 3-way source_hint | `src/agents/` | | |
| | `CatalogReader` (loads full catalog) | `src/catalog/reader.py` | | |
| | `QueryPlanner` LLM call | `src/query/planner/` | | |
| | JSON IR Pydantic models | `src/query/ir/models.py` | | |
| | IR validator | `src/query/ir/validator.py` | | |
| **Output**: a validated JSON IR object. Execution lands in a follow-up PR. | |
| ### Follow-up PRs | |
| | PR | Scope | | |
| |---|---| | |
| | 2 | `QueryCompiler` (IR β SQL / pandas) | | |
| | 3 | `QueryExecutor` split: `DbExecutor` + `TabularExecutor` | | |
| | 4 | Retry / self-correction loop on execution failure | | |
| | 5 | Eval harness (golden questionβIRβresult examples) | | |
| | 6 | Auto PII tagging in catalog | | |
| | Later | Joins in IR, schema drift detection, hybrid catalog search | | |
| --- | |
| ## 10. Open questions | |
| | # | Question | Why it matters | | |
| |---|---|---| | |
| | 1 | Catalog storage: JSON file per user vs Postgres `jsonb` row? | Affects ingestion + read performance | | |
| | 2 | Should the catalog also list unstructured files (with descriptions only)? | Gives router unified view of all user sources | | |
| | 3 | Catalog refresh trigger: explicit "rebuild" button, on every upload, or background TTL? | Staleness vs latency tradeoff | | |
| | 4 | Confirm joins are out of initial IR scope? | Limits what user questions can be answered | | |
| | 5 | PII handling for sample_values: mask, synthesize, or skip? | Affects what gets sent to LLM prompts | | |
| --- | |
| ## 11. References | |
| - `docs/flowchart.html` β interactive end-to-end diagram (open in browser) | |
| - `docs/flowchart.mmd` β mermaid source for the diagram | |
| --- | |
| ## Glossary | |
| - **Cu** β unstructured context (prose chunks) | |
| - **Cs** β schema context (DB tables/columns from catalog) | |
| - **Ct** β tabular context (file sheets/columns from catalog) | |
| - **IR** β intermediate representation (the JSON query shape) | |
| - **PR** β pull request (a unit of code change) | |
| - **PII** β personally identifiable information (names, emails, etc.) | |
| - **ABC** β abstract base class (Python contract for subclasses) | |