Agentic-Service-Data-Eyond-Catalog

Sleeping

App Files Files Community

feat/Catalog Retrieval System

by rhbt6767 - opened 17 days ago

base: refs/heads/main

←

from: refs/pr/1

Discussion Files changed

+5655

-2365

This view is limited to 50 files because it contains too many changes. See the raw diff here.

Files changed (50) hide show

ARCHITECTURE.md +340 -0
PHASE1_TO_PHASE2_REPORT.md +260 -0
PROGRESS.md +381 -0
REPO_CONTEXT.md +474 -0
main.py +15 -11
pyproject.toml +2 -1
scripts/build_initial_catalogs.py +73 -0
scripts/enrich_all_sources.py +16 -0
src/agents/chat_handler.py +274 -0
src/agents/chatbot.py +153 -69
src/agents/orchestration.py +100 -70
src/api/v1/chat.py +50 -167
src/api/v1/data_catalog.py +100 -0
src/api/v1/db_client.py +15 -25
src/api/v1/document.py +12 -2
src/api/v1/knowledge.py +0 -25
src/catalog/README.md +6 -0
src/catalog/__init__.py +1 -0
src/catalog/introspect/__init__.py +1 -0
src/catalog/introspect/base.py +18 -0
src/catalog/introspect/database.py +246 -0
src/catalog/introspect/tabular.py +239 -0
src/catalog/models.py +86 -0
src/catalog/pii_detector.py +39 -0
src/catalog/reader.py +40 -0
src/catalog/render.py +69 -0
src/catalog/store.py +82 -0
src/catalog/validator.py +49 -0
src/config/agents/guardrails_prompt.md +0 -7
src/config/agents/system_prompt.md +0 -26
src/{pipeline/document_pipeline → config/prompts}/__init__.py +0 -0
src/config/prompts/chatbot_system.md +31 -0
src/config/prompts/guardrails.md +11 -0
src/config/prompts/intent_router.md +66 -0
src/config/prompts/query_planner.md +168 -0
src/db/postgres/init_db.py +2 -1
src/db/postgres/models.py +19 -0
src/knowledge/processing_service.py +0 -106
src/middlewares/logging.py +3 -0
src/models/api/__init__.py +1 -0
src/models/api/catalog.py +27 -0
src/models/api/chat.py +17 -0
src/models/api/document.py +9 -0
src/models/user_info.py +0 -15
src/pipeline/db_pipeline/extractor.py +76 -35
src/pipeline/{document_pipeline/document_pipeline.py → document_pipeline.py} +32 -3
src/pipeline/structured_pipeline.py +91 -0
src/pipeline/triggers.py +115 -0
src/query/README.md +11 -0
src/query/base.py +0 -32

ARCHITECTURE.md ADDED Viewed

	@@ -0,0 +1,340 @@

+# Architecture — Data Eyond Agentic Service
+**Last updated**: 2026-05-07
+**Status**: Design phase — folder skeleton in place, implementation in progress
+---
+## TL;DR
+A catalog-driven AI service for data analysis. Users upload documents and register databases or tabular files; they ask natural-language questions and get answers grounded in their data.
+The architecture has two paths:
+- **Unstructured** (PDF, DOCX, TXT) — dense similarity over prose chunks (the right primitive for free-form text).
+- **Structured** (databases, XLSX, CSV, Parquet) — a per-user **data catalog** describes what tables/columns exist; an LLM produces a structured **JSON intermediate representation (IR)** of the user's intent; a deterministic compiler turns the IR into SQL or pandas operations.
+The LLM produces *intent*, not query syntax. Deterministic code does the rest.
+---
+## 1. Why catalog-driven design
+For a database or spreadsheet, a user's question maps to *known tables and columns* — not to *similar text fragments*. Treating structured data with the same retrieval primitive as prose (chunk + embed + rank top-K) makes the right column survive a probabilistic ranking lottery. Catalog-based **lookup** is the right primitive instead.
+A central per-user catalog also means:
+- One place to keep table/column descriptions (AI-generated, refreshed when the source changes).
+- The query planner sees the user's full data landscape in a single prompt.
+- Schema stays stable across user sessions without hitting the source DB on every query.
+- New sources auto-update the catalog without re-embedding chunks.
+---
+## 2. Source taxonomy
+```
+Sources
+├── Unstructured (pdf, docx, txt)        →  Cu  (prose chunks via DocumentRetriever)
+└── Structured
+    ├── Schema (DB)                       →  Cs  (DB tables + columns)
+    └── Tabular (xlsx, csv, parquet)      →  Ct  (sheets + columns)
+                                           Cs ∪ Ct = Data Catalog Context
+```
+- **Cu** = unstructured prose context. Retrieval primitive: dense similarity over chunks.
+- **Cs** = DB schema context (tables, columns, descriptions, sample values).
+- **Ct** = tabular file context (sheets, columns, descriptions, sample values).
+- **Data Catalog Context** = `Cs ∪ Ct`. Passed to the query planner as a single unified view.
+DB vs tabular is **not** a routing concern — it's a per-source attribute (`source_type`) on each catalog entry. The split only matters at execution time (SQL vs pandas).
+---
+## 3. Routing model
+```
+source_hint ∈ { chat, unstructured, structured }
+```
+- `chat` — no search, conversational reply only
+- `unstructured` — DocumentRetriever path (Cu)
+- `structured` — catalog-driven path (Cs ∪ Ct → planner → compiler → executor)
+The router commits to one path. Cross-source questions ("compare DB sales vs uploaded customer file") are handled inside the structured path because the planner sees both Cs and Ct in one prompt.
+---
+## 4. Core architectural decisions
+### 4.1 Catalog as primary context, not retrieval
+For most users (≤50 tables), the entire catalog fits in ~3-5k tokens and is passed verbatim to the planner. No vector search, no BM25, no chunk retrieval. The LLM reads the whole catalog and picks the right table.
+When a user has hundreds of tables, **catalog-level retrieval** (BM25 + table-level vectors with RRF) can be added as a slicer between `CatalogReader` and `Planner`. Deferred until measurably needed.
+### 4.2 JSON IR over raw SQL
+The planner LLM emits a structured JSON IR describing query intent — not a SQL string. A deterministic compiler turns the IR into SQL (per dialect) or pandas/polars operations.
+Benefits:
+- Validatable with Pydantic before execution
+- Compiler whitelists allowed operations (no DROP, DELETE, etc.)
+- Portable: same IR → SQL (any dialect) / pandas / polars
+- Cheaper tokens, easier to debug, trivially testable without an LLM
+- LLM cannot emit valid-but-wrong SQL syntax
+### 4.3 Deterministic compiler, not LLM SQL writer
+The LLM produces *intent* (the IR). All actual query construction is deterministic Python. Compiler bugs are reproducible and fixable. Same IR always produces the same query.
+### 4.4 Pipeline stage isolation
+Each stage is its own module with typed input and typed output. No god classes. Stages: `IntentRouter`, `CatalogReader`, `QueryPlanner`, `IRValidator`, `QueryCompiler`, `QueryExecutor`, `ChatbotAgent`. Each is testable in isolation.
+### 4.5 Minimal LLM surface
+LLM calls happen in exactly three places (KM-557 removed `CatalogEnricher`; ingestion is now LLM-free — the planner reads column names, stats, and sample rows directly):
+1. **`IntentRouter`** — once per user message
+2. **`QueryPlanner`** — once per structured query (produces the IR)
+3. **`ChatbotAgent`** — once per answer (formats the response)
+Compiler and executors are pure code. No LLM in the hot path of query construction.
+---
+## 5. End-to-end flow
+### Ingestion (when user uploads a file or connects a DB)
+```
+source upload / DB connect
+    ↓
+introspect schema (DB: information_schema; tabular: file headers + sample rows)
+    ↓
+validate (Pydantic)
+    ↓
+write to catalog store (Postgres jsonb in `data_catalog`, keyed by user_id)
+```
+For unstructured files: chunk + embed → PGVector.
+### Query (per user message)
+```
+User message
+    ↓
+Chat cache check (Redis, 24h TTL)
+    ↓ miss
+Load chat history
+    ↓
+IntentRouter LLM   →  needs_search?  source_hint?
+    ↓
+    ├── chat        → ChatbotAgent → SSE stream
+    ├── unstructured → DocumentRetriever → answerer
+    └── structured  →
+            CatalogReader (load full Cs ∪ Ct for user)
+                ↓
+            QueryPlanner LLM  →  JSON IR
+                ↓
+            IRValidator  (Pydantic + columns-exist + ops whitelist)
+                ↓
+            QueryCompiler  →  SQL (schema source) or pandas (tabular source)
+                ↓
+            QueryExecutor  (DbExecutor or TabularExecutor)
+                ↓
+            QueryResult
+                ↓
+            ChatbotAgent → SSE stream
+```
+---
+## 6. Data catalog
+### Storage
+Per-user JSON document, stored as a `jsonb` row in Postgres keyed by `user_id`.
+### Schema (initial scope)
+```
+Catalog
+├── user_id, schema_version, generated_at
+└── sources[]
+    └── Source
+        ├── source_id, source_type, name, description, location_ref, updated_at
+        └── tables[]
+            └── Table
+                ├── table_id, name, description, row_count
+                └── columns[]
+                    └── Column
+                        ├── column_id, name, data_type, description
+                        ├── nullable
+                        ├── pii_flag
+                        ├── sample_values[]
+                        └── stats: { min, max, distinct_count } | null
+```
+### Best-practice fields deferred
+`description_human`, `synonyms[]`, `tags[]`, `primary_key`, `foreign_keys`, `unit`, `semantic_type`, `example_questions[]`, `schema_hash`, `enrichment_status`. Add when justified by user need.
+### Stable IDs
+`source_id`, `table_id`, `column_id` are stable internal references. `name` fields can change (e.g. column rename in source DB) without invalidating cached IRs.
+### PII handling
+Columns with `pii_flag: true` have `sample_values: null` — real values never enter LLM prompts. Auto-detected at ingestion via name patterns + value regex.
+---
+## 7. JSON IR
+### Schema (initial scope)
+```
+QueryIR
+├── ir_version          : "1.0"
+├── source_id           : str   (references catalog)
+├── table_id            : str   (references catalog)
+├── select[]            : SelectItem
+│   ├── { kind: "column", column_id, alias? }
+│   └── { kind: "agg",    fn, column_id?, alias? }
+├── filters[]           : { column_id, op, value, value_type }
+├── group_by[]          : column_id
+├── order_by[]          : { column_id | alias, dir }
+└── limit               : int | null
+```
+### Whitelisted operators
+```
+Filter ops:  = != < <= > >= in not_in is_null is_not_null like between
+Agg fns:     count count_distinct sum avg min max
+```
+### Validation rules (enforced before execution)
+- `source_id` exists in catalog for this user
+- `table_id` belongs to that source
+- Every `column_id` exists in that table
+- Every `agg.fn` and `filter.op` is whitelisted
+- `value_type` consistent with column's `data_type`
+- `limit` positive int, ≤ hard cap (e.g. 10000)
+If any rule fails → reject IR → re-prompt planner with error context (max 3 retries).
+### Deferred features
+`having`, `offset`, boolean tree filters (OR/NOT), `distinct`, joins, window functions. Add as user demand proves the limitation.
+---
+## 8. Executors
+Same input (validated IR), same output (`QueryResult`), different backends.
+### DbExecutor (schema sources)
+```
+IR → SqlCompiler → SQL string + params
+     ↓
+sqlglot validation (SELECT-only, whitelist tables/columns, LIMIT enforced)
+     ↓
+asyncpg / pymysql in read-only transaction with timeout (30s)
+     ↓
+QueryResult
+```
+Identifiers come from catalog (verified at validation time, safe to inline as quoted identifiers). Values are always parameterized — never inlined as strings.
+### TabularExecutor (tabular sources)
+```
+IR → PandasCompiler → operation chain
+     ↓
+choose strategy by file size:
+  ≤ 100 MB    → eager pandas
+  100 MB-1 GB → pyarrow with predicate pushdown
+  > 1 GB      → polars lazy scan
+     ↓
+execute in asyncio.to_thread (CPU work off the event loop)
+     ↓
+QueryResult
+```
+Initially eager pandas is sufficient. Add the others when a real file is too big.
+### Shared safety guarantees
+1. IR validated before reaching compiler
+2. Compiler is deterministic (no LLM)
+3. Identifiers from catalog (trusted)
+4. Values parameterized
+5. sqlglot second-line defence for SQL
+6. Read-only at every layer
+7. Timeouts and row caps
+---
+## 9. Implementation scope
+### Initial PR — what ships first
+| Item | Folder |
+|---|---|
+| Data catalog Pydantic models | `src/catalog/models.py` |
+| Catalog ingestion (introspect → enrich → validate → store) | `src/catalog/`, `src/pipeline/` |
+| `IntentRouter` with 3-way source_hint | `src/agents/` |
+| `CatalogReader` (loads full catalog) | `src/catalog/reader.py` |
+| `QueryPlanner` LLM call | `src/query/planner/` |
+| JSON IR Pydantic models | `src/query/ir/models.py` |
+| IR validator | `src/query/ir/validator.py` |
+**Output**: a validated JSON IR object. Execution lands in a follow-up PR.
+### Follow-up PRs
+| PR | Scope |
+|---|---|
+| 2 | `QueryCompiler` (IR → SQL / pandas) |
+| 3 | `QueryExecutor` split: `DbExecutor` + `TabularExecutor` |
+| 4 | Retry / self-correction loop on execution failure |
+| 5 | Eval harness (golden question→IR→result examples) |
+| 6 | Auto PII tagging in catalog |
+| Later | Joins in IR, schema drift detection, hybrid catalog search |
+---
+## 10. Open questions
+| # | Question | Why it matters |
+|---|---|---|
+| 1 | Catalog storage: JSON file per user vs Postgres `jsonb` row? | Affects ingestion + read performance |
+| 2 | Should the catalog also list unstructured files (with descriptions only)? | Gives router unified view of all user sources |
+| 3 | Catalog refresh trigger: explicit "rebuild" button, on every upload, or background TTL? | Staleness vs latency tradeoff |
+| 4 | Confirm joins are out of initial IR scope? | Limits what user questions can be answered |
+| 5 | PII handling for sample_values: mask, synthesize, or skip? | Affects what gets sent to LLM prompts |
+---
+## 11. References
+- `docs/flowchart.html` — interactive end-to-end diagram (open in browser)
+- `docs/flowchart.mmd` — mermaid source for the diagram
+---
+## Glossary
+- **Cu** — unstructured context (prose chunks)
+- **Cs** — schema context (DB tables/columns from catalog)
+- **Ct** — tabular context (file sheets/columns from catalog)
+- **IR** — intermediate representation (the JSON query shape)
+- **PR** — pull request (a unit of code change)
+- **PII** — personally identifiable information (names, emails, etc.)
+- **ABC** — abstract base class (Python contract for subclasses)

PHASE1_TO_PHASE2_REPORT.md ADDED Viewed

	@@ -0,0 +1,260 @@

+# Phase 1 → Phase 2 Migration Report
+A walkthrough of what changed between the original retrieval-style backend (Phase 1) and the current catalog-driven backend (Phase 2). Intended as a hand-off for the lead.
+---
+## 1. The conceptual change
+**Phase 1** was a single retrieval-style RAG pipeline. Every question — whether it pointed at a database, a spreadsheet, or a PDF — went through the same primitive: **chunk + embed + top-K** over PGVector. Schema and tabular columns were embedded as chunks and ranked alongside prose. When the question needed SQL, the LLM **wrote the SQL string directly** (via `query_executor`).
+**Phase 2** splits the system into two paths governed by an LLM router:
+| Path | Primitive | Why |
+|---|---|---|
+| Unstructured (PDF / DOCX / TXT) | Dense similarity over prose chunks (PGVector) | Right primitive for free text |
+| Structured (DB / CSV / XLSX / Parquet) | **Per-user data catalog** → LLM emits a **JSON IR** of intent → deterministic **compiler** → **executor** (SQL or pandas) | A column lookup shouldn't go through a similarity ranking lottery; the LLM emits intent, never SQL syntax |
+Three explicit LLM call sites only:
+1. **Intent router** (classifies the user message into `chat` / `unstructured` / `structured`)
+2. **Query planner** (turns the question + catalog into a Pydantic-validated `QueryIR`)
+3. **Chatbot agent** (formats the final answer, streamed over SSE)
+Everything else — IR validation, SQL/pandas compilation, execution — is deterministic Python.
+---
+## 2. File-by-file changes
+### 2.1 Deleted (Phase 1 only)
+| Phase 1 path | Reason it was removed |
+|---|---|
+| `src/rag/base.py`, `src/rag/retriever.py`, `src/rag/router.py` | Replaced by `src/retrieval/` |
+| `src/rag/retrievers/baseline.py`, `schema.py`, `document.py` | Schema retrieval gone (catalog replaces it); document retriever rewritten in `src/retrieval/document.py` |
+| `src/tools/search.py` (whole `tools/` folder) | Only consumer was `rag/router.py` |
+| `src/query/base.py` | Duplicate of `query/executor/base.py` |
+| `src/query/query_executor.py` | Replaced by `src/query/service.py` |
+| `src/query/executors/db_executor.py` | Replaced by `src/query/executor/db.py` |
+| `src/query/executors/tabular.py` | Replaced by `src/query/executor/tabular.py` |
+| `src/agents/chatbot.py` (Phase 1 LangChain chatbot) | Phase 2 `ChatbotAgent` lives at the same path now — see §2.2 |
+| `src/api/v1/knowledge.py` | Fake `/knowledge/rebuild` endpoint, never wired |
+| `src/config/agents/system_prompt.md`, `guardrails_prompt.md` | Replaced by `src/config/prompts/{chatbot_system,guardrails}.md` |
+| `src/models/structured_output.py` (`IntentClassification`) | Replaced by `IntentRouterDecision` Pydantic model inside `agents/orchestration.py` |
+| `src/models/sql_query.py` | LLM no longer emits SQL; IR replaces it |
+| `src/pipeline/orchestrator.py` (empty stub) | Redundant — `StructuredPipeline` takes the introspector at `run()` time |
+### 2.2 Renamed / moved (same role, new home)
+| Phase 1 location | Phase 2 location | Notes |
+|---|---|---|
+| `src/agents/chatbot.py` (Phase 1) → deleted, then `src/agents/answer_agent.py` (`AnswerAgent`) → renamed | `src/agents/chatbot.py::ChatbotAgent` | Final answer formation; streams via `astream` |
+| `src/knowledge/parquet_service.py` | `src/storage/parquet.py` | Parquet upload/download helper |
+| `src/pipeline/document_pipeline/document_pipeline.py` (folder) | `src/pipeline/document_pipeline.py` (flat) | Single module |
+| `src/rag/retrievers/document.py` | `src/retrieval/document.py` | `DocumentRetriever` migrated; tabular file types filtered out of results |
+| `src/rag/router.py` | `src/retrieval/router.py` | `RetrievalRouter`, Redis-cached, unstructured-only; dead `db: AsyncSession` + `source_hint` params removed |
+| `src/rag/base.py` (`RetrievalResult`, `BaseRetriever`) | `src/retrieval/base.py` | Same dataclass + ABC |
+> **Heads-up on the intent router**: the Phase 1 file `src/agents/orchestration.py` and its class `OrchestratorAgent` were **kept in place** for Phase 2 — but the body was fully rewritten. The class now emits `IntentRouterDecision(needs_search, source_hint ∈ {chat, unstructured, structured}, rewritten_query)`. The prompt file and test file use the `intent_router` name (`config/prompts/intent_router.md`, `tests/agents/test_intent_router.py`), but **the source module is still `orchestration.py` and the class is still `OrchestratorAgent`**. Existing imports continue to work; only the behavior changed.
+### 2.3 Added (Phase 2 new)
+**Catalog subsystem (whole new concept)**
+| Path | Role |
+|---|---|
+| `src/catalog/models.py` | Pydantic: `Catalog → Source[] → Table[] → Column[]`, `ForeignKey`, `ColumnStats.top_values` |
+| `src/catalog/introspect/base.py` | `BaseIntrospector` ABC |
+| `src/catalog/introspect/database.py` | DB introspector — wraps Phase 1 `db_pipeline/extractor.py` (`get_schema`, `profile_column`, `get_row_count`) |
+| `src/catalog/introspect/tabular.py` | CSV / XLSX / Parquet introspector — one `Table` per XLSX sheet |
+| `src/catalog/render.py` | Renders a `Source` for the planner prompt |
+| `src/catalog/validator.py` | Unique-ID + foreign-key-ref invariants |
+| `src/catalog/store.py` | Postgres `jsonb` upsert keyed by `user_id` (table `data_catalog`) |
+| `src/catalog/reader.py` | Loads + filters catalog by `source_hint` |
+| `src/catalog/pii_detector.py` | Flags PII columns at ingestion → suppresses `sample_values` |
+| `src/security/pii_patterns.py` | Name patterns + value regex used by the detector |
+**JSON IR + query subsystem**
+| Path | Role |
+|---|---|
+| `src/query/ir/models.py` | `QueryIR` Pydantic schema |
+| `src/query/ir/operators.py` | `ALLOWED_FILTER_OPS`, `ALLOWED_AGG_FNS`, `LIMIT_HARD_CAP`, `TYPE_COMPATIBILITY` |
+| `src/query/ir/validator.py` | Catalog-aware IR validation (rejects unknown column ids, bad ops, type mismatches, oversize limits) |
+| `src/query/planner/service.py` | `QueryPlannerService.plan(question, catalog, previous_error)` — Azure OpenAI structured output → `QueryIR` |
+| `src/query/planner/prompt.py` | Builds the planner prompt from catalog text |
+| `src/query/compiler/base.py` | Compiler ABC |
+| `src/query/compiler/sql.py` | `SqlCompiler` (Postgres) — all 12 filter ops, params as a dict |
+| `src/query/compiler/pandas.py` | `PandasCompiler` — returns `CompiledPandas(apply, output_columns)` |
+| `src/query/executor/base.py` | `BaseExecutor` + `QueryResult` |
+| `src/query/executor/db.py` | `DbExecutor` — sqlglot SELECT-only guard, RO txn, 30 s `statement_timeout`, 10 k row cap |
+| `src/query/executor/tabular.py` | `TabularExecutor` — Parquet via blob, `asyncio.to_thread`, 10 k cap |
+| `src/query/executor/dispatcher.py` | `ExecutorDispatcher.pick(ir)` — picks by `source.source_type` |
+| `src/query/service.py` | `QueryService.run(user_id, question, catalog)` — plan → validate → retry (max 3) → dispatch → execute |
+**Agents**
+| Path | Role |
+|---|---|
+| `src/agents/orchestration.py` | `OrchestratorAgent` — Phase 1 file/class name preserved; Phase 2 body. Emits `IntentRouterDecision` |
+| `src/agents/chatbot.py` | `ChatbotAgent` — formerly `AnswerAgent` in `agents/answer_agent.py`; renamed in Cleanup PR |
+| `src/agents/chat_handler.py` | `ChatHandler.handle(...)` — top-level orchestrator; yields `intent` / `chunk` / `done` / `error` SSE events |
+**Pipelines & API**
+| Path | Role |
+|---|---|
+| `src/pipeline/structured_pipeline.py` | DB / tabular ingestion: introspect → merge → validate → upsert |
+| `src/pipeline/triggers.py` | `on_db_registered`, `on_tabular_uploaded`, `on_document_uploaded`, `on_catalog_rebuild_requested` |
+| `src/api/v1/data_catalog.py` | `GET /api/v1/data-catalog/{user_id}` + `POST /api/v1/data-catalog/rebuild` |
+| `src/models/api/catalog.py` | Catalog request/response models |
+| `src/config/prompts/intent_router.md`, `query_planner.md`, `chatbot_system.md`, `guardrails.md` | New prompts. `guardrails.md` is appended to `chatbot_system.md` at load time |
+| `src/db/postgres/models.py` (added `Catalog` SQLAlchemy class) | Stores the per-user jsonb document in `data_catalog` |
+### 2.4 Rewired API endpoints
+| Endpoint | Phase 1 wiring | Phase 2 wiring |
+|---|---|---|
+| `POST /api/v1/chat/stream` | Inline in `chat.py`: `OrchestratorAgent` → `retriever` → `query_executor` → `chatbot` | Delegates to `ChatHandler.handle()`. Redis cache, fast intent, history load, and message persistence stay in the endpoint |
+| `POST /api/v1/database-clients/{id}/ingest` | Called `db_pipeline_service.run()` and dual-wrote vectors | Calls **only** `on_db_registered` (catalog build). Failure → HTTP 500 |
+| `POST /api/v1/document/process` | Always pushed to vector store | PDF/DOCX/TXT → `knowledge_processor` (vectors); CSV/XLSX → `on_tabular_uploaded` (catalog only, **no vector embedding**) |
+| `POST /api/v1/document/upload` | Storage + DB row | Same, plus `on_document_uploaded` trigger |
+| `POST /api/v1/data-catalog/rebuild` | — | New: iterates all sources, re-runs per-source trigger |
+| `GET /api/v1/data-catalog/{user_id}` | — | New: returns `list[CatalogIndexEntry]` |
+### 2.5 Phase 1 files still in production use
+These were **not rewritten** — Phase 2 imports them directly:
+- `src/database_client/database_client_service.py`
+- `src/utils/db_credential_encryption.py` (`decrypt_credentials_dict`) — `src/security/credentials.py` is still a stub
+- `src/pipeline/db_pipeline/db_pipeline_service.py` (`engine_scope` context manager — used by both the introspector and `DbExecutor`)
+- `src/pipeline/db_pipeline/extractor.py` (`get_schema`, `profile_column`, `get_row_count`)
+- `src/knowledge/processing_service.py` (PDF / DOCX / TXT extraction + embedding)
+- `src/db/postgres/{connection,init_db,vector_store}.py`, `src/storage/az_blob/`, `src/middlewares/`, `src/security/auth.py`
+---
+## 3. End-to-end flow (current state)
+### 3.1 Ingestion
+```
+User action                Pipeline                                Storage
+──────��───────             ────────────────────────────             ─────────────────
+upload PDF/DOCX/TXT  →   DocumentPipeline                       →  Azure Blob + PGVector
+                          (extract → chunk → embed)                 (table: langchain_pg_embedding)
+                          + on_document_uploaded                    + retrieval cache invalidate
+upload CSV/XLSX     →    TabularIntrospector                   →  Azure Blob (Parquet)
+                          (sheets / columns + sample + stats)       + data_catalog jsonb row
+                          → CatalogValidator → CatalogStore         (NO vector store — catalog only)
+                          via on_tabular_uploaded
+register DB         →    DatabaseIntrospector                  →  data_catalog jsonb row
+                          (information_schema + sample + FKs)
+                          → validate → store
+                          via on_db_registered
+```
+### 3.2 Query (per user message → SSE stream)
+```
+POST /api/v1/chat/stream
+        │
+        ├── Redis cache check (24h TTL) — hit returns cached stream
+        ├── _fast_intent (greetings / goodbyes) — bypass LLM
+        ├── load history from chat_messages
+        │
+        └── ChatHandler.handle(message, user_id, history)         [src/agents/chat_handler.py]
+                │
+                ├─ OrchestratorAgent.classify()                    [agents/orchestration.py]
+                │     → needs_search, source_hint, rewritten_query
+                │
+                ├── source_hint == "chat"
+                │     → ChatbotAgent.astream() → yield chunk events
+                │
+                ├── source_hint == "unstructured"
+                │     → RetrievalRouter.retrieve()                 [retrieval/router.py, Redis-cached]
+                │         → DocumentRetriever (PGVector MMR/cosine/etc.)
+                │     → ChatbotAgent.astream(chunks=...)
+                │
+                └── source_hint == "structured"
+                      → CatalogReader.read(user_id, "structured")  [catalog/reader.py]
+                      → QueryService.run(user_id, question, catalog)   [query/service.py]
+                           │
+                           ├─ QueryPlannerService.plan(...)        [query/planner/service.py]
+                           │    LLM(catalog, question, prev_error?) → QueryIR
+                           │
+                           ├─ IRValidator.validate(ir, catalog)    [query/ir/validator.py]
+                           │    fail → loop back to planner with error context (max 3)
+                           │
+                           ├─ ExecutorDispatcher.pick(ir)          [query/executor/dispatcher.py]
+                           │    schema source  → DbExecutor
+                           │    tabular source → TabularExecutor
+                           │
+                           ├─ DbExecutor.run(ir):                  [query/executor/db.py]
+                           │    SqlCompiler → (sql, params)
+                           │      → sqlglot SELECT-only guard
+                           │      → engine_scope (Phase 1 utility) in asyncio.to_thread
+                           │      → RO txn + statement_timeout=30s + 10k cap
+                           │
+                           ├─ TabularExecutor.run(ir):             [query/executor/tabular.py]
+                           │    resolve Parquet blob path
+                           │      → download → PandasCompiler.apply(df)
+                           │      → asyncio.to_thread → 10k cap
+                           │
+                           └─ QueryResult { rows, columns, row_count,
+                                            truncated, source_id, error?, elapsed_ms }
+                      →
+                      ChatbotAgent.astream(query_result=...)
+                            → yield chunk events
+                │
+                └── final events: done / error
+        │
+        └── persist user + assistant messages to chat_messages
+        └── populate Redis cache
+```
+**Safety invariants for the structured path** (read-only at every layer):
+1. IR validated against the catalog before reaching the compiler
+2. Identifiers come from the catalog (trusted; inlined as quoted identifiers)
+3. Values from `IR.filters` are always parameterized
+4. Compiler is deterministic — no LLM in the hot path
+5. sqlglot rejects anything that isn't a pure SELECT
+6. DB connection is read-only with a 30 s `statement_timeout`
+7. Hard 10 000 row cap on both executors; neither raises — errors go in `QueryResult.error`
+---
+## 4. Summary table for review
+| Concern | Phase 1 — where it lived | Phase 2 — where it lives | Change type |
+|---|---|---|---|
+| Intent classification | `agents/orchestration.py::OrchestratorAgent` (free-text intent) | **Same path + same class name** — body rewritten to emit `IntentRouterDecision` | Body rewrite only |
+| Top-level chat orchestration | Inline in `api/v1/chat.py` | `agents/chat_handler.py::ChatHandler` | Extracted to a reusable module |
+| Final answer formation | `agents/chatbot.py` (Phase 1 LangChain) | `agents/chatbot.py::ChatbotAgent` (was `AnswerAgent` in `answer_agent.py` mid-cycle) | Rewritten + renamed |
+| Schema retrieval (DB / tabular) | `rag/retrievers/schema.py` + PGVector chunks | **Removed**. Replaced by catalog (`catalog/store.py` jsonb) loaded verbatim into planner prompt | Whole concept replaced |
+| Doc retrieval (PDF / DOCX / TXT) | `rag/retrievers/document.py`, `rag/router.py` | `retrieval/document.py`, `retrieval/router.py` | Moved; Redis cache restored; tabular files filtered |
+| Query writing | `query/query_executor.py` + `models/sql_query.py` (LLM writes SQL) | `query/planner/service.py` (LLM writes IR) + `query/compiler/sql.py` (deterministic) | LLM emits intent, not SQL |
+| DB execution | `query/executors/db_executor.py` | `query/executor/db.py::DbExecutor` | Folder renamed (`executors` → `executor`); sqlglot guard + RO txn + 30 s timeout kept |
+| Tabular execution | `query/executors/tabular.py` | `query/executor/tabular.py::TabularExecutor` | Parquet-only; pandas compiler split out |
+| Executor selection | Hard-coded in `query_executor.py` | `query/executor/dispatcher.py::ExecutorDispatcher` | New; routes by `source.source_type` |
+| Catalog (NEW) | — | `catalog/` (models, introspect/, validator, store, reader, pii_detector, render) | New subsystem |
+| Catalog persistence | (data was embedded in PGVector) | Postgres jsonb table `data_catalog`, keyed by `user_id` | New table |
+| Ingestion triggers | Inline in API endpoints | `pipeline/triggers.py` (`on_db_registered`, `on_tabular_uploaded`, `on_document_uploaded`, `on_catalog_rebuild_requested`) | Centralized event entry points |
+| Structured pipeline | `pipeline/db_pipeline/db_pipeline_service.py` (still present for `engine_scope` + extractor reuse) | `pipeline/structured_pipeline.py` (orchestrator) — reuses Phase 1 extractor | New orchestrator wraps Phase 1 introspection helpers |
+| Document pipeline | `pipeline/document_pipeline/document_pipeline.py` (folder) | `pipeline/document_pipeline.py` (file) | Flattened; CSV / XLSX now skip the vector store |
+| Parquet helper | `knowledge/parquet_service.py` | `storage/parquet.py` | Moved into `storage/` |
+| Prompts | `config/agents/system_prompt.md`, `guardrails_prompt.md` | `config/prompts/{intent_router,query_planner,chatbot_system,guardrails}.md` | Folder renamed; split into four files; guardrails appended to `chatbot_system` at load |
+| PII detection | — | `catalog/pii_detector.py` + `security/pii_patterns.py` | New. Columns flagged `pii_flag=true` get `sample_values: null` so PII never enters prompts |
+| Chat endpoint | `api/v1/chat.py` (does everything inline) | `api/v1/chat.py` (cache + history + persistence) → delegates to `ChatHandler` | Slimmed; SSE event shape is `intent` / `chunk` / `done` / `error` |
+| DB ingest endpoint | `api/v1/db_client.py::ingest` (Phase 1 `db_pipeline_service.run()`) | `api/v1/db_client.py::ingest` (calls `on_db_registered` only) | Phase 1 dual-write removed |
+| Document process endpoint | `api/v1/document.py::process` (always vectorize) | `api/v1/document.py::process` (PDF/DOCX/TXT → vectors; CSV/XLSX → catalog via `on_tabular_uploaded`) | Routing by file type |
+| Catalog management API | — | `api/v1/data_catalog.py` (GET index + POST rebuild) | New |
+**Bottom line.** Every Phase 1 file under `src/rag/`, `src/tools/`, `src/query/executors/`, `src/query/query_executor.py`, `src/query/base.py`, `src/api/v1/knowledge.py`, and `src/config/agents/` is gone. Phase 1 introspection helpers under `src/pipeline/db_pipeline/` and `src/database_client/` are still imported by Phase 2 — they were not rewritten, just wrapped. The three LLM call sites are now explicit and the SQL-writing one no longer exists; the planner emits a Pydantic-validated `QueryIR` instead.
+The one filename gotcha to remember: the **intent router** still lives at `src/agents/orchestration.py` as class `OrchestratorAgent` (Phase 1 name kept for import-site compatibility, Phase 2 body). The matching prompt and tests use the `intent_router` name, but the source module does not.

PROGRESS.md ADDED Viewed

	@@ -0,0 +1,381 @@

+# Progress — Phase 2 catalog-driven build
+Persistent tracker mirroring the 42-item ownership table in `REPO_CONTEXT.md` "Team — division of work". Update as PRs land. Future Claude Code sessions read this to know what's already done.
+**Last updated**: 2026-05-12 ([NOTICKET] Cleanup PR landed: ChatHandler wired to chat.py, Phase 1 dual-write dropped from /ingest, on_catalog_rebuild_requested implemented, dead modules deleted, answer_agent→chatbot renamed, retrieval cache restored via RetrievalRouter, top_values added to ColumnStats, lifespan migration, knowledge_router removed)
+**Current open PR**: `pr/1` — active. Cleanup PR committed and pushed.
+---
+## Legend
+- `[x]` done and merged
+- `[~]` in progress (open PR or active branch)
+- `[ ]` not started
+- **DB** / **TAB** / **B** — ownership (from REPO_CONTEXT.md)
+---
+## PR sequence
+| PR | Status | Owner(s) | Scope |
+|---|---|---|---|
+| PR1 | `[x]` merged | DB | Contract locks + catalog plumbing + DB introspector + IR validator + tests |
+| PR1-tab | `[x]` shipped | TAB | Tabular introspector + on_tabular_uploaded trigger + 31 unit tests |
+| PR2a | `[x]` merged | DB | CatalogEnricher + StructuredPipeline + on_db_registered trigger + FK extension on Table (enricher later removed in KM-557) |
+| KM-557 | `[x]` shipped | DB | Drop CatalogEnricher entirely (cost cut — planner uses stats + sample rows directly); rename jsonb table `catalogs` → `data_catalog`; add `GET /api/v1/data-catalog/{user_id}` index endpoint for catalog refresher |
+| PR2b | `[x]` shipped | DB-solo (B-review) | IntentRouter + planner prompt + planner LLM service |
+| PR3-DB | `[x]` shipped | DB | SqlCompiler (Postgres) + DbExecutor (sqlglot guard, RO + statement_timeout, asyncio.to_thread) + 36 golden IR→SQL tests |
+| PR3-TAB | `[x]` shipped | TAB | PandasCompiler + TabularExecutor + 43+12 golden IR→DataFrame tests |
+| PR4 | `[x]` | DB-solo (B-review) | ExecutorDispatcher + QueryService + ChatHandler module. **API rewired in Cleanup PR.** |
+| PR5 | `[x]` shipped | DB-solo (B-review) | Retry/self-correction loop on validation failure (lives in QueryService, max 3 attempts, planner re-prompted with prior error) |
+| PR6 | `[~]` scaffold | DB-solo (B-review) | Eval harness scaffold + 3 DB-targeting golden cases. Skipped without `RUN_PLANNER_EVAL=1` env. TAB extends with tabular cases. |
+| PR7 | `[x]` | DB-solo (B-review) | `ChatbotAgent` (renamed from `AnswerAgent`) + chatbot_system + guardrails prompts. `answer_agent.py` → `chatbot.py`, `AnswerAgent` → `ChatbotAgent`. API rewired in Cleanup PR. |
+| Cleanup | `[x]` | B | ChatHandler wired to chat.py; Phase 1 dual-write dropped from /ingest; on_catalog_rebuild_requested + POST /data-catalog/rebuild; dead modules deleted (chatbot Phase 1, orchestrator, query/base, knowledge.py, config/agents/); retrieval cache restored via RetrievalRouter; top_values added to ColumnStats; lifespan migration; knowledge_router removed. |
+---
+## All items
+### Contracts (B — shared)
+| # | Item | Status | Notes |
+|---|---|---|---|
+| 1 | Catalog Pydantic models (`catalog/models.py`) | `[x]` | PR1 added `location_ref` URI-scheme docstring; PR2a added `ForeignKey` model + `Table.foreign_keys` field |
+| 2 | IR Pydantic models (`query/ir/models.py`) | `[x]` | Pre-existing scaffold |
+| 3 | IR operator whitelists (`query/ir/operators.py`) | `[x]` | PR1 filled `TYPE_COMPATIBILITY` matrix |
+| 4 | PII patterns / regex (`security/pii_patterns.py`) | `[x]` | Pre-existing |
+| — | `data_catalog` Postgres jsonb table (`db/postgres/models.py`) | `[x]` | PR1 added `Catalog` SQLAlchemy class + `init_db.py` import. KM-557 renamed `__tablename__` from `catalogs` → `data_catalog`; created fresh (no migration) |
+| — | `QueryResult` shape (`query/executor/base.py`) | `[x]` | Pre-existing scaffold; `columns: list[str]` added (TAB owner, PR1-tab) — DbExecutor updated to populate it. |
+| — | `Source.location_ref` URI scheme | `[x]` | PR1 documented in `catalog/models.py` docstring |
+### Ingestion — introspection
+| # | Item | Owner | Status | Notes |
+|---|---|---|---|---|
+| 5 | DB introspector (`catalog/introspect/database.py`) | DB | `[x]` | PR1 — reuses Phase 1 `database_client_service`, `db_credential_encryption`, `db_pipeline_service.engine_scope`, `extractor.get_schema/profile_column/get_row_count`. PR2a wired FK extraction (was discarded before). |
+| 6 | Tabular introspector (`catalog/introspect/tabular.py`) | TAB | `[~]` | PR1-tab — downloads original blob (CSV/XLSX/Parquet), one Table per sheet (XLSX) or one Table (CSV/Parquet). `source_id = document_id`. `fetch_doc`/`fetch_blob` injectable for unit tests (no Settings). |
+| 7 | `BaseIntrospector` ABC (`catalog/introspect/base.py`) | B | `[x]` | Pre-existing; signature locked |
+### Ingestion — shared catalog plumbing
+| # | Item | Owner | Status | Notes |
+|---|---|---|---|---|
+| 8 | ~~Catalog enricher + prompt~~ | B | **REMOVED in KM-557** | Cost optimization — planner reads stats + sample rows + column names directly. `catalog/enricher.py` + `config/prompts/catalog_enricher.md` deleted. `render_source` (the only piece still needed) moved to `src/catalog/render.py`. Tests moved to `tests/catalog/test_render.py`. |
+| 9 | Catalog validator (`catalog/validator.py`) | B | `[x]` | PR1 (DB owner picked up) — uniqueness invariants |
+| 10 | Catalog store — Postgres jsonb (`catalog/store.py`) | B | `[x]` | PR1 (DB owner picked up) — `INSERT ... ON CONFLICT` |
+| 11 | Catalog reader (`catalog/reader.py`) | B | `[x]` | PR1 (DB owner picked up) — filters by source_hint, empty on miss |
+| 12 | PII detector (`catalog/pii_detector.py`) | B | `[x]` | PR1 (DB owner picked up) — name + value matching, bias toward over-flag |
+### Ingestion — pipelines
+| # | Item | Owner | Status | Notes |
+|---|---|---|---|---|
+| 13 | Structured pipeline (`pipeline/structured_pipeline.py`) | B | `[x]` | PR2a (DB owner) — Source-type-agnostic: caller supplies the introspector. `default_structured_pipeline()` factory wires production deps lazily so tests can inject mocks without `Settings()` construction. **KM-557**: enrich step removed; pipeline is now `introspect → merge with existing → validate → upsert`. Constructor no longer takes `enricher`. |
+| 14 | Triggers (`pipeline/triggers.py`) | B | `[x]` | PR2a — `on_db_registered` implemented (DB owner). PR1-tab — `on_tabular_uploaded` implemented (TAB owner). **2026-05-11** — `on_document_uploaded` implemented. **2026-05-12** — `on_catalog_rebuild_requested` implemented: iterates all Sources in current catalog, re-runs `on_db_registered` (schema) or `on_tabular_uploaded` (tabular) per source; per-source errors logged but don't abort. |
+| 15 | Ingestion orchestrator (`pipeline/orchestrator.py`) | B | **DELETED** | Redundant stub — `StructuredPipeline` already takes introspector at run() time. Deleted in Cleanup PR. |
+| 16 | Document pipeline (`pipeline/document_pipeline.py`) | TAB | `[x]` | Flattened `pipeline/document_pipeline/document_pipeline.py` (folder) → `pipeline/document_pipeline.py` (file). Updated import in `api/v1/document.py`. |
+### Query — shared spine
+| # | Item | Owner | Status | Notes |
+|---|---|---|---|---|
+| 17 | IR validator (`query/ir/validator.py`) | B | `[x]` | PR1 (DB owner) — full rule set; descriptive errors for planner retry |
+| 18 | Planner LLM service (`query/planner/service.py`) | B | `[x]` | PR2b — Azure OpenAI structured output → `QueryIR`. Injectable chain. Supports retry via `previous_error` argument. |
+| 19 | Planner prompt (`query/planner/prompt.py`, `config/prompts/query_planner.md`) | B | `[x]` | PR2b — system prompt with hard constraints + few-shot for DB and tabular sources. `build_planner_prompt(question, catalog, previous_error)` calls `catalog.render.render_source` (renamed from `catalog.enricher.render_source` in KM-557). |
+| 20 | Intent router (`agents/orchestration.py` — class `OrchestratorAgent`; `config/prompts/intent_router.md`) | B | `[x]` | PR2b — single LLM call → `IntentRouterDecision(needs_search, source_hint, rewritten_query)`. Supports conversation history. **NOTE**: source filename + class name were kept from Phase 1 for import-site compatibility; only the body is Phase 2. Prompt file and test file use the `intent_router` name. |
+| 21 | Executor base + `QueryResult` (`query/executor/base.py`) | B | `[x]` | Pre-existing scaffold |
+| 22 | Executor dispatcher (`query/executor/dispatcher.py`) | B | `[x]` | PR4 — picks DbExecutor / TabularExecutor by `source.source_type`. Lazy imports of production executors keep import side-effect-free for tests. Caches per source_type. |
+| 23 | Compiler base ABC (`query/compiler/base.py`) | B | `[x]` | Pre-existing scaffold |
+| 24 | Top-level QueryService (`query/service.py`) | B | `[x]` | PR4+5 — `plan → validate → dispatch → execute → QueryResult`. Retry loop on validation failure (max 3, planner re-prompted with prior error). Catches NotImplementedError from TabularExecutor placeholder gracefully. Never raises. |
+### Query — DB path
+| # | Item | Status | Notes |
+|---|---|---|---|
+| 25 | SQL compiler (`query/compiler/sql.py`) | `[x]` | PR3-DB — Postgres dialect (Supabase reuses); deterministic IR → (sql, named-params dict); double-quoted identifiers from catalog; all whitelisted ops (=, !=, <, <=, >, >=, in, not_in, is_null, is_not_null, like, between); alias-aware order_by; `CompiledSql.params: dict[str, Any]` (changed from `list`). MySQL/BigQuery/Snowflake compilers later. |
+| 26 | DB executor (`query/executor/db.py`) | `[x]` | PR3-DB — sync engine via `db_pipeline_service.engine_scope` inside `asyncio.to_thread`. sqlglot SELECT-only / no-DML guard. Postgres-only session settings: `default_transaction_read_only=on` + `statement_timeout=30000`. asyncio.wait_for backstop. Never raises — populates `QueryResult.error`. 10k row hard cap. |
+| 27 | Credential encryption (`security/credentials.py`) | `[ ]` | Stub exists; PR1 reused Phase 1 `utils/db_credential_encryption.py` instead. Move in cleanup PR |
+| 28 | User-DB connection management | `[x]` | PR3-DB reused Phase 1 `db_pipeline_service.engine_scope` (same as PR1 introspector); no new helper needed |
+### Query — Tabular path
+| # | Item | Status | Notes |
+|---|---|---|---|
+| 29 | Pandas compiler (`query/compiler/pandas.py`) | `[~]` | PR3-TAB — `CompiledPandas` dataclass; all 12 filter ops; all 6 aggs; group_by via `pd.concat` of Series; alias-aware order_by; `_like_to_regex` (`%`→`.*`, `_`→`.`); pure module-level helpers |
+| 30 | Tabular executor (`query/executor/tabular.py`) | `[~]` | PR3-TAB — `fetch_blob` injectable for tests; blob path: single-table → `{uid}/{did}.parquet`, multi-table → `{uid}/{did}__{table.name}.parquet`; `asyncio.to_thread`; 10k row hard cap; errors → `QueryResult.error` |
+| 31 | Parquet upload/download wrapper | `[x]` | Moved `knowledge/parquet_service.py` → `storage/parquet.py`. Updated 4 import sites: `pipeline/document_pipeline.py`, `knowledge/processing_service.py`, `query/executor/tabular.py`, `query/executors/tabular.py`. |
+### Agents + chat
+| # | Item | Status | Notes |
+|---|---|---|---|
+| 32 | Chatbot agent + prompt (`agents/chatbot.py`, `config/prompts/chatbot_system.md`) | `[x]` | PR7-bundle — `ChatbotAgent` (was `AnswerAgent`) streams tokens, accepts `QueryResult` or list[`DocumentChunk`] or neither. **Cleanup PR**: renamed `answer_agent.py` → `chatbot.py`, `AnswerAgent` → `ChatbotAgent`; Phase 1 `agents/chatbot.py` deleted. |
+| 33 | Guardrails prompt (`config/prompts/guardrails.md`) | `[x]` | PR7-bundle — appended to `chatbot_system.md` so guardrails take precedence in conflict. |
+| — | Chat handler / orchestrator (`agents/chat_handler.py`) | `[x]` | PR4-bundle — top-level Phase 2 orchestrator. Routes by `source_hint`: chat → AnswerAgent direct; structured → CatalogReader + QueryService; unstructured → DocumentRetriever placeholder + AnswerAgent. Yields `intent` / `chunk` / `done` / `error` SSE-style events. Phase 1 chat.py NOT touched — cleanup PR rewires the API to call this. |
+### API surface
+| # | Item | Owner | Status | Notes |
+|---|---|---|---|---|
+| 34 | DB client endpoints (`api/v1/db_client.py`) | DB | `[x]` | **Cleanup PR** — `/ingest` now calls only `on_db_registered`. Phase 1 `db_pipeline_service.run()` + `decrypt_credentials_dict` removed. Error from catalog build now raises HTTP 500 (was silent log). Response simplified to `{"status": "success", "client_id": ...}`. |
+| 35 | Document/tabular upload endpoints (`api/v1/document.py`) | TAB | `[x]` | Rewired `/document/process` — after processing CSV/XLSX, calls `on_tabular_uploaded(document_id, user_id)`. Catalog ingestion failure is logged but does not fail the request. **2026-05-11** — CSV/XLSX no longer ingested to vector store (`knowledge_processor` skipped for tabular types in `document_pipeline.py`); they go to catalog only. |
+| 36 | Chat stream endpoint (`api/v1/chat.py`) | B | `[x]` | Rewired `/chat/stream` — replaced `query_executor.execute()` (Phase 1) with `CatalogReader + QueryService` (Phase 2). **Cleanup PR**: fully rewired to `ChatHandler.handle()`. Inline intent routing, retrieval, and answer generation removed. Redis cache, fast intent, history loading, and message persistence remain in chat.py. Sources event emits `[]` (retrieval not yet exposed by ChatHandler). |
+| 37 | Room / users endpoints (`api/v1/room.py`, `api/v1/users.py`) | B | `[ ]` | No catalog work; only touch if auth flow changes |
+| — | Data catalog index endpoint (`api/v1/data_catalog.py`) | DB | `[x]` | **KM-557** — `GET /api/v1/data-catalog/{user_id}` → `list[CatalogIndexEntry]`. **Cleanup PR** — added `POST /api/v1/data-catalog/rebuild?user_id=` → calls `on_catalog_rebuild_requested`; per-source errors logged but don't fail the request. |
+### Tests + eval
+| # | Item | Owner | Status | Notes |
+|---|---|---|---|---|
+| 38 | DB compiler golden tests (`tests/query/compiler/test_sql.py`) | DB | `[x]` | PR3-DB — 36 tests across all whitelisted ops, identifier quoting, agg / count_distinct / count(*), order_by alias resolution, parameter sequencing, error paths. Pure-Python, no LLM, no DB. |
+| 39 | Pandas compiler golden tests (`tests/unit/query/compiler/test_pandas_compiler.py`) | TAB | `[~]` | PR3-TAB — 43 tests: all 12 filter ops, all 6 aggs, group_by, order_by, limit, aliases, empty DataFrame, error paths. `test_tabular_executor.py` adds 12 more (blob name resolution + happy path + error paths). |
+| 40 | IR validator tests (`tests/query/ir/test_validator.py`) | B | `[x]` | PR1 — 19 tests, all rules covered |
+| — | PII detector tests (`tests/catalog/test_pii_detector.py`) | B | `[x]` | PR1 — 26 tests (parametrized) |
+| — | Catalog validator tests (`tests/catalog/test_validator.py`) | B | `[x]` | PR1 — 5 tests |
+| — | Catalog render tests (`tests/catalog/test_render.py`) | B | `[x]` | **KM-557** — 5 tests (renamed from `test_enricher.py`; LLM enrichment tests dropped, render-only tests kept). |
+| — | Catalog store integration test (`tests/catalog/test_store.py`) | DB | `[x]` | PR1 — module-level skip without `RUN_INTEGRATION_TESTS=1` |
+| — | DB introspector test | DB | `[ ]` | Deferred to PR2 — needs Postgres testcontainer or fixture infra |
+| — | Tabular introspector test | TAB | `[x]` | PR1-tab — 31 unit tests (CSV/XLSX/Parquet, stats, PII, error paths). No DB/blob I/O — mocks injected via constructor. |
+| 41 | Planner eval (`tests/query/planner/`) | B | `[x]` | PR6-scaffold — `test_golden_questions.py` with 3 DB-targeting cases. TAB added `test_golden_tabular.py` with 4 tabular cases (group_by+sum, top-N+limit, date range filter, XLSX sheet selection). All 4 passed against real Azure OpenAI. Fix shipped alongside: `query/planner/service.py` replaced `("system", text)` tuple with `SystemMessage` — without this, `{...}` in `query_planner.md` was parsed as f-string variables and crashed on every real invocation. |
+| 42 | E2E smoke tests (`tests/e2e/`) | B | `[ ]` | Defer until Phase 2 endpoints are wired (cleanup PR). Component-level orchestration is already covered by `test_chat_handler.py` + `test_service.py`. |
+| — | Golden IR fixtures (`tests/fixtures/golden_irs.json`) | B | `[~]` | PR1 seeded with 5 DB-targeting examples; TAB extends in PR1-tab |
+| — | Shared `sample_catalog` fixture (`tests/conftest.py`) | B | `[x]` | PR1 — DB-shaped; TAB may add tabular sibling |
+---
+## What just shipped (2026-05-12 — Cleanup PR)
+**Phase 1 removal + Phase 2 API rewiring:**
+- `src/api/v1/chat.py` — fully rewired to `ChatHandler.handle()`. Removed inline IntentRouter, retrieval, and ChatbotAgent calls. Redis cache, fast intent, load_history, save_messages stay in chat.py.
+- `src/api/v1/db_client.py` — `/ingest` now calls only `on_db_registered`. Phase 1 `db_pipeline_service.run()` block removed. Catalog build failure now raises HTTP 500.
+- `src/api/v1/data_catalog.py` — added `POST /api/v1/data-catalog/rebuild` endpoint.
+- `src/pipeline/triggers.py` — `on_catalog_rebuild_requested` implemented: iterates catalog sources, re-runs the appropriate trigger per source type, per-source errors logged.
+**Dead modules deleted:**
+- `src/agents/chatbot.py` (Phase 1 LangChain chatbot)
+- `src/pipeline/orchestrator.py` (empty stub)
+- `src/query/base.py` (old duplicate of `executor/base.py`)
+- `src/api/v1/knowledge.py` (fake `/knowledge/rebuild` endpoint)
+- `src/config/agents/` (folder — prompts only used by deleted Phase 1 chatbot)
+**Renames:**
+- `src/agents/answer_agent.py` → `src/agents/chatbot.py`; `AnswerAgent` → `ChatbotAgent`; updated all import sites (`chat_handler.py`, `chat.py`)
+**Fixes + improvements:**
+- `src/agents/chat_handler.py` — `_get_document_retriever()` now returns `RetrievalRouter` (Redis-cached) instead of `DocumentRetriever` directly; retrieval-level cache restored.
+- `src/retrieval/router.py` — removed dead `db: AsyncSession` and `source_hint` parameters + `_UNSTRUCTURED_HINTS` constant from `retrieve()`. Cache key simplified.
+- `src/knowledge/processing_service.py` — removed dead `_build_csv_documents`, `_build_excel_documents`, `_profile_dataframe`, `_to_sheet_document` methods + `pandas` and `upload_parquet` imports.
+- `src/catalog/models.py` — added `top_values: list[Any] | None` to `ColumnStats`.
+- `src/catalog/introspect/tabular.py` — `_to_column` now populates `top_values` for columns with ≤10 distinct values; useful for query planner WHERE clause generation.
+- `main.py` — replaced deprecated `@app.on_event("startup")` with `lifespan` context manager; removed `knowledge_router`.
+---
+## What just shipped (KM-557 — DB owner)
+After lead review of the catalog ingestion cost: dropped LLM enrichment,
+renamed the storage table, and exposed a lightweight index endpoint for
+the upcoming catalog refresher.
+**Files deleted**:
+- `src/catalog/enricher.py` — entire CatalogEnricher + EnrichmentResponse + apply_descriptions removed
+- `src/config/prompts/catalog_enricher.md` — dead prompt
+- `tests/catalog/test_enricher.py` — replaced by `test_render.py`
+**Files added**:
+- `src/catalog/render.py` — new home for `render_source` (the only piece of the old enricher still needed; consumed by `query/planner/prompt.py`)
+- `src/api/v1/data_catalog.py` — `GET /api/v1/data-catalog/{user_id}` returns `list[CatalogIndexEntry]`
+- `tests/catalog/test_render.py` — 5 tests (same coverage as the old render block)
+**Files modified**:
+- `src/db/postgres/models.py` — `__tablename__ = "data_catalog"` (was `"catalogs"`). Class name unchanged
+- `src/pipeline/structured_pipeline.py` — `StructuredPipeline(validator, store)` (was `(enricher, validator, store)`); pipeline is now `introspect → merge → validate → upsert`; `default_structured_pipeline()` no longer constructs an enricher
+- `src/pipeline/triggers.py` — docstrings updated; `on_catalog_rebuild_requested` docstring rewritten for the refresher use case
+- `src/query/planner/prompt.py` — import now `from ...catalog.render import render_source`
+- `src/catalog/introspect/{base,database,tabular}.py` — docstring scrubs (no behavior changes)
+- `src/models/api/catalog.py` — added `CatalogIndexEntry`; simplified `CatalogRebuildResponse` to `sources_rebuilt`
+- `main.py` — registered `data_catalog_router`
+- `src/security/README.md` — one stale wording fix
+**No migration**: the `data_catalog` table is created from scratch on first `init_db()`. The old `catalogs` table was never deployed against production data, so no rename SQL is needed.
+**Tests**: all 4 `test_structured_pipeline.py` tests reworked to construct `StructuredPipeline(validator=, store=)` without `enricher`. 5 `test_render.py` tests cover render_source standalone.
+**Lint**: `ruff check` clean on modified Phase 2 paths.
+**Open follow-ups left for the lead**:
+- `on_catalog_rebuild_requested` body — the refresher will iterate the index endpoint and call this trigger per source
+- `api/v1/db_client.py` `/ingest` still doesn't call `on_db_registered` — same blocker as before, untouched by KM-557
+---
+## What just shipped (2026-05-11 — retrieval migration + bug fixes)
+**Files implemented / migrated**:
+- `src/retrieval/base.py` — `RetrievalResult` dataclass + `BaseRetriever` ABC (was in `src/rag/base.py`)
+- `src/retrieval/document.py` — full `DocumentRetriever` migrated from `src/rag/retrievers/document.py`; all retrieval methods (MMR/cosine/euclidean/inner_product/manhattan). Tabular file types filtered out from results.
+- `src/retrieval/router.py` — `RetrievalRouter` (Redis-cached, unstructured-only). `invalidate_cache(user_id)` clears all `retrieval:{user_id}:*` keys.
+**Deleted** (no longer used):
+- `src/rag/` — entire folder (base.py, retriever.py, router.py, retrievers/)
+- `src/tools/` — entire folder (search.py was the only real file; only called by deleted rag/ router)
+**Bug fixes**:
+- `src/pipeline/document_pipeline.py` — `retrieval_router.invalidate_cache(user_id)` called after `process()` and `delete()`. Redis failure is caught and logged (does not fail the document op).
+- `src/pipeline/document_pipeline.py` — CSV/XLSX now skips `knowledge_processor` (vector store). Tabular files go to catalog only; no duplicate embeddings.
+- `src/pipeline/triggers.py` — `on_document_uploaded` implemented (was `raise NotImplementedError`).
+- `src/agents/chat_handler.py` — `_normalize_chunks` now handles `RetrievalResult` objects. Previously they were silently dropped, causing empty context for unstructured queries through ChatHandler.
+**Import updates** (all changed from `src.rag.*` → `src.retrieval.*`):
+- `src/api/v1/chat.py`, `src/query/base.py`, `src/query/query_executor.py`, `src/query/executors/db_executor.py`, `src/query/executors/tabular.py`
+---
+## What shipped previously (PR2b/4/5/6/7-bundle — DB owner solo, teammate reviews)
+**Files implemented**:
+- `src/agents/orchestration.py` — `OrchestratorAgent.classify(message, history) → IntentRouterDecision`. Pydantic model for structured output. History-aware query rewriting. Phase 1 filename + class name preserved; body fully rewritten for Phase 2.
+- `src/agents/answer_agent.py` — `AnswerAgent.astream(...)` streams answer tokens; accepts `QueryResult` and/or `list[DocumentChunk]`. Renames to `chatbot.py` in cleanup PR.
+- `src/agents/chat_handler.py` — `ChatHandler.handle(message, user_id, history)` returns `AsyncIterator[dict]` of `intent` / `chunk` / `done` / `error` SSE events. All deps injectable; lazy default builders.
+- `src/query/planner/prompt.py` — `render_catalog(catalog)` + `build_planner_prompt(question, catalog, previous_error)`. Reuses `catalog.enricher.render_source` for consistency across LLM call sites.
+- `src/query/planner/service.py` — `QueryPlannerService.plan(question, catalog, previous_error)` Azure OpenAI structured output → `QueryIR`.
+- `src/query/executor/dispatcher.py` — `ExecutorDispatcher.pick(ir) → BaseExecutor` by `source.source_type`. Lazy executor imports + per-source-type cache.
+- `src/query/service.py` — `QueryService.run(user_id, question, catalog) → QueryResult`. Plan→validate→retry-on-failure (max 3)→dispatch→execute. Catches NotImplementedError from TabularExecutor placeholder gracefully.
+**Prompts written** (filled in placeholders):
+- `src/config/prompts/intent_router.md`
+- `src/config/prompts/query_planner.md`
+- `src/config/prompts/chatbot_system.md`
+- `src/config/prompts/guardrails.md`
+**Tests added** (46 new — total now 146 + 2 skipped):
+- `tests/agents/test_intent_router.py` (4)
+- `tests/agents/test_answer_agent.py` (12)
+- `tests/agents/test_chat_handler.py` (6)
+- `tests/query/planner/test_prompt.py` (7)
+- `tests/query/planner/test_service.py` (3)
+- `tests/query/executor/test_dispatcher.py` (5)
+- `tests/query/test_service.py` (8)
+- `tests/query/planner/test_golden_questions.py` (3 — skipped by default; eval harness scaffold)
+**Lint**: `ruff check` clean on all Phase 2 paths. Phase 1 files have pre-existing E501/S608 issues — out of scope for this PR.
+**Placeholders / blockers for teammate** (status as of DB owner's commit, before merge):
+- `src/query/executor/tabular.py` (TAB) — DB owner's note: "still raises NotImplementedError". **Post-merge**: TAB shipped this in PR3-TAB; dispatcher now routes to the real `TabularExecutor`. The `NotImplementedError` catch in `QueryService` stays as a safety net.
+- `src/retrieval/document.py` — **implemented** (2026-05-11). Full `DocumentRetriever` migrated from `src/rag/retrievers/document.py`; supports MMR/cosine/euclidean/manhattan/inner_product. `_normalize_chunks` in `chat_handler.py` now handles `RetrievalResult` → `DocumentChunk` conversion correctly.
+- `src/api/v1/chat.py` (Phase 1) — NOT touched. Cleanup PR rewires the SSE endpoint to call `ChatHandler.handle(...)`.
+- `src/api/v1/db_client.py` (Phase 1) — NOT touched. Cleanup PR rewires `/database-clients/{id}/ingest` to call `pipeline.triggers.on_db_registered`.
+---
+## What shipped previously (PR3-TAB — TAB owner)
+**Files implemented**:
+- `src/query/compiler/pandas.py` — `PandasCompiler` + `CompiledPandas(apply, output_columns)` dataclass. Pure helper functions (easier to test in isolation): `_apply_filters` (all 12 ops, `_like_to_regex` for LIKE), `_apply_select` (column pick + rename), `_apply_agg` (scalar + group_by via `pd.concat` of Series → `reset_index`), `_apply_orderby` (alias-aware via `_resolve_order_col`). Closure captures all IR fields explicitly so `apply(df)` is self-contained.
+- `src/query/executor/tabular.py` — `TabularExecutor` with injectable `fetch_blob` (same testability pattern as `TabularIntrospector`). Resolves Parquet blob path from `az_blob://{uid}/{did}` + table: single-table → `{uid}/{did}.parquet`, multi-table → `{uid}/{did}__{table.name}.parquet`. Runs compile → download → `asyncio.to_thread(_load_and_apply)` → 10k hard cap. Never raises; errors populate `QueryResult.error`. Uses `compiled.output_columns` for column labels (safe on empty DataFrame).
+**Tests added** (55 new — total suite was 86 all passing at PR3-TAB time):
+- `tests/unit/query/compiler/test_pandas_compiler.py` — 43 tests across all 12 filter ops (including `is_null`, `not_in`, `like`, `between`), all 6 agg fns, group_by, order_by asc/desc, limit-after-order, alias round-trip, empty DataFrame, error paths.
+- `tests/unit/query/executor/test_tabular_executor.py` — 12 tests: `_resolve_blob_name` (single/multi-table, bad prefix), happy-path `QueryResult` shape (columns, rows, backend, truncated, source_id), wrong source_type → error, blob fetch failure → error, unknown source → error.
+**Lint**: `ruff check` clean on both files.
+---
+## What shipped previously (PR1-tab — TAB owner)
+**Files implemented**:
+- `src/catalog/introspect/tabular.py` — `TabularIntrospector` reads original blob (CSV/XLSX/Parquet), profiles each column (dtype, stats, sample values), runs PIIDetector. For XLSX: one `Table` per sheet (`Table.name = sheet_name`); for CSV/Parquet: one `Table` (`Table.name = filename stem`). `fetch_doc`/`fetch_blob` are constructor-injectable for unit tests — no `Settings` or DB required at import time.
+- `src/pipeline/triggers.py` — `on_tabular_uploaded` wired (mirrors `on_db_registered` pattern).
+**Tests added** (31 new):
+- `tests/unit/catalog/test_introspect_tabular.py` — CSV / XLSX / Parquet shapes, per-column stats, nullable detection, PII name + value matching, sample capping, all error paths. Pure Python, no network I/O.
+**Executor contract note**: introspector downloads the *original* blob for schema reading. The tabular executor (PR3-TAB) downloads *Parquet* blobs for query execution. For CSV/Parquet sources (single table), the executor must call `parquet_blob_name(uid, did, sheet_name=None)`; for XLSX (multi-table), `parquet_blob_name(uid, did, table.name)`.
+---
+## What shipped previously (PR3-DB — DB owner)
+**Files implemented**:
+- `src/query/compiler/sql.py` — `SqlCompiler` for Postgres dialect; `CompiledSql(sql, params)` dataclass with `params: dict[str, Any]` (changed from `list`); supports all 12 whitelisted filter ops, all 6 aggs, alias-aware order_by; `_qident` escapes embedded double-quotes
+- `src/query/executor/db.py` — `DbExecutor` with sqlglot SELECT-only guard, Postgres session-level read-only + 30s `statement_timeout`, `asyncio.wait_for` backstop, 10k row hard cap; rejects non-`schema` source_type and `dbclient://` URI mismatch; never raises (populates `QueryResult.error`)
+**Files extended**:
+- `src/query/compiler/pandas.py` — fixed pre-existing UP035 (Callable import)
+- `pyproject.toml` — added `S608` to `tests/**` ruff ignore (false positive: tests assert literal SQL strings)
+**Tests added** (36 new, all passing — total now 100):
+- `tests/query/compiler/test_sql.py` — every filter op, every agg, count(*), count_distinct, order_by alias vs column, multi-filter AND, identifier quoting escape, error paths
+**Lint**: `ruff check` clean on Phase 2 paths.
+**Hand-off note for teammate**: `CompiledSql.params` is now `dict[str, Any]` not `list`. The pandas compiler will follow the same convention (or document its own) — coordinate when PR3-TAB lands.
+---
+## What shipped previously (PR2a — DB owner)
+**Files implemented**:
+- `src/catalog/enricher.py` — Azure OpenAI GPT-4o + structured output (`EnrichmentResponse`), `render_source` (reusable by planner prompt later), `apply_descriptions` merger, injectable `structured_chain` for tests
+- `src/pipeline/structured_pipeline.py` — `StructuredPipeline` orchestrator + `default_structured_pipeline()` factory with lazy production-dep imports
+- `src/pipeline/triggers.py` — `on_db_registered` wired; tabular/document/rebuild stubs preserved with implementation notes
+**Files extended**:
+- `src/catalog/models.py` — added `ForeignKey` model, `Table.foreign_keys: list[ForeignKey] = []`
+- `src/catalog/introspect/database.py` — `_extract_foreign_keys` populates `Table.foreign_keys` from extractor data
+- `src/config/prompts/catalog_enricher.md` — full system prompt with style rules and one few-shot example
+**Tests added** (14 new, all passing — total now 64):
+- `tests/catalog/test_enricher.py` — render / apply / end-to-end with fake chain (10 tests)
+- `tests/pipeline/test_structured_pipeline.py` — orchestration with stub deps (4 tests)
+**Lint**: `ruff check` clean on all Phase 2 paths. Phase 1 files (`pipeline/db_pipeline/`, `pipeline/document_pipeline/`) have pre-existing ruff issues — out of scope for this PR.
+---
+## What shipped previously (PR1 — DB owner's first chunk)
+**Files implemented** (was `NotImplementedError`):
+- `src/catalog/pii_detector.py`, `src/catalog/validator.py`, `src/catalog/store.py`, `src/catalog/reader.py`
+- `src/catalog/introspect/database.py` (FK extraction added in PR2a)
+- `src/query/ir/validator.py`
+**Files extended**:
+- `src/query/ir/operators.py` — `TYPE_COMPATIBILITY` matrix
+- `src/catalog/models.py` — `location_ref` URI-scheme docstring
+- `src/db/postgres/models.py` — `Catalog` SQLAlchemy table; `init_db.py` imports it
+**Tests**: 50 unit tests + 1 integration (gated on `RUN_INTEGRATION_TESTS=1`).
+**Reused Phase 1 utilities** (cleanup deferred):
+- `src/database_client/database_client_service.py:get`
+- `src/utils/db_credential_encryption.py:decrypt_credentials_dict`
+- `src/pipeline/db_pipeline/db_pipeline_service.py:engine_scope`
+- `src/pipeline/db_pipeline/extractor.py:get_schema/profile_column/get_row_count`
+---
+## Open contract items (not yet locked)
+- **Joins in IR** — currently single-table only (ARCHITECTURE.md §7); DB owner accepted the constraint for v1, will revisit in PR3 if it's blocking real queries
+- **`updated_at` on Source vs `generated_at` on Catalog** — Pydantic models have both; introspector sets per-Source; CatalogStore preserves both
+- **Catalog refresh trigger** (open question §3) — default policy is rebuild-on-upload-or-connect; auto-refresh deferred
+- **Unstructured catalog entries** (open question §2) — currently empty filter for `source_hint="unstructured"`; revisit when adding doc descriptions
+- **PII handling for `sample_values`** (open question §5) — currently nulls them out (skip); mask/synthesize deferred
+- **Dialect priority for SQL compiler** — PR3 will land Postgres first, MySQL second; BigQuery/Snowflake/SQL Server later
+---
+## How to update this file
+When a PR lands:
+1. Flip status from `[ ]` or `[~]` to `[x]`
+2. Add a short note (file paths, scope cuts, surprises)
+3. Bump "Last updated" at the top
+4. If a new contract decision lands, move it from "Open contract items" to the relevant inline note
+When opening a PR:
+1. Flip status to `[~]` and add yourself as the active owner in the PR row
+2. Don't promise items in the PR description that aren't in the table

REPO_CONTEXT.md ADDED Viewed

	@@ -0,0 +1,474 @@

+# Repo Context — Agentic Service Data Eyond Catalog
+Orientation file for future Claude Code sessions. Cross-reference `ARCHITECTURE.md` for the full design rationale and decision log.
+---
+## TL;DR
+FastAPI multi-agent backend for data analysis. Users upload documents and register databases / tabular files; they ask natural-language questions and get answers grounded in their data, streamed via SSE.
+The architecture has two paths:
+- **Unstructured** (PDF, DOCX, TXT) — dense similarity over prose chunks (PGVector).
+- **Structured** (databases, XLSX, CSV, Parquet) — a per-user **data catalog** describes what tables/columns exist; an LLM produces a **JSON IR** of intent; a deterministic Python compiler turns the IR into SQL or pandas; the executor runs it.
+The LLM produces *intent*, not query syntax. Deterministic code does the rest.
+The Phase 2 end-to-end flow is **wired and runnable** as of 2026-05-12. See *Implementation status* below for the per-file matrix. `PROGRESS.md` is the authoritative line-by-line tracker; this file is the orientation.
+---
+## Stack
+- Python 3.12, FastAPI 0.115, uvicorn, sse-starlette
+- Async SQLAlchemy 2.0 + asyncpg (Postgres), psycopg3 (PGVector multi-statement workaround)
+- LangChain 0.3 + langchain-postgres (PGVector) + langchain-openai (Azure OpenAI GPT-4o + embeddings)
+- LangGraph 0.2 + langgraph-checkpoint-postgres
+- Redis 5 (response + retrieval cache)
+- Azure Blob Storage (uploads + Parquet)
+- pandas, pyarrow, polars-ready (deferred), sqlglot, pydantic v2, structlog, slowapi, langfuse
+- presidio-analyzer + spaCy `en_core_web_lg` (PII), pytesseract + pdf2image (PDF OCR)
+- DB connectors: psycopg2, pymysql, pymssql, sqlalchemy-bigquery, snowflake-sqlalchemy
+Run: `uv run --no-sync uvicorn main:app --host 0.0.0.0 --port 7860`. On Windows use `uv run --no-sync python run.py` (sets `WindowsSelectorEventLoopPolicy` for psycopg3 async).
+---
+## Top-level layout
+```
+main.py                — FastAPI app + middleware + router wiring + init_db() on startup
+run.py                 — Windows-safe local entry point
+ARCHITECTURE.md        — design intent (source of truth for shape + invariants)
+README.md
+Dockerfile             — python:3.12-slim, installs spaCy en_core_web_lg, tesseract, poppler
+pyproject.toml / uv.lock
+scripts/               — backfill scripts (build_initial_catalogs, enrich_all_sources)
+src/                   — all application code
+```
+---
+## src/ map
+### Core data shapes (only files with real content)
+| Path | Role |
+|---|---|
+| `catalog/models.py` | Pydantic: `Catalog → Source[] → Table[] → Column[]` |
+| `query/ir/models.py` | `QueryIR` (select / filters / group_by / order_by / limit) |
+| `query/ir/operators.py` | `ALLOWED_FILTER_OPS`, `ALLOWED_AGG_FNS`, `LIMIT_HARD_CAP=10000` |
+| `security/pii_patterns.py` | name patterns + email/phone regex for PII detection |
+### Catalog — identity layer for structured sources (Cs ∪ Ct)
+| Path | Role |
+|---|---|
+| `catalog/introspect/base.py` | `BaseIntrospector.introspect(location_ref) -> Source` |
+| `catalog/introspect/database.py` | `information_schema` + ~100 row sample → draft Source |
+| `catalog/introspect/tabular.py` | Parquet/CSV/XLSX header reader + sample (one Table per sheet for XLSX) |
+| `catalog/render.py` | renders a `Source` as the canonical text block consumed by the planner (KM-557; LLM enrichment removed — planner reads stats + samples directly) |
+| `catalog/validator.py` | invariants beyond Pydantic shape (unique IDs, FK refs) |
+| `catalog/store.py` | persist as Postgres `jsonb` row keyed by user_id (`get/upsert/delete`) — table `data_catalog` |
+| `catalog/reader.py` | load + filter catalog by source_hint (returns full catalog for ≤50 tables) |
+| `catalog/pii_detector.py` | flag PII columns at ingestion → suppresses `sample_values` |
+### Query — catalog-driven structured path
+| Path | Role |
+|---|---|
+| `query/service.py` | `QueryService.run(user_id, question, catalog) -> QueryResult` (top-level) |
+| `query/planner/service.py` | LLM call: question + catalog → QueryIR (structured output) |
+| `query/planner/prompt.py` | renders catalog into the planner prompt |
+| `query/ir/validator.py` | catalog-aware IR validation: column_ids exist, ops whitelisted, value_type matches data_type, limit ≤ cap |
+| `query/compiler/base.py` | `BaseCompiler.compile(ir) -> object` |
+| `query/compiler/sql.py` | IR → `(sql, params)`; identifiers from catalog, values parameterized |
+| `query/compiler/pandas.py` | IR → callable that runs against a DataFrame |
+| `query/executor/base.py` | `BaseExecutor.run(ir) -> QueryResult` (uniform across backends) |
+| `query/executor/db.py` | runs compiled SQL via asyncpg/pymysql in read-only txn (sqlglot second-line defence) |
+| `query/executor/tabular.py` | runs pandas/polars chain on a Parquet file (eager pandas → pyarrow pushdown → polars lazy by file size) |
+| `query/executor/dispatcher.py` | picks DB vs Tabular executor based on `source.source_type` of the IR's source |
+### Retrieval — unstructured path (Cu)
+| Path | Role |
+|---|---|
+| `retrieval/document.py` | `DocumentRetriever` over PGVector chunks |
+| `retrieval/router.py` | dispatches the `unstructured` route (the `chat` and `structured` routes do not pass through here) |
+### Agents — the three LLM call sites
+| Path | Role |
+|---|---|
+| `agents/orchestration.py` | `OrchestratorAgent` — classifies message → `needs_search`, `source_hint ∈ {chat, unstructured, structured}`, `rewritten_query`. Filename + class name kept from Phase 1; body replaced with Phase 2 logic. Output model is `IntentRouterDecision` |
+| `agents/chatbot.py` | `ChatbotAgent` — final answer formation (receives Cu chunks or QueryResult); SSE-streamed via `astream` |
+| `agents/chat_handler.py` | `ChatHandler` — top-level orchestrator; routes to chat / unstructured / structured and yields SSE-style `intent`/`chunk`/`done`/`error` events |
+(`QueryPlanner` is the third LLM call site, under `query/planner/`. The
+fourth — `CatalogEnricher` — was removed in KM-557; ingestion no longer
+makes any LLM calls.)
+### Pipelines — ingestion coordinators
+| Path | Role |
+|---|---|
+| `pipeline/structured_pipeline.py` | DB / tabular: introspect → merge → validate → store (no enrich step since KM-557) |
+| `pipeline/document_pipeline.py` | unstructured: extract → chunk → embed → PGVector. CSV/XLSX skip vector store (catalog only). Invalidates retrieval cache on process/delete. |
+| `pipeline/triggers.py` | event entry points called by API routes: `on_db_registered`, `on_tabular_uploaded`, `on_document_uploaded`, `on_catalog_rebuild_requested` |
+(`pipeline/orchestrator.py` was deleted in the Cleanup PR — it was a redundant stub; `StructuredPipeline` already takes the introspector at `run()` time.)
+### Security — cross-cutting
+| Path | Role |
+|---|---|
+| `security/auth.py` | bcrypt password hash/verify, JWT encode/decode, get_user |
+| `security/credentials.py` | Fernet encrypt/decrypt for stored DB credentials |
+| `security/pii_patterns.py` | (already listed) |
+### API + infra + config
+| Path | Role |
+|---|---|
+| `api/v1/*.py` | FastAPI routers — thin endpoints delegating to `pipeline/triggers` and `query/service` |
+| `models/api/{catalog,chat,document}.py` | request/response Pydantic models |
+| `db/postgres/connection.py` | two async engines: `engine` (app) and `_pgvector_engine` (PGVector) |
+| `db/postgres/init_db.py` | startup: creates `vector` extension, all tables, HNSW + GIN indexes |
+| `db/postgres/models.py` | SQLAlchemy app tables (users, rooms, chat messages, …) |
+| `db/postgres/vector_store.py` | shared PGVector instance (collection `document_embeddings`) |
+| `db/redis/connection.py` | async Redis client |
+| `storage/az_blob/az_blob.py` | Azure Blob async wrapper (uploads + Parquet) |
+| `middlewares/{cors,logging,rate_limit}.py` | CORS allow-all (POC), structlog JSON, slowapi |
+| `observability/langfuse/langfuse.py` | trace helper |
+| `config/settings.py` | pydantic-settings; `.env` uses double-underscore aliases |
+| `config/env_constant.py` | env file path constant |
+| `config/prompts/*.md` | prompt templates: `intent_router`, `query_planner`, `chatbot_system`, `guardrails` (KM-557 removed `catalog_enricher`) |
+---
+## Core architectural decisions
+1. **Catalog as primary context, not retrieval.** For ≤50 tables (typical), the entire catalog is rendered into the planner prompt verbatim (~3–5k tokens). No vector search, no BM25, no top-k for structured data. Catalog-level retrieval (BM25 + table-level vectors with RRF) is the *deferred* upgrade for users with hundreds of tables.
+2. **JSON IR over raw SQL.** The planner LLM emits a Pydantic-validated intent, never a SQL string. The compiler is deterministic Python. Benefits: validatable before execution, dialect-portable (one IR → SQL of any dialect / pandas / polars), cheaper tokens, trivially testable without an LLM, and the LLM literally cannot emit invalid SQL syntax.
+3. **Deterministic compiler, not LLM SQL writer.** All actual query construction happens in pure code. Compiler bugs are reproducible and fixable. Same IR → same query.
+4. **Pipeline stage isolation.** Each stage (`IntentRouter`, `CatalogReader`, `QueryPlanner`, `IRValidator`, `QueryCompiler`, `QueryExecutor`, `ChatbotAgent`) is its own module with typed input and typed output. No god classes.
+5. **Minimal LLM surface.** Only three LLM call sites in the system (KM-557 dropped `CatalogEnricher` — ingestion is now LLM-free; the planner reads stats + sample rows + column names directly):
+   - `IntentRouter` — once per user message
+   - `QueryPlanner` — once per structured query
+   - `ChatbotAgent` — once per answer (formatting)
+6. **Three-way routing**: `chat` / `unstructured` / `structured`. The router commits to one path. Cross-source questions ("compare DB sales vs uploaded customer file") are handled inside the structured path because the planner sees Cs ∪ Ct in one prompt. **DB vs tabular is not a routing concern** — it's a per-source attribute (`source_type`) that only matters at execution time.
+7. **Stable IDs.** `source_id`, `table_id`, `column_id` are stable internal references. Renaming a column in the source DB does not invalidate cached IRs.
+8. **PII suppression at the boundary.** Columns flagged with `pii_flag=true` have `sample_values: null` — real PII never enters LLM prompts. Auto-detected at ingestion via name patterns + value regex (`security/pii_patterns.py`). When in doubt, flag — false positives cost nothing; false negatives leak data.
+---
+## End-to-end flows
+### Ingestion (when user uploads a file or connects a DB)
+```
+source upload / DB connect
+    │
+    ├── unstructured (pdf/docx/txt)
+    │     → DocumentPipeline: extract → chunk → embed → PGVector
+    │
+    └── structured (DB schema or tabular file)
+          → introspect (information_schema or file headers + sample rows)
+          → CatalogValidator (Pydantic + unique-IDs + FK refs)
+          → CatalogStore.upsert(user_id jsonb row in `data_catalog`)
+```
+### Query (per user message)
+```
+user message
+    │
+    → Redis cache check (24h TTL)  ── miss ─→ continue
+    →
+    → IntentRouter LLM   →  needs_search? source_hint?
+    │
+    ├── chat          → ChatbotAgent → SSE stream
+    ├── unstructured  → DocumentRetriever (Cu) → ChatbotAgent → SSE stream
+    └── structured    →
+          CatalogReader.read(user_id, "structured")          # full Cs ∪ Ct
+              ↓
+          QueryPlanner LLM(question, catalog) → QueryIR
+              ↓
+          IRValidator.validate(ir, catalog)
+              (source_id ∈ catalog, table_id ∈ source, column_ids ∈ table,
+               ops/aggs whitelisted, value_type matches data_type, limit ≤ 10000)
+              fail → re-prompt planner with error context (max 3 retries)
+              ↓
+          ExecutorDispatcher.pick(ir)              # by source.source_type
+              ├─ DbExecutor       → SqlCompiler → sqlglot guard → asyncpg/pymysql
+              │                     (read-only txn, 30s timeout)
+              └─ TabularExecutor  → PandasCompiler → eager pandas (≤100 MB)
+                                    or pyarrow pushdown (100 MB–1 GB)
+                                    or polars lazy scan (>1 GB)
+              ↓
+          QueryResult
+              ↓
+          ChatbotAgent → SSE stream
+```
+---
+## Catalog schema (per-user `jsonb` row)
+```
+Catalog
+├── user_id, schema_version, generated_at
+└── sources[]
+    └── Source { source_id, source_type, name, description, location_ref, updated_at }
+        └── tables[]
+            └── Table { table_id, name, description, row_count, foreign_keys[] }
+                ├── columns[]
+                │   └── Column { column_id, name, data_type, description,
+                │                 nullable, pii_flag, sample_values[]|null, stats|null }
+                └── foreign_keys[]
+                    └── ForeignKey { column_id, target_table_id, target_column_id }
+```
+`source_type ∈ {schema, tabular, unstructured}`.
+`data_type ∈ {int, decimal, string, datetime, date, bool, json}`.
+`ForeignKey` references are within the SAME `Source` only; cross-source FKs are not modeled.
+Deferred Column fields (add when justified): `description_human`, `synonyms[]`, `tags[]`, `primary_key`, `unit`, `semantic_type`, `example_questions[]`, `schema_hash`, `enrichment_status`.
+---
+## JSON IR schema
+```jsonc
+{
+  "ir_version": "1.0",
+  "source_id":  "...",
+  "table_id":   "...",
+  "select": [
+    {"kind": "column", "column_id": "...", "alias": "..."},
+    {"kind": "agg",    "fn": "count|count_distinct|sum|avg|min|max",
+                       "column_id": "...?", "alias": "..."}
+  ],
+  "filters": [
+    {"column_id": "...",
+     "op":    "= | != | < | <= | > | >= | in | not_in | is_null | is_not_null | like | between",
+     "value": ...,
+     "value_type": "int|decimal|string|datetime|date|bool"}
+  ],
+  "group_by": ["column_id", ...],
+  "order_by": [{"column_id": "...", "dir": "asc|desc"}],
+  "limit": 100
+}
+```
+Single-table only in v1. `having`, `offset`, boolean filter trees, `distinct`, joins, window functions are deferred until user demand proves the limitation.
+---
+## Implementation status
+**As of 2026-05-12 — Phase 2 end-to-end flow is wired.** `PROGRESS.md` has the per-PR line-item table; this section is the high-level snapshot. Stub files (`raise NotImplementedError`) are now the exception, not the rule.
+| Area | Status | Notes |
+|---|---|---|
+| Catalog Pydantic models | ✅ | `catalog/models.py` — incl. `ForeignKey`, `ColumnStats.top_values` |
+| JSON IR Pydantic models | ✅ | `query/ir/models.py` + `operators.py` (TYPE_COMPATIBILITY filled) |
+| Catalog ingestion — DB | ✅ | introspect → validate → upsert. `on_db_registered` wired; `/api/v1/db-clients/{id}/ingest` calls it |
+| Catalog ingestion — tabular | ✅ | CSV/XLSX/Parquet; `on_tabular_uploaded` wired into `/api/v1/document/process`. XLSX → one Table per sheet. CSV/XLSX skip vector store |
+| Catalog ingestion — unstructured | ✅ | `on_document_uploaded` implemented; full DocumentPipeline (extract → chunk → embed → PGVector) |
+| Catalog store / reader / validator / PII detector | ✅ | `data_catalog` jsonb table (renamed from `catalogs` in KM-557) |
+| LLM enrichment | ❌ removed (KM-557) | Cost cut — planner reads `column.stats` + `sample_values` + `top_values` + `column.name` directly. `catalog/render.py` keeps the source-rendering helper |
+| `IntentRouter` (lives as `OrchestratorAgent` in `agents/orchestration.py`) | ✅ | 3-way `source_hint`, history-aware query rewriting. Filename + class name kept from Phase 1; Phase 2 body |
+| `CatalogReader` | ✅ | Loads full catalog; filters by `source_hint` |
+| `QueryPlanner` LLM call | ✅ | Azure OpenAI structured output → `QueryIR`; supports retry with `previous_error` |
+| IR validator | ✅ | Catalog-aware; full rule set; descriptive errors |
+| SQL compiler (Postgres) | ✅ | All 12 filter ops, all 6 aggs, alias-aware order_by, parameterized values, quoted identifiers |
+| DbExecutor | ✅ | sqlglot SELECT-only guard, RO txn, `statement_timeout=30000`, 10k row cap, never raises |
+| Pandas compiler | ✅ | Same op coverage as SQL; pure module-level helpers |
+| TabularExecutor | ✅ | Parquet blob path resolution, `asyncio.to_thread`, 10k cap, never raises |
+| ExecutorDispatcher | ✅ | Routes by `source.source_type`; lazy imports + cache |
+| QueryService | ✅ | plan → validate → retry-on-fail (max 3) → dispatch → execute → `QueryResult` |
+| `ChatbotAgent` + prompt + guardrails | ✅ | Renamed from `AnswerAgent` in Cleanup PR. Guardrails appended to `chatbot_system.md` |
+| `ChatHandler` (top-level chat orchestrator) | ✅ | SSE events: `intent` / `chunk` / `done` / `error` |
+| `DocumentRetriever` + `RetrievalRouter` (Redis-cached) | ✅ | Migrated from `src/rag/` (now deleted); MMR/cosine/euclidean/manhattan/inner_product |
+| `/api/v1/chat/stream` | ✅ | Rewired to `ChatHandler`; Redis cache + fast intent + history + message persistence remain in chat.py |
+| `/api/v1/db-clients/{id}/ingest` | ✅ | Calls only `on_db_registered`; Phase 1 dual-write removed |
+| `/api/v1/document/{upload,process,delete}` | ✅ | `/process` triggers `on_tabular_uploaded` for CSV/XLSX |
+| `GET /api/v1/data-catalog/{user_id}` | ✅ | Index endpoint (KM-557) |
+| `POST /api/v1/data-catalog/rebuild` | ✅ | Iterates sources, re-runs per-source trigger |
+| Credential encryption | ⚠️ stub | `security/credentials.py` not migrated; runtime reuses Phase 1 `utils/db_credential_encryption.py` |
+| Tests | ✅ 146+ unit | Compilers (DB 36, Pandas 43), validators, introspectors, agents, chat handler, dispatcher, planner |
+| Planner eval harness | 🟡 scaffold | 3 DB + 4 tabular golden cases. Gated on `RUN_PLANNER_EVAL=1`. Real Azure OpenAI passing |
+| E2E smoke tests | ❌ not started | Component-level orchestration is covered |
+| DB introspector unit test | ❌ deferred | Needs Postgres testcontainer |
+| Sources event in `/chat/stream` | ⚠️ emits `[]` | `ChatHandler` doesn't surface retrieval sources yet; same gap reflected in `save_messages` |
+**Deferred to later phases**: joins in IR, schema drift detection, hybrid catalog search (BM25 + RRF for 100+ table users), polars lazy scan for >1GB tabular files, MySQL/BigQuery/Snowflake SQL dialects, mask/synthesize PII strategies.
+---
+## Team — division of work
+The service is built by two engineers; many modules are source-type-agnostic and shared.
+- **DB** owns SQL paths: introspection, SQL compiler, DB executor, credential storage.
+- **TAB** owns tabular paths: CSV/XLSX/Parquet introspection, pandas compiler, tabular executor, blob/Parquet plumbing.
+- **B** = both — shared contracts and source-type-agnostic plumbing. Pair-program or split with explicit hand-off.
+### Step-by-step ownership
+| # | Step | File / area | Owner | Notes |
+|---|---|---|---|---|
+| 0 | **Lock contracts before coding** | — | B | See "Decisions to lock" below; block until aligned |
+| 1 | Catalog Pydantic models | `catalog/models.py` | B | Already done; only touch if both agree |
+| 2 | IR Pydantic models | `query/ir/models.py` | B | Already done; joins/window fns require joint sign-off |
+| 3 | IR operator whitelists | `query/ir/operators.py` | B | Already done; both compilers rely on these |
+| 4 | PII patterns / regex | `security/pii_patterns.py` | B | Already done; extend together as gaps appear |
+| **Ingestion — introspection** | | | | |
+| 5 | DB introspector (information_schema, sample, FKs) | `catalog/introspect/database.py` | DB | Use SQLAlchemy `inspect()`; dialect-aware quoting |
+| 6 | Tabular introspector (CSV/XLSX/Parquet headers + sample) | `catalog/introspect/tabular.py` | TAB | Each XLSX sheet → one Table |
+| 7 | `BaseIntrospector` ABC | `catalog/introspect/base.py` | B | Confirm signature returns the same `Source` shape |
+| **Ingestion — shared catalog plumbing** | | | | |
+| 8 | ~~Catalog enricher + prompt~~ | — | **REMOVED in KM-557.** Cost optimization — planner reads stats + sample rows directly. `catalog/render.py` keeps the source-rendering helper. |
+| 9 | Catalog validator | `catalog/validator.py` | B | Type-agnostic |
+| 10 | Catalog store (Postgres jsonb) | `catalog/store.py` | B | Recommend DB (Postgres expertise) |
+| 11 | Catalog reader | `catalog/reader.py` | B | Type-agnostic |
+| 12 | PII detector | `catalog/pii_detector.py` | B | Either; uses `pii_patterns.py` |
+| **Ingestion — pipelines** | | | | |
+| 13 | Structured pipeline (introspect → enrich → validate → store) | `pipeline/structured_pipeline.py` | B | Pair on this — calls both introspectors via dispatcher |
+| 14 | Triggers (`on_db_registered`, `on_tabular_uploaded`) | `pipeline/triggers.py` | B | Each owns their trigger function |
+| 15 | Ingestion orchestrator | `pipeline/orchestrator.py` | B | Routes by source_type; pair |
+| 16 | Document pipeline (PDF/DOCX/TXT) | `pipeline/document_pipeline.py` | TAB | Tabular-adjacent (file uploads) |
+| **Query — shared spine** | | | | |
+| 17 | IR validator (catalog-aware) | `query/ir/validator.py` | B | Recommend DB; both must agree on exact error messages so retry-prompt is consistent |
+| 18 | Planner LLM service | `query/planner/service.py` | B | Type-agnostic |
+| 19 | Planner prompt (catalog → text) | `query/planner/prompt.py`, `config/prompts/query_planner.md` | B | **Pair-program**. Must describe DB tables and tabular files in one consistent format |
+| 20 | Intent router (chat/unstructured/structured) | `agents/orchestration.py` (class `OrchestratorAgent` — Phase 1 filename + class name preserved; Phase 2 body), `config/prompts/intent_router.md` | B | Type-agnostic. The prompt file uses `intent_router.md`, but the source module is still `orchestration.py` |
+| 21 | Executor base + `QueryResult` | `query/executor/base.py` | B | Lock the shape before either implements an executor |
+| 22 | Executor dispatcher | `query/executor/dispatcher.py` | B | Reads `source.source_type` from catalog; pair |
+| 23 | Compiler base ABC | `query/compiler/base.py` | B | Already done |
+| 24 | Top-level QueryService | `query/service.py` | B | Wires planner → validator → compiler → executor; pair |
+| **Query — DB path** | | | | |
+| 25 | SQL compiler (IR → SQL + params, per dialect) | `query/compiler/sql.py` | DB | Identifiers from catalog (quoted), values parameterized |
+| 26 | DB executor (asyncpg/pymysql, sqlglot guard, RO txn, 30s timeout) | `query/executor/db.py` | DB | |
+| 27 | Credential encryption (Fernet) | `security/credentials.py` | DB | Needed for stored user DB creds |
+| 28 | User-DB connection management | helper in pipelines | DB | engine_scope context manager pattern |
+| **Query — Tabular path** | | | | |
+| 29 | Pandas compiler (IR → callable on DataFrame) | `query/compiler/pandas.py` | TAB | Same IR, different backend |
+| 30 | Tabular executor (eager pandas first; pyarrow / polars later) | `query/executor/tabular.py` | TAB | Initial scope: eager pandas only |
+| 31 | Parquet upload/download + Azure Blob wrapper | `storage/az_blob/az_blob.py` (+ helper) | TAB | XLSX sheet → one Parquet per sheet (deterministic blob name) |
+| **Agents + chat** | | | | |
+| 32 | Chatbot agent + prompt | `agents/chatbot.py`, `config/prompts/chatbot_system.md` | B | Receives QueryResult or Cu chunks |
+| 33 | Guardrails prompt | `config/prompts/guardrails.md` | B | |
+| **API surface** | | | | |
+| 34 | DB client endpoints (register/ingest/list/delete) | `api/v1/db_client.py` | DB | |
+| 35 | Document/tabular upload endpoints | `api/v1/document.py` | TAB | |
+| 36 | Chat stream endpoint (SSE) | `api/v1/chat.py` | B | Dispatches both paths; pair |
+| 37 | Room / users endpoints | `api/v1/room.py`, `api/v1/users.py` | B | Whoever has bandwidth |
+| **Tests + eval** | | | | |
+| 38 | DB compiler golden tests (IR → SQL fixtures) | `tests/query/compiler/test_sql.py` | DB | Pure-Python, no LLM |
+| 39 | Pandas compiler golden tests (IR → expected DataFrame) | `tests/query/compiler/test_pandas.py` | TAB | Pure-Python, no LLM |
+| 40 | IR validator tests (catalog × IR error matrix) | `tests/query/ir/test_validator.py` | B | Each contributes test cases for their source type |
+| 41 | Planner eval (golden question → IR examples) | `tests/query/planner/` | B | Each contributes ~10 question→IR examples |
+| 42 | E2E smoke tests | `tests/e2e/` | B | Pair |
+### Decisions to lock before coding
+If made unilaterally these create silent contract drift. Lock them in a 30-min sync first.
+| Decision | Why it matters | Recommended call |
+|---|---|---|
+| `QueryResult` shape (current scaffold: `source_id, backend, rows, row_count, truncated, elapsed_ms, error`) | Both executors return this; chatbot consumes it | Lock as-is unless either side needs more (e.g. `column_types` for formatting) |
+| `Source.location_ref` format (`az_blob://...` vs `dbclient://{id}` etc.) | Dispatcher and executors both parse this | Pick a convention now; document in `catalog/models.py` docstring |
+| Where do user DB credentials live? | DB executor needs creds to run queries; Source has `location_ref` but creds are encrypted separately | Recommend: `location_ref="dbclient://{client_id}"`; executor looks up creds by ID |
+| How does dispatcher pick the executor? | Routes by `source.source_type` — but where does dispatcher get it (catalog reload, or IR carries it)? | Recommend: dispatcher takes `(Catalog, IR)`, looks up source by `IR.source_id` |
+| Joins in v1 IR? | Excluded per ARCHITECTURE.md §7. DB path is most affected — real DB use often needs joins. | Recommend: ship single-table; revisit in PR 2. **DB owner must accept the constraint or push back early** |
+| Planner prompt — render tabular vs DB sources uniformly | If described differently, planner gets confused | Pair-program. Render both as `Table: name (n rows) — Columns: ...` regardless of source_type |
+| Error contract — raise or return `QueryResult.error`? | Both executors must behave the same so chatbot branches consistently | Recommend: never raise from `executor.run()`; populate `QueryResult.error` |
+| PII handling for tabular `sample_values` | DB samples come from `information_schema`; tabular from file reads. Same `pii_flag` rule must apply both sides | Confirm tabular introspector calls `pii_detector` |
+| Catalog refresh trigger (open question §3) | Affects both pipelines symmetrically | Default: rebuild on every upload/connect; defer auto-refresh |
+| `updated_at` semantics — per-Source vs per-Catalog | Affects how each pipeline writes | Recommend: per-Source `updated_at` + Catalog-level `generated_at` |
+| Dialect support scope for v1 | DB compiler must implement at least one dialect well | Recommend: Postgres first (matches app DB); MySQL second |
+| Test-fixture format for golden IRs | Both compilers test against golden IR → expected output | Recommend: shared `tests/fixtures/golden_irs.json`; each side adds expected SQL or DataFrame |
+| Logging conventions | structlog is already in place; both should log the same fields | Quick agreement: log `source_id`, `table_id`, `ir_version`, `elapsed_ms` |
+### Working rhythm (suggested)
+1. **Day 1** — 30-min sync to lock the decisions table. PR any contract/docstring changes that fall out.
+2. **Week 1** — both build introspectors + agree on the planner prompt format. PR in parallel; review each other's.
+3. **Week 2** — DB builds SQL compiler + DB executor; TAB builds pandas compiler + tabular executor. Both write golden tests against shared IR fixtures.
+4. **Week 3** — pair on dispatcher, QueryService, and chat endpoint integration. End-to-end smoke test.
+5. **Ongoing** — short daily standup, mostly to flag IR-shape questions and catalog-field additions *before* either side implements against an unconfirmed contract.
+Biggest risk: **silent contract drift** — one side adds a `QueryResult` field or assumes a new IR op exists, the other ships without it, and integration breaks at the dispatcher. The §0 lock + shared golden-IR fixtures are what prevent that.
+### Onboarding to Claude Code
+If you're new to Claude Code, before you start:
+1. Read `ARCHITECTURE.md` end-to-end (~10 min) — this is the source of truth.
+2. Skim this file (`REPO_CONTEXT.md`) — find your section in the ownership table.
+3. Read your owned files' docstrings — every stub explains its contract.
+4. Open Claude Code in this repo. When you ask Claude to implement a stub:
+   - Reference the file path + the contract it should follow
+   - Point it at `ARCHITECTURE.md` section if relevant (e.g. §7 for IR validation)
+   - Ask it to write the test first (golden IR fixtures), then the implementation
+   - Always review the diff — don't auto-accept
+Useful slash commands while working: `/review` (PR review), `/security-review` (audit pending changes).
+---
+## Conventions & gotchas
+- **Async event loop on Windows**: `run.py` sets `WindowsSelectorEventLoopPolicy` because psycopg3 async needs it. Don't call `uvicorn` directly on Windows.
+- **Two Postgres engines**: `engine` (app tables) and `_pgvector_engine` (asyncpg with `prepared_statement_cache_size=0`) — the latter is required because PGVector emits `advisory_lock + CREATE EXTENSION` as a multi-statement string and asyncpg rejects multi-statement prepared queries. `init_db.py` creates the extension explicitly so `PGVector(create_extension=False)` skips that path.
+- **Read-only at every layer for user DBs**: IR validation + compiler whitelists + sqlglot SELECT-only check + read-only DB credentials + LIMIT enforcement + 30s timeout. Five layers; no single point of failure.
+- **Identifiers vs values**: identifiers (table/column names) come from the catalog and are inlined as quoted identifiers — they were verified at validation time so this is safe. Values from `IR.filters` are *always* parameterized, never inlined as strings.
+- **Credential encryption**: Fernet via `dataeyond__db__credential__key` env var; lives in `security/credentials.py`. Sensitive fields = `{"password", "service_account_json"}`.
+- **Settings env-var aliases**: `.env` uses double-underscore names (`azureai__api_key__4o`); `Settings` exposes them as `azureai_api_key_4o` via `Field(alias=...)`. Mind both forms when adding settings.
+- **Prompts**: `src/config/prompts/*.md` — `intent_router`, `query_planner`, `chatbot_system`, `guardrails` are all written. `chatbot_system` has `guardrails` appended so guardrails take precedence in conflict. `catalog_enricher.md` was deleted in KM-557. `config/agents/` folder deleted in Cleanup PR.
+- **Planner prompt parsing gotcha**: `query/planner/service.py` uses `SystemMessage(content=...)` not `("system", text)`. The tuple form causes LangChain to interpret `{...}` in `query_planner.md` as f-string variables and crash on every real invocation. Don't refactor back to tuples.
+- **Tests**: 146+ unit tests in place. Run with `uv run pytest`. Planner eval gated on `RUN_PLANNER_EVAL=1`; catalog store integration test gated on `RUN_INTEGRATION_TESTS=1`.
+---
+## Recommended reading order
+1. `ARCHITECTURE.md` — design intent (the source of truth)
+2. `src/catalog/models.py` + `src/query/ir/models.py` — the two data shapes everything else moves between
+3. `src/query/ir/operators.py` + `src/security/pii_patterns.py` — the explicit whitelists / patterns
+4. Skim every `__init__.py`-level docstring under `src/catalog/`, `src/query/`, `src/agents/`, `src/pipeline/` — each describes the contract its module enforces
+5. `main.py` + `src/db/postgres/{connection,init_db}.py` — runtime bootstrap
+6. `ARCHITECTURE.md §10` — five open questions that haven't been decided yet
+---
+## Open questions
+Resolved as Phase 2 landed:
+1. ✅ Catalog storage shape — Postgres `jsonb` row in `data_catalog` table, keyed by `user_id`.
+2. ❌ Unstructured files in catalog — still not modeled; router uses `source_hint` from the LLM instead.
+3. 🟡 Catalog refresh trigger — rebuild-on-upload-or-connect is the default. Explicit endpoint `POST /api/v1/data-catalog/rebuild` exists. Background TTL deferred.
+4. ✅ Joins out of v1 IR — confirmed; single-table only. Revisit when real queries need it.
+5. 🟡 PII `sample_values` — currently nulled out (skip). Mask/synthesize deferred.
+---
+## Glossary
+- **Cu** — unstructured context (prose chunks)
+- **Cs** — schema context (DB tables/columns from catalog)
+- **Ct** — tabular context (file sheets/columns from catalog)
+- **IR** — intermediate representation (the JSON query shape)
+- **PII** — personally identifiable information
+- **ABC** — abstract base class

main.py CHANGED Viewed

@@ -1,5 +1,7 @@
 """Main application entry point."""
 from fastapi import FastAPI
 from src.middlewares.logging import configure_logging, get_logger
 from src.middlewares.cors import add_cors_middleware
@@ -9,8 +11,8 @@ from src.api.v1.document import router as document_router
 from src.api.v1.chat import router as chat_router
 from src.api.v1.room import router as room_router
 from src.api.v1.users import router as users_router
-from src.api.v1.knowledge import router as knowledge_router
 from src.api.v1.db_client import router as db_client_router
 from src.db.postgres.init_db import init_db
 import uvicorn
@@ -18,11 +20,21 @@ import uvicorn
 configure_logging()
 logger = get_logger("main")
 # Create FastAPI app
 app = FastAPI(
     title="DataEyond Agentic Service",
     description="Multi-agent AI backend with RAG capabilities",
-    version="0.1.0"
 )
 # Add middleware
@@ -33,18 +45,10 @@ app.add_exception_handler(RateLimitExceeded, _rate_limit_exceeded_handler)
 # Include routers
 app.include_router(users_router)
 app.include_router(document_router)
-app.include_router(knowledge_router)
 app.include_router(room_router)
 app.include_router(chat_router)
 app.include_router(db_client_router)
-@app.on_event("startup")
-async def startup_event():
-    """Initialize database on startup."""
-    logger.info("Starting application...")
-    await init_db()
-    logger.info("Database initialized")
 @app.get("/")

 """Main application entry point."""
+from contextlib import asynccontextmanager
 from fastapi import FastAPI
 from src.middlewares.logging import configure_logging, get_logger
 from src.middlewares.cors import add_cors_middleware
 from src.api.v1.chat import router as chat_router
 from src.api.v1.room import router as room_router
 from src.api.v1.users import router as users_router
 from src.api.v1.db_client import router as db_client_router
+from src.api.v1.data_catalog import router as data_catalog_router
 from src.db.postgres.init_db import init_db
 import uvicorn
 configure_logging()
 logger = get_logger("main")
+@asynccontextmanager
+async def lifespan(app: FastAPI):
+    logger.info("Starting application...")
+    await init_db()
+    logger.info("Database initialized")
+    yield
 # Create FastAPI app
 app = FastAPI(
     title="DataEyond Agentic Service",
     description="Multi-agent AI backend with RAG capabilities",
+    version="0.1.0",
+    lifespan=lifespan,
 )
 # Add middleware
 # Include routers
 app.include_router(users_router)
 app.include_router(document_router)
 app.include_router(room_router)
 app.include_router(chat_router)
 app.include_router(db_client_router)
+app.include_router(data_catalog_router)
 @app.get("/")

pyproject.toml CHANGED Viewed

@@ -121,7 +121,8 @@ ignore = [
 ]
 [tool.ruff.lint.per-file-ignores]
-"tests/**" = ["S101", "S105", "S106"]
 [tool.mypy]
 python_version = "3.12"

 ]
 [tool.ruff.lint.per-file-ignores]
+# S608 in tests is a false positive — tests assert literal SQL strings as fixtures.
+"tests/**" = ["S101", "S105", "S106", "S608"]
 [tool.mypy]
 python_version = "3.12"

scripts/build_initial_catalogs.py ADDED Viewed

	@@ -0,0 +1,73 @@

+"""Backfill catalogs for existing users.
+One-off script. For each user that already has registered DB connections or
+uploaded tabular files, run the structured pipeline to build their catalog.
+Run once against the live DB after deploying this branch to populate catalog
+rows for data registered before the catalog pipeline landed.
+Note: enrich_all_sources.py is not needed — LLM enrichment was removed in
+KM-557. The pipeline is now introspect → merge → validate → upsert.
+Usage:
+    uv run python scripts/build_initial_catalogs.py [--user-id USER_ID]
+"""
+import asyncio
+import sys
+import os
+sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
+from sqlalchemy import select
+from src.db.postgres.connection import AsyncSessionLocal
+from src.db.postgres.models import DatabaseClient, Document
+from src.pipeline.triggers import on_db_registered, on_tabular_uploaded
+async def main() -> None:
+    user_id_filter = None
+    if "--user-id" in sys.argv:
+        idx = sys.argv.index("--user-id")
+        user_id_filter = sys.argv[idx + 1]
+        print(f"Filtering to user_id: {user_id_filter}")
+    async with AsyncSessionLocal() as db:
+        # ── 1. DB clients ──────────────────────────────────────────────
+        query = select(DatabaseClient).where(DatabaseClient.status == "active")
+        if user_id_filter:
+            query = query.where(DatabaseClient.user_id == user_id_filter)
+        result = await db.execute(query)
+        db_clients = result.scalars().all()
+        print(f"\nFound {len(db_clients)} active DB client(s)")
+        for client in db_clients:
+            try:
+                await on_db_registered(client.id, client.user_id)
+                print(f"  ✓ db_client {client.id} ({client.name})")
+            except Exception as e:
+                print(f"  ✗ db_client {client.id} ({client.name}): {e}")
+        # ── 2. Tabular files ───────────────────────────────────────────
+        query = select(Document).where(
+            Document.file_type.in_(["csv", "xlsx"]),
+            Document.status == "completed",
+        )
+        if user_id_filter:
+            query = query.where(Document.user_id == user_id_filter)
+        result = await db.execute(query)
+        docs = result.scalars().all()
+        print(f"\nFound {len(docs)} completed tabular file(s)")
+        for doc in docs:
+            try:
+                await on_tabular_uploaded(doc.id, doc.user_id)
+                print(f"  ✓ {doc.file_type} {doc.id} ({doc.filename})")
+            except Exception as e:
+                print(f"  ✗ {doc.file_type} {doc.id} ({doc.filename}): {e}")
+    print("\nDone.")
+if __name__ == "__main__":
+    asyncio.run(main())

scripts/enrich_all_sources.py ADDED Viewed

	@@ -0,0 +1,16 @@

+"""Bulk re-run CatalogEnricher with the current prompt.
+For when src/config/prompts/catalog_enricher.md changes and existing
+catalog descriptions need to be regenerated across all users.
+Usage:
+    uv run python scripts/enrich_all_sources.py [--user-id USER_ID]
+"""
+def main() -> None:
+    raise NotImplementedError
+if __name__ == "__main__":
+    main()

src/agents/chat_handler.py ADDED Viewed

	@@ -0,0 +1,274 @@

+"""ChatHandler — top-level Phase 2 chat orchestrator.
+End-to-end flow per user message:
+  1. `IntentRouter.classify` → `chat` / `unstructured` / `structured`.
+  2. Route:
+       - `chat`         → no context. Pass straight to ChatbotAgent.
+       - `structured`   → CatalogReader → QueryService → QueryResult.
+       - `unstructured` → DocumentRetriever (placeholder, raises until TAB
+                          ships) → list[DocumentChunk].
+  3. `ChatbotAgent.astream` → yield text tokens.
+  4. Wrap each step into an SSE-style event dict so the API endpoint can
+     stream them as Server-Sent Events.
+Phase 1's chat endpoint (`src/api/v1/chat.py`) is intentionally NOT touched
+in this PR. PR7 cleanup will rewire it to call `ChatHandler.handle(...)`.
+All dependencies are injectable for tests. Default constructors lazy-build
+production deps (no `Settings()` triggered at import time as long as you
+inject mocks).
+"""
+from __future__ import annotations
+import json
+from collections.abc import AsyncIterator
+from typing import TYPE_CHECKING, Any
+from langchain_core.messages import BaseMessage
+from src.middlewares.logging import get_logger
+from src.retrieval.base import RetrievalResult
+from .chatbot import ChatbotAgent, DocumentChunk
+from .orchestration import OrchestratorAgent
+if TYPE_CHECKING:
+    from ..catalog.reader import CatalogReader
+    from ..query.service import QueryService
+    from ..retrieval.router import RetrievalRouter
+logger = get_logger("chat_handler")
+class ChatHandler:
+    """Top-level chat orchestrator.
+    Returns an `AsyncIterator[dict]` of SSE-style events with shape
+    `{"event": <name>, "data": <str>}`. Event types:
+      - `intent`  — emitted once after classification (JSON-encoded decision)
+      - `sources` — JSON array of source refs (one per structured table, or
+                    per (document_id, page_label) for unstructured)
+      - `chunk`   — text fragment of the streaming answer (one per token)
+      - `done`    — end of stream (data is empty string)
+      - `error`   — failure; data is a user-facing message
+    """
+    def __init__(
+        self,
+        intent_router: OrchestratorAgent | None = None,
+        answer_agent: ChatbotAgent | None = None,
+        catalog_reader: CatalogReader | None = None,
+        query_service: QueryService | None = None,
+        document_retriever: RetrievalRouter | None = None,
+    ) -> None:
+        self._intent_router = intent_router
+        self._answer_agent = answer_agent
+        self._catalog_reader = catalog_reader
+        self._query_service = query_service
+        self._document_retriever = document_retriever
+    # ------------------------------------------------------------------
+    # Lazy default-dep builders
+    # ------------------------------------------------------------------
+    def _get_intent_router(self) -> OrchestratorAgent:
+        if self._intent_router is None:
+            self._intent_router = OrchestratorAgent()
+        return self._intent_router
+    def _get_answer_agent(self) -> ChatbotAgent:
+        if self._answer_agent is None:
+            self._answer_agent = ChatbotAgent()
+        return self._answer_agent
+    def _get_catalog_reader(self) -> CatalogReader:
+        if self._catalog_reader is None:
+            from ..catalog.reader import CatalogReader
+            from ..catalog.store import CatalogStore
+            self._catalog_reader = CatalogReader(CatalogStore())
+        return self._catalog_reader
+    def _get_query_service(self) -> QueryService:
+        if self._query_service is None:
+            from ..query.service import QueryService
+            self._query_service = QueryService()
+        return self._query_service
+    def _get_document_retriever(self) -> RetrievalRouter:
+        if self._document_retriever is None:
+            from ..retrieval.router import RetrievalRouter
+            self._document_retriever = RetrievalRouter()
+        return self._document_retriever
+    # ------------------------------------------------------------------
+    # Public entry
+    # ------------------------------------------------------------------
+    async def handle(
+        self,
+        message: str,
+        user_id: str,
+        history: list[BaseMessage] | None = None,
+    ) -> AsyncIterator[dict[str, Any]]:
+        # ---- 1. Classify intent --------------------------------------
+        try:
+            decision = await self._get_intent_router().classify(message, history)
+        except Exception as e:
+            logger.error("intent classification failed", error=str(e))
+            yield {"event": "error", "data": f"Could not classify message: {e}"}
+            return
+        yield {"event": "intent", "data": decision.model_dump_json()}
+        rewritten = decision.rewritten_query or message
+        query_result = None
+        chunks: list[DocumentChunk] | None = None
+        raw_chunks: Any = None
+        # ---- 2. Route ------------------------------------------------
+        if decision.source_hint == "structured":
+            try:
+                catalog = await self._get_catalog_reader().read(user_id, "structured")
+                query_result = await self._get_query_service().run(
+                    user_id, rewritten, catalog
+                )
+            except Exception as e:
+                logger.error(
+                    "structured route failed",
+                    user_id=user_id,
+                    error=str(e),
+                )
+                yield {"event": "error", "data": f"Structured query failed: {e}"}
+                return
+        elif decision.source_hint == "unstructured":
+            try:
+                raw_chunks = await self._get_document_retriever().retrieve(
+                    rewritten, user_id
+                )
+                chunks = _normalize_chunks(raw_chunks)
+            except NotImplementedError:
+                logger.warning("DocumentRetriever placeholder hit", user_id=user_id)
+                yield {
+                    "event": "error",
+                    "data": "Document retrieval is not yet available — pending implementation.",
+                }
+                return
+            except Exception as e:
+                logger.error(
+                    "unstructured route failed", user_id=user_id, error=str(e)
+                )
+                yield {"event": "error", "data": f"Document retrieval failed: {e}"}
+                return
+        # else: chat path — no context
+        # ---- 2b. Emit sources ---------------------------------------
+        sources = _build_sources(
+            decision.source_hint, user_id, query_result, raw_chunks
+        )
+        yield {"event": "sources", "data": json.dumps(sources)}
+        # ---- 3. Stream answer ----------------------------------------
+        try:
+            async for token in self._get_answer_agent().astream(
+                message,
+                history=history,
+                query_result=query_result,
+                chunks=chunks,
+            ):
+                yield {"event": "chunk", "data": token}
+        except Exception as e:
+            logger.error("answer streaming failed", user_id=user_id, error=str(e))
+            yield {"event": "error", "data": f"Answer generation failed: {e}"}
+            return
+        yield {"event": "done", "data": ""}
+def _build_sources(
+    source_hint: str,
+    user_id: str,
+    query_result: Any,
+    raw_chunks: Any,
+) -> list[dict[str, Any]]:
+    """Build the sources payload for the SSE `sources` event.
+    - structured: one entry per executed table (table_name only).
+    - unstructured: deduped by (document_id, page_label), Phase 1 shape.
+    - chat or error: empty list.
+    """
+    if source_hint == "structured":
+        if query_result is None or getattr(query_result, "error", None):
+            return []
+        table_name = getattr(query_result, "table_name", "") or ""
+        if not table_name:
+            return []
+        return [{
+            "document_id": f"{user_id}_{table_name}",
+            "filename": table_name,
+            "page_label": None,
+        }]
+    if source_hint == "unstructured" and raw_chunks:
+        seen: set[tuple[Any, Any]] = set()
+        sources: list[dict[str, Any]] = []
+        for item in raw_chunks:
+            if isinstance(item, RetrievalResult):
+                data = item.metadata.get("data", {})
+            elif isinstance(item, dict):
+                data = item
+            else:
+                continue
+            key = (data.get("document_id"), data.get("page_label"))
+            if key in seen or key == (None, None):
+                continue
+            seen.add(key)
+            sources.append({
+                "document_id": data.get("document_id"),
+                "filename": data.get("filename", "Unknown"),
+                "page_label": data.get("page_label", "Unknown"),
+            })
+        return sources
+    return []
+def _normalize_chunks(raw: Any) -> list[DocumentChunk]:
+    """Convert whatever the retriever returns into list[DocumentChunk].
+    The Phase 2 `DocumentRetriever.retrieve` interface is a stub today;
+    when TAB owner ships it, it should return `list[DocumentChunk]`
+    directly so this normalizer becomes a no-op. Until then we coerce
+    common shapes (dict-with-content, plain string) defensively.
+    """
+    if not raw:
+        return []
+    if isinstance(raw, list) and all(isinstance(c, DocumentChunk) for c in raw):
+        return raw
+    chunks: list[DocumentChunk] = []
+    for item in raw:
+        if isinstance(item, DocumentChunk):
+            chunks.append(item)
+        elif isinstance(item, dict):
+            chunks.append(
+                DocumentChunk(
+                    content=str(item.get("content", "")),
+                    filename=item.get("filename"),
+                    page_label=item.get("page_label"),
+                )
+            )
+        elif isinstance(item, RetrievalResult):
+            data = item.metadata.get("data", {})
+            page = data.get("page_label")
+            chunks.append(DocumentChunk(
+                content=item.content,
+                filename=data.get("filename"),
+                page_label=str(page) if page is not None else None,
+            ))
+        elif isinstance(item, str):
+            chunks.append(DocumentChunk(content=item))
+    return chunks

src/agents/chatbot.py CHANGED Viewed

@@ -1,85 +1,169 @@
-"""Chatbot agent with RAG capabilities."""
-import tiktoken
-from langchain_openai import AzureChatOpenAI
-from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
 from langchain_core.output_parsers import StrOutputParser
-from src.config.settings import settings
 from src.middlewares.logging import get_logger
-from langchain_core.messages import HumanMessage, AIMessage
 logger = get_logger("chatbot")
-_enc = tiktoken.get_encoding("cl100k_base")
-def _count_tokens(messages: list, context: str) -> dict:
-    msg_tokens = sum(len(_enc.encode(m.content)) for m in messages)
-    ctx_tokens = len(_enc.encode(context))
-    return {"messages_tokens": msg_tokens, "context_tokens": ctx_tokens, "total": msg_tokens + ctx_tokens}
-class ChatbotAgent:
-    """Chatbot agent with RAG capabilities."""
-    def __init__(self):
-        self.llm = AzureChatOpenAI(
-            azure_deployment=settings.azureai_deployment_name_4o,
-            openai_api_version=settings.azureai_api_version_4o,
-            azure_endpoint=settings.azureai_endpoint_url_4o,
-            api_key=settings.azureai_api_key_4o,
-            temperature=0.7
         )
-        # Read system prompt
-        try:
-            with open("src/config/agents/system_prompt.md", "r") as f:
-                system_prompt = f.read()
-        except FileNotFoundError:
-            system_prompt = "You are a helpful AI assistant with access to user's uploaded documents."
-        # Create prompt template
-        self.prompt = ChatPromptTemplate.from_messages([
-            ("system", system_prompt),
-            MessagesPlaceholder(variable_name="messages"),
-            ("system", "Relevant documents:\n{context}")
-        ])
-        # Create chain
-        self.chain = self.prompt | self.llm | StrOutputParser()
-    async def generate_response(
         self,
-        messages: list,
-        context: str = ""
-    ) -> str:
-        """Generate response with optional RAG context."""
-        try:
-            logger.info("Generating chatbot response")
-            # Generate response
-            response = await self.chain.ainvoke({
-                "messages": messages,
-                "context": context
-            })
-            logger.info(f"Generated response: {response[:100]}...")
-            return response
-        except Exception as e:
-            logger.error("Response generation failed", error=str(e))
-            raise
-    async def astream_response(self, messages: list, context: str = ""):
-        """Stream response tokens as they are generated."""
-        try:
-            token_counts = _count_tokens(messages, context)
-            logger.info("LLM input tokens", **token_counts)
-            async for token in self.chain.astream({"messages": messages, "context": context}):
-                yield token
-        except Exception as e:
-            logger.error("Response streaming failed", error=str(e))
-            raise
-chatbot = ChatbotAgent()

+"""ChatbotAgent — final answer formation. Phase 2 chatbot.
+Receives one of:
+  - a `QueryResult` (structured query path),
+  - a list of document chunks (unstructured path), or
+  - nothing (chat-only path: greeting, farewell, meta question).
+Streams the answer token-by-token so the chat handler can wrap each token
+into an SSE event. Conversation history is supported.
+"""
+from __future__ import annotations
+from collections.abc import AsyncIterator
+from dataclasses import dataclass
+from pathlib import Path
+from typing import Any
+from langchain_core.messages import BaseMessage
 from langchain_core.output_parsers import StrOutputParser
+from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
+from langchain_core.runnables import Runnable
+from langchain_openai import AzureChatOpenAI
 from src.middlewares.logging import get_logger
+from ..query.executor.base import QueryResult
 logger = get_logger("chatbot")
+_PROMPT_DIR = Path(__file__).resolve().parent.parent / "config" / "prompts"
+_SYSTEM_PROMPT_PATH = _PROMPT_DIR / "chatbot_system.md"
+_GUARDRAILS_PATH = _PROMPT_DIR / "guardrails.md"
+@dataclass
+class DocumentChunk:
+    """One retrieved document chunk for the unstructured path."""
+    content: str
+    filename: str | None = None
+    page_label: str | None = None
+def _load_system_prompt() -> str:
+    """Compose system prompt = chatbot_system.md + guardrails.md.
+    Guardrails appended last so they take precedence in conflict (matches
+    the docstring at the top of guardrails.md).
+    """
+    chatbot = _SYSTEM_PROMPT_PATH.read_text(encoding="utf-8")
+    guardrails = _GUARDRAILS_PATH.read_text(encoding="utf-8")
+    return f"{chatbot}\n\n{guardrails}"
+def _format_query_result(qr: QueryResult) -> str:
+    """Render a QueryResult as a compact context block for the LLM."""
+    source_label = qr.source_name or "(unknown source)"
+    table_label = qr.table_name or "(unknown table)"
+    if qr.error:
+        return (
+            f"[Query result — FAILED]\n"
+            f"source: {source_label}\n"
+            f"table: {table_label}\n"
+            f"error: {qr.error}"
         )
+    lines: list[str] = [
+        "[Query result]",
+        f"source: {source_label}",
+        f"table: {table_label}",
+        f"backend: {qr.backend}",
+        f"row_count: {qr.row_count}"
+        + (" (truncated)" if qr.truncated else ""),
+        f"elapsed_ms: {qr.elapsed_ms}",
+    ]
+    if qr.rows:
+        # Cap rendering at 25 rows; the LLM doesn't need the full set
+        cap = min(len(qr.rows), 25)
+        columns = list(qr.rows[0].keys())
+        lines.append("columns: " + ", ".join(columns))
+        lines.append("rows:")
+        for row in qr.rows[:cap]:
+            lines.append("  " + ", ".join(f"{k}={row[k]!r}" for k in columns))
+        if cap < len(qr.rows):
+            lines.append(f"  ... (+{len(qr.rows) - cap} more rows omitted from prompt)")
+    return "\n".join(lines)
+def _format_document_chunks(chunks: list[DocumentChunk]) -> str:
+    if not chunks:
+        return ""
+    blocks: list[str] = []
+    for c in chunks:
+        label_parts = [p for p in (c.filename, c.page_label) if p]
+        label = ", ".join(label_parts) if label_parts else "Unknown source"
+        blocks.append(f"[Source: {label}]\n{c.content}")
+    return "\n\n".join(blocks)
+def _build_context_block(
+    query_result: QueryResult | None,
+    chunks: list[DocumentChunk] | None,
+) -> str:
+    parts: list[str] = []
+    if query_result is not None:
+        parts.append(_format_query_result(query_result))
+    if chunks:
+        parts.append(_format_document_chunks(chunks))
+    return "\n\n".join(parts) if parts else "(no data context — answer conversationally)"
+def _build_default_chain() -> Runnable:
+    from src.config.settings import settings
+    llm = AzureChatOpenAI(
+        azure_deployment=settings.azureai_deployment_name_4o,
+        openai_api_version=settings.azureai_api_version_4o,
+        azure_endpoint=settings.azureai_endpoint_url_4o,
+        api_key=settings.azureai_api_key_4o,
+        temperature=0.3,
+    )
+    prompt = ChatPromptTemplate.from_messages(
+        [
+            ("system", _load_system_prompt()),
+            MessagesPlaceholder(variable_name="history", optional=True),
+            ("human", "{message}"),
+            ("system", "Data context for this turn:\n\n{context}"),
+        ]
+    )
+    return prompt | llm | StrOutputParser()
+class ChatbotAgent:
+    """Formats and streams the final user-facing answer.
+    `chain` is injectable: tests pass a fake that yields canned tokens.
+    Default constructs the production Azure OpenAI streaming chain on
+    first use.
+    """
+    def __init__(self, chain: Runnable | None = None) -> None:
+        self._chain = chain
+    def _ensure_chain(self) -> Runnable:
+        if self._chain is None:
+            self._chain = _build_default_chain()
+        return self._chain
+    async def astream(
         self,
+        message: str,
+        history: list[BaseMessage] | None = None,
+        query_result: QueryResult | None = None,
+        chunks: list[DocumentChunk] | None = None,
+    ) -> AsyncIterator[str]:
+        """Stream tokens of the final answer.
+        Caller wraps each token into the SSE format. Empty `history` and
+        no context = pure chat reply.
+        """
+        chain = self._ensure_chain()
+        payload: dict[str, Any] = {
+            "message": message,
+            "history": history or [],
+            "context": _build_context_block(query_result, chunks),
+        }
+        async for token in chain.astream(payload):
+            yield token

src/agents/orchestration.py CHANGED Viewed

@@ -1,79 +1,109 @@
-"""Orchestrator agent for intent recognition and planning."""
-from langchain_openai import AzureChatOpenAI
 from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
-from src.config.settings import settings
 from src.middlewares.logging import get_logger
-from src.models.structured_output import IntentClassification
 logger = get_logger("orchestrator")
 class OrchestratorAgent:
-    """Orchestrator agent for intent recognition and planning."""
-    def __init__(self):
-        self.llm = AzureChatOpenAI(
-            azure_deployment=settings.azureai_deployment_name_4o,
-            openai_api_version=settings.azureai_api_version_4o,
-            azure_endpoint=settings.azureai_endpoint_url_4o,
-            api_key=settings.azureai_api_key_4o,
-            temperature=0
         )
-        self.prompt = ChatPromptTemplate.from_messages([
-            ("system", """You are an orchestrator agent. You receive recent conversation history and the user's latest message.
-Your task:
-1. Determine intent: question, greeting, goodbye, or other
-2. Decide whether to search the user's documents (needs_search)
-3. If search is needed, rewrite the user's message into a STANDALONE search query that incorporates necessary context from conversation history. If the user says "tell me more" or "how many papers?", the search_query must spell out the full topic explicitly from history.
-4. If no search needed, provide a short direct_response (plain text only, no markdown formatting).
-Intent Routing:
-- question -> needs_search=True, search_query=<standalone rewritten query>
-- greeting -> needs_search=False, direct_response="Hello! How can I assist you today?"
-- goodbye -> needs_search=False, direct_response="Goodbye! Have a great day!"
-- other -> needs_search=True, search_query=<standalone rewritten query>
-Source Routing (set source_hint):
-- Columns, tables, sheets, data types, schema, row counts, statistics -> source_hint=schema
-- Document content, paragraphs, reports, articles, text -> source_hint=document
-- Unclear or spans both -> source_hint=both
-"""),
-            MessagesPlaceholder(variable_name="history"),
-            ("user", "{message}")
-        ])
-        # with_structured_output uses function calling — guarantees valid schema regardless of LLM response style
-        self.chain = self.prompt | self.llm.with_structured_output(IntentClassification)
-    async def analyze_message(self, message: str, history: list = None) -> dict:
-        """Analyze user message and determine next actions.
-        Args:
-            message: The current user message.
-            history: Recent conversation as LangChain BaseMessage objects (oldest-first).
-                     Used to rewrite ambiguous follow-ups into standalone search queries.
-        """
-        try:
-            logger.info(f"Analyzing message: {message[:50]}...")
-            history_messages = history or []
-            result: IntentClassification = await self.chain.ainvoke({"message": message, "history": history_messages})
-            logger.info(f"Intent: {result.intent}, Needs search: {result.needs_search}, Search query: {result.search_query[:50] if result.search_query else ''}")
-            return result.model_dump()
-        except Exception as e:
-            logger.error("Message analysis failed", error=str(e))
-            # Fallback to treating everything as a question
-            return {
-                "intent": "question",
-                "needs_search": True,
-                "search_query": message,
-                "direct_response": None
-            }
-orchestrator = OrchestratorAgent()

+"""OrchestratorAgent — classifies a user message and emits source_hint.
+Output: needs_search (bool) + source_hint ∈ { chat, unstructured, structured }
++ rewritten_query (standalone form of the user's question, history-resolved).
+Phase 2 replaces the previous intent-classification body. The class name
+is preserved so existing import sites (`from src.agents.orchestration
+import OrchestratorAgent`) keep working. The default LLM chain is
+constructed lazily so the module is import-safe even without `.env`
+populated.
+"""
+from __future__ import annotations
+from pathlib import Path
+from typing import Literal
+from langchain_core.messages import BaseMessage
 from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
+from langchain_core.runnables import Runnable
+from langchain_openai import AzureChatOpenAI
+from pydantic import BaseModel, Field
 from src.middlewares.logging import get_logger
 logger = get_logger("orchestrator")
+SourceHint = Literal["chat", "unstructured", "structured"]
+_PROMPT_PATH = (
+    Path(__file__).resolve().parent.parent
+    / "config"
+    / "prompts"
+    / "intent_router.md"
+)
+class IntentRouterDecision(BaseModel):
+    """LLM output. Pydantic so it can be used with `with_structured_output`."""
+    needs_search: bool = Field(
+        ..., description="True if we must look at the user's data to answer."
+    )
+    source_hint: SourceHint = Field(
+        ...,
+        description="Which downstream path: 'chat' (no lookup), "
+        "'unstructured' (PDF/DOCX/TXT prose), 'structured' (DB / tabular file).",
+    )
+    rewritten_query: str | None = Field(
+        None,
+        description="Standalone version of the question, history-resolved. "
+        "Null when needs_search=false.",
+    )
+def _load_prompt_text() -> str:
+    return _PROMPT_PATH.read_text(encoding="utf-8")
+def _build_default_chain() -> Runnable:
+    from src.config.settings import settings
+    llm = AzureChatOpenAI(
+        azure_deployment=settings.azureai_deployment_name_4o,
+        openai_api_version=settings.azureai_api_version_4o,
+        azure_endpoint=settings.azureai_endpoint_url_4o,
+        api_key=settings.azureai_api_key_4o,
+        temperature=0,
+    )
+    prompt = ChatPromptTemplate.from_messages(
+        [
+            ("system", _load_prompt_text()),
+            MessagesPlaceholder(variable_name="history", optional=True),
+            ("human", "{message}"),
+        ]
+    )
+    return prompt | llm.with_structured_output(IntentRouterDecision)
 class OrchestratorAgent:
+    """Classifies a user message into chat / unstructured / structured.
+    Inject `structured_chain` for tests; default builds the production
+    Azure OpenAI chain on first use.
+    """
+    def __init__(self, structured_chain: Runnable | None = None) -> None:
+        self._chain = structured_chain
+    def _ensure_chain(self) -> Runnable:
+        if self._chain is None:
+            self._chain = _build_default_chain()
+        return self._chain
+    async def classify(
+        self,
+        message: str,
+        history: list[BaseMessage] | None = None,
+    ) -> IntentRouterDecision:
+        chain = self._ensure_chain()
+        decision: IntentRouterDecision = await chain.ainvoke(
+            {"message": message, "history": history or []}
         )
+        logger.info(
+            "intent classified",
+            source_hint=decision.source_hint,
+            needs_search=decision.needs_search,
+        )
+        return decision

src/api/v1/chat.py CHANGED Viewed

@@ -1,46 +1,40 @@
 """Chat endpoint with streaming support."""
-import asyncio
 import uuid
 from fastapi import APIRouter, Depends, HTTPException
 from sqlalchemy.ext.asyncio import AsyncSession
 from src.db.postgres.connection import get_db
 from src.db.postgres.models import ChatMessage, MessageSource
-from src.agents.orchestration import orchestrator
-from src.agents.chatbot import chatbot
-from src.rag.retriever import retriever
-from src.rag.base import RetrievalResult
-from src.query.query_executor import query_executor
-from src.query.base import QueryResult
 from src.db.redis.connection import get_redis
-from src.config.settings import settings
 from src.middlewares.logging import get_logger, log_execution
-from sse_starlette.sse import EventSourceResponse
-from langchain_core.messages import HumanMessage, AIMessage
-from sqlalchemy import select
-from pydantic import BaseModel
-from typing import List, Dict, Any, Optional
-import json
 _GREETINGS = frozenset(["hi", "hello", "hey", "halo", "hai", "hei"])
 _GOODBYES = frozenset(["bye", "goodbye", "thanks", "thank you", "terima kasih", "sampai jumpa"])
-def _fast_intent(message: str) -> Optional[dict]:
-    """Bypass LLM orchestrator for obvious greetings and farewells."""
     lower = message.lower().strip().rstrip("!.,?")
     if lower in _GREETINGS:
-        return {"intent": "greeting", "needs_search": False,
-                "direct_response": "Hello! How can I assist you today?", "search_query": ""}
     if lower in _GOODBYES:
-        return {"intent": "goodbye", "needs_search": False,
-                "direct_response": "Goodbye! Have a great day!", "search_query": ""}
     return None
-logger = get_logger("chat_api")
-router = APIRouter(prefix="/api/v1", tags=["Chat"])
 class ChatRequest(BaseModel):
     user_id: str
@@ -48,66 +42,6 @@ class ChatRequest(BaseModel):
     message: str
-def _format_context(results: List[RetrievalResult]) -> str:
-    """Format retrieval results as context string for the LLM."""
-    lines = []
-    for result in results:
-        data = result.metadata.get("data", {})
-        filename = data.get("filename", "Unknown")
-        page = data.get("page_label")
-        source_label = f"{filename}, p.{page}" if page else filename
-        lines.append(f"[Source: {source_label}]\n{result.content}\n")
-    return "\n".join(lines)
-def _extract_sources(results: List[RetrievalResult]) -> List[Dict[str, Any]]:
-    """Extract deduplicated source references from retrieval results."""
-    seen = set()
-    sources = []
-    for result in results:
-        meta = result.metadata
-        data = meta.get("data", {})
-        if "document_id" in data:
-            key = (data.get("document_id"), data.get("page_label"))
-            if key not in seen:
-                seen.add(key)
-                sources.append({
-                    "document_id": data.get("document_id"),
-                    "filename": data.get("filename", "Unknown"),
-                    "page_label": data.get("page_label", "Unknown"),
-                })
-        else:
-            key = (data.get("table_name"), data.get("column_name"))
-            if key not in seen:
-                seen.add(key)
-                table_name = data.get("table_name")
-                user_id = meta.get("user_id")
-                sources.append({
-                    "document_id": f"{user_id}_{table_name}",
-                    "filename": data.get("table_name", "Unknown"),
-                    "page_label": data.get("column_name", "Unknown"),
-                })
-    logger.debug(f"Extracted sources: {sources}")
-    return sources
-def _format_query_results(results: list[QueryResult]) -> str:
-    if not results:
-        return ""
-    lines = []
-    for r in results:
-        name = r.metadata.get("client_name", r.source_id)
-        lines.append(f"[Query result — {name}, tables: {r.table_or_file}]")
-        lines.append(f"SQL: {r.metadata.get('sql', '')}")
-        if r.columns and r.rows:
-            lines.append(" | ".join(r.columns))
-            for row in r.rows[:20]:
-                lines.append(" | ".join(str(row.get(c, "")) for c in r.columns))
-        lines.append(f"({r.row_count} rows total)\n")
-    return "\n".join(lines)
 async def get_cached_response(redis, cache_key: str) -> Optional[str]:
     cached = await redis.get(cache_key)
     if cached:
@@ -163,13 +97,15 @@ async def chat_stream(request: ChatRequest, db: AsyncSession = Depends(get_db)):
     """Chat endpoint with streaming response.
     SSE event sequence:
-      1. sources  — JSON array of {document_id, filename, page_label}
       2. chunk    — text fragments of the answer
       3. done     — signals end of stream
     """
     redis = await get_redis()
     cache_key = f"{settings.redis_prefix}chat:{request.room_id}:{request.message}"
     cached = await get_cached_response(redis, cache_key)
     if cached:
         logger.info("Returning cached response")
@@ -183,96 +119,43 @@ async def chat_stream(request: ChatRequest, db: AsyncSession = Depends(get_db)):
         return EventSourceResponse(stream_cached())
     try:
-        # Step 1: Fast local intent check (skips LLM for greetings/farewells)
-        intent_result = _fast_intent(request.message)
-        context = ""
-        sources: List[Dict[str, Any]] = []
-        if intent_result is None:
-            # Step 2: Launch retrieval and history loading in parallel, then run orchestrator.
-            # k=5
-            # tables — db_executor's FK expansion is one-hop and cannot bridge
-            # 2-hop gaps (e.g. customers -> order_items -> products) on its own.
-            retrieval_task = asyncio.create_task(
-                retriever.retrieve(request.message, request.user_id, db, k=5)
-            )
-            history_task = asyncio.create_task(
-                load_history(db, request.room_id, limit=6)  # 6 msgs (3 pairs) for orchestrator
-            )
-            history = await history_task  # fast DB query (<100ms), done before orchestrator finishes
-            intent_result = await orchestrator.analyze_message(request.message, history)
-            search_query = intent_result.get("search_query", request.message) or request.message
-            if not intent_result.get("needs_search"):
-                retrieval_task.cancel()
-                try:
-                    await retrieval_task
-                except asyncio.CancelledError:
-                    pass
-                raw_results = []
-            else:
-                logger.info(f"Searching for: {search_query}")
-                if search_query != request.message:
-                    retrieval_task.cancel()
-                    try:
-                        await retrieval_task
-                    except asyncio.CancelledError:
-                        pass
-                    raw_results = await retriever.retrieve(
-                        query=search_query,
-                        user_id=request.user_id,
-                        db=db,
-                        k=5,
-                        source_hint=intent_result.get("source_hint", "both"),
-                    )
-                else:
-                    raw_results = await retrieval_task
-            context = _format_context(raw_results)
-            sources = _extract_sources(raw_results)
-            source_hint = intent_result.get("source_hint", "both")
-            if source_hint in ("schema", "both"):
-                # Use search_query (orchestrator's standalone rewrite) so follow-up
-                # messages like "dive deeper" or "show me last year" resolve correctly.
-                # For first-turn questions search_query == request.message, so no change.
-                query_results = await query_executor.execute(
-                    results=raw_results,
-                    user_id=request.user_id,
-                    db=db,
-                    question=search_query,
-                )
-                query_context = _format_query_results(query_results)
-                if query_context:
-                    context = query_context + "\n\n" + context
-        # Step 3: Direct response for greetings / non-document intents
-        if intent_result.get("direct_response"):
-            response = intent_result["direct_response"]
-            await cache_response(redis, cache_key, response)
-            await save_messages(db, request.room_id, request.message, response, sources=[])
             async def stream_direct():
                 yield {"event": "sources", "data": json.dumps([])}
-                yield {"event": "message", "data": response}
             return EventSourceResponse(stream_direct())
-        # Step 4: Stream answer token-by-token as LLM generates it
-        # Load full history (10 msgs) for chatbot — richer context than the 6 used by orchestrator
-        full_history = await load_history(db, request.room_id, limit=10)
-        messages = full_history + [HumanMessage(content=request.message)]
         async def stream_response():
             full_response = ""
-            yield {"event": "sources", "data": json.dumps(sources)}
-            async for token in chatbot.astream_response(messages, context):
-                full_response += token
-                yield {"event": "chunk", "data": token}
-            yield {"event": "done", "data": ""}
-            await cache_response(redis, cache_key, full_response)
-            await save_messages(db, request.room_id, request.message, full_response, sources=sources)
         return EventSourceResponse(stream_response())

 """Chat endpoint with streaming support."""
 import uuid
+import json
+from typing import List, Dict, Any, Optional
 from fastapi import APIRouter, Depends, HTTPException
+from langchain_core.messages import HumanMessage, AIMessage
+from pydantic import BaseModel
+from sqlalchemy import select
 from sqlalchemy.ext.asyncio import AsyncSession
+from sse_starlette.sse import EventSourceResponse
+from src.agents.chat_handler import ChatHandler
+from src.config.settings import settings
 from src.db.postgres.connection import get_db
 from src.db.postgres.models import ChatMessage, MessageSource
 from src.db.redis.connection import get_redis
 from src.middlewares.logging import get_logger, log_execution
+logger = get_logger("chat_api")
+router = APIRouter(prefix="/api/v1", tags=["Chat"])
 _GREETINGS = frozenset(["hi", "hello", "hey", "halo", "hai", "hei"])
 _GOODBYES = frozenset(["bye", "goodbye", "thanks", "thank you", "terima kasih", "sampai jumpa"])
+def _fast_intent(message: str) -> Optional[str]:
+    """Return a direct response for obvious greetings/farewells, else None."""
     lower = message.lower().strip().rstrip("!.,?")
     if lower in _GREETINGS:
+        return "Hello! How can I assist you today?"
     if lower in _GOODBYES:
+        return "Goodbye! Have a great day!"
     return None
 class ChatRequest(BaseModel):
     user_id: str
     message: str
 async def get_cached_response(redis, cache_key: str) -> Optional[str]:
     cached = await redis.get(cache_key)
     if cached:
     """Chat endpoint with streaming response.
     SSE event sequence:
+      1. sources  — JSON array of source refs from ChatHandler (table for
+                    structured; deduped document_id/page_label for unstructured)
       2. chunk    — text fragments of the answer
       3. done     — signals end of stream
     """
     redis = await get_redis()
     cache_key = f"{settings.redis_prefix}chat:{request.room_id}:{request.message}"
+    # Redis cache hit
     cached = await get_cached_response(redis, cache_key)
     if cached:
         logger.info("Returning cached response")
         return EventSourceResponse(stream_cached())
     try:
+        # Fast intent: greetings/farewells bypass LLM entirely
+        direct = _fast_intent(request.message)
+        if direct:
+            await cache_response(redis, cache_key, direct)
+            await save_messages(db, request.room_id, request.message, direct, sources=[])
             async def stream_direct():
                 yield {"event": "sources", "data": json.dumps([])}
+                yield {"event": "chunk", "data": direct}
+                yield {"event": "done", "data": ""}
             return EventSourceResponse(stream_direct())
+        history = await load_history(db, request.room_id, limit=10)
+        handler = ChatHandler()
         async def stream_response():
             full_response = ""
+            sources: List[Dict[str, Any]] = []
+            async for event in handler.handle(request.message, request.user_id, history):
+                if event["event"] == "sources":
+                    try:
+                        sources = json.loads(event["data"]) or []
+                    except (TypeError, ValueError):
+                        sources = []
+                    yield event
+                elif event["event"] == "chunk":
+                    full_response += event["data"]
+                    yield event
+                elif event["event"] == "done":
+                    await cache_response(redis, cache_key, full_response)
+                    await save_messages(db, request.room_id, request.message, full_response, sources=sources)
+                    yield event
+                elif event["event"] == "error":
+                    yield event
+                    return
+                # "intent" event: consumed internally, not forwarded to frontend
         return EventSourceResponse(stream_response())

src/api/v1/data_catalog.py ADDED Viewed

	@@ -0,0 +1,100 @@

+"""API endpoints for the per-user data catalog index.
+The index is a lightweight summary of every structured source registered
+by a user (DB connections and tabular files). It is intended to be
+consumed by the catalog refresher and by frontend listings — full
+catalog payloads (tables + columns + samples + stats) are not exposed
+here on purpose.
+"""
+from typing import List
+from fastapi import APIRouter, HTTPException, Query, status
+from src.catalog.store import CatalogStore
+from src.middlewares.logging import get_logger, log_execution
+from src.models.api.catalog import CatalogIndexEntry
+from src.pipeline.triggers import on_catalog_rebuild_requested
+logger = get_logger("data_catalog_api")
+router = APIRouter(prefix="/api/v1", tags=["Data Catalog"])
+@router.get(
+    "/data-catalog/{user_id}",
+    response_model=List[CatalogIndexEntry],
+    summary="List the user's data catalog index",
+    response_description="One entry per registered structured source.",
+    responses={
+        200: {"description": "Returns an empty list if the user has no registered sources."},
+        500: {"description": "Internal server error while reading the catalog."},
+    },
+)
+@log_execution(logger)
+async def list_data_catalog_index(user_id: str):
+    """
+    Return a lightweight index of every structured source registered by the user.
+    One entry per source (DB connection or tabular file), including the
+    `source_id`, `source_type`, display `name`, `location_ref`, current
+    `table_count`, and `updated_at` timestamp.
+    Used by the catalog refresher to decide which sources need to be
+    rebuilt. Returns an empty list if the user has no catalog yet.
+    """
+    try:
+        catalog = await CatalogStore().get(user_id)
+    except Exception as e:
+        logger.error("Failed to read catalog index", user_id=user_id, error=str(e))
+        raise HTTPException(
+            status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
+            detail=f"Failed to read catalog index: {e}",
+        )
+    if catalog is None:
+        return []
+    return [
+        CatalogIndexEntry(
+            source_id=s.source_id,
+            source_type=s.source_type,
+            name=s.name,
+            location_ref=s.location_ref,
+            table_count=len(s.tables),
+            updated_at=s.updated_at,
+        )
+        for s in catalog.sources
+    ]
+@router.post(
+    "/data-catalog/rebuild",
+    status_code=status.HTTP_200_OK,
+    summary="Rebuild the catalog for a user",
+    response_description="Confirmation that the rebuild was triggered.",
+    responses={
+        200: {"description": "Rebuild completed. Per-source errors are logged but do not fail this request."},
+        500: {"description": "Unexpected error before the rebuild loop started."},
+    },
+)
+@log_execution(logger)
+async def rebuild_data_catalog(
+    user_id: str = Query(..., description="ID of the user whose catalog should be rebuilt."),
+):
+    """
+    Re-introspect every source in the user's catalog and upsert the results.
+    Each source (DB connection or tabular file) is processed independently.
+    A failure on one source is logged but does not abort the remaining sources.
+    If the user has no catalog yet, returns success with no-op.
+    """
+    try:
+        await on_catalog_rebuild_requested(user_id)
+    except Exception as e:
+        logger.error("catalog rebuild failed", user_id=user_id, error=str(e))
+        raise HTTPException(
+            status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
+            detail=f"Catalog rebuild failed: {e}",
+        )
+    return {"status": "success", "user_id": user_id}

src/api/v1/db_client.py CHANGED Viewed

@@ -27,8 +27,7 @@ from src.models.credentials import (  # noqa: F401 — re-exported for Swagger s
     SqlServerCredentials,
     SupabaseCredentials,
 )
-from src.pipeline.db_pipeline import db_pipeline_service
-from src.utils.db_credential_encryption import decrypt_credentials_dict
 logger = get_logger("database_client_api")
@@ -407,20 +406,22 @@ async def delete_database_client(
         raise HTTPException(status_code=403, detail="Access denied")
     await database_client_service.delete(db, client_id)
     return {"status": "success", "message": "Database client deleted successfully"}
 @router.post(
     "/database-clients/{client_id}/ingest",
     status_code=status.HTTP_200_OK,
-    summary="Ingest schema from a registered database into the vector store",
-    response_description="Count of chunks ingested.",
     responses={
-        200: {"description": "Ingestion completed successfully."},
         403: {"description": "Access denied — user_id does not own this connection."},
         404: {"description": "Connection not found."},
-        501: {"description": "The connection's db_type is not yet supported by the pipeline."},
-        500: {"description": "Ingestion failed (connection error, profiling error, etc.)."},
     },
 )
 @limiter.limit("5/minute")
@@ -432,11 +433,9 @@ async def ingest_database_client(
     db: AsyncSession = Depends(get_db),
 ):
     """
-    Decrypt the stored credentials, connect to the user's database, introspect
-    its schema, profile each column, embed the descriptions, and store them in
-    the shared PGVector collection tagged with `source_type="database"`.
-    Chunks become retrievable via the same retriever used for document chunks.
     """
     client = await database_client_service.get(db, client_id)
@@ -453,21 +452,12 @@ async def ingest_database_client(
         )
     try:
-        creds = decrypt_credentials_dict(client.credentials)
-        with db_pipeline_service.engine_scope(
-            db_type=client.db_type,
-            credentials=creds,
-        ) as engine:
-            total = await db_pipeline_service.run(user_id=user_id, client_id=client_id, engine=engine)
-    except NotImplementedError as e:
-        raise HTTPException(status_code=status.HTTP_501_NOT_IMPLEMENTED, detail=str(e))
     except Exception as e:
-        logger.error(
-            f"Ingestion failed for client {client_id}", user_id=user_id, error=str(e)
-        )
         raise HTTPException(
             status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
-            detail=f"Ingestion failed: {e}",
         )
-    return {"status": "success", "client_id": client_id, "chunks_ingested": total}

     SqlServerCredentials,
     SupabaseCredentials,
 )
+from src.pipeline.triggers import on_db_registered
 logger = get_logger("database_client_api")
         raise HTTPException(status_code=403, detail="Access denied")
     await database_client_service.delete(db, client_id)
+    from src.pipeline.triggers import on_db_deleted
+    await on_db_deleted(client_id, user_id)
     return {"status": "success", "message": "Database client deleted successfully"}
 @router.post(
     "/database-clients/{client_id}/ingest",
     status_code=status.HTTP_200_OK,
+    summary="Build the catalog for a registered database connection",
+    response_description="Confirmation that the catalog was built.",
     responses={
+        200: {"description": "Catalog built successfully."},
         403: {"description": "Access denied — user_id does not own this connection."},
         404: {"description": "Connection not found."},
+        409: {"description": "Connection is inactive."},
+        500: {"description": "Catalog build failed."},
     },
 )
 @limiter.limit("5/minute")
     db: AsyncSession = Depends(get_db),
 ):
     """
+    Introspect the registered database and build (or rebuild) the catalog entry
+    for this connection. The catalog is stored in `data_catalog` and used by
+    the query pipeline to plan structured queries.
     """
     client = await database_client_service.get(db, client_id)
         )
     try:
+        await on_db_registered(client_id, user_id)
     except Exception as e:
+        logger.error("catalog build failed", client_id=client_id, error=str(e))
         raise HTTPException(
             status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
+            detail=f"Catalog build failed: {e}",
         )
+    return {"status": "success", "client_id": client_id}

src/api/v1/document.py CHANGED Viewed

@@ -6,7 +6,7 @@ from src.db.postgres.connection import get_db
 from src.document.document_service import document_service
 from src.middlewares.logging import get_logger, log_execution
 from src.middlewares.rate_limit import limiter
-from src.pipeline.document_pipeline.document_pipeline import document_pipeline
 from pydantic import BaseModel
 from typing import List
@@ -24,7 +24,7 @@ class DocumentResponse(BaseModel):
     created_at: str
-# NOTE: Keep in sync with SUPPORTED_FILE_TYPES in src/pipeline/document_pipeline/document_pipeline.py
 _DOC_TYPES = [
     {"doc_type": "pdf", "max_size": 10, "status": "active", "message": None},
     {"doc_type": "docx", "max_size": 10, "status": "active", "message": None},
@@ -92,6 +92,8 @@ async def delete_document(
 ):
     """Delete a document."""
     await document_pipeline.delete(document_id, user_id, db)
     return {"status": "success", "message": "Document deleted successfully"}
@@ -104,5 +106,13 @@ async def process_document(
 ):
     """Process document and ingest to vector index."""
     data = await document_pipeline.process(document_id, user_id, db)
     return {"status": "success", "message": "Document processed successfully", "data": data}

 from src.document.document_service import document_service
 from src.middlewares.logging import get_logger, log_execution
 from src.middlewares.rate_limit import limiter
+from src.pipeline.document_pipeline import document_pipeline
 from pydantic import BaseModel
 from typing import List
     created_at: str
+# NOTE: Keep in sync with SUPPORTED_FILE_TYPES in src/pipeline/document_pipeline.py
 _DOC_TYPES = [
     {"doc_type": "pdf", "max_size": 10, "status": "active", "message": None},
     {"doc_type": "docx", "max_size": 10, "status": "active", "message": None},
 ):
     """Delete a document."""
     await document_pipeline.delete(document_id, user_id, db)
+    from src.pipeline.triggers import on_tabular_deleted
+    await on_tabular_deleted(document_id, user_id)
     return {"status": "success", "message": "Document deleted successfully"}
 ):
     """Process document and ingest to vector index."""
     data = await document_pipeline.process(document_id, user_id, db)
+    if data["file_type"] in ("csv", "xlsx"):
+        from src.pipeline.triggers import on_tabular_uploaded
+        try:
+            await on_tabular_uploaded(document_id, user_id)
+        except Exception as e:
+            logger.error("catalog ingestion failed after process", document_id=document_id, error=str(e))
     return {"status": "success", "message": "Document processed successfully", "data": data}

src/api/v1/knowledge.py DELETED Viewed

@@ -1,25 +0,0 @@
-"""Knowledge base management API endpoints."""
-from fastapi import APIRouter, Depends
-from sqlalchemy.ext.asyncio import AsyncSession
-from src.db.postgres.connection import get_db
-from src.middlewares.logging import get_logger, log_execution
-logger = get_logger("knowledge_api")
-router = APIRouter(prefix="/api/v1", tags=["Knowledge"])
-@router.post("/knowledge/rebuild")
-@log_execution(logger)
-async def rebuild_vector_index(
-    user_id: str,
-    db: AsyncSession = Depends(get_db)
-):
-    """Rebuild vector index for a user (admin endpoint)."""
-    # This would re-process all documents
-    # For POC, we'll skip this complexity
-    return {
-        "status": "success",
-        "message": "Vector index rebuild initiated"
-    }

src/catalog/README.md ADDED Viewed

	@@ -0,0 +1,6 @@

+# catalog
+Per-user data catalog: identity layer for structured sources (DB schemas + tabular files).
+Holds AI-enriched table/column descriptions, consumed by `query/planner` to generate JSON IR.
+See `ARCHITECTURE.md` (root) for the full design.

src/catalog/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ """Catalog domain — per-user data catalog (Cs + Ct)."""

src/catalog/introspect/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ """Source-specific schema introspection (databases, tabular files)."""

src/catalog/introspect/base.py ADDED Viewed

	@@ -0,0 +1,18 @@

+"""BaseIntrospector — contract for source-specific schema readers.
+Subclasses produce a Source object with raw schema (names, types, sample
+values, stats). The planner consumes this directly — descriptions are not
+LLM-generated.
+"""
+from abc import ABC, abstractmethod
+from ..models import Source
+class BaseIntrospector(ABC):
+    """Abstract base. Subclasses: DatabaseIntrospector, TabularIntrospector."""
+    @abstractmethod
+    async def introspect(self, location_ref: str) -> Source:
+        ...

src/catalog/introspect/database.py ADDED Viewed

	@@ -0,0 +1,246 @@

+"""Database schema introspection (Postgres / MySQL / Supabase).
+Reads information_schema for tables/columns/types, samples ~100 rows per table
+for `sample_values` and basic stats. Description fields are left empty —
+the planner relies on names + samples + stats directly.
+Reuses Phase 1 utilities (`database_client_service`, `db_credential_encryption`,
+`db_pipeline_service.engine_scope`, `extractor.get_schema/profile_column/get_row_count`)
+to avoid reimplementation. The cleanup PR will move those into `security/` and
+`pipeline/db_pipeline/` respectively.
+"""
+import asyncio
+import hashlib
+from datetime import UTC, datetime
+from decimal import Decimal
+from typing import Any
+from src.database_client.database_client_service import database_client_service
+from src.db.postgres.connection import AsyncSessionLocal
+from src.middlewares.logging import get_logger
+from src.pipeline.db_pipeline import db_pipeline_service
+from src.pipeline.db_pipeline.extractor import (
+    get_row_count,
+    get_schema,
+    profile_column,
+)
+from src.utils.db_credential_encryption import decrypt_credentials_dict
+from ..models import Column, ColumnStats, DataType, ForeignKey, Source, Table
+from ..pii_detector import PIIDetector
+from .base import BaseIntrospector
+logger = get_logger("db_introspector")
+_DBCLIENT_PREFIX = "dbclient://"
+def _stable_id(prefix: str, *parts: str) -> str:
+    """Deterministic short ID from joined parts. Survives renames at the
+    `name` field while preserving identity for cached IRs.
+    Hash is non-cryptographic (identifier only).
+    """
+    h = hashlib.sha1(
+        "/".join(parts).encode("utf-8"), usedforsecurity=False
+    ).hexdigest()[:12]
+    return f"{prefix}{h}"
+def _map_sql_type(sql_type: str) -> DataType:
+    """Map a stringified SQLAlchemy type to a Catalog DataType.
+    Matches on substring of the SQLAlchemy type repr (e.g. 'INTEGER',
+    'TIMESTAMP', 'BOOLEAN'). Conservative — unknowns fall back to "string"
+    so the column is at least addressable.
+    """
+    s = sql_type.upper()
+    if "INT" in s:
+        return "int"
+    if "FLOAT" in s or "NUMERIC" in s or "DECIMAL" in s or "REAL" in s or "DOUBLE" in s:
+        return "decimal"
+    if "BOOL" in s:
+        return "bool"
+    if "TIMESTAMP" in s or "DATETIME" in s:
+        return "datetime"
+    if "DATE" in s:
+        return "date"
+    if "JSON" in s:
+        return "json"
+    return "string"
+def _normalize(v: Any) -> Any:
+    """Coerce non-JSON-native scalars (Decimal, numpy, datetime) to types
+    that survive the jsonb round-trip when the catalog is persisted.
+    """
+    if v is None:
+        return None
+    if isinstance(v, Decimal):
+        return float(v)
+    try:
+        import numpy as np
+        if isinstance(v, np.generic):
+            return v.item()
+    except ImportError:
+        pass
+    if isinstance(v, datetime):
+        return v.isoformat()
+    return v
+class DatabaseIntrospector(BaseIntrospector):
+    """Connect to user DB → read information_schema → sample 100 rows/table."""
+    def __init__(self) -> None:
+        self._pii = PIIDetector()
+    async def introspect(self, location_ref: str) -> Source:
+        if not location_ref.startswith(_DBCLIENT_PREFIX):
+            raise ValueError(
+                f"DatabaseIntrospector expects 'dbclient://...' location_ref, "
+                f"got {location_ref!r}"
+            )
+        client_id = location_ref[len(_DBCLIENT_PREFIX):]
+        if not client_id:
+            raise ValueError("location_ref is missing client_id after 'dbclient://'")
+        async with AsyncSessionLocal() as session:
+            client = await database_client_service.get(session, client_id)
+        if client is None:
+            raise ValueError(f"DatabaseClient {client_id!r} not found")
+        creds = decrypt_credentials_dict(client.credentials)
+        logger.info(
+            "introspecting db source",
+            client_id=client_id,
+            db_type=client.db_type,
+            name=client.name,
+        )
+        # SQLAlchemy inspect() + pandas read_sql are synchronous — run in a
+        # threadpool so the event loop stays free.
+        tables: list[Table] = await asyncio.to_thread(
+            self._introspect_sync, client.db_type, creds
+        )
+        return Source(
+            source_id=client_id,
+            source_type="schema",
+            name=client.name,
+            location_ref=location_ref,
+            updated_at=datetime.now(UTC),
+            tables=tables,
+        )
+    def _introspect_sync(self, db_type: str, creds: dict) -> list[Table]:
+        with db_pipeline_service.engine_scope(db_type, creds) as engine:
+            schema = get_schema(engine)
+            tables: list[Table] = []
+            for table_name, cols in schema.items():
+                try:
+                    row_count = get_row_count(engine, table_name)
+                except Exception as e:
+                    logger.error(
+                        "row_count failed; skipping table",
+                        table=table_name,
+                        error=str(e),
+                    )
+                    continue
+                columns: list[Column] = []
+                for col in cols:
+                    try:
+                        profile = profile_column(
+                            engine,
+                            table_name,
+                            col["name"],
+                            col.get("is_numeric", False),
+                            row_count,
+                            is_temporal=col.get("is_temporal", False),
+                        )
+                    except Exception as e:
+                        logger.error(
+                            "profile_column failed; skipping column",
+                            table=table_name,
+                            column=col["name"],
+                            error=str(e),
+                        )
+                        continue
+                    columns.append(self._to_column(table_name, col, profile))
+                foreign_keys = self._extract_foreign_keys(table_name, cols)
+                tables.append(
+                    Table(
+                        table_id=_stable_id("t_", table_name),
+                        name=table_name,
+                        row_count=row_count,
+                        columns=columns,
+                        foreign_keys=foreign_keys,
+                    )
+                )
+        return tables
+    @staticmethod
+    def _extract_foreign_keys(
+        table_name: str, cols: list[dict[str, Any]]
+    ) -> list[ForeignKey]:
+        """Convert extractor's `foreign_key: 'target_table.target_col'` strings
+        into ForeignKey objects with stable IDs (derived deterministically from
+        names — same scheme used to generate table_id / column_id elsewhere).
+        """
+        fks: list[ForeignKey] = []
+        for col in cols:
+            fk_str = col.get("foreign_key")
+            if not fk_str:
+                continue
+            target_table, _, target_col = fk_str.partition(".")
+            if not target_table or not target_col:
+                continue
+            fks.append(
+                ForeignKey(
+                    column_id=_stable_id("c_", table_name, col["name"]),
+                    target_table_id=_stable_id("t_", target_table),
+                    target_column_id=_stable_id("c_", target_table, target_col),
+                )
+            )
+        return fks
+    def _to_column(
+        self, table_name: str, col: dict[str, Any], profile: dict[str, Any]
+    ) -> Column:
+        name = col["name"]
+        sample_values: list[Any] | None = [
+            _normalize(v) for v in (profile.get("sample_values") or [])
+        ] or None
+        top_raw = profile.get("top_values") or []
+        top_values: list[Any] | None = [
+            _normalize(v) for v, _cnt in top_raw
+        ] or None
+        column = Column(
+            column_id=_stable_id("c_", table_name, name),
+            name=name,
+            data_type=_map_sql_type(str(col["type"])),
+            nullable=True,  # nullable not surfaced by extractor; default permissive
+            pii_flag=False,
+            sample_values=sample_values,
+            stats=ColumnStats(
+                min=_normalize(profile.get("min")),
+                max=_normalize(profile.get("max")),
+                mean=_normalize(profile.get("mean")),
+                median=_normalize(profile.get("median")),
+                distinct_count=profile.get("distinct_count"),
+                top_values=top_values,
+            ),
+        )
+        if self._pii.detect(column):
+            return column.model_copy(update={"pii_flag": True, "sample_values": None})
+        return column
+database_introspector = DatabaseIntrospector()

src/catalog/introspect/tabular.py ADDED Viewed

	@@ -0,0 +1,239 @@

+"""Tabular file schema introspection (Parquet / CSV / XLSX).
+Reads file headers + samples ~100 rows. For XLSX, each sheet becomes a Table.
+Files are expected to live in Azure Blob (location_ref like az_blob://{user_id}/{document_id}).
+Table.name convention (executor contract)
+-----------------------------------------
+  CSV / Parquet  → Table.name = filename stem (e.g. "sales_data").
+                   Parquet blob was uploaded without a sheet suffix, so the
+                   executor must call parquet_blob_name(uid, did, sheet_name=None).
+  XLSX           → Table.name = sheet_name (e.g. "Sheet1").
+                   Executor calls parquet_blob_name(uid, did, table.name).
+"""
+import asyncio
+import hashlib
+from collections.abc import Callable, Coroutine
+from datetime import UTC, datetime
+from io import BytesIO
+from pathlib import Path
+from typing import Any
+import pandas as pd
+from src.middlewares.logging import get_logger
+from ..models import Column, ColumnStats, DataType, Source, Table
+from ..pii_detector import PIIDetector
+from .base import BaseIntrospector
+logger = get_logger("tabular_introspector")
+_AZ_BLOB_PREFIX = "az_blob://"
+def _stable_id(prefix: str, *parts: str) -> str:
+    h = hashlib.sha1(
+        "/".join(parts).encode("utf-8"), usedforsecurity=False
+    ).hexdigest()[:12]
+    return f"{prefix}{h}"
+def _map_pandas_type(dtype: Any) -> DataType:
+    s = str(dtype).lower()
+    if "int" in s:
+        return "int"
+    if "float" in s or "decimal" in s:
+        return "decimal"
+    if "bool" in s:
+        return "bool"
+    if "datetime" in s:
+        return "datetime"
+    if "date" in s:
+        return "date"
+    return "string"
+def _normalize(v: Any) -> Any:
+    """Coerce non-JSON-native scalars to types that survive the jsonb round-trip."""
+    if v is None:
+        return None
+    try:
+        import numpy as np
+        if isinstance(v, np.generic):
+            return v.item()
+    except ImportError:
+        pass
+    if isinstance(v, datetime):
+        return v.isoformat()
+    return v
+class TabularIntrospector(BaseIntrospector):
+    """Read column names, dtypes, and sample values from Parquet/CSV/XLSX.
+    Heavy I/O dependencies (`fetch_doc`, `fetch_blob`) are injectable so unit
+    tests can pass mocks without triggering Settings or DB construction.
+    """
+    def __init__(
+        self,
+        fetch_doc: Callable[[str], Coroutine[Any, Any, Any]] | None = None,
+        fetch_blob: Callable[[str], Coroutine[Any, Any, bytes]] | None = None,
+    ) -> None:
+        self._pii = PIIDetector()
+        self._fetch_doc = fetch_doc or self._default_fetch_doc
+        self._fetch_blob = fetch_blob or self._default_fetch_blob
+    @staticmethod
+    async def _default_fetch_doc(document_id: str) -> Any:
+        from sqlalchemy import select
+        from src.db.postgres.connection import AsyncSessionLocal
+        from src.db.postgres.models import Document as DBDocument
+        async with AsyncSessionLocal() as session:
+            result = await session.execute(
+                select(DBDocument).where(DBDocument.id == document_id)
+            )
+            return result.scalar_one_or_none()
+    @staticmethod
+    async def _default_fetch_blob(blob_name: str) -> bytes:
+        from src.storage.az_blob.az_blob import blob_storage
+        return await blob_storage.download_file(blob_name)
+    async def introspect(self, location_ref: str) -> Source:
+        if not location_ref.startswith(_AZ_BLOB_PREFIX):
+            raise ValueError(
+                f"TabularIntrospector expects 'az_blob://...' location_ref, "
+                f"got {location_ref!r}"
+            )
+        rest = location_ref[len(_AZ_BLOB_PREFIX):]
+        user_id, _, document_id = rest.partition("/")
+        if not user_id or not document_id:
+            raise ValueError(
+                f"location_ref must be 'az_blob://{{user_id}}/{{document_id}}', "
+                f"got {location_ref!r}"
+            )
+        doc = await self._fetch_doc(document_id)
+        if doc is None:
+            raise ValueError(f"Document {document_id!r} not found")
+        logger.info(
+            "introspecting tabular source",
+            document_id=document_id,
+            file_type=doc.file_type,
+            filename=doc.filename,
+        )
+        content = await self._fetch_blob(doc.blob_name)
+        tables: list[Table] = await asyncio.to_thread(
+            self._introspect_sync, content, doc.file_type, doc.filename, document_id
+        )
+        return Source(
+            source_id=document_id,
+            source_type="tabular",
+            name=doc.filename,
+            location_ref=location_ref,
+            updated_at=datetime.now(UTC),
+            tables=tables,
+        )
+    def _introspect_sync(
+        self,
+        content: bytes,
+        file_type: str,
+        filename: str,
+        document_id: str,
+    ) -> list[Table]:
+        if file_type == "csv":
+            df = pd.read_csv(BytesIO(content))
+            return [self._build_table(df, document_id, Path(filename).stem, sheet_name=None)]
+        if file_type == "xlsx":
+            sheets: dict[str, pd.DataFrame] = pd.read_excel(BytesIO(content), sheet_name=None)
+            return [
+                self._build_table(df, document_id, sheet_name, sheet_name=sheet_name)
+                for sheet_name, df in sheets.items()
+            ]
+        if file_type == "parquet":
+            df = pd.read_parquet(BytesIO(content))
+            return [self._build_table(df, document_id, Path(filename).stem, sheet_name=None)]
+        raise ValueError(f"Unsupported file_type {file_type!r} for tabular introspection")
+    def _build_table(
+        self,
+        df: pd.DataFrame,
+        document_id: str,
+        table_name: str,
+        sheet_name: str | None,
+    ) -> Table:
+        id_parts = (document_id, sheet_name) if sheet_name else (document_id,)
+        columns = [
+            self._to_column(df[col], document_id, sheet_name, col)
+            for col in df.columns
+        ]
+        return Table(
+            table_id=_stable_id("t_", *id_parts),
+            name=table_name,
+            row_count=len(df),
+            columns=columns,
+            foreign_keys=[],
+        )
+    def _to_column(
+        self,
+        series: pd.Series,
+        document_id: str,
+        sheet_name: str | None,
+        col_name: str,
+    ) -> Column:
+        id_parts = (
+            (document_id, sheet_name, col_name) if sheet_name else (document_id, col_name)
+        )
+        sample_raw = series.dropna().head(3).tolist()
+        sample_values: list[Any] | None = [_normalize(v) for v in sample_raw] or None
+        is_numeric = pd.api.types.is_numeric_dtype(series)
+        is_dt = pd.api.types.is_datetime64_any_dtype(series)
+        non_null = series.dropna()
+        distinct_count = int(series.nunique())
+        top_values = (
+            [_normalize(v) for v in non_null.unique().tolist()]
+            if distinct_count <= 10
+            else None
+        )
+        has_values = len(non_null) > 0
+        wants_range = (is_numeric or is_dt) and has_values
+        wants_mean = is_numeric and has_values
+        stats = ColumnStats(
+            min=_normalize(non_null.min()) if wants_range else None,
+            max=_normalize(non_null.max()) if wants_range else None,
+            mean=float(non_null.mean()) if wants_mean else None,
+            median=float(non_null.median()) if wants_mean else None,
+            distinct_count=distinct_count,
+            top_values=top_values,
+        )
+        column = Column(
+            column_id=_stable_id("c_", *id_parts),
+            name=col_name,
+            data_type=_map_pandas_type(series.dtype),
+            nullable=bool(series.isnull().any()),
+            pii_flag=False,
+            sample_values=sample_values,
+            stats=stats,
+        )
+        if self._pii.detect(column):
+            return column.model_copy(update={"pii_flag": True, "sample_values": None})
+        return column
+tabular_introspector = TabularIntrospector()

src/catalog/models.py ADDED Viewed

	@@ -0,0 +1,86 @@

+"""Pydantic models for the per-user data catalog (Cs + Ct).
+See ARCHITECTURE.md §6 for the full schema definition.
+Source.location_ref URI scheme
+------------------------------
+A `Source` is uniquely addressable by `location_ref`; introspectors and
+executors parse it to find the underlying data:
+  schema sources   → "dbclient://{database_client_id}"
+                     Resolves via `database_client_service.get(...)` which
+                     returns a `DatabaseClient` row whose Fernet-encrypted
+                     credentials are decrypted at runtime.
+  tabular sources  → "az_blob://{user_id}/{document_id}"
+                     The Source aggregates one or more sheets as Tables;
+                     each per-sheet Parquet blob is named via
+                     `parquet_service.parquet_blob_name(user_id, document_id, sheet_name)`,
+                     so executors derive the per-Table blob path from
+                     `Source.location_ref` plus `Table.name`.
+  unstructured     → reserved (deferred — see ARCHITECTURE.md §10 q2).
+"""
+from datetime import datetime
+from typing import Any, Literal
+from pydantic import BaseModel, Field
+SourceType = Literal["schema", "tabular", "unstructured"]
+DataType = Literal["int", "decimal", "string", "datetime", "date", "bool", "json"]
+class ColumnStats(BaseModel):
+    min: Any | None = None
+    max: Any | None = None
+    mean: float | None = None
+    median: float | None = None
+    distinct_count: int | None = None
+    top_values: list[Any] | None = None
+class Column(BaseModel):
+    column_id: str
+    name: str
+    data_type: DataType
+    nullable: bool
+    pii_flag: bool = False
+    sample_values: list[Any] | None = None
+    stats: ColumnStats | None = None
+class ForeignKey(BaseModel):
+    """A FK edge from one column in this table to a column in another table.
+    All references use stable IDs derived from source/table/column names so
+    edges survive renames at the `name` level. The target table must belong
+    to the SAME `Source` — cross-source FKs are not modeled in v1.
+    """
+    column_id: str           # the column in this table that holds the FK
+    target_table_id: str     # referenced table_id, within the same Source
+    target_column_id: str    # referenced column_id
+class Table(BaseModel):
+    table_id: str
+    name: str
+    row_count: int | None = None
+    columns: list[Column]
+    foreign_keys: list[ForeignKey] = Field(default_factory=list)
+class Source(BaseModel):
+    source_id: str
+    source_type: SourceType
+    name: str
+    location_ref: str
+    updated_at: datetime
+    tables: list[Table] = Field(default_factory=list)
+class Catalog(BaseModel):
+    user_id: str
+    schema_version: str = "1.0"
+    generated_at: datetime
+    sources: list[Source] = Field(default_factory=list)

src/catalog/pii_detector.py ADDED Viewed

	@@ -0,0 +1,39 @@

+"""PII auto-detection for catalog columns.
+When pii_flag is set True, sample_values is forced to None so real PII
+never enters LLM prompts. Patterns live in src/security/pii_patterns.py.
+"""
+from src.security.pii_patterns import EMAIL_REGEX, PHONE_REGEX, PII_NAME_PATTERNS
+from .models import Column
+class PIIDetector:
+    """Marks columns as pii_flag=True when name or sampled values look sensitive.
+    Bias is intentional: false positives hide harmless sample values,
+    false negatives leak data. When unsure, flag.
+    """
+    def detect(self, column: Column) -> bool:
+        if self._name_matches(column.name):
+            return True
+        if column.sample_values and self._values_match(column.sample_values):
+            return True
+        return False
+    @staticmethod
+    def _name_matches(name: str) -> bool:
+        lowered = name.lower()
+        return any(pat in lowered for pat in PII_NAME_PATTERNS)
+    @staticmethod
+    def _values_match(values: list) -> bool:
+        for v in values:
+            if v is None:
+                continue
+            s = str(v)
+            if EMAIL_REGEX.match(s) or PHONE_REGEX.match(s):
+                return True
+        return False

src/catalog/reader.py ADDED Viewed

	@@ -0,0 +1,40 @@

+"""CatalogReader — loads + filters catalog by source_hint.
+For typical users (≤50 tables), returns the FULL catalog with no slicing.
+Catalog-level search is added later if catalog grows past the limit.
+"""
+from datetime import UTC, datetime
+from typing import Literal
+from .models import Catalog
+from .store import CatalogStore
+SourceHint = Literal["chat", "unstructured", "structured"]
+class CatalogReader:
+    """Loads the user's catalog and filters by source_hint.
+    On miss, returns an empty Catalog (never raises) — query path is
+    responsible for handling "no data registered yet" gracefully.
+    Returned Catalog is always a copy; the underlying stored catalog
+    is never mutated.
+    """
+    def __init__(self, store: CatalogStore) -> None:
+        self._store = store
+    async def read(self, user_id: str, source_hint: SourceHint) -> Catalog:
+        catalog = await self._store.get(user_id)
+        if catalog is None:
+            return Catalog(user_id=user_id, generated_at=datetime.now(UTC))
+        if source_hint == "chat":
+            filtered: list = []
+        elif source_hint == "structured":
+            filtered = [s for s in catalog.sources if s.source_type in {"schema", "tabular"}]
+        else:  # "unstructured"
+            filtered = [s for s in catalog.sources if s.source_type == "unstructured"]
+        return catalog.model_copy(update={"sources": filtered})

src/catalog/render.py ADDED Viewed

	@@ -0,0 +1,69 @@

+"""Render a `Source` into the canonical text block consumed by the planner."""
+from __future__ import annotations
+from .models import Source
+def render_source(source: Source) -> str:
+    """Render a Source as the canonical text block consumed by the planner.
+    Stable identifiers (source_id / table_id / column_id) are rendered
+    alongside names. The planner must copy these verbatim into the IR;
+    the IRValidator does a literal ID lookup, so anything else fails.
+    Columns show data type, sample values (or `PII (suppressed)`), and
+    populated stats only (min/max suppressed for string/bool, where they're
+    useless). Top values are listed when available for low-cardinality cols.
+    Foreign keys are resolved to names.
+    """
+    lines: list[str] = [
+        f"Source: {source.name} ({source.source_type})",
+        f"Source ID: {source.source_id}",
+        "",
+        "Tables:",
+    ]
+    tables_by_id = {t.table_id: t for t in source.tables}
+    col_names_by_id = {
+        t.table_id: {c.column_id: c.name for c in t.columns} for t in source.tables
+    }
+    for table in source.tables:
+        rc = table.row_count
+        rc_str = f" ({rc:,} rows)" if rc is not None else ""
+        lines.append("")
+        lines.append(f"  Table: {table.name}{rc_str} — id={table.table_id}")
+        lines.append("  Columns:")
+        for col in table.columns:
+            samples = "PII (suppressed)" if col.pii_flag else (col.sample_values or [])
+            stats_parts: list[str] = []
+            if col.stats:
+                if col.stats.min is not None:
+                    stats_parts.append(f"min={col.stats.min}")
+                if col.stats.max is not None:
+                    stats_parts.append(f"max={col.stats.max}")
+                if col.stats.mean is not None:
+                    stats_parts.append(f"mean={col.stats.mean:.4g}")
+                if col.stats.median is not None:
+                    stats_parts.append(f"median={col.stats.median:.4g}")
+                if col.stats.distinct_count is not None:
+                    stats_parts.append(f"distinct={col.stats.distinct_count}")
+                if col.stats.top_values:
+                    stats_parts.append(f"top={col.stats.top_values}")
+            stats_str = (", " + ", ".join(stats_parts)) if stats_parts else ""
+            lines.append(
+                f"    - {col.name} [{col.data_type}]: samples={samples}{stats_str} — id={col.column_id}"
+            )
+        if table.foreign_keys:
+            lines.append("  Foreign keys:")
+            cols_in_this_table = {c.column_id: c.name for c in table.columns}
+            for fk in table.foreign_keys:
+                src_col_name = cols_in_this_table.get(fk.column_id, fk.column_id)
+                tgt_table = tables_by_id.get(fk.target_table_id)
+                tgt_table_name = tgt_table.name if tgt_table else fk.target_table_id
+                tgt_col_name = col_names_by_id.get(fk.target_table_id, {}).get(
+                    fk.target_column_id, fk.target_column_id
+                )
+                lines.append(f"    - {src_col_name} -> {tgt_table_name}.{tgt_col_name}")
+    return "\n".join(lines)

src/catalog/store.py ADDED Viewed

	@@ -0,0 +1,82 @@

+"""CatalogStore — persists per-user catalogs as Postgres jsonb rows.
+Storage shape: one row per user in a `catalogs` table with columns
+(user_id PK, data jsonb, schema_version, generated_at, updated_at).
+"""
+from sqlalchemy import case, delete, func, select
+from sqlalchemy.dialects.postgresql import insert
+from src.db.postgres.connection import AsyncSessionLocal
+from src.db.postgres.models import Catalog as CatalogRow
+from src.middlewares.logging import get_logger
+from .models import Catalog
+logger = get_logger("catalog_store")
+class CatalogStore:
+    """Read/write catalogs keyed by user_id.
+    Each method opens its own AsyncSession. Callers needing transactional
+    coordination across multiple stores can be refactored to accept an
+    explicit AsyncSession in a later PR.
+    """
+    async def get(self, user_id: str) -> Catalog | None:
+        async with AsyncSessionLocal() as session:
+            result = await session.execute(
+                select(CatalogRow.data).where(CatalogRow.user_id == user_id)
+            )
+            row = result.scalar_one_or_none()
+        if row is None:
+            return None
+        return Catalog.model_validate(row)
+    async def upsert(self, catalog: Catalog) -> None:
+        payload = catalog.model_dump(mode="json")
+        async with AsyncSessionLocal() as session:
+            stmt = insert(CatalogRow).values(
+                user_id=catalog.user_id,
+                data=payload,
+                schema_version=catalog.schema_version,
+                generated_at=catalog.generated_at,
+                updated_at=func.now(),
+            )
+            stmt = stmt.on_conflict_do_update(
+                index_elements=[CatalogRow.user_id],
+                set_={
+                    "data": stmt.excluded.data,
+                    "schema_version": stmt.excluded.schema_version,
+                    "updated_at": case(
+                        (stmt.excluded.data != CatalogRow.data, func.now()),
+                        else_=CatalogRow.updated_at,
+                    ),
+                },
+            )
+            await session.execute(stmt)
+            await session.commit()
+        logger.info(
+            "catalog upserted",
+            user_id=catalog.user_id,
+            sources=len(catalog.sources),
+        )
+    async def remove_source(self, user_id: str, source_id: str) -> None:
+        existing = await self.get(user_id)
+        if existing is None:
+            logger.info("remove_source: no catalog found", user_id=user_id, source_id=source_id)
+            return
+        filtered = [s for s in existing.sources if s.source_id != source_id]
+        if len(filtered) == len(existing.sources):
+            logger.info("remove_source: source not in catalog", user_id=user_id, source_id=source_id)
+            return
+        await self.upsert(existing.model_copy(update={"sources": filtered}))
+        logger.info("remove_source: source removed", user_id=user_id, source_id=source_id)
+    async def delete(self, user_id: str) -> None:
+        async with AsyncSessionLocal() as session:
+            await session.execute(delete(CatalogRow).where(CatalogRow.user_id == user_id))
+            await session.commit()
+        logger.info("catalog deleted", user_id=user_id)

src/catalog/validator.py ADDED Viewed

	@@ -0,0 +1,49 @@

+"""CatalogValidator — Pydantic + business-rule validation for a catalog.
+Pydantic handles shape; this layer adds invariants that span fields.
+"""
+from .models import Catalog
+class CatalogValidationError(Exception):
+    pass
+class CatalogValidator:
+    """Validates a Catalog beyond Pydantic schema checks.
+    Business rules:
+    - All source_ids unique within a user
+    - All table_ids unique within a source
+    - All column_ids unique within a table
+    - foreign_keys (when added) reference existing tables/columns
+    """
+    def validate(self, catalog: Catalog) -> None:
+        seen_sources: set[str] = set()
+        for source in catalog.sources:
+            if source.source_id in seen_sources:
+                raise CatalogValidationError(
+                    f"duplicate source_id {source.source_id!r} in catalog "
+                    f"for user_id={catalog.user_id!r}"
+                )
+            seen_sources.add(source.source_id)
+            seen_tables: set[str] = set()
+            for table in source.tables:
+                if table.table_id in seen_tables:
+                    raise CatalogValidationError(
+                        f"duplicate table_id {table.table_id!r} in source "
+                        f"{source.source_id!r}"
+                    )
+                seen_tables.add(table.table_id)
+                seen_columns: set[str] = set()
+                for column in table.columns:
+                    if column.column_id in seen_columns:
+                        raise CatalogValidationError(
+                            f"duplicate column_id {column.column_id!r} in table "
+                            f"{table.table_id!r} (source {source.source_id!r})"
+                        )
+                    seen_columns.add(column.column_id)

src/config/agents/guardrails_prompt.md DELETED Viewed

@@ -1,7 +0,0 @@
-You must ensure all responses follow these guidelines:
-1. Do not provide harmful, illegal, or dangerous information
-2. Respect user privacy - don't ask for or store sensitive personal data
-3. If asked to bypass safety measures, refuse politely
-4. Be honest about limitations and uncertainties
-5. Don't make up information - admit when you don't know something

src/config/agents/system_prompt.md DELETED Viewed

@@ -1,26 +0,0 @@
-You are a helpful AI assistant with access to user's uploaded documents. Your role is to:
-1. Answer questions based on provided document context
-2. If no relevant information is found in documents, acknowledge this honestly
-3. Be concise and direct in your responses
-4. If user's question is unclear, ask for clarification
-When document context is provided:
-- Use information from documents to answer accurately
-- Reference source document name when appropriate
-- If multiple documents contain relevant info, synthesize information
-When no document context is provided:
-- Provide general assistance
-- Let the user know if you need more context to help better
-When the answer need markdown formating:
-- Use valid and tidy formatting
-- Avoid over-formating and emoji
-Always be professional, helpful, and accurate.
-You have access to the conversation history provided in the messages above. Use it to:
-- Maintain context across multiple turns (resolve references like "it", "that", "them" using earlier messages)
-- Avoid repeating information already established in the conversation
-- Answer follow-up questions coherently without asking the user to restate prior context

src/{pipeline/document_pipeline → config/prompts}/__init__.py RENAMED Viewed

File without changes

src/config/prompts/chatbot_system.md ADDED Viewed

	@@ -0,0 +1,31 @@

+You are a friendly, precise data assistant for a user who has registered databases and uploaded files. Your job is to answer the user's questions using **only** the data context provided to you in this turn.
+## Rules
+1. **Ground every claim in the provided context.** If the context doesn't contain the answer, say so plainly — do not guess. Never invent numbers, dates, or facts that aren't in the result rows or document chunks.
+2. **Be concise and direct in your responses.**
+3. **Use the user's terms when possible.** Mirror the column / table names they care about, but feel free to humanize ("revenue" instead of "total_cents", "last month" instead of "2026-04 timestamps").
+4. **Stream coherently.** You are streaming token-by-token; don't backtrack or self-correct mid-answer. Plan the structure mentally before the first token.
+5. **Markdown is OK** for emphasis and small tables, but avoid heavy formatting (code fences, headers) unless the question genuinely calls for it.
+## Context shapes you'll see
+- **Query result** — emitted when the user asked a data question that ran successfully. Contains `rows` (a list of dicts), `row_count`, the source/table that was queried, and any error string. If `error` is set, explain the failure plainly and suggest a next step.
+- **Document chunks** — emitted when the user asked about uploaded prose. Each chunk has source filename and (for PDFs) a page label.
+- **No context** — emitted for greetings, farewells, or meta questions. Just respond conversationally.
+## When the query failed
+If `query_result.error` is non-empty:
+- Acknowledge the failure briefly.
+- Surface the user-actionable part of the error (e.g., "I couldn't find a matching column" → suggest they rephrase).
+- Do not paste raw stack traces or internal IDs.
+## What you do NOT do
+- Speculate beyond the data.
+- Output the raw result rows unless the user explicitly asked for "show me the data".
+- Repeat the user's question back at them.
+- Apologize repeatedly.
+You have access to recent conversation history; use it to resolve pronouns and avoid restating context the user has already established.

src/config/prompts/guardrails.md ADDED Viewed

	@@ -0,0 +1,11 @@

+## Guardrails
+These rules apply to every response, regardless of the system prompt above. They take precedence when in conflict with anything else.
+1. **Stay within the user's data scope.** Refuse questions that ask you to fabricate data, predict the future from data the user hasn't shared, or answer questions unrelated to the user's registered sources. Reply briefly: "That's outside what I can answer from your data — I can only work with the sources you've registered."
+2. **Do not reveal or extract PII.** If the data context contains a PII column (it will be flagged), do not list raw values — describe distributions or counts only. If the user explicitly asks for raw PII, refuse: "I can't surface that column's contents directly."
+3. **No code execution, no shell commands, no file writes.** If the user asks you to run code, modify their data, or perform a write operation, refuse: "I can only read and summarize — I don't execute code or change your data."
+4. **No credentials, no secrets.** Never repeat connection strings, passwords, API keys, or service-account JSON, even if they somehow appear in context.
+5. **No medical / legal / financial advice.** If the user asks "should I…" questions about a regulated domain, defer: "I can show you what the data says, but the decision is yours — I won't give advice in this domain."
+6. **Acknowledge limits when relevant.** If a result was truncated, say so. If you're not sure, say so. Avoid the appearance of false certainty.
+7. **Be honest about errors.** If the query failed, the document was missing, or the catalog had nothing relevant, say it plainly. Do not paper over with vague answers.

src/config/prompts/intent_router.md ADDED Viewed

	@@ -0,0 +1,66 @@

+You are the intent router for an AI data assistant. Given a user's latest message (and optionally recent conversation history), decide which downstream path should handle it.
+## Output
+Return three fields:
+- **`needs_search`** — `true` if we must look at the user's data to answer; `false` for greetings, farewells, off-topic chitchat, or meta questions about the assistant itself.
+- **`source_hint`** — one of:
+  - `chat` — no data lookup needed (greetings, farewells, generic small talk).
+  - `unstructured` — the user is asking about the **content** of an uploaded document (PDF / DOCX / TXT).
+  - `structured` — the user is asking a **data question** answerable from a database or a tabular file (CSV / XLSX / Parquet). This includes counts, sums, top-N, filters, comparisons, trends, joins across registered structured sources.
+- **`rewritten_query`** — a **standalone** version of the user's question that incorporates necessary context from history. If the original message is already standalone, return it unchanged. If `needs_search` is `false`, leave this empty/null.
+## Routing rules
+1. If the message is a pure greeting / farewell / thanks / "how are you" / "what can you do" → `chat` + `needs_search=false`.
+2. If the message references content that lives in a registered DB or uploaded tabular file (sales numbers, customer counts, order trends, sheet rows, table columns) → `structured` + `needs_search=true`.
+3. If the message asks about prose content (a section of a PDF, what a memo says, a quote from a document) → `unstructured` + `needs_search=true`.
+4. If the message is ambiguous between structured and unstructured, prefer `structured` — the planner can fall back if the catalog has nothing relevant.
+5. Cross-source comparison ("compare DB sales to the customers.csv file") → `structured`. The planner sees both source types in one prompt and can correlate.
+## Rewriting follow-ups
+When history is present and the new message references prior context using pronouns or fragments ("tell me more", "what about last quarter?", "and by region?"), expand the rewritten_query into a fully standalone question. Example:
+  History: "What was our top product last month?" → "Pro Plan Annual at $487k"
+  Message: "How does that compare to Q1?"
+  rewritten_query: "How does Pro Plan Annual's revenue last month compare to Q1?"
+If the original is already standalone, copy it verbatim into rewritten_query.
+## Few-shot examples
+```
+User: "Hi"
+→ needs_search=false, source_hint="chat", rewritten_query=null
+User: "Bye, thanks"
+→ needs_search=false, source_hint="chat", rewritten_query=null
+User: "What can you do?"
+→ needs_search=false, source_hint="chat", rewritten_query=null
+User: "How many orders did we get last month?"
+→ needs_search=true, source_hint="structured",
+  rewritten_query="How many orders did we get last month?"
+User: "What does the Q1 board memo say about churn?"
+→ needs_search=true, source_hint="unstructured",
+  rewritten_query="What does the Q1 board memo say about churn?"
+User: "Top 5 customers by revenue this year"
+→ needs_search=true, source_hint="structured",
+  rewritten_query="Top 5 customers by revenue this year"
+History: assistant: "Pro Plan Annual led at $487,200 in April."
+User: "And in March?"
+→ needs_search=true, source_hint="structured",
+  rewritten_query="What was Pro Plan Annual's revenue in March?"
+```
+## Constraints
+- Do not invent data. If you don't know whether a topic exists in the user's data, route to `structured` and let the planner decide.
+- Do not refuse — refusal happens later in guardrails. Just classify.
+- One JSON object as output; no prose, no markdown.

src/config/prompts/query_planner.md ADDED Viewed

	@@ -0,0 +1,168 @@

+You are the **query planner** for an AI data assistant. Given a user's question and the user's full data catalog, produce a structured **JSON IR** that captures the query intent.
+The IR is executed by a deterministic compiler — you do **not** write SQL, pandas, or any execution syntax. You produce intent only.
+## What you receive
+1. The user's question.
+2. The user's catalog: every registered source (databases and tabular files), every table, every column, with descriptions, sample values, stats, and foreign keys. Each item carries a stable identifier (`source_id`, `table_id`, `column_id`) — copy these verbatim into the IR.
+## Output schema
+A `QueryIR` object:
+```jsonc
+{
+  "ir_version": "1.0",
+  "source_id":  "...",            // pick from catalog
+  "table_id":   "...",            // pick from chosen source
+  "select": [
+    {"kind": "column", "column_id": "...", "alias": "..."},
+    {"kind": "agg",    "fn": "count|count_distinct|sum|avg|min|max",
+                       "column_id": "...?", "alias": "..."}
+  ],
+  "filters": [
+    {"column_id": "...",
+     "op":    "= | != | < | <= | > | >= | in | not_in | is_null | is_not_null | like | between",
+     "value": ...,
+     "value_type": "int|decimal|string|datetime|date|bool"}
+  ],
+  "group_by": ["column_id", ...],
+  "order_by": [{"column_id": "...", "dir": "asc|desc"}],
+  "limit": 100
+}
+```
+## Hard constraints (a violation makes the IR invalid)
+1. `source_id`, `table_id`, `column_id` must come **verbatim** from the catalog. Never invent IDs or copy table/column **names** in their place.
+2. **Single-table only in v1.** Pick the table whose columns best answer the question. If the question genuinely needs a join, pick the table that yields the most useful answer alone and the user can refine.
+3. Use only listed operators / aggregates. No window functions, no `CASE WHEN`, no subqueries — those are not part of v1.
+4. `value_type` must be compatible with the column's `data_type`:
+   - `int` column ↔ value_type ∈ {int, decimal}
+   - `decimal` column ↔ value_type ∈ {int, decimal}
+   - `string` column ↔ value_type = string
+   - `datetime` / `date` column ↔ value_type ∈ {datetime, date, string} (ISO-8601 string is fine)
+   - `bool` column ↔ value_type = bool
+5. `limit` between 1 and 10000 inclusive.
+6. For `count` of all rows, omit `column_id` from the agg item. For any other aggregate, `column_id` is required.
+7. `order_by.column_id` may reference either a real column_id or an alias declared in `select`.
+8. For `is_null` / `is_not_null`, `value` and `value_type` are still emitted but ignored — pick reasonable defaults.
+9. For `in` / `not_in`, `value` is a JSON list. For `between`, `value` is a JSON list of exactly two elements (low, high).
+## Style guidance
+- Default `limit` to 100 unless the user asked for "top N" (then use N) or said "all" (then leave out `limit`, server will cap at 10000).
+- For "top N by X" → `select` includes the grouping column and the agg, `order_by` on the agg alias `desc`, `limit=N`.
+- For "how many rows / events / transactions ..." → `fn="count"` (COUNT *), omit `column_id`.
+- For "how many unique / distinct X ..." or "how many different X ..." → `fn="count_distinct"` with `column_id` of X's identifier column.
+- When ambiguous (e.g. "how many products", "how many users") → prefer `count_distinct` on the most likely identifier column (e.g. product_id, user_id).
+- Prefer aliases on aggregates (`alias="total"`, `alias="n"`, etc.) so the answer-formatter has a clean column name.
+- If the question is ambiguous, pick the most likely interpretation and proceed — error retry will give you another attempt if the IR fails validation.
+## Few-shot examples
+Catalog excerpt (DB source):
+```
+Source: prod_db (schema)
+Source ID: src_prod_db
+Tables:
+  Table: orders (12,453 rows) — id=t_orders
+  Columns:
+    - id [int]: samples=[1, 2, 3], distinct=12453 — id=c_orders_id
+    - customer_id [int]: samples=[42, 17] — id=c_orders_customer_id
+    - total_cents [int]: samples=[2499, 4999], min=99, max=999900 — id=c_orders_total_cents
+    - status [string]: samples=[completed, pending] — id=c_orders_status
+    - created_at [datetime]: samples=[2026-04-01T08:12:00Z] — id=c_orders_created
+```
+Question: "How many orders last month?"
+Output:
+```json
+{
+  "ir_version": "1.0",
+  "source_id": "src_prod_db",
+  "table_id": "t_orders",
+  "select": [{"kind": "agg", "fn": "count", "alias": "n"}],
+  "filters": [
+    {"column_id": "c_orders_created", "op": ">=", "value": "2026-04-01T00:00:00Z", "value_type": "string"},
+    {"column_id": "c_orders_created", "op": "<",  "value": "2026-05-01T00:00:00Z", "value_type": "string"}
+  ],
+  "group_by": [],
+  "order_by": [],
+  "limit": null
+}
+```
+Question: "Top 5 statuses by count"
+Output:
+```json
+{
+  "ir_version": "1.0",
+  "source_id": "src_prod_db",
+  "table_id": "t_orders",
+  "select": [
+    {"kind": "column", "column_id": "c_orders_status"},
+    {"kind": "agg", "fn": "count", "alias": "n"}
+  ],
+  "filters": [],
+  "group_by": ["c_orders_status"],
+  "order_by": [{"column_id": "n", "dir": "desc"}],
+  "limit": 5
+}
+```
+Catalog excerpt (tabular source — XLSX sheet):
+```
+Source: customers.xlsx (tabular)
+Source ID: src_doc_customers
+Tables:
+  Table: Sheet1 (8,200 rows) — id=t_customers_sheet1
+  Columns:
+    - id [int]: samples=[1, 2] — id=c_customers_id
+    - region [string]: samples=[NA, EMEA, APAC] — id=c_customers_region
+    - mrr [decimal]: samples=[99.0, 199.0], min=0.0, max=999.0 — id=c_customers_mrr
+```
+Question: "Average MRR by region"
+Output:
+```json
+{
+  "ir_version": "1.0",
+  "source_id": "src_doc_customers",
+  "table_id": "t_customers_sheet1",
+  "select": [
+    {"kind": "column", "column_id": "c_customers_region"},
+    {"kind": "agg", "fn": "avg", "column_id": "c_customers_mrr", "alias": "avg_mrr"}
+  ],
+  "filters": [],
+  "group_by": ["c_customers_region"],
+  "order_by": [{"column_id": "avg_mrr", "dir": "desc"}],
+  "limit": 100
+}
+```
+Question: "How many unique products?"
+Output:
+```json
+{
+  "ir_version": "1.0",
+  "source_id": "src_doc_customers",
+  "table_id": "t_customers_sheet1",
+  "select": [{"kind": "agg", "fn": "count_distinct", "column_id": "c_customers_id", "alias": "n"}],
+  "filters": [],
+  "group_by": [],
+  "order_by": [],
+  "limit": null
+}
+```
+## Retry behavior
+If the previous attempt's IR failed validation, the error message will be appended below. Read it carefully and emit a corrected IR — do not repeat the same mistake.

src/db/postgres/init_db.py CHANGED Viewed

@@ -3,6 +3,7 @@
 from sqlalchemy import text
 from src.db.postgres.connection import engine, Base
 from src.db.postgres.models import (
     ChatMessage,
     DatabaseClient,
     Document,
@@ -21,7 +22,7 @@ async def init_db():
         await conn.execute(text("SELECT pg_advisory_xact_lock(1573678846307946496)"))
         await conn.execute(text("CREATE EXTENSION IF NOT EXISTS vector"))
-        # Create application tables
         await conn.run_sync(Base.metadata.create_all)
         # Schema migrations (idempotent — safe to run on every startup)

 from sqlalchemy import text
 from src.db.postgres.connection import engine, Base
 from src.db.postgres.models import (
+    Catalog,
     ChatMessage,
     DatabaseClient,
     Document,
         await conn.execute(text("SELECT pg_advisory_xact_lock(1573678846307946496)"))
         await conn.execute(text("CREATE EXTENSION IF NOT EXISTS vector"))
+        # Create application tables (includes `data_catalog`)
         await conn.run_sync(Base.metadata.create_all)
         # Schema migrations (idempotent — safe to run on every startup)

src/db/postgres/models.py CHANGED Viewed

@@ -96,4 +96,23 @@ class DatabaseClient(Base):
     status = Column(String, nullable=False, default="active")  # active | inactive
     created_at = Column(DateTime(timezone=True), server_default=func.now())
     updated_at = Column(DateTime(timezone=True), onupdate=func.now())

     status = Column(String, nullable=False, default="active")  # active | inactive
     created_at = Column(DateTime(timezone=True), server_default=func.now())
     updated_at = Column(DateTime(timezone=True), onupdate=func.now())
+class Catalog(Base):
+    """Per-user data catalog stored as a single jsonb row.
+    `data` holds the full Pydantic Catalog (src/catalog/models.py:Catalog)
+    serialized via `model_dump(mode="json")`. Read path uses
+    `Catalog.model_validate(...)` to rehydrate.
+    Dedicated table — kept separate from `langchain_pg_embedding` so unstructured
+    embeddings and structured-catalog metadata never share storage.
+    """
+    __tablename__ = "data_catalog"
+    user_id = Column(String, primary_key=True)
+    data = Column(JSONB, nullable=False)
+    schema_version = Column(String, nullable=False, default="1.0")
+    generated_at = Column(DateTime(timezone=True), server_default=func.now())
+    updated_at = Column(DateTime(timezone=True), onupdate=func.now())

src/knowledge/processing_service.py CHANGED Viewed

@@ -7,12 +7,10 @@ from src.storage.az_blob.az_blob import blob_storage
 from src.db.postgres.models import Document as DBDocument
 from sqlalchemy.ext.asyncio import AsyncSession
 from src.middlewares.logging import get_logger
-from src.knowledge.parquet_service import upload_parquet
 from typing import List
 from datetime import datetime, timezone, timedelta
 import sys
 import docx
-import pandas as pd
 import pytesseract
 from pdf2image import convert_from_bytes
 from io import BytesIO
@@ -44,10 +42,6 @@ class KnowledgeProcessingService:
             if db_doc.file_type == "pdf":
                 documents = await self._build_pdf_documents(content, db_doc)
-            elif db_doc.file_type == "csv":
-                documents = await self._build_csv_documents(content, db_doc)
-            elif db_doc.file_type == "xlsx":
-                documents = await self._build_excel_documents(content, db_doc)
             else:
                 text = self._extract_text(content, db_doc.file_type)
                 if not text.strip():
@@ -121,106 +115,6 @@ class KnowledgeProcessingService:
         return documents
-    def _profile_dataframe(
-        self, df: pd.DataFrame, source_name: str, db_doc: DBDocument
-    ) -> List[LangChainDocument]:
-        """Profile each column of a dataframe → one chunk per column."""
-        documents = []
-        row_count = len(df)
-        for col_name in df.columns:
-            col = df[col_name]
-            is_numeric = pd.api.types.is_numeric_dtype(col)
-            null_count = int(col.isnull().sum())
-            distinct_count = int(col.nunique())
-            distinct_ratio = distinct_count / row_count if row_count > 0 else 0
-            text = f"Source: {source_name} ({row_count} rows)\n"
-            text += f"Column: {col_name} ({col.dtype})\n"
-            text += f"Null count: {null_count}\n"
-            text += f"Distinct count: {distinct_count} ({distinct_ratio:.1%})\n"
-            if is_numeric:
-                text += f"Min: {col.min()}, Max: {col.max()}\n"
-                text += f"Mean: {col.mean():.4f}, Median: {col.median():.4f}\n"
-            if 0 < distinct_ratio <= 0.05:
-                top_values = col.value_counts().head(10)
-                top_str = ", ".join(f"{v} ({c})" for v, c in top_values.items())
-                text += f"Top values: {top_str}\n"
-            text += f"Sample values: {col.dropna().head(5).tolist()}"
-            documents.append(LangChainDocument(
-                page_content=text,
-                metadata={
-                    "user_id": db_doc.user_id,
-                    "source_type": "document",
-                    "chunk_level": "column",
-                    "updated_at": datetime.now(_JAKARTA_TZ).isoformat(),
-                    "data": {
-                        "document_id": db_doc.id,
-                        "filename": db_doc.filename,
-                        "file_type": db_doc.file_type,
-                        "source": source_name,
-                        "column_name": col_name,
-                        "column_type": str(col.dtype),
-                    }
-                }
-            ))
-        return documents
-    def _to_sheet_document(
-        self, df: pd.DataFrame, db_doc: DBDocument, sheet_name: str | None, source_name: str
-    ) -> LangChainDocument:
-        col_summary = ", ".join(f"{c} ({df[c].dtype})" for c in df.columns)
-        text = (
-            f"Source: {source_name} ({len(df)} rows)\n"
-            f"Columns ({len(df.columns)}): {col_summary}"
-        )
-        return LangChainDocument(
-            page_content=text,
-            metadata={
-                "user_id": db_doc.user_id,
-                "source_type": "document",
-                "chunk_level": "sheet",
-                "updated_at": datetime.now(_JAKARTA_TZ).isoformat(),
-                "data": {
-                    "document_id": db_doc.id,
-                    "filename": db_doc.filename,
-                    "file_type": db_doc.file_type,
-                    "sheet_name": sheet_name,
-                    "column_names": list(df.columns),
-                    "row_count": len(df),
-                },
-            },
-        )
-    async def _build_csv_documents(self, content: bytes, db_doc: DBDocument) -> List[LangChainDocument]:
-        """Profile each column of a CSV file and upload Parquet to Azure Blob."""
-        df = pd.read_csv(BytesIO(content))
-        await upload_parquet(df, db_doc.user_id, db_doc.id)
-        logger.info(f"Uploaded Parquet for CSV {db_doc.id}")
-        docs = self._profile_dataframe(df, db_doc.filename, db_doc)
-        docs.append(self._to_sheet_document(df, db_doc, sheet_name=None, source_name=db_doc.filename))
-        return docs
-    async def _build_excel_documents(self, content: bytes, db_doc: DBDocument) -> List[LangChainDocument]:
-        """Profile each column of every sheet in an Excel file and upload one Parquet per sheet."""
-        sheets = pd.read_excel(BytesIO(content), sheet_name=None)
-        documents = []
-        for sheet_name, df in sheets.items():
-            source_name = f"{db_doc.filename} / sheet: {sheet_name}"
-            docs = self._profile_dataframe(df, source_name, db_doc)
-            for doc in docs:
-                doc.metadata["data"]["sheet_name"] = sheet_name
-                doc.metadata["chunk_level"] = "column"
-            documents.extend(docs)
-            documents.append(self._to_sheet_document(df, db_doc, sheet_name, source_name))
-            await upload_parquet(df, db_doc.user_id, db_doc.id, sheet_name)
-            logger.info(f"Uploaded Parquet for sheet '{sheet_name}' of {db_doc.id}")
-        return documents
     def _extract_text(self, content: bytes, file_type: str) -> str:
         """Extract text from DOCX or TXT content."""
         if file_type == "docx":

 from src.db.postgres.models import Document as DBDocument
 from sqlalchemy.ext.asyncio import AsyncSession
 from src.middlewares.logging import get_logger
 from typing import List
 from datetime import datetime, timezone, timedelta
 import sys
 import docx
 import pytesseract
 from pdf2image import convert_from_bytes
 from io import BytesIO
             if db_doc.file_type == "pdf":
                 documents = await self._build_pdf_documents(content, db_doc)
             else:
                 text = self._extract_text(content, db_doc.file_type)
                 if not text.strip():
         return documents
     def _extract_text(self, content: bytes, file_type: str) -> str:
         """Extract text from DOCX or TXT content."""
         if file_type == "docx":

src/middlewares/logging.py CHANGED Viewed

@@ -1,5 +1,6 @@
 """Structured logging middleware with structlog."""
 import structlog
 from functools import wraps
 from typing import Callable, Any
@@ -8,6 +9,8 @@ import time
 def configure_logging():
     """Configure structured logging."""
     structlog.configure(
         processors=[
             structlog.stdlib.filter_by_level,

 """Structured logging middleware with structlog."""
+import logging
 import structlog
 from functools import wraps
 from typing import Callable, Any
 def configure_logging():
     """Configure structured logging."""
+    logging.basicConfig(level=logging.WARNING)
+    logging.getLogger("tabular_executor").setLevel(logging.INFO)
     structlog.configure(
         processors=[
             structlog.stdlib.filter_by_level,

src/models/api/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ """API request/response shapes per route family."""

src/models/api/catalog.py ADDED Viewed

	@@ -0,0 +1,27 @@

+"""Request / response models for catalog-related routes."""
+from datetime import datetime
+from pydantic import BaseModel, Field
+class CatalogRebuildRequest(BaseModel):
+    user_id: str
+class CatalogRebuildResponse(BaseModel):
+    user_id: str
+    sources_rebuilt: int
+class CatalogIndexEntry(BaseModel):
+    """One row in the per-user catalog index — used by the refresher to decide
+    which sources to rebuild and by the UI to list registered sources.
+    """
+    source_id: str = Field(..., description="Stable internal source identifier.")
+    source_type: str = Field(..., description="schema | tabular | unstructured.")
+    name: str = Field(..., description="Display name (DB name or filename).")
+    location_ref: str = Field(..., description="URI: dbclient://… or az_blob://…")
+    table_count: int = Field(..., description="Number of tables/sheets in this source.")
+    updated_at: datetime = Field(..., description="Last time this source was (re)introspected.")

src/models/api/chat.py ADDED Viewed

	@@ -0,0 +1,17 @@

+"""Request / response models for /api/v1/chat/* routes."""
+from typing import Any
+from pydantic import BaseModel
+class ChatRequest(BaseModel):
+    user_id: str
+    room_id: str
+    message: str
+class ChatStreamEvent(BaseModel):
+    """One SSE event. Type values: `sources`, `chunk`, `done`."""
+    event: str
+    data: dict[str, Any]

src/models/api/document.py ADDED Viewed

	@@ -0,0 +1,9 @@

+"""Request / response models for /api/v1/documents/* routes."""
+from pydantic import BaseModel
+class DocumentUploadResponse(BaseModel):
+    document_id: str
+    filename: str
+    status: str  # uploaded | processing | completed | failed

src/models/user_info.py DELETED Viewed

@@ -1,15 +0,0 @@
-"""User info models for existing users.py."""
-from pydantic import BaseModel
-class UserCreate(BaseModel):
-    """User creation model."""
-    fullname: str
-    email: str
-    password: str
-    company: str | None = None
-    company_size: str | None = None
-    function: str | None = None
-    site: str | None = None
-    role: str | None = None

src/pipeline/db_pipeline/extractor.py CHANGED Viewed

@@ -9,7 +9,7 @@ not user input.
 from typing import Optional
 import pandas as pd
-from sqlalchemy import Float, Integer, Numeric, inspect
 from sqlalchemy.engine import Engine
 from src.middlewares.logging import get_logger
@@ -17,10 +17,16 @@ from src.middlewares.logging import get_logger
 logger = get_logger("db_extractor")
 TOP_VALUES_THRESHOLD = 0.05  # show top values if distinct_ratio <= 5%
 # Dialects where PERCENTILE_CONT(...) WITHIN GROUP is supported as an aggregate.
 # MySQL has no percentile aggregate; BigQuery has PERCENTILE_CONT only as an
-# analytic (window) function — both drop median and keep min/max/mean.
 _MEDIAN_DIALECTS = frozenset({"postgresql", "mssql", "snowflake"})
@@ -53,7 +59,7 @@ def _qi(engine: Engine, name: str) -> str:
 def get_schema(
     engine: Engine, exclude_tables: Optional[frozenset[str]] = None
 ) -> dict[str, list[dict]]:
-    """Returns {table_name: [{name, type, is_numeric, is_primary_key, foreign_key}, ...]}."""
     exclude = exclude_tables or frozenset()
     inspector = inspect(engine)
     schema = {}
@@ -75,6 +81,7 @@ def get_schema(
                 "name": c["name"],
                 "type": str(c["type"]),
                 "is_numeric": isinstance(c["type"], (Integer, Numeric, Float)),
                 "is_primary_key": c["name"] in pk_cols,
                 "foreign_key": fk_map.get(c["name"]),
             }
@@ -96,8 +103,14 @@ def profile_column(
     col_name: str,
     is_numeric: bool,
     row_count: int,
 ) -> dict:
-    """Returns null_count, distinct_count, min/max, top values, and sample values."""
     if row_count == 0:
         return {
             "null_count": 0,
@@ -108,39 +121,69 @@ def profile_column(
     qt = _qi(engine, table_name)
     qc = _qi(engine, col_name)
-    # Combined stats query: null_count, distinct_count, and min/max (if numeric).
-    # One round-trip instead of two.
-    select_cols = [
         f"COUNT(*) - COUNT({qc}) AS nulls",
         f"COUNT(DISTINCT {qc}) AS distincts",
     ]
-    if is_numeric:
-        select_cols.append(f"MIN({qc}) AS min_val")
-        select_cols.append(f"MAX({qc}) AS max_val")
-        select_cols.append(f"AVG({qc}) AS mean_val")
-        if _supports_median(engine):
-            select_cols.append(
-                f"PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY {qc}) AS median_val"
-            )
-    stats = pd.read_sql(f"SELECT {', '.join(select_cols)} FROM {qt}", engine)
-    null_count = int(stats.iloc[0]["nulls"])
-    distinct_count = int(stats.iloc[0]["distincts"])
-    distinct_ratio = distinct_count / row_count if row_count > 0 else 0
-    profile = {
-        "null_count": null_count,
-        "distinct_count": distinct_count,
-        "distinct_ratio": round(distinct_ratio, 4),
-    }
-    if is_numeric:
-        profile["min"] = stats.iloc[0]["min_val"]
-        profile["max"] = stats.iloc[0]["max_val"]
-        profile["mean"] = stats.iloc[0]["mean_val"]
-        if _supports_median(engine):
             profile["median"] = stats.iloc[0]["median_val"]
     if 0 < distinct_ratio <= TOP_VALUES_THRESHOLD:
         top_sql = _head_query(
@@ -153,9 +196,6 @@ def profile_column(
         top = pd.read_sql(top_sql, engine)
         profile["top_values"] = list(zip(top.iloc[:, 0].tolist(), top["cnt"].tolist()))
-    sample = pd.read_sql(_head_query(engine, qc, qt, 5), engine)
-    profile["sample_values"] = sample.iloc[:, 0].tolist()
     return profile
@@ -273,7 +313,8 @@ def build_text(table_name: str, row_count: int, col: dict, profile: dict) -> str
     text += f"Distinct count: {profile['distinct_count']} ({profile['distinct_ratio']:.1%})\n"
     if "min" in profile:
         text += f"Min: {profile['min']}, Max: {profile['max']}\n"
-        text += f"Mean: {profile['mean']}\n"
         if profile.get("median") is not None:
             text += f"Median: {profile['median']}\n"
     if "top_values" in profile:

 from typing import Optional
 import pandas as pd
+from sqlalchemy import Date, DateTime, Float, Integer, Numeric, inspect
 from sqlalchemy.engine import Engine
 from src.middlewares.logging import get_logger
 logger = get_logger("db_extractor")
 TOP_VALUES_THRESHOLD = 0.05  # show top values if distinct_ratio <= 5%
+SAMPLE_LIMIT = 3             # sample N rows per column (down from 5 — token cost)
+# Dialects with a single-statement CTE that survives `pd.read_sql`. On these we
+# fold the stats and sample queries into one round-trip per column. MySQL <8 and
+# old SQLite are excluded out of caution.
+_CTE_DIALECTS = frozenset({"postgresql", "mssql", "snowflake", "bigquery"})
 # Dialects where PERCENTILE_CONT(...) WITHIN GROUP is supported as an aggregate.
 # MySQL has no percentile aggregate; BigQuery has PERCENTILE_CONT only as an
+# analytic (window) function — both drop median and keep mean.
 _MEDIAN_DIALECTS = frozenset({"postgresql", "mssql", "snowflake"})
 def get_schema(
     engine: Engine, exclude_tables: Optional[frozenset[str]] = None
 ) -> dict[str, list[dict]]:
+    """Returns {table_name: [{name, type, is_numeric, is_temporal, is_primary_key, foreign_key}, ...]}."""
     exclude = exclude_tables or frozenset()
     inspector = inspect(engine)
     schema = {}
                 "name": c["name"],
                 "type": str(c["type"]),
                 "is_numeric": isinstance(c["type"], (Integer, Numeric, Float)),
+                "is_temporal": isinstance(c["type"], (Date, DateTime)),
                 "is_primary_key": c["name"] in pk_cols,
                 "foreign_key": fk_map.get(c["name"]),
             }
     col_name: str,
     is_numeric: bool,
     row_count: int,
+    is_temporal: bool = False,
 ) -> dict:
+    """Returns null_count, distinct_count, min/max (numeric+temporal), mean/median (numeric), and sample values.
+    Numeric columns compute mean and (where the dialect supports it) median.
+    Datetime/date get min/max only (no useful mean/median over timestamps).
+    Strings/bools skip range stats entirely.
+    """
     if row_count == 0:
         return {
             "null_count": 0,
     qt = _qi(engine, table_name)
     qc = _qi(engine, col_name)
+    wants_range = is_numeric or is_temporal
+    wants_mean = is_numeric
+    wants_median = is_numeric and _supports_median(engine)
+    profile: dict = {}
+    # Build the stats SELECT list incrementally — same column set used in both
+    # the CTE and fallback branches.
+    stat_cols = [
         f"COUNT(*) - COUNT({qc}) AS nulls",
         f"COUNT(DISTINCT {qc}) AS distincts",
     ]
+    if wants_range:
+        stat_cols += [f"MIN({qc}) AS min_val", f"MAX({qc}) AS max_val"]
+    if wants_mean:
+        stat_cols.append(f"AVG({qc}) AS mean_val")
+    if wants_median:
+        stat_cols.append(
+            f"PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY {qc}) AS median_val"
+        )
+    if engine.dialect.name in _CTE_DIALECTS:
+        # Single round-trip: stats + sample together via CTE.
+        stats_select = ", ".join(stat_cols)
+        passthrough = ", ".join(
+            f"s.{c.split(' AS ')[-1]}" for c in stat_cols
+        )
+        sql = (
+            f"WITH stats AS (SELECT {stats_select} FROM {qt}), "
+            f"sample AS ({_head_query(engine, qc + ' AS sample_val', qt, SAMPLE_LIMIT)}) "
+            f"SELECT {passthrough}, sample.sample_val FROM stats s CROSS JOIN sample"
+        )
+        rows = pd.read_sql(sql, engine)
+        null_count = int(rows.iloc[0]["nulls"])
+        distinct_count = int(rows.iloc[0]["distincts"])
+        sample_values = rows["sample_val"].tolist()
+        if wants_range:
+            profile["min"] = rows.iloc[0]["min_val"]
+            profile["max"] = rows.iloc[0]["max_val"]
+        if wants_mean:
+            profile["mean"] = rows.iloc[0]["mean_val"]
+        if wants_median:
+            profile["median"] = rows.iloc[0]["median_val"]
+    else:
+        # Two-query fallback (MySQL/SQLite).
+        stats = pd.read_sql(f"SELECT {', '.join(stat_cols)} FROM {qt}", engine)
+        null_count = int(stats.iloc[0]["nulls"])
+        distinct_count = int(stats.iloc[0]["distincts"])
+        if wants_range:
+            profile["min"] = stats.iloc[0]["min_val"]
+            profile["max"] = stats.iloc[0]["max_val"]
+        if wants_mean:
+            profile["mean"] = stats.iloc[0]["mean_val"]
+        if wants_median:
             profile["median"] = stats.iloc[0]["median_val"]
+        sample = pd.read_sql(_head_query(engine, qc, qt, SAMPLE_LIMIT), engine)
+        sample_values = sample.iloc[:, 0].tolist()
+    distinct_ratio = distinct_count / row_count if row_count > 0 else 0
+    profile["null_count"] = null_count
+    profile["distinct_count"] = distinct_count
+    profile["distinct_ratio"] = round(distinct_ratio, 4)
+    profile["sample_values"] = sample_values
     if 0 < distinct_ratio <= TOP_VALUES_THRESHOLD:
         top_sql = _head_query(
         top = pd.read_sql(top_sql, engine)
         profile["top_values"] = list(zip(top.iloc[:, 0].tolist(), top["cnt"].tolist()))
     return profile
     text += f"Distinct count: {profile['distinct_count']} ({profile['distinct_ratio']:.1%})\n"
     if "min" in profile:
         text += f"Min: {profile['min']}, Max: {profile['max']}\n"
+        if profile.get("mean") is not None:
+            text += f"Mean: {profile['mean']}\n"
         if profile.get("median") is not None:
             text += f"Median: {profile['median']}\n"
     if "top_values" in profile:

src/pipeline/{document_pipeline/document_pipeline.py → document_pipeline.py} RENAMED Viewed

@@ -1,13 +1,17 @@
 """Document upload and processing pipeline."""
 from fastapi import HTTPException, UploadFile
 from sqlalchemy.ext.asyncio import AsyncSession
 from src.document.document_service import document_service
 from src.knowledge.processing_service import knowledge_processor
-from src.knowledge.parquet_service import delete_document_parquets
 from src.middlewares.logging import get_logger
 from src.storage.az_blob.az_blob import blob_storage
 logger = get_logger("document_pipeline")
@@ -62,11 +66,19 @@ class DocumentPipeline:
         try:
             await document_service.update_document_status(db, document_id, "processing")
-            chunks_count = await knowledge_processor.process_document(document, db)
             await document_service.update_document_status(db, document_id, "completed")
             logger.info(f"Processed document {document_id}: {chunks_count} chunks")
-            return {"document_id": document_id, "chunks_processed": chunks_count}
         except Exception as e:
             logger.error(f"Processing failed for document {document_id}", error=str(e))
@@ -87,8 +99,25 @@ class DocumentPipeline:
         if document.file_type in ("csv", "xlsx"):
             await delete_document_parquets(user_id, document_id)
         logger.info(f"Deleted document {document_id} for user {user_id}")
         return {"document_id": document_id}
 document_pipeline = DocumentPipeline()

 """Document upload and processing pipeline."""
+from io import BytesIO
+import pandas as pd
 from fastapi import HTTPException, UploadFile
 from sqlalchemy.ext.asyncio import AsyncSession
 from src.document.document_service import document_service
 from src.knowledge.processing_service import knowledge_processor
+from src.storage.parquet import delete_document_parquets, upload_parquet
 from src.middlewares.logging import get_logger
 from src.storage.az_blob.az_blob import blob_storage
+from src.retrieval.router import retrieval_router
 logger = get_logger("document_pipeline")
         try:
             await document_service.update_document_status(db, document_id, "processing")
+            if document.file_type not in ("csv", "xlsx"):
+                chunks_count = await knowledge_processor.process_document(document, db)
+            else:
+                await _upload_parquet(document)
+                chunks_count = 0
             await document_service.update_document_status(db, document_id, "completed")
+            try:
+                await retrieval_router.invalidate_cache(user_id)
+            except Exception as e:
+                logger.warning("Failed to invalidate retrieval cache", user_id=user_id, error=str(e))
             logger.info(f"Processed document {document_id}: {chunks_count} chunks")
+            return {"document_id": document_id, "chunks_processed": chunks_count, "file_type": document.file_type}
         except Exception as e:
             logger.error(f"Processing failed for document {document_id}", error=str(e))
         if document.file_type in ("csv", "xlsx"):
             await delete_document_parquets(user_id, document_id)
+        try:
+            await retrieval_router.invalidate_cache(user_id)
+        except Exception as e:
+            logger.warning("Failed to invalidate retrieval cache", user_id=user_id, error=str(e))
         logger.info(f"Deleted document {document_id} for user {user_id}")
         return {"document_id": document_id}
+async def _upload_parquet(document) -> None:
+    """Download original blob and upload Parquet(s) without vector embedding."""
+    content = await blob_storage.download_file(document.blob_name)
+    if document.file_type == "csv":
+        df = pd.read_csv(BytesIO(content))
+        await upload_parquet(df, document.user_id, document.id)
+    else:  # xlsx
+        sheets = pd.read_excel(BytesIO(content), sheet_name=None)
+        for sheet_name, df in sheets.items():
+            await upload_parquet(df, document.user_id, document.id, sheet_name)
 document_pipeline = DocumentPipeline()

src/pipeline/structured_pipeline.py ADDED Viewed

	@@ -0,0 +1,91 @@

+"""StructuredPipeline — builds a catalog for DB / tabular sources.
+Steps (per source, end-to-end):
+  1. introspect (caller-supplied — DatabaseIntrospector or TabularIntrospector)
+  2. merge     (replace any existing source with the same source_id)
+  3. validate  (catalog/validator.py)
+  4. upsert    (catalog/store.py)
+LLM-driven enrichment was removed: the planner relies on stats + sample
+rows + column names directly. Source/table/column `description` fields stay
+in the model but are not populated by this pipeline.
+Source-type-agnostic: the caller picks the introspector. Triggers in
+`pipeline/triggers.py` know which one to use based on the upload event.
+"""
+from __future__ import annotations
+from datetime import UTC, datetime
+from typing import TYPE_CHECKING
+from src.catalog.introspect.base import BaseIntrospector
+from src.catalog.models import Catalog, Source
+from src.middlewares.logging import get_logger
+if TYPE_CHECKING:
+    from src.catalog.store import CatalogStore
+    from src.catalog.validator import CatalogValidator
+logger = get_logger("structured_pipeline")
+class StructuredPipeline:
+    """Orchestrates introspect → merge → validate → store.
+    Dependencies are injected (no concrete imports at class-definition time)
+    so tests can pass mocks without constructing Settings or opening DB
+    connections.
+    """
+    def __init__(
+        self,
+        validator: CatalogValidator,
+        store: CatalogStore,
+    ) -> None:
+        self._validator = validator
+        self._store = store
+    async def run(
+        self,
+        introspector: BaseIntrospector,
+        location_ref: str,
+        user_id: str,
+    ) -> Source:
+        source = await introspector.introspect(location_ref)
+        merged = await self._merge_with_existing(user_id, source)
+        self._validator.validate(merged)
+        await self._store.upsert(merged)
+        logger.info(
+            "structured pipeline complete",
+            user_id=user_id,
+            source_id=source.source_id,
+            source_type=source.source_type,
+            tables=len(source.tables),
+        )
+        return source
+    async def _merge_with_existing(self, user_id: str, new_source: Source) -> Catalog:
+        existing = await self._store.get(user_id)
+        now = datetime.now(UTC)
+        if existing is None:
+            return Catalog(user_id=user_id, generated_at=now, sources=[new_source])
+        kept = [s for s in existing.sources if s.source_id != new_source.source_id]
+        return existing.model_copy(
+            update={"sources": [*kept, new_source]}
+        )
+def default_structured_pipeline() -> StructuredPipeline:
+    """Build the production pipeline with default deps.
+    Lazy imports keep `from src.pipeline.structured_pipeline import …` cheap
+    and side-effect-free for tests.
+    """
+    from src.catalog.store import CatalogStore
+    from src.catalog.validator import CatalogValidator
+    return StructuredPipeline(
+        validator=CatalogValidator(),
+        store=CatalogStore(),
+    )

src/pipeline/triggers.py ADDED Viewed

	@@ -0,0 +1,115 @@

+"""Pipeline trigger entry points called from API routes / event handlers.
+These thin functions are what the FastAPI routes invoke; they delegate to the
+appropriate pipeline (StructuredPipeline for DB/tabular, DocumentPipeline for
+unstructured).
+Errors propagate from the pipelines — the caller decides whether to surface
+them as HTTP 4xx/5xx or quietly fail. The trigger itself does not catch.
+"""
+from src.middlewares.logging import get_logger
+logger = get_logger("pipeline_triggers")
+async def on_db_registered(database_client_id: str, user_id: str) -> None:
+    """Build a dbclient:// location_ref and run the structured pipeline.
+    Called by `/api/v1/database-clients/{id}/ingest` (after rewiring in a
+    later PR). The DatabaseIntrospector resolves the client_id to a
+    DatabaseClient row, decrypts credentials, connects, and produces a Source.
+    The catalog is then validated and upserted (no LLM enrichment step).
+    """
+    from src.catalog.introspect.database import database_introspector
+    from src.pipeline.structured_pipeline import default_structured_pipeline
+    location_ref = f"dbclient://{database_client_id}"
+    logger.info(
+        "on_db_registered triggered",
+        user_id=user_id,
+        database_client_id=database_client_id,
+    )
+    pipeline = default_structured_pipeline()
+    await pipeline.run(database_introspector, location_ref, user_id)
+async def on_tabular_uploaded(document_id: str, user_id: str) -> None:
+    """Build an az_blob:// location_ref and run the structured pipeline.
+    Called after a CSV/XLSX/Parquet file has been processed and its Parquet
+    blob(s) uploaded. The TabularIntrospector downloads the original blob,
+    profiles each column, and produces a Source. The catalog is then validated
+    and upserted (no LLM enrichment step).
+    """
+    from src.catalog.introspect.tabular import tabular_introspector
+    from src.pipeline.structured_pipeline import default_structured_pipeline
+    location_ref = f"az_blob://{user_id}/{document_id}"
+    logger.info(
+        "on_tabular_uploaded triggered",
+        user_id=user_id,
+        document_id=document_id,
+    )
+    pipeline = default_structured_pipeline()
+    await pipeline.run(tabular_introspector, location_ref, user_id)
+async def on_document_uploaded(document_id: str, user_id: str) -> None:
+    """Process an unstructured document (PDF/DOCX/TXT) through the document pipeline.
+    Opens its own DB session so it can be called from event handlers that
+    don't have an injected session (same pattern as on_tabular_uploaded).
+    """
+    from src.db.postgres.connection import AsyncSessionLocal
+    from src.pipeline.document_pipeline import document_pipeline
+    logger.info("on_document_uploaded triggered", user_id=user_id, document_id=document_id)
+    async with AsyncSessionLocal() as db:
+        await document_pipeline.process(document_id, user_id, db)
+async def on_tabular_deleted(document_id: str, user_id: str) -> None:
+    """Remove a tabular source from the user's catalog when its document is deleted."""
+    from src.catalog.store import CatalogStore
+    logger.info("on_tabular_deleted triggered", user_id=user_id, document_id=document_id)
+    await CatalogStore().remove_source(user_id, source_id=document_id)
+async def on_db_deleted(client_id: str, user_id: str) -> None:
+    """Remove a schema source from the user's catalog when its DB client is deleted."""
+    from src.catalog.store import CatalogStore
+    logger.info("on_db_deleted triggered", user_id=user_id, client_id=client_id)
+    await CatalogStore().remove_source(user_id, source_id=client_id)
+async def on_catalog_rebuild_requested(user_id: str) -> None:
+    """Re-introspect every source in the user's catalog and upsert the result.
+    Iterates all Sources in the current catalog. Each source is re-run through
+    its original trigger (on_db_registered for schema, on_tabular_uploaded for
+    tabular). Per-source failures are logged but do not abort the remaining
+    sources.
+    """
+    from src.catalog.store import CatalogStore
+    catalog = await CatalogStore().get(user_id)
+    if catalog is None:
+        logger.info("no catalog to rebuild", user_id=user_id)
+        return
+    logger.info("on_catalog_rebuild_requested triggered", user_id=user_id, source_count=len(catalog.sources))
+    for source in catalog.sources:
+        try:
+            if source.source_type == "schema":
+                client_id = source.location_ref.split("://")[1]
+                await on_db_registered(client_id, user_id)
+            elif source.source_type == "tabular":
+                document_id = source.location_ref.split("://")[1].split("/")[1]
+                await on_tabular_uploaded(document_id, user_id)
+            else:
+                logger.warning("unsupported source_type for rebuild", source_type=source.source_type, source_id=source.source_id)
+        except Exception as e:
+            logger.error("rebuild failed for source", source_id=source.source_id, source_type=source.source_type, error=str(e))

src/query/README.md ADDED Viewed

	@@ -0,0 +1,11 @@

+# query
+Catalog-driven query subsystem. User question → IR → SQL/pandas → result.
+Subpackages:
+- `ir/` — JSON IR Pydantic models + validator
+- `planner/` — LLM step: question + catalog → IR
+- `compiler/` — deterministic IR → SQL or pandas op chain (no LLM)
+- `executor/` — runs the compiled query against DB or Parquet
+See `ARCHITECTURE.md` (root) for the full design.

src/query/base.py DELETED Viewed

@@ -1,32 +0,0 @@
-"""Shared contract for query executors."""
-from abc import ABC, abstractmethod
-from dataclasses import dataclass, field
-from sqlalchemy.ext.asyncio import AsyncSession
-from src.rag.base import RetrievalResult
-@dataclass
-class QueryResult:
-    source_type: str        # "database" or "document"
-    source_id: str          # database_client_id or document_id
-    table_or_file: str
-    columns: list[str]
-    rows: list[dict]
-    row_count: int
-    metadata: dict = field(default_factory=dict)
-    # metadata should include "column_types": {"col_name": "dtype"} when available
-class BaseExecutor(ABC):
-    @abstractmethod
-    async def execute(
-        self,
-        results: list[RetrievalResult],
-        user_id: str,
-        db: AsyncSession,
-        question: str,
-        limit: int = 100,
-    ) -> list[QueryResult]: ...