feat/Analysis State & Report Rework

#4
.gitignore CHANGED
@@ -53,6 +53,5 @@ migratego/
53
  docs/specs/tabular_parquet_contract.md
54
  docs/specs/tabular_parquet.md
55
 
56
- # Personal / local working docs (not for the shared repo)
57
- AGENT_ARCHITECTURE_CONTEXT_new.md
58
- PROJECT_SUMMARY.md
 
53
  docs/specs/tabular_parquet_contract.md
54
  docs/specs/tabular_parquet.md
55
 
56
+ # Personal / local working docs (not for the shared repo) — archived out of root
57
+ docs/_archive/
 
API_ENDPOINTS.md ADDED
@@ -0,0 +1,373 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Data Eyond — Python Agentic Service: FE-Callable API (for Go integration)
2
+
3
+ **Audience:** Harry (Go gateway) wiring the FE → Go → Python surface.
4
+ **Scope:** the **4 FE-callable surfaces** the Python service exposes after the 2026-06-24 pivot
5
+ (DEV_PLAN decision #6). Everything else under `/api/v1` is internal / Phase-1 legacy / Go-owned —
6
+ see [§7](#7-not-fe-facing) and the full inventory in [§9](#9-appendix--complete-endpoint-inventory-all-registered-routes).
7
+ **Branch:** `pr/4` · **Snapshot:** 2026-06-25 · **Companion:** [REPO_STATUS.md](REPO_STATUS.md).
8
+
9
+ > Request flow is **FE → Go → Python**. The FE never calls Python directly except for chat
10
+ > streaming. Auth/JWT is terminated at the Go gateway; Python receives `user_id` / `room_id` as
11
+ > **trusted inputs** and does no auth of its own.
12
+
13
+ ---
14
+
15
+ ## 1. The 4 FE-callable surfaces
16
+
17
+ | # | Logical name | HTTP | How it's invoked |
18
+ |---|---|---|---|
19
+ | 1 | **`call_agent`** | `POST /api/v1/chat/stream` | The one streaming chat call. Router classifies + dispatches. |
20
+ | 2 | **`list_skills`** | `GET /api/v1/tools` | Static slash-command catalog for the FE "/" menu. Cacheable. |
21
+ | 3 | **skill: `help`** | *(via `call_agent`)* | **No dedicated endpoint** — the router resolves it to the `help` intent inside `/chat/stream`. |
22
+ | 4 | **skill: `report`** | `POST /api/v1/report` (+ 2 `GET`s) | Dedicated REST API. **Not** through `/chat/stream`. |
23
+
24
+ **Key consequence for Go:** the two catalog skills are invoked **differently**. `/help` goes through
25
+ `/chat/stream`; `/report` is a direct REST call to the Report API. The catalog's `name` field is the
26
+ internal route key (`help` = router intent; `report` = the Report API), not a uniform dispatch key.
27
+
28
+ **Conventions:**
29
+ - Base path: `/api/v1`.
30
+ - **`room_id == analysis_id`** — one chat room == one analysis session (#9). Callers pass `room_id`
31
+ to chat; it *is* the `analysis_id` used by the report API.
32
+ - Streaming uses **SSE** (`text/event-stream`, `sse-starlette`).
33
+
34
+ ---
35
+
36
+ ## 2. `call_agent` — `POST /api/v1/chat/stream`
37
+
38
+ The only FE→Python call in normal operation. Source: [chat.py:169](src/api/v1/chat.py:169).
39
+
40
+ **Request body** (`application/json`) — `ChatRequest`:
41
+
42
+ ```json
43
+ {
44
+ "user_id": "u_1a2b3c",
45
+ "room_id": "room_42",
46
+ "message": "What were total sales by region last quarter?"
47
+ }
48
+ ```
49
+
50
+ `room_id` is the analysis session id. No auth header (handled by Go).
51
+
52
+ **Response:** `text/event-stream`. Events arrive in this order:
53
+
54
+ | `event:` | `data:` payload | Notes |
55
+ |---|---|---|
56
+ | `sources` | JSON array of source refs | `{document_id, filename, page_label}`. Structured: one per executed table (`document_id = "{user_id}_{table}"`, `page_label = null`). Unstructured: deduped doc/page. `chat`/`help`/`error`: `[]`. |
57
+ | `status` | text | **Slow-path only** — progress pings ("Planning…", "Running N steps…"). Keeps the SSE alive; safe to surface or ignore. |
58
+ | `chunk` | text fragment | Concatenate in order to form the answer. |
59
+ | `done` | *(empty)* | End of stream. |
60
+ | `error` | text | Terminal error; stream stops after this. |
61
+
62
+ > The handler also emits an internal `intent` event — it is **consumed inside Python** (gates
63
+ > caching) and **not forwarded** to the client. Go/FE will never see it.
64
+
65
+ **Example — `structured_flow` answer** (raw SSE wire; blank line separates events). Source shape:
66
+ [chat_handler.py:607](src/agents/chat_handler.py:607).
67
+
68
+ ```
69
+ event: sources
70
+ data: [{"document_id":"u_1a2b3c_orders","filename":"orders","page_label":null}]
71
+
72
+ event: status
73
+ data: Planning analysis…
74
+
75
+ event: status
76
+ data: Running 3 steps…
77
+
78
+ event: chunk
79
+ data: Total sales by region last quarter:
80
+
81
+ event: chunk
82
+ data: Central led at $1.21M (38%), East $0.74M, West $0.55M (down 12% QoQ).
83
+
84
+ event: done
85
+ data:
86
+ ```
87
+
88
+ **Example — simple `chat` reply** (no status pings, empty sources):
89
+
90
+ ```
91
+ event: sources
92
+ data: []
93
+
94
+ event: chunk
95
+ data: I'm your AI data analyst — connect a source or ask a question to get started.
96
+
97
+ event: done
98
+ data:
99
+ ```
100
+
101
+ **Behavior worth knowing for integration:**
102
+ - **Redis response cache** (1h TTL) is applied to the stateless `chat` intent only; cached replies
103
+ replay as `sources`/`chunk`/`done`.
104
+ - **Greeting/farewell fast-path** returns a canned reply with no LLM call.
105
+ - The LLM **router** classifies every message into one of **5 intents** —
106
+ `chat` · `help` · `check` · `unstructured_flow` · `structured_flow` — and dispatches. Messages
107
+ persist (user + assistant) on `done`.
108
+
109
+ ---
110
+
111
+ ## 3. `list_skills` — `GET /api/v1/tools`
112
+
113
+ Static, deterministic, **safe for Go to cache**. Source: [tools.py:133](src/api/v1/tools.py:133).
114
+
115
+ **Request:** none (no params, no body).
116
+
117
+ **Response** `200` (`ListToolsResponse`):
118
+
119
+ ```json
120
+ {
121
+ "count": 2,
122
+ "tools": [
123
+ { "command": "/help", "name": "help", "type": "skill",
124
+ "description": "Show what the assistant can do and guide your next step." },
125
+ { "command": "/report", "name": "report", "type": "skill",
126
+ "description": "Generate a versioned analysis report (background, EDA, key findings, insights)." }
127
+ ]
128
+ }
129
+ ```
130
+
131
+ `CommandResponse` = `{ command, name, type, description }`, `type ∈ {skill, analytics, data_access}`.
132
+ Post-KM-678 the catalog is **`/help` + `/report` only**; the `analyze_*`, `check_*`, `retrieve_*`
133
+ and retired `/problem-statement` entries are commented out (kept for restorability), not deleted.
134
+
135
+ ---
136
+
137
+ ## 4. skill: `help` — via `call_agent`
138
+
139
+ **There is no `/help` endpoint.** The FE "/" menu surfaces `/help`; to invoke it, call
140
+ `POST /api/v1/chat/stream` and let the router classify the message as the `help` intent
141
+ ([chat_handler.py:363](src/agents/chat_handler.py:363)). Help streams `chunk` events (same SSE
142
+ shape as §2, with `sources: []` and no `status` pings) — a state-aware, next-step guidance reply.
143
+
144
+ ```
145
+ event: sources
146
+ data: []
147
+
148
+ event: chunk
149
+ data: Your goal is set — you can start exploring now. Try a question like "average order value by month", then I can generate a report.
150
+
151
+ event: done
152
+ data:
153
+ ```
154
+
155
+ > **Open integration question (for Harry):** the Python `/chat/stream` contract has **no
156
+ > forced-intent / slash-bypass param** — `handle()` always routes via the LLM classifier. So
157
+ > deterministic `/help` dispatch depends on either (a) Go forwarding the literal slash text and
158
+ > trusting the router to classify it as `help`, or (b) adding a forced-intent input to the chat
159
+ > contract. The `tools.py` docstring's "slash invocation bypasses the router to the tool directly"
160
+ > is **not yet true on the Python side.** Needs a decision. (DEV_PLAN #8/#18.)
161
+
162
+ ---
163
+
164
+ ## 5. skill: `report` — Report API
165
+
166
+ Dedicated REST surface (the "Generate Report" button), **not** a chat route.
167
+ Source: [report.py](src/api/v1/report.py).
168
+
169
+ ### `POST /api/v1/report`
170
+ Generate, persist, and return a new report **version**.
171
+
172
+ **Query params:** `analysis_id` (required), `user_id` (required). No request body.
173
+
174
+ ```
175
+ POST /api/v1/report?analysis_id=room_42&user_id=u_1a2b3c
176
+ ```
177
+
178
+ | Status | Meaning |
179
+ |---|---|
180
+ | `201` | New version generated → `AnalysisReport` body. |
181
+ | `409` | Floor not met — **no recorded analyses yet** for this session, nothing to report. |
182
+ | `500` | Generation or persistence failed. |
183
+
184
+ **`201` response** (`AnalysisReport`):
185
+
186
+ ```json
187
+ {
188
+ "report_id": "8f3a2b1c9d4e4f6a8b0c1d2e3f4a5b6c",
189
+ "analysis_id": "room_42",
190
+ "user_id": "u_1a2b3c",
191
+ "version": 2,
192
+ "generated_at": "2026-06-25T09:14:33.512Z",
193
+ "problem_statement": {
194
+ "objective": "Understand which regions drive revenue and why Q1 dipped.",
195
+ "business_questions": [
196
+ "Which regions contribute most to total revenue?",
197
+ "Did any region decline quarter-over-quarter?"
198
+ ]
199
+ },
200
+ "record_ids": ["rec_a1", "rec_b2"],
201
+ "executive_summary": "Revenue is concentrated in the Central region (38% of total). The West was the only region to contract, down 12% QoQ — the main driver of the Q1 dip.",
202
+ "findings": [
203
+ { "text": "Central region contributed 38% of total revenue, the largest share.",
204
+ "record_ids": ["rec_a1"], "supporting_data": null },
205
+ { "text": "West region revenue fell 12% quarter-over-quarter.",
206
+ "record_ids": ["rec_b2"], "supporting_data": null }
207
+ ],
208
+ "caveats": [
209
+ { "text": "March data for the East region was partially missing (~6% of rows).",
210
+ "record_ids": ["rec_b2"] }
211
+ ],
212
+ "open_questions": [
213
+ { "text": "What drove the West region's QoQ decline?", "record_ids": ["rec_b2"] }
214
+ ],
215
+ "data_sources": [
216
+ { "source_id": "src_sales_db", "name": "orders", "source_type": "postgres",
217
+ "detail": { "tables": ["orders"], "row_count": 48213,
218
+ "columns": ["region", "amount", "ordered_at"] } }
219
+ ],
220
+ "method_steps": [
221
+ { "task_id": "t1", "stage": "data_understanding", "objective": "Inventory the sales source",
222
+ "status": "success", "tools_used": ["check_data"] },
223
+ { "task_id": "t2", "stage": "modeling", "objective": "Aggregate revenue by region",
224
+ "status": "success", "tools_used": ["analyze_aggregate"] }
225
+ ],
226
+ "rendered_markdown": "# Analysis Report\n\n*Generated 2026-06-25 by u_1a2b3c · 2 analyses · 1 source(s)*\n\n## Objective\nUnderstand which regions drive revenue…\n\n## Key Findings\n1. Central region contributed 38%…"
227
+ }
228
+ ```
229
+
230
+ **`409` response** (floor not met — the demo's most common error):
231
+
232
+ ```json
233
+ { "detail": "Not ready to generate a report — still needs at least one completed analysis." }
234
+ ```
235
+
236
+ > ⚠️ **Demo/integration precondition:** `AnalysisRecord`s persist **only on the slow path**, so
237
+ > reports require **`enable_slow_path=true`** on the Python deployment *and* ≥1 prior
238
+ > `structured_flow` question in the session. With slow path off, `POST /report` **409s by design**,
239
+ > not a bug. (DEV_PLAN #15/#16.)
240
+
241
+ ### `GET /api/v1/report/{analysis_id}`
242
+ List a session's report versions (oldest-first). Returns `[ReportVersionEntry]`; `[]` if none.
243
+
244
+ ```json
245
+ [
246
+ { "report_id": "1b2c3d4e…", "version": 1, "generated_at": "2026-06-24T15:02:11Z", "record_count": 1 },
247
+ { "report_id": "8f3a2b1c…", "version": 2, "generated_at": "2026-06-25T09:14:33Z", "record_count": 2 }
248
+ ]
249
+ ```
250
+
251
+ ### `GET /api/v1/report/{analysis_id}/{version}`
252
+ Fetch one version → `AnalysisReport` (same shape as the `POST` 201 body above); `404` if that
253
+ version doesn't exist.
254
+
255
+ ```json
256
+ { "detail": "No report v3 for analysis 'room_42'." }
257
+ ```
258
+
259
+ ---
260
+
261
+ ## 6. Schemas
262
+
263
+ **`AnalysisReport`** (POST + GET-version body):
264
+
265
+ | Field | Type | Notes |
266
+ |---|---|---|
267
+ | `report_id` | str | |
268
+ | `analysis_id` | str | == `room_id` |
269
+ | `user_id` | str \| null | |
270
+ | `version` | int | monotonic V1, V2, … |
271
+ | `generated_at` | datetime | ISO 8601, UTC |
272
+ | `problem_statement` | `{ objective: str, business_questions: string[] }` | the frozen goal snapshot (new pivot shape) |
273
+ | `record_ids` | string[] | records the version was built from |
274
+ | `executive_summary` | str | the **only** LLM-authored field |
275
+ | `findings` | `ReportFinding[]` | `{ text, record_ids[], supporting_data? }` |
276
+ | `caveats` | `AttributedNote[]` | `{ text, record_ids[] }` |
277
+ | `open_questions` | `AttributedNote[]` | `{ text, record_ids[] }` |
278
+ | `data_sources` | `DataSourceRef[]` | `{ source_id, name, source_type, detail }` |
279
+ | `method_steps` | `TaskSummary[]` | `{ task_id, stage, objective, status, tools_used[] }`; `stage` ∈ CRISP-DM phases |
280
+ | `rendered_markdown` | str | the full rendered report |
281
+
282
+ > **Persistence caveat:** dedorch `reports` stores **markdown only**. On read-back via the `GET`
283
+ > endpoints, the structured fields above come back **empty** and `rendered_markdown` is the source of
284
+ > truth. (REPO_STATUS §5.)
285
+
286
+ **`ReportVersionEntry`** (GET-list rows): `{ report_id, version, generated_at, record_count }`.
287
+
288
+ ---
289
+
290
+ ## 7. Not FE-facing
291
+
292
+ Registered under `/api/v1` but **not** part of the FE→Python surface — do not wire these from the FE:
293
+
294
+ - **Analysis CRUD** — `POST /analysis/create`, `GET /analysis`, `GET /analysis/{id}`. Intended to
295
+ move behind Go (state writes via Go, per decision #5/#18). Router still **mounted** (Go may use it);
296
+ the FE should not call it.
297
+ - **`check_data` / `check_knowledge`** — served by **Go**, not surfaced as Python FE endpoints.
298
+ - **Chat cache management** — `DELETE /chat/cache`, `/chat/cache/room/{id}`, `/retrieval/cache/{user_id}`
299
+ (ops/internal).
300
+ - **Phase-1 legacy routers** — `users`, `room`, `document`, `db_client`, `data_catalog`
301
+ (functionally migrated to Go; mostly dormant).
302
+ - **Health/root** — `GET /`, `GET /health` (liveness only).
303
+
304
+ ---
305
+
306
+ ## 8. Open items affecting this contract
307
+
308
+ 1. **`/help` dispatch mechanism** — router-classify vs. forced-intent param (§4). *(DEV_PLAN #8/#18)*
309
+ 2. **`/report` needs `enable_slow_path=true`** + a prior `structured_flow` question, else 409.
310
+ *(DEV_PLAN #15)*
311
+ 3. **`analysis_records` home** post-`SKIP_INIT_DB` cutover — the report API depends on this table
312
+ existing. *(DEV_PLAN #14/#16)*
313
+ 4. **Analysis-state writes** — once Go owns creation + state writes, Python's per-turn state
314
+ `ensure` becomes a read-only get (Go must guarantee the row exists before any chat turn).
315
+ *(DEV_PLAN #18)*
316
+
317
+ ---
318
+
319
+ ## 9. Appendix — complete endpoint inventory (all registered routes)
320
+
321
+ Every route mounted in [main.py](main.py), so task #8 can be decided against the full picture.
322
+ **32 routes** across 9 routers + 2 app-level. Status legend:
323
+ **✅ FE-callable** (one of the 4 surfaces — keep) · **✂️ comment out** (task #8 target) ·
324
+ **🟦 legacy → Go** (Phase-1, functionally migrated; not FE→Python; mostly dormant) ·
325
+ **⚙️ internal/ops**.
326
+
327
+ | Method | Path | Purpose | Router | Status |
328
+ |---|---|---|---|---|
329
+ | POST | `/api/v1/chat/stream` | Main chat SSE — **`call_agent`**; carries chat/help/check/structured/unstructured intents | Chat | ✅ FE-callable (#1, +help #3) |
330
+ | GET | `/api/v1/tools` | Slash-command catalog — **`list_skills`** (Go caches) | Tools | ✅ FE-callable (#2) |
331
+ | POST | `/api/v1/report` | Generate a report version | Report | ✅ FE-callable (#4) |
332
+ | GET | `/api/v1/report/{analysis_id}` | List report versions | Report | ✅ FE-callable (#4) |
333
+ | GET | `/api/v1/report/{analysis_id}/{version}` | Fetch one report version | Report | ✅ FE-callable (#4) |
334
+ | POST | `/api/v1/analysis/create` | Create session (state + room + bindings) | Analysis | ✂️ comment (#8 → Go) |
335
+ | GET | `/api/v1/analysis` | List a user's analyses | Analysis | ✂️ comment (#8) |
336
+ | GET | `/api/v1/analysis/{analysis_id}` | Get one session's state + sources | Analysis | ✂️ comment (#8) |
337
+ | DELETE | `/api/v1/chat/cache` | Clear one cached reply | Chat | ⚙️ internal/ops |
338
+ | DELETE | `/api/v1/chat/cache/room/{room_id}` | Clear a room's cache | Chat | ⚙️ internal/ops |
339
+ | DELETE | `/api/v1/retrieval/cache/{user_id}` | Clear a user's retrieval cache | Chat | ⚙️ internal/ops |
340
+ | GET | `/` | Service status | (app) | ⚙️ internal/ops |
341
+ | GET | `/health` | Liveness probe | (app) | ⚙️ internal/ops |
342
+ | POST | `/api/login` | Login by email + password ⚠️ mounted at `/api`, **not** `/api/v1` | Users | 🟦 legacy → Go |
343
+ | GET | `/api/v1/documents/doctypes` | Supported document types | Documents | 🟦 legacy → Go |
344
+ | GET | `/api/v1/documents/{user_id}` | List a user's documents | Documents | 🟦 legacy → Go |
345
+ | POST | `/api/v1/document/upload` | Upload a document (10/min) | Documents | 🟦 legacy → Go |
346
+ | DELETE | `/api/v1/document/delete` | Delete a document | Documents | 🟦 legacy → Go |
347
+ | POST | `/api/v1/document/process` | Process / ingest a document | Documents | 🟦 legacy → Go |
348
+ | GET | `/api/v1/rooms/{user_id}` | List a user's rooms | Rooms | 🟦 legacy → Go |
349
+ | GET | `/api/v1/room/{room_id}` | Get one room | Rooms | 🟦 legacy → Go |
350
+ | DELETE | `/api/v1/room/{room_id}` | Delete a room | Rooms | 🟦 legacy → Go |
351
+ | POST | `/api/v1/room/create` | Create a room | Rooms | 🟦 legacy → Go |
352
+ | GET | `/api/v1/data-catalog/{user_id}` | List catalog index | Data Catalog | 🟦 legacy → Go |
353
+ | POST | `/api/v1/data-catalog/rebuild` | Rebuild a user's catalog | Data Catalog | 🟦 legacy → Go |
354
+ | GET | `/api/v1/database-clients/dbtypes` | Supported DB types | Database Clients | 🟦 legacy → Go |
355
+ | POST | `/api/v1/database-clients` | Create a DB connection | Database Clients | 🟦 legacy → Go |
356
+ | GET | `/api/v1/database-clients/{user_id}` | List a user's DB connections | Database Clients | 🟦 legacy → Go |
357
+ | GET | `/api/v1/database-clients/{user_id}/{client_id}` | Get one DB connection | Database Clients | 🟦 legacy → Go |
358
+ | PUT | `/api/v1/database-clients/{client_id}` | Update a DB connection | Database Clients | 🟦 legacy → Go |
359
+ | DELETE | `/api/v1/database-clients/{client_id}` | Delete a DB connection | Database Clients | 🟦 legacy → Go |
360
+ | POST | `/api/v1/database-clients/{client_id}/ingest` | Build the catalog for a DB connection | Database Clients | 🟦 legacy → Go |
361
+
362
+ **Tally:** 5 ✅ FE-callable · 3 ✂️ to comment (#8) · 19 🟦 legacy→Go · 5 ⚙️ internal/ops.
363
+
364
+ **Task #8 reading:**
365
+ - **Keep exposed:** the 5 ✅ rows (`chat/stream`, `/tools`, the 3 `report` routes). `help` rides on
366
+ `chat/stream` — no route of its own.
367
+ - **Comment out (the #8 to-do):** the 3 `analysis` routes — analysis CRUD moves behind Go (#5/#18).
368
+ - **`check_data` is not an HTTP endpoint** — it's the `check` router intent (runs inside
369
+ `chat/stream`) plus its now-commented slash-catalog entry (KM-678); Go serves it to the FE. So
370
+ "comment check_data" = the catalog line (done) + don't expose a Python route (there isn't one).
371
+ - The 19 🟦 routers (`users`, `document`, `room`, `data_catalog`, `db_client`) are Phase-1 legacy,
372
+ already functionally in Go (REPO_STATUS §7). They're out of the FE→Python path but **still
373
+ mounted** — a separate cleanup from #8's analysis-CRUD scope.
ARCHITECTURE.md DELETED
@@ -1,353 +0,0 @@
1
- # Architecture — Data Eyond Agentic Service
2
-
3
- **Last updated**: 2026-05-20
4
- **Status**: Phase 2 catalog path shipped; document ingestion has moved to a separate Go service. The long-term split is **Python = agent/ML layer, Go = data plane**; this document covers the Python side only.
5
-
6
- ---
7
-
8
- ## Product vision (north star)
9
-
10
- Data Eyond is an *AI data scientist* for business analytics, structured around **CRISP-DM** (Business Understanding → Data Understanding → Data Preparation → Modeling → Evaluation → Deployment). Targets executives doing self-serve deep-dives and data analysts/scientists offloading routine work.
11
-
12
- Envisioned user flow: **interview agent** captures goal → user connects data sources → asks natural-language question → CRISP-DM-structured analytical response, exportable as a **presentation** or **notebook-style report**.
13
-
14
- The catalog-driven, IR-based architecture documented below is the *foundation*. The next architectural evolution is an agentic layer (analytical planner, per-stage CRISP-DM agents, evaluator, reporter) that consumes the existing IntentRouter → QueryPlanner → Executor → ChatbotAgent spine as its tool layer. See `REPO_CONTEXT.md` → *Roadmap — agentic evolution* for the target agent topology.
15
-
16
- ---
17
-
18
- ## TL;DR
19
-
20
- A catalog-driven AI service for data analysis. Users upload documents and register databases or tabular files; they ask natural-language questions and get answers grounded in their data.
21
-
22
- The architecture has two paths:
23
-
24
- - **Unstructured** (PDF, DOCX, TXT) — dense similarity over prose chunks (the right primitive for free-form text). **Ingestion is handled by a separate Go service**; this Python service reads embeddings from PGVector at query time.
25
- - **Structured** (databases, XLSX, CSV, Parquet) — a per-user **data catalog** describes what tables/columns exist; an LLM produces a structured **JSON intermediate representation (IR)** of the user's intent; a deterministic compiler turns the IR into SQL or pandas operations.
26
-
27
- The LLM produces *intent*, not query syntax. Deterministic code does the rest.
28
-
29
- ---
30
-
31
- ## 1. Why catalog-driven design
32
-
33
- For a database or spreadsheet, a user's question maps to *known tables and columns* — not to *similar text fragments*. Treating structured data with the same retrieval primitive as prose (chunk + embed + rank top-K) makes the right column survive a probabilistic ranking lottery. Catalog-based **lookup** is the right primitive instead.
34
-
35
- A central per-user catalog also means:
36
-
37
- - One place to keep table/column descriptions (AI-generated, refreshed when the source changes).
38
- - The query planner sees the user's full data landscape in a single prompt.
39
- - Schema stays stable across user sessions without hitting the source DB on every query.
40
- - New sources auto-update the catalog without re-embedding chunks.
41
-
42
- ---
43
-
44
- ## 2. Source taxonomy
45
-
46
- ```
47
- Sources
48
- ├── Unstructured (pdf, docx, txt) → Cu (prose chunks via DocumentRetriever)
49
- └── Structured
50
- ├── Schema (DB) → Cs (DB tables + columns)
51
- └── Tabular (xlsx, csv, parquet) → Ct (sheets + columns)
52
- Cs ∪ Ct = Data Catalog Context
53
- ```
54
-
55
- - **Cu** = unstructured prose context. Retrieval primitive: dense similarity over chunks.
56
- - **Cs** = DB schema context (tables, columns, descriptions, sample values).
57
- - **Ct** = tabular file context (sheets, columns, descriptions, sample values).
58
- - **Data Catalog Context** = `Cs ∪ Ct`. Passed to the query planner as a single unified view.
59
-
60
- DB vs tabular is **not** a routing concern — it's a per-source attribute (`source_type`) on each catalog entry. The split only matters at execution time (SQL vs pandas).
61
-
62
- ---
63
-
64
- ## 3. Routing model
65
-
66
- > **Superseded 2026-06-18** — the 3-way `source_hint` below was reworked into a flat **6-intent** handler router (`chat`, `help`, `problem_statement`, `check`, `unstructured_flow`, `structured_flow`). Modality (structured vs unstructured *data*) is now the Planner's job, not the router's. See `ORCHESTRATOR_REWORK_PLAN.md`.
67
-
68
- ```
69
- source_hint ∈ { chat, unstructured, structured }
70
- ```
71
-
72
- - `chat` — no search, conversational reply only
73
- - `unstructured` — DocumentRetriever path (Cu)
74
- - `structured` — catalog-driven path (Cs ∪ Ct → planner → compiler → executor)
75
-
76
- The router commits to one path. Cross-source questions ("compare DB sales vs uploaded customer file") are handled inside the structured path because the planner sees both Cs and Ct in one prompt.
77
-
78
- ---
79
-
80
- ## 4. Core architectural decisions
81
-
82
- ### 4.1 Catalog as primary context, not retrieval
83
-
84
- For most users (≤50 tables), the entire catalog fits in ~3-5k tokens and is passed verbatim to the planner. No vector search, no BM25, no chunk retrieval. The LLM reads the whole catalog and picks the right table.
85
-
86
- When a user has hundreds of tables, **catalog-level retrieval** (BM25 + table-level vectors with RRF) can be added as a slicer between `CatalogReader` and `Planner`. Deferred until measurably needed.
87
-
88
- ### 4.2 JSON IR over raw SQL
89
-
90
- The planner LLM emits a structured JSON IR describing query intent — not a SQL string. A deterministic compiler turns the IR into SQL (per dialect) or pandas/polars operations.
91
-
92
- Benefits:
93
-
94
- - Validatable with Pydantic before execution
95
- - Compiler whitelists allowed operations (no DROP, DELETE, etc.)
96
- - Portable: same IR → SQL (any dialect) / pandas / polars
97
- - Cheaper tokens, easier to debug, trivially testable without an LLM
98
- - LLM cannot emit valid-but-wrong SQL syntax
99
-
100
- ### 4.3 Deterministic compiler, not LLM SQL writer
101
-
102
- The LLM produces *intent* (the IR). All actual query construction is deterministic Python. Compiler bugs are reproducible and fixable. Same IR always produces the same query.
103
-
104
- ### 4.4 Pipeline stage isolation
105
-
106
- Each stage is its own module with typed input and typed output. No god classes. Stages: `IntentRouter`, `CatalogReader`, `QueryPlanner`, `IRValidator`, `QueryCompiler`, `QueryExecutor`, `ChatbotAgent`. Each is testable in isolation.
107
-
108
- ### 4.5 Minimal LLM surface
109
-
110
- LLM calls happen in exactly three places (KM-557 removed `CatalogEnricher`; ingestion is now LLM-free — the planner reads column names, stats, and sample rows directly):
111
-
112
- 1. **`IntentRouter`** — once per user message
113
- 2. **`QueryPlanner`** — once per structured query (produces the IR)
114
- 3. **`ChatbotAgent`** — once per answer (formats the response)
115
-
116
- Compiler and executors are pure code. No LLM in the hot path of query construction.
117
-
118
- ---
119
-
120
- ## 5. End-to-end flow
121
-
122
- ### Ingestion (when user uploads a file or connects a DB)
123
-
124
- ```
125
- Structured sources (DB connect / XLSX / CSV / Parquet upload) — Python:
126
- source upload / DB connect
127
-
128
- introspect schema (DB: information_schema; tabular: file headers + sample rows)
129
-
130
- validate (Pydantic)
131
-
132
- write to catalog store (Postgres jsonb in `data_catalog`, keyed by user_id)
133
- ```
134
-
135
- **Unstructured ingestion (PDF / DOCX / TXT) is handled by a separate Go service**, which writes chunks + embeddings into the `documents` collection in PGVector. The Python service does not own this path — it reads only.
136
-
137
- ### Query (per user message)
138
-
139
- ```
140
- User message
141
-
142
- Chat cache check (Redis, 24h TTL)
143
- ↓ miss
144
- Load chat history
145
-
146
- IntentRouter LLM → needs_search? source_hint?
147
-
148
- ├── chat → ChatbotAgent → SSE stream
149
- ├── unstructured → DocumentRetriever (raw SQL: pgvector `<=>` cosine or `<+>` manhattan) → answerer
150
- └── structured →
151
- CatalogReader (load full Cs ∪ Ct for user)
152
-
153
- QueryPlanner LLM → JSON IR
154
-
155
- IRValidator (Pydantic + columns-exist + ops whitelist)
156
-
157
- QueryCompiler → SQL (schema source) or pandas (tabular source)
158
-
159
- QueryExecutor (DbExecutor or TabularExecutor)
160
-
161
- QueryResult
162
-
163
- ChatbotAgent → SSE stream
164
- ```
165
-
166
- ---
167
-
168
- ## 6. Data catalog
169
-
170
- ### Storage
171
-
172
- Per-user JSON document, stored as a `jsonb` row in Postgres keyed by `user_id`.
173
-
174
- ### Schema (initial scope)
175
-
176
- ```
177
- Catalog
178
- ├── user_id, schema_version, generated_at
179
- └── sources[]
180
- └── Source
181
- ├── source_id, source_type, name, description, location_ref, updated_at
182
- └── tables[]
183
- └── Table
184
- ├── table_id, name, description, row_count
185
- └── columns[]
186
- └── Column
187
- ├── column_id, name, data_type, description
188
- ├── nullable
189
- ├── pii_flag
190
- ├── sample_values[]
191
- └── stats: { min, max, distinct_count } | null
192
- ```
193
-
194
- ### Best-practice fields deferred
195
-
196
- `description_human`, `synonyms[]`, `tags[]`, `primary_key`, `foreign_keys`, `unit`, `semantic_type`, `example_questions[]`, `schema_hash`, `enrichment_status`. Add when justified by user need.
197
-
198
- ### Stable IDs
199
-
200
- `source_id`, `table_id`, `column_id` are stable internal references. `name` fields can change (e.g. column rename in source DB) without invalidating cached IRs.
201
-
202
- ### PII handling
203
-
204
- Columns with `pii_flag: true` have `sample_values: null` — real values never enter LLM prompts. Auto-detected at ingestion via name patterns + value regex.
205
-
206
- ---
207
-
208
- ## 7. JSON IR
209
-
210
- ### Schema (initial scope)
211
-
212
- ```
213
- QueryIR
214
- ├── ir_version : "1.0"
215
- ├── source_id : str (references catalog)
216
- ├── table_id : str (references catalog)
217
- ├── select[] : SelectItem
218
- │ ├── { kind: "column", column_id, alias? }
219
- │ └── { kind: "agg", fn, column_id?, alias? }
220
- ├── filters[] : { column_id, op, value, value_type }
221
- ├── group_by[] : column_id
222
- ├── order_by[] : { column_id | alias, dir }
223
- └── limit : int | null
224
- ```
225
-
226
- ### Whitelisted operators
227
-
228
- ```
229
- Filter ops: = != < <= > >= in not_in is_null is_not_null like between
230
- Agg fns: count count_distinct sum avg min max
231
- ```
232
-
233
- ### Validation rules (enforced before execution)
234
-
235
- - `source_id` exists in catalog for this user
236
- - `table_id` belongs to that source
237
- - Every `column_id` exists in that table
238
- - Every `agg.fn` and `filter.op` is whitelisted
239
- - `value_type` consistent with column's `data_type`
240
- - `limit` positive int, ≤ hard cap (e.g. 10000)
241
-
242
- If any rule fails → reject IR → re-prompt planner with error context (max 3 retries).
243
-
244
- ### Deferred features
245
-
246
- `having`, `offset`, boolean tree filters (OR/NOT), `distinct`, joins, window functions. Add as user demand proves the limitation.
247
-
248
- ---
249
-
250
- ## 8. Executors
251
-
252
- Same input (validated IR), same output (`QueryResult`), different backends.
253
-
254
- ### DbExecutor (schema sources)
255
-
256
- ```
257
- IR → SqlCompiler → SQL string + params
258
-
259
- sqlglot validation (SELECT-only, whitelist tables/columns, LIMIT enforced)
260
-
261
- asyncpg / pymysql in read-only transaction with timeout (30s)
262
-
263
- QueryResult
264
- ```
265
-
266
- Identifiers come from catalog (verified at validation time, safe to inline as quoted identifiers). Values are always parameterized — never inlined as strings.
267
-
268
- ### TabularExecutor (tabular sources)
269
-
270
- ```
271
- IR → PandasCompiler → operation chain
272
-
273
- choose strategy by file size:
274
- ≤ 100 MB → eager pandas
275
- 100 MB-1 GB → pyarrow with predicate pushdown
276
- > 1 GB → polars lazy scan
277
-
278
- execute in asyncio.to_thread (CPU work off the event loop)
279
-
280
- QueryResult
281
- ```
282
-
283
- Initially eager pandas is sufficient. Add the others when a real file is too big.
284
-
285
- ### Shared safety guarantees
286
-
287
- 1. IR validated before reaching compiler
288
- 2. Compiler is deterministic (no LLM)
289
- 3. Identifiers from catalog (trusted)
290
- 4. Values parameterized
291
- 5. sqlglot second-line defence for SQL
292
- 6. Read-only at every layer
293
- 7. Timeouts and row caps
294
-
295
- ---
296
-
297
- ## 9. Implementation scope
298
-
299
- ### Initial PR — what ships first
300
-
301
- | Item | Folder |
302
- |---|---|
303
- | Data catalog Pydantic models | `src/catalog/models.py` |
304
- | Catalog ingestion (introspect → enrich → validate → store) | `src/catalog/`, `src/pipeline/` |
305
- | `IntentRouter` with 3-way source_hint | `src/agents/` |
306
- | `CatalogReader` (loads full catalog) | `src/catalog/reader.py` |
307
- | `QueryPlanner` LLM call | `src/query/planner/` |
308
- | JSON IR Pydantic models | `src/query/ir/models.py` |
309
- | IR validator | `src/query/ir/validator.py` |
310
-
311
- **Output**: a validated JSON IR object. Execution lands in a follow-up PR.
312
-
313
- ### Follow-up PRs
314
-
315
- | PR | Scope |
316
- |---|---|
317
- | 2 | `QueryCompiler` (IR → SQL / pandas) |
318
- | 3 | `QueryExecutor` split: `DbExecutor` + `TabularExecutor` |
319
- | 4 | Retry / self-correction loop on execution failure |
320
- | 5 | Eval harness (golden question→IR→result examples) |
321
- | 6 | Auto PII tagging in catalog |
322
- | Later | Joins in IR, schema drift detection, hybrid catalog search |
323
-
324
- ---
325
-
326
- ## 10. Open questions
327
-
328
- | # | Question | Why it matters |
329
- |---|---|---|
330
- | 1 | Catalog storage: JSON file per user vs Postgres `jsonb` row? | Affects ingestion + read performance |
331
- | 2 | Should the catalog also list unstructured files (with descriptions only)? | Gives router unified view of all user sources |
332
- | 3 | Catalog refresh trigger: explicit "rebuild" button, on every upload, or background TTL? | Staleness vs latency tradeoff |
333
- | 4 | Confirm joins are out of initial IR scope? | Limits what user questions can be answered |
334
- | 5 | PII handling for sample_values: mask, synthesize, or skip? | Affects what gets sent to LLM prompts |
335
-
336
- ---
337
-
338
- ## 11. References
339
-
340
- - `docs/flowchart.html` — interactive end-to-end diagram (open in browser)
341
- - `docs/flowchart.mmd` — mermaid source for the diagram
342
-
343
- ---
344
-
345
- ## Glossary
346
-
347
- - **Cu** — unstructured context (prose chunks)
348
- - **Cs** — schema context (DB tables/columns from catalog)
349
- - **Ct** — tabular context (file sheets/columns from catalog)
350
- - **IR** — intermediate representation (the JSON query shape)
351
- - **PR** — pull request (a unit of code change)
352
- - **PII** — personally identifiable information (names, emails, etc.)
353
- - **ABC** — abstract base class (Python contract for subclasses)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
CHECKPOINT_PLAN_2026-06-17.md DELETED
@@ -1,147 +0,0 @@
1
- # Checkpoint Plan — Wednesday, 17 June 2026
2
-
3
- Working plan for Sofhia & Rifqi based on the checkpoint with mas Harry on **Thursday, 11 June 2026**.
4
- Goal: everything below is **merged and demo-able before the next sync on Wednesday, 17 June (afternoon)**.
5
-
6
- **Updated at: Friday, 12 June 2026** (Sofhia + Rifqi)
7
-
8
- > Source of truth for decisions is the meeting itself. Note: the NotebookLM summary is **stale on two points** — Data Availability Check was *eliminated* as a tool, and Success Metrics was *folded into* the Problem Statement template. Do not build either as a standalone skill.
9
-
10
- ---
11
-
12
- ## 0. Progress (per Fri 12 Jun — Sofhia)
13
-
14
- Dated snapshot of what landed this session. Live task status (incl. what's left) lives in §2 Ownership — this section only records the deltas + traceability.
15
-
16
- - ✅ **Tool matrix** built (xlsx, all ~10 tools + status colours) — presentation material ready.
17
- - ✅ **Registry trimmed to 4 active analytics** (`KM-641`, commit `66e2e4d`): `ACTIVE_ANALYTICS_TOOLS` (descriptive, aggregate, correlation, trend) vs `DEFERRED_ANALYTICS_TOOLS` (comparison, contribution, profile, segment) — specs + compute fns kept, only registry exposure withheld. Tests 206 pass, ruff/mypy clean.
18
- - ✅ **Planner few-shot synced**: Example A `analyze_contribution` → `analyze_aggregate` (so few-shots don't reference a deferred tool).
19
- - ✅ **Data-access tools renamed** (`KM-642`, commit `c38c0c2`): `query_structured` → `data_retrieve`, `retrieve_documents` → `knowledge_retrieve` across the tool layer + planner stub/prompt/validator/few-shots. Mechanical, no behavior change.
20
- - ✅ **`data_check` merge + `knowledge_check`** (`KM-643`, commit `4bd5f1e`): `list_sources` + `describe_source` → one parameterized `data_check` (no arg = list structured sources; `source_id` = schema) + new `knowledge_check` (unstructured). Tests 206 pass.
21
- - ✅ **Redis Cloud live** (free tier, TTL = 1 h), env vars shared in the group (Rifqi).
22
- - ✅ **Planner tool list verified** against the trimmed registry — no references to old tool names or deferred analytics anywhere in `src/` (Rifqi).
23
- - 📌 **Decision:** `tests/` stays gitignored — team decided not to push tests to origin (closes PROGRESS.md R3 as won't-do).
24
- - 📌 **Ownership:** Rifqi owns `generate_report` development + the `analysis_records` table / real `AnalysisStore` (contract still co-designed with Sofhia).
25
- - ✅ **R5 cache fix** (Rifqi, `b701e95`): chat cache scoped by `user_id`, TTL 24h→1h.
26
- - ✅ **AnalysisRecord persistence landed** (Rifqi): `stage` now flows to the record (CRISP-DM grouping for the report) + identity fields (`record_id`/`analysis_id`/`user_id`); `PostgresAnalysisStore` + `analysis_records` table replace `NullAnalysisStore`, wired into `ChatHandler`. Unblocks the `generate_report` renderer and the DoD "record persisted" step. Open: `analysis_id` handoff from Harry's Analysis State.
27
- - ✅ **Verb-first tool naming** (Sofhia, commit `2d6406d`): the 4 data/knowledge tools renamed to lead with a verb — `data_check`→`check_data`, `knowledge_check`→`check_knowledge`, `data_retrieve`→`retrieve_data`, `knowledge_retrieve`→`retrieve_knowledge` (the `analyze_*` tools already lead with a verb). These verb-first names are now canonical; the tool-set table + §3 below use them. Dated log entries above keep the old names as historical record.
28
-
29
- ---
30
-
31
- ## 1. Locked decisions (from the 2026-06-11 checkpoint)
32
-
33
- 1. **Single chat page.** The separate interview/survey page is killed. Sidebar = Knowledge menu (connect/manage data) + Analysis menu (sessions).
34
- 2. **Data-first hard gate.** Creating a new analysis requires **≥ 1 bound data source** (server-side rejection, no empty sessions). User provides title + optional short description.
35
- 3. **Analysis State lives in the DB.** Per-analysis row: `user_id`, `data_source_ids[]`, `interview_status` (default `not_pass`), `report_status` (default `no_report` → `V1`, `V2`, …). Explicitly **NOT cached, NOT in Redis** — the Orchestrator reads it from Postgres every turn.
36
- 4. **Skills, not agents.** No separate interview agent. The Orchestrator routes per user turn using the Analysis State; an analytical request still executes through the existing Planner → TaskRunner → Assembler spine (static plan, no mid-run LLM).
37
- 5. **Interview = one skill: Problem Statement.** Success metrics become fields inside the PS template (what to increase/decrease + target). Data availability check is handled by the data-first creation gate + PS validation cross-checking fields against the bound catalog — not a separate tool.
38
- 6. **Analytics focus = 4 tools:** descriptive, aggregate, correlation, trend. The other four composites (comparison, contribution, profile, segment) are **deprioritized, not deleted** — keep the code, just don't register them. If "comparison" returns later it should be a proper statistical **test**, not a generic compare.
39
- 7. **`describe_source` merges into the listing tool** — one call returns sources *with* their schema/metadata, fewer tools for the planner.
40
- 8. **Report = on-demand, button-triggered (not a chat skill).** A dedicated "Generate Report" button in the Analysis menu calls a **report API** (not the chat route): trigger generation for a session, list its versions, fetch a version. Renders from accumulated **AnalysisRecords + the Problem Statement** — never from chat history. Each report is a **persisted, versioned artifact**: generation snapshots the record IDs it used and bumps `report_status` to `V<n>`. (Owner: Rifqi, KM-644.)
41
- 9. **Help = deterministic guide.** No LLM: read Analysis State → tell the user the next required step. Callable in any state.
42
- 10. **Redis Cloud free tier, TTL = 1 hour**, env shared in the team group — for retrieval/query caching only, never for state.
43
-
44
- ### Final tool set (~10)
45
-
46
- | Tool (canonical, verb-first) | Maps to (lineage) | Status |
47
- |---|---|---|
48
- | `check_knowledge` | new — list user's documents + metadata | done |
49
- | `check_data` | `list_sources` + `describe_source` merged (catalog-backed) | done |
50
- | `retrieve_knowledge` | `retrieve_documents` → `knowledge_retrieve` | done |
51
- | `retrieve_data` | `query_structured` → `data_retrieve` (tabular: file + DB, both working) | done |
52
- | `analyze_descriptive` | `src/tools/analytics/descriptive.py` | done |
53
- | `analyze_aggregate` | `src/tools/analytics/aggregation.py` | done |
54
- | `analyze_correlation` | `src/tools/analytics/relationship.py` | done |
55
- | `analyze_trend` | `src/tools/analytics/temporal.py` | done |
56
- | `problem_statement` | new — interview skill (**Harry**) | Harry |
57
- | `generate_report` | new — on-demand, versioned | to design |
58
- | `help` | new — deterministic state guide | to build |
59
-
60
- (`problem_statement` + `help` live at the orchestrator level; `generate_report` is **button-triggered via a dedicated report API**, not chat-routed (decision #8). The TaskRunner registry holds the 4 analytics + 4 data/knowledge tools. Unregister `analyze_comparison`, `analyze_contribution`, `analyze_profile`, `analyze_segment` from the planner-visible registry — keep the modules.)
61
-
62
- ---
63
-
64
- ## 2. Ownership
65
-
66
- ### Sofhia
67
- - [x] 4 analytics tools: trim registry to 4 active, tests still pass after deprioritizing the other four. (`KM-641`, commit `66e2e4d`)
68
- - [x] Data/knowledge tools: merge `describe_source` into `data_check`, rename `retrieve_documents` → `knowledge_retrieve`, `query_structured` → `data_retrieve`, build `knowledge_check`. (`KM-642` `c38c0c2`, `KM-643` `4bd5f1e`)
69
- - [ ] Co-design `generate_report` contract with Rifqi (Rifqi owns development, see §3).
70
- - [x] Tool matrix (see §4).
71
-
72
- ### Rifqi
73
- - [x] **Redis Cloud free tier** (~30–50 MB): create instance, set TTL = 1 h, share env vars in the group. (done 12 Jun)
74
- - [x] **R5 cache fix**: chat cache key scoped by `user_id`, TTL 24h→1h (urgent on shared Redis). (12 Jun, commit `b701e95`)
75
- - [x] **AnalysisRecord contract gaps closed**: `stage` (CRISP-DM) now flows Task→TaskResult→TaskSummary so the report can group the method appendix; `AnalysisRecord` gained `record_id`/`analysis_id`/`user_id` identity fields. (12 Jun)
76
- - [x] **`analysis_records` table + real `AnalysisStore`**: `PostgresAnalysisStore` (save + `list_for_analysis`, never-throw) replaces `NullAnalysisStore`; wired into `ChatHandler`, `user_id` stamped at save. Satisfies the DoD "record persisted" step. (12 Jun)
77
- - [ ] **Own `generate_report` development — KM-644 "Report Generator"** (contract co-designed with Sofhia, see §3). Button-triggered via a dedicated **report API** (trigger / list versions / fetch); reads `analysis_records` + Problem Statement; persists a versioned report artifact, bumps `report_status`. *(record persistence done above; report API + persistence + renderer + contract doc next)*
78
- - [x] Verify planner tool list matches the trimmed registry (4 analytics + 4 data/knowledge) and few-shots don't reference removed tools. (verified 12 Jun — no stale tool names in `src/`)
79
- - ⚠️ **Blocked-on-Harry**: `analysis_id` is `NULL` on persisted records until the Analysis State reaches the slow path — need the session-ID handoff so `generate_report` can group records per analysis.
80
-
81
- ### Shared (Sofhia + Rifqi)
82
- - [ ] `generate_report` design + skeleton: input = AnalysisRecords for the session + Problem Statement from Analysis State; output = versioned artifact; bumps `report_status`. Agree on the contract even if rendering is stubbed for Wednesday. (Development: Rifqi.)
83
- - [ ] `help` skill: deterministic — read Analysis State, return the next required step. Small, do it together or whoever finishes first.
84
- - [ ] Tool behavior smoke test end-to-end on an easy case (descriptive/aggregate path), per Harry's ask: "robust tools before agents."
85
-
86
- ### Harry (dependencies — not ours, but we block on them)
87
- - `problem_statement` skill + PS template (incl. increase/decrease target fields).
88
- - Analysis State class + DB table, frontend analysis-builder step.
89
- - Merging our PRs (he auto-merges; he clones from latest after).
90
-
91
- ---
92
-
93
- ## 3. Per-tool behavior contract (how to build each one)
94
-
95
- Harry's framing: for every tool, define **goal / trigger / input / process / output**, and behave like a Claude-style skill — if a required argument is missing, respond with a polite feedback message asking for it (e.g. table/column name), never guess silently.
96
-
97
- - **`check_knowledge`** — "what documents do I have?" → list documents with name, type, uploaded-at.
98
- - **`check_data`** — "what data do I have?" → sources (file + DB) with schema/metadata from the data catalog, created/uploaded timestamps.
99
- - **`retrieve_knowledge`** — RAG over uploaded documents; returns passages with source attribution.
100
- - **`retrieve_data`** — query tabular data (file + DB) via QueryIR; output consumable by the `analyze_*` tools.
101
- - **`analyze_*` (4)** — require valid table/column references; if missing or wrong, return actionable feedback instead of guessing.
102
- - **`generate_report`** — button-triggered via a dedicated report API (not chat-routed); on-demand only (never auto); post-pass gated; renders from AnalysisRecords + PS; persists a versioned artifact, snapshots record IDs, bumps version. (KM-644, Rifqi.)
103
- - **`help`** — no LLM; state → next step. Repeating it is fine, that's its job.
104
-
105
- ---
106
-
107
- ## 4. Tool matrix (deliverable for the sync)
108
-
109
- Harry explicitly asked for a matrix covering every tool. Produce one sheet/markdown table with columns:
110
-
111
- `tool | goal | trigger (when the orchestrator calls it) | input | process | output | gated by interview_status? | status (done / in progress / planned)`
112
-
113
- Use the tool set table in §1 as the row list. This doubles as the presentation material on Wednesday.
114
-
115
- ---
116
-
117
- ## 5. Day-by-day
118
-
119
- | Day | Target |
120
- |---|---|
121
- | **Thu 11** | Checkpoint meeting + task split with Harry. |
122
- | **Fri 12 (today)** | ✅ Registry trimmed to 4 analytics + few-shot synced (Sofhia, KM-641). ✅ Tool matrix built. ⏳ Redis Cloud + env share (Rifqi). |
123
- | **Mon 15** | Data/knowledge tools done (`data_check` merge, renames, `knowledge_check`). `generate_report` contract agreed. |
124
- | **Tue 16** | `help` skill done. `generate_report` skeleton wired to AnalysisRecord. Tool matrix drafted. End-to-end smoke test on the easy path. |
125
- | **Wed 17 (AM)** | Buffer: fix fallout, finalize matrix, rehearse the demo flow. |
126
- | **Wed 17 (PM)** | **Sync with Harry.** |
127
-
128
- ---
129
-
130
- ## 6. Open questions to confirm with Harry on Wednesday
131
-
132
- 1. **Gate scope.** Proposal: keep the fast path + exploration tools (`check_knowledge`, `check_data`, retrieves, `help`, arguably `descriptive`) available **pre-pass**; gate only the insight tools (correlation, trend, report). Hard-gating everything risks frustrating users who just want to look at their data.
133
- 2. **Who flips `interview_status` to `pass`?** Proposal: a deterministic validator (PS template slots complete + fields cross-checked against the bound catalog) makes the call — the LLM conducts the conversation but never decides the pass. ("Conversational skin, deterministic skeleton.")
134
- 3. **Skills vs spine — one sentence to lock in writing:** *"Skills are registry tools executed by the existing Planner → TaskRunner → Assembler spine; the Analysis State gate is a pre-check in the Orchestrator."* This keeps the new flow and the locked architecture fully compatible.
135
- 4. `generate_report` invocation goes through the same gate (post-pass only) — confirm.
136
-
137
- ---
138
-
139
- ## 7. Definition of done for Wednesday
140
-
141
- - [ ] All team PRs merged; Harry unblocked on the Analysis State class.
142
- - [ ] Registry exposes exactly 4 analytics + 4 data/knowledge tools, all passing local tests.
143
- - [ ] Redis Cloud shared and working locally for all three of us (TTL 1 h).
144
- - [ ] `help` works against a (possibly stubbed) Analysis State.
145
- - [ ] `generate_report` contract written; skeleton callable.
146
- - [ ] Tool matrix ready to present.
147
- - [ ] One end-to-end happy path runs: create analysis (with data) → blocked pre-pass → interview stub passes → descriptive/aggregate answer → record persisted.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
DEV_PLAN.md ADDED
@@ -0,0 +1,160 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Data Eyond — Current Development Plan (post 2026-06-24 meeting + 2026-06-25 checkpoint)
2
+
3
+ **Purpose:** context file for Claude Code sessions working on the current sprint.
4
+ **Branch:** `pr/4` · **Snapshot:** 2026-06-25.
5
+ **Companion:** [REPO_STATUS.md](REPO_STATUS.md) describes the repo's *current built state*; this file
6
+ describes the *in-flight plan* that changes it. New decisions from the 2026-06-25 checkpoint are in
7
+ [§1.5](#15-2026-06-25-checkpoint-deltas).
8
+
9
+ ---
10
+
11
+ ## 1. The direction change (locked decisions from 2026-06-24)
12
+
13
+ 1. **"Problem statement" is replaced by two user-entered fields: `objective` + `business_questions`.**
14
+ User fills them at onboarding; **both mandatory to submit; NO agent validation.**
15
+ 2. The **gate (`problem_validated`) and the `problem_statement` skill/intent are removed** (comment out, don't delete).
16
+ 3. **Report is records-based** (reads persisted `AnalysisRecord`s) — **decided and pushed** (KM-674).
17
+ It is formal markdown: title, date, "generated by {user}", objective, business questions,
18
+ findings, insights. **NOT gated** on whether business questions were answered.
19
+ 4. **`owner_id` → `user_id`** everywhere (Harry mirrors in dedorch/Go).
20
+ 5. **State writes go through a request to Go**, not direct Python DB writes.
21
+ 6. **FE-callable surface = 4 endpoints:** `call_agent` (chat/stream), `list_skills` (`GET /tools`),
22
+ **skill: help**, **skill: report**. `problem_statement` removed; `check_data` not FE-facing from
23
+ Python (Go provides it); analysis CRUD not needed from FE (comment, don't delete).
24
+ 7. Deliverables for Harry: (a) API endpoint doc (MD); (b) full Python project doc (MD → PDF/Word BRD).
25
+ 8. Integration tested via Swagger `/docs` on the HF Python build (simulating FE manually). Target ~Wed.
26
+
27
+ ## 1.5. 2026-06-25 checkpoint deltas
28
+
29
+ Confirms the 2026-06-24 direction and adds these concrete changes (folded into §4 as tasks 21–28):
30
+
31
+ 1. **Rename `analysis_records` → `report_inputs`** (DONE #21) — names the table by purpose (the rows
32
+ report generation reads); avoids clashing with Go's `analyses_messages` and with Langfuse
33
+ observability. **Stays Python-owned**; finalized schema handed to Harry so his dedorch migration
34
+ creates it post-`SKIP_INIT_DB` (#22, resolves #16). Write scope = **one row per slow-path analysis
35
+ run** (decided — not per-agent-call telemetry; that stays Langfuse).
36
+ 2. **`analyses` table (Go) — `status`, `data_bind` + `data_bind_version`, `report_collection`** (id+version).
37
+ **Verified 2026-06-25: these + `user_id` are ALREADY present in dedorch `analyses`.** Plus Harry drops
38
+ the duplicate/wrong singular `analysis` table. (→ #3)
39
+ 3. **`analyses_messages` (Go) = the analysis chat room** (user Q + agent A) — replaces the now-**deprecated**
40
+ `chat_messages`/`rooms`; Python's chat read/write must migrate here before cutover. (→ #25)
41
+ 4. **Reports: Go owns ALL writes.** Report stays a **skill** (no router intent): FE → Go → Python;
42
+ Python only returns content. Input = the records table (now `agent_observability`); edit-mode may
43
+ also need the last report. (→ #7/#18/#24)
44
+ 5. **Markdown minimum now:** tables, **bold**, *italic*, horizontal separators — optimize that before
45
+ anything fancier. (→ #23)
46
+ 6. **Deferred:** charts (prefer **Plotly→JSON** in a future `chart` table over matplotlib PNGs) and
47
+ images (image table keyed by analysis/message/report + originals in a bucket). (→ #26/#27)
48
+ 7. **Near-term:** the remove-`problem_statement` work isn't on HF yet → **PR + deploy + test in the
49
+ playground** (#13). Harry stabilizes Go ~Fri; FE manual testing ~Mon. **Keep it playground-able.**
50
+ 8. **UI research** (no dedicated UI person): new-analysis form (title/objective/business_questions),
51
+ knowledge menu (user-level vs analysis-level binding), report artifacts panel + version selector;
52
+ interview + old analysis UI removed. (→ #28)
53
+
54
+ ## 2. What is already done (KM-674, pushed on `pr/4`)
55
+
56
+ Report layer adapted to the new goal shape:
57
+ - `report/schemas.py::ProblemStatement` → `objective: str` + `business_questions: list[str]`
58
+ (old `target_value`/`scope`/`metric_direction`/`target_metric` dropped). Class name kept for now
59
+ (rename to `ReportGoal` once the upstream AnalysisState rename lands).
60
+ - `report/generator.py` renders **Objective** + numbered **Business Questions** + a
61
+ **"generated by {user}"** line.
62
+ - `api/v1/report.py::_problem_statement_from` is **tolerant**: prefers new `objective` /
63
+ `business_questions` from state, falls back to legacy `problem_statement` — works before AND after
64
+ Harry's migration.
65
+ - `config/prompts/report_summary.md` updated to objective + business questions.
66
+ - Report stays **records-based**; the floor gate (`problem_validated`) was deliberately left for task #2.
67
+
68
+ **This tolerant-migration pattern (getattr fallback) is the model for tasks #2 and #4.**
69
+
70
+ ## 3. Assessment — gaps & contradictions to resolve before building
71
+
72
+ These came out of reviewing the plan against the actual code. They are folded into the task table (§4) as tasks 15–19.
73
+
74
+ - **G1 (→ task 15). Records-based reports need the slow path ON.** `AnalysisRecord`s persist only in
75
+ `chat_handler._run_slow_path`, which runs only when `ENABLE_SLOW_PATH=true`. Default is off → no
76
+ records → `POST /report` 409s. The Swagger demo can't show a non-empty report unless slow path is
77
+ flipped on and a `structured_flow` question is run first. `BusinessContext` is still a stub but the
78
+ slow path runs fine on it.
79
+ - **G2 (→ task 16). `analysis_records` ownership is now required and collides with `SKIP_INIT_DB`.**
80
+ It's created today by Python `create_all` (`db/postgres/init_db.py`), is in no dedorch/Go migration,
81
+ and after the dedorch cutover (`SKIP_INIT_DB=true`) Python stops running `create_all` → the table
82
+ won't exist → reports break. Decide: dedorch migration (Harry) OR a Python carve-out that creates
83
+ just this one table even under `SKIP_INIT_DB`. *(Resolved 2026-06-25 — see §1.5.1 / #16 / #22.)*
84
+ - **G3 (→ task 17). `chat_history` in the report contract is vestigial.** Records-based generation
85
+ reads records by `analysis_id`; it never uses chat history. Drop `chat_history` from the report
86
+ skill contract, or mark it reserved/unused.
87
+ - **G4 (→ task 4 note). Make #2/#4 tolerant of both state shapes.** If Harry drops
88
+ `problem_validated`/`owner_id` from dedorch before Python stops reading them, Python's gate +
89
+ state_store break. Use the same `getattr` tolerance KM-674 used. The `owner_id`→`user_id` rename
90
+ also touches `api/v1/analysis.py` (`_serialize_state`, `list_analyses`, `get_analysis`), not just
91
+ the model + state_store.
92
+ - **G5 (→ task 18). "State writes via Go" is bigger than `report_id`.** Python still writes state in
93
+ `/analysis/create` (state + room + bindings, plus the data-first gate and soon the mandatory-field
94
+ check) and in `state_store.ensure` per turn. If creation moves to Go (consistent with commenting
95
+ analysis CRUD), then Go owns ALL state writes + both creation gates, and Python's `ensure` must
96
+ become a **read-only get** (Go must guarantee the row exists before any chat turn).
97
+ - **G6 (smaller).**
98
+ - Removing `problem_statement` (task 1) means neutering it in 4 places: the `Intent` literal
99
+ (`agents/orchestration.py`), the router prompt (`config/prompts/intent_router.md`), the handler,
100
+ and the gate's redirect *target*. Do it with task 2.
101
+ - "generated by {user}" currently prints the raw `user_id`; a formal report wants a name — source
102
+ from `users.fullname` or have Go pass a display name (task 19).
103
+ - The meeting's outline (background / EDA / insights) isn't fully in the renderer; map those
104
+ sections onto the record fields deliberately (task 5 follow-up).
105
+ - The full project doc (task 11) should reuse [REPO_STATUS.md](REPO_STATUS.md), not restart.
106
+
107
+ ## 4. Task table
108
+
109
+ Status legend: ⬜ not started · 🔄 in progress · ✅ done · ⛔ blocked · 🔎 verify · ⏸️ deferred.
110
+
111
+ | # | Task | Owner | Status | Note |
112
+ |---|---|---|---|---|
113
+ | 1 | Comment out `problem_statement` skill **+ `Intent` literal + router prompt + gate redirect target**; remove `/problem-statement` from `list_tools` | Rifqi | ✅ | Done 2026-06-25 (one commit w/ #2). Unwired in `orchestration.py`, `intent_router.md`, `chat_handler.py`, `tools.py`; `problem_statement.py` kept intact |
114
+ | 2 | Drop `problem_validated`: gate neutered; `is_report_ready`/`report_floor` → **≥1 completed analysis** only, no-LLM | Rifqi | ✅ | Done 2026-06-25. `gate.py` no-op, gate call site commented in `chat_handler.py`, `report_floor` drops the goal check. Tests updated (`test_gate`/`test_chat_handler`/`test_readiness`). Suite: **284 passed, 7 skipped**; ruff clean |
115
+ | 3 | dedorch `analyses` migration: drop `problem_statement`/`problem_validated`, add `objective` + `business_questions` | Harry | 🔄 | **Verified dedorch 2026-06-25:** `analyses` (plural) ALREADY has `user_id` + `status` + `data_bind` + `data_bind_version` + `report_collection` → those parts done. **Remaining:** drop `problem_statement`/`problem_validated` + add `objective`/`business_questions`. Singular `analysis` = deprecated duplicate to drop |
116
+ | 4 | Update Python `analyses` model + `state_store` + `analysis.py` to match dedorch; `owner_id`→`user_id` | Rifqi/Sofhia | ✅ | Done 2026-06-26. `owner_id`→`user_id` + added `status`/`data_bind`/`data_bind_version`/`report_collection` (DB-only, not in the `AnalysisState` pydantic) across `models.py`/`gate.py`/`state_store.py`/`analysis.py` + 3 local tests; also `report_inputs` `id`/`analysis_id` → `uuid`. Kept `problem_statement`/`problem_validated`; `objective`/`business_questions` wait on Harry's #3. Suite **284 passed** |
117
+ | 5 | Report generator → `objective`+`business_questions`, "generated by {user}", formal outline | Sofhia | ✅ | Goal-shape (KM-674) + author name (#19) + outline (KM-680): Objective → Business Questions → Executive Summary → Key Findings → EDA → Notes & Limitations → How This Was Analyzed |
118
+ | 6 | Report skill input contract: `analysis_id` + `user_id` (no `chat_history`) | Sofhia/Rifqi | ✅ | No-op: `POST /report` already takes only analysis_id + user_id (records-based). Documented in API_ENDPOINTS.md §5. *(Edit-mode input revisited in #24.)* |
119
+ | 7 | `report_id` state update via request to Go, not direct DB | Sofhia + Harry | ⬜ | Needs Go endpoint. **Checkpoint:** Go owns ALL `reports` writes; Python stops any direct insert/update and only returns content; report stays a **skill** (no intent). See #18 |
120
+ | 8 | Expose/confirm 4 FE endpoints; comment `check_data` + analysis CRUD | Sofhia | ✅ | KM-678: `list_tools` trimmed to `/help` + `/report` (analytics/check/retrieve commented in the **menu**). `help` confirmed as a `call_agent` intent — no own endpoint. Analysis CRUD endpoint left **registered**: "comment the rest" was about the FE slash menu, not killing HTTP routes Go needs |
121
+ | 9 | Verify `analysis_id` in `call_agent` contract | Sofhia | ✅ | Verified: no separate field — carried as `room_id` (`analysis_id == room_id`), per REPO_STATUS §4/§11. Action for Go: send the id as `room_id` |
122
+ | 10 | API endpoint doc (MD), 4 endpoints, for Go integration | Rifqi + Sofhia | ✅ | Done 2026-06-25 — `API_ENDPOINTS.md` (repo root). 4 FE surfaces with request/response **examples** (chat SSE transcript, report 201/409 JSON, version list), schemas, §9 full 32-route inventory + task-8 reading |
123
+ | 11 | Full Python project doc (MD → PDF/Word BRD) | Rifqi | ✅ | Done 2026-06-26 — `PROJECT_BRD.md` (repo root): purpose/context, FR-1..9 capabilities, lifecycle, architecture, data model, API (→ API_ENDPOINTS), NFRs, integrations, open items. Reuses REPO_STATUS/API_ENDPOINTS; convert to PDF/Word for distribution |
124
+ | 12 | Reconcile/open the `list_tools` PR cleanly (stacked commits) | Rifqi | ✅ | N/A — we develop directly on the single active branch `pr/4` (KM-652 + KM-678 already stacked there); no separate PR to reconcile |
125
+ | 13 | Deploy HF Python build (remove-`problem_statement` work) → test 4 endpoints via Swagger / playground | Sofhia + Harry | 🔄 | **Unblocked (#15 ✅).** Remove-PS work is on `pr/4` but **not on HF `main` yet** → PR + deploy, then manual test. Harry stabilizes Go ~Fri; FE testing ~Mon |
126
+ | 14 | `analysis_records` home | Rifqi + Sofhia + lead | ✅ | **Resolved 2026-06-25:** stays Python-owned, **renamed** (→ #21); schema handed to Harry so the dedorch migration creates it post-cutover (→ #22). Not moved to Go |
127
+ | 15 | Flip `ENABLE_SLOW_PATH=true` + verify an `AnalysisRecord` persists from a `structured_flow` question | Rifqi | ✅ | Verified locally 2026-06-25 (in-process). structured_flow on Titanic.csv → 3-task plan `check_data→retrieve_data→analyze_aggregate` (all success) → AnalysisRecord persisted (substantive) → `report_floor` pass → report generates (201). HF env-flip + Swagger run folds into #13 |
128
+ | 16 | Decide `analysis_records` creation under `SKIP_INIT_DB` | Rifqi + Harry | ✅ | **Resolved 2026-06-25:** Python defines it; **Harry's dedorch migration creates it** on env-move (Python still creates locally meanwhile) → exists post-cutover. Execution = #22 |
129
+ | 17 | Reconcile report contract with records-based: remove/flag `chat_history` | Sofhia/Rifqi | ✅ | Nothing to remove — `chat_history` was never in the report contract/code (only in help.md). Confirmed via grep; API_ENDPOINTS.md §5 documents the clean contract |
130
+ | 18 | Confirm Go owns ALL analysis-state writes + both creation gates; make Python `state_store.ensure` read-only | Rifqi + Harry | ⬜ | **Confirmed by 2026-06-25 checkpoint** (Python read-only; Go owns writes + new tables). Execution pending Go endpoints |
131
+ | 19 | Decide report author display-name source (`users.fullname` vs Go-passed name) | Sofhia | ✅ | Done 2026-06-25. `AnalysisReport.user_name`; `generator` renders `user_name or user_id`; `api/v1/report.py::_resolve_user_name` reads `users.fullname` never-throw (fallback `user_id`). Decided: resolve in Python (unblocked); swap to Go-passed name later if preferred |
132
+ | 20 | **Help handoff:** update `handlers/help.py` + `help.md` — drop the `problem_validated` tier + `define_problem_statement` action (the skill it points at is gone as of #1) | Sofhia | ✅ | Done 2026-06-25. `help.py`: actions = `ask_analysis_question` (always) + `generate_report` (if ready); renders objective/business_questions (getattr-tolerant). `help.md` v1→v2: 3 tiers, no `/problem_statement`, `/generate report`→`/report`. Local test_help updated → 11 pass |
133
+ | 21 | Rename `analysis_records` → **`report_inputs`** (table, ORM `ReportInputRow`, store `*ReportInputStore`) | Rifqi | ✅ | Done 2026-06-26. `sed` rename across 9 files; Pydantic `AnalysisRecord` kept; columns stay String (pure rename — uuid+FK is the #22 Harry schema). Name `report_inputs` (purpose; avoids Langfuse/`analyses_messages` clash). Write scope = one row per slow-path run. Suite **284 passed** |
134
+ | 22 | Finalize `report_inputs` schema → hand to Harry for the dedorch migration | Rifqi → Harry | 🔄 | **DDL ready** (uuid `id`/`analysis_id` + FK→`analyses(id)`; `user_id`/`plan_id` text; `data` jsonb = serialized `AnalysisRecord`, shape documented). dedorch has empty `analysis_records` → rename. Resolves #16. **Action: send Harry the DDL + `data` shape** |
135
+ | 23 | Report markdown formatting: tables, **bold**, *italic*, horizontal separators | Sofhia | ✅ | Done 2026-06-25. Added `---` separators between header + each section in `_render_markdown`. Tables (EDA) / bold (method labels) / italic (meta + citations) already emitted. Relaxed `report_summary.md` to allow inline `**bold**`/`*italic*` for emphasis (kept no-headings/no-bullets so it doesn't duplicate the section structure / Key Findings). Compile + ruff clean |
136
+ | 24 | Clarify report input contract: records table (+ `last_report` for edit mode?) | Rifqi/Sofhia ↔ Harry | ⬜ new | Edit-mode input left open at the checkpoint |
137
+ | 25 | Migrate Python chat path to Go `analyses_messages` (+ `analyses`) | Rifqi ↔ Harry | ⬜ | **Bigger than "confirm" (verified 2026-06-25):** dedorch `rooms` + `chat_messages` are **deprecated** (`zdeprecated_*`). Python's `Room`/`ChatMessage` models + `chat.py` `load_history`/`save_messages` target them → **break post-cutover**. Move history read/write to `analyses_messages` before the conn-string cutover |
138
+ | 26 | **Charts (DEFERRED):** store Plotly JSON in a future `chart` table (not matplotlib PNG) | — | ⏸️ | After the markdown path is done end-to-end |
139
+ | 27 | **Images (DEFERRED):** image table (id, analysis_id, msg/report ref, order) + originals in a bucket | — | ⏸️ | Maintenance-heavy; parked |
140
+ | 28 | **UI research** (FE): new-analysis form, knowledge menu (user vs analysis level), report artifacts + version selector | Team | ⬜ new | No dedicated UI person; interview + old analysis UI removed |
141
+
142
+ ## 5. Critical path & sequencing
143
+
144
+ - **Critical path:** #22 (send Harry the `report_inputs` schema). HF deploy (#13) for the playground. (#4 ✅, #21 ✅; Harry's #3 no longer blocks us — Python is getattr-tolerant.)
145
+ - **Parallelizable now:** #22 (handoff). (#4 ✅, #11 ✅ done.)
146
+ - **Harry-blocked / coordinated:** #3 (now 🔄, blocks #4), #7 (Go endpoint), #18 (Go state ownership), #24 (contract). **#25 = chat-path migration to `analyses_messages` — a cutover blocker.**
147
+ - **Demo gate (playground, #13):** deploy the remove-`problem_statement` work to HF — slow path (#15 ✅)
148
+ and the report path are verified locally, and #16 is resolved (#22 hands Harry the schema). **Keep it
149
+ playground-able.**
150
+
151
+ ## 6. Decisions still open (need the team / Harry / lead)
152
+
153
+ - ~~`analysis_records`: dedorch-owned vs Python-owned (#16/#14).~~ RESOLVED: Python-owned + renamed **`report_inputs`** (#21 done); Harry's migration creates it (#22).
154
+ - ~~Whether `help` is its own endpoint or via `call_agent` (#8).~~ RESOLVED: `help` is a `call_agent` intent (no own endpoint).
155
+ - ~~Author display-name source for the report (#19).~~ RESOLVED: Python resolves `users.fullname` (fallback `user_id`); swap to a Go-passed name later if preferred.
156
+ - ~~Keep vs drop `chat_history` in the report contract (#17).~~ RESOLVED: never in the contract; report is records-based (analysis_id + user_id only).
157
+ - Confirm Go takes over analysis creation + both creation gates (data-first + mandatory fields) (#18).
158
+ - **Report input for edit mode** — does Python need the last report content? (#24)
159
+ - ~~`report_inputs` write scope — every agent call vs slow-path-only? (#21)~~ RESOLVED: one row per slow-path run (telemetry stays Langfuse).
160
+ - **Python history source** — confirm Go's `analysis_message` (#25).
PHASE1_TO_PHASE2_REPORT.md DELETED
@@ -1,273 +0,0 @@
1
- # Phase 1 → Phase 2 Migration Report
2
-
3
- A walkthrough of what changed between the original retrieval-style backend (Phase 1) and the current catalog-driven backend (Phase 2). Intended as a hand-off for the lead.
4
-
5
- ---
6
-
7
- ## 1. The conceptual change
8
-
9
- **Phase 1** was a single retrieval-style RAG pipeline. Every question — whether it pointed at a database, a spreadsheet, or a PDF — went through the same primitive: **chunk + embed + top-K** over PGVector. Schema and tabular columns were embedded as chunks and ranked alongside prose. When the question needed SQL, the LLM **wrote the SQL string directly** (via `query_executor`).
10
-
11
- **Phase 2** splits the system into two paths governed by an LLM router:
12
-
13
- | Path | Primitive | Why |
14
- |---|---|---|
15
- | Unstructured (PDF / DOCX / TXT) | Dense similarity over prose chunks (PGVector) | Right primitive for free text |
16
- | Structured (DB / CSV / XLSX / Parquet) | **Per-user data catalog** → LLM emits a **JSON IR** of intent → deterministic **compiler** → **executor** (SQL or pandas) | A column lookup shouldn't go through a similarity ranking lottery; the LLM emits intent, never SQL syntax |
17
-
18
- Three explicit LLM call sites only:
19
-
20
- 1. **Intent router** (classifies the user message into `chat` / `unstructured` / `structured`)
21
- 2. **Query planner** (turns the question + catalog into a Pydantic-validated `QueryIR`)
22
- 3. **Chatbot agent** (formats the final answer, streamed over SSE)
23
-
24
- Everything else — IR validation, SQL/pandas compilation, execution — is deterministic Python.
25
-
26
- ---
27
-
28
- ## 2. File-by-file changes
29
-
30
- ### 2.1 Deleted (Phase 1 only)
31
-
32
- | Phase 1 path | Reason it was removed |
33
- |---|---|
34
- | `src/rag/base.py`, `src/rag/retriever.py`, `src/rag/router.py` | Replaced by `src/retrieval/` |
35
- | `src/rag/retrievers/baseline.py`, `schema.py`, `document.py` | Schema retrieval gone (catalog replaces it); document retriever rewritten in `src/retrieval/document.py` |
36
- | `src/tools/search.py` (whole `tools/` folder) | Only consumer was `rag/router.py` |
37
- | `src/query/base.py` | Duplicate of `query/executor/base.py` |
38
- | `src/query/query_executor.py` | Replaced by `src/query/service.py` |
39
- | `src/query/executors/db_executor.py` | Replaced by `src/query/executor/db.py` |
40
- | `src/query/executors/tabular.py` | Replaced by `src/query/executor/tabular.py` |
41
- | `src/agents/chatbot.py` (Phase 1 LangChain chatbot) | Phase 2 `ChatbotAgent` lives at the same path now — see §2.2 |
42
- | `src/api/v1/knowledge.py` | Fake `/knowledge/rebuild` endpoint, never wired |
43
- | `src/config/agents/system_prompt.md`, `guardrails_prompt.md` | Replaced by `src/config/prompts/{chatbot_system,guardrails}.md` |
44
- | `src/models/structured_output.py` (`IntentClassification`) | Replaced by `IntentRouterDecision` Pydantic model inside `agents/orchestration.py` |
45
- | `src/models/sql_query.py` | LLM no longer emits SQL; IR replaces it |
46
- | `src/pipeline/orchestrator.py` (empty stub) | Redundant — `StructuredPipeline` takes the introspector at `run()` time |
47
-
48
- ### 2.2 Renamed / moved (same role, new home)
49
-
50
- | Phase 1 location | Phase 2 location | Notes |
51
- |---|---|---|
52
- | `src/agents/chatbot.py` (Phase 1) → deleted, then `src/agents/answer_agent.py` (`AnswerAgent`) → renamed | `src/agents/chatbot.py::ChatbotAgent` | Final answer formation; streams via `astream` |
53
- | `src/knowledge/parquet_service.py` | `src/storage/parquet.py` | Parquet upload/download helper |
54
- | `src/pipeline/document_pipeline/document_pipeline.py` (folder) | `src/pipeline/document_pipeline.py` (flat) | Single module |
55
- | `src/rag/retrievers/document.py` | `src/retrieval/document.py` | `DocumentRetriever` migrated; tabular file types filtered out of results. **Post-report update (mentor commit 61c746f, 2026-05-20):** rewritten to raw SQL (pgvector `<=>` cosine, `<+>` manhattan only) to dodge asyncpg type-mapping issues with the Go-ingested schema. MMR / euclidean / inner_product dropped. |
56
- | `src/rag/router.py` | `src/retrieval/router.py` | `RetrievalRouter`, Redis-cached, unstructured-only; dead `db: AsyncSession` + `source_hint` params removed |
57
- | `src/rag/base.py` (`RetrievalResult`, `BaseRetriever`) | `src/retrieval/base.py` | Same dataclass + ABC |
58
-
59
- > **Heads-up on the intent router**: the Phase 1 file `src/agents/orchestration.py` and its class `OrchestratorAgent` were **kept in place** for Phase 2 — but the body was fully rewritten. The class now emits `IntentRouterDecision(needs_search, source_hint ∈ {chat, unstructured, structured}, rewritten_query)`. The prompt file and test file use the `intent_router` name (`config/prompts/intent_router.md`, `tests/agents/test_intent_router.py`), but **the source module is still `orchestration.py` and the class is still `OrchestratorAgent`**. Existing imports continue to work; only the behavior changed.
60
-
61
- ### 2.3 Added (Phase 2 new)
62
-
63
- **Catalog subsystem (whole new concept)**
64
-
65
- | Path | Role |
66
- |---|---|
67
- | `src/catalog/models.py` | Pydantic: `Catalog → Source[] → Table[] → Column[]`, `ForeignKey`, `ColumnStats.top_values` |
68
- | `src/catalog/introspect/base.py` | `BaseIntrospector` ABC |
69
- | `src/catalog/introspect/database.py` | DB introspector — wraps Phase 1 `db_pipeline/extractor.py` (`get_schema`, `profile_column`, `get_row_count`) |
70
- | `src/catalog/introspect/tabular.py` | CSV / XLSX / Parquet introspector — one `Table` per XLSX sheet |
71
- | `src/catalog/render.py` | Renders a `Source` for the planner prompt |
72
- | `src/catalog/validator.py` | Unique-ID + foreign-key-ref invariants |
73
- | `src/catalog/store.py` | Postgres `jsonb` upsert keyed by `user_id` (table `data_catalog`) |
74
- | `src/catalog/reader.py` | Loads + filters catalog by `source_hint` |
75
- | `src/catalog/pii_detector.py` | Flags PII columns at ingestion → suppresses `sample_values` |
76
- | `src/security/pii_patterns.py` | Name patterns + value regex used by the detector |
77
-
78
- **JSON IR + query subsystem**
79
-
80
- | Path | Role |
81
- |---|---|
82
- | `src/query/ir/models.py` | `QueryIR` Pydantic schema |
83
- | `src/query/ir/operators.py` | `ALLOWED_FILTER_OPS`, `ALLOWED_AGG_FNS`, `LIMIT_HARD_CAP`, `TYPE_COMPATIBILITY` |
84
- | `src/query/ir/validator.py` | Catalog-aware IR validation (rejects unknown column ids, bad ops, type mismatches, oversize limits) |
85
- | `src/query/planner/service.py` | `QueryPlannerService.plan(question, catalog, previous_error)` — Azure OpenAI structured output → `QueryIR` |
86
- | `src/query/planner/prompt.py` | Builds the planner prompt from catalog text |
87
- | `src/query/compiler/base.py` | Compiler ABC |
88
- | `src/query/compiler/sql.py` | `SqlCompiler` (Postgres) — all 12 filter ops, params as a dict |
89
- | `src/query/compiler/pandas.py` | `PandasCompiler` — returns `CompiledPandas(apply, output_columns)` |
90
- | `src/query/executor/base.py` | `BaseExecutor` + `QueryResult` |
91
- | `src/query/executor/db.py` | `DbExecutor` — sqlglot SELECT-only guard, RO txn, 30 s `statement_timeout`, 10 k row cap |
92
- | `src/query/executor/tabular.py` | `TabularExecutor` — Parquet via blob, `asyncio.to_thread`, 10 k cap |
93
- | `src/query/executor/dispatcher.py` | `ExecutorDispatcher.pick(ir)` — picks by `source.source_type` |
94
- | `src/query/service.py` | `QueryService.run(user_id, question, catalog)` — plan → validate → retry (max 3) → dispatch → execute |
95
-
96
- **Agents**
97
-
98
- | Path | Role |
99
- |---|---|
100
- | `src/agents/orchestration.py` | `OrchestratorAgent` — Phase 1 file/class name preserved; Phase 2 body. Emits `IntentRouterDecision` |
101
- | `src/agents/chatbot.py` | `ChatbotAgent` — formerly `AnswerAgent` in `agents/answer_agent.py`; renamed in Cleanup PR |
102
- | `src/agents/chat_handler.py` | `ChatHandler.handle(...)` — top-level orchestrator; yields `intent` / `chunk` / `done` / `error` SSE events |
103
-
104
- **Pipelines & API**
105
-
106
- | Path | Role |
107
- |---|---|
108
- | `src/pipeline/structured_pipeline.py` | DB / tabular ingestion: introspect → merge → validate → upsert |
109
- | `src/pipeline/triggers.py` | `on_db_registered`, `on_tabular_uploaded`, `on_document_uploaded`, `on_catalog_rebuild_requested` |
110
- | `src/api/v1/data_catalog.py` | `GET /api/v1/data-catalog/{user_id}` + `POST /api/v1/data-catalog/rebuild` |
111
- | `src/models/api/catalog.py` | Catalog request/response models |
112
- | `src/config/prompts/intent_router.md`, `query_planner.md`, `chatbot_system.md`, `guardrails.md` | New prompts. `guardrails.md` is appended to `chatbot_system.md` at load time |
113
- | `src/db/postgres/models.py` (added `Catalog` SQLAlchemy class) | Stores the per-user jsonb document in `data_catalog` |
114
-
115
- ### 2.4 Rewired API endpoints
116
-
117
- | Endpoint | Phase 1 wiring | Phase 2 wiring |
118
- |---|---|---|
119
- | `POST /api/v1/chat/stream` | Inline in `chat.py`: `OrchestratorAgent` → `retriever` → `query_executor` → `chatbot` | Delegates to `ChatHandler.handle()`. Redis cache, fast intent, history load, and message persistence stay in the endpoint |
120
- | `POST /api/v1/database-clients/{id}/ingest` | Called `db_pipeline_service.run()` and dual-wrote vectors | Calls **only** `on_db_registered` (catalog build). Failure → HTTP 500 |
121
- | `POST /api/v1/document/process` | Always pushed to vector store | PDF/DOCX/TXT → `knowledge_processor` (vectors); CSV/XLSX → `on_tabular_uploaded` (catalog only, **no vector embedding**) |
122
- | `POST /api/v1/document/upload` | Storage + DB row | Same, plus `on_document_uploaded` trigger |
123
- | `POST /api/v1/data-catalog/rebuild` | — | New: iterates all sources, re-runs per-source trigger |
124
- | `GET /api/v1/data-catalog/{user_id}` | — | New: returns `list[CatalogIndexEntry]` |
125
-
126
- ### 2.5 Phase 1 files still in production use
127
-
128
- These were **not rewritten** — Phase 2 imports them directly:
129
-
130
- - `src/database_client/database_client_service.py`
131
- - `src/utils/db_credential_encryption.py` (`decrypt_credentials_dict`) — `src/security/credentials.py` is still a stub
132
- - `src/pipeline/db_pipeline/db_pipeline_service.py` (`engine_scope` context manager — used by both the introspector and `DbExecutor`)
133
- - `src/pipeline/db_pipeline/extractor.py` (`get_schema`, `profile_column`, `get_row_count`)
134
- - `src/knowledge/processing_service.py` (PDF / DOCX / TXT extraction + embedding)
135
- - `src/db/postgres/{connection,init_db,vector_store}.py`, `src/storage/az_blob/`, `src/middlewares/`, `src/security/auth.py`
136
-
137
- ---
138
-
139
- ## 3. End-to-end flow (current state)
140
-
141
- ### 3.1 Ingestion
142
-
143
- ```
144
- User action Pipeline Storage
145
- ────────────── ──────────────────────────── ─────────────────
146
- upload PDF/DOCX/TXT → DocumentPipeline → Azure Blob + PGVector
147
- (extract → chunk → embed) (table: langchain_pg_embedding)
148
- + on_document_uploaded + retrieval cache invalidate
149
-
150
- upload CSV/XLSX → TabularIntrospector → Azure Blob (Parquet)
151
- (sheets / columns + sample + stats) + data_catalog jsonb row
152
- → CatalogValidator → CatalogStore (NO vector store — catalog only)
153
- via on_tabular_uploaded
154
-
155
- register DB → DatabaseIntrospector → data_catalog jsonb row
156
- (information_schema + sample + FKs)
157
- → validate → store
158
- via on_db_registered
159
- ```
160
-
161
- ### 3.2 Query (per user message → SSE stream)
162
-
163
- ```
164
- POST /api/v1/chat/stream
165
-
166
- ├── Redis cache check (24h TTL) — hit returns cached stream
167
- ├── _fast_intent (greetings / goodbyes) — bypass LLM
168
- ├── load history from chat_messages
169
-
170
- └── ChatHandler.handle(message, user_id, history) [src/agents/chat_handler.py]
171
-
172
- ├─ OrchestratorAgent.classify() [agents/orchestration.py]
173
- │ → needs_search, source_hint, rewritten_query
174
-
175
- ├── source_hint == "chat"
176
- │ → ChatbotAgent.astream() → yield chunk events
177
-
178
- ├── source_hint == "unstructured"
179
- │ → RetrievalRouter.retrieve() [retrieval/router.py, Redis-cached]
180
- │ → DocumentRetriever (raw SQL: pgvector `<=>` cosine or `<+>` manhattan)
181
- │ → ChatbotAgent.astream(chunks=...)
182
-
183
- └── source_hint == "structured"
184
- → CatalogReader.read(user_id, "structured") [catalog/reader.py]
185
- → QueryService.run(user_id, question, catalog) [query/service.py]
186
-
187
- ├─ QueryPlannerService.plan(...) [query/planner/service.py]
188
- │ LLM(catalog, question, prev_error?) → QueryIR
189
-
190
- ├─ IRValidator.validate(ir, catalog) [query/ir/validator.py]
191
- │ fail → loop back to planner with error context (max 3)
192
-
193
- ├─ ExecutorDispatcher.pick(ir) [query/executor/dispatcher.py]
194
- │ schema source → DbExecutor
195
- │ tabular source → TabularExecutor
196
-
197
- ├─ DbExecutor.run(ir): [query/executor/db.py]
198
- │ SqlCompiler → (sql, params)
199
- │ → sqlglot SELECT-only guard
200
- │ → engine_scope (Phase 1 utility) in asyncio.to_thread
201
- │ → RO txn + statement_timeout=30s + 10k cap
202
-
203
- ├─ TabularExecutor.run(ir): [query/executor/tabular.py]
204
- │ resolve Parquet blob path
205
- │ → download → PandasCompiler.apply(df)
206
- │ → asyncio.to_thread → 10k cap
207
-
208
- └─ QueryResult { rows, columns, row_count,
209
- truncated, source_id, error?, elapsed_ms }
210
-
211
- ChatbotAgent.astream(query_result=...)
212
- → yield chunk events
213
-
214
- └── final events: done / error
215
-
216
- └── persist user + assistant messages to chat_messages
217
- └── populate Redis cache
218
- ```
219
-
220
- **Safety invariants for the structured path** (read-only at every layer):
221
-
222
- 1. IR validated against the catalog before reaching the compiler
223
- 2. Identifiers come from the catalog (trusted; inlined as quoted identifiers)
224
- 3. Values from `IR.filters` are always parameterized
225
- 4. Compiler is deterministic — no LLM in the hot path
226
- 5. sqlglot rejects anything that isn't a pure SELECT
227
- 6. DB connection is read-only with a 30 s `statement_timeout`
228
- 7. Hard 10 000 row cap on both executors; neither raises — errors go in `QueryResult.error`
229
-
230
- ---
231
-
232
- ## 4. Summary table for review
233
-
234
- | Concern | Phase 1 — where it lived | Phase 2 — where it lives | Change type |
235
- |---|---|---|---|
236
- | Intent classification | `agents/orchestration.py::OrchestratorAgent` (free-text intent) | **Same path + same class name** — body rewritten to emit `IntentRouterDecision` | Body rewrite only |
237
- | Top-level chat orchestration | Inline in `api/v1/chat.py` | `agents/chat_handler.py::ChatHandler` | Extracted to a reusable module |
238
- | Final answer formation | `agents/chatbot.py` (Phase 1 LangChain) | `agents/chatbot.py::ChatbotAgent` (was `AnswerAgent` in `answer_agent.py` mid-cycle) | Rewritten + renamed |
239
- | Schema retrieval (DB / tabular) | `rag/retrievers/schema.py` + PGVector chunks | **Removed**. Replaced by catalog (`catalog/store.py` jsonb) loaded verbatim into planner prompt | Whole concept replaced |
240
- | Doc retrieval (PDF / DOCX / TXT) | `rag/retrievers/document.py`, `rag/router.py` | `retrieval/document.py`, `retrieval/router.py` | Moved; Redis cache restored; tabular files filtered. **Post-report update:** rewritten to raw SQL (cosine / manhattan only); collection renamed `document_embeddings` → `documents` to match the Go ingestion service. |
241
- | Query writing | `query/query_executor.py` + `models/sql_query.py` (LLM writes SQL) | `query/planner/service.py` (LLM writes IR) + `query/compiler/sql.py` (deterministic) | LLM emits intent, not SQL |
242
- | DB execution | `query/executors/db_executor.py` | `query/executor/db.py::DbExecutor` | Folder renamed (`executors` → `executor`); sqlglot guard + RO txn + 30 s timeout kept |
243
- | Tabular execution | `query/executors/tabular.py` | `query/executor/tabular.py::TabularExecutor` | Parquet-only; pandas compiler split out |
244
- | Executor selection | Hard-coded in `query_executor.py` | `query/executor/dispatcher.py::ExecutorDispatcher` | New; routes by `source.source_type` |
245
- | Catalog (NEW) | — | `catalog/` (models, introspect/, validator, store, reader, pii_detector, render) | New subsystem |
246
- | Catalog persistence | (data was embedded in PGVector) | Postgres jsonb table `data_catalog`, keyed by `user_id` | New table |
247
- | Ingestion triggers | Inline in API endpoints | `pipeline/triggers.py` (`on_db_registered`, `on_tabular_uploaded`, `on_document_uploaded`, `on_catalog_rebuild_requested`) | Centralized event entry points |
248
- | Structured pipeline | `pipeline/db_pipeline/db_pipeline_service.py` (still present for `engine_scope` + extractor reuse) | `pipeline/structured_pipeline.py` (orchestrator) — reuses Phase 1 extractor | New orchestrator wraps Phase 1 introspection helpers |
249
- | Document pipeline | `pipeline/document_pipeline/document_pipeline.py` (folder) | `pipeline/document_pipeline.py` (file) | Flattened; CSV / XLSX now skip the vector store |
250
- | Parquet helper | `knowledge/parquet_service.py` | `storage/parquet.py` | Moved into `storage/` |
251
- | Prompts | `config/agents/system_prompt.md`, `guardrails_prompt.md` | `config/prompts/{intent_router,query_planner,chatbot_system,guardrails}.md` | Folder renamed; split into four files; guardrails appended to `chatbot_system` at load |
252
- | PII detection | — | `catalog/pii_detector.py` + `security/pii_patterns.py` | New. Columns flagged `pii_flag=true` get `sample_values: null` so PII never enters prompts |
253
- | Chat endpoint | `api/v1/chat.py` (does everything inline) | `api/v1/chat.py` (cache + history + persistence) → delegates to `ChatHandler` | Slimmed; SSE event shape is `intent` / `chunk` / `done` / `error` |
254
- | DB ingest endpoint | `api/v1/db_client.py::ingest` (Phase 1 `db_pipeline_service.run()`) | `api/v1/db_client.py::ingest` (calls `on_db_registered` only) | Phase 1 dual-write removed |
255
- | Document process endpoint | `api/v1/document.py::process` (always vectorize) | `api/v1/document.py::process` (PDF/DOCX/TXT → vectors; CSV/XLSX → catalog via `on_tabular_uploaded`) | Routing by file type |
256
- | Catalog management API | — | `api/v1/data_catalog.py` (GET index + POST rebuild) | New |
257
-
258
- **Bottom line.** Every Phase 1 file under `src/rag/`, `src/tools/`, `src/query/executors/`, `src/query/query_executor.py`, `src/query/base.py`, `src/api/v1/knowledge.py`, and `src/config/agents/` is gone. Phase 1 introspection helpers under `src/pipeline/db_pipeline/` and `src/database_client/` are still imported by Phase 2 — they were not rewritten, just wrapped. The three LLM call sites are now explicit and the SQL-writing one no longer exists; the planner emits a Pydantic-validated `QueryIR` instead.
259
-
260
- The one filename gotcha to remember: the **intent router** still lives at `src/agents/orchestration.py` as class `OrchestratorAgent` (Phase 1 name kept for import-site compatibility, Phase 2 body). The matching prompt and tests use the `intent_router` name, but the source module does not.
261
-
262
- ---
263
-
264
- ## 5. Addendum — post-report changes (2026-05-20, mentor commit `61c746f`)
265
-
266
- This report was originally written as a snapshot at Phase 2 completion. The Phase 2 architecture itself hasn't changed, but a few implementation details have shifted as the Go migration progresses. Captured here so the report stays trustworthy:
267
-
268
- - **Doc ingestion is now a Go service.** PDF/DOCX/TXT chunking + embedding + writes into PGVector are no longer done by Python. The Python service reads only.
269
- - **PGVector collection renamed:** `document_embeddings` → `documents` (to match the Go service's writes). Touched files: `db/postgres/vector_store.py`, `retrieval/document.py`.
270
- - **`DocumentRetriever` rewritten to raw SQL.** Uses pgvector operators directly (`<=>` cosine, `<+>` manhattan). The LangChain ORM path couldn't cope with the schema written by the Go service (asyncpg type-mapping issues — id String vs UUID, jsonb_path_match binding quirks). MMR / euclidean / inner_product were dropped as part of the rewrite.
271
- - **Intent router defaults flipped.** Ambiguous topical/knowledge questions now prefer `unstructured` (was `structured`). Indonesian few-shot examples added to the prompt.
272
- - **Cache management endpoints added:** `DELETE /api/v1/chat/cache`, `DELETE /api/v1/chat/cache/room/{id}`, `DELETE /api/v1/retrieval/cache/{user_id}`. Redis chat cache now stores `{response, sources}` (was just `response`) so cached replies repopulate `message_sources`.
273
- - **Direction.** The long-term split is **Python = agent/ML layer, Go = data plane**. More pieces are expected to follow doc ingestion out of Python.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
PROGRESS.md DELETED
@@ -1,692 +0,0 @@
1
- # Progress — Phase 2 catalog-driven build
2
-
3
- Persistent tracker mirroring the 42-item ownership table in `REPO_CONTEXT.md` "Team — division of work". Update as PRs land. Future Claude Code sessions read this to know what's already done.
4
-
5
- **Last updated**: 2026-06-12 (Redis Cloud live; R3 closed as won't-do; R5 cache fix; AnalysisRecord persistence landed — `PostgresAnalysisStore` + `analysis_records` table)
6
- **Current open PR**: `pr/3` — active.
7
-
8
- ---
9
-
10
- ## What just shipped (2026-06-12 — AnalysisRecord persistence, Rifqi)
11
-
12
- Groundwork for `generate_report`. The slow path now persists a real, citable
13
- record; the report (next) renders from it.
14
-
15
- - **Contract gaps closed** (`agents/slow_path/schemas.py`): `stage: CrispStage`
16
- added to `TaskResult` + `TaskSummary` and populated at all 3 `TaskResult` build
17
- sites in `task_runner.py` + copied in `assembler._build_record` — so the report
18
- can group its method appendix by CRISP-DM phase. `AnalysisRecord` gained identity:
19
- `record_id` (auto uuid), `analysis_id`/`user_id` (optional; stamped at persist).
20
- - **Real store** (`agents/slow_path/store.py`): `PostgresAnalysisStore` —
21
- `save()` (never-throw, idempotent upsert) + `list_for_analysis()` (oldest-first,
22
- the report's render order). `NullAnalysisStore` kept (tests / disabled persistence).
23
- `AnalysisStore` Protocol gained `list_for_analysis`.
24
- - **Table** (`db/postgres/models.py`): `analysis_records` jsonb table (one row per
25
- run, indexed by `analysis_id` + `user_id`); registered in `init_db.py`, created by
26
- `create_all` on startup (no migration — `data_catalog` precedent).
27
- - **Wired** (`agents/chat_handler.py`): default store flipped to `PostgresAnalysisStore`;
28
- `user_id` stamped onto the record at the save site (in scope there).
29
- - **Open**: `analysis_id` is `NULL` until Harry's Analysis State reaches the slow
30
- path (session-ID handoff needed to group records per analysis).
31
-
32
- ---
33
-
34
- ## Principal architecture review (2026-06-10) — findings + fix tracker
35
-
36
- A full external review (read the context docs + the slow path, tool layer, query
37
- spine, catalog plumbing, chat endpoint, config/connection layers) landed. It confirmed
38
- the DB-latency diagnosis and surfaced several gaps **not previously tracked here**.
39
- Verified against code before logging. Severity: **critical** / important / nice-to-have.
40
-
41
- **Runtime / latency (the original problem):**
42
- - DB connection handling is the anomaly, NOT cold start. `DbExecutor._run_sync`
43
- (`db.py:192`) → `engine_scope` does `create_engine → connect (TCP+TLS+SCRAM) → 2×SET
44
- → dispose` on EVERY query. Measured ~6–8s for 60 rows; a 2nd warm-session query was
45
- still ~6.6s → per-call handshake, never amortized. `engine_scope`'s connect-once-dispose
46
- semantics were designed for the ingestion pipeline and wrongly inherited by the query path.
47
- - `describe_source` ~3.5s is **planner-induced waste**: every few-shot (`examples.py`)
48
- opens with a `describe_source` task, so the LLM always plans a tool that re-reads from
49
- the catalog DB the same catalog already rendered into its prompt. Its impl does 2
50
- sequential full-catalog reads (`data_access.py:127-128`). Total catalog reads/request ~5×.
51
- - Azure LLM clients rebuilt per request: `ChatHandler(enable_tracing=True)` is constructed
52
- per request (`chat.py:172`) → fresh Orchestrator/Chatbot → fresh AzureChatOpenAI → fresh
53
- TLS to Azure each call. Planner/Assembler correctly use module singletons; the other two don't.
54
- - Tokens (~13k/request) are NORMAL for this design — do not optimize for $.
55
- - **Reject the scheduled DB-warmer idea**: targets cold start (~1.8s slice) not the per-call
56
- handshake, keeps serverless user DBs awake 24/7 (their compute bill), and decrypts every
57
- tenant's creds on a cron (attack surface). Strictly dominated by an engine cache +
58
- request-scoped pre-connect.
59
-
60
- **Fix tracker (new):**
61
-
62
- | # | Fix | Severity | Owner | Status |
63
- |---|---|---|---|---|
64
- | R1 | **AuthN/AuthZ** on data endpoints — reject body-supplied `user_id`/`room_id`, derive identity from a verified token. `/chat/stream` has none (`chat.py:40,128`); tenant isolation is client honesty. **CORRECTION to the review:** `security/auth.py` is a STUB (all `NotImplementedError`); the real JWT impl lives in `src/users/users.py` (`encode_jwt`/`decode_jwt`, HS, env-keyed) **but is unused** — `/login` (`api/v1/users.py`) returns the user profile as plain JSON and mints NO token. So R1 is cross-team: (1) `/login` must issue a JWT, (2) frontend must send it as `Bearer`, (3) data endpoints validate it. **Gates the engine-cache work (DB2).** | **critical** | DB/B + frontend | `[ ]` |
65
- | R2 | **Always compile a LIMIT** — `sql.py` now emits a bound for every query: explicit limit honored (clamped to `MAX_RESULT_ROWS=10000`), unbounded queries get `LIMIT cap+1` so an unbounded SELECT can't stream a whole table into memory. `CompiledSql.row_cap` carries the cap; `DbExecutor` caps + flags truncation from it (dropped its own `_ROW_HARD_CAP`). Tests updated (`test_sql.py`, +3 cases); `S608` restored to `tests/**` ruff ignore (was dropped). | **critical** | DB | `[x]` |
66
- | R3 | **Commit `tests/` + minimal CI** — `tests/` is gitignored; the 200+ tests cited as done exist only on laptops (already caused rename rot). ~~GitHub origin carries tests; HF Space gets the Docker build.~~ **2026-06-12: team decided tests stay gitignored/local — closed as won't-do.** | **critical (process)** | shared | `[won't do]` |
67
- | DB1 | **In-memory `describe_source`** (request-scoped `MemoizingCatalogReader`, `reader.py`) + **LLM-client hoist** (shared module-level `ChatHandler` in `chat.py`). Measured live: `describe_source` 3.5s→~2.0s (structured read now served from the planner's cached snapshot; only the unstructured read remains a round-trip), catalog reads/request ~5→~2. External `query_structured` handshake unchanged (DB2's job) so total slow path is ~flat until DB2. Tests: `tests/catalog/test_reader.py`. | important | agent | `[x]` |
68
- | DB2 | **Keyed engine cache** — `src/database_client/engine.py::UserEngineCache` (process singleton): pooled engines keyed by `client_id + creds-hash` (rotation auto-invalidates), bounded LRU (50) + 600s idle TTL, `pool_pre_ping` + `pool_recycle=300`. `DbExecutor._run_sync` reuses the warm connection instead of `create_engine→connect→dispose` per query (postgres/supabase only; other db_types keep the legacy path — no regression). **Live-measured: warm `query_structured` 6.6–9.4s → ~2.5s** (the residual is the per-call catalog-DB client fetch + pre-ping, not the external handshake). **Finding:** Neon's transaction pooler REJECTS `default_transaction_read_only` as a libpq startup `option` — caught live; moved read-only + statement_timeout to a per-connection `connect` event (best-effort; authoritative read-only is the SELECT-only compiler + sqlglot guard, see R10). Per-request ownership/active check kept. Proceeded ahead of R1 per owner decision (marginal security delta over the existing no-auth state; auth tracked separately). Tests: `tests/database_client/test_engine.py`. First query/process still cold → DB3. | important | DB | `[x]` |
69
- | DB3 | **Speculative pre-connect** — `DbExecutor.prewarm(catalog, user_id)` warms the pooled engine for schema sources (fire-and-forget at slow-path entry) so the cold first-query handshake overlaps the ~4s Planner call. Best-effort, never raises; gated to the default path (skipped when a coordinator factory is injected). Verified live through `ChatHandler.handle`. | nice-to-have | DB | `[x]` |
70
- | R4 | **Per-stage progress events** — `SlowPathCoordinator.run` gained an optional `progress` callback; `ChatHandler` bridges it to SSE `status` events (`chat.py` forwards them). Live: stream now shows `Planning…`→`Running N steps…`→`Composing…` (max wire gap ~4.6s, was ~13s of silence) → fixes proxy idle-timeout + UX. **Deferred:** token-streaming the Assembler answer needs splitting it into a streamed prose call + a structured-record call — that doubles the Assembler LLM calls (cost/latency), so it's a separate decision; the answer is still emitted as one chunk after the (fast ~2.5s) Assembler. Test: `test_chat_handler_wiring.py`. | important | agent | `[~]` |
71
- | R5 | **Response cache**: key on `user_id` + catalog version; invalidate on ingest. Was `chat:{room_id}:{message}`, 24h TTL, no user → cross-user replay + stale answers. **2026-06-12 (Rifqi):** key now `chat:{room_id}:{user_id}:{message}` via `_chat_cache_key()`, TTL 24h→1h (checkpoint decision) — urgent now that Redis is a shared Cloud instance. `DELETE /chat/cache` gained a required `user_id` param (frontend heads-up); room-wide clear pattern unchanged. **Still open:** catalog-version in key / invalidate-on-ingest. | important | B | `[~]` |
72
- | R6 | **Hard time budget** — wrap `coordinator.run()` in `asyncio.wait_for` (60–90s). `Constraints.time_budget_seconds` is rendered but not enforced. | important | agent | `[ ]` |
73
- | R7 | **Root-task-failure short-circuit** before the Assembler (templated/fast-path fallback, NOT replanning) — stops paying ~2k tok to narrate an empty RunState. | important | agent | `[ ]` |
74
- | R8 | **Catalog upsert race** — per-user advisory lock around read-merge-upsert (`store.py`); concurrent uploads can drop a source. | important | DB | `[ ]` |
75
- | R9 | **`extra="ignore"`** in `settings.py:15` (currently `allow` → typo'd env vars silently swallowed); require Azure keys in prod. | nice-to-have | B | `[ ]` |
76
- | R10 | **Read-only enforcement is session-state, not a server role.** `REPO_CONTEXT.md` counts "read-only DB credentials" as a defense layer but nothing requests/verifies a read-only role. Either request read-only creds at registration (verify via `SELECT current_setting(...)`) or drop the claim. | important | DB | `[ ]` |
77
- | R11 | **De-duplicate** `_PLACEHOLDER_RE` (`task_runner.py:31` vs validator) and `_DATA_ACCESS_TOOLS` (invoker vs planner registry) — import one from the other; comments aren't a sync mechanism. **TAB slice done (90e80f9):** canonical `DATA_ACCESS_TOOLS` now lives once in `tools/data_access.py`; `invoker.py` imports it (was a duplicated frozenset synced by comment). **Agent slice done (2026-06-10):** `PLACEHOLDER_RE` single-sourced in `planner/schemas.py` (part of the ToolCall placeholder convention); validator + task_runner import it. `planner/registry.py` keeps local spec *bodies* (stub pending KM-465 #4) but name-checks them against `DATA_ACCESS_TOOLS` in `_data_access_slice()` — upstream rename/add now raises at `default_registry()` instead of drifting silently. Registry output unchanged (same 12 tools, same order). | nice-to-have | agent/tool | `[x]` |
78
- | R12 | **Doc/process hygiene** — some code docstrings cite internal design specs that are not committed to the repo (design docs are kept out of version control), so the references dangle for anyone but the author; `CLAUDE.md` lists deleted modules (enricher, `pipeline/orchestrator.py`); `main` is 38 commits behind on a dead architecture. | nice-to-have | agent | `[ ]` |
79
- | R13 | **Pre-existing test failure** (found during R2, NOT caused by it): `tests/query/planner/test_prompt.py::test_render_catalog_with_sources` fails — `query/planner/prompt.py::render_catalog` now renders stable IDs (`src_test_db`) the test asserts are absent. Old query-planner path; confirmed failing on a clean tree. | nice-to-have | DB | `[ ]` |
80
- | T1 | **`input_schema` is presence-only, not type-checked** — `ToolSpec.input_schema` comment said "validates ToolCall.args", but `TaskRunner._validate_args` only enforces `required` presence; the `properties` types are documentation, never validated at runtime. Clarified the contract in `tools/contracts.py` so nobody assumes type-safety (a wrong-typed arg passes validation, surfaces only inside the compute fn). Doc-only, no behavior change (90e80f9). | nice-to-have | TAB | `[x]` |
81
- | T2 | **Dead Python embed path?** — `document_pipeline.process()` → `knowledge_processor` → `vector_store.aadd_documents()` still writes PDF/DOCX/TXT embeddings to `langchain_pg_embedding`, contradicting CLAUDE.md's "Go is sole writer, Python reads only". Verified the Go service (`Orchestrator-Agent-Service/internal/documents`) IS a complete ingestion writer to the same tables for all 5 file types (OCR + chunk + embed) → the Python embed branch is very likely redundant. **Blocked on one operational fact:** does the frontend still upload to `/document/process` (Python) or to Go? Park until confirmed — deleting a live ingestion path would break unstructured RAG. The csv/xlsx parquet branch stays regardless (feeds the catalog/tabular path). | nice-to-have | TAB | `[blocked]` |
82
-
83
- **Slow-path endpoint wiring (2026-06-10):** the Orchestrator→slow-path is now wired
84
- into the live endpoint behind an **env flag**. `settings.enable_slow_path` (env
85
- `ENABLE_SLOW_PATH`, default **off**) is passed to the shared `ChatHandler` in
86
- `api/v1/chat.py`. Flip `ENABLE_SLOW_PATH=true` to route `structured` intents through
87
- Planner→TaskRunner→Assembler and test end-to-end from `/chat/stream` (status progress
88
- events + answer stream). Stays opt-in because `BusinessContext` is still the stub;
89
- fast/unstructured paths unchanged. Verified live via `ChatHandler.handle`.
90
-
91
- **Architecture verdict:** fundamentally sound (catalog-driven IR + deterministic compiler
92
- + static plan is the right call). Debt is transitional duplication (two planners/registries/
93
- contract modules — documented, owned) and `ChatHandler` drifting toward a god object
94
- (extract the slow-path composition root + the SSE `_build_sources`/`_normalize_chunks`
95
- mappers when convenient).
96
-
97
- ---
98
-
99
- ## What just shipped (2026-06-09/10 — tool layer, tracing, slow-path wiring)
100
-
101
- Big stretch since the slow-path workers landed. The tool layer (teammate-owned) is
102
- now **complete and real**, the slow path is **wired into `ChatHandler` behind a gate**,
103
- and the whole chat pipeline is **traced**. Fast path still untouched; live behavior
104
- unchanged (flags default off).
105
-
106
- **Tool layer — COMPLETE (teammate, KM-624→630).** `src/tools/` was re-created (the
107
- 2026-05-11 note about deleting it is superseded). Now teammate-owned:
108
- - `src/tools/analytics/` — the 8 **composite** `analyze_*` computes (descriptive,
109
- aggregate, comparison, contribution, profile, correlation, segment, trend) +
110
- prompt-style DESCRIPTIONs (KM-624/625).
111
- - `src/tools/contracts.py` — canonical `ToolSpec`/`ToolRegistry`/`ToolOutput` (KM-627).
112
- `agents/planner/contracts.py` now just re-exports them + keeps the `BusinessContext`
113
- stub (lead's).
114
- - `src/tools/registry.py::analytics_registry()` (KM-628); `src/tools/invoker.py` +
115
- `src/tools/data_access.py` — `AnalyticsToolInvoker` (KM-629), `DataAccessToolInvoker`
116
- + `CompositeToolInvoker` (KM-630). All never-throw. **Pattern A confirmed** (`analyze_*`
117
- take a `data` `${t<id>}` placeholder from an upstream `query_structured`).
118
- - **Verified live E2E (2026-06-09):** real `query_structured` against a user's Neon
119
- Postgres → `analyze_trend` → Assembler. `analyze_contribution` surfaced a real tool
120
- bug (Decimal vs float in `decomposition.py`) — degrade-and-continue held; **now fixed
121
- by the tool owner** (`_coerce_decimals` in `invoker._materialize`, KM-630 / commit
122
- 1195870), so the whole `analyze_*` family is covered in one place. **Directive:** agent
123
- side does NOT modify `src/tools/` without confirmation.
124
-
125
- **Planner — realigned to the real tools (KM-626).** `registry.py::default_registry()`
126
- composes the real `analytics_registry()` + a local stub for the 4 data-access tools.
127
- Few-shots grown to **A–D**: A `analyze_contribution`, B `analyze_trend`, C mixed
128
- structured+unstructured (`retrieve_documents`, independent branch), D `analyze_aggregate`.
129
- `parallelizable_with` **removed** from `Task` (schema/validator/examples/prompt) —
130
- TaskRunner derives parallelism from `depends_on` alone.
131
-
132
- **Slow-path wiring — built, GATED OFF (KM-626).** `agents/chat_handler.py` gains a
133
- `structured→slow` branch behind `ChatHandler(enable_slow_path=False)`: when on it builds
134
- a per-request `CompositeToolInvoker` (composition root) + `SlowPathCoordinator`, streams
135
- `chat_answer`, persists the `analysis_record`. Two seams isolate the remaining blockers:
136
- - `agents/planner/business_context.py::get_business_context(user_id)` — async stub
137
- `BusinessContext`; TODO(lead) swap for the real read.
138
- - `agents/slow_path/store.py` — `AnalysisStore` Protocol + `NullAnalysisStore` (logs
139
- only). Real store = `analysis_records` table in the catalog DB (Neon `dataeyond`) —
140
- **table not created yet**. `chat_answer` still emitted as one chunk (not token-streamed).
141
-
142
- **Observability — Langfuse tracing wired (KM-631).** `src/observability/langfuse/
143
- tracing.py` — `RequestTracer`/`NullTracer`/`TracingToolInvoker` + `_redact`. One trace
144
- per request groups Orchestrator.classify, Planner.plan (each retry = its own generation),
145
- Assembler.assemble, Chatbot.astream + tool spans (latency/metadata only). Gated:
146
- `ChatHandler(enable_tracing=False)`; `api/v1/chat.py` opts in (`=True`). PII policy:
147
- Orchestrator+Planner unmasked (question + PII-safe summary); Assembler+Chatbot masked
148
- (see real rows/chunks); tool spans carry name + arg keys + row count only. Zero added
149
- LLM tokens; verified live to US Cloud.
150
-
151
- **Live evals green (2026-06-09, real Azure 4o):** `RUN_PLANNER_EVAL=1` and
152
- `RUN_SLOW_PATH_EVAL=1` both pass — Planner emits valid catalog-consistent `QueryIR` and
153
- wires Pattern A correctly; self-corrects via retry.
154
-
155
- **Open follow-ups:** real `BusinessContext` (lead); create `analysis_records` table +
156
- real `AnalysisStore` (**Rifqi owns, 2026-06-12** — folded into `generate_report` work,
157
- see `CHECKPOINT_PLAN_2026-06-17.md`); register data-access `ToolSpec`s upstream (`data_access_registry()`)
158
- or keep the planner stub; 4o → GPT-mini deployment swap; flip `enable_slow_path` on once
159
- `BusinessContext` is real. NOTE: 3 test files pre-existing broken from rename rot
160
- (`test_chat_handler.py`, `test_intent_router.py`, `test_answer_agent.py` import the old
161
- `answer_agent`/`intent_router` module names).
162
-
163
- ---
164
-
165
- ## What just shipped (2026-06-10 — TAB: tool-layer hardening + DRY)
166
-
167
- Owner-side companion to the agent block above. After the live E2E surfaced real-data
168
- edge cases, the tool layer got a round of correctness hardening. All in TAB-owned paths
169
- (`src/tools/`, `src/catalog/`); no agent-side or API change.
170
-
171
- **JSON-safety across the `analyze_*` family.** Real DB rows carry scalar types that
172
- don't survive the jsonb / SSE round-trip:
173
- - `[KM-630] coerce DB Decimal → float` (commit 1195870) — `_coerce_decimals` in
174
- `invoker._materialize` converts object-columns holding `decimal.Decimal` (asyncpg
175
- returns NUMERIC as `Decimal`) to `float64` before any compute runs. Fixes the
176
- `float + Decimal` TypeError in `decomposition.analyze_contribution` **and** the whole
177
- family in one seam — only touches columns that actually contain a `Decimal`.
178
- - `[KM-624] non-JSON-safe scalars in mode & top_value` (commit 6981ed3) — normalize
179
- numpy / non-native scalars so descriptive + top-value outputs serialize cleanly.
180
-
181
- **Planner↔Tools registry alignment + Timestamp keys** (commit 4bb7623, `fix(tools)`):
182
- - `registry.py` — `analyze_descriptive.required` corrected `["data"]` → `["data",
183
- "column_ids"]` to match the compute signature (`column_ids` has no default). Prevents
184
- the Planner from emitting a call that's missing a required arg. `analyze_profile` stays
185
- `["data"]` (its `column_ids` defaults to `None`).
186
- - `aggregation._clean` — group-by over a datetime column produced `pd.Timestamp` group
187
- keys that aren't JSON-safe; now normalized to `.isoformat()` alongside the existing
188
- numpy `.item()` branch.
189
-
190
- **DRY: single `SAMPLE_LIMIT` constant** (commit 6d46ba5, `[NOTICKET] refactor(catalog)`):
191
- - One source of truth in `catalog/introspect/base.py` (`SAMPLE_LIMIT = 3`, down from 5 —
192
- token cost: sample values feed the planner prompt). Both introspection paths import it:
193
- `catalog/introspect/tabular.py` and `pipeline/db_pipeline/extractor.py` (which dropped
194
- its own local `= 3`). Dependency direction is pipeline→catalog (no circular import).
195
- Stale test `test_sample_values_capped_at_five` updated to assert the real cap (3).
196
-
197
- **Audit result:** Planner↔Tools arg alignment swept end-to-end — 7/8 `analyze_*` tools
198
- already matched; the 1 mismatch (`analyze_descriptive`) is the fix above. Pattern A holds
199
- across all of them.
200
-
201
- ---
202
-
203
- ## What just shipped (2026-06-08 — KM-626: slow-path agent layer)
204
-
205
- The rest of the slow path after the Planner (KM-567) — TaskRunner, Assembler, and
206
- the coordinator. Built and tested against
207
- mocks; **not yet wired into the live `ChatHandler`** (waits on the tool team's real
208
- `ToolInvoker` + a real `BusinessContext`). Fast path untouched.
209
-
210
- **Naming:** "Orchestrator" = the entry dispatcher only (`agents/orchestration.py`).
211
- The slow-path **workers** live in **`agents/slow_path/`** — deliberately NOT named
212
- "orchestrator".
213
-
214
- **Files added** (`src/agents/slow_path/`):
215
- - `schemas.py` — `TaskResult`, `RunState`; `TaskSummary`, `AnalysisRecord`,
216
- `AssembledOutput`, `AssemblerNarrative`. Reuses `ToolOutput`.
217
- - `invoker.py` — `ToolInvoker` Protocol only; the tool team owns the impl (KM-418).
218
- - `errors.py` — `SlowPathError`, `AssemblerError`.
219
- - `task_runner.py` — deterministic, 0 LLM: wave-based execution, `${t<id>}` placeholder
220
- resolution, internal `validate_args`, never-throw invoke, status labeling,
221
- degrade-and-continue → `RunState`.
222
- - `assembler.py` + `prompt.py` + `config/prompts/assembler.md` — single LLM call →
223
- `AssemblerNarrative`; code merges with `RunState` to build the `AnalysisRecord`
224
- (structured fields copied, never re-authored).
225
- - `coordinator.py` — `SlowPathCoordinator`: Planner → TaskRunner → Assembler.
226
-
227
- **Tests added** (`tests/agents/slow_path/`, 12 passing; gitignored): schema round-trips
228
- + chat_answer-first; runner happy/placeholder/parallel/degrade/arg-miss; assembler
229
- narrative-vs-snapshot + question threading; coordinator end-to-end. `ruff` clean;
230
- tool-agnostic (no `src/tools/*` import).
231
-
232
- **Open follow-ups (not blockers):** wire `SlowPathCoordinator` into the expanded
233
- Orchestrator/`ChatHandler` once the real invoker + `BusinessContext` exist; swap the
234
- test `MockToolInvoker` for the tool team's real one (zero agent change, INV-7); 4o →
235
- GPT-mini deployment swap.
236
-
237
- ---
238
-
239
- ## What just shipped (2026-06-08 — tool taxonomy + ownership revision)
240
-
241
- Team decisions after the teammate pushed KM-624 (`src/tools/analytics/`):
242
-
243
- - **Composite tools, not atomic.** v1 uses **composite "family" tools** (`analyze_*`),
244
- not the atomic `compute_*` set the earlier draft assumed. One `analyze_*` call does a
245
- whole analytical job (e.g. `analyze_descriptive` subsumes median/mode/stddev/percentile;
246
- `analyze_trend` subsumes `date_trunc`). Tool-taxonomy decision recorded.
247
- - **Tool team owns ALL tools** — compute, data-access (`query_structured`,
248
- `retrieve_documents`, `list_sources`, `describe_source`), the wrapper/invoker layer
249
- (KM-418), and **all tool tests**. The agent team owns nothing below the registry contract.
250
- - **Planner stub realigned to the real tools.** `registry.py` rewritten from the 9 atomic
251
- entries to **12 composite entries** (4 data-access + 8 `analyze_*`); `examples.py`
252
- rewritten (Example A → `analyze_contribution`, Example B → `analyze_trend`); `planner.md`
253
- bullet updated; planner tests updated. 32 passing + 1 gated, `ruff` clean.
254
- - **Open (tool team's call):** Pattern A (analyze_* take a `${t<id>}` `data` placeholder
255
- from an upstream `query_structured`) vs Pattern B (self-fetch by `source_id`). Stub
256
- assumes A; reshaped to match once decided (agent code unaffected, INV-7).
257
- - **New coupling:** the tool team's `query_structured`/`retrieve_documents` are expected
258
- to call our existing `QueryService`/`RetrievalRouter`; `query_structured` stays
259
- inline-`QueryIR` so `IRValidator` still applies. Interface to coordinate.
260
-
261
- **Next (our scope, all mock-able now):** TaskRunner + Assembler against a `MockToolInvoker`,
262
- then Orchestrator slow-path wiring. Stubs still to retire on integration: `contracts.py`
263
- (BusinessContext from lead; ToolSpec/ToolRegistry/ToolOutput from tool team) and `registry.py`
264
- (real registry from tool team). Infra: swap the 4o stand-in for a GPT-mini deployment.
265
-
266
- ---
267
-
268
- ## What just shipped (2026-06-05 — Phase 3: Planner agent)
269
-
270
- First slow-path agent (the Planner). A single LLM
271
- call turns BusinessContext + Catalog + ToolRegistry + question + Constraints into a
272
- validated, **static** `TaskList` (DAG of fully-specified tool-call chains). No
273
- replanning (INV-6); tool-agnostic against a registry contract (INV-7). Fast path
274
- (`agents/orchestration.py`, `agents/chatbot.py`, `query/`) untouched.
275
-
276
- **Files added** (`src/agents/planner/`):
277
- - `contracts.py` — **STUB** Pydantic contracts pending reconciliation: `BusinessContext`
278
- (+KeyTerm/DataTableNote/DataColumnNote, lead's), `ToolSpec`/`ToolRegistry` (tool
279
- team KM-608), `ToolOutput` envelope.
280
- - `schemas.py` — `CrispStage`, `ToolCall`, `Task`, `TaskList`. No replan schemas.
281
- - `inputs.py` — `CatalogSummary` (condensed, PII `sample_values` nulled, `from_catalog`
282
- builder + `render`) and `Constraints` (max_tasks=5, modeling_allowed=False).
283
- - `registry.py` — **STUB** v1 P0 registry: query_structured, retrieve_documents,
284
- list_sources, describe_source, compute_median/stddev/percentile/mode, date_trunc.
285
- - `errors.py` — `PlannerError`, `PlannerValidationError`.
286
- - `prompt.py` + `config/prompts/planner.md` — system prompt (INV-1/6/7 + principles) +
287
- per-call human content (context + catalog + tools + constraints + few-shots + question).
288
- - `examples.py` — two few-shots (A exploratory revenue-by-category; B descriptive
289
- monthly-trend-by-region with date_trunc), built from the real `TaskList` schema.
290
- - `validator.py` — `PlannerValidator` running the 8 checks; reuses the existing
291
- `IRValidator` for inline `query_structured` IRs.
292
- - `service.py` — `PlannerService` + `plan_analysis(...)`: chain (mirrors
293
- `query/planner/service.py`) + validate-and-retry loop (max 3, mirrors `QueryService`).
294
-
295
- **Tests added** (`tests/agents/planner/`, 30 passing + 1 gated): `test_schemas.py`,
296
- `test_inputs.py`, `test_validator.py` (one failure per check + happy paths),
297
- `test_service.py` (`_FakeChain` + retry), `test_golden_questions.py` (live eval gated on
298
- `RUN_PLANNER_EVAL=1`). `ruff check` clean on planner paths.
299
-
300
- **Open follow-ups (not blockers):** reconcile `BusinessContext` with the lead and
301
- `ToolRegistry`/`ToolSpec` + real tools with teammate (KM-608); "GPT mini" currently uses
302
- the configured 4o deployment (swap `azure_deployment` when a mini deployment exists). Next:
303
- Orchestrator slow-path expansion + TaskRunner + Assembler.
304
-
305
- ---
306
-
307
- ## Legend
308
-
309
- - `[x]` done and merged
310
- - `[~]` in progress (open PR or active branch)
311
- - `[ ]` not started
312
- - **DB** / **TAB** / **B** — ownership (from REPO_CONTEXT.md)
313
-
314
- ---
315
-
316
- ## PR sequence
317
-
318
- | PR | Status | Owner(s) | Scope |
319
- |---|---|---|---|
320
- | PR1 | `[x]` merged | DB | Contract locks + catalog plumbing + DB introspector + IR validator + tests |
321
- | PR1-tab | `[x]` shipped | TAB | Tabular introspector + on_tabular_uploaded trigger + 31 unit tests |
322
- | PR2a | `[x]` merged | DB | CatalogEnricher + StructuredPipeline + on_db_registered trigger + FK extension on Table (enricher later removed in KM-557) |
323
- | KM-557 | `[x]` shipped | DB | Drop CatalogEnricher entirely (cost cut — planner uses stats + sample rows directly); rename jsonb table `catalogs` → `data_catalog`; add `GET /api/v1/data-catalog/{user_id}` index endpoint for catalog refresher |
324
- | PR2b | `[x]` shipped | DB-solo (B-review) | IntentRouter + planner prompt + planner LLM service |
325
- | PR3-DB | `[x]` shipped | DB | SqlCompiler (Postgres) + DbExecutor (sqlglot guard, RO + statement_timeout, asyncio.to_thread) + 36 golden IR→SQL tests |
326
- | PR3-TAB | `[x]` shipped | TAB | PandasCompiler + TabularExecutor + 43+12 golden IR→DataFrame tests |
327
- | PR4 | `[x]` | DB-solo (B-review) | ExecutorDispatcher + QueryService + ChatHandler module. **API rewired in Cleanup PR.** |
328
- | PR5 | `[x]` shipped | DB-solo (B-review) | Retry/self-correction loop on validation failure (lives in QueryService, max 3 attempts, planner re-prompted with prior error) |
329
- | PR6 | `[~]` scaffold | DB-solo (B-review) | Eval harness scaffold + 3 DB-targeting golden cases. Skipped without `RUN_PLANNER_EVAL=1` env. TAB extends with tabular cases. |
330
- | PR7 | `[x]` | DB-solo (B-review) | `ChatbotAgent` (renamed from `AnswerAgent`) + chatbot_system + guardrails prompts. `answer_agent.py` → `chatbot.py`, `AnswerAgent` → `ChatbotAgent`. API rewired in Cleanup PR. |
331
- | Cleanup | `[x]` | B | ChatHandler wired to chat.py; Phase 1 dual-write dropped from /ingest; on_catalog_rebuild_requested + POST /data-catalog/rebuild; dead modules deleted (chatbot Phase 1, orchestrator, query/base, knowledge.py, config/agents/); retrieval cache restored via RetrievalRouter; top_values added to ColumnStats; lifespan migration; knowledge_router removed. |
332
-
333
- ---
334
-
335
- ## All items
336
-
337
- ### Contracts (B — shared)
338
-
339
- | # | Item | Status | Notes |
340
- |---|---|---|---|
341
- | 1 | Catalog Pydantic models (`catalog/models.py`) | `[x]` | PR1 added `location_ref` URI-scheme docstring; PR2a added `ForeignKey` model + `Table.foreign_keys` field |
342
- | 2 | IR Pydantic models (`query/ir/models.py`) | `[x]` | Pre-existing scaffold |
343
- | 3 | IR operator whitelists (`query/ir/operators.py`) | `[x]` | PR1 filled `TYPE_COMPATIBILITY` matrix |
344
- | 4 | PII patterns / regex (`security/pii_patterns.py`) | `[x]` | Pre-existing |
345
- | — | `data_catalog` Postgres jsonb table (`db/postgres/models.py`) | `[x]` | PR1 added `Catalog` SQLAlchemy class + `init_db.py` import. KM-557 renamed `__tablename__` from `catalogs` → `data_catalog`; created fresh (no migration) |
346
- | — | `QueryResult` shape (`query/executor/base.py`) | `[x]` | Pre-existing scaffold; `columns: list[str]` added (TAB owner, PR1-tab) — DbExecutor updated to populate it. |
347
- | — | `Source.location_ref` URI scheme | `[x]` | PR1 documented in `catalog/models.py` docstring |
348
-
349
- ### Ingestion — introspection
350
-
351
- | # | Item | Owner | Status | Notes |
352
- |---|---|---|---|---|
353
- | 5 | DB introspector (`catalog/introspect/database.py`) | DB | `[x]` | PR1 — reuses Phase 1 `database_client_service`, `db_credential_encryption`, `db_pipeline_service.engine_scope`, `extractor.get_schema/profile_column/get_row_count`. PR2a wired FK extraction (was discarded before). |
354
- | 6 | Tabular introspector (`catalog/introspect/tabular.py`) | TAB | `[x]` | PR1-tab — downloads original blob (CSV/XLSX/Parquet), one Table per sheet (XLSX) or one Table (CSV/Parquet). `source_id = document_id`. `fetch_doc`/`fetch_blob` injectable for unit tests (no Settings). **2026-06-10**: sample cap now imports the shared `SAMPLE_LIMIT` (=3) from `catalog/introspect/base.py` — single source of truth across the tabular + DB introspection paths (commit 6d46ba5). |
355
- | 7 | `BaseIntrospector` ABC (`catalog/introspect/base.py`) | B | `[x]` | Pre-existing; signature locked |
356
-
357
- ### Ingestion — shared catalog plumbing
358
-
359
- | # | Item | Owner | Status | Notes |
360
- |---|---|---|---|---|
361
- | 8 | ~~Catalog enricher + prompt~~ | B | **REMOVED in KM-557** | Cost optimization — planner reads stats + sample rows + column names directly. `catalog/enricher.py` + `config/prompts/catalog_enricher.md` deleted. `render_source` (the only piece still needed) moved to `src/catalog/render.py`. Tests moved to `tests/catalog/test_render.py`. |
362
- | 9 | Catalog validator (`catalog/validator.py`) | B | `[x]` | PR1 (DB owner picked up) — uniqueness invariants |
363
- | 10 | Catalog store — Postgres jsonb (`catalog/store.py`) | B | `[x]` | PR1 (DB owner picked up) — `INSERT ... ON CONFLICT` |
364
- | 11 | Catalog reader (`catalog/reader.py`) | B | `[x]` | PR1 (DB owner picked up) — filters by source_hint, empty on miss |
365
- | 12 | PII detector (`catalog/pii_detector.py`) | B | `[x]` | PR1 (DB owner picked up) — name + value matching, bias toward over-flag |
366
-
367
- ### Ingestion — pipelines
368
-
369
- | # | Item | Owner | Status | Notes |
370
- |---|---|---|---|---|
371
- | 13 | Structured pipeline (`pipeline/structured_pipeline.py`) | B | `[x]` | PR2a (DB owner) — Source-type-agnostic: caller supplies the introspector. `default_structured_pipeline()` factory wires production deps lazily so tests can inject mocks without `Settings()` construction. **KM-557**: enrich step removed; pipeline is now `introspect → merge with existing → validate → upsert`. Constructor no longer takes `enricher`. |
372
- | 14 | Triggers (`pipeline/triggers.py`) | B | `[x]` | PR2a — `on_db_registered` implemented (DB owner). PR1-tab — `on_tabular_uploaded` implemented (TAB owner). **2026-05-11** — `on_document_uploaded` implemented. **2026-05-12** — `on_catalog_rebuild_requested` implemented: iterates all Sources in current catalog, re-runs `on_db_registered` (schema) or `on_tabular_uploaded` (tabular) per source; per-source errors logged but don't abort. |
373
- | 15 | Ingestion orchestrator (`pipeline/orchestrator.py`) | B | **DELETED** | Redundant stub — `StructuredPipeline` already takes introspector at run() time. Deleted in Cleanup PR. |
374
- | 16 | Document pipeline (`pipeline/document_pipeline.py`) | TAB | `[x]` | Flattened `pipeline/document_pipeline/document_pipeline.py` (folder) → `pipeline/document_pipeline.py` (file). Updated import in `api/v1/document.py`. |
375
-
376
- ### Query — shared spine
377
-
378
- | # | Item | Owner | Status | Notes |
379
- |---|---|---|---|---|
380
- | 17 | IR validator (`query/ir/validator.py`) | B | `[x]` | PR1 (DB owner) — full rule set; descriptive errors for planner retry |
381
- | 18 | Planner LLM service (`query/planner/service.py`) | B | `[x]` | PR2b — Azure OpenAI structured output → `QueryIR`. Injectable chain. Supports retry via `previous_error` argument. |
382
- | 19 | Planner prompt (`query/planner/prompt.py`, `config/prompts/query_planner.md`) | B | `[x]` | PR2b — system prompt with hard constraints + few-shot for DB and tabular sources. `build_planner_prompt(question, catalog, previous_error)` calls `catalog.render.render_source` (renamed from `catalog.enricher.render_source` in KM-557). |
383
- | 20 | Intent router (`agents/orchestration.py` — class `OrchestratorAgent`; `config/prompts/intent_router.md`) | B | `[x]` | PR2b — single LLM call → `IntentRouterDecision(needs_search, source_hint, rewritten_query)`. Supports conversation history. **NOTE**: source filename + class name were kept from Phase 1 for import-site compatibility; only the body is Phase 2. Prompt file and test file use the `intent_router` name. |
384
- | 21 | Executor base + `QueryResult` (`query/executor/base.py`) | B | `[x]` | Pre-existing scaffold |
385
- | 22 | Executor dispatcher (`query/executor/dispatcher.py`) | B | `[x]` | PR4 — picks DbExecutor / TabularExecutor by `source.source_type`. Lazy imports of production executors keep import side-effect-free for tests. Caches per source_type. |
386
- | 23 | Compiler base ABC (`query/compiler/base.py`) | B | `[x]` | Pre-existing scaffold |
387
- | 24 | Top-level QueryService (`query/service.py`) | B | `[x]` | PR4+5 — `plan → validate → dispatch → execute → QueryResult`. Retry loop on validation failure (max 3, planner re-prompted with prior error). Catches NotImplementedError from TabularExecutor placeholder gracefully. Never raises. |
388
-
389
- ### Query — DB path
390
-
391
- | # | Item | Status | Notes |
392
- |---|---|---|---|
393
- | 25 | SQL compiler (`query/compiler/sql.py`) | `[x]` | PR3-DB — Postgres dialect (Supabase reuses); deterministic IR → (sql, named-params dict); double-quoted identifiers from catalog; all whitelisted ops (=, !=, <, <=, >, >=, in, not_in, is_null, is_not_null, like, between); alias-aware order_by; `CompiledSql.params: dict[str, Any]` (changed from `list`). MySQL/BigQuery/Snowflake compilers later. |
394
- | 26 | DB executor (`query/executor/db.py`) | `[x]` | PR3-DB — sync engine via `db_pipeline_service.engine_scope` inside `asyncio.to_thread`. sqlglot SELECT-only / no-DML guard. Postgres-only session settings: `default_transaction_read_only=on` + `statement_timeout=30000`. asyncio.wait_for backstop. Never raises — populates `QueryResult.error`. 10k row hard cap. |
395
- | 27 | Credential encryption (`security/credentials.py`) | `[ ]` | Stub exists; PR1 reused Phase 1 `utils/db_credential_encryption.py` instead. Move in cleanup PR |
396
- | 28 | User-DB connection management | `[x]` | PR3-DB reused Phase 1 `db_pipeline_service.engine_scope` (same as PR1 introspector); no new helper needed |
397
-
398
- ### Query — Tabular path
399
-
400
- | # | Item | Status | Notes |
401
- |---|---|---|---|
402
- | 29 | Pandas compiler (`query/compiler/pandas.py`) | `[x]` | PR3-TAB — `CompiledPandas` dataclass; all 12 filter ops; all 6 aggs; group_by via `pd.concat` of Series; alias-aware order_by; `_like_to_regex` (`%`→`.*`, `_`→`.`); pure module-level helpers. (`polars` for large files still deferred — see Planned dependencies.) |
403
- | 30 | Tabular executor (`query/executor/tabular.py`) | `[x]` | PR3-TAB — `fetch_blob` injectable for tests; blob path: single-table → `{uid}/{did}.parquet`, multi-table → `{uid}/{did}__{table.name}.parquet`; `asyncio.to_thread`; 10k row hard cap; errors → `QueryResult.error`. Dispatcher routes to it by `source_type`. |
404
- | 31 | Parquet upload/download wrapper | `[x]` | Moved `knowledge/parquet_service.py` → `storage/parquet.py`. Updated 4 import sites: `pipeline/document_pipeline.py`, `knowledge/processing_service.py`, `query/executor/tabular.py`, `query/executors/tabular.py`. |
405
-
406
- ### Agents + chat
407
-
408
- | # | Item | Status | Notes |
409
- |---|---|---|---|
410
- | 32 | Chatbot agent + prompt (`agents/chatbot.py`, `config/prompts/chatbot_system.md`) | `[x]` | PR7-bundle — `ChatbotAgent` (was `AnswerAgent`) streams tokens, accepts `QueryResult` or list[`DocumentChunk`] or neither. **Cleanup PR**: renamed `answer_agent.py` → `chatbot.py`, `AnswerAgent` → `ChatbotAgent`; Phase 1 `agents/chatbot.py` deleted. |
411
- | 33 | Guardrails prompt (`config/prompts/guardrails.md`) | `[x]` | PR7-bundle — appended to `chatbot_system.md` so guardrails take precedence in conflict. |
412
- | — | Chat handler / orchestrator (`agents/chat_handler.py`) | `[x]` | PR4-bundle — top-level Phase 2 orchestrator. Routes by `source_hint`: chat → AnswerAgent direct; structured → CatalogReader + QueryService; unstructured → DocumentRetriever placeholder + AnswerAgent. Yields `intent` / `chunk` / `done` / `error` SSE-style events. Phase 1 chat.py NOT touched — cleanup PR rewires the API to call this. **2026-06-09**: gained the gated `structured→slow` branch (`enable_slow_path=False`) + `enable_tracing` (KM-626/631). |
413
-
414
- ### Tools — slow-path "Tools" component (TAB)
415
-
416
- New scope after the original 42-item table; added as the tool layer landed (KM-608/624–631). All TAB-owned (`src/tools/`), all never-throw.
417
-
418
- | # | Item | Owner | Status | Notes |
419
- |---|---|---|---|---|
420
- | — | Analytics compute fns (`tools/analytics/`) | TAB | `[x]` | KM-608/624/625 — 8 **composite** `analyze_*` fns (descriptive, aggregate, comparison, contribution, profile, correlation, segment, trend) + prompt-style DESCRIPTIONs. Pure pandas, no I/O. JSON-safe outputs (numpy/Decimal/Timestamp normalized — KM-624 + commit 4bb7623). |
421
- | — | Tool contracts (`tools/contracts.py`) | TAB | `[x]` | KM-627 — canonical `ToolSpec` / `ToolRegistry` / `ToolOutput`. `agents/planner/contracts.py` re-exports them (+ keeps the lead's `BusinessContext` stub). |
422
- | — | Analytics registry (`tools/registry.py`) | TAB | `[x]` | KM-628 — `analytics_registry()`. `analyze_descriptive.required` = `["data","column_ids"]` (aligned to compute signature, commit 4bb7623). |
423
- | — | Invoker layer (`tools/invoker.py`) | TAB | `[x]` | KM-629 — `AnalyticsToolInvoker` (Pattern A: `analyze_*` take a `data` `${t<id>}` placeholder from upstream `query_structured`; `_materialize` → DataFrame, `_coerce_decimals` covers the whole family) + `CompositeToolInvoker` (routes data-access vs analytics by name). |
424
- | — | Data-access tools (`tools/data_access.py`) | TAB | `[x]` | KM-630 — `DataAccessToolInvoker`: `list_sources` / `describe_source` / `query_structured` / `retrieve_documents`. Per-request DI (`user_id` + `CatalogReader`). `query_structured` calls `IRValidator` + `ExecutorDispatcher` (planner skipped — IR pre-built by the agent Planner). **Superseded by KM-642/643** — renamed `data_retrieve`/`knowledge_retrieve` and `list_sources`+`describe_source` merged into `data_check` + new `knowledge_check`; see row below. |
425
- | — | Tool tests (`tests/unit/tools/`) | TAB | `[x]` | analytics + data-access + invoker tests (gitignored). Incl. regression `test_decimal_columns_coerced_for_analyze_contribution`. |
426
- | — | Data/knowledge tool taxonomy (`tools/data_access.py`) | TAB | `[x]` | KM-642/643 (commits c38c0c2, 4bd5f1e) — renamed `query_structured`→`data_retrieve`, `retrieve_documents`→`knowledge_retrieve`; merged `list_sources`+`describe_source` → parameterized `data_check` (no arg = list structured sources; `source_id` = that source's schema) + new `knowledge_check` (unstructured/documents). Split mirrors the catalog's structured/unstructured slices. Planner stub/prompt/validator/few-shots synced; `DATA_ACCESS_TOOLS` guard kept in lockstep. Note: dated log entries above (e.g. the 2026-06-09 E2E) keep the old names as historical record. |
427
-
428
- ### API surface
429
-
430
- | # | Item | Owner | Status | Notes |
431
- |---|---|---|---|---|
432
- | 34 | DB client endpoints (`api/v1/db_client.py`) | DB | `[x]` | **Cleanup PR** — `/ingest` now calls only `on_db_registered`. Phase 1 `db_pipeline_service.run()` + `decrypt_credentials_dict` removed. Error from catalog build now raises HTTP 500 (was silent log). Response simplified to `{"status": "success", "client_id": ...}`. |
433
- | 35 | Document/tabular upload endpoints (`api/v1/document.py`) | TAB | `[x]` | Rewired `/document/process` — after processing CSV/XLSX, calls `on_tabular_uploaded(document_id, user_id)`. Catalog ingestion failure is logged but does not fail the request. **2026-05-11** — CSV/XLSX no longer ingested to vector store (`knowledge_processor` skipped for tabular types in `document_pipeline.py`); they go to catalog only. |
434
- | 36 | Chat stream endpoint (`api/v1/chat.py`) | B | `[x]` | Rewired `/chat/stream` — replaced `query_executor.execute()` (Phase 1) with `CatalogReader + QueryService` (Phase 2). **Cleanup PR**: fully rewired to `ChatHandler.handle()`. Inline intent routing, retrieval, and answer generation removed. Redis cache, fast intent, history loading, and message persistence remain in chat.py. Sources event emits `[]` (retrieval not yet exposed by ChatHandler). |
435
- | 37 | Room / users endpoints (`api/v1/room.py`, `api/v1/users.py`) | B | `[ ]` | No catalog work; only touch if auth flow changes |
436
- | — | Data catalog index endpoint (`api/v1/data_catalog.py`) | DB | `[x]` | **KM-557** — `GET /api/v1/data-catalog/{user_id}` → `list[CatalogIndexEntry]`. **Cleanup PR** — added `POST /api/v1/data-catalog/rebuild?user_id=` → calls `on_catalog_rebuild_requested`; per-source errors logged but don't fail the request. |
437
-
438
- ### Tests + eval
439
-
440
- | # | Item | Owner | Status | Notes |
441
- |---|---|---|---|---|
442
- | 38 | DB compiler golden tests (`tests/query/compiler/test_sql.py`) | DB | `[x]` | PR3-DB — 36 tests across all whitelisted ops, identifier quoting, agg / count_distinct / count(*), order_by alias resolution, parameter sequencing, error paths. Pure-Python, no LLM, no DB. |
443
- | 39 | Pandas compiler golden tests (`tests/unit/query/compiler/test_pandas_compiler.py`) | TAB | `[x]` | PR3-TAB — 43 tests: all 12 filter ops, all 6 aggs, group_by, order_by, limit, aliases, empty DataFrame, error paths. `test_tabular_executor.py` adds 12 more (blob name resolution + happy path + error paths). |
444
- | 40 | IR validator tests (`tests/query/ir/test_validator.py`) | B | `[x]` | PR1 — 19 tests, all rules covered |
445
- | — | PII detector tests (`tests/catalog/test_pii_detector.py`) | B | `[x]` | PR1 — 26 tests (parametrized) |
446
- | — | Catalog validator tests (`tests/catalog/test_validator.py`) | B | `[x]` | PR1 — 5 tests |
447
- | — | Catalog render tests (`tests/catalog/test_render.py`) | B | `[x]` | **KM-557** — 5 tests (renamed from `test_enricher.py`; LLM enrichment tests dropped, render-only tests kept). |
448
- | — | Catalog store integration test (`tests/catalog/test_store.py`) | DB | `[x]` | PR1 — module-level skip without `RUN_INTEGRATION_TESTS=1` |
449
- | — | DB introspector test | DB | `[ ]` | Deferred to PR2 — needs Postgres testcontainer or fixture infra |
450
- | — | Tabular introspector test | TAB | `[x]` | PR1-tab — 31 unit tests (CSV/XLSX/Parquet, stats, PII, error paths). No DB/blob I/O — mocks injected via constructor. |
451
- | 41 | Planner eval (`tests/query/planner/`) | B | `[x]` | PR6-scaffold — `test_golden_questions.py` with 3 DB-targeting cases. TAB added `test_golden_tabular.py` with 4 tabular cases (group_by+sum, top-N+limit, date range filter, XLSX sheet selection). All 4 passed against real Azure OpenAI. Fix shipped alongside: `query/planner/service.py` replaced `("system", text)` tuple with `SystemMessage` — without this, `{...}` in `query_planner.md` was parsed as f-string variables and crashed on every real invocation. |
452
- | 42 | E2E smoke tests (`tests/e2e/`) | B | `[ ]` | Defer until Phase 2 endpoints are wired (cleanup PR). Component-level orchestration is already covered by `test_chat_handler.py` + `test_service.py`. |
453
- | — | Golden IR fixtures (`tests/fixtures/golden_irs.json`) | B | `[~]` | PR1 seeded with 5 DB-targeting examples; TAB extends in PR1-tab |
454
- | — | Shared `sample_catalog` fixture (`tests/conftest.py`) | B | `[x]` | PR1 — DB-shaped; TAB may add tabular sibling |
455
-
456
- ---
457
-
458
- ## What just shipped (2026-05-12 — Cleanup PR)
459
-
460
- **Phase 1 removal + Phase 2 API rewiring:**
461
- - `src/api/v1/chat.py` — fully rewired to `ChatHandler.handle()`. Removed inline IntentRouter, retrieval, and ChatbotAgent calls. Redis cache, fast intent, load_history, save_messages stay in chat.py.
462
- - `src/api/v1/db_client.py` — `/ingest` now calls only `on_db_registered`. Phase 1 `db_pipeline_service.run()` block removed. Catalog build failure now raises HTTP 500.
463
- - `src/api/v1/data_catalog.py` — added `POST /api/v1/data-catalog/rebuild` endpoint.
464
- - `src/pipeline/triggers.py` — `on_catalog_rebuild_requested` implemented: iterates catalog sources, re-runs the appropriate trigger per source type, per-source errors logged.
465
-
466
- **Dead modules deleted:**
467
- - `src/agents/chatbot.py` (Phase 1 LangChain chatbot)
468
- - `src/pipeline/orchestrator.py` (empty stub)
469
- - `src/query/base.py` (old duplicate of `executor/base.py`)
470
- - `src/api/v1/knowledge.py` (fake `/knowledge/rebuild` endpoint)
471
- - `src/config/agents/` (folder — prompts only used by deleted Phase 1 chatbot)
472
-
473
- **Renames:**
474
- - `src/agents/answer_agent.py` → `src/agents/chatbot.py`; `AnswerAgent` → `ChatbotAgent`; updated all import sites (`chat_handler.py`, `chat.py`)
475
-
476
- **Fixes + improvements:**
477
- - `src/agents/chat_handler.py` — `_get_document_retriever()` now returns `RetrievalRouter` (Redis-cached) instead of `DocumentRetriever` directly; retrieval-level cache restored.
478
- - `src/retrieval/router.py` — removed dead `db: AsyncSession` and `source_hint` parameters + `_UNSTRUCTURED_HINTS` constant from `retrieve()`. Cache key simplified.
479
- - `src/knowledge/processing_service.py` — removed dead `_build_csv_documents`, `_build_excel_documents`, `_profile_dataframe`, `_to_sheet_document` methods + `pandas` and `upload_parquet` imports.
480
- - `src/catalog/models.py` — added `top_values: list[Any] | None` to `ColumnStats`.
481
- - `src/catalog/introspect/tabular.py` — `_to_column` now populates `top_values` for columns with ≤10 distinct values; useful for query planner WHERE clause generation.
482
- - `main.py` — replaced deprecated `@app.on_event("startup")` with `lifespan` context manager; removed `knowledge_router`.
483
-
484
- ---
485
-
486
- ## What just shipped (KM-557 — DB owner)
487
-
488
- After lead review of the catalog ingestion cost: dropped LLM enrichment,
489
- renamed the storage table, and exposed a lightweight index endpoint for
490
- the upcoming catalog refresher.
491
-
492
- **Files deleted**:
493
- - `src/catalog/enricher.py` — entire CatalogEnricher + EnrichmentResponse + apply_descriptions removed
494
- - `src/config/prompts/catalog_enricher.md` — dead prompt
495
- - `tests/catalog/test_enricher.py` — replaced by `test_render.py`
496
-
497
- **Files added**:
498
- - `src/catalog/render.py` — new home for `render_source` (the only piece of the old enricher still needed; consumed by `query/planner/prompt.py`)
499
- - `src/api/v1/data_catalog.py` — `GET /api/v1/data-catalog/{user_id}` returns `list[CatalogIndexEntry]`
500
- - `tests/catalog/test_render.py` — 5 tests (same coverage as the old render block)
501
-
502
- **Files modified**:
503
- - `src/db/postgres/models.py` — `__tablename__ = "data_catalog"` (was `"catalogs"`). Class name unchanged
504
- - `src/pipeline/structured_pipeline.py` — `StructuredPipeline(validator, store)` (was `(enricher, validator, store)`); pipeline is now `introspect → merge → validate → upsert`; `default_structured_pipeline()` no longer constructs an enricher
505
- - `src/pipeline/triggers.py` — docstrings updated; `on_catalog_rebuild_requested` docstring rewritten for the refresher use case
506
- - `src/query/planner/prompt.py` — import now `from ...catalog.render import render_source`
507
- - `src/catalog/introspect/{base,database,tabular}.py` — docstring scrubs (no behavior changes)
508
- - `src/models/api/catalog.py` — added `CatalogIndexEntry`; simplified `CatalogRebuildResponse` to `sources_rebuilt`
509
- - `main.py` — registered `data_catalog_router`
510
- - `src/security/README.md` — one stale wording fix
511
-
512
- **No migration**: the `data_catalog` table is created from scratch on first `init_db()`. The old `catalogs` table was never deployed against production data, so no rename SQL is needed.
513
-
514
- **Tests**: all 4 `test_structured_pipeline.py` tests reworked to construct `StructuredPipeline(validator=, store=)` without `enricher`. 5 `test_render.py` tests cover render_source standalone.
515
-
516
- **Lint**: `ruff check` clean on modified Phase 2 paths.
517
-
518
- **Open follow-ups left for the lead**:
519
- - `on_catalog_rebuild_requested` body — the refresher will iterate the index endpoint and call this trigger per source
520
- - `api/v1/db_client.py` `/ingest` still doesn't call `on_db_registered` — same blocker as before, untouched by KM-557
521
-
522
- ---
523
-
524
- ## What just shipped (2026-05-11 — retrieval migration + bug fixes)
525
-
526
- **Files implemented / migrated**:
527
- - `src/retrieval/base.py` — `RetrievalResult` dataclass + `BaseRetriever` ABC (was in `src/rag/base.py`)
528
- - `src/retrieval/document.py` — full `DocumentRetriever` migrated from `src/rag/retrievers/document.py`; all retrieval methods (MMR/cosine/euclidean/inner_product/manhattan). Tabular file types filtered out from results.
529
- - `src/retrieval/router.py` — `RetrievalRouter` (Redis-cached, unstructured-only). `invalidate_cache(user_id)` clears all `retrieval:{user_id}:*` keys.
530
-
531
- **Deleted** (no longer used):
532
- - `src/rag/` — entire folder (base.py, retriever.py, router.py, retrievers/)
533
- - `src/tools/` — entire folder (search.py was the only real file; only called by deleted rag/ router)
534
-
535
- **Bug fixes**:
536
- - `src/pipeline/document_pipeline.py` — `retrieval_router.invalidate_cache(user_id)` called after `process()` and `delete()`. Redis failure is caught and logged (does not fail the document op).
537
- - `src/pipeline/document_pipeline.py` — CSV/XLSX now skips `knowledge_processor` (vector store). Tabular files go to catalog only; no duplicate embeddings.
538
- - `src/pipeline/triggers.py` — `on_document_uploaded` implemented (was `raise NotImplementedError`).
539
- - `src/agents/chat_handler.py` — `_normalize_chunks` now handles `RetrievalResult` objects. Previously they were silently dropped, causing empty context for unstructured queries through ChatHandler.
540
-
541
- **Import updates** (all changed from `src.rag.*` → `src.retrieval.*`):
542
- - `src/api/v1/chat.py`, `src/query/base.py`, `src/query/query_executor.py`, `src/query/executors/db_executor.py`, `src/query/executors/tabular.py`
543
-
544
- ---
545
-
546
- ## What shipped previously (PR2b/4/5/6/7-bundle — DB owner solo, teammate reviews)
547
-
548
- **Files implemented**:
549
- - `src/agents/orchestration.py` — `OrchestratorAgent.classify(message, history) → IntentRouterDecision`. Pydantic model for structured output. History-aware query rewriting. Phase 1 filename + class name preserved; body fully rewritten for Phase 2.
550
- - `src/agents/answer_agent.py` — `AnswerAgent.astream(...)` streams answer tokens; accepts `QueryResult` and/or `list[DocumentChunk]`. Renames to `chatbot.py` in cleanup PR.
551
- - `src/agents/chat_handler.py` — `ChatHandler.handle(message, user_id, history)` returns `AsyncIterator[dict]` of `intent` / `chunk` / `done` / `error` SSE events. All deps injectable; lazy default builders.
552
- - `src/query/planner/prompt.py` — `render_catalog(catalog)` + `build_planner_prompt(question, catalog, previous_error)`. Reuses `catalog.enricher.render_source` for consistency across LLM call sites.
553
- - `src/query/planner/service.py` — `QueryPlannerService.plan(question, catalog, previous_error)` Azure OpenAI structured output → `QueryIR`.
554
- - `src/query/executor/dispatcher.py` — `ExecutorDispatcher.pick(ir) → BaseExecutor` by `source.source_type`. Lazy executor imports + per-source-type cache.
555
- - `src/query/service.py` — `QueryService.run(user_id, question, catalog) → QueryResult`. Plan→validate→retry-on-failure (max 3)→dispatch→execute. Catches NotImplementedError from TabularExecutor placeholder gracefully.
556
-
557
- **Prompts written** (filled in placeholders):
558
- - `src/config/prompts/intent_router.md`
559
- - `src/config/prompts/query_planner.md`
560
- - `src/config/prompts/chatbot_system.md`
561
- - `src/config/prompts/guardrails.md`
562
-
563
- **Tests added** (46 new — total now 146 + 2 skipped):
564
- - `tests/agents/test_intent_router.py` (4)
565
- - `tests/agents/test_answer_agent.py` (12)
566
- - `tests/agents/test_chat_handler.py` (6)
567
- - `tests/query/planner/test_prompt.py` (7)
568
- - `tests/query/planner/test_service.py` (3)
569
- - `tests/query/executor/test_dispatcher.py` (5)
570
- - `tests/query/test_service.py` (8)
571
- - `tests/query/planner/test_golden_questions.py` (3 — skipped by default; eval harness scaffold)
572
-
573
- **Lint**: `ruff check` clean on all Phase 2 paths. Phase 1 files have pre-existing E501/S608 issues — out of scope for this PR.
574
-
575
- **Placeholders / blockers for teammate** (status as of DB owner's commit, before merge):
576
- - `src/query/executor/tabular.py` (TAB) — DB owner's note: "still raises NotImplementedError". **Post-merge**: TAB shipped this in PR3-TAB; dispatcher now routes to the real `TabularExecutor`. The `NotImplementedError` catch in `QueryService` stays as a safety net.
577
- - `src/retrieval/document.py` — **implemented** (2026-05-11). Full `DocumentRetriever` migrated from `src/rag/retrievers/document.py`; supports MMR/cosine/euclidean/manhattan/inner_product. `_normalize_chunks` in `chat_handler.py` now handles `RetrievalResult` → `DocumentChunk` conversion correctly.
578
- - `src/api/v1/chat.py` (Phase 1) — NOT touched. Cleanup PR rewires the SSE endpoint to call `ChatHandler.handle(...)`.
579
- - `src/api/v1/db_client.py` (Phase 1) — NOT touched. Cleanup PR rewires `/database-clients/{id}/ingest` to call `pipeline.triggers.on_db_registered`.
580
-
581
- ---
582
-
583
- ## What shipped previously (PR3-TAB — TAB owner)
584
-
585
- **Files implemented**:
586
- - `src/query/compiler/pandas.py` — `PandasCompiler` + `CompiledPandas(apply, output_columns)` dataclass. Pure helper functions (easier to test in isolation): `_apply_filters` (all 12 ops, `_like_to_regex` for LIKE), `_apply_select` (column pick + rename), `_apply_agg` (scalar + group_by via `pd.concat` of Series → `reset_index`), `_apply_orderby` (alias-aware via `_resolve_order_col`). Closure captures all IR fields explicitly so `apply(df)` is self-contained.
587
- - `src/query/executor/tabular.py` — `TabularExecutor` with injectable `fetch_blob` (same testability pattern as `TabularIntrospector`). Resolves Parquet blob path from `az_blob://{uid}/{did}` + table: single-table → `{uid}/{did}.parquet`, multi-table → `{uid}/{did}__{table.name}.parquet`. Runs compile → download → `asyncio.to_thread(_load_and_apply)` → 10k hard cap. Never raises; errors populate `QueryResult.error`. Uses `compiled.output_columns` for column labels (safe on empty DataFrame).
588
-
589
- **Tests added** (55 new — total suite was 86 all passing at PR3-TAB time):
590
- - `tests/unit/query/compiler/test_pandas_compiler.py` — 43 tests across all 12 filter ops (including `is_null`, `not_in`, `like`, `between`), all 6 agg fns, group_by, order_by asc/desc, limit-after-order, alias round-trip, empty DataFrame, error paths.
591
- - `tests/unit/query/executor/test_tabular_executor.py` — 12 tests: `_resolve_blob_name` (single/multi-table, bad prefix), happy-path `QueryResult` shape (columns, rows, backend, truncated, source_id), wrong source_type → error, blob fetch failure → error, unknown source → error.
592
-
593
- **Lint**: `ruff check` clean on both files.
594
-
595
- ---
596
-
597
- ## What shipped previously (PR1-tab — TAB owner)
598
-
599
- **Files implemented**:
600
- - `src/catalog/introspect/tabular.py` — `TabularIntrospector` reads original blob (CSV/XLSX/Parquet), profiles each column (dtype, stats, sample values), runs PIIDetector. For XLSX: one `Table` per sheet (`Table.name = sheet_name`); for CSV/Parquet: one `Table` (`Table.name = filename stem`). `fetch_doc`/`fetch_blob` are constructor-injectable for unit tests — no `Settings` or DB required at import time.
601
- - `src/pipeline/triggers.py` — `on_tabular_uploaded` wired (mirrors `on_db_registered` pattern).
602
-
603
- **Tests added** (31 new):
604
- - `tests/unit/catalog/test_introspect_tabular.py` — CSV / XLSX / Parquet shapes, per-column stats, nullable detection, PII name + value matching, sample capping, all error paths. Pure Python, no network I/O.
605
-
606
- **Executor contract note**: introspector downloads the *original* blob for schema reading. The tabular executor (PR3-TAB) downloads *Parquet* blobs for query execution. For CSV/Parquet sources (single table), the executor must call `parquet_blob_name(uid, did, sheet_name=None)`; for XLSX (multi-table), `parquet_blob_name(uid, did, table.name)`.
607
-
608
- ---
609
-
610
- ## What shipped previously (PR3-DB — DB owner)
611
-
612
- **Files implemented**:
613
- - `src/query/compiler/sql.py` — `SqlCompiler` for Postgres dialect; `CompiledSql(sql, params)` dataclass with `params: dict[str, Any]` (changed from `list`); supports all 12 whitelisted filter ops, all 6 aggs, alias-aware order_by; `_qident` escapes embedded double-quotes
614
- - `src/query/executor/db.py` — `DbExecutor` with sqlglot SELECT-only guard, Postgres session-level read-only + 30s `statement_timeout`, `asyncio.wait_for` backstop, 10k row hard cap; rejects non-`schema` source_type and `dbclient://` URI mismatch; never raises (populates `QueryResult.error`)
615
-
616
- **Files extended**:
617
- - `src/query/compiler/pandas.py` — fixed pre-existing UP035 (Callable import)
618
- - `pyproject.toml` — added `S608` to `tests/**` ruff ignore (false positive: tests assert literal SQL strings)
619
-
620
- **Tests added** (36 new, all passing — total now 100):
621
- - `tests/query/compiler/test_sql.py` — every filter op, every agg, count(*), count_distinct, order_by alias vs column, multi-filter AND, identifier quoting escape, error paths
622
-
623
- **Lint**: `ruff check` clean on Phase 2 paths.
624
-
625
- **Hand-off note for teammate**: `CompiledSql.params` is now `dict[str, Any]` not `list`. The pandas compiler will follow the same convention (or document its own) — coordinate when PR3-TAB lands.
626
-
627
- ---
628
-
629
- ## What shipped previously (PR2a — DB owner)
630
-
631
- **Files implemented**:
632
- - `src/catalog/enricher.py` — Azure OpenAI GPT-4o + structured output (`EnrichmentResponse`), `render_source` (reusable by planner prompt later), `apply_descriptions` merger, injectable `structured_chain` for tests
633
- - `src/pipeline/structured_pipeline.py` — `StructuredPipeline` orchestrator + `default_structured_pipeline()` factory with lazy production-dep imports
634
- - `src/pipeline/triggers.py` — `on_db_registered` wired; tabular/document/rebuild stubs preserved with implementation notes
635
-
636
- **Files extended**:
637
- - `src/catalog/models.py` — added `ForeignKey` model, `Table.foreign_keys: list[ForeignKey] = []`
638
- - `src/catalog/introspect/database.py` — `_extract_foreign_keys` populates `Table.foreign_keys` from extractor data
639
- - `src/config/prompts/catalog_enricher.md` — full system prompt with style rules and one few-shot example
640
-
641
- **Tests added** (14 new, all passing — total now 64):
642
- - `tests/catalog/test_enricher.py` — render / apply / end-to-end with fake chain (10 tests)
643
- - `tests/pipeline/test_structured_pipeline.py` — orchestration with stub deps (4 tests)
644
-
645
- **Lint**: `ruff check` clean on all Phase 2 paths. Phase 1 files (`pipeline/db_pipeline/`, `pipeline/document_pipeline/`) have pre-existing ruff issues — out of scope for this PR.
646
-
647
- ---
648
-
649
- ## What shipped previously (PR1 — DB owner's first chunk)
650
-
651
- **Files implemented** (was `NotImplementedError`):
652
- - `src/catalog/pii_detector.py`, `src/catalog/validator.py`, `src/catalog/store.py`, `src/catalog/reader.py`
653
- - `src/catalog/introspect/database.py` (FK extraction added in PR2a)
654
- - `src/query/ir/validator.py`
655
-
656
- **Files extended**:
657
- - `src/query/ir/operators.py` — `TYPE_COMPATIBILITY` matrix
658
- - `src/catalog/models.py` — `location_ref` URI-scheme docstring
659
- - `src/db/postgres/models.py` — `Catalog` SQLAlchemy table; `init_db.py` imports it
660
-
661
- **Tests**: 50 unit tests + 1 integration (gated on `RUN_INTEGRATION_TESTS=1`).
662
-
663
- **Reused Phase 1 utilities** (cleanup deferred):
664
- - `src/database_client/database_client_service.py:get`
665
- - `src/utils/db_credential_encryption.py:decrypt_credentials_dict`
666
- - `src/pipeline/db_pipeline/db_pipeline_service.py:engine_scope`
667
- - `src/pipeline/db_pipeline/extractor.py:get_schema/profile_column/get_row_count`
668
-
669
- ---
670
-
671
- ## Open contract items (not yet locked)
672
-
673
- - **Joins in IR** — currently single-table only (ARCHITECTURE.md §7); DB owner accepted the constraint for v1, will revisit in PR3 if it's blocking real queries
674
- - **`updated_at` on Source vs `generated_at` on Catalog** — Pydantic models have both; introspector sets per-Source; CatalogStore preserves both
675
- - **Catalog refresh trigger** (open question §3) — default policy is rebuild-on-upload-or-connect; auto-refresh deferred
676
- - **Unstructured catalog entries** (open question §2) — currently empty filter for `source_hint="unstructured"`; revisit when adding doc descriptions
677
- - **PII handling for `sample_values`** (open question §5) — currently nulls them out (skip); mask/synthesize deferred
678
- - **Dialect priority for SQL compiler** — PR3 will land Postgres first, MySQL second; BigQuery/Snowflake/SQL Server later
679
-
680
- ---
681
-
682
- ## How to update this file
683
-
684
- When a PR lands:
685
- 1. Flip status from `[ ]` or `[~]` to `[x]`
686
- 2. Add a short note (file paths, scope cuts, surprises)
687
- 3. Bump "Last updated" at the top
688
- 4. If a new contract decision lands, move it from "Open contract items" to the relevant inline note
689
-
690
- When opening a PR:
691
- 1. Flip status to `[~]` and add yourself as the active owner in the PR row
692
- 2. Don't promise items in the PR description that aren't in the table
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
PROJECT_BRD.md ADDED
@@ -0,0 +1,150 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Data Eyond — Python Agentic Service: Business Requirements & Design (BRD)
2
+
3
+ **Status:** draft for review · **Date:** 2026-06-26 · **Branch:** `pr/4`
4
+ **Audience:** Harry (Go gateway) + leads/stakeholders.
5
+ **Scope:** the Python **agentic LLM service** (`Agentic-Service-Data-Eyond-Catalog`) only — its
6
+ requirements, capabilities, architecture, data, and integration contract.
7
+ **Companions (source of truth, not duplicated here):** [REPO_STATUS.md](REPO_STATUS.md) (current
8
+ built state) · [API_ENDPOINTS.md](API_ENDPOINTS.md) (FE-callable API) · [DEV_PLAN.md](DEV_PLAN.md)
9
+ (in-flight plan). This BRD synthesizes those into a stakeholder-facing document; convert to PDF/Word
10
+ for distribution.
11
+
12
+ ---
13
+
14
+ ## 1. Purpose & scope
15
+ Data Eyond is an **"AI data scientist"** for business analytics, modelled on **CRISP-DM** (Business
16
+ Understanding → Data Understanding → Preparation → Modeling → Evaluation → Deployment). A user sets a
17
+ goal, connects data (databases or files), asks natural-language analytical questions, and receives
18
+ CRISP-DM-structured answers that can be exported as a versioned **report**. The aim is a *"junior data
19
+ scientist that hands back a decision-ready deliverable,"* not a *"chatbot over a database."*
20
+
21
+ This document covers the **Python service** — the agentic reasoning layer. It does **not** specify the
22
+ Go gateway or the React frontend except at their integration boundaries (§9, §11).
23
+
24
+ ## 2. Business context & objectives
25
+ - **Target users:** executives doing self-serve deep-dives; analysts offloading routine work.
26
+ - **Value:** turn a business question + connected data into auditable, CRISP-DM-structured findings
27
+ and a formal report, without the user writing SQL or code.
28
+ - **Objectives:** (a) accurate, grounded analysis over the user's own data; (b) a decision-ready,
29
+ versioned report artifact; (c) safe, read-only access to user data; (d) a clean service contract the
30
+ Go gateway can integrate against.
31
+
32
+ ## 3. Stakeholders & actors
33
+ | Actor | Role |
34
+ |---|---|
35
+ | End user (exec/analyst) | Defines the analysis goal, asks questions, generates reports (via the FE) |
36
+ | Frontend (React/Vite) | Talks to Go for everything; to Python only for chat streaming |
37
+ | Go gateway (`Orchestrator-Agent-Service`) | Auth/JWT, rooms, documents, DB-credential storage, catalog ingestion, **all DB migrations**, and now all analysis-state writes |
38
+ | Python agentic service (this repo) | Router, skills, slow analytical path, structured query engine, RAG, report generation |
39
+ | Harry | Owns the Go gateway + dedorch DB migrations |
40
+
41
+ ## 4. Solution overview
42
+ Request flow is **FE → Go → Python**; the FE calls Python directly only for chat streaming. The Python
43
+ service is a **FastAPI** app that classifies each user message and dispatches it to the right
44
+ capability, streaming results back over SSE. Heavy analysis runs through a deterministic **slow path**
45
+ (plan → execute → assemble) whose structured output is persisted and later rendered into reports.
46
+
47
+ ## 5. Functional requirements (capabilities)
48
+ | ID | Capability | Description |
49
+ |---|---|---|
50
+ | FR-1 | **Intent routing** | One GPT-4o call classifies each message into one of 5 intents — `chat`, `help`, `check`, `unstructured_flow`, `structured_flow` — with history-aware query rewriting (EN/ID). |
51
+ | FR-2 | **Help skill** | State-aware, next-step guidance (LLM, streamed); only offers actions the current state allows (e.g., a report only when one is generatable). |
52
+ | FR-3 | **Check skill** | No-LLM inventory of available structured data + uploaded documents. |
53
+ | FR-4 | **Structured analysis (slow path)** | Planner → TaskRunner → Assembler: a static DAG of tool-call chains, degrade-and-continue execution, narrative authored by one LLM call; produces a structured run record. |
54
+ | FR-5 | **Structured query engine** | Catalog-driven JSON IR → deterministic SQL/pandas compiler → read-only executor, with single-level FK joins (DB sources). |
55
+ | FR-6 | **Unstructured RAG** | Retrieval over PGVector document chunks, answered by the chatbot. |
56
+ | FR-7 | **Analytics tools** | Composite `analyze_*` (descriptive, aggregate, correlation, trend) over data-access tools (`check_*`, `retrieve_*`). |
57
+ | FR-8 | **Report generation** | Deterministic assembly of findings/EDA/limitations/method from persisted run records + one LLM call for the executive summary; **versioned**, formal markdown. |
58
+ | FR-9 | **Analysis sessions** | One session = one analysis = one chat room (`analysis_id == room_id`); per-analysis data-source binding. |
59
+
60
+ **Goal capture (post-2026-06-24 pivot):** the analysis goal is **two user-entered fields** —
61
+ `objective` + `business_questions` — captured at onboarding, **both mandatory, no agent validation**.
62
+ The former agent-validated "problem statement" + its gate are removed.
63
+
64
+ ## 6. Analysis & report lifecycle
65
+ 1. **Create analysis** (via Go) — session row + chat room + chosen data-source bindings; goal =
66
+ `objective` + `business_questions`.
67
+ 2. **Ask questions** — `POST /chat/stream`; the router dispatches; `structured_flow` questions run the
68
+ slow path and **persist one `report_inputs` row per run** (the report's source of truth).
69
+ 3. **Generate report** — the report skill reads the session's `report_inputs`, assembles the structured
70
+ sections + an executive summary, and persists an immutable **versioned** report (markdown).
71
+ 4. **Read reports** — list versions / fetch a version.
72
+
73
+ > Reports are **records-based** (never from chat history) and require the slow path to have run
74
+ > (`enable_slow_path=true`) so records exist.
75
+
76
+ ## 7. System architecture (subsystems)
77
+ FastAPI + async SQLAlchemy + LangChain (Azure GPT-4o) + Redis + Azure Blob + PGVector. Key subsystems
78
+ (detail in REPO_STATUS §9):
79
+ - **Router** (`agents/orchestration.py`) — 5-intent classifier.
80
+ - **Skills** (`agents/handlers/`) — `help` (LLM), `check` (no-LLM).
81
+ - **Slow path** (`agents/slow_path/` + `agents/planner/`) — Planner, TaskRunner, Assembler.
82
+ - **Structured query engine** (`query/`) — IR validate → compile → read-only execute (never raises).
83
+ - **Report** (`agents/report/`) — generator, store (advisory-locked versioning), readiness floor.
84
+ - **Observability** — Langfuse tracing (PII-masked); Redis caching; pooled DB engines.
85
+
86
+ ## 8. Data model
87
+ SQLAlchemy models in `src/db/postgres/models.py` (detail in REPO_STATUS §8). The service is moving to
88
+ the shared **dedorch** DB (Go owns migrations; Python is consumer-only — §11).
89
+
90
+ | Table | Purpose | Owner |
91
+ |---|---|---|
92
+ | `users` | accounts (incl. `fullname` for report authorship) | Go |
93
+ | `analyses` *(plural)* | per-analysis session state: `objective`/`business_questions` (pivot), `user_id`, `status`, `data_bind`(+version), `report_collection`, `report_id` | Go (dedorch) |
94
+ | `analyses_messages` | the analysis chat room (user Q + agent A) — replaces deprecated `chat_messages`/`rooms` | Go (dedorch) |
95
+ | `report_inputs` | one jsonb row per slow-path run — the report's source of truth (was `analysis_records`) | **Python** (schema handed to Go) |
96
+ | `reports` | versioned report artifacts (markdown) | Go (dedorch) |
97
+ | `data_sources` | per-analysis source bindings | Go (dedorch) |
98
+ | `documents`, `databases`, `data_catalog` | uploads, DB credentials (Fernet), per-user catalog | Go ingestion |
99
+ | `langchain_pg_embedding` | PGVector document chunks | Go ingestion |
100
+
101
+ ## 9. API surface (FE-callable)
102
+ Full contract + request/response examples in [API_ENDPOINTS.md](API_ENDPOINTS.md). The FE-callable
103
+ surface is **4 things**:
104
+ 1. **`call_agent`** — `POST /api/v1/chat/stream` (SSE).
105
+ 2. **`list_skills`** — `GET /api/v1/tools` (slash-command catalog; cacheable).
106
+ 3. **skill: `help`** — via `call_agent` (router intent; no dedicated endpoint).
107
+ 4. **skill: `report`** — `POST /api/v1/report` + `GET` list/version.
108
+
109
+ `analysis_id == room_id`. Auth is terminated at Go; Python trusts `user_id`/`room_id`.
110
+
111
+ ## 10. Non-functional requirements
112
+ | Area | Requirement / mechanism |
113
+ |---|---|
114
+ | **Security — data access** | All structured queries are read-only: IR validation + SQL compiler whitelist + sqlglot SELECT-only guard + read-only session + LIMIT/timeout. DB credentials are Fernet-encrypted with an owner check. |
115
+ | **Security — PII** | PII columns carry no sample values into prompts; Langfuse masks PII on assembler/chatbot spans. |
116
+ | **Reliability** | Never-throw seams across tools/query/executors/state/report — failures degrade to soft output rather than crashing a turn. |
117
+ | **Performance** | Redis response cache (stateless `chat` only) + retrieval cache; pooled DB engines + speculative prewarm; warm Azure clients per process. |
118
+ | **Observability** | Langfuse: one trace per request (router/planner/assembler/chatbot + tool spans), tokens + latency. |
119
+ | **Portability** | Runs on HuggingFace Spaces (Linux) and Windows (`run.py` sets the selector event-loop policy for psycopg3 async). |
120
+
121
+ ## 11. Integrations & dependencies
122
+ - **Two-repo boundary:** Python is edited independently; Go + FE are reference-only. Python reads/writes
123
+ shared Postgres, reads Azure Blob (Parquet for tabular sources), uses Redis.
124
+ - **dedorch migration:** Python is moving from the `dataeyond` DB to **dedorch**. **Go owns all
125
+ migrations; Python is consumer-only** — if Python needs a table, it hands Go the schema. Table names
126
+ are **plural** (`analyses`, `analyses_messages`); `rooms`/`chat_messages` are deprecated there.
127
+ - **State writes via Go:** all analysis-state writes move behind Go; Python's per-turn state access
128
+ becomes a read-only get (in progress).
129
+ - **External services:** Azure OpenAI (GPT-4o + embeddings), Azure Blob, Postgres (+ PGVector), Redis,
130
+ Langfuse.
131
+
132
+ ## 12. Constraints & assumptions
133
+ - The slow path must be enabled (`enable_slow_path=true`) for reports to have content.
134
+ - `report_inputs` is Python-owned but its schema is provided to Go so the dedorch migration creates it
135
+ (so it survives the `SKIP_INIT_DB` cutover).
136
+ - Charts and images are **out of scope for now** — reports are markdown (tables/bold/italic/separators);
137
+ charts (Plotly JSON) and images (table + bucket) are deferred.
138
+ - The frontend has no dedicated UI designer; UI is being researched in parallel.
139
+
140
+ ## 13. Open items & roadmap
141
+ Tracked in [DEV_PLAN.md](DEV_PLAN.md) §4. Headlines: finish Go-side state ownership (#7/#18), the
142
+ dedorch `analyses` migration (#3, mostly done), HF deploy + playground test (#13), chat-path migration
143
+ to `analyses_messages` (#25), and the deferred charts/images/UI work (#26/#27/#28).
144
+
145
+ ## 14. Glossary
146
+ - **Slow path** — the deterministic Planner→TaskRunner→Assembler analytical pipeline.
147
+ - **`report_inputs`** — the jsonb table of slow-path run records the report reads (formerly `analysis_records`).
148
+ - **dedorch** — the shared Postgres DB the service is migrating to; Go owns its migrations.
149
+ - **CRISP-DM** — the cross-industry standard data-mining process the analysis is structured around.
150
+ - **`analysis_id == room_id`** — one analysis session is one chat room, identified by the same id.
REPO_CONTEXT.md DELETED
@@ -1,494 +0,0 @@
1
- # Repo Context — Agentic Service Data Eyond Catalog
2
-
3
- Orientation file for future Claude Code sessions. Cross-reference `ARCHITECTURE.md` for the full design rationale and decision log.
4
-
5
- ---
6
-
7
- ## Product vision — Data Eyond, your AI data scientist
8
-
9
- Data Eyond is positioned as an *AI data scientist* that supports business analytics. It is built around the **CRISP-DM** framework (Business Understanding → Data Understanding → Data Preparation → Modeling → Evaluation → Deployment) — the agent works through data problems the way a real analyst would, not as a one-shot Q&A bot.
10
-
11
- **Target users:**
12
- - **Executives** — deep-dive into their own data and extract insight to drive business decisions without needing a data team in the loop.
13
- - **Data analysts / scientists** — offload routine analysis so they can focus on heavier work.
14
-
15
- **Envisioned user flow:**
16
- 1. **Discovery interview** — a short conversation with a Data Eyond *interview agent* that draws out goal, business context, and what the user is actually trying to learn (CRISP-DM Business Understanding).
17
- 2. **Connect data** — DB connection or file upload (DB, CSV, XLSX, Parquet, documents).
18
- 3. **Ask Data Eyond** — natural-language analytical question.
19
- 4. **CRISP-DM-structured analytical response** — exportable as a **presentation deliverable** or a **notebook-style report**.
20
-
21
- North star: less "chatbot over a database", more "junior data scientist that hands back a polished, decision-ready deliverable."
22
-
23
- The current repo (Phase 2, below) is the *foundation* — IntentRouter → QueryPlanner → Executor → ChatbotAgent gives us a reliable structured-query spine. The next evolution is the agentic layer that turns this into an end-to-end CRISP-DM workflow (see *Roadmap — agentic evolution* further down).
24
-
25
- ---
26
-
27
- ## TL;DR
28
-
29
- FastAPI multi-agent backend for data analysis. Users upload documents and register databases / tabular files; they ask natural-language questions and get answers grounded in their data, streamed via SSE.
30
-
31
- The architecture has two paths:
32
-
33
- - **Unstructured** (PDF, DOCX, TXT) — dense similarity over prose chunks (PGVector).
34
- - **Structured** (databases, XLSX, CSV, Parquet) — a per-user **data catalog** describes what tables/columns exist; an LLM produces a **JSON IR** of intent; a deterministic Python compiler turns the IR into SQL or pandas; the executor runs it.
35
-
36
- The LLM produces *intent*, not query syntax. Deterministic code does the rest.
37
-
38
- The Phase 2 end-to-end flow is **wired and runnable** as of 2026-05-12. See *Implementation status* below for the per-file matrix. `PROGRESS.md` is the authoritative line-by-line tracker; this file is the orientation.
39
-
40
- ---
41
-
42
- ## Stack
43
-
44
- - Python 3.12, FastAPI 0.115, uvicorn, sse-starlette
45
- - Async SQLAlchemy 2.0 + asyncpg (Postgres), psycopg3 (PGVector multi-statement workaround)
46
- - LangChain 0.3 + langchain-postgres (PGVector) + langchain-openai (Azure OpenAI GPT-4o + embeddings)
47
- - LangGraph 0.2 + langgraph-checkpoint-postgres
48
- - Redis 5 (response + retrieval cache)
49
- - Azure Blob Storage (uploads + Parquet)
50
- - pandas, pyarrow, polars-ready (deferred), sqlglot, pydantic v2, structlog, slowapi, langfuse
51
- - presidio-analyzer + spaCy `en_core_web_lg` (PII), pytesseract + pdf2image (PDF OCR)
52
- - DB connectors: psycopg2, pymysql, pymssql, sqlalchemy-bigquery, snowflake-sqlalchemy
53
-
54
- Run: `uv run --no-sync uvicorn main:app --host 0.0.0.0 --port 7860`. On Windows use `uv run --no-sync python run.py` (sets `WindowsSelectorEventLoopPolicy` for psycopg3 async).
55
-
56
- ---
57
-
58
- ## Top-level layout
59
-
60
- ```
61
- main.py — FastAPI app + middleware + router wiring + init_db() on startup
62
- run.py — Windows-safe local entry point
63
- ARCHITECTURE.md — design intent (source of truth for shape + invariants)
64
- README.md
65
- Dockerfile — python:3.12-slim, installs spaCy en_core_web_lg, tesseract, poppler
66
- pyproject.toml / uv.lock
67
- scripts/ — backfill scripts (build_initial_catalogs, enrich_all_sources)
68
- src/ — all application code
69
- ```
70
-
71
- ---
72
-
73
- ## src/ map
74
-
75
- ### Core data shapes (only files with real content)
76
-
77
- | Path | Role |
78
- |---|---|
79
- | `catalog/models.py` | Pydantic: `Catalog → Source[] → Table[] → Column[]` |
80
- | `query/ir/models.py` | `QueryIR` (select / filters / group_by / order_by / limit) |
81
- | `query/ir/operators.py` | `ALLOWED_FILTER_OPS`, `ALLOWED_AGG_FNS`, `LIMIT_HARD_CAP=10000` |
82
- | `security/pii_patterns.py` | name patterns + email/phone regex for PII detection |
83
-
84
- ### Catalog — identity layer for structured sources (Cs ∪ Ct)
85
-
86
- | Path | Role |
87
- |---|---|
88
- | `catalog/introspect/base.py` | `BaseIntrospector.introspect(location_ref) -> Source` |
89
- | `catalog/introspect/database.py` | `information_schema` + ~100 row sample → draft Source |
90
- | `catalog/introspect/tabular.py` | Parquet/CSV/XLSX header reader + sample (one Table per sheet for XLSX) |
91
- | `catalog/render.py` | renders a `Source` as the canonical text block consumed by the planner (KM-557; LLM enrichment removed — planner reads stats + samples directly) |
92
- | `catalog/validator.py` | invariants beyond Pydantic shape (unique IDs, FK refs) |
93
- | `catalog/store.py` | persist as Postgres `jsonb` row keyed by user_id (`get/upsert/delete`) — table `data_catalog` |
94
- | `catalog/reader.py` | load + filter catalog by source_hint (returns full catalog for ≤50 tables) |
95
- | `catalog/pii_detector.py` | flag PII columns at ingestion → suppresses `sample_values` |
96
-
97
- ### Query — catalog-driven structured path
98
-
99
- | Path | Role |
100
- |---|---|
101
- | `query/service.py` | `QueryService.run(user_id, question, catalog) -> QueryResult` (top-level) |
102
- | `query/planner/service.py` | LLM call: question + catalog → QueryIR (structured output) |
103
- | `query/planner/prompt.py` | renders catalog into the planner prompt |
104
- | `query/ir/validator.py` | catalog-aware IR validation: column_ids exist, ops whitelisted, value_type matches data_type, limit ≤ cap |
105
- | `query/compiler/base.py` | `BaseCompiler.compile(ir) -> object` |
106
- | `query/compiler/sql.py` | IR → `(sql, params)`; identifiers from catalog, values parameterized |
107
- | `query/compiler/pandas.py` | IR → callable that runs against a DataFrame |
108
- | `query/executor/base.py` | `BaseExecutor.run(ir) -> QueryResult` (uniform across backends) |
109
- | `query/executor/db.py` | runs compiled SQL via asyncpg/pymysql in read-only txn (sqlglot second-line defence) |
110
- | `query/executor/tabular.py` | runs pandas/polars chain on a Parquet file (eager pandas → pyarrow pushdown → polars lazy by file size) |
111
- | `query/executor/dispatcher.py` | picks DB vs Tabular executor based on `source.source_type` of the IR's source |
112
-
113
- ### Retrieval — unstructured path (Cu)
114
-
115
- | Path | Role |
116
- |---|---|
117
- | `retrieval/document.py` | `DocumentRetriever` over PGVector chunks |
118
- | `retrieval/router.py` | dispatches the `unstructured` route (the `chat` and `structured` routes do not pass through here) |
119
-
120
- ### Agents — the three LLM call sites
121
-
122
- | Path | Role |
123
- |---|---|
124
- | `agents/orchestration.py` | `OrchestratorAgent` — classifies message → `needs_search`, `source_hint ∈ {chat, unstructured, structured}`, `rewritten_query`. Filename + class name kept from Phase 1; body replaced with Phase 2 logic. Output model is `IntentRouterDecision` |
125
- | `agents/chatbot.py` | `ChatbotAgent` — final answer formation (receives Cu chunks or QueryResult); SSE-streamed via `astream` |
126
- | `agents/chat_handler.py` | `ChatHandler` — top-level orchestrator; routes to chat / unstructured / structured and yields SSE-style `intent`/`chunk`/`done`/`error` events |
127
-
128
- (`QueryPlanner` is the third LLM call site, under `query/planner/`. The
129
- fourth — `CatalogEnricher` — was removed in KM-557; ingestion no longer
130
- makes any LLM calls.)
131
-
132
- ### Pipelines — ingestion coordinators
133
-
134
- | Path | Role |
135
- |---|---|
136
- | `pipeline/structured_pipeline.py` | DB / tabular: introspect → merge → validate → store (no enrich step since KM-557) |
137
- | `pipeline/document_pipeline.py` | unstructured: extract → chunk → embed → PGVector. CSV/XLSX skip vector store (catalog only). Invalidates retrieval cache on process/delete. |
138
- | `pipeline/triggers.py` | event entry points called by API routes: `on_db_registered`, `on_tabular_uploaded`, `on_document_uploaded`, `on_catalog_rebuild_requested` |
139
-
140
- (`pipeline/orchestrator.py` was deleted in the Cleanup PR — it was a redundant stub; `StructuredPipeline` already takes the introspector at `run()` time.)
141
-
142
- ### Security — cross-cutting
143
-
144
- | Path | Role |
145
- |---|---|
146
- | `security/auth.py` | bcrypt password hash/verify, JWT encode/decode, get_user |
147
- | `security/credentials.py` | Fernet encrypt/decrypt for stored DB credentials |
148
- | `security/pii_patterns.py` | (already listed) |
149
-
150
- ### API + infra + config
151
-
152
- | Path | Role |
153
- |---|---|
154
- | `api/v1/*.py` | FastAPI routers — thin endpoints delegating to `pipeline/triggers` and `query/service` |
155
- | `models/api/{catalog,chat,document}.py` | request/response Pydantic models |
156
- | `db/postgres/connection.py` | two async engines: `engine` (app) and `_pgvector_engine` (PGVector) |
157
- | `db/postgres/init_db.py` | startup: creates `vector` extension, all tables, HNSW + GIN indexes |
158
- | `db/postgres/models.py` | SQLAlchemy app tables (users, rooms, chat messages, …) |
159
- | `db/postgres/vector_store.py` | shared PGVector instance (collection `documents` — written by Go ingestion service) |
160
- | `db/redis/connection.py` | async Redis client |
161
- | `storage/az_blob/az_blob.py` | Azure Blob async wrapper (uploads + Parquet) |
162
- | `middlewares/{cors,logging,rate_limit}.py` | CORS allow-all (POC), structlog JSON, slowapi |
163
- | `observability/langfuse/langfuse.py` | trace helper |
164
- | `config/settings.py` | pydantic-settings; `.env` uses double-underscore aliases |
165
- | `config/env_constant.py` | env file path constant |
166
- | `config/prompts/*.md` | prompt templates: `intent_router`, `query_planner`, `chatbot_system`, `guardrails` (KM-557 removed `catalog_enricher`) |
167
-
168
- ---
169
-
170
- ## Core architectural decisions
171
-
172
- 1. **Catalog as primary context, not retrieval.** For ≤50 tables (typical), the entire catalog is rendered into the planner prompt verbatim (~3–5k tokens). No vector search, no BM25, no top-k for structured data. Catalog-level retrieval (BM25 + table-level vectors with RRF) is the *deferred* upgrade for users with hundreds of tables.
173
-
174
- 2. **JSON IR over raw SQL.** The planner LLM emits a Pydantic-validated intent, never a SQL string. The compiler is deterministic Python. Benefits: validatable before execution, dialect-portable (one IR → SQL of any dialect / pandas / polars), cheaper tokens, trivially testable without an LLM, and the LLM literally cannot emit invalid SQL syntax.
175
-
176
- 3. **Deterministic compiler, not LLM SQL writer.** All actual query construction happens in pure code. Compiler bugs are reproducible and fixable. Same IR → same query.
177
-
178
- 4. **Pipeline stage isolation.** Each stage (`IntentRouter`, `CatalogReader`, `QueryPlanner`, `IRValidator`, `QueryCompiler`, `QueryExecutor`, `ChatbotAgent`) is its own module with typed input and typed output. No god classes.
179
-
180
- 5. **Minimal LLM surface.** Only three LLM call sites in the system (KM-557 dropped `CatalogEnricher` — ingestion is now LLM-free; the planner reads stats + sample rows + column names directly):
181
- - `IntentRouter` — once per user message
182
- - `QueryPlanner` — once per structured query
183
- - `ChatbotAgent` — once per answer (formatting)
184
-
185
- 6. **Three-way routing**: `chat` / `unstructured` / `structured`. The router commits to one path. Cross-source questions ("compare DB sales vs uploaded customer file") are handled inside the structured path because the planner sees Cs ∪ Ct in one prompt. **DB vs tabular is not a routing concern** — it's a per-source attribute (`source_type`) that only matters at execution time.
186
-
187
- 7. **Stable IDs.** `source_id`, `table_id`, `column_id` are stable internal references. Renaming a column in the source DB does not invalidate cached IRs.
188
-
189
- 8. **PII suppression at the boundary.** Columns flagged with `pii_flag=true` have `sample_values: null` — real PII never enters LLM prompts. Auto-detected at ingestion via name patterns + value regex (`security/pii_patterns.py`). When in doubt, flag — false positives cost nothing; false negatives leak data.
190
-
191
- ---
192
-
193
- ## End-to-end flows
194
-
195
- ### Ingestion (when user uploads a file or connects a DB)
196
-
197
- ```
198
- source upload / DB connect
199
-
200
- ├── unstructured (pdf/docx/txt)
201
- │ → DocumentPipeline: extract → chunk → embed → PGVector
202
-
203
- └── structured (DB schema or tabular file)
204
- → introspect (information_schema or file headers + sample rows)
205
- → CatalogValidator (Pydantic + unique-IDs + FK refs)
206
- → CatalogStore.upsert(user_id jsonb row in `data_catalog`)
207
- ```
208
-
209
- ### Query (per user message)
210
-
211
- ```
212
- user message
213
-
214
- → Redis cache check (24h TTL) ── miss ─→ continue
215
-
216
- → IntentRouter LLM → needs_search? source_hint?
217
-
218
- ├── chat → ChatbotAgent → SSE stream
219
- ├── unstructured → DocumentRetriever (Cu) → ChatbotAgent → SSE stream
220
- └── structured →
221
- CatalogReader.read(user_id, "structured") # full Cs ∪ Ct
222
-
223
- QueryPlanner LLM(question, catalog) → QueryIR
224
-
225
- IRValidator.validate(ir, catalog)
226
- (source_id ∈ catalog, table_id ∈ source, column_ids ∈ table,
227
- ops/aggs whitelisted, value_type matches data_type, limit ≤ 10000)
228
- fail → re-prompt planner with error context (max 3 retries)
229
-
230
- ExecutorDispatcher.pick(ir) # by source.source_type
231
- ├─ DbExecutor → SqlCompiler → sqlglot guard → asyncpg/pymysql
232
- │ (read-only txn, 30s timeout)
233
- └─ TabularExecutor → PandasCompiler → eager pandas (≤100 MB)
234
- or pyarrow pushdown (100 MB–1 GB)
235
- or polars lazy scan (>1 GB)
236
-
237
- QueryResult
238
-
239
- ChatbotAgent → SSE stream
240
- ```
241
-
242
- ---
243
-
244
- ## Catalog schema (per-user `jsonb` row)
245
-
246
- ```
247
- Catalog
248
- ├── user_id, schema_version, generated_at
249
- └── sources[]
250
- └── Source { source_id, source_type, name, description, location_ref, updated_at }
251
- └── tables[]
252
- └── Table { table_id, name, description, row_count, foreign_keys[] }
253
- ├── columns[]
254
- │ └── Column { column_id, name, data_type, description,
255
- │ nullable, pii_flag, sample_values[]|null, stats|null }
256
- └── foreign_keys[]
257
- └── ForeignKey { column_id, target_table_id, target_column_id }
258
- ```
259
-
260
- `source_type ∈ {schema, tabular, unstructured}`.
261
- `data_type ∈ {int, decimal, string, datetime, date, bool, json}`.
262
- `ForeignKey` references are within the SAME `Source` only; cross-source FKs are not modeled.
263
-
264
- Deferred Column fields (add when justified): `description_human`, `synonyms[]`, `tags[]`, `primary_key`, `unit`, `semantic_type`, `example_questions[]`, `schema_hash`, `enrichment_status`.
265
-
266
- ---
267
-
268
- ## JSON IR schema
269
-
270
- ```jsonc
271
- {
272
- "ir_version": "1.0",
273
- "source_id": "...",
274
- "table_id": "...",
275
- "select": [
276
- {"kind": "column", "column_id": "...", "alias": "..."},
277
- {"kind": "agg", "fn": "count|count_distinct|sum|avg|min|max",
278
- "column_id": "...?", "alias": "..."}
279
- ],
280
- "filters": [
281
- {"column_id": "...",
282
- "op": "= | != | < | <= | > | >= | in | not_in | is_null | is_not_null | like | between",
283
- "value": ...,
284
- "value_type": "int|decimal|string|datetime|date|bool"}
285
- ],
286
- "group_by": ["column_id", ...],
287
- "order_by": [{"column_id": "...", "dir": "asc|desc"}],
288
- "limit": 100
289
- }
290
- ```
291
-
292
- Single-table only in v1. `having`, `offset`, boolean filter trees, `distinct`, joins, window functions are deferred until user demand proves the limitation.
293
-
294
- ---
295
-
296
- ## Implementation status
297
-
298
- **As of 2026-05-12 — Phase 2 end-to-end flow is wired.** `PROGRESS.md` has the per-PR line-item table; this section is the high-level snapshot. Stub files (`raise NotImplementedError`) are now the exception, not the rule.
299
-
300
- | Area | Status | Notes |
301
- |---|---|---|
302
- | Catalog Pydantic models | ✅ | `catalog/models.py` — incl. `ForeignKey`, `ColumnStats.top_values` |
303
- | JSON IR Pydantic models | ✅ | `query/ir/models.py` + `operators.py` (TYPE_COMPATIBILITY filled) |
304
- | Catalog ingestion — DB | ✅ | introspect → validate → upsert. `on_db_registered` wired; `/api/v1/db-clients/{id}/ingest` calls it |
305
- | Catalog ingestion — tabular | ✅ | CSV/XLSX/Parquet; `on_tabular_uploaded` wired into `/api/v1/document/process`. XLSX → one Table per sheet. CSV/XLSX skip vector store |
306
- | Catalog ingestion — unstructured | ✅ | `on_document_uploaded` implemented; full DocumentPipeline (extract → chunk → embed → PGVector) |
307
- | Catalog store / reader / validator / PII detector | ✅ | `data_catalog` jsonb table (renamed from `catalogs` in KM-557) |
308
- | LLM enrichment | ❌ removed (KM-557) | Cost cut — planner reads `column.stats` + `sample_values` + `top_values` + `column.name` directly. `catalog/render.py` keeps the source-rendering helper |
309
- | `IntentRouter` (lives as `OrchestratorAgent` in `agents/orchestration.py`) | ✅ | 3-way `source_hint`, history-aware query rewriting. Filename + class name kept from Phase 1; Phase 2 body |
310
- | `CatalogReader` | ✅ | Loads full catalog; filters by `source_hint` |
311
- | `QueryPlanner` LLM call | ✅ | Azure OpenAI structured output → `QueryIR`; supports retry with `previous_error` |
312
- | IR validator | ✅ | Catalog-aware; full rule set; descriptive errors |
313
- | SQL compiler (Postgres) | ✅ | All 12 filter ops, all 6 aggs, alias-aware order_by, parameterized values, quoted identifiers |
314
- | DbExecutor | ✅ | sqlglot SELECT-only guard, RO txn, `statement_timeout=30000`, 10k row cap, never raises |
315
- | Pandas compiler | ✅ | Same op coverage as SQL; pure module-level helpers |
316
- | TabularExecutor | ✅ | Parquet blob path resolution, `asyncio.to_thread`, 10k cap, never raises |
317
- | ExecutorDispatcher | ✅ | Routes by `source.source_type`; lazy imports + cache |
318
- | QueryService | ✅ | plan → validate → retry-on-fail (max 3) → dispatch → execute → `QueryResult` |
319
- | `ChatbotAgent` + prompt + guardrails | ✅ | Renamed from `AnswerAgent` in Cleanup PR. Guardrails appended to `chatbot_system.md` |
320
- | `ChatHandler` (top-level chat orchestrator) | ✅ | SSE events: `intent` / `chunk` / `done` / `error` |
321
- | `DocumentRetriever` + `RetrievalRouter` (Redis-cached) | ✅ | Migrated from `src/rag/` (now deleted). Mentor commit `61c746f` rewrote to raw SQL (pgvector `<=>` cosine, `<+>` manhattan) to dodge asyncpg type-mapping issues with Go-ingested schema. Methods reduced to `cosine | manhattan`. Collection: `documents`. |
322
- | `/api/v1/chat/stream` | ✅ | Rewired to `ChatHandler`; Redis cache + fast intent + history + message persistence remain in chat.py |
323
- | `/api/v1/db-clients/{id}/ingest` | ✅ | Calls only `on_db_registered`; Phase 1 dual-write removed |
324
- | `/api/v1/document/{upload,process,delete}` | ✅ | `/process` triggers `on_tabular_uploaded` for CSV/XLSX |
325
- | `GET /api/v1/data-catalog/{user_id}` | ✅ | Index endpoint (KM-557) |
326
- | `POST /api/v1/data-catalog/rebuild` | ✅ | Iterates sources, re-runs per-source trigger |
327
- | Credential encryption | ⚠️ stub | `security/credentials.py` not migrated; runtime reuses Phase 1 `utils/db_credential_encryption.py` |
328
- | Tests | ✅ 146+ unit | Compilers (DB 36, Pandas 43), validators, introspectors, agents, chat handler, dispatcher, planner |
329
- | Planner eval harness | 🟡 scaffold | 3 DB + 4 tabular golden cases. Gated on `RUN_PLANNER_EVAL=1`. Real Azure OpenAI passing |
330
- | E2E smoke tests | ❌ not started | Component-level orchestration is covered |
331
- | DB introspector unit test | ❌ deferred | Needs Postgres testcontainer |
332
- | Sources event in `/chat/stream` | ⚠️ emits `[]` | `ChatHandler` doesn't surface retrieval sources yet; same gap reflected in `save_messages` |
333
-
334
- **Deferred to later phases**: joins in IR, schema drift detection, hybrid catalog search (BM25 + RRF for 100+ table users), polars lazy scan for >1GB tabular files, MySQL/BigQuery/Snowflake SQL dialects, mask/synthesize PII strategies.
335
-
336
- ---
337
-
338
- ## Team — division of work
339
-
340
- The service is built by two engineers; many modules are source-type-agnostic and shared.
341
-
342
- - **DB** owns SQL paths: introspection, SQL compiler, DB executor, credential storage.
343
- - **TAB** owns tabular paths: CSV/XLSX/Parquet introspection, pandas compiler, tabular executor, blob/Parquet plumbing.
344
- - **B** = both — shared contracts and source-type-agnostic plumbing. Pair-program or split with explicit hand-off.
345
-
346
- ### Step-by-step ownership
347
-
348
- | # | Step | File / area | Owner | Notes |
349
- |---|---|---|---|---|
350
- | 0 | **Lock contracts before coding** | — | B | See "Decisions to lock" below; block until aligned |
351
- | 1 | Catalog Pydantic models | `catalog/models.py` | B | Already done; only touch if both agree |
352
- | 2 | IR Pydantic models | `query/ir/models.py` | B | Already done; joins/window fns require joint sign-off |
353
- | 3 | IR operator whitelists | `query/ir/operators.py` | B | Already done; both compilers rely on these |
354
- | 4 | PII patterns / regex | `security/pii_patterns.py` | B | Already done; extend together as gaps appear |
355
- | **Ingestion — introspection** | | | | |
356
- | 5 | DB introspector (information_schema, sample, FKs) | `catalog/introspect/database.py` | DB | Use SQLAlchemy `inspect()`; dialect-aware quoting |
357
- | 6 | Tabular introspector (CSV/XLSX/Parquet headers + sample) | `catalog/introspect/tabular.py` | TAB | Each XLSX sheet → one Table |
358
- | 7 | `BaseIntrospector` ABC | `catalog/introspect/base.py` | B | Confirm signature returns the same `Source` shape |
359
- | **Ingestion — shared catalog plumbing** | | | | |
360
- | 8 | ~~Catalog enricher + prompt~~ | — | **REMOVED in KM-557.** Cost optimization — planner reads stats + sample rows directly. `catalog/render.py` keeps the source-rendering helper. |
361
- | 9 | Catalog validator | `catalog/validator.py` | B | Type-agnostic |
362
- | 10 | Catalog store (Postgres jsonb) | `catalog/store.py` | B | Recommend DB (Postgres expertise) |
363
- | 11 | Catalog reader | `catalog/reader.py` | B | Type-agnostic |
364
- | 12 | PII detector | `catalog/pii_detector.py` | B | Either; uses `pii_patterns.py` |
365
- | **Ingestion — pipelines** | | | | |
366
- | 13 | Structured pipeline (introspect → enrich → validate → store) | `pipeline/structured_pipeline.py` | B | Pair on this — calls both introspectors via dispatcher |
367
- | 14 | Triggers (`on_db_registered`, `on_tabular_uploaded`) | `pipeline/triggers.py` | B | Each owns their trigger function |
368
- | 15 | Ingestion orchestrator | `pipeline/orchestrator.py` | B | Routes by source_type; pair |
369
- | 16 | Document pipeline (PDF/DOCX/TXT) | `pipeline/document_pipeline.py` | TAB | Tabular-adjacent (file uploads) |
370
- | **Query — shared spine** | | | | |
371
- | 17 | IR validator (catalog-aware) | `query/ir/validator.py` | B | Recommend DB; both must agree on exact error messages so retry-prompt is consistent |
372
- | 18 | Planner LLM service | `query/planner/service.py` | B | Type-agnostic |
373
- | 19 | Planner prompt (catalog → text) | `query/planner/prompt.py`, `config/prompts/query_planner.md` | B | **Pair-program**. Must describe DB tables and tabular files in one consistent format |
374
- | 20 | Intent router (chat/unstructured/structured) | `agents/orchestration.py` (class `OrchestratorAgent` — Phase 1 filename + class name preserved; Phase 2 body), `config/prompts/intent_router.md` | B | Type-agnostic. The prompt file uses `intent_router.md`, but the source module is still `orchestration.py` |
375
- | 21 | Executor base + `QueryResult` | `query/executor/base.py` | B | Lock the shape before either implements an executor |
376
- | 22 | Executor dispatcher | `query/executor/dispatcher.py` | B | Reads `source.source_type` from catalog; pair |
377
- | 23 | Compiler base ABC | `query/compiler/base.py` | B | Already done |
378
- | 24 | Top-level QueryService | `query/service.py` | B | Wires planner → validator → compiler → executor; pair |
379
- | **Query — DB path** | | | | |
380
- | 25 | SQL compiler (IR → SQL + params, per dialect) | `query/compiler/sql.py` | DB | Identifiers from catalog (quoted), values parameterized |
381
- | 26 | DB executor (asyncpg/pymysql, sqlglot guard, RO txn, 30s timeout) | `query/executor/db.py` | DB | |
382
- | 27 | Credential encryption (Fernet) | `security/credentials.py` | DB | Needed for stored user DB creds |
383
- | 28 | User-DB connection management | helper in pipelines | DB | engine_scope context manager pattern |
384
- | **Query — Tabular path** | | | | |
385
- | 29 | Pandas compiler (IR → callable on DataFrame) | `query/compiler/pandas.py` | TAB | Same IR, different backend |
386
- | 30 | Tabular executor (eager pandas first; pyarrow / polars later) | `query/executor/tabular.py` | TAB | Initial scope: eager pandas only |
387
- | 31 | Parquet upload/download + Azure Blob wrapper | `storage/az_blob/az_blob.py` (+ helper) | TAB | XLSX sheet → one Parquet per sheet (deterministic blob name) |
388
- | **Agents + chat** | | | | |
389
- | 32 | Chatbot agent + prompt | `agents/chatbot.py`, `config/prompts/chatbot_system.md` | B | Receives QueryResult or Cu chunks |
390
- | 33 | Guardrails prompt | `config/prompts/guardrails.md` | B | |
391
- | **API surface** | | | | |
392
- | 34 | DB client endpoints (register/ingest/list/delete) | `api/v1/db_client.py` | DB | |
393
- | 35 | Document/tabular upload endpoints | `api/v1/document.py` | TAB | |
394
- | 36 | Chat stream endpoint (SSE) | `api/v1/chat.py` | B | Dispatches both paths; pair |
395
- | 37 | Room / users endpoints | `api/v1/room.py`, `api/v1/users.py` | B | Whoever has bandwidth |
396
- | **Tests + eval** | | | | |
397
- | 38 | DB compiler golden tests (IR → SQL fixtures) | `tests/query/compiler/test_sql.py` | DB | Pure-Python, no LLM |
398
- | 39 | Pandas compiler golden tests (IR → expected DataFrame) | `tests/query/compiler/test_pandas.py` | TAB | Pure-Python, no LLM |
399
- | 40 | IR validator tests (catalog × IR error matrix) | `tests/query/ir/test_validator.py` | B | Each contributes test cases for their source type |
400
- | 41 | Planner eval (golden question → IR examples) | `tests/query/planner/` | B | Each contributes ~10 question→IR examples |
401
- | 42 | E2E smoke tests | `tests/e2e/` | B | Pair |
402
-
403
- ### Decisions to lock before coding
404
-
405
- If made unilaterally these create silent contract drift. Lock them in a 30-min sync first.
406
-
407
- | Decision | Why it matters | Recommended call |
408
- |---|---|---|
409
- | `QueryResult` shape (current scaffold: `source_id, backend, rows, row_count, truncated, elapsed_ms, error`) | Both executors return this; chatbot consumes it | Lock as-is unless either side needs more (e.g. `column_types` for formatting) |
410
- | `Source.location_ref` format (`az_blob://...` vs `dbclient://{id}` etc.) | Dispatcher and executors both parse this | Pick a convention now; document in `catalog/models.py` docstring |
411
- | Where do user DB credentials live? | DB executor needs creds to run queries; Source has `location_ref` but creds are encrypted separately | Recommend: `location_ref="dbclient://{client_id}"`; executor looks up creds by ID |
412
- | How does dispatcher pick the executor? | Routes by `source.source_type` — but where does dispatcher get it (catalog reload, or IR carries it)? | Recommend: dispatcher takes `(Catalog, IR)`, looks up source by `IR.source_id` |
413
- | Joins in v1 IR? | Excluded per ARCHITECTURE.md §7. DB path is most affected — real DB use often needs joins. | Recommend: ship single-table; revisit in PR 2. **DB owner must accept the constraint or push back early** |
414
- | Planner prompt — render tabular vs DB sources uniformly | If described differently, planner gets confused | Pair-program. Render both as `Table: name (n rows) — Columns: ...` regardless of source_type |
415
- | Error contract — raise or return `QueryResult.error`? | Both executors must behave the same so chatbot branches consistently | Recommend: never raise from `executor.run()`; populate `QueryResult.error` |
416
- | PII handling for tabular `sample_values` | DB samples come from `information_schema`; tabular from file reads. Same `pii_flag` rule must apply both sides | Confirm tabular introspector calls `pii_detector` |
417
- | Catalog refresh trigger (open question §3) | Affects both pipelines symmetrically | Default: rebuild on every upload/connect; defer auto-refresh |
418
- | `updated_at` semantics — per-Source vs per-Catalog | Affects how each pipeline writes | Recommend: per-Source `updated_at` + Catalog-level `generated_at` |
419
- | Dialect support scope for v1 | DB compiler must implement at least one dialect well | Recommend: Postgres first (matches app DB); MySQL second |
420
- | Test-fixture format for golden IRs | Both compilers test against golden IR → expected output | Recommend: shared `tests/fixtures/golden_irs.json`; each side adds expected SQL or DataFrame |
421
- | Logging conventions | structlog is already in place; both should log the same fields | Quick agreement: log `source_id`, `table_id`, `ir_version`, `elapsed_ms` |
422
-
423
- ### Working rhythm (suggested)
424
-
425
- 1. **Day 1** — 30-min sync to lock the decisions table. PR any contract/docstring changes that fall out.
426
- 2. **Week 1** — both build introspectors + agree on the planner prompt format. PR in parallel; review each other's.
427
- 3. **Week 2** — DB builds SQL compiler + DB executor; TAB builds pandas compiler + tabular executor. Both write golden tests against shared IR fixtures.
428
- 4. **Week 3** — pair on dispatcher, QueryService, and chat endpoint integration. End-to-end smoke test.
429
- 5. **Ongoing** — short daily standup, mostly to flag IR-shape questions and catalog-field additions *before* either side implements against an unconfirmed contract.
430
-
431
- Biggest risk: **silent contract drift** — one side adds a `QueryResult` field or assumes a new IR op exists, the other ships without it, and integration breaks at the dispatcher. The §0 lock + shared golden-IR fixtures are what prevent that.
432
-
433
- ### Onboarding to Claude Code
434
-
435
- If you're new to Claude Code, before you start:
436
-
437
- 1. Read `ARCHITECTURE.md` end-to-end (~10 min) — this is the source of truth.
438
- 2. Skim this file (`REPO_CONTEXT.md`) — find your section in the ownership table.
439
- 3. Read your owned files' docstrings — every stub explains its contract.
440
- 4. Open Claude Code in this repo. When you ask Claude to implement a stub:
441
- - Reference the file path + the contract it should follow
442
- - Point it at `ARCHITECTURE.md` section if relevant (e.g. §7 for IR validation)
443
- - Ask it to write the test first (golden IR fixtures), then the implementation
444
- - Always review the diff — don't auto-accept
445
-
446
- Useful slash commands while working: `/review` (PR review), `/security-review` (audit pending changes).
447
-
448
- ---
449
-
450
- ## Conventions & gotchas
451
-
452
- - **Async event loop on Windows**: `run.py` sets `WindowsSelectorEventLoopPolicy` because psycopg3 async needs it. Don't call `uvicorn` directly on Windows.
453
- - **Two Postgres engines**: `engine` (app tables) and `_pgvector_engine` (asyncpg with `prepared_statement_cache_size=0`) — the latter is required because PGVector emits `advisory_lock + CREATE EXTENSION` as a multi-statement string and asyncpg rejects multi-statement prepared queries. `init_db.py` creates the extension explicitly so `PGVector(create_extension=False)` skips that path.
454
- - **Read-only at every layer for user DBs**: IR validation + compiler whitelists + sqlglot SELECT-only check + read-only DB credentials + LIMIT enforcement + 30s timeout. Five layers; no single point of failure.
455
- - **Identifiers vs values**: identifiers (table/column names) come from the catalog and are inlined as quoted identifiers — they were verified at validation time so this is safe. Values from `IR.filters` are *always* parameterized, never inlined as strings.
456
- - **Credential encryption**: Fernet via `dataeyond__db__credential__key` env var; lives in `security/credentials.py`. Sensitive fields = `{"password", "service_account_json"}`.
457
- - **Settings env-var aliases**: `.env` uses double-underscore names (`azureai__api_key__4o`); `Settings` exposes them as `azureai_api_key_4o` via `Field(alias=...)`. Mind both forms when adding settings.
458
- - **Prompts**: `src/config/prompts/*.md` — `intent_router`, `query_planner`, `chatbot_system`, `guardrails` are all written. `chatbot_system` has `guardrails` appended so guardrails take precedence in conflict. `catalog_enricher.md` was deleted in KM-557. `config/agents/` folder deleted in Cleanup PR.
459
- - **Planner prompt parsing gotcha**: `query/planner/service.py` uses `SystemMessage(content=...)` not `("system", text)`. The tuple form causes LangChain to interpret `{...}` in `query_planner.md` as f-string variables and crash on every real invocation. Don't refactor back to tuples.
460
- - **Tests**: 146+ unit tests in place. Run with `uv run pytest`. Planner eval gated on `RUN_PLANNER_EVAL=1`; catalog store integration test gated on `RUN_INTEGRATION_TESTS=1`.
461
-
462
- ---
463
-
464
- ## Recommended reading order
465
-
466
- 1. `ARCHITECTURE.md` — design intent (the source of truth)
467
- 2. `src/catalog/models.py` + `src/query/ir/models.py` — the two data shapes everything else moves between
468
- 3. `src/query/ir/operators.py` + `src/security/pii_patterns.py` — the explicit whitelists / patterns
469
- 4. Skim every `__init__.py`-level docstring under `src/catalog/`, `src/query/`, `src/agents/`, `src/pipeline/` — each describes the contract its module enforces
470
- 5. `main.py` + `src/db/postgres/{connection,init_db}.py` — runtime bootstrap
471
- 6. `ARCHITECTURE.md §10` — five open questions that haven't been decided yet
472
-
473
- ---
474
-
475
- ## Open questions
476
-
477
- Resolved as Phase 2 landed:
478
-
479
- 1. ✅ Catalog storage shape — Postgres `jsonb` row in `data_catalog` table, keyed by `user_id`.
480
- 2. ❌ Unstructured files in catalog — still not modeled; router uses `source_hint` from the LLM instead.
481
- 3. 🟡 Catalog refresh trigger — rebuild-on-upload-or-connect is the default. Explicit endpoint `POST /api/v1/data-catalog/rebuild` exists. Background TTL deferred.
482
- 4. ✅ Joins out of v1 IR — confirmed; single-table only. Revisit when real queries need it.
483
- 5. 🟡 PII `sample_values` — currently nulled out (skip). Mask/synthesize deferred.
484
-
485
- ---
486
-
487
- ## Glossary
488
-
489
- - **Cu** — unstructured context (prose chunks)
490
- - **Cs** — schema context (DB tables/columns from catalog)
491
- - **Ct** — tabular context (file sheets/columns from catalog)
492
- - **IR** — intermediate representation (the JSON query shape)
493
- - **PII** — personally identifiable information
494
- - **ABC** — abstract base class
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
REPO_STATUS.md ADDED
@@ -0,0 +1,306 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Data Eyond — Python Agentic Service: Current Status
2
+
3
+ **Audience:** teammates onboarding onto the Python repo (`Agentic-Service-Data-Eyond-Catalog`).
4
+ **Scope:** what the code does **right now** (branch `pr/4`, ticket KM-652). Describes current state only — no roadmap or to-dos.
5
+ **Snapshot date:** 2026-06-25.
6
+
7
+ > This file is grounded in the source, not the older design docs. Where the two
8
+ > disagree, the code wins — see [§11 Doc-vs-code](#11-where-the-older-docs-are-stale).
9
+ > `REPO_CONTEXT.md` / `ARCHITECTURE.md` are the original Phase-2 design docs and are
10
+ > stale on the router, joins, and the analysis/report stack.
11
+
12
+ ---
13
+
14
+ ## 1. The product in one paragraph
15
+
16
+ Data Eyond is an **"AI data scientist"** for business analytics, modelled on **CRISP-DM**
17
+ (Business Understanding → Data Understanding → Preparation → Modeling → Evaluation →
18
+ Deployment). It targets executives doing self-serve deep-dives and analysts offloading
19
+ routine work. A user defines a goal, connects data (DB or files), asks natural-language
20
+ analytical questions, and gets CRISP-DM-structured answers that can be exported as a
21
+ versioned **report**. The aim is "junior data scientist that hands back a decision-ready
22
+ deliverable," not "chatbot over a database."
23
+
24
+ ---
25
+
26
+ ## 2. Three repos, one hard ownership rule
27
+
28
+ Request flow is **FE → Go → Python**. The FE never calls Python directly except for chat
29
+ streaming.
30
+
31
+ | Repo | Role | We edit? |
32
+ |---|---|---|
33
+ | **Python** — `Agentic-Service-Data-Eyond-Catalog` (this repo) | The agentic LLM service: router, gate, skills, slow analytical path, structured query engine, unstructured RAG, report generation, analysis-session state. FastAPI + async SQLAlchemy + LangChain + Azure GPT-4o. | **Yes — the only repo we edit.** |
34
+ | **Go** — `Orchestrator-Agent-Service` | Gateway / data plane: interview agent, auth/JWT, rooms, documents (Azure Blob + CSV/XLSX→Parquet + embeddings), database_clients (Fernet creds), catalog ingestion, **all DB migrations**. | Reference only. |
35
+ | **FE** — `E2E-Frontend-Data-Eyond` | React/Vite SPA. Talks to Go for everything and to Python only for chat streaming. | Reference only. |
36
+
37
+ Shared infra: **Postgres** (app tables + `data_catalog` jsonb + PGVector `langchain_pg_embedding`), **Azure Blob**, and (Python-only) **Redis**.
38
+
39
+ ---
40
+
41
+ ## 3. Tech stack & how to run
42
+
43
+ - Python 3.12, FastAPI, uvicorn, sse-starlette
44
+ - Async SQLAlchemy 2.0 + asyncpg (Postgres); psycopg3 for the PGVector engine
45
+ - LangChain + langchain-openai (Azure OpenAI GPT-4o) + langchain-postgres (PGVector)
46
+ - Redis (response + retrieval cache), Azure Blob (uploads + Parquet)
47
+ - pandas / pyarrow, sqlglot, pydantic v2, structlog, slowapi, langfuse
48
+ - DB connectors: psycopg2, pymysql, pymssql, sqlalchemy-bigquery, snowflake-sqlalchemy
49
+
50
+ Run (Linux/Docker): `uv run --no-sync uvicorn main:app --host 0.0.0.0 --port 7860`
51
+ Run (Windows): `uv run --no-sync python run.py` (sets `WindowsSelectorEventLoopPolicy` for psycopg3 async — don't call uvicorn directly on Windows).
52
+
53
+ Tests live locally and are gitignored. Run with `./.venv/Scripts/python.exe -m pytest`.
54
+
55
+ ---
56
+
57
+ ## 4. Chat request lifecycle
58
+
59
+ Entry: `POST /api/v1/chat/stream` (`src/api/v1/chat.py`) → `ChatHandler.handle(...)`
60
+ (`src/agents/chat_handler.py`). One shared `ChatHandler` per process keeps the Azure clients warm.
61
+
62
+ ```
63
+ POST /chat/stream { user_id, room_id, message }
64
+ │ (analysis_id == room_id — one session = one analysis = one chat room)
65
+ ├─ Redis response-cache check (1h TTL, key chat:{room}:{user}:{message}) ── hit → replay
66
+ ├─ greeting/farewell short-circuit (_fast_intent, EN+ID) ── hit → canned reply
67
+ ├─ load last-10 history
68
+ └─ ChatHandler.handle:
69
+ 1. classify → RouterDecision [1 GPT-4o call]
70
+ 2. ensure analysis-state row (get-or-create, idempotent)
71
+ 3. emit `intent` (internal; gates caching), then dispatch:
72
+ chat → ChatbotAgent → SSE
73
+ help → HelpAgent (state + history + readiness) → SSE
74
+ check → check_data/check_knowledge tool → rendered table [no LLM]
75
+ unstructured_flow → DocumentRetriever (PGVector RAG) → ChatbotAgent → SSE
76
+ structured_flow → CatalogReader → (slow path | QueryService) → SSE
77
+ 4. SSE events: intent (internal), sources, chunk, status, done | error
78
+ ```
79
+
80
+ Only the `chat` intent is cached (stateless). Messages persist on `done`.
81
+
82
+ > The router emits **5 intents** now. The `problem_statement` skill and the `problem_validated`
83
+ > gate were removed 2026-06-25 (KM-652) — the analysis goal is two user-entered fields
84
+ > (`objective` + `business_questions`) captured at onboarding, with no agent validation.
85
+
86
+ ---
87
+
88
+ ## 5. Report lifecycle
89
+
90
+ The report is a **dedicated API, not a chat route** (`src/api/v1/report.py`):
91
+
92
+ ```
93
+ POST /report?analysis_id&user_id
94
+ ├─ load analysis state; enforce the report FLOOR
95
+ │ (≥1 substantive analyze_* success) → else 409
96
+ ├─ ReportGenerator.generate (src/agents/report/generator.py):
97
+ │ read persisted AnalysisRecords (list_for_analysis)
98
+ │ deterministically assemble findings / caveats / open-questions /
99
+ │ data-source appendix / CRISP-DM method appendix (copied verbatim)
100
+ │ ONE LLM call → executive summary only (deterministic fallback on failure)
101
+ │ render markdown
102
+ ├─ ReportStore.save: advisory-locked version assignment → dedorch `reports`
103
+ └─ write report_id back onto analysis state
104
+
105
+ GET /report/{analysis_id} → list versions (oldest-first)
106
+ GET /report/{analysis_id}/{ver} → fetch one version
107
+ ```
108
+
109
+ Two facts to internalise:
110
+ - **Records only exist on the slow path.** With `ENABLE_SLOW_PATH=false` (the default) no
111
+ records accumulate, so generation 409s — by design, not a bug.
112
+ - **dedorch `reports` stores markdown only.** Structured report fields are computed at
113
+ generation, rendered into `rendered_markdown`, and only the markdown is persisted; on
114
+ read-back the structured fields come back empty.
115
+
116
+ ---
117
+
118
+ ## 6. Feature list (what's built)
119
+
120
+ - **5-intent handler router** (`chat`/`help`/`check`/`unstructured_flow`/`structured_flow`) with history-aware query rewriting (EN/ID).
121
+ - **Skills:** `help` (LLM, state-aware next-step guidance), `check` (no-LLM data/document inventory). *(The `problem_statement` skill and the `problem_validated` gate were removed 2026-06-25 — KM-652; `gate.py` kept as a no-op seam, `problem_statement.py` kept but unwired.)*
122
+ - **Slow analytical path:** Planner → TaskRunner → Assembler (static plan, degrade-and-continue, 3 LLM calls fixed).
123
+ - **Structured query engine:** catalog-driven JSON IR → deterministic SQL/pandas compiler → read-only executor, with **single-level FK joins** (DB sources only).
124
+ - **Unstructured RAG** over PGVector.
125
+ - **Analytics tools:** 4 registered composite `analyze_*` (descriptive, aggregate, correlation, trend) + 4 data-access tools (check_data, check_knowledge, retrieve_data, retrieve_knowledge). Four further composites (comparison, contribution, profile, segment) exist in code but are **not registered** with the Planner.
126
+ - **Versioned report generation** from persisted records.
127
+ - **Analysis sessions:** data-first creation gate (≥1 bound source), per-analysis data-source binding (#10).
128
+ - **Langfuse tracing** (PII-masked), **Redis caching**, **pooled DB engines** + speculative prewarm.
129
+
130
+ ---
131
+
132
+ ## 7. API surface (this repo, all under `/api/v1`)
133
+
134
+ | Endpoint | Purpose | Caller |
135
+ |---|---|---|
136
+ | `POST /chat/stream` | Main chat SSE (router → dispatch) | FE → Go → Python (the only FE→Python call today) |
137
+ | `DELETE /chat/cache` · `/chat/cache/room/{id}` · `/retrieval/cache/{user_id}` | Cache management | internal / ops |
138
+ | `POST /analysis/create` · `GET /analysis` · `GET /analysis/{id}` | Analysis-session CRUD (state + room + bindings created atomically) | intended FE → Go |
139
+ | `POST /report` · `GET /report/{id}` · `GET /report/{id}/{ver}` | Report generate / list / fetch | FE → Go (report button) |
140
+ | `GET /tools` | Slash-command catalog (static, cacheable) | Go caches it for the FE "/" menu |
141
+ | `users` · `room` · `document` · `db_client` · `data_catalog` routers | Phase-1 legacy; functionally migrated to Go | mostly dormant |
142
+
143
+ ---
144
+
145
+ ## 8. Data model
146
+
147
+ SQLAlchemy models in `src/db/postgres/models.py`. Created on startup by `init_db()`
148
+ unless `SKIP_INIT_DB=true`.
149
+
150
+ | Table | Shape | Written by | Read by |
151
+ |---|---|---|---|
152
+ | `users`, `rooms`, `chat_messages`, `message_sources` | base app | chat endpoint, Go | chat history |
153
+ | `documents`, `databases` | uploads + DB creds (Fernet-encrypted) | Go ingestion | executor cred resolution |
154
+ | `data_catalog` | per-user jsonb `Catalog` (Source → Table → Column) | Go ingestion / Python pipeline | CatalogReader, planner, tools |
155
+ | `langchain_pg_embedding` | PGVector document chunks | Go ingestion | DocumentRetriever |
156
+ | `analysis_records` | jsonb `AnalysisRecord`, one per slow-path run | slow path | ReportGenerator, report readiness |
157
+ | `analysis` *(dedorch)* | uuid id, `owner_id`, `problem_statement`, `problem_validated`, `report_id` | `/analysis/create`, state store | gate, Help, report |
158
+ | `reports` *(dedorch)* | uuid, `title` + markdown `content` + `version` | ReportStore | report API |
159
+ | `data_sources` *(dedorch)* | per-analysis binding; `reference_id` = catalog source_id | `/analysis/create` | structured-flow scoping, report appendix |
160
+
161
+ **Catalog shape** (the jsonb in `data_catalog`):
162
+ `Catalog → Source[ {source_id, source_type ∈ schema|tabular|unstructured, name, location_ref} → Table[ {table_id, name, row_count, foreign_keys[]} → Column[ {column_id, name, data_type, nullable, pii_flag, sample_values|null, stats} ] ] ]`. PII columns have `sample_values: null` so real values never enter prompts.
163
+
164
+ **QueryIR shape** (`src/query/ir/models.py`):
165
+ `{ source_id, table_id, joins[], select[], filters[], group_by[], order_by[], limit }`.
166
+ Joins are single-level equi-joins to a related table **in the same source**, FK-backed,
167
+ **DB sources only**.
168
+
169
+ ---
170
+
171
+ ## 9. Subsystems (where the code lives)
172
+
173
+ ### Router — `src/agents/orchestration.py`
174
+ One GPT-4o structured-output call → `RouterDecision{intent, rewritten_query, confidence}`,
175
+ `intent ∈ {chat, help, check, unstructured_flow, structured_flow}` (`problem_statement` removed
176
+ 2026-06-25). It's a
177
+ *handler* classifier: `structured_flow` = slow path, `unstructured_flow` = fast RAG; the
178
+ data-modality mix on the slow path is the Planner's job. Prompt: `src/config/prompts/intent_router.md`.
179
+
180
+ ### Gate — `src/agents/gate.py`
181
+ **Neutered 2026-06-25 (KM-652):** `gate()` now passes every intent through unchanged — the
182
+ `problem_validated` redirect was removed (the goal is user-entered, no agent validation). The
183
+ function + `AnalysisState` contract are kept as a no-op seam; the call site in
184
+ `chat_handler.handle` is commented out. `AnalysisState` still carries (id, analysis_title,
185
+ problem_statement, problem_validated, owner_id, report_id, created_at, updated_at) until the
186
+ dedorch state migration (#3/#4) renames it.
187
+
188
+ ### Skills — `src/agents/handlers/`
189
+ - `help.py` — LLM (streamed). A consistency guard derives the *allowed* actions from state
190
+ (mirrors the gate) and feeds them to the prompt, so Help can't suggest a report when the goal
191
+ isn't validated or there's nothing to report. Consumes a deterministic readiness signal.
192
+ - `check.py` — **no LLM.** Keyword cues route to `check_data`, `check_knowledge`, or both
193
+ (helicopter view, concurrent). Renders tool tables to markdown.
194
+ - `problem_statement.py` — **unwired 2026-06-25** (no longer routed to; file kept intact). Was an
195
+ LLM drafter that validated a goal and wrote `problem_validated`.
196
+
197
+ ### Slow path — `src/agents/slow_path/` + `src/agents/planner/`
198
+ - **Planner** (`planner/service.py`) — 1 LLM call → `TaskList` (DAG of tool-call chains). 8-check
199
+ validator with re-prompt retry (max 3). `BusinessContext` is a **stub** (`planner/business_context.py`),
200
+ which is why the slow path stays opt-in.
201
+ - **TaskRunner** (`slow_path/task_runner.py`) — deterministic, 0 LLM. Wave-based execution,
202
+ `${t<id>}` placeholder resolution (Pattern A), never-throw invocation, **degrade-and-continue**
203
+ (failed task → dependents skipped, independent branches run). No replanning.
204
+ - **Assembler** (`slow_path/assembler.py`) — 1 LLM call authoring only the narrative; code copies
205
+ the structured `results_snapshot` / `tasks_run` from the run state into the `AnalysisRecord`
206
+ (the report's source of truth).
207
+
208
+ Streaming + persistence: `chat_handler._run_slow_path` bridges per-stage progress to SSE `status`
209
+ events, prewarms the DB engine in parallel with planning, emits the answer, then persists the
210
+ record stamped with `user_id` + `analysis_id`.
211
+
212
+ ### Structured query engine — `src/query/`
213
+ `QueryService.run` (`query/service.py`): plan → validate → retry(3) → dispatch → execute; **never
214
+ raises** (errors land in `QueryResult.error`). `IRValidator` (`query/ir/validator.py`) checks
215
+ source/table/column existence, op/agg whitelists, type compatibility, limit cap, and **FK-backed
216
+ joins** (DB only). `DbExecutor` (`query/executor/db.py`): SqlCompiler → sqlglot SELECT-only guard →
217
+ Fernet-decrypt creds (with owner check) → `asyncio.to_thread` (30s timeout) → pooled engine
218
+ (read-only + statement_timeout) → 10k row cap. Defense-in-depth: IR validation + compiler whitelist
219
+ + sqlglot guard + read-only session + LIMIT/timeout.
220
+
221
+ ### Data-source binding (#10) — `src/agents/binding_store.py`
222
+ At `/analysis/create`, chosen `data_source_ids` become `data_sources` rows. On a `structured_flow`
223
+ turn the catalog reader is wrapped so the Planner and the tools' re-reads see the same scoped
224
+ catalog. **Fail-open**: empty/disjoint binding → whole catalog.
225
+
226
+ ### Tool layer — `src/tools/data_access.py`, `src/agents/planner/registry.py`
227
+ `DataAccessToolInvoker` implements the never-throw tool seam for the 4 data-access tools.
228
+ `retrieve_data` runs a pre-built IR (validate → dispatch → execute, skipping the planner) and
229
+ coerces `Decimal`→`float` — the Pattern A handoff the `analyze_*` tools consume. The planner
230
+ registry composes a local data-access spec stub (name-checked against `DATA_ACCESS_TOOLS`) with the
231
+ real `analytics_registry()`.
232
+
233
+ ### Report — `src/agents/report/`
234
+ `generator.py` reads records, deterministically assembles structured fields, 1 LLM call for the
235
+ executive summary; `store.py` versions under an advisory lock and persists markdown to dedorch
236
+ `reports`; `readiness.py` defines the **report floor** (≥1 successful `analyze_*`; the
237
+ `problem_validated` precondition was dropped 2026-06-25) shared by the report API and the Help
238
+ readiness signal so the two can't disagree.
239
+
240
+ ### Observability — Langfuse
241
+ The endpoint's `ChatHandler` runs with `enable_tracing=True`. One trace per request groups
242
+ router/planner/assembler/chatbot + tool spans. PII policy: router/planner unmasked (PII-safe
243
+ summaries); assembler/chatbot masked (see real rows); tool spans carry name + arg keys + row counts
244
+ only.
245
+
246
+ ---
247
+
248
+ ## 10. Feature flags
249
+
250
+ | Flag | Where | Default | Effect |
251
+ |---|---|---|---|
252
+ | `ENABLE_SLOW_PATH` | `settings.enable_slow_path` | **off** | Route `structured_flow` through Planner/TaskRunner/Assembler (vs single-query `QueryService`). Records persist only on the slow path → reports require this on. |
253
+ | `ENABLE_GATE` | `settings.enable_gate` | **off** | **Deprecated 2026-06-25** — gate neutered; the flag has no effect. Kept to avoid `.env` churn. |
254
+ | `SKIP_INIT_DB` | env, `main.py` | off | Skip `create_all` on startup — the dedorch cutover switch (Go owns dedorch migrations). |
255
+ | `enable_tracing` | hardcoded `True` in `chat.py` | on (endpoint) | Langfuse tracing. |
256
+
257
+ ---
258
+
259
+ ## 11. Where the older docs are stale
260
+
261
+ Trust the code. The original Phase-2 docs (`ARCHITECTURE.md`, `REPO_CONTEXT.md`) and the Go repo's
262
+ copies disagree with the current code on:
263
+
264
+ | Topic | Old docs | Current code |
265
+ |---|---|---|
266
+ | Router | 3-way `source_hint` (chat/unstructured/structured) | Flat **5-intent** `RouterDecision` (was 6; `problem_statement` removed 2026-06-25) |
267
+ | Joins in IR | "single-table only; deferred" | **Single-level FK-backed joins** (DB sources only) |
268
+ | Analysis / report / gate / slow path | "Phase 2 spine only" | All built and present |
269
+ | `analysis_id` | open question | resolved: **`analysis_id == room_id`** |
270
+ | Report source | (newer invariant) "from records, never chat history" | confirmed: generator reads `AnalysisRecord`s |
271
+
272
+ ---
273
+
274
+ ## 12. dedorch migration — current state
275
+
276
+ The Python DB is moving from `dataeyond` → **dedorch** (Go owns dedorch migrations; Python is
277
+ consumer-only). Current state:
278
+
279
+ - Base tables already match dedorch.
280
+ - The analysis-family models have been **renamed to dedorch** on `pr/3`: `analysis` (was
281
+ `analysis_states`, uuid ids), `data_sources` (was `analysis_data_sources`), `reports` (was
282
+ `analysis_reports`, flattened to title + markdown content + version).
283
+ - `analysis_records` (the slow-path structured output) has **no dedorch home** — it remains a
284
+ Python-owned jsonb table.
285
+ - The connection-string cutover (paired with `SKIP_INIT_DB`) is a coordinated step that has not
286
+ happened yet; Python still creates tables on startup until then.
287
+
288
+ The dedorch migrations themselves live outside the three checked-out repos (Harry owns them), so the
289
+ dedorch table shapes are asserted by the Python model docstrings, not visible in the Go repo here.
290
+
291
+ ---
292
+
293
+ ## 13. Conventions & gotchas
294
+
295
+ - **Two Postgres engines:** app engine + a separate PGVector engine (`prepared_statement_cache_size=0`)
296
+ because PGVector emits multi-statement strings asyncpg rejects.
297
+ - **Identifiers vs values:** identifiers come from the catalog and are inlined as quoted; filter
298
+ values are always parameterized.
299
+ - **Settings aliases:** `.env` uses double-underscore names (`azureai__api_key__4o`); `Settings`
300
+ exposes them as `azureai_api_key_4o`.
301
+ - **Never-throw seams** are pervasive (tool invoker, query service, executors, state/binding reads,
302
+ record persistence, report summary). Failures degrade into soft output rather than raising — good
303
+ for UX, but they can mask real breakage (e.g. a binding silently fail-opening to the full catalog).
304
+ - **Prompts** live in `src/config/prompts/*.md`. `chatbot_system.md` has `guardrails.md` appended so
305
+ guardrails win on conflict.
306
+ - **Tests** are gitignored (team decision) — run them locally.
src/agents/chat_handler.py CHANGED
@@ -9,8 +9,10 @@ End-to-end flow per user message:
9
  - `unstructured_flow` → DocumentRetriever (RAG over PGVector) →
10
  list[DocumentChunk].
11
  - `check` → check_data / check_knowledge tool → rendered table.
12
- - `problem_statement` → PS skill: draft + validate → write analysis state.
13
  - `help` → Help skill: analysis state + history → streamed guidance.
 
 
 
14
  3. `ChatbotAgent.astream` → yield text tokens.
15
  4. Wrap each step into an SSE-style event dict so the API endpoint can
16
  stream them as Server-Sent Events.
@@ -39,7 +41,9 @@ from src.retrieval.base import RetrievalResult
39
  from .chatbot import ChatbotAgent, DocumentChunk
40
  from .handlers.check import run_check
41
  from .handlers.help import HelpAgent
42
- from .handlers.problem_statement import ProblemStatementAgent, run_problem_statement
 
 
43
  from .orchestration import OrchestratorAgent
44
 
45
  if TYPE_CHECKING:
@@ -48,7 +52,7 @@ if TYPE_CHECKING:
48
  from ..retrieval.router import RetrievalRouter
49
  from .gate import AnalysisState
50
  from .slow_path.coordinator import SlowPathCoordinator
51
- from .slow_path.store import AnalysisStore
52
 
53
  logger = get_logger("chat_handler")
54
 
@@ -78,7 +82,7 @@ class ChatHandler:
78
  slow_path_coordinator_factory: (
79
  Callable[[str], SlowPathCoordinator] | None
80
  ) = None,
81
- analysis_store: AnalysisStore | None = None,
82
  check_invoker_factory: Callable[[str], Any] | None = None,
83
  ps_agent: ProblemStatementAgent | None = None,
84
  help_agent: HelpAgent | None = None,
@@ -114,8 +118,8 @@ class ChatHandler:
114
  # `#10` data-source binding: scopes structured_flow's catalog to the sources
115
  # the analysis is bound to. Injectable for tests; fail-open when absent.
116
  self._binding_store = binding_store
117
- # Deterministic gate: redirect structured_flow -> problem_statement until the
118
- # analysis is validated. OFF by default (legacy rooms have no state row).
119
  self._enable_gate = enable_gate
120
 
121
  # ------------------------------------------------------------------
@@ -244,9 +248,8 @@ class ChatHandler:
244
 
245
  intent = decision.intent
246
  # ---- 1a. Ensure session state row (T-A) ----------------------
247
- # Rooms created via /room/create have no `analysis_states` row. Without one
248
- # the gate redirect-loops and problem_statement / report_id writes silently
249
- # no-op. Lazily get-or-create it (idempotent) so any session is gate-ready.
250
  analysis_state: AnalysisState | None = None
251
  if analysis_id:
252
  try:
@@ -256,18 +259,20 @@ class ChatHandler:
256
  "analysis state ensure failed", analysis_id=analysis_id, error=str(e)
257
  )
258
 
259
- # ---- 1b. Gate (deterministic, post-router) -------------------
260
- # Redirect structured_flow -> problem_statement until the analysis is
261
- # validated. Fails closed (not-validated) when the state row is unavailable.
262
- if self._enable_gate and analysis_id:
263
- from .gate import gate, stub_analysis_state
264
-
265
- intent = gate(
266
- intent,
267
- analysis_state
268
- if analysis_state is not None
269
- else stub_analysis_state(problem_validated=False),
270
- )
 
 
271
 
272
  # The `intent` event is consumed by the endpoint (it gates response caching
273
  # on the effective intent) and is NOT forwarded to the frontend. We emit the
@@ -337,22 +342,24 @@ class ChatHandler:
337
  yield {"event": "chunk", "data": text}
338
  yield {"event": "done", "data": ""}
339
  return
340
- elif intent == "problem_statement":
341
- try:
342
- text = await run_problem_statement(
343
- message,
344
- analysis_id,
345
- agent=self._get_ps_agent(),
346
- store=self._get_state_store(),
347
- history=history,
348
- )
349
- except Exception as e:
350
- logger.error("problem_statement route failed", user_id=user_id, error=str(e))
351
- yield {"event": "error", "data": f"Problem statement failed: {e}"}
352
- return
353
- yield {"event": "chunk", "data": text}
354
- yield {"event": "done", "data": ""}
355
- return
 
 
356
  elif intent == "help":
357
  try:
358
  state = analysis_state or await self._load_analysis_state(analysis_id)
@@ -468,11 +475,11 @@ class ChatHandler:
468
  PlannerService(), TaskRunner(invoker, registry), Assembler(), registry
469
  )
470
 
471
- def _get_analysis_store(self) -> AnalysisStore:
472
  if self._analysis_store is None:
473
- from .slow_path.store import PostgresAnalysisStore
474
 
475
- self._analysis_store = PostgresAnalysisStore()
476
  return self._analysis_store
477
 
478
  async def _run_slow_path(
@@ -487,7 +494,7 @@ class ChatHandler:
487
  """Run the slow path and stream its assembled answer as SSE events.
488
 
489
  Context comes from the `get_business_context` seam (a stub today); the
490
- `analysis_record` is persisted via the `AnalysisStore` seam (PostgresAnalysisStore),
491
  stamped with the request's user_id + analysis_id so the report can group it.
492
  `chat_answer` is emitted as a single `chunk` (the Assembler returns the whole
493
  object — true token streaming is a later step).
 
9
  - `unstructured_flow` → DocumentRetriever (RAG over PGVector) →
10
  list[DocumentChunk].
11
  - `check` → check_data / check_knowledge tool → rendered table.
 
12
  - `help` → Help skill: analysis state + history → streamed guidance.
13
+
14
+ (`problem_statement` was removed 2026-06-24 — the goal is now user-entered
15
+ `objective` + `business_questions` captured at onboarding, with no agent skill.)
16
  3. `ChatbotAgent.astream` → yield text tokens.
17
  4. Wrap each step into an SSE-style event dict so the API endpoint can
18
  stream them as Server-Sent Events.
 
41
  from .chatbot import ChatbotAgent, DocumentChunk
42
  from .handlers.check import run_check
43
  from .handlers.help import HelpAgent
44
+ # `run_problem_statement` unwired 2026-06-24 (problem_statement removed from the router).
45
+ # `ProblemStatementAgent` kept — still referenced by the constructor + _get_ps_agent.
46
+ from .handlers.problem_statement import ProblemStatementAgent
47
  from .orchestration import OrchestratorAgent
48
 
49
  if TYPE_CHECKING:
 
52
  from ..retrieval.router import RetrievalRouter
53
  from .gate import AnalysisState
54
  from .slow_path.coordinator import SlowPathCoordinator
55
+ from .slow_path.store import ReportInputStore
56
 
57
  logger = get_logger("chat_handler")
58
 
 
82
  slow_path_coordinator_factory: (
83
  Callable[[str], SlowPathCoordinator] | None
84
  ) = None,
85
+ analysis_store: ReportInputStore | None = None,
86
  check_invoker_factory: Callable[[str], Any] | None = None,
87
  ps_agent: ProblemStatementAgent | None = None,
88
  help_agent: HelpAgent | None = None,
 
118
  # `#10` data-source binding: scopes structured_flow's catalog to the sources
119
  # the analysis is bound to. Injectable for tests; fail-open when absent.
120
  self._binding_store = binding_store
121
+ # Deterministic gate DEPRECATED 2026-06-24 (problem_validated gate removed).
122
+ # Unused flag; the gate call site in handle() is commented out.
123
  self._enable_gate = enable_gate
124
 
125
  # ------------------------------------------------------------------
 
248
 
249
  intent = decision.intent
250
  # ---- 1a. Ensure session state row (T-A) ----------------------
251
+ # Rooms created via /room/create have no `analysis` row. Without one, Help and
252
+ # the report_id write-back silently no-op. Lazily get-or-create it (idempotent).
 
253
  analysis_state: AnalysisState | None = None
254
  if analysis_id:
255
  try:
 
259
  "analysis state ensure failed", analysis_id=analysis_id, error=str(e)
260
  )
261
 
262
+ # ---- 1b. Gate (REMOVED 2026-06-24) ---------------------------
263
+ # The problem_validated gate was dropped: structured_flow is no longer
264
+ # redirected to problem_statement (the goal is now user-entered objective +
265
+ # business_questions, no agent validation). `gate()` is neutered to a no-op; the
266
+ # call site is left commented for restorability.
267
+ # if self._enable_gate and analysis_id:
268
+ # from .gate import gate, stub_analysis_state
269
+ #
270
+ # intent = gate(
271
+ # intent,
272
+ # analysis_state
273
+ # if analysis_state is not None
274
+ # else stub_analysis_state(problem_validated=False),
275
+ # )
276
 
277
  # The `intent` event is consumed by the endpoint (it gates response caching
278
  # on the effective intent) and is NOT forwarded to the frontend. We emit the
 
342
  yield {"event": "chunk", "data": text}
343
  yield {"event": "done", "data": ""}
344
  return
345
+ # problem_statement dispatch removed 2026-06-24 (skill unwired; intent no longer
346
+ # emitted by the router). Branch kept commented for restorability.
347
+ # elif intent == "problem_statement":
348
+ # try:
349
+ # text = await run_problem_statement(
350
+ # message,
351
+ # analysis_id,
352
+ # agent=self._get_ps_agent(),
353
+ # store=self._get_state_store(),
354
+ # history=history,
355
+ # )
356
+ # except Exception as e:
357
+ # logger.error("problem_statement route failed", user_id=user_id, error=str(e))
358
+ # yield {"event": "error", "data": f"Problem statement failed: {e}"}
359
+ # return
360
+ # yield {"event": "chunk", "data": text}
361
+ # yield {"event": "done", "data": ""}
362
+ # return
363
  elif intent == "help":
364
  try:
365
  state = analysis_state or await self._load_analysis_state(analysis_id)
 
475
  PlannerService(), TaskRunner(invoker, registry), Assembler(), registry
476
  )
477
 
478
+ def _get_analysis_store(self) -> ReportInputStore:
479
  if self._analysis_store is None:
480
+ from .slow_path.store import PostgresReportInputStore
481
 
482
+ self._analysis_store = PostgresReportInputStore()
483
  return self._analysis_store
484
 
485
  async def _run_slow_path(
 
494
  """Run the slow path and stream its assembled answer as SSE events.
495
 
496
  Context comes from the `get_business_context` seam (a stub today); the
497
+ `analysis_record` is persisted via the `ReportInputStore` seam (PostgresReportInputStore),
498
  stamped with the request's user_id + analysis_id so the report can group it.
499
  `chat_answer` is emitted as a single `chunk` (the Assembler returns the whole
500
  object — true token streaming is a later step).
src/agents/gate.py CHANGED
@@ -40,26 +40,29 @@ class AnalysisState(BaseModel):
40
  analysis_title: str
41
  problem_statement: str
42
  problem_validated: bool = False
43
- owner_id: str
44
  report_id: str | None = None
45
  created_at: datetime
46
  updated_at: datetime
47
 
48
 
49
  def gate(intent: Intent, state: AnalysisState) -> Intent:
50
- """Return the effective intent after applying the deterministic gate policy.
51
 
52
- `structured_flow` requires `problem_validated is True`; otherwise redirect to
53
- `problem_statement`. All other intents pass through unchanged.
 
 
54
  """
55
- if intent == "structured_flow" and not state.problem_validated:
56
- logger.info(
57
- "gate redirect",
58
- requested=intent,
59
- effective="problem_statement",
60
- reason="problem_not_validated",
61
- )
62
- return "problem_statement"
 
63
  return intent
64
 
65
 
@@ -75,7 +78,7 @@ def stub_analysis_state(*, problem_validated: bool = False) -> AnalysisState:
75
  analysis_title="Stub analysis",
76
  problem_statement="Stub problem statement" if problem_validated else "",
77
  problem_validated=problem_validated,
78
- owner_id="stub-user",
79
  report_id=None,
80
  created_at=now,
81
  updated_at=now,
 
40
  analysis_title: str
41
  problem_statement: str
42
  problem_validated: bool = False
43
+ user_id: str
44
  report_id: str | None = None
45
  created_at: datetime
46
  updated_at: datetime
47
 
48
 
49
  def gate(intent: Intent, state: AnalysisState) -> Intent:
50
+ """Return the effective intent (NEUTERED 2026-06-24 passes everything through).
51
 
52
+ The `problem_validated` gate was removed: analysis is no longer gated on a validated
53
+ problem statement (the goal is now two user-entered fields, `objective` +
54
+ `business_questions`, captured at onboarding with no agent validation). Kept as a
55
+ no-op seam so gating can be restored without re-threading call sites.
56
  """
57
+ # Pre-2026-06-24 policy: redirect analytical requests until the goal was validated.
58
+ # if intent == "structured_flow" and not state.problem_validated:
59
+ # logger.info(
60
+ # "gate redirect",
61
+ # requested=intent,
62
+ # effective="problem_statement",
63
+ # reason="problem_not_validated",
64
+ # )
65
+ # return "problem_statement"
66
  return intent
67
 
68
 
 
78
  analysis_title="Stub analysis",
79
  problem_statement="Stub problem statement" if problem_validated else "",
80
  problem_validated=problem_validated,
81
+ user_id="stub-user",
82
  report_id=None,
83
  created_at=now,
84
  updated_at=now,
src/agents/handlers/help.py CHANGED
@@ -7,18 +7,24 @@ it never runs analysis or produces data answers.
7
  The prompt lives in `config/prompts/help.md` (the playbook); this module composes
8
  the context and streams the LLM answer, mirroring `ChatbotAgent`. The **consistency
9
  guard** has teeth here, not just in the prompt: `_derive_available_actions` computes
10
- the actions actually allowed from the state (the same policy as `gate.py`), and that
11
- list is fed into the prompt — the LLM is told to suggest *only* those, so it can't
12
- tell the user to generate a report when the goal isn't validated or the analysis
13
- isn't ready.
 
 
 
 
 
14
 
15
  SEAMS:
16
- - `AnalysisState` is the locked 8-field contract from `gate.py` (KM-652). The gate,
17
- this skill, and tests share `gate.stub_analysis_state(...)` so they exercise the
18
- same shape.
19
- - `ReportReadiness` is the return shape of `is_report_ready(chat_history)` (seam #5,
20
- Rifqi not built yet). Help *consumes* it; it does not compute it. Until it lands,
21
- the caller passes a stub (default: not ready).
 
22
  """
23
 
24
  from __future__ import annotations
@@ -59,17 +65,13 @@ class ReportReadiness:
59
 
60
 
61
  def _derive_available_actions(state: AnalysisState, report_ready: ReportReadiness) -> list[str]:
62
- """Actions Help is allowed to suggest, derived from state (mirrors `gate.py`).
63
 
64
- This is the consistency guard's teeth: analysis is gated behind a validated goal
65
- (same rule the gate applies to `structured_flow`), and a report is only offered
66
- when the readiness signal says so. Keep this policy in sync with `gate.gate`.
67
  """
68
- if not state.problem_validated:
69
- # Goal not set → the only useful move is defining the problem statement.
70
- return ["define_problem_statement"]
71
-
72
- actions = ["ask_analysis_question", "refine_problem_statement"]
73
  if report_ready.ready:
74
  actions.append("generate_report")
75
  return actions
@@ -78,11 +80,16 @@ def _derive_available_actions(state: AnalysisState, report_ready: ReportReadines
78
  def _format_state(state: AnalysisState) -> str:
79
  """Render the analysis state as a compact context block for the LLM."""
80
  has_report = "yes" if state.report_id else "no"
 
 
 
 
 
81
  return (
82
  "[Analysis state]\n"
83
  f"analysis_title: {state.analysis_title or '(none)'}\n"
84
- f"problem_statement: {state.problem_statement or '(empty)'}\n"
85
- f"problem_validated: {str(state.problem_validated).lower()}\n"
86
  f"has_report: {has_report}"
87
  )
88
 
@@ -173,7 +180,6 @@ class HelpAgent:
173
  actions = available_actions or _derive_available_actions(state, readiness)
174
  logger.info(
175
  "help guidance",
176
- problem_validated=state.problem_validated,
177
  report_ready=readiness.ready,
178
  available_actions=actions,
179
  )
 
7
  The prompt lives in `config/prompts/help.md` (the playbook); this module composes
8
  the context and streams the LLM answer, mirroring `ChatbotAgent`. The **consistency
9
  guard** has teeth here, not just in the prompt: `_derive_available_actions` computes
10
+ the actions actually allowed from the readiness signal, and that list is fed into the
11
+ prompt — the LLM is told to suggest *only* those, so it can't tell the user to
12
+ generate a report before the analysis is ready.
13
+
14
+ NOTE (KM-652, 2026-06-24): the `problem_statement` skill + the `problem_validated`
15
+ gate were removed — the goal is now two user-entered fields (`objective` +
16
+ `business_questions`) captured at onboarding, with no agent validation. So Help no
17
+ longer steers users to define/validate a goal in chat; it just orients them to
18
+ analysis and (when ready) the report.
19
 
20
  SEAMS:
21
+ - `AnalysisState` is the contract from `gate.py`. The gate, this skill, and tests
22
+ share `gate.stub_analysis_state(...)` so they exercise the same shape. (The
23
+ `objective`/`business_questions` rename is in-flight — task #4 — so this module
24
+ reads those getattr-tolerantly, falling back to legacy `problem_statement`.)
25
+ - `ReportReadiness` is the return shape of `is_report_ready` (seam #5, Rifqi built
26
+ in `report/readiness.py`). Help *consumes* it; it does not compute it. A missing
27
+ signal degrades to a not-ready stub.
28
  """
29
 
30
  from __future__ import annotations
 
65
 
66
 
67
  def _derive_available_actions(state: AnalysisState, report_ready: ReportReadiness) -> list[str]:
68
+ """Actions Help is allowed to suggest, derived from the readiness signal.
69
 
70
+ Since KM-652 there is no goal-validation gate: the goal (objective +
71
+ business_questions) is set in the onboarding form, so asking analysis questions is
72
+ always available. A report is only offered when the readiness signal says so.
73
  """
74
+ actions = ["ask_analysis_question"]
 
 
 
 
75
  if report_ready.ready:
76
  actions.append("generate_report")
77
  return actions
 
80
  def _format_state(state: AnalysisState) -> str:
81
  """Render the analysis state as a compact context block for the LLM."""
82
  has_report = "yes" if state.report_id else "no"
83
+ # Tolerant of the in-flight AnalysisState rename (#4): prefer objective +
84
+ # business_questions, fall back to the legacy free-text problem_statement.
85
+ objective = getattr(state, "objective", "") or getattr(state, "problem_statement", "") or ""
86
+ questions = getattr(state, "business_questions", None) or []
87
+ business_questions = "; ".join(questions) if questions else "(none)"
88
  return (
89
  "[Analysis state]\n"
90
  f"analysis_title: {state.analysis_title or '(none)'}\n"
91
+ f"objective: {objective or '(empty)'}\n"
92
+ f"business_questions: {business_questions}\n"
93
  f"has_report: {has_report}"
94
  )
95
 
 
180
  actions = available_actions or _derive_available_actions(state, readiness)
181
  logger.info(
182
  "help guidance",
 
183
  report_ready=readiness.ready,
184
  available_actions=actions,
185
  )
src/agents/handlers/problem_statement.py CHANGED
@@ -1,3 +1,7 @@
 
 
 
 
1
  """Problem Statement skill — guide the user to a usable problem statement.
2
 
3
  Routed by the orchestrator (intent `problem_statement`) and callable as a skill.
 
1
+ # UNWIRED 2026-06-24: the problem_statement skill is no longer routed to — it was removed
2
+ # from the 6-intent router and the gate (the goal is now user-entered objective +
3
+ # business_questions, no agent validation). File kept intact (comment, don't delete) so
4
+ # the skill can be restored if needed. See DEV_PLAN.md #1.
5
  """Problem Statement skill — guide the user to a usable problem statement.
6
 
7
  Routed by the orchestrator (intent `problem_statement`) and callable as a skill.
src/agents/orchestration.py CHANGED
@@ -32,7 +32,9 @@ logger = get_logger("orchestrator")
32
  Intent = Literal[
33
  "chat",
34
  "help",
35
- "problem_statement",
 
 
36
  "check",
37
  "unstructured_flow",
38
  "structured_flow",
@@ -53,10 +55,10 @@ class RouterDecision(BaseModel):
53
  ...,
54
  description=(
55
  "Handler route for this message: 'chat' (conversational, no data), "
56
- "'help' (what-to-do-next guidance), 'problem_statement' (define or "
57
- "refine the analysis goal), 'check' (inventory: what data/documents "
58
- "exist), 'unstructured_flow' (answer from documents, fast RAG), or "
59
- "'structured_flow' (analytical question over data, slow Planner path)."
60
  ),
61
  )
62
  rewritten_query: str | None = Field(
 
32
  Intent = Literal[
33
  "chat",
34
  "help",
35
+ # "problem_statement", # removed 2026-06-24 — the analysis goal is now two
36
+ # # user-entered fields (objective + business_questions),
37
+ # # captured at onboarding with no agent validation.
38
  "check",
39
  "unstructured_flow",
40
  "structured_flow",
 
55
  ...,
56
  description=(
57
  "Handler route for this message: 'chat' (conversational, no data), "
58
+ "'help' (what-to-do-next guidance), 'check' (inventory: what "
59
+ "data/documents exist), 'unstructured_flow' (answer from documents, fast "
60
+ "RAG), or 'structured_flow' (analytical question over data, slow Planner "
61
+ "path)."
62
  ),
63
  )
64
  rewritten_query: str | None = Field(
src/agents/report/generator.py CHANGED
@@ -167,9 +167,12 @@ def _build_human_content(
167
  ps: ProblemStatement, findings: list[ReportFinding], caveats: list[AttributedNote]
168
  ) -> str:
169
  sections = []
170
- ps_lines = [v for v in (ps.objective, ps.target_value, ps.scope) if v]
171
- if ps_lines:
172
- sections.append("# Problem Statement\n" + "\n".join(ps_lines))
 
 
 
173
  sections.append(
174
  "# Findings (already finalized — synthesize, do not add numbers)\n"
175
  + "\n".join(f"- {f.text}" for f in findings)
@@ -182,16 +185,23 @@ def _build_human_content(
182
  def _render_markdown(report: AnalysisReport) -> str:
183
  # Version is deliberately NOT in the markdown — it is assigned by the store
184
  # after rendering and lives in the structured `version` field / API metadata.
185
- parts: list[str] = ["# Analysis Report"]
186
- parts.append(
187
- f"*Generated {report.generated_at:%Y-%m-%d} · "
188
- f"{len(report.record_ids)} analyses · {len(report.data_sources)} source(s)*"
189
- )
 
 
 
190
 
191
  ps = report.problem_statement
192
- ps_lines = [v for v in (ps.objective, ps.target_value, ps.scope) if v]
193
- if ps_lines:
194
- parts.append("## Problem Statement\n" + " ".join(ps_lines))
 
 
 
 
195
 
196
  if report.executive_summary:
197
  parts.append("## Executive Summary\n" + report.executive_summary)
@@ -203,18 +213,8 @@ def _render_markdown(report: AnalysisReport) -> str:
203
  lines.append(f"{i}. {f.text}{cite}")
204
  parts.append("\n".join(lines))
205
 
206
- if report.caveats or report.open_questions:
207
- lines = ["## Caveats & Open Questions"]
208
- for n in report.caveats:
209
- cite = f" *({', '.join(n.record_ids)})*" if n.record_ids else ""
210
- lines.append(f"- {n.text}{cite}")
211
- for n in report.open_questions:
212
- cite = f" *({', '.join(n.record_ids)})*" if n.record_ids else ""
213
- lines.append(f"- Open: {n.text}{cite}")
214
- parts.append("\n".join(lines))
215
-
216
  if report.data_sources:
217
- lines = ["## Appendix A — Data Used", "| source | type | detail |", "|---|---|---|"]
218
  for ds in report.data_sources:
219
  d = ds.detail
220
  bits = []
@@ -227,8 +227,18 @@ def _render_markdown(report: AnalysisReport) -> str:
227
  lines.append(f"| {ds.name} | {ds.source_type or '—'} | {' · '.join(bits) or '—'} |")
228
  parts.append("\n".join(lines))
229
 
 
 
 
 
 
 
 
 
 
 
230
  if report.method_steps:
231
- lines = ["## Appendix B Method"]
232
  for stage_key, label in _STAGE_LABELS:
233
  steps = [s for s in report.method_steps if s.stage == stage_key]
234
  if not steps:
@@ -239,7 +249,7 @@ def _render_markdown(report: AnalysisReport) -> str:
239
  lines.append(f"**{label}** — {rendered}")
240
  parts.append("\n".join(lines))
241
 
242
- return "\n\n".join(parts)
243
 
244
 
245
  # --------------------------------------------------------------------------- #
@@ -264,9 +274,9 @@ class ReportGenerator:
264
 
265
  def _ensure_record_store(self):
266
  if self._record_store is None:
267
- from ..slow_path.store import PostgresAnalysisStore
268
 
269
- self._record_store = PostgresAnalysisStore()
270
  return self._record_store
271
 
272
  def _ensure_chain(self) -> Runnable:
@@ -286,6 +296,7 @@ class ReportGenerator:
286
  analysis_id: str,
287
  user_id: str | None = None,
288
  problem_statement: ProblemStatement | None = None,
 
289
  ) -> AnalysisReport:
290
  records = await self._ensure_record_store().list_for_analysis(analysis_id)
291
  if not records:
@@ -305,6 +316,7 @@ class ReportGenerator:
305
  report = AnalysisReport(
306
  analysis_id=analysis_id,
307
  user_id=user_id,
 
308
  version=0, # assigned by ReportStore.save under the advisory lock
309
  generated_at=datetime.now(UTC),
310
  problem_statement=ps,
 
167
  ps: ProblemStatement, findings: list[ReportFinding], caveats: list[AttributedNote]
168
  ) -> str:
169
  sections = []
170
+ if ps.objective:
171
+ sections.append("# Objective\n" + ps.objective)
172
+ if ps.business_questions:
173
+ sections.append(
174
+ "# Business questions\n" + "\n".join(f"- {q}" for q in ps.business_questions)
175
+ )
176
  sections.append(
177
  "# Findings (already finalized — synthesize, do not add numbers)\n"
178
  + "\n".join(f"- {f.text}" for f in findings)
 
185
  def _render_markdown(report: AnalysisReport) -> str:
186
  # Version is deliberately NOT in the markdown — it is assigned by the store
187
  # after rendering and lives in the structured `version` field / API metadata.
188
+ meta = f"*Generated {report.generated_at:%Y-%m-%d}"
189
+ author = report.user_name or report.user_id
190
+ if author:
191
+ meta += f" by {author}"
192
+ meta += f" · {len(report.record_ids)} analyses · {len(report.data_sources)} source(s)*"
193
+ # Title + meta form the header block; each subsequent section is divided by a
194
+ # horizontal rule (`---`) so the report reads as a formal, sectioned document.
195
+ parts: list[str] = ["# Analysis Report\n" + meta]
196
 
197
  ps = report.problem_statement
198
+ if ps.objective:
199
+ parts.append("## Objective\n" + ps.objective)
200
+ if ps.business_questions:
201
+ parts.append(
202
+ "## Business Questions\n"
203
+ + "\n".join(f"{i}. {q}" for i, q in enumerate(ps.business_questions, 1))
204
+ )
205
 
206
  if report.executive_summary:
207
  parts.append("## Executive Summary\n" + report.executive_summary)
 
213
  lines.append(f"{i}. {f.text}{cite}")
214
  parts.append("\n".join(lines))
215
 
 
 
 
 
 
 
 
 
 
 
216
  if report.data_sources:
217
+ lines = ["## EDA", "| source | type | detail |", "|---|---|---|"]
218
  for ds in report.data_sources:
219
  d = ds.detail
220
  bits = []
 
227
  lines.append(f"| {ds.name} | {ds.source_type or '—'} | {' · '.join(bits) or '—'} |")
228
  parts.append("\n".join(lines))
229
 
230
+ if report.caveats or report.open_questions:
231
+ lines = ["## Notes & Limitations"]
232
+ for n in report.caveats:
233
+ cite = f" *({', '.join(n.record_ids)})*" if n.record_ids else ""
234
+ lines.append(f"- {n.text}{cite}")
235
+ for n in report.open_questions:
236
+ cite = f" *({', '.join(n.record_ids)})*" if n.record_ids else ""
237
+ lines.append(f"- Open: {n.text}{cite}")
238
+ parts.append("\n".join(lines))
239
+
240
  if report.method_steps:
241
+ lines = ["## How This Was Analyzed"]
242
  for stage_key, label in _STAGE_LABELS:
243
  steps = [s for s in report.method_steps if s.stage == stage_key]
244
  if not steps:
 
249
  lines.append(f"**{label}** — {rendered}")
250
  parts.append("\n".join(lines))
251
 
252
+ return "\n\n---\n\n".join(parts)
253
 
254
 
255
  # --------------------------------------------------------------------------- #
 
274
 
275
  def _ensure_record_store(self):
276
  if self._record_store is None:
277
+ from ..slow_path.store import PostgresReportInputStore
278
 
279
+ self._record_store = PostgresReportInputStore()
280
  return self._record_store
281
 
282
  def _ensure_chain(self) -> Runnable:
 
296
  analysis_id: str,
297
  user_id: str | None = None,
298
  problem_statement: ProblemStatement | None = None,
299
+ user_name: str | None = None,
300
  ) -> AnalysisReport:
301
  records = await self._ensure_record_store().list_for_analysis(analysis_id)
302
  if not records:
 
316
  report = AnalysisReport(
317
  analysis_id=analysis_id,
318
  user_id=user_id,
319
+ user_name=user_name,
320
  version=0, # assigned by ReportStore.save under the advisory lock
321
  generated_at=datetime.now(UTC),
322
  problem_statement=ps,
src/agents/report/readiness.py CHANGED
@@ -7,8 +7,8 @@ not a judgement.
7
 
8
  The rule mirrors what makes a real report non-empty and worth generating, so Help can
9
  never suggest an action that would 409 or produce a duplicate:
10
- 1. `problem_validated` the gate's own precondition (no validated goal, no
11
- analysis worth reporting). Same rule `gate.gate` applies to `structured_flow`.
12
  2. at least one **substantive** persisted `AnalysisRecord` — a record whose
13
  *analysis* task succeeded. A failed run still persists a record WITH findings
14
  (they narrate the failure), and data-access tasks (check_/retrieve_) succeed even
@@ -45,15 +45,15 @@ if TYPE_CHECKING:
45
  logger = get_logger("report_readiness")
46
 
47
  # Human-readable gaps surfaced to the user via Help (kept stable for the prompt).
48
- _MISSING_PROBLEM = "a validated problem statement"
49
  _MISSING_ANALYSIS = "at least one completed analysis"
50
  _MISSING_DELTA = "a new analysis since the last report"
51
 
52
 
53
  def _default_record_store():
54
- from ..slow_path.store import PostgresAnalysisStore
55
 
56
- return PostgresAnalysisStore()
57
 
58
 
59
  def _default_report_store():
@@ -91,18 +91,22 @@ async def report_floor(
91
  *,
92
  record_store=None,
93
  ) -> tuple[list[str], list]:
94
- """The report **floor**: a validated goal + ≥1 substantive analysis.
95
 
96
  Returns `(missing, substantive_records)`. This is the shared gate both the Help
97
  readiness signal AND the report API enforce, so the button and Help can't drift
98
- (T-D / T11). It deliberately excludes the delta-since-report check — that is
99
- advisory and lives only in `is_report_ready`; the report button is always allowed
100
- to cut a new version (decision 4A). Fails closed (counts as missing analysis) on
101
- a record-store read error. `record_store` is injectable for tests.
 
 
 
 
 
 
102
  """
103
  missing: list[str] = []
104
- if not state.problem_validated:
105
- missing.append(_MISSING_PROBLEM)
106
 
107
  substantive: list = []
108
  if analysis_id:
@@ -116,7 +120,7 @@ async def report_floor(
116
  analysis_id=analysis_id,
117
  error=str(exc),
118
  )
119
- return [*missing, _MISSING_ANALYSIS], []
120
 
121
  if not substantive:
122
  missing.append(_MISSING_ANALYSIS)
 
7
 
8
  The rule mirrors what makes a real report non-empty and worth generating, so Help can
9
  never suggest an action that would 409 or produce a duplicate:
10
+ 1. (removed 2026-06-24) a validated problem statement the report no longer gates on
11
+ the goal (now user-entered `objective` + `business_questions`, no agent validation).
12
  2. at least one **substantive** persisted `AnalysisRecord` — a record whose
13
  *analysis* task succeeded. A failed run still persists a record WITH findings
14
  (they narrate the failure), and data-access tasks (check_/retrieve_) succeed even
 
45
  logger = get_logger("report_readiness")
46
 
47
  # Human-readable gaps surfaced to the user via Help (kept stable for the prompt).
48
+ # _MISSING_PROBLEM retired 2026-06-24 — the report no longer gates on a validated goal.
49
  _MISSING_ANALYSIS = "at least one completed analysis"
50
  _MISSING_DELTA = "a new analysis since the last report"
51
 
52
 
53
  def _default_record_store():
54
+ from ..slow_path.store import PostgresReportInputStore
55
 
56
+ return PostgresReportInputStore()
57
 
58
 
59
  def _default_report_store():
 
91
  *,
92
  record_store=None,
93
  ) -> tuple[list[str], list]:
94
+ """The report **floor**: ≥1 substantive analysis.
95
 
96
  Returns `(missing, substantive_records)`. This is the shared gate both the Help
97
  readiness signal AND the report API enforce, so the button and Help can't drift
98
+ (T-D / T11).
99
+
100
+ CHANGED 2026-06-24: the `problem_validated` precondition was dropped analysis is no
101
+ longer gated on a validated goal (now user-entered `objective` + `business_questions`,
102
+ no agent validation), so the only floor is "is there anything worth reporting". The
103
+ delta-since-report check stays advisory and lives only in `is_report_ready`; the
104
+ report button is always allowed to cut a new version (decision 4A). Fails closed
105
+ (counts as missing analysis) on a record-store read error. `record_store` is
106
+ injectable for tests. `state` stays in the signature (callers + the `is_report_ready`
107
+ delta check use it).
108
  """
109
  missing: list[str] = []
 
 
110
 
111
  substantive: list = []
112
  if analysis_id:
 
120
  analysis_id=analysis_id,
121
  error=str(exc),
122
  )
123
+ return [_MISSING_ANALYSIS], []
124
 
125
  if not substantive:
126
  missing.append(_MISSING_ANALYSIS)
src/agents/report/schemas.py CHANGED
@@ -22,17 +22,18 @@ from ..slow_path.schemas import TaskSummary
22
 
23
 
24
  class ProblemStatement(BaseModel):
25
- """Minimal stub of Harry's Problem Statement, frozen into each report.
26
-
27
- Loose on purpose until the real PS template lands (Analysis State, upstream).
28
- A report snapshots the PS as it was at generation time.
 
 
 
 
29
  """
30
 
31
  objective: str = ""
32
- metric_direction: str = "" # "increase" | "decrease"
33
- target_metric: str = ""
34
- target_value: str = ""
35
- scope: str = ""
36
 
37
 
38
  class DataSourceRef(BaseModel):
@@ -75,6 +76,7 @@ class AnalysisReport(BaseModel):
75
  report_id: str = Field(default_factory=lambda: uuid4().hex)
76
  analysis_id: str
77
  user_id: str | None = None
 
78
  version: int
79
  generated_at: datetime
80
  # Frozen snapshots.
 
22
 
23
 
24
  class ProblemStatement(BaseModel):
25
+ """The analysis goal, frozen into each report at generation time.
26
+
27
+ Analysis-State shape `objective` + `business_questions`
28
+ which replaced the old single free-text problem statement. A report snapshots
29
+ the goal as it was at generation time. Class name is kept (for now) to avoid an
30
+ import churn across report.py / generator.py / store.py; rename to `ReportGoal`
31
+ once the upstream AnalysisState rename (objective/business_questions) lands so
32
+ every caller migrates in one pass.
33
  """
34
 
35
  objective: str = ""
36
+ business_questions: list[str] = Field(default_factory=list)
 
 
 
37
 
38
 
39
  class DataSourceRef(BaseModel):
 
76
  report_id: str = Field(default_factory=lambda: uuid4().hex)
77
  analysis_id: str
78
  user_id: str | None = None
79
+ user_name: str | None = None # display name for "generated by"; falls back to user_id
80
  version: int
81
  generated_at: datetime
82
  # Frozen snapshots.
src/agents/report/store.py CHANGED
@@ -1,6 +1,6 @@
1
  """ReportStore — persists/reads versioned AnalysisReports (KM-644).
2
 
3
- Mirrors `PostgresAnalysisStore`: each call opens its own `AsyncSessionLocal`.
4
 
5
  Version assignment is serialized per `analysis_id` with a Postgres
6
  transaction-level advisory lock so concurrent button presses can't compute the
 
1
  """ReportStore — persists/reads versioned AnalysisReports (KM-644).
2
 
3
+ Mirrors `PostgresReportInputStore`: each call opens its own `AsyncSessionLocal`.
4
 
5
  Version assignment is serialized per `analysis_id` with a Postgres
6
  transaction-level advisory lock so concurrent button presses can't compute the
src/agents/slow_path/schemas.py CHANGED
@@ -69,7 +69,7 @@ class TaskSummary(BaseModel):
69
  class AnalysisRecord(BaseModel):
70
  # Identity. `record_id` is the unit the report cites and snapshots
71
  # (`record_ids`); `analysis_id`/`user_id` scope the record to one analysis
72
- # session + owner and are stamped by the composition root / AnalysisStore at
73
  # persist time (they depend on the Analysis State that lives outside the slow
74
  # path), so they default to None when the Assembler first builds the record.
75
  record_id: str = Field(default_factory=lambda: uuid4().hex)
 
69
  class AnalysisRecord(BaseModel):
70
  # Identity. `record_id` is the unit the report cites and snapshots
71
  # (`record_ids`); `analysis_id`/`user_id` scope the record to one analysis
72
+ # session + owner and are stamped by the composition root / ReportInputStore at
73
  # persist time (they depend on the Analysis State that lives outside the slow
74
  # path), so they default to None when the Assembler first builds the record.
75
  record_id: str = Field(default_factory=lambda: uuid4().hex)
src/agents/slow_path/store.py CHANGED
@@ -1,13 +1,13 @@
1
- """AnalysisStore — the seam the slow path persists its AnalysisRecord through.
2
 
3
  The Assembler produces an `AnalysisRecord` (the faithful, structured record of a
4
  run — §8.3, INV-4). Persisting it is a separate concern from streaming the answer,
5
  so it sits behind this seam. `generate_report` later reads records back by
6
  `analysis_id` (oldest-first) and renders from them — never from chat history.
7
 
8
- - `NullAnalysisStore` logs and stores nothing (kept for tests / when persistence
9
  is intentionally disabled).
10
- - `PostgresAnalysisStore` writes one `analysis_records` row per run in the catalog
11
  DB (Neon `dataeyond`, `settings.postgres_connstring`).
12
 
13
  `save` must never raise on the caller's path — a persistence failure must not break
@@ -23,7 +23,7 @@ from sqlalchemy import select
23
  from sqlalchemy.dialects.postgresql import insert
24
 
25
  from src.db.postgres.connection import AsyncSessionLocal
26
- from src.db.postgres.models import AnalysisRecordRow
27
  from src.middlewares.logging import get_logger
28
 
29
  from .schemas import AnalysisRecord
@@ -32,7 +32,7 @@ logger = get_logger("analysis_store")
32
 
33
 
34
  @runtime_checkable
35
- class AnalysisStore(Protocol):
36
  """Persist + read completed analyses.
37
 
38
  `save` must never raise on the caller's path. `list_for_analysis` returns the
@@ -44,12 +44,12 @@ class AnalysisStore(Protocol):
44
  async def list_for_analysis(self, analysis_id: str) -> list[AnalysisRecord]: ...
45
 
46
 
47
- class NullAnalysisStore:
48
  """No-op store: logs the record, persists nothing. Reads return empty."""
49
 
50
  async def save(self, record: AnalysisRecord) -> None:
51
  logger.info(
52
- "analysis_record produced (not persisted — NullAnalysisStore)",
53
  record_id=record.record_id,
54
  plan_id=record.plan_id,
55
  n_tasks=len(record.tasks_run),
@@ -59,8 +59,8 @@ class NullAnalysisStore:
59
  return []
60
 
61
 
62
- class PostgresAnalysisStore:
63
- """Writes/reads `analysis_records` jsonb rows in the catalog DB.
64
 
65
  Mirrors `CatalogStore`: each call opens its own `AsyncSession`. One row per
66
  record (vs. one-per-user for the catalog) since records accumulate per analysis.
@@ -70,7 +70,7 @@ class PostgresAnalysisStore:
70
  try:
71
  payload = record.model_dump(mode="json")
72
  async with AsyncSessionLocal() as session:
73
- stmt = insert(AnalysisRecordRow).values(
74
  id=record.record_id,
75
  analysis_id=record.analysis_id,
76
  user_id=record.user_id,
@@ -81,7 +81,7 @@ class PostgresAnalysisStore:
81
  # Re-running the same plan id-collides only if record_id repeats;
82
  # treat that as idempotent (overwrite) rather than erroring the user.
83
  stmt = stmt.on_conflict_do_update(
84
- index_elements=[AnalysisRecordRow.id],
85
  set_={"data": stmt.excluded.data},
86
  )
87
  await session.execute(stmt)
@@ -102,9 +102,9 @@ class PostgresAnalysisStore:
102
  async def list_for_analysis(self, analysis_id: str) -> list[AnalysisRecord]:
103
  async with AsyncSessionLocal() as session:
104
  result = await session.execute(
105
- select(AnalysisRecordRow.data)
106
- .where(AnalysisRecordRow.analysis_id == analysis_id)
107
- .order_by(AnalysisRecordRow.created_at.asc())
108
  )
109
  rows = result.scalars().all()
110
  return [AnalysisRecord.model_validate(row) for row in rows]
 
1
+ """ReportInputStore — the seam the slow path persists its AnalysisRecord through.
2
 
3
  The Assembler produces an `AnalysisRecord` (the faithful, structured record of a
4
  run — §8.3, INV-4). Persisting it is a separate concern from streaming the answer,
5
  so it sits behind this seam. `generate_report` later reads records back by
6
  `analysis_id` (oldest-first) and renders from them — never from chat history.
7
 
8
+ - `NullReportInputStore` logs and stores nothing (kept for tests / when persistence
9
  is intentionally disabled).
10
+ - `PostgresReportInputStore` writes one `report_inputs` row per run in the catalog
11
  DB (Neon `dataeyond`, `settings.postgres_connstring`).
12
 
13
  `save` must never raise on the caller's path — a persistence failure must not break
 
23
  from sqlalchemy.dialects.postgresql import insert
24
 
25
  from src.db.postgres.connection import AsyncSessionLocal
26
+ from src.db.postgres.models import ReportInputRow
27
  from src.middlewares.logging import get_logger
28
 
29
  from .schemas import AnalysisRecord
 
32
 
33
 
34
  @runtime_checkable
35
+ class ReportInputStore(Protocol):
36
  """Persist + read completed analyses.
37
 
38
  `save` must never raise on the caller's path. `list_for_analysis` returns the
 
44
  async def list_for_analysis(self, analysis_id: str) -> list[AnalysisRecord]: ...
45
 
46
 
47
+ class NullReportInputStore:
48
  """No-op store: logs the record, persists nothing. Reads return empty."""
49
 
50
  async def save(self, record: AnalysisRecord) -> None:
51
  logger.info(
52
+ "analysis_record produced (not persisted — NullReportInputStore)",
53
  record_id=record.record_id,
54
  plan_id=record.plan_id,
55
  n_tasks=len(record.tasks_run),
 
59
  return []
60
 
61
 
62
+ class PostgresReportInputStore:
63
+ """Writes/reads `report_inputs` jsonb rows in the catalog DB.
64
 
65
  Mirrors `CatalogStore`: each call opens its own `AsyncSession`. One row per
66
  record (vs. one-per-user for the catalog) since records accumulate per analysis.
 
70
  try:
71
  payload = record.model_dump(mode="json")
72
  async with AsyncSessionLocal() as session:
73
+ stmt = insert(ReportInputRow).values(
74
  id=record.record_id,
75
  analysis_id=record.analysis_id,
76
  user_id=record.user_id,
 
81
  # Re-running the same plan id-collides only if record_id repeats;
82
  # treat that as idempotent (overwrite) rather than erroring the user.
83
  stmt = stmt.on_conflict_do_update(
84
+ index_elements=[ReportInputRow.id],
85
  set_={"data": stmt.excluded.data},
86
  )
87
  await session.execute(stmt)
 
102
  async def list_for_analysis(self, analysis_id: str) -> list[AnalysisRecord]:
103
  async with AsyncSessionLocal() as session:
104
  result = await session.execute(
105
+ select(ReportInputRow.data)
106
+ .where(ReportInputRow.analysis_id == analysis_id)
107
+ .order_by(ReportInputRow.created_at.asc())
108
  )
109
  rows = result.scalars().all()
110
  return [AnalysisRecord.model_validate(row) for row in rows]
src/agents/state_store.py CHANGED
@@ -5,7 +5,7 @@ The orchestrator gate + Help skill read `AnalysisState` (the locked contract in
5
  row shares its id with the chat `rooms` row — one session = one analysis = one
6
  conversation (`analysis_id == room_id`).
7
 
8
- Mirrors `PostgresAnalysisStore`: each call opens its own `AsyncSession`.
9
  """
10
 
11
  from __future__ import annotations
@@ -27,7 +27,7 @@ def _row_to_state(row: AnalysisStateRow) -> AnalysisState:
27
  analysis_title=row.analysis_title,
28
  problem_statement=row.problem_statement,
29
  problem_validated=row.problem_validated,
30
- owner_id=row.owner_id,
31
  report_id=row.report_id,
32
  created_at=row.created_at,
33
  updated_at=row.updated_at,
@@ -45,7 +45,7 @@ class AnalysisStateStore:
45
  async def ensure(
46
  self,
47
  analysis_id: str,
48
- owner_id: str,
49
  analysis_title: str = "New analysis",
50
  ) -> AnalysisState:
51
  """Get-or-create the state row for a session (idempotent, race-safe).
@@ -62,7 +62,7 @@ class AnalysisStateStore:
62
  insert(AnalysisStateRow)
63
  .values(
64
  id=analysis_id,
65
- owner_id=owner_id,
66
  analysis_title=analysis_title,
67
  problem_statement="",
68
  problem_validated=False,
@@ -78,7 +78,7 @@ class AnalysisStateStore:
78
  self,
79
  *,
80
  analysis_id: str,
81
- owner_id: str,
82
  analysis_title: str = "New analysis",
83
  problem_statement: str = "",
84
  ) -> AnalysisState:
@@ -86,7 +86,7 @@ class AnalysisStateStore:
86
  async with AsyncSessionLocal() as session:
87
  row = AnalysisStateRow(
88
  id=analysis_id,
89
- owner_id=owner_id,
90
  analysis_title=analysis_title,
91
  problem_statement=problem_statement,
92
  problem_validated=False,
 
5
  row shares its id with the chat `rooms` row — one session = one analysis = one
6
  conversation (`analysis_id == room_id`).
7
 
8
+ Mirrors `PostgresReportInputStore`: each call opens its own `AsyncSession`.
9
  """
10
 
11
  from __future__ import annotations
 
27
  analysis_title=row.analysis_title,
28
  problem_statement=row.problem_statement,
29
  problem_validated=row.problem_validated,
30
+ user_id=row.user_id,
31
  report_id=row.report_id,
32
  created_at=row.created_at,
33
  updated_at=row.updated_at,
 
45
  async def ensure(
46
  self,
47
  analysis_id: str,
48
+ user_id: str,
49
  analysis_title: str = "New analysis",
50
  ) -> AnalysisState:
51
  """Get-or-create the state row for a session (idempotent, race-safe).
 
62
  insert(AnalysisStateRow)
63
  .values(
64
  id=analysis_id,
65
+ user_id=user_id,
66
  analysis_title=analysis_title,
67
  problem_statement="",
68
  problem_validated=False,
 
78
  self,
79
  *,
80
  analysis_id: str,
81
+ user_id: str,
82
  analysis_title: str = "New analysis",
83
  problem_statement: str = "",
84
  ) -> AnalysisState:
 
86
  async with AsyncSessionLocal() as session:
87
  row = AnalysisStateRow(
88
  id=analysis_id,
89
+ user_id=user_id,
90
  analysis_title=analysis_title,
91
  problem_statement=problem_statement,
92
  problem_validated=False,
src/api/v1/analysis.py CHANGED
@@ -30,7 +30,7 @@ def _serialize_state(row: AnalysisStateRow, data_source_ids: list[str]) -> dict:
30
  "analysis_title": row.analysis_title,
31
  "problem_statement": row.problem_statement,
32
  "problem_validated": row.problem_validated,
33
- "owner_id": row.owner_id,
34
  "report_id": row.report_id,
35
  "data_source_ids": data_source_ids,
36
  "created_at": row.created_at.isoformat() if row.created_at else None,
@@ -94,7 +94,7 @@ async def create_analysis(
94
  # id, created atomically in one transaction.
95
  state_row = AnalysisStateRow(
96
  id=analysis_id,
97
- owner_id=request.user_id,
98
  analysis_title=request.analysis_title,
99
  problem_statement=request.problem_statement,
100
  problem_validated=False,
@@ -144,7 +144,7 @@ async def list_analyses(user_id: str, db: AsyncSession = Depends(get_db)):
144
  """
145
  result = await db.execute(
146
  select(AnalysisStateRow)
147
- .where(AnalysisStateRow.owner_id == user_id)
148
  .order_by(AnalysisStateRow.updated_at.desc())
149
  )
150
  rows = result.scalars().all()
 
30
  "analysis_title": row.analysis_title,
31
  "problem_statement": row.problem_statement,
32
  "problem_validated": row.problem_validated,
33
+ "user_id": row.user_id,
34
  "report_id": row.report_id,
35
  "data_source_ids": data_source_ids,
36
  "created_at": row.created_at.isoformat() if row.created_at else None,
 
94
  # id, created atomically in one transaction.
95
  state_row = AnalysisStateRow(
96
  id=analysis_id,
97
+ user_id=request.user_id,
98
  analysis_title=request.analysis_title,
99
  problem_statement=request.problem_statement,
100
  problem_validated=False,
 
144
  """
145
  result = await db.execute(
146
  select(AnalysisStateRow)
147
+ .where(AnalysisStateRow.user_id == user_id)
148
  .order_by(AnalysisStateRow.updated_at.desc())
149
  )
150
  rows = result.scalars().all()
src/api/v1/report.py CHANGED
@@ -45,10 +45,37 @@ async def _load_state(analysis_id: str):
45
 
46
 
47
  def _problem_statement_from(state) -> ProblemStatement:
48
- """Map the analysis's free-text problem statement into the report's structured PS."""
49
- if state is None or not state.problem_statement:
 
 
 
 
 
 
50
  return ProblemStatement()
51
- return ProblemStatement(objective=state.problem_statement)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
52
 
53
 
54
  async def _record_report_on_state(analysis_id: str, report_id: str) -> None:
@@ -109,8 +136,9 @@ async def generate_report(
109
 
110
  try:
111
  problem_statement = _problem_statement_from(state)
 
112
  report = await _generator.generate(
113
- analysis_id, user_id, problem_statement=problem_statement
114
  )
115
  except ReportError as e:
116
  raise HTTPException(status_code=status.HTTP_409_CONFLICT, detail=str(e)) from e
@@ -121,6 +149,14 @@ async def generate_report(
121
  detail=f"Report generation failed: {e}",
122
  ) from e
123
 
 
 
 
 
 
 
 
 
124
  try:
125
  saved = await _store.save(report)
126
  except Exception as e:
 
45
 
46
 
47
  def _problem_statement_from(state) -> ProblemStatement:
48
+ """Freeze the analysis goal into the report's snapshot.
49
+
50
+ Bridges the 2026-06-24 AnalysisState rework: prefer the new `objective` +
51
+ `business_questions` fields when the state carries them, else fall back to the
52
+ legacy free-text `problem_statement`. So the report works both before and after
53
+ the state-model migration lands (#4 / dedorch #3).
54
+ """
55
+ if state is None:
56
  return ProblemStatement()
57
+ objective = getattr(state, "objective", "") or getattr(state, "problem_statement", "") or ""
58
+ business_questions = list(getattr(state, "business_questions", []) or [])
59
+ return ProblemStatement(objective=objective, business_questions=business_questions)
60
+
61
+
62
+ async def _resolve_user_name(user_id: str) -> str | None:
63
+ """Best-effort display name (`users.fullname`) for the report's "generated by".
64
+
65
+ Never-throw: a missing user or read error falls back to None, so the generator
66
+ shows the raw `user_id`. Resolving it here keeps the report self-contained (#19);
67
+ swap to a Go-passed display name later if the team prefers.
68
+ """
69
+ try:
70
+ from src.db.postgres.connection import AsyncSessionLocal
71
+ from src.db.postgres.models import User
72
+
73
+ async with AsyncSessionLocal() as session:
74
+ user = await session.get(User, user_id)
75
+ return user.fullname if user is not None else None
76
+ except Exception as e: # noqa: BLE001 — never block a report on the name lookup
77
+ logger.warning("report: user name resolve failed", user_id=user_id, error=str(e))
78
+ return None
79
 
80
 
81
  async def _record_report_on_state(analysis_id: str, report_id: str) -> None:
 
136
 
137
  try:
138
  problem_statement = _problem_statement_from(state)
139
+ user_name = await _resolve_user_name(user_id)
140
  report = await _generator.generate(
141
+ analysis_id, user_id, problem_statement=problem_statement, user_name=user_name
142
  )
143
  except ReportError as e:
144
  raise HTTPException(status_code=status.HTTP_409_CONFLICT, detail=str(e)) from e
 
149
  detail=f"Report generation failed: {e}",
150
  ) from e
151
 
152
+ # ⚠️ TRANSITIONAL — Go is to own ALL writes:
153
+ # the report becomes a content-only skill (FE → Go → Python) and Go persists to the
154
+ # `reports`/`analyses` tables. Until Go exposes those write endpoints, Python still
155
+ # self-persists here:
156
+ # _store.save(report) → inserts the versioned `reports` row
157
+ # _record_report_on_state(...) → writes report_id back onto the `analyses` row
158
+ # Remove both (return `report` content only) once Go's report-write + state-write
159
+ # endpoints land.
160
  try:
161
  saved = await _store.save(report)
162
  except Exception as e:
src/api/v1/tools.py CHANGED
@@ -4,18 +4,17 @@ Exposes the agent's user-invocable slash-command catalog so the Golang backend
4
  can cache it and the frontend can render its "/" command menu WITHOUT calling the
5
  AI agent for every list (Golang GETs + caches `list_tools`).
6
 
7
- Scope confirmed: the catalog is the UNIFIED set of
8
- everything the user can invoke via `/`
9
- spanning what the team internally splits into skills + analytics tools +
10
- data-access tools. Naming : verb-first, kebab-case, `/` prefix.
11
-
12
- Each command maps 1:1 to a real internal tool/intent `name` (the dispatch key);
13
- the granular data-access tools (check_data, check_knowledge, retrieve_data,
14
- retrieve_knowledge) are listed separately.
15
- NOTE: the merged `check` intent still exists for natural-language routing it is
16
- NOT a slash command; slash invocation bypasses the router to the tool directly.
17
- Deferred analytics tools (comparison/contribution/profile/segment) are NOT
18
- exposed (not wired to the Planner).
19
 
20
  Stateless and deterministic — safe for the Golang backend to cache.
21
  """
@@ -49,6 +48,16 @@ class ListToolsResponse(BaseModel):
49
  # Single source of truth for the FE slash-command catalog. Order = display order.
50
  # Keep `command` in Harry's convention (verb-first, kebab-case, `/`); `name` is the
51
  # internal route/tool name used by the orchestrator.
 
 
 
 
 
 
 
 
 
 
52
  _COMMAND_CATALOG: list[CommandResponse] = [
53
  CommandResponse(
54
  command="/help",
@@ -57,60 +66,67 @@ _COMMAND_CATALOG: list[CommandResponse] = [
57
  description="Show what the assistant can do and guide your next step.",
58
  ),
59
  CommandResponse(
60
- command="/problem-statement",
61
- name="problem_statement",
62
  type="skill",
63
- description="Define and validate your analysis goal (objective + metric) "
64
- "before exploring data.",
65
- ),
66
- CommandResponse(
67
- command="/analyze-descriptive",
68
- name="analyze_descriptive",
69
- type="analytics",
70
- description="Summary statistics for selected columns (count, mean, min, max, …).",
71
- ),
72
- CommandResponse(
73
- command="/analyze-aggregate",
74
- name="analyze_aggregate",
75
- type="analytics",
76
- description="Group and aggregate values (sum, count, average) by dimension.",
77
- ),
78
- CommandResponse(
79
- command="/analyze-correlation",
80
- name="analyze_correlation",
81
- type="analytics",
82
- description="Correlation strength between numeric columns.",
83
- ),
84
- CommandResponse(
85
- command="/analyze-trend",
86
- name="analyze_trend",
87
- type="analytics",
88
- description="Trend of a value over time at a chosen frequency.",
89
- ),
90
- CommandResponse(
91
- command="/check-data",
92
- name="check_data",
93
- type="data_access",
94
- description="Inventory of the available structured data sources.",
95
- ),
96
- CommandResponse(
97
- command="/check-knowledge",
98
- name="check_knowledge",
99
- type="data_access",
100
- description="Inventory of the available knowledge / uploaded documents.",
101
- ),
102
- CommandResponse(
103
- command="/retrieve-data",
104
- name="retrieve_data",
105
- type="data_access",
106
- description="Pull rows from a structured source for analysis.",
107
- ),
108
- CommandResponse(
109
- command="/retrieve-knowledge",
110
- name="retrieve_knowledge",
111
- type="data_access",
112
- description="Retrieve relevant passages from your uploaded documents.",
113
  ),
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
114
  ]
115
 
116
 
 
4
  can cache it and the frontend can render its "/" command menu WITHOUT calling the
5
  AI agent for every list (Golang GETs + caches `list_tools`).
6
 
7
+ Scope (2026-06-24, KM-674): the FE slash-command catalog is now just the two
8
+ FE-callable skills `/help` and `/report`. Naming: verb-first, kebab-case, `/`
9
+ prefix; each command maps 1:1 to a real internal tool/intent `name` (the dispatch
10
+ key).
11
+
12
+ The analytics + data-access tools (analyze_*, check_*, retrieve_*) and the retired
13
+ `/problem-statement` skill are kept COMMENTED in the catalog below, NOT deleted —
14
+ they still exist and run via the router/Planner, and check_data/check_knowledge are
15
+ served by Golang; they are simply not surfaced in the FE slash menu for now. Slash
16
+ invocation bypasses the router to the tool directly, so re-exposing one is a matter
17
+ of un-commenting its entry.
 
18
 
19
  Stateless and deterministic — safe for the Golang backend to cache.
20
  """
 
48
  # Single source of truth for the FE slash-command catalog. Order = display order.
49
  # Keep `command` in Harry's convention (verb-first, kebab-case, `/`); `name` is the
50
  # internal route/tool name used by the orchestrator.
51
+ #
52
+ # 2026-06-24 (KM-674 batch): the FE-callable skills are ONLY /help + /report. The rest
53
+ # below are COMMENTED OUT — NOT deleted — on purpose:
54
+ # - /problem-statement is retired (objective + business_questions now live in the
55
+ # New-Analysis form, not a slash skill).
56
+ # - check_data / check_knowledge stay available but are served by Golang, not exposed
57
+ # in the FE slash menu.
58
+ # - the analytics + data-access tools still exist and run via the router/Planner; they
59
+ # are simply not surfaced as FE slash commands here.
60
+ # Re-enable any line if the FE slash menu is later widened back out.
61
  _COMMAND_CATALOG: list[CommandResponse] = [
62
  CommandResponse(
63
  command="/help",
 
66
  description="Show what the assistant can do and guide your next step.",
67
  ),
68
  CommandResponse(
69
+ command="/report",
70
+ name="report",
71
  type="skill",
72
+ description="Generate a versioned analysis report (background, EDA, "
73
+ "key findings, insights).",
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
74
  ),
75
+ # CommandResponse(
76
+ # command="/problem-statement",
77
+ # name="problem_statement",
78
+ # type="skill",
79
+ # description="Define and validate your analysis goal (objective + metric) "
80
+ # "before exploring data.",
81
+ # ),
82
+ # CommandResponse(
83
+ # command="/analyze-descriptive",
84
+ # name="analyze_descriptive",
85
+ # type="analytics",
86
+ # description="Summary statistics for selected columns (count, mean, min, max, …).",
87
+ # ),
88
+ # CommandResponse(
89
+ # command="/analyze-aggregate",
90
+ # name="analyze_aggregate",
91
+ # type="analytics",
92
+ # description="Group and aggregate values (sum, count, average) by dimension.",
93
+ # ),
94
+ # CommandResponse(
95
+ # command="/analyze-correlation",
96
+ # name="analyze_correlation",
97
+ # type="analytics",
98
+ # description="Correlation strength between numeric columns.",
99
+ # ),
100
+ # CommandResponse(
101
+ # command="/analyze-trend",
102
+ # name="analyze_trend",
103
+ # type="analytics",
104
+ # description="Trend of a value over time at a chosen frequency.",
105
+ # ),
106
+ # CommandResponse(
107
+ # command="/check-data",
108
+ # name="check_data",
109
+ # type="data_access",
110
+ # description="Inventory of the available structured data sources.",
111
+ # ),
112
+ # CommandResponse(
113
+ # command="/check-knowledge",
114
+ # name="check_knowledge",
115
+ # type="data_access",
116
+ # description="Inventory of the available knowledge / uploaded documents.",
117
+ # ),
118
+ # CommandResponse(
119
+ # command="/retrieve-data",
120
+ # name="retrieve_data",
121
+ # type="data_access",
122
+ # description="Pull rows from a structured source for analysis.",
123
+ # ),
124
+ # CommandResponse(
125
+ # command="/retrieve-knowledge",
126
+ # name="retrieve_knowledge",
127
+ # type="data_access",
128
+ # description="Retrieve relevant passages from your uploaded documents.",
129
+ # ),
130
  ]
131
 
132
 
src/config/prompts/help.md CHANGED
@@ -1,5 +1,8 @@
1
- <!-- help.md · v1 · Help skill prompt. Bump to v2 (don't silently overwrite) on major change,
2
- e.g. when real UI steps land from the frontend. See checkpoint 2026-06-18. -->
 
 
 
3
 
4
  You are the **Help guide** for an AI data-analysis assistant. Think of yourself as the
5
  instruction sheet that comes with a board game: your only job is to tell the user
@@ -12,8 +15,8 @@ You are given context, never raw user prose to analyze:
12
 
13
  - **`analysis_state`** — the current per-analysis state. Fields you use:
14
  - `analysis_title` — what this analysis is called.
15
- - `problem_statement` — the user's goal (may be empty/weak; it is optional at creation).
16
- - `problem_validated` (bool) **the gate.** `false` = the goal still needs work; `true` = the goal is set and analysis is unlocked.
17
  - `report_id` — `0`/absent means no report has ever been generated.
18
  - **`chat_history`** — the conversation so far. Use it to judge how far along the user is and to avoid repeating yourself.
19
  - **`report_ready`** — a **deterministic** signal computed for you (NOT your judgment):
@@ -35,35 +38,34 @@ Keep it short. Lead with the next step; don't recap everything.
35
 
36
  ## State-tiered guidance
37
 
38
- Pick the branch that matches `analysis_state` + `report_ready`:
 
 
39
 
40
- ### A. `problem_validated == false`fix the goal first
41
- The user can't get good analysis without a clear goal. Steer them to define or sharpen the
42
- problem statement.
43
- - If `problem_statement` is empty: encourage them to state what they want to find out, and mention the AI can help — they can run **`/problem_statement`** (or just describe their goal in chat).
44
- - If `problem_statement` exists but is vague: gently push for something more **measurable and concrete** (a target, a metric, a timeframe), grounded in their `analysis_title` and the data they've bound. Give one short example of a sharper version.
45
- - Do **not** push analysis or reports yet.
46
-
47
- ### B. `problem_validated == true`, little/no analysis yet → orient to analysis
48
- Tell them the goal is set and they can start asking questions about their data. Give the **how**:
49
  - Suggest 2–3 concrete starter questions, **descriptive/basic first** (e.g. "Which products sell the most?", "How have sales trended this month?").
50
- - **Tie suggestions back to their `problem_statement`** so the analysis stays relevant — don't suggest random analyses.
51
  - **Read `chat_history` first and never re-suggest a question already asked or answered.** Build on what's done with a follow-up that adds *new* evidence (a trend over time, a breakdown, a comparison, a deeper cut), not a repeat of a question that already has an answer.
52
  - You may offer a basic end-to-end "starter analysis" path (a few descriptive questions → a first report), kept simple.
53
 
54
- ### C. `problem_validated == true`, analysis under way, `report_ready.ready == false` → close the gaps
55
  They've started but there isn't enough yet for a report. Point at `report_ready.missing` and
56
  recommend the specific next questions that would fill those gaps (phrase them as questions
57
- the user can ask), still anchored to the problem statement.
58
 
59
- ### D. `problem_validated == true` and `report_ready.ready == true` → nudge toward the report
60
  There's enough to report. Encourage them to generate it. Report can be triggered **two ways**:
61
- the **`/generate report`** skill **or** the report button — mention both so it feels natural.
62
  Do not over-promise the report's depth.
63
 
 
 
 
 
64
  ## How-to phrasing (degrade gracefully)
65
 
66
- - **Via chat / skills** — write these **accurately and specifically**; they are stable (e.g. "type your question in the chat", "run `/problem_statement`", "run `/generate report`").
67
  - **Via the UI (buttons/menus)** — the frontend isn't final yet. Describe UI steps **generically** ("use the Generate Report option") rather than naming exact buttons/positions you're unsure of. Prefer the chat/skill path when unsure. *(A later version of this file will fill in the real UI steps.)*
68
  - If a field in `analysis_state` is missing or the state looks unwired, **fall back to generic guidance** rather than guessing specifics.
69
 
@@ -84,24 +86,16 @@ English). A few sentences is usually enough.
84
  ## Examples
85
 
86
  ```
87
- State: problem_validated=false, problem_statement=""
88
- → "Looks like we haven't set a goal yet. Tell me what you want to find out — for example,
89
- 'reduce churn next quarter' — or run /problem_statement and I'll help you shape it."
90
-
91
- State: problem_validated=false, problem_statement="make sales better"
92
- → "Your goal is a good start but a bit broad. Let's make it measurable — e.g. 'grow north-region
93
- revenue by 10% this quarter.' Run /problem_statement and we'll refine it together."
94
-
95
- State: problem_validated=true, chat_history nearly empty
96
  → "Your goal is set — you can start exploring now. Try a basic question first, like
97
  'Which products sell the most?' or 'How have monthly sales trended?', then we can dig into
98
- what's driving your goal."
99
 
100
- State: problem_validated=true, report_ready.ready=false, missing=["no comparison over time"]
101
  → "Good progress. Before a report, it's worth looking at change over time — try asking
102
  'How does this quarter compare to last?' Once we have that, we can put the report together."
103
 
104
- State: problem_validated=true, report_ready.ready=true
105
- → "You've covered enough to summarize. You can generate your report now — run /generate report
106
  or use the report option to create it."
107
  ```
 
1
+ <!-- help.md · v2 · Help skill prompt. v2 (2026-06-24, KM-652): removed the problem_statement
2
+ skill + the problem_validated gate the goal (objective + business_questions) is now set
3
+ in the New Analysis form at onboarding, so Help no longer steers users to define/validate a
4
+ goal in chat. Bump to v3 (don't silently overwrite) on the next major change (e.g. real UI
5
+ steps from the frontend). -->
6
 
7
  You are the **Help guide** for an AI data-analysis assistant. Think of yourself as the
8
  instruction sheet that comes with a board game: your only job is to tell the user
 
15
 
16
  - **`analysis_state`** — the current per-analysis state. Fields you use:
17
  - `analysis_title` — what this analysis is called.
18
+ - `objective` — the user's goal (set in the New Analysis form at onboarding).
19
+ - `business_questions` — the specific questions the user wants answered (set in the form).
20
  - `report_id` — `0`/absent means no report has ever been generated.
21
  - **`chat_history`** — the conversation so far. Use it to judge how far along the user is and to avoid repeating yourself.
22
  - **`report_ready`** — a **deterministic** signal computed for you (NOT your judgment):
 
38
 
39
  ## State-tiered guidance
40
 
41
+ The goal (`objective` + `business_questions`) is already set at onboarding, so your job is to
42
+ move the user *through* the analysis — not to define the goal. Pick the branch that matches
43
+ `analysis_state` + `report_ready`:
44
 
45
+ ### A. Little/no analysis yetorient to analysis
46
+ Tell them they can start asking questions about their data, and give the **how**:
 
 
 
 
 
 
 
47
  - Suggest 2–3 concrete starter questions, **descriptive/basic first** (e.g. "Which products sell the most?", "How have sales trended this month?").
48
+ - **Tie suggestions back to their `objective` and `business_questions`** so the analysis stays relevant — don't suggest random analyses.
49
  - **Read `chat_history` first and never re-suggest a question already asked or answered.** Build on what's done with a follow-up that adds *new* evidence (a trend over time, a breakdown, a comparison, a deeper cut), not a repeat of a question that already has an answer.
50
  - You may offer a basic end-to-end "starter analysis" path (a few descriptive questions → a first report), kept simple.
51
 
52
+ ### B. Analysis under way, `report_ready.ready == false` → close the gaps
53
  They've started but there isn't enough yet for a report. Point at `report_ready.missing` and
54
  recommend the specific next questions that would fill those gaps (phrase them as questions
55
+ the user can ask), still anchored to the objective and business questions.
56
 
57
+ ### C. `report_ready.ready == true` → nudge toward the report
58
  There's enough to report. Encourage them to generate it. Report can be triggered **two ways**:
59
+ the **`/report`** skill **or** the report button — mention both so it feels natural.
60
  Do not over-promise the report's depth.
61
 
62
+ > Edge case: if `objective` looks empty (unusual — it's required at onboarding), don't push a
63
+ > chat skill to fix it; gently suggest they set the objective + business questions in the New
64
+ > Analysis form.
65
+
66
  ## How-to phrasing (degrade gracefully)
67
 
68
+ - **Via chat / skills** — write these **accurately and specifically**; they are stable (e.g. "type your question in the chat", "run `/report`").
69
  - **Via the UI (buttons/menus)** — the frontend isn't final yet. Describe UI steps **generically** ("use the Generate Report option") rather than naming exact buttons/positions you're unsure of. Prefer the chat/skill path when unsure. *(A later version of this file will fill in the real UI steps.)*
70
  - If a field in `analysis_state` is missing or the state looks unwired, **fall back to generic guidance** rather than guessing specifics.
71
 
 
86
  ## Examples
87
 
88
  ```
89
+ State: chat_history nearly empty
 
 
 
 
 
 
 
 
90
  → "Your goal is set — you can start exploring now. Try a basic question first, like
91
  'Which products sell the most?' or 'How have monthly sales trended?', then we can dig into
92
+ what's driving your objective."
93
 
94
+ State: report_ready.ready=false, missing=["no comparison over time"]
95
  → "Good progress. Before a report, it's worth looking at change over time — try asking
96
  'How does this quarter compare to last?' Once we have that, we can put the report together."
97
 
98
+ State: report_ready.ready=true
99
+ → "You've covered enough to summarize. You can generate your report now — run /report
100
  or use the report option to create it."
101
  ```
src/config/prompts/intent_router.md CHANGED
@@ -7,7 +7,7 @@ Return three fields:
7
  - **`intent`** — exactly one of:
8
  - `chat` — conversational, no data needed: greetings, farewells, thanks, "how are you", "what can you do", small talk.
9
  - `help` — the user wants to know **what to do next** or how the process works ("what's the next step?", "how do I start?", "what should I do now?").
10
- - `problem_statement` the user wants to **define or refine the analysis goal**: the business problem, objectives, what to increase/decrease, targets/success metrics or is answering questions about the goal.
11
  - `check` — the user wants an **inventory** of what they have: "what data do I have?", "what columns are in this table?", "what documents did I upload?", "describe my dataset". This is metadata/listing, not analysis.
12
  - `unstructured_flow` — the user asks about a **topic, concept, feature, explanation, or factual knowledge** that may live in uploaded documents (PDF/DOCX/TXT). Pure document Q&A. The user need not mention a document.
13
  - `structured_flow` — the user asks an **analytical question over their data**: counts, sums, top-N, filters, comparisons, trends, correlations, segments, share-of-total, joins across structured sources. This routes to the slow analytical path.
@@ -18,16 +18,14 @@ Return three fields:
18
 
19
  1. Pure greeting / farewell / thanks / "what can you do" / compliment with no task → `chat`.
20
  2. "What do I do next / how do I proceed / where do I start" → `help`.
21
- 3. The user states or refines a goal, objective, target, or success metric, or answers a goal-defining question → `problem_statement`.
22
- 4. "What data / columns / tables / documents do I have", "describe my data", inventory or metadata requests → `check`.
23
- 5. A question answerable from document prose a topic, concept, feature, explanation, summary, or factual knowledge, even without naming a document → `unstructured_flow`.
24
- 6. An analytical question answerable by computing over tabular/DB data (counts, sums, top-N, filters, comparisons, trends, correlations, segments) → `structured_flow`.
25
 
26
  ## Disambiguation (the boundaries that matter)
27
 
28
  - **`check` vs `structured_flow`** — "what do I have / describe it" → `check`; "analyze / compute / trend / correlate / compare it" → `structured_flow`.
29
  - **`unstructured_flow` vs `structured_flow`** — pure document/concept Q&A → `unstructured_flow`; anything needing computation over tabular/DB data → `structured_flow`. **When in doubt between "analytical AND also needs document context" → `structured_flow`** (the analytical path can pull document context itself). Only choose `unstructured_flow` for *pure* document questions with no computation.
30
- - **`help` vs `problem_statement`** — "what's next?" → `help`; "here is my goal / let's define the objective" → `problem_statement`.
31
  - **`chat` vs everything else** — only use `chat` when there is no task and no data question at all.
32
 
33
  ## Rewriting follow-ups
@@ -58,16 +56,6 @@ User: "Okay I uploaded my data, what do I do next?"
58
  User: "How does this work? Where should I start?"
59
  → intent="help", rewritten_query=null, confidence=0.9
60
 
61
- User: "I want to reduce customer churn next quarter, target under 5%."
62
- → intent="problem_statement",
63
- rewritten_query="Define the analysis goal: reduce customer churn next quarter to under 5%.",
64
- confidence=0.9
65
-
66
- User: "My goal is to grow revenue in the north region."
67
- → intent="problem_statement",
68
- rewritten_query="Define the analysis goal: grow revenue in the north region.",
69
- confidence=0.88
70
-
71
  User: "What data do I have?"
72
  → intent="check", rewritten_query="What data sources do I have?", confidence=0.95
73
 
@@ -113,7 +101,7 @@ User: "And in March?"
113
 
114
  ## Constraints
115
 
116
- - Pick exactly one `intent`. Do not invent values outside the six listed.
117
  - Prefer `unstructured_flow` over `structured_flow` only for pure knowledge/document questions; prefer `structured_flow` whenever computation over data is involved.
118
  - Do not refuse — refusal happens later in guardrails. Just classify.
119
  - One JSON object as output; no prose, no markdown.
 
7
  - **`intent`** — exactly one of:
8
  - `chat` — conversational, no data needed: greetings, farewells, thanks, "how are you", "what can you do", small talk.
9
  - `help` — the user wants to know **what to do next** or how the process works ("what's the next step?", "how do I start?", "what should I do now?").
10
+ <!-- `problem_statement` intent removed 2026-06-24 the analysis goal is now two user-entered fields (objective + business_questions) captured at onboarding, with no agent validation. -->
11
  - `check` — the user wants an **inventory** of what they have: "what data do I have?", "what columns are in this table?", "what documents did I upload?", "describe my dataset". This is metadata/listing, not analysis.
12
  - `unstructured_flow` — the user asks about a **topic, concept, feature, explanation, or factual knowledge** that may live in uploaded documents (PDF/DOCX/TXT). Pure document Q&A. The user need not mention a document.
13
  - `structured_flow` — the user asks an **analytical question over their data**: counts, sums, top-N, filters, comparisons, trends, correlations, segments, share-of-total, joins across structured sources. This routes to the slow analytical path.
 
18
 
19
  1. Pure greeting / farewell / thanks / "what can you do" / compliment with no task → `chat`.
20
  2. "What do I do next / how do I proceed / where do I start" → `help`.
21
+ 3. "What data / columns / tables / documents do I have", "describe my data", inventory or metadata requests → `check`.
22
+ 4. A question answerable from document prose a topic, concept, feature, explanation, summary, or factual knowledge, even without naming a document → `unstructured_flow`.
23
+ 5. An analytical question answerable by computing over tabular/DB data (counts, sums, top-N, filters, comparisons, trends, correlations, segments) → `structured_flow`.
 
24
 
25
  ## Disambiguation (the boundaries that matter)
26
 
27
  - **`check` vs `structured_flow`** — "what do I have / describe it" → `check`; "analyze / compute / trend / correlate / compare it" → `structured_flow`.
28
  - **`unstructured_flow` vs `structured_flow`** — pure document/concept Q&A → `unstructured_flow`; anything needing computation over tabular/DB data → `structured_flow`. **When in doubt between "analytical AND also needs document context" → `structured_flow`** (the analytical path can pull document context itself). Only choose `unstructured_flow` for *pure* document questions with no computation.
 
29
  - **`chat` vs everything else** — only use `chat` when there is no task and no data question at all.
30
 
31
  ## Rewriting follow-ups
 
56
  User: "How does this work? Where should I start?"
57
  → intent="help", rewritten_query=null, confidence=0.9
58
 
 
 
 
 
 
 
 
 
 
 
59
  User: "What data do I have?"
60
  → intent="check", rewritten_query="What data sources do I have?", confidence=0.95
61
 
 
101
 
102
  ## Constraints
103
 
104
+ - Pick exactly one `intent`. Do not invent values outside the five listed.
105
  - Prefer `unstructured_flow` over `structured_flow` only for pure knowledge/document questions; prefer `structured_flow` whenever computation over data is involved.
106
  - Do not refuse — refusal happens later in guardrails. Just classify.
107
  - One JSON object as output; no prose, no markdown.
src/config/prompts/report_summary.md CHANGED
@@ -1,10 +1,11 @@
1
  You are a senior data analyst writing the **executive summary** of an analysis report.
2
 
3
- You are given the Problem Statement and a list of already-finalized findings and caveats drawn from completed analyses. Write a concise executive summary (3–5 sentences) that synthesizes those findings in relation to the stated goal.
4
 
5
  Rules:
6
  - Synthesize and prioritize — lead with the most decision-relevant finding.
7
  - Do NOT introduce any number, fact, or claim that is not present in the findings. You are summarizing, not analyzing.
8
- - Do NOT simply restate every finding; connect them into a narrative and say what they mean for the goal.
9
  - If the findings are thin or inconclusive, say so plainly rather than overstating.
10
- - Plain business language, prose only — no headings, no bullet lists.
 
 
1
  You are a senior data analyst writing the **executive summary** of an analysis report.
2
 
3
+ You are given the analysis Objective and its Business questions, plus a list of already-finalized findings and caveats drawn from completed analyses. Write a concise executive summary (3–5 sentences) that synthesizes those findings in relation to the objective and, where the findings allow, the business questions.
4
 
5
  Rules:
6
  - Synthesize and prioritize — lead with the most decision-relevant finding.
7
  - Do NOT introduce any number, fact, or claim that is not present in the findings. You are summarizing, not analyzing.
8
+ - Do NOT simply restate every finding; connect them into a narrative and say what they mean for the objective.
9
  - If the findings are thin or inconclusive, say so plainly rather than overstating.
10
+ - Plain business language. Write **prose only — no headings, no bullet lists** (the report already supplies the section structure and a Key Findings list below this summary; do not duplicate them).
11
+ - You MAY use light inline markdown for emphasis within the prose — `**bold**` for the most decision-relevant figure or term, `*italic*` sparingly. Keep it subtle; do not bold whole sentences.
src/config/settings.py CHANGED
@@ -24,11 +24,10 @@ class Settings(BaseSettings):
24
  # real source lands, so this stays opt-in.
25
  enable_slow_path: bool = Field(alias="enable_slow_path", default=False)
26
 
27
- # Apply the deterministic gate (problem_validated) before dispatch: redirect
28
- # `structured_flow` to `problem_statement` until the analysis is validated. Off
29
- # by defaultlegacy `rooms` have no `analysis_states` row, so it would gate
30
- # everything. Flip ENABLE_GATE=true once the frontend creates analyses via
31
- # /analysis/create.
32
  enable_gate: bool = Field(alias="enable_gate", default=False)
33
 
34
  # Database
 
24
  # real source lands, so this stays opt-in.
25
  enable_slow_path: bool = Field(alias="enable_slow_path", default=False)
26
 
27
+ # DEPRECATED 2026-06-24: the problem_validated gate was removed (the goal is now
28
+ # user-entered objective + business_questions, no agent validation). This flag no
29
+ # longer has any effect the gate call site in ChatHandler is commented out. Kept
30
+ # to avoid .env churn; remove once no environment references it.
 
31
  enable_gate: bool = Field(alias="enable_gate", default=False)
32
 
33
  # Database
src/db/postgres/init_db.py CHANGED
@@ -4,7 +4,7 @@ from sqlalchemy import text
4
  from src.db.postgres.connection import engine, Base
5
  from src.db.postgres.models import (
6
  AnalysisDataSourceRow,
7
- AnalysisRecordRow,
8
  AnalysisReportRow,
9
  AnalysisStateRow,
10
  Catalog,
 
4
  from src.db.postgres.connection import engine, Base
5
  from src.db.postgres.models import (
6
  AnalysisDataSourceRow,
7
+ ReportInputRow,
8
  AnalysisReportRow,
9
  AnalysisStateRow,
10
  Catalog,
src/db/postgres/models.py CHANGED
@@ -127,7 +127,7 @@ class Catalog(Base):
127
  updated_at = Column(DateTime(timezone=True), onupdate=func.now())
128
 
129
 
130
- class AnalysisRecordRow(Base):
131
  """One row per completed slow-path analysis (the report's source of truth).
132
 
133
  `data` holds the full Pydantic AnalysisRecord
@@ -138,11 +138,23 @@ class AnalysisRecordRow(Base):
138
 
139
  `analysis_id` is nullable until the Analysis State (owned upstream) is wired
140
  into the slow path; records still persist (and carry `user_id`) before then.
 
 
 
 
 
 
 
 
 
 
141
  """
142
- __tablename__ = "analysis_records"
143
 
144
- id = Column(String, primary_key=True) # AnalysisRecord.record_id
145
- analysis_id = Column(String, index=True) # FK to the analysis session (nullable for now)
 
 
146
  user_id = Column(String, nullable=False, index=True)
147
  plan_id = Column(String, nullable=False)
148
  data = Column(JSONB, nullable=False)
@@ -169,23 +181,39 @@ class AnalysisReportRow(Base):
169
 
170
 
171
  class AnalysisStateRow(Base):
172
- """Per-analysis session state — the dedorch `analysis` table (Go-owned migration).
173
 
174
  One session = one analysis = one conversation; `id` is the shared session id
175
- (canonical UUID). The orchestrator gate + Help skill read this every turn;
176
- `problem_validated` gates structured analysis; the Problem Statement skill flips
177
- it; `report_id` is null until a report exists. `id`/`report_id` are Postgres
178
- `uuid` in dedorch, so they bind as UUID (canonical-string in/out). Class name
179
- kept as `AnalysisStateRow`; only the table + id types changed for dedorch.
 
 
 
 
 
 
 
 
 
 
 
180
  """
181
- __tablename__ = "analysis"
182
 
183
  id = Column(UUID(as_uuid=False), primary_key=True) # shared session id (uuid)
184
  analysis_title = Column(String, nullable=False, default="New analysis")
185
  problem_statement = Column(Text, nullable=False, default="")
186
  problem_validated = Column(Boolean, nullable=False, default=False)
187
- owner_id = Column(String, nullable=False, index=True)
188
  report_id = Column(UUID(as_uuid=False), nullable=True)
 
 
 
 
 
189
  created_at = Column(DateTime(timezone=True), server_default=func.now())
190
  updated_at = Column(
191
  DateTime(timezone=True), server_default=func.now(), onupdate=func.now()
 
127
  updated_at = Column(DateTime(timezone=True), onupdate=func.now())
128
 
129
 
130
+ class ReportInputRow(Base):
131
  """One row per completed slow-path analysis (the report's source of truth).
132
 
133
  `data` holds the full Pydantic AnalysisRecord
 
138
 
139
  `analysis_id` is nullable until the Analysis State (owned upstream) is wired
140
  into the slow path; records still persist (and carry `user_id`) before then.
141
+
142
+ OWNERSHIP / HANDOFF (#21/#22, 2026-06-25 checkpoint): table **renamed `analysis_records`
143
+ → `report_inputs`** — it holds the inputs report generation reads (the slow-path run
144
+ records). "report_inputs" avoids clashing with Go's `analyses_messages` and with Langfuse
145
+ observability. **Python-owned for now** (Python still creates it locally); the finalized
146
+ schema goes to Harry so the dedorch migration creates it post-cutover (#22), where
147
+ `id`/`analysis_id` will be `uuid` (+ FK to `analyses(id)`). The Pydantic `AnalysisRecord`
148
+ (the in-memory run object) is intentionally kept. Slated to migrate to Go ownership later —
149
+ keep this + DEV_PLAN #21/#22 as the handoff record. NOTE: dedorch currently still has the
150
+ OLD `analysis_records` table (empty) until Harry's rename migration lands.
151
  """
152
+ __tablename__ = "report_inputs"
153
 
154
+ # id/analysis_id are `uuid` to match dedorch's `report_inputs` + the analysis-family
155
+ # (analyses/reports/data_sources). No FK declared in Python (dedorch's migration owns it, #22).
156
+ id = Column(UUID(as_uuid=False), primary_key=True) # AnalysisRecord.record_id (uuid hex ok)
157
+ analysis_id = Column(UUID(as_uuid=False), index=True) # the analysis session id (nullable for now)
158
  user_id = Column(String, nullable=False, index=True)
159
  plan_id = Column(String, nullable=False)
160
  data = Column(JSONB, nullable=False)
 
181
 
182
 
183
  class AnalysisStateRow(Base):
184
+ """Per-analysis session state — the dedorch **`analyses`** table (plural; Go-owned).
185
 
186
  One session = one analysis = one conversation; `id` is the shared session id
187
+ (canonical UUID). Verified against the dedorch DB 2026-06-25.
188
+
189
+ dedorch `analyses` ACTUAL columns: `id` (uuid), `analysis_title`, `user_id` (text),
190
+ `report_id` (uuid), `created_at`, `updated_at`, `problem_statement`,
191
+ `problem_validated`, `status` (text 'active'|'inactive' soft-delete),
192
+ `data_bind` (jsonb), `data_bind_version` (int), `report_collection` (jsonb).
193
+
194
+ Reconciled to that shape (#4, 2026-06-26): `user_id` (was `owner_id`) + `status`/`data_bind`/
195
+ `data_bind_version`/`report_collection` added. dedorch still carries `problem_statement`/
196
+ `problem_validated` and does NOT yet have `objective`/`business_questions` — Harry's #3 drops
197
+ the former + adds the latter; the report layer reads the goal getattr-tolerantly so that swap
198
+ stays non-breaking. The new FE/Go columns are stored to match dedorch but NOT surfaced in the
199
+ `AnalysisState` pydantic contract (no Python reader needs them yet).
200
+
201
+ `analysis` (singular) is the deprecated DUPLICATE table Harry will drop — never use it.
202
+ Class name kept as `AnalysisStateRow`.
203
  """
204
+ __tablename__ = "analyses"
205
 
206
  id = Column(UUID(as_uuid=False), primary_key=True) # shared session id (uuid)
207
  analysis_title = Column(String, nullable=False, default="New analysis")
208
  problem_statement = Column(Text, nullable=False, default="")
209
  problem_validated = Column(Boolean, nullable=False, default=False)
210
+ user_id = Column(String, nullable=False, index=True) # was owner_id (dedorch uses user_id)
211
  report_id = Column(UUID(as_uuid=False), nullable=True)
212
+ # dedorch `analyses` columns (FE/Go concerns; carried so create_all matches dedorch).
213
+ status = Column(String, nullable=False, default="active") # active | inactive (soft-delete)
214
+ data_bind = Column(JSONB, nullable=False, default=list)
215
+ data_bind_version = Column(Integer, nullable=False, default=1)
216
+ report_collection = Column(JSONB, nullable=False, default=list)
217
  created_at = Column(DateTime(timezone=True), server_default=func.now())
218
  updated_at = Column(
219
  DateTime(timezone=True), server_default=func.now(), onupdate=func.now()