feat/Knowledge & Data Tools

#3
by rhbt6767 - opened
This view is limited to 50 files because it contains too many changes. See the raw diff here.
Files changed (50) hide show
  1. ARCHITECTURE.md +2 -0
  2. CHECKPOINT_PLAN_2026-06-17.md +147 -0
  3. PROGRESS.md +32 -6
  4. eval/__init__.py +0 -0
  5. eval/intent/README.md +70 -0
  6. eval/intent/__init__.py +0 -0
  7. eval/intent/intent_dataset.json +56 -0
  8. eval/intent/results/.gitkeep +0 -0
  9. eval/intent/run_eval.py +384 -0
  10. eval/readiness/README.md +34 -0
  11. eval/readiness/__init__.py +0 -0
  12. eval/readiness/readiness_dataset.json +40 -0
  13. eval/readiness/results/.gitkeep +0 -0
  14. eval/readiness/results/readiness_result_2026-06-22_101645.json +268 -0
  15. eval/readiness/results/readiness_result_2026-06-22_143809.json +284 -0
  16. eval/readiness/run_eval.py +309 -0
  17. main.py +6 -0
  18. pyproject.toml +2 -0
  19. src/agents/binding_store.py +34 -0
  20. src/agents/chat_handler.py +254 -32
  21. src/agents/gate.py +108 -0
  22. src/agents/handlers/__init__.py +1 -0
  23. src/agents/handlers/check.py +165 -0
  24. src/agents/handlers/help.py +192 -0
  25. src/agents/handlers/problem_statement.py +171 -0
  26. src/agents/orchestration.py +42 -27
  27. src/agents/planner/examples.py +20 -22
  28. src/agents/planner/inputs.py +3 -3
  29. src/agents/planner/registry.py +39 -28
  30. src/agents/planner/service.py +1 -1
  31. src/agents/planner/validator.py +5 -5
  32. src/agents/report/__init__.py +9 -0
  33. src/agents/report/errors.py +7 -0
  34. src/agents/report/generator.py +363 -0
  35. src/agents/report/readiness.py +165 -0
  36. src/agents/report/schemas.py +91 -0
  37. src/agents/report/store.py +119 -0
  38. src/agents/slow_path/assembler.py +32 -1
  39. src/agents/slow_path/coordinator.py +4 -4
  40. src/agents/slow_path/schemas.py +12 -0
  41. src/agents/slow_path/store.py +78 -12
  42. src/agents/slow_path/task_runner.py +3 -0
  43. src/agents/state_store.py +128 -0
  44. src/api/v1/analysis.py +174 -0
  45. src/api/v1/chat.py +52 -9
  46. src/api/v1/report.py +189 -0
  47. src/api/v1/tools.py +124 -0
  48. src/catalog/reader.py +3 -2
  49. src/config/prompts/help.md +107 -0
  50. src/config/prompts/intent_router.md +76 -39
ARCHITECTURE.md CHANGED
@@ -63,6 +63,8 @@ DB vs tabular is **not** a routing concern β€” it's a per-source attribute (`sou
63
 
64
  ## 3. Routing model
65
 
 
 
66
  ```
67
  source_hint ∈ { chat, unstructured, structured }
68
  ```
 
63
 
64
  ## 3. Routing model
65
 
66
+ > **Superseded 2026-06-18** β€” the 3-way `source_hint` below was reworked into a flat **6-intent** handler router (`chat`, `help`, `problem_statement`, `check`, `unstructured_flow`, `structured_flow`). Modality (structured vs unstructured *data*) is now the Planner's job, not the router's. See `ORCHESTRATOR_REWORK_PLAN.md`.
67
+
68
  ```
69
  source_hint ∈ { chat, unstructured, structured }
70
  ```
CHECKPOINT_PLAN_2026-06-17.md ADDED
@@ -0,0 +1,147 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Checkpoint Plan β€” Wednesday, 17 June 2026
2
+
3
+ Working plan for Sofhia & Rifqi based on the checkpoint with mas Harry on **Thursday, 11 June 2026**.
4
+ Goal: everything below is **merged and demo-able before the next sync on Wednesday, 17 June (afternoon)**.
5
+
6
+ **Updated at: Friday, 12 June 2026** (Sofhia + Rifqi)
7
+
8
+ > Source of truth for decisions is the meeting itself. Note: the NotebookLM summary is **stale on two points** β€” Data Availability Check was *eliminated* as a tool, and Success Metrics was *folded into* the Problem Statement template. Do not build either as a standalone skill.
9
+
10
+ ---
11
+
12
+ ## 0. Progress (per Fri 12 Jun β€” Sofhia)
13
+
14
+ Dated snapshot of what landed this session. Live task status (incl. what's left) lives in Β§2 Ownership β€” this section only records the deltas + traceability.
15
+
16
+ - βœ… **Tool matrix** built (xlsx, all ~10 tools + status colours) β€” presentation material ready.
17
+ - βœ… **Registry trimmed to 4 active analytics** (`KM-641`, commit `66e2e4d`): `ACTIVE_ANALYTICS_TOOLS` (descriptive, aggregate, correlation, trend) vs `DEFERRED_ANALYTICS_TOOLS` (comparison, contribution, profile, segment) β€” specs + compute fns kept, only registry exposure withheld. Tests 206 pass, ruff/mypy clean.
18
+ - βœ… **Planner few-shot synced**: Example A `analyze_contribution` β†’ `analyze_aggregate` (so few-shots don't reference a deferred tool).
19
+ - βœ… **Data-access tools renamed** (`KM-642`, commit `c38c0c2`): `query_structured` β†’ `data_retrieve`, `retrieve_documents` β†’ `knowledge_retrieve` across the tool layer + planner stub/prompt/validator/few-shots. Mechanical, no behavior change.
20
+ - βœ… **`data_check` merge + `knowledge_check`** (`KM-643`, commit `4bd5f1e`): `list_sources` + `describe_source` β†’ one parameterized `data_check` (no arg = list structured sources; `source_id` = schema) + new `knowledge_check` (unstructured). Tests 206 pass.
21
+ - βœ… **Redis Cloud live** (free tier, TTL = 1 h), env vars shared in the group (Rifqi).
22
+ - βœ… **Planner tool list verified** against the trimmed registry β€” no references to old tool names or deferred analytics anywhere in `src/` (Rifqi).
23
+ - πŸ“Œ **Decision:** `tests/` stays gitignored β€” team decided not to push tests to origin (closes PROGRESS.md R3 as won't-do).
24
+ - πŸ“Œ **Ownership:** Rifqi owns `generate_report` development + the `analysis_records` table / real `AnalysisStore` (contract still co-designed with Sofhia).
25
+ - βœ… **R5 cache fix** (Rifqi, `b701e95`): chat cache scoped by `user_id`, TTL 24hβ†’1h.
26
+ - βœ… **AnalysisRecord persistence landed** (Rifqi): `stage` now flows to the record (CRISP-DM grouping for the report) + identity fields (`record_id`/`analysis_id`/`user_id`); `PostgresAnalysisStore` + `analysis_records` table replace `NullAnalysisStore`, wired into `ChatHandler`. Unblocks the `generate_report` renderer and the DoD "record persisted" step. Open: `analysis_id` handoff from Harry's Analysis State.
27
+ - βœ… **Verb-first tool naming** (Sofhia, commit `2d6406d`): the 4 data/knowledge tools renamed to lead with a verb β€” `data_check`β†’`check_data`, `knowledge_check`β†’`check_knowledge`, `data_retrieve`β†’`retrieve_data`, `knowledge_retrieve`β†’`retrieve_knowledge` (the `analyze_*` tools already lead with a verb). These verb-first names are now canonical; the tool-set table + Β§3 below use them. Dated log entries above keep the old names as historical record.
28
+
29
+ ---
30
+
31
+ ## 1. Locked decisions (from the 2026-06-11 checkpoint)
32
+
33
+ 1. **Single chat page.** The separate interview/survey page is killed. Sidebar = Knowledge menu (connect/manage data) + Analysis menu (sessions).
34
+ 2. **Data-first hard gate.** Creating a new analysis requires **β‰₯ 1 bound data source** (server-side rejection, no empty sessions). User provides title + optional short description.
35
+ 3. **Analysis State lives in the DB.** Per-analysis row: `user_id`, `data_source_ids[]`, `interview_status` (default `not_pass`), `report_status` (default `no_report` β†’ `V1`, `V2`, …). Explicitly **NOT cached, NOT in Redis** β€” the Orchestrator reads it from Postgres every turn.
36
+ 4. **Skills, not agents.** No separate interview agent. The Orchestrator routes per user turn using the Analysis State; an analytical request still executes through the existing Planner β†’ TaskRunner β†’ Assembler spine (static plan, no mid-run LLM).
37
+ 5. **Interview = one skill: Problem Statement.** Success metrics become fields inside the PS template (what to increase/decrease + target). Data availability check is handled by the data-first creation gate + PS validation cross-checking fields against the bound catalog β€” not a separate tool.
38
+ 6. **Analytics focus = 4 tools:** descriptive, aggregate, correlation, trend. The other four composites (comparison, contribution, profile, segment) are **deprioritized, not deleted** β€” keep the code, just don't register them. If "comparison" returns later it should be a proper statistical **test**, not a generic compare.
39
+ 7. **`describe_source` merges into the listing tool** β€” one call returns sources *with* their schema/metadata, fewer tools for the planner.
40
+ 8. **Report = on-demand, button-triggered (not a chat skill).** A dedicated "Generate Report" button in the Analysis menu calls a **report API** (not the chat route): trigger generation for a session, list its versions, fetch a version. Renders from accumulated **AnalysisRecords + the Problem Statement** β€” never from chat history. Each report is a **persisted, versioned artifact**: generation snapshots the record IDs it used and bumps `report_status` to `V<n>`. (Owner: Rifqi, KM-644.)
41
+ 9. **Help = deterministic guide.** No LLM: read Analysis State β†’ tell the user the next required step. Callable in any state.
42
+ 10. **Redis Cloud free tier, TTL = 1 hour**, env shared in the team group β€” for retrieval/query caching only, never for state.
43
+
44
+ ### Final tool set (~10)
45
+
46
+ | Tool (canonical, verb-first) | Maps to (lineage) | Status |
47
+ |---|---|---|
48
+ | `check_knowledge` | new β€” list user's documents + metadata | done |
49
+ | `check_data` | `list_sources` + `describe_source` merged (catalog-backed) | done |
50
+ | `retrieve_knowledge` | `retrieve_documents` β†’ `knowledge_retrieve` | done |
51
+ | `retrieve_data` | `query_structured` β†’ `data_retrieve` (tabular: file + DB, both working) | done |
52
+ | `analyze_descriptive` | `src/tools/analytics/descriptive.py` | done |
53
+ | `analyze_aggregate` | `src/tools/analytics/aggregation.py` | done |
54
+ | `analyze_correlation` | `src/tools/analytics/relationship.py` | done |
55
+ | `analyze_trend` | `src/tools/analytics/temporal.py` | done |
56
+ | `problem_statement` | new β€” interview skill (**Harry**) | Harry |
57
+ | `generate_report` | new β€” on-demand, versioned | to design |
58
+ | `help` | new β€” deterministic state guide | to build |
59
+
60
+ (`problem_statement` + `help` live at the orchestrator level; `generate_report` is **button-triggered via a dedicated report API**, not chat-routed (decision #8). The TaskRunner registry holds the 4 analytics + 4 data/knowledge tools. Unregister `analyze_comparison`, `analyze_contribution`, `analyze_profile`, `analyze_segment` from the planner-visible registry β€” keep the modules.)
61
+
62
+ ---
63
+
64
+ ## 2. Ownership
65
+
66
+ ### Sofhia
67
+ - [x] 4 analytics tools: trim registry to 4 active, tests still pass after deprioritizing the other four. (`KM-641`, commit `66e2e4d`)
68
+ - [x] Data/knowledge tools: merge `describe_source` into `data_check`, rename `retrieve_documents` β†’ `knowledge_retrieve`, `query_structured` β†’ `data_retrieve`, build `knowledge_check`. (`KM-642` `c38c0c2`, `KM-643` `4bd5f1e`)
69
+ - [ ] Co-design `generate_report` contract with Rifqi (Rifqi owns development, see Β§3).
70
+ - [x] Tool matrix (see Β§4).
71
+
72
+ ### Rifqi
73
+ - [x] **Redis Cloud free tier** (~30–50 MB): create instance, set TTL = 1 h, share env vars in the group. (done 12 Jun)
74
+ - [x] **R5 cache fix**: chat cache key scoped by `user_id`, TTL 24h→1h (urgent on shared Redis). (12 Jun, commit `b701e95`)
75
+ - [x] **AnalysisRecord contract gaps closed**: `stage` (CRISP-DM) now flows Task→TaskResult→TaskSummary so the report can group the method appendix; `AnalysisRecord` gained `record_id`/`analysis_id`/`user_id` identity fields. (12 Jun)
76
+ - [x] **`analysis_records` table + real `AnalysisStore`**: `PostgresAnalysisStore` (save + `list_for_analysis`, never-throw) replaces `NullAnalysisStore`; wired into `ChatHandler`, `user_id` stamped at save. Satisfies the DoD "record persisted" step. (12 Jun)
77
+ - [ ] **Own `generate_report` development β€” KM-644 "Report Generator"** (contract co-designed with Sofhia, see Β§3). Button-triggered via a dedicated **report API** (trigger / list versions / fetch); reads `analysis_records` + Problem Statement; persists a versioned report artifact, bumps `report_status`. *(record persistence done above; report API + persistence + renderer + contract doc next)*
78
+ - [x] Verify planner tool list matches the trimmed registry (4 analytics + 4 data/knowledge) and few-shots don't reference removed tools. (verified 12 Jun β€” no stale tool names in `src/`)
79
+ - ⚠️ **Blocked-on-Harry**: `analysis_id` is `NULL` on persisted records until the Analysis State reaches the slow path β€” need the session-ID handoff so `generate_report` can group records per analysis.
80
+
81
+ ### Shared (Sofhia + Rifqi)
82
+ - [ ] `generate_report` design + skeleton: input = AnalysisRecords for the session + Problem Statement from Analysis State; output = versioned artifact; bumps `report_status`. Agree on the contract even if rendering is stubbed for Wednesday. (Development: Rifqi.)
83
+ - [ ] `help` skill: deterministic β€” read Analysis State, return the next required step. Small, do it together or whoever finishes first.
84
+ - [ ] Tool behavior smoke test end-to-end on an easy case (descriptive/aggregate path), per Harry's ask: "robust tools before agents."
85
+
86
+ ### Harry (dependencies β€” not ours, but we block on them)
87
+ - `problem_statement` skill + PS template (incl. increase/decrease target fields).
88
+ - Analysis State class + DB table, frontend analysis-builder step.
89
+ - Merging our PRs (he auto-merges; he clones from latest after).
90
+
91
+ ---
92
+
93
+ ## 3. Per-tool behavior contract (how to build each one)
94
+
95
+ Harry's framing: for every tool, define **goal / trigger / input / process / output**, and behave like a Claude-style skill β€” if a required argument is missing, respond with a polite feedback message asking for it (e.g. table/column name), never guess silently.
96
+
97
+ - **`check_knowledge`** β€” "what documents do I have?" β†’ list documents with name, type, uploaded-at.
98
+ - **`check_data`** β€” "what data do I have?" β†’ sources (file + DB) with schema/metadata from the data catalog, created/uploaded timestamps.
99
+ - **`retrieve_knowledge`** β€” RAG over uploaded documents; returns passages with source attribution.
100
+ - **`retrieve_data`** β€” query tabular data (file + DB) via QueryIR; output consumable by the `analyze_*` tools.
101
+ - **`analyze_*` (4)** β€” require valid table/column references; if missing or wrong, return actionable feedback instead of guessing.
102
+ - **`generate_report`** β€” button-triggered via a dedicated report API (not chat-routed); on-demand only (never auto); post-pass gated; renders from AnalysisRecords + PS; persists a versioned artifact, snapshots record IDs, bumps version. (KM-644, Rifqi.)
103
+ - **`help`** β€” no LLM; state β†’ next step. Repeating it is fine, that's its job.
104
+
105
+ ---
106
+
107
+ ## 4. Tool matrix (deliverable for the sync)
108
+
109
+ Harry explicitly asked for a matrix covering every tool. Produce one sheet/markdown table with columns:
110
+
111
+ `tool | goal | trigger (when the orchestrator calls it) | input | process | output | gated by interview_status? | status (done / in progress / planned)`
112
+
113
+ Use the tool set table in Β§1 as the row list. This doubles as the presentation material on Wednesday.
114
+
115
+ ---
116
+
117
+ ## 5. Day-by-day
118
+
119
+ | Day | Target |
120
+ |---|---|
121
+ | **Thu 11** | Checkpoint meeting + task split with Harry. |
122
+ | **Fri 12 (today)** | βœ… Registry trimmed to 4 analytics + few-shot synced (Sofhia, KM-641). βœ… Tool matrix built. ⏳ Redis Cloud + env share (Rifqi). |
123
+ | **Mon 15** | Data/knowledge tools done (`data_check` merge, renames, `knowledge_check`). `generate_report` contract agreed. |
124
+ | **Tue 16** | `help` skill done. `generate_report` skeleton wired to AnalysisRecord. Tool matrix drafted. End-to-end smoke test on the easy path. |
125
+ | **Wed 17 (AM)** | Buffer: fix fallout, finalize matrix, rehearse the demo flow. |
126
+ | **Wed 17 (PM)** | **Sync with Harry.** |
127
+
128
+ ---
129
+
130
+ ## 6. Open questions to confirm with Harry on Wednesday
131
+
132
+ 1. **Gate scope.** Proposal: keep the fast path + exploration tools (`check_knowledge`, `check_data`, retrieves, `help`, arguably `descriptive`) available **pre-pass**; gate only the insight tools (correlation, trend, report). Hard-gating everything risks frustrating users who just want to look at their data.
133
+ 2. **Who flips `interview_status` to `pass`?** Proposal: a deterministic validator (PS template slots complete + fields cross-checked against the bound catalog) makes the call β€” the LLM conducts the conversation but never decides the pass. ("Conversational skin, deterministic skeleton.")
134
+ 3. **Skills vs spine β€” one sentence to lock in writing:** *"Skills are registry tools executed by the existing Planner β†’ TaskRunner β†’ Assembler spine; the Analysis State gate is a pre-check in the Orchestrator."* This keeps the new flow and the locked architecture fully compatible.
135
+ 4. `generate_report` invocation goes through the same gate (post-pass only) β€” confirm.
136
+
137
+ ---
138
+
139
+ ## 7. Definition of done for Wednesday
140
+
141
+ - [ ] All team PRs merged; Harry unblocked on the Analysis State class.
142
+ - [ ] Registry exposes exactly 4 analytics + 4 data/knowledge tools, all passing local tests.
143
+ - [ ] Redis Cloud shared and working locally for all three of us (TTL 1 h).
144
+ - [ ] `help` works against a (possibly stubbed) Analysis State.
145
+ - [ ] `generate_report` contract written; skeleton callable.
146
+ - [ ] Tool matrix ready to present.
147
+ - [ ] One end-to-end happy path runs: create analysis (with data) β†’ blocked pre-pass β†’ interview stub passes β†’ descriptive/aggregate answer β†’ record persisted.
PROGRESS.md CHANGED
@@ -2,8 +2,32 @@
2
 
3
  Persistent tracker mirroring the 42-item ownership table in `REPO_CONTEXT.md` "Team β€” division of work". Update as PRs land. Future Claude Code sessions read this to know what's already done.
4
 
5
- **Last updated**: 2026-06-10 (tool layer complete + hardening/DRY + Langfuse tracing + gated slow-path wiring)
6
- **Current open PR**: `pr/2` β€” active.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7
 
8
  ---
9
 
@@ -39,12 +63,12 @@ Verified against code before logging. Severity: **critical** / important / nice-
39
  |---|---|---|---|---|
40
  | R1 | **AuthN/AuthZ** on data endpoints β€” reject body-supplied `user_id`/`room_id`, derive identity from a verified token. `/chat/stream` has none (`chat.py:40,128`); tenant isolation is client honesty. **CORRECTION to the review:** `security/auth.py` is a STUB (all `NotImplementedError`); the real JWT impl lives in `src/users/users.py` (`encode_jwt`/`decode_jwt`, HS, env-keyed) **but is unused** β€” `/login` (`api/v1/users.py`) returns the user profile as plain JSON and mints NO token. So R1 is cross-team: (1) `/login` must issue a JWT, (2) frontend must send it as `Bearer`, (3) data endpoints validate it. **Gates the engine-cache work (DB2).** | **critical** | DB/B + frontend | `[ ]` |
41
  | R2 | **Always compile a LIMIT** β€” `sql.py` now emits a bound for every query: explicit limit honored (clamped to `MAX_RESULT_ROWS=10000`), unbounded queries get `LIMIT cap+1` so an unbounded SELECT can't stream a whole table into memory. `CompiledSql.row_cap` carries the cap; `DbExecutor` caps + flags truncation from it (dropped its own `_ROW_HARD_CAP`). Tests updated (`test_sql.py`, +3 cases); `S608` restored to `tests/**` ruff ignore (was dropped). | **critical** | DB | `[x]` |
42
- | R3 | **Commit `tests/` + minimal CI** β€” `tests/` is gitignored; the 200+ tests cited as done exist only on laptops (already caused rename rot). GitHub origin carries tests; HF Space gets the Docker build (already doesn't COPY tests). | **critical (process)** | shared | `[ ]` |
43
  | DB1 | **In-memory `describe_source`** (request-scoped `MemoizingCatalogReader`, `reader.py`) + **LLM-client hoist** (shared module-level `ChatHandler` in `chat.py`). Measured live: `describe_source` 3.5sβ†’~2.0s (structured read now served from the planner's cached snapshot; only the unstructured read remains a round-trip), catalog reads/request ~5β†’~2. External `query_structured` handshake unchanged (DB2's job) so total slow path is ~flat until DB2. Tests: `tests/catalog/test_reader.py`. | important | agent | `[x]` |
44
  | DB2 | **Keyed engine cache** β€” `src/database_client/engine.py::UserEngineCache` (process singleton): pooled engines keyed by `client_id + creds-hash` (rotation auto-invalidates), bounded LRU (50) + 600s idle TTL, `pool_pre_ping` + `pool_recycle=300`. `DbExecutor._run_sync` reuses the warm connection instead of `create_engineβ†’connectβ†’dispose` per query (postgres/supabase only; other db_types keep the legacy path β€” no regression). **Live-measured: warm `query_structured` 6.6–9.4s β†’ ~2.5s** (the residual is the per-call catalog-DB client fetch + pre-ping, not the external handshake). **Finding:** Neon's transaction pooler REJECTS `default_transaction_read_only` as a libpq startup `option` β€” caught live; moved read-only + statement_timeout to a per-connection `connect` event (best-effort; authoritative read-only is the SELECT-only compiler + sqlglot guard, see R10). Per-request ownership/active check kept. Proceeded ahead of R1 per owner decision (marginal security delta over the existing no-auth state; auth tracked separately). Tests: `tests/database_client/test_engine.py`. First query/process still cold β†’ DB3. | important | DB | `[x]` |
45
  | DB3 | **Speculative pre-connect** β€” `DbExecutor.prewarm(catalog, user_id)` warms the pooled engine for schema sources (fire-and-forget at slow-path entry) so the cold first-query handshake overlaps the ~4s Planner call. Best-effort, never raises; gated to the default path (skipped when a coordinator factory is injected). Verified live through `ChatHandler.handle`. | nice-to-have | DB | `[x]` |
46
  | R4 | **Per-stage progress events** β€” `SlowPathCoordinator.run` gained an optional `progress` callback; `ChatHandler` bridges it to SSE `status` events (`chat.py` forwards them). Live: stream now shows `Planning…`β†’`Running N steps…`β†’`Composing…` (max wire gap ~4.6s, was ~13s of silence) β†’ fixes proxy idle-timeout + UX. **Deferred:** token-streaming the Assembler answer needs splitting it into a streamed prose call + a structured-record call β€” that doubles the Assembler LLM calls (cost/latency), so it's a separate decision; the answer is still emitted as one chunk after the (fast ~2.5s) Assembler. Test: `test_chat_handler_wiring.py`. | important | agent | `[~]` |
47
- | R5 | **Response cache**: key on `user_id` + catalog version; invalidate on ingest. Today `chat:{room_id}:{message}`, 24h TTL, no user (`chat.py:138`) β†’ cross-room replay + stale answers. | important | B | `[ ]` |
48
  | R6 | **Hard time budget** β€” wrap `coordinator.run()` in `asyncio.wait_for` (60–90s). `Constraints.time_budget_seconds` is rendered but not enforced. | important | agent | `[ ]` |
49
  | R7 | **Root-task-failure short-circuit** before the Assembler (templated/fast-path fallback, NOT replanning) β€” stops paying ~2k tok to narrate an empty RunState. | important | agent | `[ ]` |
50
  | R8 | **Catalog upsert race** β€” per-user advisory lock around read-merge-upsert (`store.py`); concurrent uploads can drop a source. | important | DB | `[ ]` |
@@ -129,7 +153,8 @@ LLM tokens; verified live to US Cloud.
129
  wires Pattern A correctly; self-corrects via retry.
130
 
131
  **Open follow-ups:** real `BusinessContext` (lead); create `analysis_records` table +
132
- real `AnalysisStore`; register data-access `ToolSpec`s upstream (`data_access_registry()`)
 
133
  or keep the planner stub; 4o β†’ GPT-mini deployment swap; flip `enable_slow_path` on once
134
  `BusinessContext` is real. NOTE: 3 test files pre-existing broken from rename rot
135
  (`test_chat_handler.py`, `test_intent_router.py`, `test_answer_agent.py` import the old
@@ -396,8 +421,9 @@ New scope after the original 42-item table; added as the tool layer landed (KM-6
396
  | β€” | Tool contracts (`tools/contracts.py`) | TAB | `[x]` | KM-627 β€” canonical `ToolSpec` / `ToolRegistry` / `ToolOutput`. `agents/planner/contracts.py` re-exports them (+ keeps the lead's `BusinessContext` stub). |
397
  | β€” | Analytics registry (`tools/registry.py`) | TAB | `[x]` | KM-628 β€” `analytics_registry()`. `analyze_descriptive.required` = `["data","column_ids"]` (aligned to compute signature, commit 4bb7623). |
398
  | β€” | Invoker layer (`tools/invoker.py`) | TAB | `[x]` | KM-629 β€” `AnalyticsToolInvoker` (Pattern A: `analyze_*` take a `data` `${t<id>}` placeholder from upstream `query_structured`; `_materialize` β†’ DataFrame, `_coerce_decimals` covers the whole family) + `CompositeToolInvoker` (routes data-access vs analytics by name). |
399
- | β€” | Data-access tools (`tools/data_access.py`) | TAB | `[x]` | KM-630 β€” `DataAccessToolInvoker`: `list_sources` / `describe_source` / `query_structured` / `retrieve_documents`. Per-request DI (`user_id` + `CatalogReader`). `query_structured` calls `IRValidator` + `ExecutorDispatcher` (planner skipped β€” IR pre-built by the agent Planner). |
400
  | β€” | Tool tests (`tests/unit/tools/`) | TAB | `[x]` | analytics + data-access + invoker tests (gitignored). Incl. regression `test_decimal_columns_coerced_for_analyze_contribution`. |
 
401
 
402
  ### API surface
403
 
 
2
 
3
  Persistent tracker mirroring the 42-item ownership table in `REPO_CONTEXT.md` "Team β€” division of work". Update as PRs land. Future Claude Code sessions read this to know what's already done.
4
 
5
+ **Last updated**: 2026-06-12 (Redis Cloud live; R3 closed as won't-do; R5 cache fix; AnalysisRecord persistence landed β€” `PostgresAnalysisStore` + `analysis_records` table)
6
+ **Current open PR**: `pr/3` β€” active.
7
+
8
+ ---
9
+
10
+ ## What just shipped (2026-06-12 β€” AnalysisRecord persistence, Rifqi)
11
+
12
+ Groundwork for `generate_report`. The slow path now persists a real, citable
13
+ record; the report (next) renders from it.
14
+
15
+ - **Contract gaps closed** (`agents/slow_path/schemas.py`): `stage: CrispStage`
16
+ added to `TaskResult` + `TaskSummary` and populated at all 3 `TaskResult` build
17
+ sites in `task_runner.py` + copied in `assembler._build_record` β€” so the report
18
+ can group its method appendix by CRISP-DM phase. `AnalysisRecord` gained identity:
19
+ `record_id` (auto uuid), `analysis_id`/`user_id` (optional; stamped at persist).
20
+ - **Real store** (`agents/slow_path/store.py`): `PostgresAnalysisStore` β€”
21
+ `save()` (never-throw, idempotent upsert) + `list_for_analysis()` (oldest-first,
22
+ the report's render order). `NullAnalysisStore` kept (tests / disabled persistence).
23
+ `AnalysisStore` Protocol gained `list_for_analysis`.
24
+ - **Table** (`db/postgres/models.py`): `analysis_records` jsonb table (one row per
25
+ run, indexed by `analysis_id` + `user_id`); registered in `init_db.py`, created by
26
+ `create_all` on startup (no migration β€” `data_catalog` precedent).
27
+ - **Wired** (`agents/chat_handler.py`): default store flipped to `PostgresAnalysisStore`;
28
+ `user_id` stamped onto the record at the save site (in scope there).
29
+ - **Open**: `analysis_id` is `NULL` until Harry's Analysis State reaches the slow
30
+ path (session-ID handoff needed to group records per analysis).
31
 
32
  ---
33
 
 
63
  |---|---|---|---|---|
64
  | R1 | **AuthN/AuthZ** on data endpoints β€” reject body-supplied `user_id`/`room_id`, derive identity from a verified token. `/chat/stream` has none (`chat.py:40,128`); tenant isolation is client honesty. **CORRECTION to the review:** `security/auth.py` is a STUB (all `NotImplementedError`); the real JWT impl lives in `src/users/users.py` (`encode_jwt`/`decode_jwt`, HS, env-keyed) **but is unused** β€” `/login` (`api/v1/users.py`) returns the user profile as plain JSON and mints NO token. So R1 is cross-team: (1) `/login` must issue a JWT, (2) frontend must send it as `Bearer`, (3) data endpoints validate it. **Gates the engine-cache work (DB2).** | **critical** | DB/B + frontend | `[ ]` |
65
  | R2 | **Always compile a LIMIT** β€” `sql.py` now emits a bound for every query: explicit limit honored (clamped to `MAX_RESULT_ROWS=10000`), unbounded queries get `LIMIT cap+1` so an unbounded SELECT can't stream a whole table into memory. `CompiledSql.row_cap` carries the cap; `DbExecutor` caps + flags truncation from it (dropped its own `_ROW_HARD_CAP`). Tests updated (`test_sql.py`, +3 cases); `S608` restored to `tests/**` ruff ignore (was dropped). | **critical** | DB | `[x]` |
66
+ | R3 | **Commit `tests/` + minimal CI** β€” `tests/` is gitignored; the 200+ tests cited as done exist only on laptops (already caused rename rot). ~~GitHub origin carries tests; HF Space gets the Docker build.~~ **2026-06-12: team decided tests stay gitignored/local β€” closed as won't-do.** | **critical (process)** | shared | `[won't do]` |
67
  | DB1 | **In-memory `describe_source`** (request-scoped `MemoizingCatalogReader`, `reader.py`) + **LLM-client hoist** (shared module-level `ChatHandler` in `chat.py`). Measured live: `describe_source` 3.5sβ†’~2.0s (structured read now served from the planner's cached snapshot; only the unstructured read remains a round-trip), catalog reads/request ~5β†’~2. External `query_structured` handshake unchanged (DB2's job) so total slow path is ~flat until DB2. Tests: `tests/catalog/test_reader.py`. | important | agent | `[x]` |
68
  | DB2 | **Keyed engine cache** β€” `src/database_client/engine.py::UserEngineCache` (process singleton): pooled engines keyed by `client_id + creds-hash` (rotation auto-invalidates), bounded LRU (50) + 600s idle TTL, `pool_pre_ping` + `pool_recycle=300`. `DbExecutor._run_sync` reuses the warm connection instead of `create_engineβ†’connectβ†’dispose` per query (postgres/supabase only; other db_types keep the legacy path β€” no regression). **Live-measured: warm `query_structured` 6.6–9.4s β†’ ~2.5s** (the residual is the per-call catalog-DB client fetch + pre-ping, not the external handshake). **Finding:** Neon's transaction pooler REJECTS `default_transaction_read_only` as a libpq startup `option` β€” caught live; moved read-only + statement_timeout to a per-connection `connect` event (best-effort; authoritative read-only is the SELECT-only compiler + sqlglot guard, see R10). Per-request ownership/active check kept. Proceeded ahead of R1 per owner decision (marginal security delta over the existing no-auth state; auth tracked separately). Tests: `tests/database_client/test_engine.py`. First query/process still cold β†’ DB3. | important | DB | `[x]` |
69
  | DB3 | **Speculative pre-connect** β€” `DbExecutor.prewarm(catalog, user_id)` warms the pooled engine for schema sources (fire-and-forget at slow-path entry) so the cold first-query handshake overlaps the ~4s Planner call. Best-effort, never raises; gated to the default path (skipped when a coordinator factory is injected). Verified live through `ChatHandler.handle`. | nice-to-have | DB | `[x]` |
70
  | R4 | **Per-stage progress events** β€” `SlowPathCoordinator.run` gained an optional `progress` callback; `ChatHandler` bridges it to SSE `status` events (`chat.py` forwards them). Live: stream now shows `Planning…`β†’`Running N steps…`β†’`Composing…` (max wire gap ~4.6s, was ~13s of silence) β†’ fixes proxy idle-timeout + UX. **Deferred:** token-streaming the Assembler answer needs splitting it into a streamed prose call + a structured-record call β€” that doubles the Assembler LLM calls (cost/latency), so it's a separate decision; the answer is still emitted as one chunk after the (fast ~2.5s) Assembler. Test: `test_chat_handler_wiring.py`. | important | agent | `[~]` |
71
+ | R5 | **Response cache**: key on `user_id` + catalog version; invalidate on ingest. Was `chat:{room_id}:{message}`, 24h TTL, no user → cross-user replay + stale answers. **2026-06-12 (Rifqi):** key now `chat:{room_id}:{user_id}:{message}` via `_chat_cache_key()`, TTL 24h→1h (checkpoint decision) — urgent now that Redis is a shared Cloud instance. `DELETE /chat/cache` gained a required `user_id` param (frontend heads-up); room-wide clear pattern unchanged. **Still open:** catalog-version in key / invalidate-on-ingest. | important | B | `[~]` |
72
  | R6 | **Hard time budget** β€” wrap `coordinator.run()` in `asyncio.wait_for` (60–90s). `Constraints.time_budget_seconds` is rendered but not enforced. | important | agent | `[ ]` |
73
  | R7 | **Root-task-failure short-circuit** before the Assembler (templated/fast-path fallback, NOT replanning) β€” stops paying ~2k tok to narrate an empty RunState. | important | agent | `[ ]` |
74
  | R8 | **Catalog upsert race** β€” per-user advisory lock around read-merge-upsert (`store.py`); concurrent uploads can drop a source. | important | DB | `[ ]` |
 
153
  wires Pattern A correctly; self-corrects via retry.
154
 
155
  **Open follow-ups:** real `BusinessContext` (lead); create `analysis_records` table +
156
+ real `AnalysisStore` (**Rifqi owns, 2026-06-12** β€” folded into `generate_report` work,
157
+ see `CHECKPOINT_PLAN_2026-06-17.md`); register data-access `ToolSpec`s upstream (`data_access_registry()`)
158
  or keep the planner stub; 4o β†’ GPT-mini deployment swap; flip `enable_slow_path` on once
159
  `BusinessContext` is real. NOTE: 3 test files pre-existing broken from rename rot
160
  (`test_chat_handler.py`, `test_intent_router.py`, `test_answer_agent.py` import the old
 
421
  | β€” | Tool contracts (`tools/contracts.py`) | TAB | `[x]` | KM-627 β€” canonical `ToolSpec` / `ToolRegistry` / `ToolOutput`. `agents/planner/contracts.py` re-exports them (+ keeps the lead's `BusinessContext` stub). |
422
  | β€” | Analytics registry (`tools/registry.py`) | TAB | `[x]` | KM-628 β€” `analytics_registry()`. `analyze_descriptive.required` = `["data","column_ids"]` (aligned to compute signature, commit 4bb7623). |
423
  | β€” | Invoker layer (`tools/invoker.py`) | TAB | `[x]` | KM-629 β€” `AnalyticsToolInvoker` (Pattern A: `analyze_*` take a `data` `${t<id>}` placeholder from upstream `query_structured`; `_materialize` β†’ DataFrame, `_coerce_decimals` covers the whole family) + `CompositeToolInvoker` (routes data-access vs analytics by name). |
424
+ | β€” | Data-access tools (`tools/data_access.py`) | TAB | `[x]` | KM-630 β€” `DataAccessToolInvoker`: `list_sources` / `describe_source` / `query_structured` / `retrieve_documents`. Per-request DI (`user_id` + `CatalogReader`). `query_structured` calls `IRValidator` + `ExecutorDispatcher` (planner skipped β€” IR pre-built by the agent Planner). **Superseded by KM-642/643** β€” renamed `data_retrieve`/`knowledge_retrieve` and `list_sources`+`describe_source` merged into `data_check` + new `knowledge_check`; see row below. |
425
  | β€” | Tool tests (`tests/unit/tools/`) | TAB | `[x]` | analytics + data-access + invoker tests (gitignored). Incl. regression `test_decimal_columns_coerced_for_analyze_contribution`. |
426
+ | β€” | Data/knowledge tool taxonomy (`tools/data_access.py`) | TAB | `[x]` | KM-642/643 (commits c38c0c2, 4bd5f1e) β€” renamed `query_structured`β†’`data_retrieve`, `retrieve_documents`β†’`knowledge_retrieve`; merged `list_sources`+`describe_source` β†’ parameterized `data_check` (no arg = list structured sources; `source_id` = that source's schema) + new `knowledge_check` (unstructured/documents). Split mirrors the catalog's structured/unstructured slices. Planner stub/prompt/validator/few-shots synced; `DATA_ACCESS_TOOLS` guard kept in lockstep. Note: dated log entries above (e.g. the 2026-06-09 E2E) keep the old names as historical record. |
427
 
428
  ### API surface
429
 
eval/__init__.py ADDED
File without changes
eval/intent/README.md ADDED
@@ -0,0 +1,70 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Intent-Routing Eval (E3)
2
+
3
+ Scores the live 6-intent router (`OrchestratorAgent.classify`) against a golden
4
+ dataset of labelled messages. Run it before any deploy that touches the router
5
+ prompt (`src/config/prompts/intent_router.md`) or its few-shots.
6
+
7
+ ## Files
8
+
9
+ | File | What |
10
+ |---|---|
11
+ | `intent_dataset.json` | **Golden dataset** β€” `message` + known-correct `expected_intent` per case. The source of truth scoring compares against. |
12
+ | `run_eval.py` | Runner β€” calls the router per case, scores correctness, records latency + tokens. |
13
+ | `results/` | Timestamped run reports, one JSON per run (never overwritten). |
14
+
15
+ ## Run
16
+
17
+ Run as a module (`-m`), not the file path β€” module mode puts the repo root on
18
+ `sys.path` so `src` imports resolve; `python eval/intent/run_eval.py` fails.
19
+
20
+ ```bash
21
+ uv run python -m eval.intent.run_eval # full dataset
22
+ uv run python -m eval.intent.run_eval --limit 6 # quick smoke test
23
+ uv run python -m eval.intent.run_eval --langfuse # also stream traces to Langfuse
24
+ ```
25
+
26
+ Needs a populated `.env` (Azure OpenAI) β€” it calls the live model and spends
27
+ tokens. Output: a per-case detail table + an aggregate summary in the terminal,
28
+ and `results/eval_result_<timestamp>.json`.
29
+
30
+ **Tracking is the committed result files, not Langfuse** β€” the JSON reports in
31
+ `results/` are the versionable audit trail (see below). `--langfuse` is an
32
+ *optional* extra: when set, each case is also sent as a Langfuse trace (grouped
33
+ under one `intent_eval_<ts>` session) with a `intent_correct` 1/0 score, so the
34
+ same run is browsable in the Langfuse dashboard. It is off by default and the
35
+ eval runs fully without Langfuse configured.
36
+
37
+ ## What's measured
38
+
39
+ - **correctness** β€” overall + per-intent + per-language accuracy (`got == expected`)
40
+ - **runtime** β€” average ms per case
41
+ - **tokens** β€” input / output / total (read from the model response, no Langfuse)
42
+
43
+ ## Commit convention for `results/`
44
+
45
+ The reports are **versionable**, not a scratch log:
46
+
47
+ - **Do commit** a result after a meaningful change β€” e.g. a new
48
+ `intent_router.md` version, or new dataset cases. The new timestamped file
49
+ *adds* to the history; old files are never replaced. This is how we answer
50
+ "did accuracy improve after prompt v2?" β€” diff two committed result files.
51
+ - **Don't commit** throwaway runs while iterating. Just leave them unstaged or
52
+ delete them.
53
+
54
+ So the audit trail = prompt versions (in `src/config/prompts/`) lined up against
55
+ the committed result files here.
56
+
57
+ ## Dataset notes
58
+
59
+ - 6 intents: `chat`, `help`, `problem_statement`, `check`, `unstructured_flow`,
60
+ `structured_flow`. Each has 6-7 **distinct** scenarios (not EN/ID translation
61
+ pairs), balanced across English + Indonesian.
62
+ - `carried_over: true` rows mirror the pre-rework `intent_router.md` examples
63
+ (regression). `lang` enables per-language scoring. `id` is a stable handle for
64
+ diffing the same case across runs.
65
+ - Routing labels are decided from the question **phrasing**, not from which file
66
+ holds the answer (the router has no catalog access). See the `_grounding` note
67
+ in `intent_dataset.json`.
68
+ - Owner: Rifqi (structured/DB-grounded rows) + Sofhia (unstructured/document +
69
+ tabular-file rows). Merge both into this one file.
70
+ ```
eval/intent/__init__.py ADDED
File without changes
eval/intent/intent_dataset.json ADDED
@@ -0,0 +1,56 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_about": "Golden intent dataset for the reworked 6-intent router (E3 eval β€” runs against the live LLM, not the mocked unit tests). Each of the 6 intents has 6-7 DISTINCT scenarios (not EN/ID translation pairs of one scenario) balanced across English + Indonesian, since users code-switch (technical terms often stay English: revenue, churn, upload, base_price). `carried_over` rows mirror the old intent_router.md examples for regression; the rest are new. `id` is a stable per-case handle so timestamped run files (eval_result_<ts>.json) can be diffed case-by-case across runs β€” it is NOT the run timestamp (that lives in the filename). `lang` enables per-language correctness scoring (aggregate, not matched-pairs).",
3
+ "_grounding": "structured_flow + structured `check` rows are grounded in Sofhia's real test files: online_vs_offline_learning_dataset.csv (cols: Learning_Mode, Subject, Study_Hours, Retention_Score, Focus_Level, Exam_Score) and xl_knowledge_large.xlsx (telco product catalog: product_name, category, base_price, is_active, region_restriction...). unstructured_flow + document `check` rows are grounded in the IoT connectivity indicative-price PDF and the Internet of Things DOCX. IMPORTANT: routing labels are decided from the question PHRASING, not from which file holds the answer (the router has no catalog access) β€” so structured rows are clearly analytical (avg/correlation/count) and unstructured rows are clearly explanatory (jelaskan/summarize/features), with no price-lookup collisions. Rifqi's DB-grounded rows can be merged into this same file for variety; owner: Rifqi + Sofhia.",
4
+ "_next_layer": "Not in this seed: (1) deliberate near-boundary cases (chat-vs-help, check-vs-structured); (2) follow-up/contextual handling β€” decide whether follow-ups route to `chat` or are out of scope.",
5
+ "schema": {
6
+ "id": "stable per-case handle, <intent>_<NN>",
7
+ "message": "the user utterance fed to the router",
8
+ "expected_intent": "one of: chat | help | problem_statement | check | unstructured_flow | structured_flow",
9
+ "lang": "en | id",
10
+ "carried_over": "true if mirrored from the pre-rework intent_router.md examples"
11
+ },
12
+ "cases": [
13
+ { "id": "chat_01", "message": "Hi", "expected_intent": "chat", "lang": "en", "carried_over": true },
14
+ { "id": "chat_02", "message": "Bye, thanks", "expected_intent": "chat", "lang": "en", "carried_over": true },
15
+ { "id": "chat_03", "message": "What can you do?", "expected_intent": "chat", "lang": "en", "carried_over": true },
16
+ { "id": "chat_04", "message": "Kamu bisa ngerti bahasa Indonesia gk?", "expected_intent": "chat", "lang": "id", "carried_over": false },
17
+ { "id": "chat_05", "message": "Test, kebaca gak?", "expected_intent": "chat", "lang": "id", "carried_over": false },
18
+ { "id": "chat_06", "message": "Oh paham2", "expected_intent": "chat", "lang": "id", "carried_over": false },
19
+
20
+ { "id": "help_01", "message": "Okay I uploaded my data, what do I do next?", "expected_intent": "help", "lang": "en", "carried_over": false },
21
+ { "id": "help_02", "message": "How does this work, where should I start?", "expected_intent": "help", "lang": "en", "carried_over": false },
22
+ { "id": "help_03", "message": "How do I connect my database to this?", "expected_intent": "help", "lang": "en", "carried_over": false },
23
+ { "id": "help_04", "message": "Setelah analisis selesai, aku bisa ngapain lagi?", "expected_intent": "help", "lang": "id", "carried_over": false },
24
+ { "id": "help_05", "message": "Aku harus upload file dulu atau connect database dulu atau bisa langsung tanpa keduanya?", "expected_intent": "help", "lang": "id", "carried_over": false },
25
+ { "id": "help_06", "message": "Cara bikin report-nya gimana deh?", "expected_intent": "help", "lang": "id", "carried_over": false },
26
+
27
+ { "id": "ps_01", "message": "I want to reduce customer churn next quarter, target under 5%.", "expected_intent": "problem_statement", "lang": "en", "carried_over": false },
28
+ { "id": "ps_02", "message": "My goal is to improve online students' exam scores this semester.", "expected_intent": "problem_statement", "lang": "en", "carried_over": false },
29
+ { "id": "ps_03", "message": "We need to figure out which product categories to push next year.", "expected_intent": "problem_statement", "lang": "en", "carried_over": false },
30
+ { "id": "ps_04", "message": "Aku mau tau faktor apa yg paling ngaruh ke retention score siswa.", "expected_intent": "problem_statement", "lang": "id", "carried_over": false },
31
+ { "id": "ps_05", "message": "Tujuanku naikin penjualan produk prepaid kuartal depan.", "expected_intent": "problem_statement", "lang": "id", "carried_over": false },
32
+ { "id": "ps_06", "message": "Aku pengen fokus benahin paket internet yang kurang laku di luar Jawa.", "expected_intent": "problem_statement", "lang": "id", "carried_over": false },
33
+
34
+ { "id": "check_01", "message": "What data do I have?", "expected_intent": "check", "lang": "en", "carried_over": false },
35
+ { "id": "check_02", "message": "What columns are in the online vs offline learning dataset?", "expected_intent": "check", "lang": "en", "carried_over": false },
36
+ { "id": "check_03", "message": "Is the IoT connectivity pricing PDF already uploaded?", "expected_intent": "check", "lang": "en", "carried_over": false },
37
+ { "id": "check_04", "message": "Kolom di tabel product master list apa aja?", "expected_intent": "check", "lang": "id", "carried_over": false },
38
+ { "id": "check_05", "message": "Dokumen apa aja yang udh aku upload?", "expected_intent": "check", "lang": "id", "carried_over": false },
39
+ { "id": "check_06", "message": "Sumber dataku yang berupa database yg mana aja?", "expected_intent": "check", "lang": "id", "carried_over": false },
40
+
41
+ { "id": "unstructured_01", "message": "apa key feature dari iot connectivity?", "expected_intent": "unstructured_flow", "lang": "id", "carried_over": true },
42
+ { "id": "unstructured_02", "message": "Jelaskan tentang Internet of Things.", "expected_intent": "unstructured_flow", "lang": "id", "carried_over": false },
43
+ { "id": "unstructured_03", "message": "Menurut dokumen IoT connectivity, paket apa aja yang ditawarkan?", "expected_intent": "unstructured_flow", "lang": "id", "carried_over": false },
44
+ { "id": "unstructured_04", "message": "What pricing tiers are in the IoT connectivity document?", "expected_intent": "unstructured_flow", "lang": "en", "carried_over": false },
45
+ { "id": "unstructured_05", "message": "Summarize the key points from the IoT connectivity pricing document.", "expected_intent": "unstructured_flow", "lang": "en", "carried_over": false },
46
+ { "id": "unstructured_06", "message": "What use cases of IoT are mentioned in the document?", "expected_intent": "unstructured_flow", "lang": "en", "carried_over": false },
47
+
48
+ { "id": "structured_01", "message": "How many orders did we get last month?", "expected_intent": "structured_flow", "lang": "en", "carried_over": true },
49
+ { "id": "structured_02", "message": "Top 5 customers by revenue this year", "expected_intent": "structured_flow", "lang": "en", "carried_over": true },
50
+ { "id": "structured_03", "message": "What's the average exam score per learning mode?", "expected_intent": "structured_flow", "lang": "en", "carried_over": false },
51
+ { "id": "structured_04", "message": "Is there a correlation between study hours and exam score?", "expected_intent": "structured_flow", "lang": "en", "carried_over": false },
52
+ { "id": "structured_05", "message": "Rata-rata base price per kategori produk berapa?", "expected_intent": "structured_flow", "lang": "id", "carried_over": false },
53
+ { "id": "structured_06", "message": "Ada berapa produk yang masih aktif per kategori?", "expected_intent": "structured_flow", "lang": "id", "carried_over": false },
54
+ { "id": "structured_07", "message": "Bandingin retention score antara siswa online sama offline.", "expected_intent": "structured_flow", "lang": "id", "carried_over": false }
55
+ ]
56
+ }
eval/intent/results/.gitkeep ADDED
File without changes
eval/intent/run_eval.py ADDED
@@ -0,0 +1,384 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Intent-routing eval runner (E3).
2
+
3
+ Feeds each golden case in `intent_dataset.json` to the live 6-intent router
4
+ (`OrchestratorAgent.classify`), then scores correctness + records latency and
5
+ token usage. Prints a per-case detail table and an aggregate summary, and
6
+ writes a timestamped JSON report under `results/` (never overwritten β€” one file
7
+ per run, so runs can be diffed over time).
8
+
9
+ Run before every deploy that touches the router prompt or its few-shots.
10
+ Invoke as a module (`-m`) so the repo root is on `sys.path` and `src` imports
11
+ resolve β€” running the file path directly (`python eval/intent/run_eval.py`)
12
+ puts only `eval/intent/` on the path and fails:
13
+
14
+ uv run python -m eval.intent.run_eval
15
+ uv run python -m eval.intent.run_eval --limit 6 # quick smoke test
16
+
17
+ Tokens come straight from the model response (LangChain `usage_metadata` via a
18
+ callback) β€” no Langfuse needed. The router is called unmodified: it already
19
+ accepts a `callbacks=` list and forwards it into the chain config.
20
+ """
21
+
22
+ from __future__ import annotations
23
+
24
+ import argparse
25
+ import asyncio
26
+ import json
27
+ import statistics
28
+ import time
29
+ from dataclasses import asdict, dataclass
30
+ from datetime import datetime
31
+ from pathlib import Path
32
+ from typing import Any
33
+
34
+ from langchain_core.callbacks import BaseCallbackHandler
35
+ from langchain_core.outputs import LLMResult
36
+
37
+ from src.agents.orchestration import OrchestratorAgent
38
+
39
+ _HERE = Path(__file__).resolve().parent
40
+ DATASET = _HERE / "intent_dataset.json"
41
+ RESULTS_DIR = _HERE / "results"
42
+
43
+ INTENTS = [
44
+ "chat",
45
+ "help",
46
+ "problem_statement",
47
+ "check",
48
+ "unstructured_flow",
49
+ "structured_flow",
50
+ ]
51
+
52
+ # Short labels so the EXPECT->GOT column stays narrow in the detail table.
53
+ _ABBR = {
54
+ "chat": "chat",
55
+ "help": "help",
56
+ "problem_statement": "prob_stmt",
57
+ "check": "check",
58
+ "unstructured_flow": "unstruct",
59
+ "structured_flow": "structF",
60
+ }
61
+
62
+
63
+ class _UsageCollector(BaseCallbackHandler):
64
+ """Sums token usage across the LLM calls made during one classify().
65
+
66
+ Reads `usage_metadata` off each returned message (the canonical LangChain
67
+ field), falling back to `llm_output['token_usage']` for providers that only
68
+ populate the legacy field.
69
+ """
70
+
71
+ def __init__(self) -> None:
72
+ self.input_tokens = 0
73
+ self.output_tokens = 0
74
+ self.total_tokens = 0
75
+
76
+ def on_llm_end(self, response: LLMResult, **kwargs: Any) -> None:
77
+ before = self.total_tokens
78
+ for generation_list in response.generations:
79
+ for generation in generation_list:
80
+ message = getattr(generation, "message", None)
81
+ usage = getattr(message, "usage_metadata", None) if message else None
82
+ if usage:
83
+ self.input_tokens += usage.get("input_tokens", 0)
84
+ self.output_tokens += usage.get("output_tokens", 0)
85
+ self.total_tokens += usage.get("total_tokens", 0)
86
+ if self.total_tokens == before and response.llm_output:
87
+ usage = response.llm_output.get("token_usage") or {}
88
+ self.input_tokens += usage.get("prompt_tokens", 0)
89
+ self.output_tokens += usage.get("completion_tokens", 0)
90
+ self.total_tokens += usage.get("total_tokens", 0)
91
+
92
+ @property
93
+ def tokens(self) -> dict[str, int]:
94
+ return {
95
+ "input": self.input_tokens,
96
+ "output": self.output_tokens,
97
+ "total": self.total_tokens,
98
+ }
99
+
100
+
101
+ @dataclass
102
+ class CaseResult:
103
+ id: str
104
+ lang: str
105
+ message: str
106
+ expected: str
107
+ got: str
108
+ correct: bool
109
+ latency_ms: int
110
+ tokens: dict[str, int]
111
+
112
+
113
+ def load_cases(path: Path) -> list[dict[str, Any]]:
114
+ """Read the `cases` array, skipping the leading `_*` doc keys and `schema`."""
115
+ data = json.loads(path.read_text(encoding="utf-8"))
116
+ return list(data["cases"])
117
+
118
+
119
+ @dataclass
120
+ class _LangfuseCtx:
121
+ """Optional Langfuse sink β€” one session groups all cases of a run."""
122
+
123
+ session_id: str
124
+ client: Any
125
+
126
+
127
+ def _new_langfuse_handler(lf_ctx: _LangfuseCtx, case: dict[str, Any]) -> Any:
128
+ """Per-case LangChain callback so each trace carries the case's labels."""
129
+ from langfuse.callback import CallbackHandler
130
+
131
+ from src.config.settings import settings
132
+
133
+ return CallbackHandler(
134
+ public_key=settings.LANGFUSE_PUBLIC_KEY,
135
+ secret_key=settings.LANGFUSE_SECRET_KEY,
136
+ host=settings.LANGFUSE_HOST,
137
+ session_id=lf_ctx.session_id,
138
+ trace_name=f"intent_eval/{case['id']}",
139
+ metadata={
140
+ "case_id": case["id"],
141
+ "expected": case["expected_intent"],
142
+ "lang": case["lang"],
143
+ },
144
+ tags=["intent-eval", case["expected_intent"], case["lang"]],
145
+ )
146
+
147
+
148
+ def _score_langfuse(lf_ctx: _LangfuseCtx, handler: Any, result: CaseResult) -> None:
149
+ """Attach a 1/0 correctness score to the case's trace. Best-effort."""
150
+ try:
151
+ lf_ctx.client.score(
152
+ trace_id=handler.get_trace_id(),
153
+ name="intent_correct",
154
+ value=1 if result.correct else 0,
155
+ comment=f"{result.expected} -> {result.got}",
156
+ )
157
+ except Exception: # noqa: BLE001, S110 β€” scoring must never break the run
158
+ pass
159
+
160
+
161
+ async def run_case(
162
+ agent: OrchestratorAgent,
163
+ case: dict[str, Any],
164
+ lf_ctx: _LangfuseCtx | None = None,
165
+ ) -> CaseResult:
166
+ """Classify one message; never throws β€” a failed call is recorded as ERROR."""
167
+ collector = _UsageCollector()
168
+ callbacks: list[Any] = [collector]
169
+ lf_handler = _new_langfuse_handler(lf_ctx, case) if lf_ctx else None
170
+ if lf_handler is not None:
171
+ callbacks.append(lf_handler)
172
+
173
+ start = time.perf_counter()
174
+ got: str
175
+ try:
176
+ decision = await agent.classify(case["message"], callbacks=callbacks)
177
+ got = decision.intent
178
+ except Exception as exc: # noqa: BLE001 β€” one bad case shouldn't kill the run
179
+ got = f"ERROR:{type(exc).__name__}"
180
+ latency_ms = round((time.perf_counter() - start) * 1000)
181
+
182
+ result = CaseResult(
183
+ id=case["id"],
184
+ lang=case["lang"],
185
+ message=case["message"],
186
+ expected=case["expected_intent"],
187
+ got=got,
188
+ correct=got == case["expected_intent"],
189
+ latency_ms=latency_ms,
190
+ tokens=collector.tokens,
191
+ )
192
+ if lf_ctx is not None and lf_handler is not None:
193
+ _score_langfuse(lf_ctx, lf_handler, result)
194
+ return result
195
+
196
+
197
+ def _group_accuracy(results: list[CaseResult], key: str) -> dict[str, dict[str, Any]]:
198
+ out: dict[str, dict[str, Any]] = {}
199
+ keys = INTENTS if key == "expected" else sorted({getattr(r, key) for r in results})
200
+ for k in keys:
201
+ sub = [r for r in results if getattr(r, key) == k]
202
+ if not sub:
203
+ continue
204
+ passed = sum(r.correct for r in sub)
205
+ out[k] = {
206
+ "n": len(sub),
207
+ "passed": passed,
208
+ "accuracy": round(passed / len(sub), 3),
209
+ }
210
+ return out
211
+
212
+
213
+ def summarize(results: list[CaseResult]) -> dict[str, Any]:
214
+ n = len(results)
215
+ passed = sum(r.correct for r in results)
216
+ latencies = [r.latency_ms for r in results]
217
+ tok_in = sum(r.tokens["input"] for r in results)
218
+ tok_out = sum(r.tokens["output"] for r in results)
219
+ tok_total = sum(r.tokens["total"] for r in results)
220
+ return {
221
+ "total": n,
222
+ "passed": passed,
223
+ "accuracy": round(passed / n, 3) if n else 0.0,
224
+ "runtime_avg_ms": round(statistics.mean(latencies)) if latencies else 0,
225
+ "runtime_total_s": round(sum(latencies) / 1000, 1),
226
+ "tokens": {
227
+ "input": tok_in,
228
+ "output": tok_out,
229
+ "total": tok_total,
230
+ "avg_total_per_case": round(tok_total / n) if n else 0,
231
+ },
232
+ "by_intent": _group_accuracy(results, "expected"),
233
+ "by_lang": _group_accuracy(results, "lang"),
234
+ }
235
+
236
+
237
+ def _truncate(text: str, width: int) -> str:
238
+ text = text.replace("\n", " ")
239
+ return text if len(text) <= width else text[: width - 3] + "..."
240
+
241
+
242
+ def format_table(results: list[CaseResult]) -> str:
243
+ header = (
244
+ f"{'ID':<15} {'L':<3} {'QUESTION':<40} "
245
+ f"{'EXPECT->GOT':<20} {'OK':<3} {'MS':>5} {'TOK':>6}"
246
+ )
247
+ rule = "-" * len(header)
248
+ lines = [rule, header, rule]
249
+ for r in results:
250
+ exp_got = f"{_ABBR.get(r.expected, r.expected)}->{_ABBR.get(r.got, r.got)}"
251
+ ok = "ok" if r.correct else "X"
252
+ lines.append(
253
+ f"{r.id:<15} {r.lang:<3} {_truncate(r.message, 40):<40} "
254
+ f"{_truncate(exp_got, 20):<20} {ok:<3} {r.latency_ms:>5} {r.tokens['total']:>6}"
255
+ )
256
+ lines.append(rule)
257
+ return "\n".join(lines)
258
+
259
+
260
+ def format_summary(summary: dict[str, Any], results: list[CaseResult]) -> str:
261
+ lines = ["SUMMARY"]
262
+ lines.append(
263
+ f" Overall {summary['passed']}/{summary['total']} correct"
264
+ f" ({summary['accuracy'] * 100:.1f}%)"
265
+ )
266
+ lines.append(
267
+ f" Runtime avg {summary['runtime_avg_ms']} ms"
268
+ f" | total {summary['runtime_total_s']} s"
269
+ )
270
+ tok = summary["tokens"]
271
+ lines.append(
272
+ f" Tokens avg {tok['avg_total_per_case']}"
273
+ f" | total {tok['total']} (in {tok['input']} / out {tok['output']})"
274
+ )
275
+ lines.append("")
276
+ lines.append(" By intent")
277
+ for intent, m in summary["by_intent"].items():
278
+ lines.append(
279
+ f" {intent:<18} {m['passed']}/{m['n']} {m['accuracy'] * 100:.0f}%"
280
+ )
281
+ lines.append(" By language")
282
+ for lang, m in summary["by_lang"].items():
283
+ lines.append(
284
+ f" {lang:<18} {m['passed']}/{m['n']} {m['accuracy'] * 100:.0f}%"
285
+ )
286
+ failures = [r for r in results if not r.correct]
287
+ lines.append("")
288
+ lines.append(f" FAILURES ({len(failures)})")
289
+ for r in failures:
290
+ lines.append(f" {r.id:<14} [{r.lang}] {r.expected:<12} -> {r.got}")
291
+ return "\n".join(lines)
292
+
293
+
294
+ def build_report(
295
+ results: list[CaseResult], summary: dict[str, Any], meta: dict[str, Any]
296
+ ) -> dict[str, Any]:
297
+ run = {**meta, **{k: summary[k] for k in (
298
+ "total", "passed", "accuracy", "runtime_avg_ms", "runtime_total_s", "tokens"
299
+ )}}
300
+ return {
301
+ "run": run,
302
+ "by_intent": summary["by_intent"],
303
+ "by_lang": summary["by_lang"],
304
+ "cases": [asdict(r) for r in results],
305
+ }
306
+
307
+
308
+ def _model_name() -> str:
309
+ try:
310
+ from src.config.settings import settings
311
+
312
+ return str(settings.azureai_deployment_name_4o)
313
+ except Exception: # noqa: BLE001 β€” meta only; .env may be absent
314
+ return "gpt-4o"
315
+
316
+
317
+ async def main() -> None:
318
+ parser = argparse.ArgumentParser(description="Intent-routing eval (E3)")
319
+ parser.add_argument("--dataset", type=Path, default=DATASET)
320
+ parser.add_argument("--limit", type=int, default=0, help="run first N cases only")
321
+ parser.add_argument("--prompt-version", default="intent_router.md")
322
+ parser.add_argument("--no-table", action="store_true", help="skip the detail table")
323
+ parser.add_argument(
324
+ "--langfuse", action="store_true",
325
+ help="also send each case as a Langfuse trace + correctness score",
326
+ )
327
+ args = parser.parse_args()
328
+
329
+ cases = load_cases(args.dataset)
330
+ if args.limit:
331
+ cases = cases[: args.limit]
332
+
333
+ started = datetime.now()
334
+ print(f"Intent Routing Eval -- {started:%Y-%m-%d %H:%M:%S}")
335
+ print(f"dataset: {args.dataset.name} ({len(cases)}) model: {_model_name()} "
336
+ f"prompt: {args.prompt_version}")
337
+
338
+ lf_ctx: _LangfuseCtx | None = None
339
+ if args.langfuse:
340
+ try:
341
+ from src.observability.langfuse.langfuse import get_langfuse
342
+
343
+ lf_ctx = _LangfuseCtx(
344
+ session_id=f"intent_eval_{started:%Y%m%d_%H%M%S}",
345
+ client=get_langfuse(), # type: ignore[no-untyped-call]
346
+ )
347
+ print(f"langfuse: enabled (session {lf_ctx.session_id})")
348
+ except Exception as exc: # noqa: BLE001 β€” Langfuse is optional
349
+ print(f"langfuse: disabled ({type(exc).__name__}: {exc})")
350
+
351
+ agent = OrchestratorAgent()
352
+ results: list[CaseResult] = []
353
+ for case in cases:
354
+ results.append(await run_case(agent, case, lf_ctx))
355
+
356
+ if lf_ctx is not None:
357
+ try:
358
+ lf_ctx.client.flush()
359
+ except Exception: # noqa: BLE001, S110 β€” flush failure shouldn't fail the run
360
+ pass
361
+
362
+ summary = summarize(results)
363
+ if not args.no_table:
364
+ print(format_table(results))
365
+ print(format_summary(summary, results))
366
+
367
+ meta = {
368
+ "timestamp": started.isoformat(timespec="seconds"),
369
+ "dataset": args.dataset.name,
370
+ "model": _model_name(),
371
+ "prompt_version": args.prompt_version,
372
+ "langfuse_session": lf_ctx.session_id if lf_ctx else None,
373
+ }
374
+ report = build_report(results, summary, meta)
375
+ RESULTS_DIR.mkdir(parents=True, exist_ok=True)
376
+ out_path = RESULTS_DIR / f"eval_result_{started:%Y-%m-%d_%H%M%S}.json"
377
+ out_path.write_text(
378
+ json.dumps(report, ensure_ascii=False, indent=2), encoding="utf-8"
379
+ )
380
+ print(f"\n-> saved: {out_path.relative_to(_HERE.parent.parent)}")
381
+
382
+
383
+ if __name__ == "__main__":
384
+ asyncio.run(main())
eval/readiness/README.md ADDED
@@ -0,0 +1,34 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Report-readiness eval
2
+
3
+ Scores the deterministic `is_report_ready` signal (`src/agents/report/readiness.py`)
4
+ that the Help skill consumes to decide whether to nudge the user toward generating a
5
+ report. No LLM, no DB β€” each golden case declares an analysis state + a set of
6
+ persisted records/reports, and the runner feeds them through `is_report_ready` via
7
+ injectable fake stores.
8
+
9
+ ## Run
10
+
11
+ ```bash
12
+ uv run python -m eval.readiness.run_eval
13
+ uv run python -m eval.readiness.run_eval --limit 5 # smoke test
14
+ uv run python -m eval.readiness.run_eval --no-table # summary only
15
+ ```
16
+
17
+ Each run writes a timestamped `results/readiness_result_<ts>.json` (never
18
+ overwritten, diffable across runs).
19
+
20
+ ## What it measures
21
+
22
+ - **Floor correctness** β€” exact `ready` + `missing` for the deterministic floor
23
+ (validated goal Β· β‰₯1 substantive record Β· delta-since-report). Should sit at ~100%;
24
+ this is the regression guard as criteria evolve.
25
+ - **Alignment gap** β€” `alignment` cases have substantive records (floor says
26
+ `ready=true`) but `aligned=false`: the analyses don't address the problem statement.
27
+ The floor can't see this. The gap count is the evidence for/against adding the
28
+ deferred LLM-judge β€” "ship the floor, earn the judge."
29
+
30
+ ## Dataset
31
+
32
+ `readiness_dataset.json` β€” groups: `floor`, `delta`, `edge` (doc-only product
33
+ question), `alignment`. See the `_about` / `_alignment` doc keys in the file. The
34
+ `aligned` label is a semantic judgment; owner: Rifqi (report semantics) + Sofhia.
eval/readiness/__init__.py ADDED
File without changes
eval/readiness/readiness_dataset.json ADDED
@@ -0,0 +1,40 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_about": "Golden dataset for the report-readiness signal (`src/agents/report/readiness.is_report_ready`). Deterministic (no LLM): each case declares an analysis state + a set of persisted AnalysisRecords/reports, and the runner feeds them through is_report_ready via injectable fake stores, scoring the boolean `ready` AND the `missing` gaps. Floor cases should score ~100% (regression value). The `alignment` group probes the deferred LLM-judge β€” see _alignment.",
3
+ "_floor": "is_report_ready's deterministic floor: (1) problem_validated, (2) >=1 SUBSTANTIVE record, (3) delta-since-report. SUBSTANTIVE (KM-652 fix T1) = a record whose ANALYSIS task succeeded: tasks_run contains a task with status=success AND an analyze_* tool. A failed analysis still persists a record WITH findings (narrating the failure) and its data-access tasks (check_/retrieve_) succeed β€” so neither 'has findings' nor 'any task succeeded' counts. Only a successful analyze_* does.",
4
+ "_records": "records[].analysis = 'success' (analyze_* succeeded β†’ substantive) | 'failure' (analyze_* failed, data-access still succeeded β€” the real e2e case, NOT substantive) | 'none' (only check_/retrieve_ succeeded, no analyze task β€” NOT substantive; guards the 'any task succeeded' trap). records[].findings = count (a failure run still has findings; floor ignores them now). records[].age_min / reports[].age_min = minutes ago (smaller = newer).",
5
+ "_alignment": "ALIGNMENT cases: a successful analysis (floor says ready=true) but `aligned=false` means it doesn't address the problem statement β€” a human would say NOT ready. Scored floor-correct, counted separately as the 'alignment gap' = evidence for/against the LLM-judge. Alignment label owner: Rifqi (report semantics) + Sofhia.",
6
+ "schema": {
7
+ "id": "stable per-case handle, <group>_<NN>",
8
+ "group": "floor | delta | edge | alignment",
9
+ "problem_validated": "bool",
10
+ "report_id": "null = never generated; a string = a report exists",
11
+ "records": "[{ analysis: success|failure|none, findings: int, age_min: int }]",
12
+ "reports": "[{ age_min: int }] (only meaningful when report_id set)",
13
+ "aligned": "bool β€” do the analyses address the problem statement? (floor ignores this)",
14
+ "expected_ready": "what the deterministic floor SHOULD return",
15
+ "expected_missing": "subset of [problem, analysis, delta]",
16
+ "note": "human-readable description"
17
+ },
18
+ "cases": [
19
+ { "id": "floor_01", "group": "floor", "problem_validated": false, "report_id": null, "records": [], "reports": [], "aligned": false, "expected_ready": false, "expected_missing": ["problem", "analysis"], "note": "new analysis: no validated goal and no records" },
20
+ { "id": "floor_02", "group": "floor", "problem_validated": false, "report_id": null, "records": [{ "analysis": "success", "findings": 2, "age_min": 30 }], "reports": [], "aligned": true, "expected_ready": false, "expected_missing": ["problem"], "note": "has a successful analysis but goal not validated (isolates the problem gap)" },
21
+ { "id": "floor_03", "group": "floor", "problem_validated": true, "report_id": null, "records": [], "reports": [], "aligned": false, "expected_ready": false, "expected_missing": ["analysis"], "note": "validated goal but no analysis run yet" },
22
+ { "id": "floor_04", "group": "floor", "problem_validated": true, "report_id": null, "records": [{ "analysis": "failure", "findings": 3, "age_min": 20 }], "reports": [], "aligned": false, "expected_ready": false, "expected_missing": ["analysis"], "note": "T1 REGRESSION: analyze_* FAILED but the record still has 3 findings (narrating failure) + check/retrieve succeeded. Must NOT be ready β€” this is the live e2e case (analyze_aggregate failed, report still got generated under the old 'has findings' rule)." },
23
+ { "id": "floor_05", "group": "floor", "problem_validated": true, "report_id": null, "records": [{ "analysis": "none", "findings": 0, "age_min": 15 }], "reports": [], "aligned": false, "expected_ready": false, "expected_missing": ["analysis"], "note": "T1 nuance: only data-access tasks (check/retrieve) succeeded, no analyze task. 'any task succeeded' would wrongly pass β€” must NOT be ready." },
24
+ { "id": "floor_06", "group": "floor", "problem_validated": true, "report_id": null, "records": [{ "analysis": "success", "findings": 2, "age_min": 15 }], "reports": [], "aligned": true, "expected_ready": true, "expected_missing": [], "note": "validated + one successful analysis, no prior report β†’ ready" },
25
+ { "id": "floor_07", "group": "floor", "problem_validated": true, "report_id": null, "records": [{ "analysis": "success", "findings": 3, "age_min": 40 }, { "analysis": "success", "findings": 1, "age_min": 10 }], "reports": [], "aligned": true, "expected_ready": true, "expected_missing": [], "note": "multiple successful analyses β†’ ready" },
26
+ { "id": "floor_08", "group": "floor", "problem_validated": true, "report_id": null, "records": [{ "analysis": "failure", "findings": 3, "age_min": 30 }, { "analysis": "success", "findings": 2, "age_min": 10 }], "reports": [], "aligned": true, "expected_ready": true, "expected_missing": [], "note": "one failed + one successful analysis β†’ the successful one is enough β†’ ready" },
27
+
28
+ { "id": "delta_01", "group": "delta", "problem_validated": true, "report_id": "rep-1", "records": [{ "analysis": "success", "findings": 2, "age_min": 120 }], "reports": [{ "age_min": 5 }], "aligned": true, "expected_ready": false, "expected_missing": ["delta"], "note": "report exists, all analysis older than it β†’ nothing new to report" },
29
+ { "id": "delta_02", "group": "delta", "problem_validated": true, "report_id": "rep-1", "records": [{ "analysis": "success", "findings": 2, "age_min": 5 }], "reports": [{ "age_min": 120 }], "aligned": true, "expected_ready": true, "expected_missing": [], "note": "newer successful analysis after the report β†’ ready to regenerate" },
30
+ { "id": "delta_03", "group": "delta", "problem_validated": true, "report_id": "rep-1", "records": [{ "analysis": "success", "findings": 1, "age_min": 90 }, { "analysis": "success", "findings": 2, "age_min": 10 }], "reports": [{ "age_min": 60 }], "aligned": true, "expected_ready": true, "expected_missing": [], "note": "one old + one newer-than-report success β†’ ready" },
31
+ { "id": "delta_04", "group": "delta", "problem_validated": true, "report_id": "rep-2", "records": [{ "analysis": "success", "findings": 2, "age_min": 90 }], "reports": [{ "age_min": 200 }, { "age_min": 30 }], "aligned": true, "expected_ready": false, "expected_missing": ["delta"], "note": "multiple reports β€” newest wins; analysis older than newest report β†’ not ready" },
32
+ { "id": "delta_05", "group": "delta", "problem_validated": true, "report_id": "rep-1", "records": [{ "analysis": "success", "findings": 2, "age_min": 120 }, { "analysis": "failure", "findings": 3, "age_min": 5 }], "reports": [{ "age_min": 60 }], "aligned": true, "expected_ready": false, "expected_missing": ["delta"], "note": "T1+delta: the only NEW analysis (age 5) is a FAILURE β†’ no NEW substantive since the report β†’ not ready. A failed retry must not unlock a duplicate report." },
33
+
34
+ { "id": "edge_01", "group": "edge", "problem_validated": true, "report_id": null, "records": [], "reports": [], "aligned": false, "expected_ready": false, "expected_missing": ["analysis"], "note": "doc-only analysis (RAG, no structured run) produces no AnalysisRecord β†’ never report-able under the floor. PRODUCT QUESTION: should doc-only be report-able?" },
35
+
36
+ { "id": "align_01", "group": "alignment", "problem_validated": true, "report_id": null, "records": [{ "analysis": "success", "findings": 2, "age_min": 15 }], "reports": [], "aligned": false, "expected_ready": true, "expected_missing": [], "note": "GAP: successful analysis but it doesn't address the problem statement. Floor says ready; a human would say not-ready." },
37
+ { "id": "align_02", "group": "alignment", "problem_validated": true, "report_id": null, "records": [{ "analysis": "success", "findings": 3, "age_min": 25 }, { "analysis": "success", "findings": 1, "age_min": 5 }], "reports": [], "aligned": false, "expected_ready": true, "expected_missing": [], "note": "GAP: lots of successful analysis, none aligned to the goal" },
38
+ { "id": "align_03", "group": "alignment", "problem_validated": true, "report_id": null, "records": [{ "analysis": "success", "findings": 2, "age_min": 15 }], "reports": [], "aligned": true, "expected_ready": true, "expected_missing": [], "note": "control: successful AND aligned β†’ genuinely ready, no gap" }
39
+ ]
40
+ }
eval/readiness/results/.gitkeep ADDED
File without changes
eval/readiness/results/readiness_result_2026-06-22_101645.json ADDED
@@ -0,0 +1,268 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "run": {
3
+ "timestamp": "2026-06-22T10:16:45",
4
+ "dataset": "readiness_dataset.json",
5
+ "target": "src/agents/report/readiness.is_report_ready",
6
+ "total": 16,
7
+ "passed": 16,
8
+ "accuracy": 1.0,
9
+ "runtime_avg_ms": 0.0
10
+ },
11
+ "alignment_gap": {
12
+ "count": 2,
13
+ "ids": [
14
+ "align_01",
15
+ "align_02"
16
+ ]
17
+ },
18
+ "by_group": {
19
+ "floor": {
20
+ "n": 8,
21
+ "passed": 8,
22
+ "accuracy": 1.0
23
+ },
24
+ "delta": {
25
+ "n": 4,
26
+ "passed": 4,
27
+ "accuracy": 1.0
28
+ },
29
+ "edge": {
30
+ "n": 1,
31
+ "passed": 1,
32
+ "accuracy": 1.0
33
+ },
34
+ "alignment": {
35
+ "n": 3,
36
+ "passed": 3,
37
+ "accuracy": 1.0
38
+ }
39
+ },
40
+ "cases": [
41
+ {
42
+ "id": "floor_01",
43
+ "group": "floor",
44
+ "expected_ready": false,
45
+ "got_ready": false,
46
+ "expected_missing": [
47
+ "a validated problem statement",
48
+ "at least one completed analysis"
49
+ ],
50
+ "got_missing": [
51
+ "a validated problem statement",
52
+ "at least one completed analysis"
53
+ ],
54
+ "correct": true,
55
+ "aligned": false,
56
+ "gap": false,
57
+ "latency_ms": 0.0
58
+ },
59
+ {
60
+ "id": "floor_02",
61
+ "group": "floor",
62
+ "expected_ready": false,
63
+ "got_ready": false,
64
+ "expected_missing": [
65
+ "a validated problem statement"
66
+ ],
67
+ "got_missing": [
68
+ "a validated problem statement"
69
+ ],
70
+ "correct": true,
71
+ "aligned": true,
72
+ "gap": false,
73
+ "latency_ms": 0.0
74
+ },
75
+ {
76
+ "id": "floor_03",
77
+ "group": "floor",
78
+ "expected_ready": false,
79
+ "got_ready": false,
80
+ "expected_missing": [
81
+ "at least one completed analysis"
82
+ ],
83
+ "got_missing": [
84
+ "at least one completed analysis"
85
+ ],
86
+ "correct": true,
87
+ "aligned": false,
88
+ "gap": false,
89
+ "latency_ms": 0.0
90
+ },
91
+ {
92
+ "id": "floor_04",
93
+ "group": "floor",
94
+ "expected_ready": false,
95
+ "got_ready": false,
96
+ "expected_missing": [
97
+ "at least one completed analysis"
98
+ ],
99
+ "got_missing": [
100
+ "at least one completed analysis"
101
+ ],
102
+ "correct": true,
103
+ "aligned": false,
104
+ "gap": false,
105
+ "latency_ms": 0.0
106
+ },
107
+ {
108
+ "id": "floor_05",
109
+ "group": "floor",
110
+ "expected_ready": false,
111
+ "got_ready": false,
112
+ "expected_missing": [
113
+ "at least one completed analysis"
114
+ ],
115
+ "got_missing": [
116
+ "at least one completed analysis"
117
+ ],
118
+ "correct": true,
119
+ "aligned": false,
120
+ "gap": false,
121
+ "latency_ms": 0.0
122
+ },
123
+ {
124
+ "id": "floor_06",
125
+ "group": "floor",
126
+ "expected_ready": true,
127
+ "got_ready": true,
128
+ "expected_missing": [],
129
+ "got_missing": [],
130
+ "correct": true,
131
+ "aligned": true,
132
+ "gap": false,
133
+ "latency_ms": 0.0
134
+ },
135
+ {
136
+ "id": "floor_07",
137
+ "group": "floor",
138
+ "expected_ready": true,
139
+ "got_ready": true,
140
+ "expected_missing": [],
141
+ "got_missing": [],
142
+ "correct": true,
143
+ "aligned": true,
144
+ "gap": false,
145
+ "latency_ms": 0.0
146
+ },
147
+ {
148
+ "id": "floor_08",
149
+ "group": "floor",
150
+ "expected_ready": true,
151
+ "got_ready": true,
152
+ "expected_missing": [],
153
+ "got_missing": [],
154
+ "correct": true,
155
+ "aligned": true,
156
+ "gap": false,
157
+ "latency_ms": 0.0
158
+ },
159
+ {
160
+ "id": "delta_01",
161
+ "group": "delta",
162
+ "expected_ready": false,
163
+ "got_ready": false,
164
+ "expected_missing": [
165
+ "a new analysis since the last report"
166
+ ],
167
+ "got_missing": [
168
+ "a new analysis since the last report"
169
+ ],
170
+ "correct": true,
171
+ "aligned": true,
172
+ "gap": false,
173
+ "latency_ms": 0.0
174
+ },
175
+ {
176
+ "id": "delta_02",
177
+ "group": "delta",
178
+ "expected_ready": true,
179
+ "got_ready": true,
180
+ "expected_missing": [],
181
+ "got_missing": [],
182
+ "correct": true,
183
+ "aligned": true,
184
+ "gap": false,
185
+ "latency_ms": 0.0
186
+ },
187
+ {
188
+ "id": "delta_03",
189
+ "group": "delta",
190
+ "expected_ready": true,
191
+ "got_ready": true,
192
+ "expected_missing": [],
193
+ "got_missing": [],
194
+ "correct": true,
195
+ "aligned": true,
196
+ "gap": false,
197
+ "latency_ms": 0.0
198
+ },
199
+ {
200
+ "id": "delta_04",
201
+ "group": "delta",
202
+ "expected_ready": false,
203
+ "got_ready": false,
204
+ "expected_missing": [
205
+ "a new analysis since the last report"
206
+ ],
207
+ "got_missing": [
208
+ "a new analysis since the last report"
209
+ ],
210
+ "correct": true,
211
+ "aligned": true,
212
+ "gap": false,
213
+ "latency_ms": 0.0
214
+ },
215
+ {
216
+ "id": "edge_01",
217
+ "group": "edge",
218
+ "expected_ready": false,
219
+ "got_ready": false,
220
+ "expected_missing": [
221
+ "at least one completed analysis"
222
+ ],
223
+ "got_missing": [
224
+ "at least one completed analysis"
225
+ ],
226
+ "correct": true,
227
+ "aligned": false,
228
+ "gap": false,
229
+ "latency_ms": 0.0
230
+ },
231
+ {
232
+ "id": "align_01",
233
+ "group": "alignment",
234
+ "expected_ready": true,
235
+ "got_ready": true,
236
+ "expected_missing": [],
237
+ "got_missing": [],
238
+ "correct": true,
239
+ "aligned": false,
240
+ "gap": true,
241
+ "latency_ms": 0.0
242
+ },
243
+ {
244
+ "id": "align_02",
245
+ "group": "alignment",
246
+ "expected_ready": true,
247
+ "got_ready": true,
248
+ "expected_missing": [],
249
+ "got_missing": [],
250
+ "correct": true,
251
+ "aligned": false,
252
+ "gap": true,
253
+ "latency_ms": 0.0
254
+ },
255
+ {
256
+ "id": "align_03",
257
+ "group": "alignment",
258
+ "expected_ready": true,
259
+ "got_ready": true,
260
+ "expected_missing": [],
261
+ "got_missing": [],
262
+ "correct": true,
263
+ "aligned": true,
264
+ "gap": false,
265
+ "latency_ms": 0.0
266
+ }
267
+ ]
268
+ }
eval/readiness/results/readiness_result_2026-06-22_143809.json ADDED
@@ -0,0 +1,284 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "run": {
3
+ "timestamp": "2026-06-22T14:38:09",
4
+ "dataset": "readiness_dataset.json",
5
+ "target": "src/agents/report/readiness.is_report_ready",
6
+ "total": 17,
7
+ "passed": 17,
8
+ "accuracy": 1.0,
9
+ "runtime_avg_ms": 0.01
10
+ },
11
+ "alignment_gap": {
12
+ "count": 2,
13
+ "ids": [
14
+ "align_01",
15
+ "align_02"
16
+ ]
17
+ },
18
+ "by_group": {
19
+ "floor": {
20
+ "n": 8,
21
+ "passed": 8,
22
+ "accuracy": 1.0
23
+ },
24
+ "delta": {
25
+ "n": 5,
26
+ "passed": 5,
27
+ "accuracy": 1.0
28
+ },
29
+ "edge": {
30
+ "n": 1,
31
+ "passed": 1,
32
+ "accuracy": 1.0
33
+ },
34
+ "alignment": {
35
+ "n": 3,
36
+ "passed": 3,
37
+ "accuracy": 1.0
38
+ }
39
+ },
40
+ "cases": [
41
+ {
42
+ "id": "floor_01",
43
+ "group": "floor",
44
+ "expected_ready": false,
45
+ "got_ready": false,
46
+ "expected_missing": [
47
+ "a validated problem statement",
48
+ "at least one completed analysis"
49
+ ],
50
+ "got_missing": [
51
+ "a validated problem statement",
52
+ "at least one completed analysis"
53
+ ],
54
+ "correct": true,
55
+ "aligned": false,
56
+ "gap": false,
57
+ "latency_ms": 0.0
58
+ },
59
+ {
60
+ "id": "floor_02",
61
+ "group": "floor",
62
+ "expected_ready": false,
63
+ "got_ready": false,
64
+ "expected_missing": [
65
+ "a validated problem statement"
66
+ ],
67
+ "got_missing": [
68
+ "a validated problem statement"
69
+ ],
70
+ "correct": true,
71
+ "aligned": true,
72
+ "gap": false,
73
+ "latency_ms": 0.0
74
+ },
75
+ {
76
+ "id": "floor_03",
77
+ "group": "floor",
78
+ "expected_ready": false,
79
+ "got_ready": false,
80
+ "expected_missing": [
81
+ "at least one completed analysis"
82
+ ],
83
+ "got_missing": [
84
+ "at least one completed analysis"
85
+ ],
86
+ "correct": true,
87
+ "aligned": false,
88
+ "gap": false,
89
+ "latency_ms": 0.0
90
+ },
91
+ {
92
+ "id": "floor_04",
93
+ "group": "floor",
94
+ "expected_ready": false,
95
+ "got_ready": false,
96
+ "expected_missing": [
97
+ "at least one completed analysis"
98
+ ],
99
+ "got_missing": [
100
+ "at least one completed analysis"
101
+ ],
102
+ "correct": true,
103
+ "aligned": false,
104
+ "gap": false,
105
+ "latency_ms": 0.0
106
+ },
107
+ {
108
+ "id": "floor_05",
109
+ "group": "floor",
110
+ "expected_ready": false,
111
+ "got_ready": false,
112
+ "expected_missing": [
113
+ "at least one completed analysis"
114
+ ],
115
+ "got_missing": [
116
+ "at least one completed analysis"
117
+ ],
118
+ "correct": true,
119
+ "aligned": false,
120
+ "gap": false,
121
+ "latency_ms": 0.0
122
+ },
123
+ {
124
+ "id": "floor_06",
125
+ "group": "floor",
126
+ "expected_ready": true,
127
+ "got_ready": true,
128
+ "expected_missing": [],
129
+ "got_missing": [],
130
+ "correct": true,
131
+ "aligned": true,
132
+ "gap": false,
133
+ "latency_ms": 0.0
134
+ },
135
+ {
136
+ "id": "floor_07",
137
+ "group": "floor",
138
+ "expected_ready": true,
139
+ "got_ready": true,
140
+ "expected_missing": [],
141
+ "got_missing": [],
142
+ "correct": true,
143
+ "aligned": true,
144
+ "gap": false,
145
+ "latency_ms": 0.0
146
+ },
147
+ {
148
+ "id": "floor_08",
149
+ "group": "floor",
150
+ "expected_ready": true,
151
+ "got_ready": true,
152
+ "expected_missing": [],
153
+ "got_missing": [],
154
+ "correct": true,
155
+ "aligned": true,
156
+ "gap": false,
157
+ "latency_ms": 0.1
158
+ },
159
+ {
160
+ "id": "delta_01",
161
+ "group": "delta",
162
+ "expected_ready": false,
163
+ "got_ready": false,
164
+ "expected_missing": [
165
+ "a new analysis since the last report"
166
+ ],
167
+ "got_missing": [
168
+ "a new analysis since the last report"
169
+ ],
170
+ "correct": true,
171
+ "aligned": true,
172
+ "gap": false,
173
+ "latency_ms": 0.0
174
+ },
175
+ {
176
+ "id": "delta_02",
177
+ "group": "delta",
178
+ "expected_ready": true,
179
+ "got_ready": true,
180
+ "expected_missing": [],
181
+ "got_missing": [],
182
+ "correct": true,
183
+ "aligned": true,
184
+ "gap": false,
185
+ "latency_ms": 0.0
186
+ },
187
+ {
188
+ "id": "delta_03",
189
+ "group": "delta",
190
+ "expected_ready": true,
191
+ "got_ready": true,
192
+ "expected_missing": [],
193
+ "got_missing": [],
194
+ "correct": true,
195
+ "aligned": true,
196
+ "gap": false,
197
+ "latency_ms": 0.0
198
+ },
199
+ {
200
+ "id": "delta_04",
201
+ "group": "delta",
202
+ "expected_ready": false,
203
+ "got_ready": false,
204
+ "expected_missing": [
205
+ "a new analysis since the last report"
206
+ ],
207
+ "got_missing": [
208
+ "a new analysis since the last report"
209
+ ],
210
+ "correct": true,
211
+ "aligned": true,
212
+ "gap": false,
213
+ "latency_ms": 0.0
214
+ },
215
+ {
216
+ "id": "delta_05",
217
+ "group": "delta",
218
+ "expected_ready": false,
219
+ "got_ready": false,
220
+ "expected_missing": [
221
+ "a new analysis since the last report"
222
+ ],
223
+ "got_missing": [
224
+ "a new analysis since the last report"
225
+ ],
226
+ "correct": true,
227
+ "aligned": true,
228
+ "gap": false,
229
+ "latency_ms": 0.0
230
+ },
231
+ {
232
+ "id": "edge_01",
233
+ "group": "edge",
234
+ "expected_ready": false,
235
+ "got_ready": false,
236
+ "expected_missing": [
237
+ "at least one completed analysis"
238
+ ],
239
+ "got_missing": [
240
+ "at least one completed analysis"
241
+ ],
242
+ "correct": true,
243
+ "aligned": false,
244
+ "gap": false,
245
+ "latency_ms": 0.0
246
+ },
247
+ {
248
+ "id": "align_01",
249
+ "group": "alignment",
250
+ "expected_ready": true,
251
+ "got_ready": true,
252
+ "expected_missing": [],
253
+ "got_missing": [],
254
+ "correct": true,
255
+ "aligned": false,
256
+ "gap": true,
257
+ "latency_ms": 0.0
258
+ },
259
+ {
260
+ "id": "align_02",
261
+ "group": "alignment",
262
+ "expected_ready": true,
263
+ "got_ready": true,
264
+ "expected_missing": [],
265
+ "got_missing": [],
266
+ "correct": true,
267
+ "aligned": false,
268
+ "gap": true,
269
+ "latency_ms": 0.0
270
+ },
271
+ {
272
+ "id": "align_03",
273
+ "group": "alignment",
274
+ "expected_ready": true,
275
+ "got_ready": true,
276
+ "expected_missing": [],
277
+ "got_missing": [],
278
+ "correct": true,
279
+ "aligned": true,
280
+ "gap": false,
281
+ "latency_ms": 0.0
282
+ }
283
+ ]
284
+ }
eval/readiness/run_eval.py ADDED
@@ -0,0 +1,309 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Report-readiness eval runner.
2
+
3
+ Feeds each golden case in `readiness_dataset.json` to the deterministic
4
+ `is_report_ready` signal (`src/agents/report/readiness.py`) via injectable FAKE
5
+ stores β€” no LLM, no DB β€” then scores both the boolean `ready` and the `missing`
6
+ gaps. Prints a per-case detail table + aggregate summary and writes a timestamped
7
+ JSON report under `results/` (never overwritten β€” one file per run, diffable).
8
+
9
+ Two metrics matter:
10
+ - FLOOR correctness (ready + missing exact) β€” should be ~100%; this is the
11
+ regression guard as the criteria evolve.
12
+ - ALIGNMENT GAP β€” cases the floor calls ready=true but whose analyses are NOT
13
+ aligned to the problem statement (`aligned=false`). The floor can't see this;
14
+ the gap count is the evidence for/against adding the deferred LLM-judge.
15
+
16
+ Invoke as a module so `src` imports resolve:
17
+
18
+ uv run python -m eval.readiness.run_eval
19
+ uv run python -m eval.readiness.run_eval --limit 5
20
+ """
21
+
22
+ from __future__ import annotations
23
+
24
+ import argparse
25
+ import asyncio
26
+ import json
27
+ import statistics
28
+ import time
29
+ from dataclasses import asdict, dataclass, field
30
+ from datetime import UTC, datetime, timedelta
31
+ from pathlib import Path
32
+ from typing import Any
33
+
34
+ from src.agents.gate import stub_analysis_state
35
+ from src.agents.report.readiness import (
36
+ _MISSING_ANALYSIS,
37
+ _MISSING_DELTA,
38
+ _MISSING_PROBLEM,
39
+ is_report_ready,
40
+ )
41
+
42
+ _HERE = Path(__file__).resolve().parent
43
+ DATASET = _HERE / "readiness_dataset.json"
44
+ RESULTS_DIR = _HERE / "results"
45
+ GROUPS = ["floor", "delta", "edge", "alignment"]
46
+
47
+ # Dataset short codes -> the exact `missing` strings is_report_ready emits. Imported
48
+ # from the module so the dataset stays readable and survives wording changes.
49
+ _CODE_TO_MISSING = {
50
+ "problem": _MISSING_PROBLEM,
51
+ "analysis": _MISSING_ANALYSIS,
52
+ "delta": _MISSING_DELTA,
53
+ }
54
+
55
+
56
+ @dataclass
57
+ class _FakeTask:
58
+ """Mirrors slow_path.schemas.TaskSummary (the bits is_report_ready reads)."""
59
+
60
+ status: str # success | partial | failure
61
+ tools_used: list[str]
62
+
63
+
64
+ @dataclass
65
+ class _FakeRecord:
66
+ findings: list[Any]
67
+ created_at: datetime
68
+ tasks_run: list[_FakeTask]
69
+
70
+
71
+ @dataclass
72
+ class _FakeReport:
73
+ generated_at: datetime
74
+
75
+
76
+ class _FakeStore:
77
+ """Stand-in for the Postgres record/report store β€” returns canned rows."""
78
+
79
+ def __init__(self, rows: list[Any]) -> None:
80
+ self._rows = rows
81
+
82
+ async def list_for_analysis(self, _analysis_id: str) -> list[Any]:
83
+ return self._rows
84
+
85
+
86
+ @dataclass
87
+ class CaseResult:
88
+ id: str
89
+ group: str
90
+ expected_ready: bool
91
+ got_ready: bool
92
+ expected_missing: list[str]
93
+ got_missing: list[str]
94
+ correct: bool
95
+ aligned: bool
96
+ gap: bool # floor said ready but analyses not aligned to the problem statement
97
+ latency_ms: float
98
+
99
+
100
+ def load_cases(path: Path) -> list[dict[str, Any]]:
101
+ data = json.loads(path.read_text(encoding="utf-8"))
102
+ return list(data["cases"])
103
+
104
+
105
+ def _build_tasks(analysis: str) -> list[_FakeTask]:
106
+ """Realistic tasks_run: data-access always succeeds; the analyze_* task varies.
107
+
108
+ analysis = 'success' (analyze_* succeeded) | 'failure' (analyze_* failed) |
109
+ 'none' (no analyze task at all β€” only check/retrieve succeeded).
110
+ """
111
+ tasks = [
112
+ _FakeTask(status="success", tools_used=["check_data"]),
113
+ _FakeTask(status="success", tools_used=["retrieve_data"]),
114
+ ]
115
+ if analysis == "success":
116
+ tasks.append(_FakeTask(status="success", tools_used=["analyze_aggregate"]))
117
+ elif analysis == "failure":
118
+ tasks.append(_FakeTask(status="failure", tools_used=["analyze_aggregate"]))
119
+ return tasks
120
+
121
+
122
+ def _build_records(specs: list[dict[str, Any]], now: datetime) -> list[_FakeRecord]:
123
+ return [
124
+ _FakeRecord(
125
+ findings=["f"] * int(spec.get("findings", 0)),
126
+ created_at=now - timedelta(minutes=int(spec["age_min"])),
127
+ tasks_run=_build_tasks(str(spec.get("analysis", "success"))),
128
+ )
129
+ for spec in specs
130
+ ]
131
+
132
+
133
+ def _build_reports(specs: list[dict[str, Any]], now: datetime) -> list[_FakeReport]:
134
+ return [
135
+ _FakeReport(generated_at=now - timedelta(minutes=int(spec["age_min"])))
136
+ for spec in specs
137
+ ]
138
+
139
+
140
+ async def run_case(case: dict[str, Any]) -> CaseResult:
141
+ now = datetime.now(UTC)
142
+ state = stub_analysis_state(problem_validated=bool(case["problem_validated"]))
143
+ if case.get("report_id"):
144
+ state = state.model_copy(update={"report_id": case["report_id"]})
145
+
146
+ record_store = _FakeStore(_build_records(case.get("records", []), now))
147
+ report_store = _FakeStore(_build_reports(case.get("reports", []), now))
148
+ expected_missing = sorted(_CODE_TO_MISSING[c] for c in case["expected_missing"])
149
+
150
+ start = time.perf_counter()
151
+ rr = await is_report_ready(
152
+ case["id"], state, record_store=record_store, report_store=report_store
153
+ )
154
+ latency_ms = round((time.perf_counter() - start) * 1000, 1)
155
+
156
+ got_missing = sorted(rr.missing)
157
+ ready_ok = rr.ready == bool(case["expected_ready"])
158
+ missing_ok = got_missing == expected_missing
159
+ return CaseResult(
160
+ id=case["id"],
161
+ group=case["group"],
162
+ expected_ready=bool(case["expected_ready"]),
163
+ got_ready=rr.ready,
164
+ expected_missing=expected_missing,
165
+ got_missing=got_missing,
166
+ correct=ready_ok and missing_ok,
167
+ aligned=bool(case["aligned"]),
168
+ gap=rr.ready and not bool(case["aligned"]),
169
+ latency_ms=latency_ms,
170
+ )
171
+
172
+
173
+ def _group_accuracy(results: list[CaseResult]) -> dict[str, dict[str, Any]]:
174
+ out: dict[str, dict[str, Any]] = {}
175
+ for g in GROUPS:
176
+ sub = [r for r in results if r.group == g]
177
+ if not sub:
178
+ continue
179
+ passed = sum(r.correct for r in sub)
180
+ out[g] = {"n": len(sub), "passed": passed, "accuracy": round(passed / len(sub), 3)}
181
+ return out
182
+
183
+
184
+ def summarize(results: list[CaseResult]) -> dict[str, Any]:
185
+ n = len(results)
186
+ passed = sum(r.correct for r in results)
187
+ gaps = [r for r in results if r.gap]
188
+ latencies = [r.latency_ms for r in results]
189
+ return {
190
+ "total": n,
191
+ "passed": passed,
192
+ "accuracy": round(passed / n, 3) if n else 0.0,
193
+ "runtime_avg_ms": round(statistics.mean(latencies), 2) if latencies else 0,
194
+ "alignment_gap": {"count": len(gaps), "ids": [r.id for r in gaps]},
195
+ "by_group": _group_accuracy(results),
196
+ }
197
+
198
+
199
+ def _fmt_bool(value: bool) -> str:
200
+ return "T" if value else "F"
201
+
202
+
203
+ def _truncate(text: str, width: int) -> str:
204
+ return text if len(text) <= width else text[: width - 3] + "..."
205
+
206
+
207
+ def format_table(results: list[CaseResult]) -> str:
208
+ header = (
209
+ f"{'ID':<12} {'GROUP':<10} {'RDY e/g':<8} "
210
+ f"{'MISSING (got)':<40} {'OK':<3} {'GAP':<4}"
211
+ )
212
+ rule = "-" * len(header)
213
+ lines = [rule, header, rule]
214
+ for r in results:
215
+ rdy = f"{_fmt_bool(r.expected_ready)}/{_fmt_bool(r.got_ready)}"
216
+ missing = ", ".join(r.got_missing) or "-"
217
+ ok = "ok" if r.correct else "X"
218
+ gap = "GAP" if r.gap else ""
219
+ lines.append(
220
+ f"{r.id:<12} {r.group:<10} {rdy:<8} "
221
+ f"{_truncate(missing, 40):<40} {ok:<3} {gap:<4}"
222
+ )
223
+ lines.append(rule)
224
+ return "\n".join(lines)
225
+
226
+
227
+ def format_summary(summary: dict[str, Any], results: list[CaseResult]) -> str:
228
+ lines = ["SUMMARY"]
229
+ lines.append(
230
+ f" Floor {summary['passed']}/{summary['total']} correct"
231
+ f" ({summary['accuracy'] * 100:.1f}%) avg {summary['runtime_avg_ms']} ms"
232
+ )
233
+ gap = summary["alignment_gap"]
234
+ lines.append(
235
+ f" Align gap {gap['count']} case(s) ready-but-misaligned"
236
+ + (f" -> {', '.join(gap['ids'])}" if gap["ids"] else "")
237
+ )
238
+ lines.append(" (floor can't catch these; this count is the LLM-judge justification)")
239
+ lines.append("")
240
+ lines.append(" By group")
241
+ for g, m in summary["by_group"].items():
242
+ lines.append(f" {g:<12} {m['passed']}/{m['n']} {m['accuracy'] * 100:.0f}%")
243
+ failures = [r for r in results if not r.correct]
244
+ lines.append("")
245
+ lines.append(f" FAILURES ({len(failures)})")
246
+ for r in failures:
247
+ lines.append(
248
+ f" {r.id:<12} ready {_fmt_bool(r.expected_ready)}->{_fmt_bool(r.got_ready)}"
249
+ f" missing {r.expected_missing} -> {r.got_missing}"
250
+ )
251
+ return "\n".join(lines)
252
+
253
+
254
+ def build_report(
255
+ results: list[CaseResult], summary: dict[str, Any], meta: dict[str, Any]
256
+ ) -> dict[str, Any]:
257
+ run = {**meta, **{k: summary[k] for k in ("total", "passed", "accuracy", "runtime_avg_ms")}}
258
+ return {
259
+ "run": run,
260
+ "alignment_gap": summary["alignment_gap"],
261
+ "by_group": summary["by_group"],
262
+ "cases": [asdict(r) for r in results],
263
+ }
264
+
265
+
266
+ @dataclass
267
+ class _Args:
268
+ dataset: Path = DATASET
269
+ limit: int = 0
270
+ no_table: bool = False
271
+ extra: dict[str, Any] = field(default_factory=dict)
272
+
273
+
274
+ async def main() -> None:
275
+ parser = argparse.ArgumentParser(description="Report-readiness eval")
276
+ parser.add_argument("--dataset", type=Path, default=DATASET)
277
+ parser.add_argument("--limit", type=int, default=0, help="run first N cases only")
278
+ parser.add_argument("--no-table", action="store_true", help="skip the detail table")
279
+ args = parser.parse_args()
280
+
281
+ cases = load_cases(args.dataset)
282
+ if args.limit:
283
+ cases = cases[: args.limit]
284
+
285
+ started = datetime.now()
286
+ print(f"Report-Readiness Eval -- {started:%Y-%m-%d %H:%M:%S}")
287
+ print(f"dataset: {args.dataset.name} ({len(cases)} cases) target: is_report_ready")
288
+
289
+ results = [await run_case(case) for case in cases]
290
+
291
+ summary = summarize(results)
292
+ if not args.no_table:
293
+ print(format_table(results))
294
+ print(format_summary(summary, results))
295
+
296
+ meta = {
297
+ "timestamp": started.isoformat(timespec="seconds"),
298
+ "dataset": args.dataset.name,
299
+ "target": "src/agents/report/readiness.is_report_ready",
300
+ }
301
+ report = build_report(results, summary, meta)
302
+ RESULTS_DIR.mkdir(parents=True, exist_ok=True)
303
+ out_path = RESULTS_DIR / f"readiness_result_{started:%Y-%m-%d_%H%M%S}.json"
304
+ out_path.write_text(json.dumps(report, ensure_ascii=False, indent=2), encoding="utf-8")
305
+ print(f"\n-> saved: {out_path.relative_to(_HERE.parent.parent)}")
306
+
307
+
308
+ if __name__ == "__main__":
309
+ asyncio.run(main())
main.py CHANGED
@@ -13,6 +13,9 @@ from src.api.v1.room import router as room_router
13
  from src.api.v1.users import router as users_router
14
  from src.api.v1.db_client import router as db_client_router
15
  from src.api.v1.data_catalog import router as data_catalog_router
 
 
 
16
  from src.db.postgres.init_db import init_db
17
  import os
18
  import uvicorn
@@ -53,6 +56,9 @@ app.include_router(room_router)
53
  app.include_router(chat_router)
54
  app.include_router(db_client_router)
55
  app.include_router(data_catalog_router)
 
 
 
56
 
57
 
58
  @app.get("/")
 
13
  from src.api.v1.users import router as users_router
14
  from src.api.v1.db_client import router as db_client_router
15
  from src.api.v1.data_catalog import router as data_catalog_router
16
+ from src.api.v1.report import router as report_router
17
+ from src.api.v1.analysis import router as analysis_router
18
+ from src.api.v1.tools import router as tools_router
19
  from src.db.postgres.init_db import init_db
20
  import os
21
  import uvicorn
 
56
  app.include_router(chat_router)
57
  app.include_router(db_client_router)
58
  app.include_router(data_catalog_router)
59
+ app.include_router(report_router)
60
+ app.include_router(analysis_router)
61
+ app.include_router(tools_router)
62
 
63
 
64
  @app.get("/")
pyproject.toml CHANGED
@@ -123,6 +123,8 @@ ignore = [
123
  # S608: golden compiler tests assert literal SQL strings (incl. concatenated
124
  # suffixes) β€” they never execute against a DB, so it's a false positive here.
125
  "tests/**" = ["S101", "S105", "S106", "S608"]
 
 
126
 
127
  [tool.mypy]
128
  python_version = "3.12"
 
123
  # S608: golden compiler tests assert literal SQL strings (incl. concatenated
124
  # suffixes) β€” they never execute against a DB, so it's a false positive here.
125
  "tests/**" = ["S101", "S105", "S106", "S608"]
126
+ # T201: eval/ scripts are CLIs β€” print() is their intended output channel.
127
+ "eval/**" = ["T201"]
128
 
129
  [tool.mypy]
130
  python_version = "3.12"
src/agents/binding_store.py ADDED
@@ -0,0 +1,34 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """AnalysisDataSourceStore β€” read per-analysis data-source bindings (#10).
2
+
3
+ The dedorch `data_sources` table records which catalog sources an analysis is scoped
4
+ to (`reference_id` = the catalog source id). It's written at `/analysis/create`; this
5
+ store is the read seam for the two consumers β€” `structured_flow` catalog scoping and
6
+ the report's data-source appendix.
7
+
8
+ Fail-open by convention at the call sites: an empty binding (legacy room, or the FE
9
+ not yet sending ids) means "no restriction" β€” fall back to the whole catalog. Mirrors
10
+ `AnalysisStateStore`: each call opens its own `AsyncSession`.
11
+ """
12
+
13
+ from __future__ import annotations
14
+
15
+ from sqlalchemy import select
16
+
17
+ from src.db.postgres.connection import AsyncSessionLocal
18
+ from src.db.postgres.models import AnalysisDataSourceRow
19
+ from src.middlewares.logging import get_logger
20
+
21
+ logger = get_logger("binding_store")
22
+
23
+
24
+ class AnalysisDataSourceStore:
25
+ """Read the bound catalog `source_id`s for an analysis."""
26
+
27
+ async def get(self, analysis_id: str) -> list[str]:
28
+ async with AsyncSessionLocal() as session:
29
+ result = await session.execute(
30
+ select(AnalysisDataSourceRow.reference_id).where(
31
+ AnalysisDataSourceRow.analysis_id == analysis_id
32
+ )
33
+ )
34
+ return list(result.scalars().all())
src/agents/chat_handler.py CHANGED
@@ -2,18 +2,22 @@
2
 
3
  End-to-end flow per user message:
4
 
5
- 1. `IntentRouter.classify` β†’ `chat` / `unstructured` / `structured`.
6
- 2. Route:
7
- - `chat` β†’ no context. Pass straight to ChatbotAgent.
8
- - `structured` β†’ CatalogReader β†’ QueryService β†’ QueryResult.
9
- - `unstructured` β†’ DocumentRetriever (placeholder, raises until TAB
10
- ships) β†’ list[DocumentChunk].
 
 
 
11
  3. `ChatbotAgent.astream` β†’ yield text tokens.
12
  4. Wrap each step into an SSE-style event dict so the API endpoint can
13
  stream them as Server-Sent Events.
14
 
15
- Phase 1's chat endpoint (`src/api/v1/chat.py`) is intentionally NOT touched
16
- in this PR. PR7 cleanup will rewire it to call `ChatHandler.handle(...)`.
 
17
 
18
  All dependencies are injectable for tests. Default constructors lazy-build
19
  production deps (no `Settings()` triggered at import time as long as you
@@ -33,12 +37,16 @@ from src.middlewares.logging import get_logger
33
  from src.retrieval.base import RetrievalResult
34
 
35
  from .chatbot import ChatbotAgent, DocumentChunk
 
 
 
36
  from .orchestration import OrchestratorAgent
37
 
38
  if TYPE_CHECKING:
39
  from ..catalog.reader import CatalogReader
40
  from ..query.service import QueryService
41
  from ..retrieval.router import RetrievalRouter
 
42
  from .slow_path.coordinator import SlowPathCoordinator
43
  from .slow_path.store import AnalysisStore
44
 
@@ -71,6 +79,12 @@ class ChatHandler:
71
  Callable[[str], SlowPathCoordinator] | None
72
  ) = None,
73
  analysis_store: AnalysisStore | None = None,
 
 
 
 
 
 
74
  enable_tracing: bool = False,
75
  ) -> None:
76
  self._intent_router = intent_router
@@ -88,6 +102,21 @@ class ChatHandler:
88
  self._enable_slow_path = enable_slow_path
89
  self._slow_path_factory = slow_path_coordinator_factory
90
  self._analysis_store = analysis_store
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
91
 
92
  # ------------------------------------------------------------------
93
  # Lazy default-dep builders
@@ -125,6 +154,71 @@ class ChatHandler:
125
  self._document_retriever = RetrievalRouter()
126
  return self._document_retriever
127
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
128
  # ------------------------------------------------------------------
129
  # Public entry
130
  # ------------------------------------------------------------------
@@ -134,6 +228,7 @@ class ChatHandler:
134
  message: str,
135
  user_id: str,
136
  history: list[BaseMessage] | None = None,
 
137
  ) -> AsyncIterator[dict[str, Any]]:
138
  tracer = self._make_tracer(user_id, message)
139
 
@@ -147,7 +242,39 @@ class ChatHandler:
147
  yield {"event": "error", "data": f"Could not classify message: {e}"}
148
  return
149
 
150
- yield {"event": "intent", "data": decision.model_dump_json()}
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
151
 
152
  rewritten = decision.rewritten_query or message
153
  query_result = None
@@ -155,7 +282,7 @@ class ChatHandler:
155
  raw_chunks: Any = None
156
 
157
  # ---- 2. Route ------------------------------------------------
158
- if decision.source_hint == "structured":
159
  try:
160
  # One memoizing reader per request: the same catalog is otherwise
161
  # re-fetched from the catalog DB 4-5x across the slow-path run. This
@@ -164,10 +291,15 @@ class ChatHandler:
164
  from ..catalog.reader import MemoizingCatalogReader
165
 
166
  req_reader = MemoizingCatalogReader(self._get_catalog_reader())
167
- catalog = await req_reader.read(user_id, "structured")
 
 
 
 
 
168
  if self._enable_slow_path:
169
  async for event in self._run_slow_path(
170
- user_id, rewritten, catalog, tracer, req_reader
171
  ):
172
  yield event
173
  return
@@ -182,32 +314,88 @@ class ChatHandler:
182
  )
183
  yield {"event": "error", "data": f"Structured query failed: {e}"}
184
  return
185
- elif decision.source_hint == "unstructured":
186
  try:
187
  raw_chunks = await self._get_document_retriever().retrieve(
188
  rewritten, user_id
189
  )
190
  chunks = _normalize_chunks(raw_chunks)
191
- except NotImplementedError:
192
- logger.warning("DocumentRetriever placeholder hit", user_id=user_id)
193
- yield {
194
- "event": "error",
195
- "data": "Document retrieval is not yet available β€” pending implementation.",
196
- }
197
- return
198
  except Exception as e:
199
  logger.error(
200
  "unstructured route failed", user_id=user_id, error=str(e)
201
  )
202
  yield {"event": "error", "data": f"Document retrieval failed: {e}"}
203
  return
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
204
  # else: chat path β€” no context
205
 
206
  # ---- 2b. Emit sources ---------------------------------------
207
- sources = _build_sources(
208
- decision.source_hint, user_id, query_result, raw_chunks
 
 
 
 
209
  )
210
- logger.info("built sources", source_hint=decision.source_hint, sources_count=len(sources), raw_chunks_count=len(raw_chunks) if raw_chunks else 0)
211
  yield {"event": "sources", "data": json.dumps(sources)}
212
 
213
  # ---- 3. Stream answer ----------------------------------------
@@ -282,9 +470,9 @@ class ChatHandler:
282
 
283
  def _get_analysis_store(self) -> AnalysisStore:
284
  if self._analysis_store is None:
285
- from .slow_path.store import NullAnalysisStore
286
 
287
- self._analysis_store = NullAnalysisStore()
288
  return self._analysis_store
289
 
290
  async def _run_slow_path(
@@ -294,11 +482,13 @@ class ChatHandler:
294
  catalog: Any,
295
  tracer: Any = None,
296
  catalog_reader: CatalogReader | None = None,
 
297
  ) -> AsyncIterator[dict[str, Any]]:
298
  """Run the slow path and stream its assembled answer as SSE events.
299
 
300
  Context comes from the `get_business_context` seam (a stub today); the
301
- `analysis_record` is persisted via the `AnalysisStore` seam (a no-op today).
 
302
  `chat_answer` is emitted as a single `chunk` (the Assembler returns the whole
303
  object β€” true token streaming is a later step).
304
  """
@@ -368,26 +558,58 @@ class ChatHandler:
368
  yield {"event": "sources", "data": json.dumps([])} # TODO: derive from record
369
  yield {"event": "chunk", "data": result.chat_answer}
370
  try:
371
- await self._get_analysis_store().save(result.analysis_record)
 
 
 
 
 
 
 
372
  except Exception as e: # persistence must never break the user's answer
373
  logger.error("analysis_record persist failed", user_id=user_id, error=str(e))
374
  tracer.end() # output omitted (chat_answer may contain PII on Cloud)
375
  yield {"event": "done", "data": ""}
376
 
377
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
378
  def _build_sources(
379
- source_hint: str,
380
  user_id: str,
381
  query_result: Any,
382
  raw_chunks: Any,
383
  ) -> list[dict[str, Any]]:
384
  """Build the sources payload for the SSE `sources` event.
385
 
386
- - structured: one entry per executed table (table_name only).
387
- - unstructured: deduped by (document_id, page_label), Phase 1 shape.
388
  - chat or error: empty list.
389
  """
390
- if source_hint == "structured":
391
  if query_result is None or getattr(query_result, "error", None):
392
  return []
393
  table_name = getattr(query_result, "table_name", "") or ""
@@ -399,7 +621,7 @@ def _build_sources(
399
  "page_label": None,
400
  }]
401
 
402
- if source_hint == "unstructured" and raw_chunks:
403
  seen: set[tuple[Any, Any]] = set()
404
  sources: list[dict[str, Any]] = []
405
  for item in raw_chunks:
 
2
 
3
  End-to-end flow per user message:
4
 
5
+ 1. `OrchestratorAgent.classify` β†’ RouterDecision (one of six intents).
6
+ 2. Route by intent:
7
+ - `chat` β†’ no context. Pass straight to ChatbotAgent.
8
+ - `structured_flow` β†’ CatalogReader β†’ slow path / QueryService.
9
+ - `unstructured_flow` β†’ DocumentRetriever (RAG over PGVector) β†’
10
+ list[DocumentChunk].
11
+ - `check` β†’ check_data / check_knowledge tool β†’ rendered table.
12
+ - `problem_statement` β†’ PS skill: draft + validate β†’ write analysis state.
13
+ - `help` β†’ Help skill: analysis state + history β†’ streamed guidance.
14
  3. `ChatbotAgent.astream` β†’ yield text tokens.
15
  4. Wrap each step into an SSE-style event dict so the API endpoint can
16
  stream them as Server-Sent Events.
17
 
18
+ The chat endpoint (`src/api/v1/chat.py`) calls `ChatHandler.handle(...)` per
19
+ request, behind two endpoint-level pre-filters: a greeting/farewell
20
+ short-circuit and a Redis response cache (both skip the LLM on a hit).
21
 
22
  All dependencies are injectable for tests. Default constructors lazy-build
23
  production deps (no `Settings()` triggered at import time as long as you
 
37
  from src.retrieval.base import RetrievalResult
38
 
39
  from .chatbot import ChatbotAgent, DocumentChunk
40
+ from .handlers.check import run_check
41
+ from .handlers.help import HelpAgent
42
+ from .handlers.problem_statement import ProblemStatementAgent, run_problem_statement
43
  from .orchestration import OrchestratorAgent
44
 
45
  if TYPE_CHECKING:
46
  from ..catalog.reader import CatalogReader
47
  from ..query.service import QueryService
48
  from ..retrieval.router import RetrievalRouter
49
+ from .gate import AnalysisState
50
  from .slow_path.coordinator import SlowPathCoordinator
51
  from .slow_path.store import AnalysisStore
52
 
 
79
  Callable[[str], SlowPathCoordinator] | None
80
  ) = None,
81
  analysis_store: AnalysisStore | None = None,
82
+ check_invoker_factory: Callable[[str], Any] | None = None,
83
+ ps_agent: ProblemStatementAgent | None = None,
84
+ help_agent: HelpAgent | None = None,
85
+ state_store: Any | None = None,
86
+ binding_store: Any | None = None,
87
+ enable_gate: bool = False,
88
  enable_tracing: bool = False,
89
  ) -> None:
90
  self._intent_router = intent_router
 
102
  self._enable_slow_path = enable_slow_path
103
  self._slow_path_factory = slow_path_coordinator_factory
104
  self._analysis_store = analysis_store
105
+ # `check` skill: builds the data-access invoker (check_data/check_knowledge)
106
+ # per request with the authenticated user_id. Injectable for tests.
107
+ self._check_invoker_factory = check_invoker_factory
108
+ # `problem_statement` skill: LLM drafter + the Analysis State store it writes
109
+ # `problem_validated` to. Both injectable for tests.
110
+ self._ps_agent = ps_agent
111
+ # `help` skill: LLM guide that reads the Analysis State + chat history.
112
+ self._help_agent = help_agent
113
+ self._state_store = state_store
114
+ # `#10` data-source binding: scopes structured_flow's catalog to the sources
115
+ # the analysis is bound to. Injectable for tests; fail-open when absent.
116
+ self._binding_store = binding_store
117
+ # Deterministic gate: redirect structured_flow -> problem_statement until the
118
+ # analysis is validated. OFF by default (legacy rooms have no state row).
119
+ self._enable_gate = enable_gate
120
 
121
  # ------------------------------------------------------------------
122
  # Lazy default-dep builders
 
154
  self._document_retriever = RetrievalRouter()
155
  return self._document_retriever
156
 
157
+ def _get_check_invoker(self, user_id: str) -> Any:
158
+ """Build the per-request data-access invoker for the `check` skill."""
159
+ if self._check_invoker_factory is not None:
160
+ return self._check_invoker_factory(user_id)
161
+ from ..tools.data_access import DataAccessToolInvoker
162
+
163
+ return DataAccessToolInvoker(user_id, self._get_catalog_reader())
164
+
165
+ def _get_ps_agent(self) -> ProblemStatementAgent:
166
+ if self._ps_agent is None:
167
+ self._ps_agent = ProblemStatementAgent()
168
+ return self._ps_agent
169
+
170
+ def _get_help_agent(self) -> HelpAgent:
171
+ if self._help_agent is None:
172
+ self._help_agent = HelpAgent()
173
+ return self._help_agent
174
+
175
+ def _get_state_store(self) -> Any:
176
+ if self._state_store is None:
177
+ from .state_store import AnalysisStateStore
178
+
179
+ self._state_store = AnalysisStateStore()
180
+ return self._state_store
181
+
182
+ def _get_binding_store(self) -> Any:
183
+ if self._binding_store is None:
184
+ from .binding_store import AnalysisDataSourceStore
185
+
186
+ self._binding_store = AnalysisDataSourceStore()
187
+ return self._binding_store
188
+
189
+ async def _bound_source_ids(self, analysis_id: str | None) -> set[str]:
190
+ """#10: the catalog source_ids this analysis is bound to (empty = unscoped).
191
+
192
+ Fail-open: no analysis_id, no binding rows (legacy room / FE not sending
193
+ ids), or a read error β†’ empty set, which the caller treats as "whole
194
+ catalog". Used to build a `_ScopedCatalogReader` so the Planner AND the
195
+ data-access tools (which re-read the catalog themselves) see the same scope.
196
+ """
197
+ if not analysis_id:
198
+ return set()
199
+ try:
200
+ return set(await self._get_binding_store().get(analysis_id))
201
+ except Exception as e: # noqa: BLE001 β€” never block the query on this
202
+ logger.warning("binding read failed β€” unscoped", analysis_id=analysis_id, error=str(e))
203
+ return set()
204
+
205
+ async def _load_analysis_state(self, analysis_id: str | None) -> AnalysisState:
206
+ """Load Analysis State for the Help skill; fail closed to a not-validated stub.
207
+
208
+ Mirrors the gate's never-throw fallback so Help degrades gracefully on a
209
+ missing row, a read error, or a legacy room with no `analysis_id`.
210
+ """
211
+ from .gate import stub_analysis_state
212
+
213
+ if not analysis_id:
214
+ return stub_analysis_state(problem_validated=False)
215
+ try:
216
+ state = await self._get_state_store().get(analysis_id)
217
+ except Exception as e:
218
+ logger.warning("help state read failed β€” not-validated", error=str(e))
219
+ state = None
220
+ return state if state is not None else stub_analysis_state(problem_validated=False)
221
+
222
  # ------------------------------------------------------------------
223
  # Public entry
224
  # ------------------------------------------------------------------
 
228
  message: str,
229
  user_id: str,
230
  history: list[BaseMessage] | None = None,
231
+ analysis_id: str | None = None,
232
  ) -> AsyncIterator[dict[str, Any]]:
233
  tracer = self._make_tracer(user_id, message)
234
 
 
242
  yield {"event": "error", "data": f"Could not classify message: {e}"}
243
  return
244
 
245
+ intent = decision.intent
246
+ # ---- 1a. Ensure session state row (T-A) ----------------------
247
+ # Rooms created via /room/create have no `analysis_states` row. Without one
248
+ # the gate redirect-loops and problem_statement / report_id writes silently
249
+ # no-op. Lazily get-or-create it (idempotent) so any session is gate-ready.
250
+ analysis_state: AnalysisState | None = None
251
+ if analysis_id:
252
+ try:
253
+ analysis_state = await self._get_state_store().ensure(analysis_id, user_id)
254
+ except Exception as e:
255
+ logger.warning(
256
+ "analysis state ensure failed", analysis_id=analysis_id, error=str(e)
257
+ )
258
+
259
+ # ---- 1b. Gate (deterministic, post-router) -------------------
260
+ # Redirect structured_flow -> problem_statement until the analysis is
261
+ # validated. Fails closed (not-validated) when the state row is unavailable.
262
+ if self._enable_gate and analysis_id:
263
+ from .gate import gate, stub_analysis_state
264
+
265
+ intent = gate(
266
+ intent,
267
+ analysis_state
268
+ if analysis_state is not None
269
+ else stub_analysis_state(problem_validated=False),
270
+ )
271
+
272
+ # The `intent` event is consumed by the endpoint (it gates response caching
273
+ # on the effective intent) and is NOT forwarded to the frontend. We emit the
274
+ # post-gate intent so the cache keys on what actually ran.
275
+ event_data = decision.model_dump()
276
+ event_data["intent"] = intent
277
+ yield {"event": "intent", "data": json.dumps(event_data)}
278
 
279
  rewritten = decision.rewritten_query or message
280
  query_result = None
 
282
  raw_chunks: Any = None
283
 
284
  # ---- 2. Route ------------------------------------------------
285
+ if intent == "structured_flow":
286
  try:
287
  # One memoizing reader per request: the same catalog is otherwise
288
  # re-fetched from the catalog DB 4-5x across the slow-path run. This
 
291
  from ..catalog.reader import MemoizingCatalogReader
292
 
293
  req_reader = MemoizingCatalogReader(self._get_catalog_reader())
294
+ # #10: scope every catalog read β€” the Planner's AND the data-access
295
+ # tools' own re-reads β€” to the analysis's bound sources, so binding
296
+ # is a boundary, not just a planner hint (T-B). Fail-open (T-C).
297
+ bound = await self._bound_source_ids(analysis_id)
298
+ reader = _ScopedCatalogReader(req_reader, bound) if bound else req_reader
299
+ catalog = await reader.read(user_id, "structured")
300
  if self._enable_slow_path:
301
  async for event in self._run_slow_path(
302
+ user_id, rewritten, catalog, tracer, reader, analysis_id
303
  ):
304
  yield event
305
  return
 
314
  )
315
  yield {"event": "error", "data": f"Structured query failed: {e}"}
316
  return
317
+ elif intent == "unstructured_flow":
318
  try:
319
  raw_chunks = await self._get_document_retriever().retrieve(
320
  rewritten, user_id
321
  )
322
  chunks = _normalize_chunks(raw_chunks)
 
 
 
 
 
 
 
323
  except Exception as e:
324
  logger.error(
325
  "unstructured route failed", user_id=user_id, error=str(e)
326
  )
327
  yield {"event": "error", "data": f"Document retrieval failed: {e}"}
328
  return
329
+ elif intent == "check":
330
+ try:
331
+ invoker = self._get_check_invoker(user_id)
332
+ text = await run_check(rewritten, invoker)
333
+ except Exception as e:
334
+ logger.error("check route failed", user_id=user_id, error=str(e))
335
+ yield {"event": "error", "data": f"Lookup failed: {e}"}
336
+ return
337
+ yield {"event": "chunk", "data": text}
338
+ yield {"event": "done", "data": ""}
339
+ return
340
+ elif intent == "problem_statement":
341
+ try:
342
+ text = await run_problem_statement(
343
+ message,
344
+ analysis_id,
345
+ agent=self._get_ps_agent(),
346
+ store=self._get_state_store(),
347
+ history=history,
348
+ )
349
+ except Exception as e:
350
+ logger.error("problem_statement route failed", user_id=user_id, error=str(e))
351
+ yield {"event": "error", "data": f"Problem statement failed: {e}"}
352
+ return
353
+ yield {"event": "chunk", "data": text}
354
+ yield {"event": "done", "data": ""}
355
+ return
356
+ elif intent == "help":
357
+ try:
358
+ state = analysis_state or await self._load_analysis_state(analysis_id)
359
+ except Exception as e:
360
+ logger.error("help route failed", user_id=user_id, error=str(e))
361
+ yield {"event": "error", "data": f"Help failed: {e}"}
362
+ return
363
+ # report_ready (seam #5): deterministic β€” validated goal + β‰₯1 recorded
364
+ # analysis (mirrors the report API's own 409 gate). Never-throws (fails
365
+ # closed to not-ready), so Help degrades safely. The consistency guard in
366
+ # HelpAgent only offers `generate_report` when this says ready.
367
+ from .report.readiness import is_report_ready
368
+
369
+ report_ready = await is_report_ready(analysis_id, state)
370
+ # The prompt sees chat history -> masked.
371
+ hc = tracer.callbacks(masked=True)
372
+ hkw = {"callbacks": hc} if hc else {}
373
+ try:
374
+ async for token in self._get_help_agent().astream(
375
+ state,
376
+ history=history,
377
+ message=message,
378
+ report_ready=report_ready,
379
+ **hkw,
380
+ ):
381
+ yield {"event": "chunk", "data": token}
382
+ except Exception as e:
383
+ logger.error("help streaming failed", user_id=user_id, error=str(e))
384
+ yield {"event": "error", "data": f"Help generation failed: {e}"}
385
+ return
386
+ tracer.end()
387
+ yield {"event": "done", "data": ""}
388
+ return
389
  # else: chat path β€” no context
390
 
391
  # ---- 2b. Emit sources ---------------------------------------
392
+ sources = _build_sources(intent, user_id, query_result, raw_chunks)
393
+ logger.info(
394
+ "built sources",
395
+ intent=intent,
396
+ sources_count=len(sources),
397
+ raw_chunks_count=len(raw_chunks) if raw_chunks else 0,
398
  )
 
399
  yield {"event": "sources", "data": json.dumps(sources)}
400
 
401
  # ---- 3. Stream answer ----------------------------------------
 
470
 
471
  def _get_analysis_store(self) -> AnalysisStore:
472
  if self._analysis_store is None:
473
+ from .slow_path.store import PostgresAnalysisStore
474
 
475
+ self._analysis_store = PostgresAnalysisStore()
476
  return self._analysis_store
477
 
478
  async def _run_slow_path(
 
482
  catalog: Any,
483
  tracer: Any = None,
484
  catalog_reader: CatalogReader | None = None,
485
+ analysis_id: str | None = None,
486
  ) -> AsyncIterator[dict[str, Any]]:
487
  """Run the slow path and stream its assembled answer as SSE events.
488
 
489
  Context comes from the `get_business_context` seam (a stub today); the
490
+ `analysis_record` is persisted via the `AnalysisStore` seam (PostgresAnalysisStore),
491
+ stamped with the request's user_id + analysis_id so the report can group it.
492
  `chat_answer` is emitted as a single `chunk` (the Assembler returns the whole
493
  object β€” true token streaming is a later step).
494
  """
 
558
  yield {"event": "sources", "data": json.dumps([])} # TODO: derive from record
559
  yield {"event": "chunk", "data": result.chat_answer}
560
  try:
561
+ # Stamp identity from the request scope: owner + the shared session id
562
+ # (analysis_id == room_id). Without analysis_id the record is orphaned β€”
563
+ # list_for_analysis can't find it, so the report + is_report_ready go
564
+ # blind. The store is never-throw.
565
+ record = result.analysis_record.model_copy(
566
+ update={"user_id": user_id, "analysis_id": analysis_id}
567
+ )
568
+ await self._get_analysis_store().save(record)
569
  except Exception as e: # persistence must never break the user's answer
570
  logger.error("analysis_record persist failed", user_id=user_id, error=str(e))
571
  tracer.end() # output omitted (chat_answer may contain PII on Cloud)
572
  yield {"event": "done", "data": ""}
573
 
574
 
575
+ class _ScopedCatalogReader:
576
+ """Wraps a CatalogReader, restricting `structured` reads to an analysis's bound
577
+ sources (#10).
578
+
579
+ Scoping lives here β€” not at a single call site β€” so the Planner AND the
580
+ data-access tools (which re-read the catalog themselves) see the same scoped
581
+ view; otherwise binding is only a hint to the Planner while the executor runs
582
+ against the full catalog. Fail-open: an empty or fully-disjoint binding yields
583
+ the whole catalog, so a stale / cross-source binding degrades instead of
584
+ emptying the catalog. Only `structured` reads are scoped (all #10 binds today);
585
+ `unstructured` / retrieval reads pass through.
586
+ """
587
+
588
+ def __init__(self, inner: Any, bound: set[str]) -> None:
589
+ self._inner = inner
590
+ self._bound = bound
591
+
592
+ async def read(self, user_id: str, source_hint: str) -> Any:
593
+ catalog = await self._inner.read(user_id, source_hint)
594
+ if not self._bound or source_hint != "structured":
595
+ return catalog
596
+ scoped = [s for s in catalog.sources if s.source_id in self._bound]
597
+ return catalog.model_copy(update={"sources": scoped or catalog.sources})
598
+
599
+
600
  def _build_sources(
601
+ intent: str,
602
  user_id: str,
603
  query_result: Any,
604
  raw_chunks: Any,
605
  ) -> list[dict[str, Any]]:
606
  """Build the sources payload for the SSE `sources` event.
607
 
608
+ - structured_flow: one entry per executed table (table_name only).
609
+ - unstructured_flow: deduped by (document_id, page_label), Phase 1 shape.
610
  - chat or error: empty list.
611
  """
612
+ if intent == "structured_flow":
613
  if query_result is None or getattr(query_result, "error", None):
614
  return []
615
  table_name = getattr(query_result, "table_name", "") or ""
 
621
  "page_label": None,
622
  }]
623
 
624
+ if intent == "unstructured_flow" and raw_chunks:
625
  seen: set[tuple[Any, Any]] = set()
626
  sources: list[dict[str, Any]] = []
627
  for item in raw_chunks:
src/agents/gate.py ADDED
@@ -0,0 +1,108 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Deterministic routing gate β€” policy check over the router's intent.
2
+
3
+ After the LLM router picks an intent, the gate checks it against the per-analysis
4
+ Analysis State and returns the **effective** intent: allow as-is, or redirect. No
5
+ LLM, no I/O in `gate()` itself.
6
+
7
+ Only one rule has teeth in v1: an analytical request (`structured_flow`) requires a
8
+ validated problem statement (`problem_validated is True`); otherwise it is
9
+ redirected to `problem_statement` so the user defines the goal first. Everything
10
+ else passes through. `generate_report` is not a router intent (button / report
11
+ API), so it is not gated here.
12
+
13
+ `AnalysisState` is the locked 8-field contract (mirrors the `analysis_states` DB
14
+ table). `get_analysis_state` reads the real per-analysis row via `AnalysisStateStore`
15
+ (#9, landed); it fails closed to a not-validated stub on a missing row or read error.
16
+ See `ORCHESTRATOR_REWORK_PLAN.md` Β§4.
17
+ """
18
+
19
+ from __future__ import annotations
20
+
21
+ from datetime import UTC, datetime
22
+
23
+ from pydantic import BaseModel
24
+
25
+ from src.agents.orchestration import Intent
26
+ from src.middlewares.logging import get_logger
27
+
28
+ logger = get_logger("gate")
29
+
30
+
31
+ class AnalysisState(BaseModel):
32
+ """Per-analysis state the gate + Help skill read every turn (locked contract).
33
+
34
+ `problem_validated` is the gate driver; `report_id` is null until a report
35
+ exists. Field names mirror the `analysis_states` table so the DB read swaps in
36
+ without touching readers.
37
+ """
38
+
39
+ id: str
40
+ analysis_title: str
41
+ problem_statement: str
42
+ problem_validated: bool = False
43
+ owner_id: str
44
+ report_id: str | None = None
45
+ created_at: datetime
46
+ updated_at: datetime
47
+
48
+
49
+ def gate(intent: Intent, state: AnalysisState) -> Intent:
50
+ """Return the effective intent after applying the deterministic gate policy.
51
+
52
+ `structured_flow` requires `problem_validated is True`; otherwise redirect to
53
+ `problem_statement`. All other intents pass through unchanged.
54
+ """
55
+ if intent == "structured_flow" and not state.problem_validated:
56
+ logger.info(
57
+ "gate redirect",
58
+ requested=intent,
59
+ effective="problem_statement",
60
+ reason="problem_not_validated",
61
+ )
62
+ return "problem_statement"
63
+ return intent
64
+
65
+
66
+ def stub_analysis_state(*, problem_validated: bool = False) -> AnalysisState:
67
+ """Hardcoded Analysis State for building/testing before the DB lands (#9).
68
+
69
+ Shared fixture so the gate, the Help skill, and tests all exercise the same
70
+ shape. `problem_validated=True` simulates a passed interview.
71
+ """
72
+ now = datetime.now(UTC)
73
+ return AnalysisState(
74
+ id="stub-analysis",
75
+ analysis_title="Stub analysis",
76
+ problem_statement="Stub problem statement" if problem_validated else "",
77
+ problem_validated=problem_validated,
78
+ owner_id="stub-user",
79
+ report_id=None,
80
+ created_at=now,
81
+ updated_at=now,
82
+ )
83
+
84
+
85
+ async def get_analysis_state(analysis_id: str) -> AnalysisState:
86
+ """Load the Analysis State for an analysis (shared id with the chat room).
87
+
88
+ Reads the `analysis_states` row via `AnalysisStateStore`. Never-throw seam: a
89
+ missing row (e.g. a legacy room created before this table) or a read failure
90
+ degrades to a **not-validated** stub, so the gate fails closed (β†’ steer to
91
+ `problem_statement`) rather than running ungated analysis. The store import is
92
+ lazy so this module stays import-safe without a DB.
93
+ """
94
+ try:
95
+ from src.agents.state_store import AnalysisStateStore
96
+
97
+ state = await AnalysisStateStore().get(analysis_id)
98
+ except Exception as exc: # noqa: BLE001 β€” never-throw; fail closed to not-validated
99
+ logger.warning(
100
+ "get_analysis_state read failed β€” default not-validated",
101
+ analysis_id=analysis_id,
102
+ error=str(exc),
103
+ )
104
+ return stub_analysis_state(problem_validated=False)
105
+ if state is None:
106
+ logger.debug("analysis_state missing β€” default not-validated", analysis_id=analysis_id)
107
+ return stub_analysis_state(problem_validated=False)
108
+ return state
src/agents/handlers/__init__.py ADDED
@@ -0,0 +1 @@
 
 
1
+ """Deterministic skill handlers dispatched by the orchestrator (non-LLM)."""
src/agents/handlers/check.py ADDED
@@ -0,0 +1,165 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """`check` skill handler β€” deterministic data/document inventory (no LLM).
2
+
3
+ The router emits a single `check` intent; this handler picks the concrete tool
4
+ (`check_data` for structured sources, `check_knowledge` for documents) and renders
5
+ the tool's `ToolOutput` table into a markdown reply. Broad queries with no
6
+ specific cue call both tools concurrently and stitch a helicopter-view inventory.
7
+ See `ORCHESTRATOR_REWORK_PLAN.md` Β§2.
8
+
9
+ The data-access invoker never throws (Β§8.4); `render_tool_output` handles the
10
+ `error` envelope defensively.
11
+ """
12
+
13
+ from __future__ import annotations
14
+
15
+ import asyncio
16
+ import re
17
+ from typing import TYPE_CHECKING
18
+
19
+ from src.tools.contracts import ToolOutput
20
+
21
+ if TYPE_CHECKING:
22
+ from src.agents.slow_path.invoker import ToolInvoker
23
+
24
+ # Cues that point at documents rather than structured data.
25
+ _KNOWLEDGE_CUES = (
26
+ "document",
27
+ "docs",
28
+ "doc ",
29
+ "file",
30
+ "pdf",
31
+ "docx",
32
+ ".txt",
33
+ "uploaded",
34
+ "knowledge",
35
+ "dokumen",
36
+ )
37
+
38
+ # Cues that point at structured/tabular data specifically.
39
+ _DATA_CUES = (
40
+ "kolom",
41
+ "column",
42
+ "tabel",
43
+ "table",
44
+ "baris",
45
+ "row",
46
+ "schema",
47
+ "skema",
48
+ "database",
49
+ " db",
50
+ )
51
+
52
+
53
+ def _intent(message: str) -> str:
54
+ """Return 'knowledge', 'data', or 'both' (helicopter view) from keyword cues."""
55
+ lowered = message.lower()
56
+ is_knowledge = any(cue in lowered for cue in _KNOWLEDGE_CUES)
57
+ is_data = any(cue in lowered for cue in _DATA_CUES)
58
+ if is_knowledge and not is_data:
59
+ return "knowledge"
60
+ if is_data and not is_knowledge:
61
+ return "data"
62
+ return "both"
63
+
64
+
65
+ def render_tool_output(out: ToolOutput) -> str:
66
+ """Render a `check_*` ToolOutput table into a markdown string, or '' if empty."""
67
+ if out.kind == "error":
68
+ return f"Sorry, I couldn't look that up: {out.error}"
69
+ columns = out.columns or []
70
+ rows = out.rows or []
71
+ if not rows:
72
+ return ""
73
+ header = "| " + " | ".join(columns) + " |"
74
+ separator = "| " + " | ".join("---" for _ in columns) + " |"
75
+ body = "\n".join(
76
+ "| " + " | ".join(str(cell) for cell in row) + " |" for row in rows
77
+ )
78
+ return f"{header}\n{separator}\n{body}"
79
+
80
+
81
+ def _matched_source_ids(message: str, inventory: ToolOutput) -> list[str]:
82
+ """All source_ids whose name appears as a whole word in the message.
83
+
84
+ The user names sources in plain words ("sales", "kolom sales sama orders");
85
+ the tool needs exact `source_id`s. We resolve them against the inventory
86
+ rows (kind="table", columns include "source_id" + "name") instead of an LLM
87
+ β€” a cheap match against catalog metadata already in hand. Whole-word match
88
+ (`\\b`) avoids nuisance hits ("orders" inside "reorders") and treats `_` as
89
+ part of the word, so "sales" won't pick up "sales_archive". Multiple named
90
+ sources all match, so the caller can show each schema.
91
+ """
92
+ if inventory.kind != "table" or not inventory.rows:
93
+ return []
94
+ cols = inventory.columns or []
95
+ try:
96
+ id_idx = cols.index("source_id")
97
+ name_idx = cols.index("name")
98
+ except ValueError:
99
+ return []
100
+
101
+ matched: list[str] = []
102
+ for row in inventory.rows:
103
+ name = str(row[name_idx])
104
+ if name and re.search(rf"\b{re.escape(name)}\b", message, re.IGNORECASE):
105
+ matched.append(str(row[id_idx]))
106
+ return matched
107
+
108
+
109
+ def _render_helicopter(data_out: ToolOutput, knowledge_out: ToolOutput) -> str:
110
+ """Stitch structured + document inventory into one helicopter-view reply."""
111
+ parts: list[str] = []
112
+
113
+ data_table = render_tool_output(data_out)
114
+ if data_table:
115
+ parts.append(f"**Data terstruktur**\n{data_table}")
116
+
117
+ knowledge_table = render_tool_output(knowledge_out)
118
+ if knowledge_table:
119
+ parts.append(f"**Dokumen**\n{knowledge_table}")
120
+
121
+ if not parts:
122
+ return "Nothing registered yet β€” I don't see any sources or documents."
123
+
124
+ return "\n\n".join(parts)
125
+
126
+
127
+ async def run_check(message: str, invoker: ToolInvoker) -> str:
128
+ """Route to check_data, check_knowledge, or both (helicopter view) based on cues."""
129
+ intent = _intent(message)
130
+
131
+ _no_match = "Nothing registered yet β€” I don't see any matching sources."
132
+
133
+ if intent == "knowledge":
134
+ out = await invoker.invoke("check_knowledge", {})
135
+ return render_tool_output(out) or _no_match
136
+
137
+ if intent == "data":
138
+ inventory = await invoker.invoke("check_data", {})
139
+ if inventory.kind == "error":
140
+ return render_tool_output(inventory)
141
+ # Drill down to the schema of each source the user named; if they named
142
+ # none, return the source listing.
143
+ source_ids = _matched_source_ids(message, inventory)
144
+ if not source_ids:
145
+ return render_tool_output(inventory) or _no_match
146
+ schemas = await asyncio.gather(
147
+ *(invoker.invoke("check_data", {"source_id": sid}) for sid in source_ids)
148
+ )
149
+ if len(schemas) == 1:
150
+ return render_tool_output(schemas[0]) or _no_match
151
+ # Multiple named sources β†’ one labelled section per source.
152
+ sections: list[str] = []
153
+ for out in schemas:
154
+ table = render_tool_output(out)
155
+ if table:
156
+ label = (out.meta or {}).get("source_name") or "source"
157
+ sections.append(f"**{label}**\n{table}")
158
+ return "\n\n".join(sections) or _no_match
159
+
160
+ # broad / ambiguous β†’ helicopter view: call both concurrently
161
+ data_out, knowledge_out = await asyncio.gather(
162
+ invoker.invoke("check_data", {}),
163
+ invoker.invoke("check_knowledge", {}),
164
+ )
165
+ return _render_helicopter(data_out, knowledge_out)
src/agents/handlers/help.py ADDED
@@ -0,0 +1,192 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """`help` skill handler β€” state-aware next-step guidance (LLM call).
2
+
3
+ Reads the per-analysis state + chat history (and a deterministic report-readiness
4
+ signal) and tells the user where they are and what to do next. Help only guides;
5
+ it never runs analysis or produces data answers.
6
+
7
+ The prompt lives in `config/prompts/help.md` (the playbook); this module composes
8
+ the context and streams the LLM answer, mirroring `ChatbotAgent`. The **consistency
9
+ guard** has teeth here, not just in the prompt: `_derive_available_actions` computes
10
+ the actions actually allowed from the state (the same policy as `gate.py`), and that
11
+ list is fed into the prompt β€” the LLM is told to suggest *only* those, so it can't
12
+ tell the user to generate a report when the goal isn't validated or the analysis
13
+ isn't ready.
14
+
15
+ SEAMS:
16
+ - `AnalysisState` is the locked 8-field contract from `gate.py` (KM-652). The gate,
17
+ this skill, and tests share `gate.stub_analysis_state(...)` so they exercise the
18
+ same shape.
19
+ - `ReportReadiness` is the return shape of `is_report_ready(chat_history)` (seam #5,
20
+ Rifqi β€” not built yet). Help *consumes* it; it does not compute it. Until it lands,
21
+ the caller passes a stub (default: not ready).
22
+ """
23
+
24
+ from __future__ import annotations
25
+
26
+ from collections.abc import AsyncIterator
27
+ from dataclasses import dataclass, field
28
+ from pathlib import Path
29
+ from typing import Any
30
+
31
+ from langchain_core.messages import BaseMessage
32
+ from langchain_core.output_parsers import StrOutputParser
33
+ from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
34
+ from langchain_core.runnables import Runnable
35
+ from langchain_openai import AzureChatOpenAI
36
+
37
+ from src.agents.gate import AnalysisState
38
+ from src.middlewares.logging import get_logger
39
+
40
+ logger = get_logger("help")
41
+
42
+ _PROMPT_DIR = Path(__file__).resolve().parent.parent.parent / "config" / "prompts"
43
+ _SYSTEM_PROMPT_PATH = _PROMPT_DIR / "help.md"
44
+ _GUARDRAILS_PATH = _PROMPT_DIR / "guardrails.md"
45
+
46
+ # Neutral human turn when Help is triggered by a slash command with no real content.
47
+ _DEFAULT_TRIGGER = "What should I do next?"
48
+
49
+
50
+ @dataclass
51
+ class ReportReadiness:
52
+ """Deterministic report-readiness signal β€” the return of Rifqi's `is_report_ready`.
53
+
54
+ `missing` lists the gaps to fill when `ready` is False.
55
+ """
56
+
57
+ ready: bool = False
58
+ missing: list[str] = field(default_factory=list)
59
+
60
+
61
+ def _derive_available_actions(state: AnalysisState, report_ready: ReportReadiness) -> list[str]:
62
+ """Actions Help is allowed to suggest, derived from state (mirrors `gate.py`).
63
+
64
+ This is the consistency guard's teeth: analysis is gated behind a validated goal
65
+ (same rule the gate applies to `structured_flow`), and a report is only offered
66
+ when the readiness signal says so. Keep this policy in sync with `gate.gate`.
67
+ """
68
+ if not state.problem_validated:
69
+ # Goal not set β†’ the only useful move is defining the problem statement.
70
+ return ["define_problem_statement"]
71
+
72
+ actions = ["ask_analysis_question", "refine_problem_statement"]
73
+ if report_ready.ready:
74
+ actions.append("generate_report")
75
+ return actions
76
+
77
+
78
+ def _format_state(state: AnalysisState) -> str:
79
+ """Render the analysis state as a compact context block for the LLM."""
80
+ has_report = "yes" if state.report_id else "no"
81
+ return (
82
+ "[Analysis state]\n"
83
+ f"analysis_title: {state.analysis_title or '(none)'}\n"
84
+ f"problem_statement: {state.problem_statement or '(empty)'}\n"
85
+ f"problem_validated: {str(state.problem_validated).lower()}\n"
86
+ f"has_report: {has_report}"
87
+ )
88
+
89
+
90
+ def _format_report_ready(report_ready: ReportReadiness) -> str:
91
+ missing = ", ".join(report_ready.missing) if report_ready.missing else "(none)"
92
+ return (
93
+ "[Report readiness]\n"
94
+ f"ready: {str(report_ready.ready).lower()}\n"
95
+ f"missing: {missing}"
96
+ )
97
+
98
+
99
+ def _build_context_block(
100
+ state: AnalysisState,
101
+ report_ready: ReportReadiness,
102
+ available_actions: list[str],
103
+ ) -> str:
104
+ """Compose the deterministic context the prompt's 'never misguide' rule trusts."""
105
+ return "\n\n".join(
106
+ [
107
+ _format_state(state),
108
+ _format_report_ready(report_ready),
109
+ "[Available actions]\n" + ", ".join(available_actions),
110
+ ]
111
+ )
112
+
113
+
114
+ def _load_system_prompt() -> str:
115
+ """Compose system prompt = help.md + guardrails.md (guardrails last, as elsewhere)."""
116
+ help_md = _SYSTEM_PROMPT_PATH.read_text(encoding="utf-8")
117
+ guardrails = _GUARDRAILS_PATH.read_text(encoding="utf-8")
118
+ return f"{help_md}\n\n{guardrails}"
119
+
120
+
121
+ def _build_default_chain() -> Runnable:
122
+ from src.config.settings import settings
123
+
124
+ llm = AzureChatOpenAI(
125
+ azure_deployment=settings.azureai_deployment_name_4o,
126
+ openai_api_version=settings.azureai_api_version_4o,
127
+ azure_endpoint=settings.azureai_endpoint_url_4o,
128
+ api_key=settings.azureai_api_key_4o,
129
+ temperature=0.3,
130
+ model_kwargs={"stream_options": {"include_usage": True}},
131
+ )
132
+ prompt = ChatPromptTemplate.from_messages(
133
+ [
134
+ ("system", _load_system_prompt()),
135
+ MessagesPlaceholder(variable_name="history", optional=True),
136
+ ("human", "{message}"),
137
+ ("system", "Analysis state and signals for this turn:\n\n{context}"),
138
+ ]
139
+ )
140
+ return prompt | llm | StrOutputParser()
141
+
142
+
143
+ class HelpAgent:
144
+ """Streams state-aware guidance to the user.
145
+
146
+ `chain` is injectable: tests pass a fake that yields canned tokens. Default
147
+ constructs the production Azure OpenAI streaming chain on first use.
148
+ """
149
+
150
+ def __init__(self, chain: Runnable | None = None) -> None:
151
+ self._chain = chain
152
+
153
+ def _ensure_chain(self) -> Runnable:
154
+ if self._chain is None:
155
+ self._chain = _build_default_chain()
156
+ return self._chain
157
+
158
+ async def astream(
159
+ self,
160
+ state: AnalysisState,
161
+ history: list[BaseMessage] | None = None,
162
+ report_ready: ReportReadiness | None = None,
163
+ message: str | None = None,
164
+ available_actions: list[str] | None = None,
165
+ callbacks: list | None = None,
166
+ ) -> AsyncIterator[str]:
167
+ """Stream tokens of the guidance reply.
168
+
169
+ `report_ready` defaults to "not ready" so a missing signal degrades safely.
170
+ `available_actions`, when omitted, is derived deterministically from state.
171
+ """
172
+ readiness = report_ready or ReportReadiness()
173
+ actions = available_actions or _derive_available_actions(state, readiness)
174
+ logger.info(
175
+ "help guidance",
176
+ problem_validated=state.problem_validated,
177
+ report_ready=readiness.ready,
178
+ available_actions=actions,
179
+ )
180
+
181
+ chain = self._ensure_chain()
182
+ payload: dict[str, Any] = {
183
+ "message": message or _DEFAULT_TRIGGER,
184
+ "history": history or [],
185
+ "context": _build_context_block(state, readiness, actions),
186
+ }
187
+ if callbacks:
188
+ async for token in chain.astream(payload, config={"callbacks": callbacks}):
189
+ yield token
190
+ else:
191
+ async for token in chain.astream(payload):
192
+ yield token
src/agents/handlers/problem_statement.py ADDED
@@ -0,0 +1,171 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Problem Statement skill β€” guide the user to a usable problem statement.
2
+
3
+ Routed by the orchestrator (intent `problem_statement`) and callable as a skill.
4
+ An LLM drafts/refines the statement from the analysis title + the user's message and
5
+ declares what's still `missing`; a check validates only when nothing is missing. The
6
+ model is instructed to fill `objective`/`metric` ONLY from what the user explicitly
7
+ stated β€” a bare data question ("which X has the most Y?") leaves them in `missing`, so
8
+ it does not auto-validate (the gate stays meaningful). On a valid draft it persists
9
+ `problem_statement` + `problem_validated=True`; otherwise it streams guidance and
10
+ leaves the analysis un-validated.
11
+
12
+ NOTE: completeness is still a (hardened) LLM judgment β€” the truly deterministic gate
13
+ is an explicit user confirmation, planned with the frontend (see T3b / #11).
14
+
15
+ See `ORCHESTRATOR_REWORK_PLAN.md` Β§4 and the 2026-06-18 checkpoint.
16
+ """
17
+
18
+ from __future__ import annotations
19
+
20
+ from pathlib import Path
21
+ from typing import TYPE_CHECKING
22
+
23
+ from langchain_core.messages import BaseMessage
24
+ from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
25
+ from langchain_core.runnables import Runnable
26
+ from langchain_openai import AzureChatOpenAI
27
+ from pydantic import BaseModel, Field
28
+
29
+ from src.middlewares.logging import get_logger
30
+
31
+ if TYPE_CHECKING:
32
+ from src.agents.state_store import AnalysisStateStore
33
+
34
+ logger = get_logger("problem_statement")
35
+
36
+ _PROMPT_PATH = (
37
+ Path(__file__).resolve().parent.parent.parent
38
+ / "config"
39
+ / "prompts"
40
+ / "problem_statement.md"
41
+ )
42
+
43
+
44
+ class ProblemStatementDraft(BaseModel):
45
+ """LLM output for the Problem Statement skill."""
46
+
47
+ problem_statement: str = Field(
48
+ ..., description="The refined, standalone problem statement (never empty)."
49
+ )
50
+ objective: str = Field(
51
+ "", description="What success looks like β€” fill ONLY when the user explicitly "
52
+ "stated it; never inferred from a data question. Empty otherwise."
53
+ )
54
+ metric: str = Field(
55
+ "", description="The KPI to move/investigate β€” fill ONLY when the user "
56
+ "explicitly stated it; never inferred from a data question. Empty otherwise."
57
+ )
58
+ missing: list[str] = Field(
59
+ default_factory=list,
60
+ description="Which of 'objective' / 'metric' the user has NOT explicitly stated "
61
+ "yet. A bare data question leaves both here. Empty list = complete.",
62
+ )
63
+ feedback: str = Field(
64
+ ...,
65
+ description="Message to the user β€” guidance if incomplete, confirmation if complete.",
66
+ )
67
+
68
+
69
+ def is_valid(draft: ProblemStatementDraft) -> bool:
70
+ """Complete iff there's a statement and the model flagged nothing missing.
71
+
72
+ Keying on the model's explicit `missing` list (rather than 'are objective/metric
73
+ non-empty') is what stops a bare data question from auto-validating: the hardened
74
+ prompt puts the un-stated parts in `missing`, so this returns False for it.
75
+ """
76
+ return bool(draft.problem_statement.strip()) and not draft.missing
77
+
78
+
79
+ def _load_prompt_text() -> str:
80
+ return _PROMPT_PATH.read_text(encoding="utf-8")
81
+
82
+
83
+ def _build_default_chain() -> Runnable:
84
+ from src.config.settings import settings
85
+
86
+ llm = AzureChatOpenAI(
87
+ azure_deployment=settings.azureai_deployment_name_4o,
88
+ openai_api_version=settings.azureai_api_version_4o,
89
+ azure_endpoint=settings.azureai_endpoint_url_4o,
90
+ api_key=settings.azureai_api_key_4o,
91
+ temperature=0,
92
+ )
93
+ prompt = ChatPromptTemplate.from_messages(
94
+ [
95
+ ("system", _load_prompt_text()),
96
+ MessagesPlaceholder(variable_name="history", optional=True),
97
+ (
98
+ "human",
99
+ "Analysis title: {analysis_title}\n"
100
+ "Current problem statement: {current}\n\n"
101
+ "User message: {message}",
102
+ ),
103
+ ]
104
+ )
105
+ return prompt | llm.with_structured_output(ProblemStatementDraft)
106
+
107
+
108
+ class ProblemStatementAgent:
109
+ """Single LLM call that drafts/refines a problem statement.
110
+
111
+ Inject `chain` for tests; the default builds the Azure OpenAI chain on first use.
112
+ """
113
+
114
+ def __init__(self, chain: Runnable | None = None) -> None:
115
+ self._chain = chain
116
+
117
+ def _ensure_chain(self) -> Runnable:
118
+ if self._chain is None:
119
+ self._chain = _build_default_chain()
120
+ return self._chain
121
+
122
+ async def draft(
123
+ self,
124
+ message: str,
125
+ analysis_title: str,
126
+ current: str,
127
+ history: list[BaseMessage] | None = None,
128
+ ) -> ProblemStatementDraft:
129
+ chain = self._ensure_chain()
130
+ return await chain.ainvoke(
131
+ {
132
+ "message": message,
133
+ "analysis_title": analysis_title,
134
+ "current": current,
135
+ "history": history or [],
136
+ }
137
+ )
138
+
139
+
140
+ async def run_problem_statement(
141
+ message: str,
142
+ analysis_id: str | None,
143
+ *,
144
+ agent: ProblemStatementAgent,
145
+ store: AnalysisStateStore,
146
+ history: list[BaseMessage] | None = None,
147
+ ) -> str:
148
+ """Draft + validate the problem statement; persist on a valid draft.
149
+
150
+ Loads the current title/statement (if the analysis exists), drafts a refinement,
151
+ runs the deterministic completeness check, and writes `problem_statement` +
152
+ `problem_validated` back. Returns the user-facing feedback. When `analysis_id` is
153
+ missing (e.g. a legacy room), it still drafts + returns guidance but cannot persist.
154
+ """
155
+ analysis_title, current = "New analysis", ""
156
+ if analysis_id:
157
+ state = await store.get(analysis_id)
158
+ if state is not None:
159
+ analysis_title, current = state.analysis_title, state.problem_statement
160
+
161
+ draft = await agent.draft(message, analysis_title, current, history)
162
+ validated = is_valid(draft)
163
+
164
+ if analysis_id:
165
+ await store.update(
166
+ analysis_id,
167
+ problem_statement=draft.problem_statement,
168
+ problem_validated=validated,
169
+ )
170
+ logger.info("problem_statement drafted", analysis_id=analysis_id, validated=validated)
171
+ return draft.feedback
src/agents/orchestration.py CHANGED
@@ -1,13 +1,17 @@
1
- """OrchestratorAgent β€” classifies a user message and emits source_hint.
2
 
3
- Output: needs_search (bool) + source_hint ∈ { chat, unstructured, structured }
4
- + rewritten_query (standalone form of the user's question, history-resolved).
5
 
6
- Phase 2 replaces the previous intent-classification body. The class name
7
- is preserved so existing import sites (`from src.agents.orchestration
8
- import OrchestratorAgent`) keep working. The default LLM chain is
9
- constructed lazily so the module is import-safe even without `.env`
10
- populated.
 
 
 
 
 
11
  """
12
 
13
  from __future__ import annotations
@@ -25,7 +29,14 @@ from src.middlewares.logging import get_logger
25
 
26
  logger = get_logger("orchestrator")
27
 
28
- SourceHint = Literal["chat", "unstructured", "structured"]
 
 
 
 
 
 
 
29
 
30
  _PROMPT_PATH = (
31
  Path(__file__).resolve().parent.parent
@@ -35,21 +46,29 @@ _PROMPT_PATH = (
35
  )
36
 
37
 
38
- class IntentRouterDecision(BaseModel):
39
  """LLM output. Pydantic so it can be used with `with_structured_output`."""
40
 
41
- needs_search: bool = Field(
42
- ..., description="True if we must look at the user's data to answer."
43
- )
44
- source_hint: SourceHint = Field(
45
  ...,
46
- description="Which downstream path: 'chat' (no lookup), "
47
- "'unstructured' (PDF/DOCX/TXT prose), 'structured' (DB / tabular file).",
 
 
 
 
 
48
  )
49
  rewritten_query: str | None = Field(
50
  None,
51
- description="Standalone version of the question, history-resolved. "
52
- "Null when needs_search=false.",
 
 
 
 
 
 
53
  )
54
 
55
 
@@ -74,11 +93,11 @@ def _build_default_chain() -> Runnable:
74
  ("human", "{message}"),
75
  ]
76
  )
77
- return prompt | llm.with_structured_output(IntentRouterDecision)
78
 
79
 
80
  class OrchestratorAgent:
81
- """Classifies a user message into chat / unstructured / structured.
82
 
83
  Inject `structured_chain` for tests; default builds the production
84
  Azure OpenAI chain on first use.
@@ -97,18 +116,14 @@ class OrchestratorAgent:
97
  message: str,
98
  history: list[BaseMessage] | None = None,
99
  callbacks: list | None = None,
100
- ) -> IntentRouterDecision:
101
  chain = self._ensure_chain()
102
  payload = {"message": message, "history": history or []}
103
  if callbacks:
104
- decision: IntentRouterDecision = await chain.ainvoke(
105
  payload, config={"callbacks": callbacks}
106
  )
107
  else:
108
  decision = await chain.ainvoke(payload)
109
- logger.info(
110
- "intent classified",
111
- source_hint=decision.source_hint,
112
- needs_search=decision.needs_search,
113
- )
114
  return decision
 
1
+ """OrchestratorAgent β€” classifies a user message into one of six intents.
2
 
3
+ Output: RouterDecision { intent, rewritten_query, confidence }.
 
4
 
5
+ The router is a **handler-level** intent classifier, not a data-modality
6
+ classifier: `structured_flow` routes to the slow Planner spine and
7
+ `unstructured_flow` to the fast RAG path; the structured/unstructured data mix on
8
+ the slow path is the Planner's job, not the router's. See
9
+ `ORCHESTRATOR_REWORK_PLAN.md`.
10
+
11
+ The class name `OrchestratorAgent` is preserved so existing import sites
12
+ (`from src.agents.orchestration import OrchestratorAgent`) keep working. The
13
+ default LLM chain is built lazily so the module is import-safe even without
14
+ `.env` populated.
15
  """
16
 
17
  from __future__ import annotations
 
29
 
30
  logger = get_logger("orchestrator")
31
 
32
+ Intent = Literal[
33
+ "chat",
34
+ "help",
35
+ "problem_statement",
36
+ "check",
37
+ "unstructured_flow",
38
+ "structured_flow",
39
+ ]
40
 
41
  _PROMPT_PATH = (
42
  Path(__file__).resolve().parent.parent
 
46
  )
47
 
48
 
49
+ class RouterDecision(BaseModel):
50
  """LLM output. Pydantic so it can be used with `with_structured_output`."""
51
 
52
+ intent: Intent = Field(
 
 
 
53
  ...,
54
+ description=(
55
+ "Handler route for this message: 'chat' (conversational, no data), "
56
+ "'help' (what-to-do-next guidance), 'problem_statement' (define or "
57
+ "refine the analysis goal), 'check' (inventory: what data/documents "
58
+ "exist), 'unstructured_flow' (answer from documents, fast RAG), or "
59
+ "'structured_flow' (analytical question over data, slow Planner path)."
60
+ ),
61
  )
62
  rewritten_query: str | None = Field(
63
  None,
64
+ description=(
65
+ "Standalone version of the question, history-resolved. Null for "
66
+ "'chat' and 'help' (no data lookup needed)."
67
+ ),
68
+ )
69
+ confidence: float | None = Field(
70
+ None,
71
+ description="Classifier confidence in [0, 1]. Optional.",
72
  )
73
 
74
 
 
93
  ("human", "{message}"),
94
  ]
95
  )
96
+ return prompt | llm.with_structured_output(RouterDecision)
97
 
98
 
99
  class OrchestratorAgent:
100
+ """Classifies a user message into one of the six router intents.
101
 
102
  Inject `structured_chain` for tests; default builds the production
103
  Azure OpenAI chain on first use.
 
116
  message: str,
117
  history: list[BaseMessage] | None = None,
118
  callbacks: list | None = None,
119
+ ) -> RouterDecision:
120
  chain = self._ensure_chain()
121
  payload = {"message": message, "history": history or []}
122
  if callbacks:
123
+ decision: RouterDecision = await chain.ainvoke(
124
  payload, config={"callbacks": callbacks}
125
  )
126
  else:
127
  decision = await chain.ainvoke(payload)
128
+ logger.info("intent classified", intent=decision.intent)
 
 
 
 
129
  return decision
src/agents/planner/examples.py CHANGED
@@ -2,7 +2,7 @@
2
 
3
  Two illustrative (question -> TaskList) pairs that teach the OUTPUT SHAPE:
4
  stages, dependency edges, ordered tool-call chains, inline QueryIR,
5
- "${t<id>}" placeholders, and the assumed data-flow convention β€” `query_structured`
6
  pulls rows, then a composite `analyze_*` tool consumes them via a `data` placeholder
7
  referencing the upstream result's column aliases (Pattern A; the tool team may
8
  instead pick self-fetch by `source_id`, in which case these examples are reshaped
@@ -21,9 +21,8 @@ from .schemas import Task, TaskList, ToolCall
21
  # --------------------------------------------------------------------------- #
22
  # Example A β€” exploratory, no modeling.
23
  # "Which product categories drove last quarter's revenue?"
24
- # Shows: query_structured pulls rows -> analyze_contribution computes each
25
- # category's share of the total in one call (no manual per-category + total
26
- # queries).
27
  # --------------------------------------------------------------------------- #
28
 
29
  _EXAMPLE_A = TaskList(
@@ -36,7 +35,7 @@ _EXAMPLE_A = TaskList(
36
  id="t1",
37
  stage="data_understanding",
38
  objective="Confirm the sales source exposes category, revenue, and order date.",
39
- tool_calls=[ToolCall(tool="describe_source", args={"source_id": "src_sales"})],
40
  expected_output="source_shape",
41
  success_criteria="Produced the orders table schema; the 3 needed columns are present.",
42
  depends_on=[],
@@ -48,7 +47,7 @@ _EXAMPLE_A = TaskList(
48
  objective="Pull last quarter's order-level category and revenue rows.",
49
  tool_calls=[
50
  ToolCall(
51
- tool="query_structured",
52
  args={
53
  "ir": {
54
  "source_id": "src_sales",
@@ -78,20 +77,19 @@ _EXAMPLE_A = TaskList(
78
  Task(
79
  id="t3",
80
  stage="evaluation",
81
- objective="Rank each category's revenue share of the quarter total.",
82
  tool_calls=[
83
  ToolCall(
84
- tool="analyze_contribution",
85
  args={
86
  "data": "${t2}",
87
- "dimension": "category",
88
- "value_column": "revenue",
89
- "agg": "sum",
90
  },
91
  )
92
  ],
93
- expected_output="category_contribution",
94
- success_criteria="Produced each category's revenue share, ranked high to low.",
95
  depends_on=["t2"],
96
  estimated_cost="low",
97
  ),
@@ -113,7 +111,7 @@ _EXAMPLE_B = TaskList(
113
  id="t1",
114
  stage="data_understanding",
115
  objective="Confirm the sales source exposes order date, revenue, and region.",
116
- tool_calls=[ToolCall(tool="describe_source", args={"source_id": "src_sales"})],
117
  expected_output="source_shape",
118
  success_criteria="Produced the orders table schema; the needed columns are present.",
119
  depends_on=[],
@@ -125,7 +123,7 @@ _EXAMPLE_B = TaskList(
125
  objective="Pull this year's order dates, revenue, and region.",
126
  tool_calls=[
127
  ToolCall(
128
- tool="query_structured",
129
  args={
130
  "ir": {
131
  "source_id": "src_sales",
@@ -189,8 +187,8 @@ _EXAMPLE_B = TaskList(
189
  # Example C β€” mixed structured + unstructured.
190
  # "Revenue dipped in Q1 β€” what happened?"
191
  # Shows: a structured branch (query -> analyze_trend) runs alongside an
192
- # INDEPENDENT retrieve_documents branch that pulls qualitative context. Note
193
- # retrieve_documents takes a natural-language `query` (NOT a `${t<id>}` data
194
  # placeholder β€” it is a source, not a consumer) and can run in parallel; the
195
  # Assembler folds the document context into the explanation.
196
  # --------------------------------------------------------------------------- #
@@ -205,7 +203,7 @@ _EXAMPLE_C = TaskList(
205
  id="t1",
206
  stage="data_understanding",
207
  objective="Confirm the sales source exposes order date and revenue.",
208
- tool_calls=[ToolCall(tool="describe_source", args={"source_id": "src_sales"})],
209
  expected_output="source_shape",
210
  success_criteria="Produced the orders table schema; date and revenue columns present.",
211
  depends_on=[],
@@ -217,7 +215,7 @@ _EXAMPLE_C = TaskList(
217
  objective="Pull Q1 order dates and revenue.",
218
  tool_calls=[
219
  ToolCall(
220
- tool="query_structured",
221
  args={
222
  "ir": {
223
  "source_id": "src_sales",
@@ -275,7 +273,7 @@ _EXAMPLE_C = TaskList(
275
  objective="Retrieve qualitative context on Q1 operational events behind a dip.",
276
  tool_calls=[
277
  ToolCall(
278
- tool="retrieve_documents",
279
  args={
280
  "query": "operational issues, outages, or notable events in Q1 2026",
281
  "top_k": 5,
@@ -310,7 +308,7 @@ _EXAMPLE_D = TaskList(
310
  id="t1",
311
  stage="data_understanding",
312
  objective="Confirm the sales source exposes region and revenue.",
313
- tool_calls=[ToolCall(tool="describe_source", args={"source_id": "src_sales"})],
314
  expected_output="source_shape",
315
  success_criteria="Produced the orders table schema; region and revenue present.",
316
  depends_on=[],
@@ -322,7 +320,7 @@ _EXAMPLE_D = TaskList(
322
  objective="Pull order-level region and revenue.",
323
  tool_calls=[
324
  ToolCall(
325
- tool="query_structured",
326
  args={
327
  "ir": {
328
  "source_id": "src_sales",
 
2
 
3
  Two illustrative (question -> TaskList) pairs that teach the OUTPUT SHAPE:
4
  stages, dependency edges, ordered tool-call chains, inline QueryIR,
5
+ "${t<id>}" placeholders, and the assumed data-flow convention β€” `retrieve_data`
6
  pulls rows, then a composite `analyze_*` tool consumes them via a `data` placeholder
7
  referencing the upstream result's column aliases (Pattern A; the tool team may
8
  instead pick self-fetch by `source_id`, in which case these examples are reshaped
 
21
  # --------------------------------------------------------------------------- #
22
  # Example A β€” exploratory, no modeling.
23
  # "Which product categories drove last quarter's revenue?"
24
+ # Shows: retrieve_data pulls rows -> analyze_aggregate sums revenue per
25
+ # category in one call (no manual per-category queries).
 
26
  # --------------------------------------------------------------------------- #
27
 
28
  _EXAMPLE_A = TaskList(
 
35
  id="t1",
36
  stage="data_understanding",
37
  objective="Confirm the sales source exposes category, revenue, and order date.",
38
+ tool_calls=[ToolCall(tool="check_data", args={"source_id": "src_sales"})],
39
  expected_output="source_shape",
40
  success_criteria="Produced the orders table schema; the 3 needed columns are present.",
41
  depends_on=[],
 
47
  objective="Pull last quarter's order-level category and revenue rows.",
48
  tool_calls=[
49
  ToolCall(
50
+ tool="retrieve_data",
51
  args={
52
  "ir": {
53
  "source_id": "src_sales",
 
77
  Task(
78
  id="t3",
79
  stage="evaluation",
80
+ objective="Sum revenue per category for the quarter.",
81
  tool_calls=[
82
  ToolCall(
83
+ tool="analyze_aggregate",
84
  args={
85
  "data": "${t2}",
86
+ "aggregations": {"revenue": ["sum"]},
87
+ "group_by": ["category"],
 
88
  },
89
  )
90
  ],
91
+ expected_output="category_revenue",
92
+ success_criteria="Produced total revenue per category, one row each.",
93
  depends_on=["t2"],
94
  estimated_cost="low",
95
  ),
 
111
  id="t1",
112
  stage="data_understanding",
113
  objective="Confirm the sales source exposes order date, revenue, and region.",
114
+ tool_calls=[ToolCall(tool="check_data", args={"source_id": "src_sales"})],
115
  expected_output="source_shape",
116
  success_criteria="Produced the orders table schema; the needed columns are present.",
117
  depends_on=[],
 
123
  objective="Pull this year's order dates, revenue, and region.",
124
  tool_calls=[
125
  ToolCall(
126
+ tool="retrieve_data",
127
  args={
128
  "ir": {
129
  "source_id": "src_sales",
 
187
  # Example C β€” mixed structured + unstructured.
188
  # "Revenue dipped in Q1 β€” what happened?"
189
  # Shows: a structured branch (query -> analyze_trend) runs alongside an
190
+ # INDEPENDENT retrieve_knowledge branch that pulls qualitative context. Note
191
+ # retrieve_knowledge takes a natural-language `query` (NOT a `${t<id>}` data
192
  # placeholder β€” it is a source, not a consumer) and can run in parallel; the
193
  # Assembler folds the document context into the explanation.
194
  # --------------------------------------------------------------------------- #
 
203
  id="t1",
204
  stage="data_understanding",
205
  objective="Confirm the sales source exposes order date and revenue.",
206
+ tool_calls=[ToolCall(tool="check_data", args={"source_id": "src_sales"})],
207
  expected_output="source_shape",
208
  success_criteria="Produced the orders table schema; date and revenue columns present.",
209
  depends_on=[],
 
215
  objective="Pull Q1 order dates and revenue.",
216
  tool_calls=[
217
  ToolCall(
218
+ tool="retrieve_data",
219
  args={
220
  "ir": {
221
  "source_id": "src_sales",
 
273
  objective="Retrieve qualitative context on Q1 operational events behind a dip.",
274
  tool_calls=[
275
  ToolCall(
276
+ tool="retrieve_knowledge",
277
  args={
278
  "query": "operational issues, outages, or notable events in Q1 2026",
279
  "top_k": 5,
 
308
  id="t1",
309
  stage="data_understanding",
310
  objective="Confirm the sales source exposes region and revenue.",
311
+ tool_calls=[ToolCall(tool="check_data", args={"source_id": "src_sales"})],
312
  expected_output="source_shape",
313
  success_criteria="Produced the orders table schema; region and revenue present.",
314
  depends_on=[],
 
320
  objective="Pull order-level region and revenue.",
321
  tool_calls=[
322
  ToolCall(
323
+ tool="retrieve_data",
324
  args={
325
  "ir": {
326
  "source_id": "src_sales",
src/agents/planner/inputs.py CHANGED
@@ -4,9 +4,9 @@
4
  for the planner prompt. It carries every table + column id/type/PII flag + row
5
  counts + low-cardinality top_values, with `sample_values` nulled on PII columns
6
  (INV: no PII sample values into the prompt, see doc Β§13). It also lists the
7
- available unstructured sources so the planner can plan `retrieve_documents`.
8
 
9
- The planner *validator* still checks inline `query_structured` IRs against the
10
  full `Catalog` via the existing IRValidator β€” the summary is a prompt input, not
11
  the validation source of truth.
12
 
@@ -124,7 +124,7 @@ class CatalogSummary(BaseModel):
124
  lines.append("")
125
 
126
  if self.unstructured_sources:
127
- lines.append("Unstructured sources (for retrieve_documents):")
128
  for src in self.unstructured_sources:
129
  lines.append(f" - {src.name} β€” id={src.source_id}")
130
 
 
4
  for the planner prompt. It carries every table + column id/type/PII flag + row
5
  counts + low-cardinality top_values, with `sample_values` nulled on PII columns
6
  (INV: no PII sample values into the prompt, see doc Β§13). It also lists the
7
+ available unstructured sources so the planner can plan `retrieve_knowledge`.
8
 
9
+ The planner *validator* still checks inline `retrieve_data` IRs against the
10
  full `Catalog` via the existing IRValidator β€” the summary is a prompt input, not
11
  the validation source of truth.
12
 
 
124
  lines.append("")
125
 
126
  if self.unstructured_sources:
127
+ lines.append("Unstructured sources (for retrieve_knowledge):")
128
  for src in self.unstructured_sources:
129
  lines.append(f" - {src.name} β€” id={src.source_id}")
130
 
src/agents/planner/registry.py CHANGED
@@ -7,8 +7,8 @@ outside it).
7
  `src/tools/registry.py::analytics_registry()` (KM-628), built on the canonical
8
  `ToolSpec` (`src/tools/contracts.py`, KM-465/KM-627) and the prompt-style tool
9
  descriptions (KM-625). No longer a stub on our side β€” it tracks the real registry.
10
- - **Data access (`query_structured` / `retrieve_documents` / `list_sources` /
11
- `describe_source`) β€” spec BODIES still a local stub.** The tool team owns these too,
12
  but their wrappers + `ToolSpec`s haven't landed yet (KM-465 #4). We keep best-guess
13
  spec bodies here so the Planner can plan end-to-end β€” but the NAMES derive from
14
  `src.tools.data_access.DATA_ACCESS_TOOLS` (R11), so a tool rename/addition upstream
@@ -16,10 +16,10 @@ outside it).
16
  this slice and swap `default_registry()` for the tool team's full composition.
17
 
18
  **Confirmed conventions (KM-465):** Pattern A β€” `analyze_*` tools take a `data`
19
- `"${t<id>}"` placeholder pointing at an upstream `query_structured` output (no
20
  self-fetch); resolved to a DataFrame at execution time. `input_schema` is the
21
  lightweight `{required, properties}` dict the planner validator (check #8) reads;
22
- `query_structured.args["ir"]` carries an inline QueryIR validated against the
23
  catalog by the existing IRValidator.
24
 
25
  See AGENT_ARCHITECTURE_CONTEXT_new.md Β§9.2 / Β§9.3.
@@ -38,25 +38,32 @@ from .contracts import ToolRegistry, ToolSpec
38
  # --------------------------------------------------------------------------- #
39
  _DATA_ACCESS_SPEC_BODIES: tuple[ToolSpec, ...] = (
40
  ToolSpec(
41
- name="query_structured",
42
  category="analytics.query",
43
  input_schema={"required": ["ir"], "properties": {"ir": {"type": "object"}}},
44
  output_kind="table",
45
  description=(
46
- "Run one validated, single-table query against a structured source (DB "
47
- "schema or tabular file) and return rows. The `ir` argument is an inline "
48
- "QueryIR (the JSON intent: source_id, table_id, select, filters, group_by, "
49
- "order_by, limit) β€” never SQL. This is the data-access entry point: use it "
50
- "to select, filter, and pull the rows the analytics (`analyze_*`) tools "
51
- "then consume. It also does simple built-in aggregation the IR can express "
52
- "(count/sum/avg/min/max/count_distinct). Do NOT use it for richer statistics "
 
 
 
 
 
 
 
53
  "(median/percentile/mode/stddev/skew β†’ analyze_descriptive), trends "
54
  "(analyze_trend), correlation, segmentation, or share-of-total; and do NOT "
55
- "use it to read documents (use retrieve_documents)."
56
  ),
57
  ),
58
  ToolSpec(
59
- name="retrieve_documents",
60
  category="retrieval.documents",
61
  input_schema={
62
  "required": ["query"],
@@ -71,32 +78,36 @@ _DATA_ACCESS_SPEC_BODIES: tuple[ToolSpec, ...] = (
71
  "Dense-retrieve the most relevant chunks from the user's unstructured "
72
  "sources (PDF/DOCX/TXT) for a natural-language `query`. Use this to pull "
73
  "qualitative context into an analysis. Optionally scope to one `source_id`. "
74
- "Do NOT use it for numbers in tables β€” that is query_structured's job."
75
  ),
76
  ),
77
  ToolSpec(
78
- name="list_sources",
79
  category="catalog.introspection",
80
- input_schema={"required": [], "properties": {}},
 
 
 
81
  output_kind="table",
82
  description=(
83
- "List the user's available data sources (id, name, type, table count). Use "
84
- "early in data_understanding when the plan must discover what exists before "
85
- "querying. Cheap. Do NOT use it to read column details (use describe_source)."
 
 
 
86
  ),
87
  ),
88
  ToolSpec(
89
- name="describe_source",
90
  category="catalog.introspection",
91
- input_schema={
92
- "required": ["source_id"],
93
- "properties": {"source_id": {"type": "string"}},
94
- },
95
  output_kind="table",
96
  description=(
97
- "Return the tables and columns (names, types, row counts) of one source by "
98
- "`source_id`. Use in data_understanding to confirm the shape of a source "
99
- "before querying it. Do NOT use it to fetch data rows (use query_structured)."
 
100
  ),
101
  ),
102
  )
 
7
  `src/tools/registry.py::analytics_registry()` (KM-628), built on the canonical
8
  `ToolSpec` (`src/tools/contracts.py`, KM-465/KM-627) and the prompt-style tool
9
  descriptions (KM-625). No longer a stub on our side β€” it tracks the real registry.
10
+ - **Data access (`retrieve_data` / `retrieve_knowledge` / `check_data` /
11
+ `check_knowledge`) β€” spec BODIES still a local stub.** The tool team owns these too,
12
  but their wrappers + `ToolSpec`s haven't landed yet (KM-465 #4). We keep best-guess
13
  spec bodies here so the Planner can plan end-to-end β€” but the NAMES derive from
14
  `src.tools.data_access.DATA_ACCESS_TOOLS` (R11), so a tool rename/addition upstream
 
16
  this slice and swap `default_registry()` for the tool team's full composition.
17
 
18
  **Confirmed conventions (KM-465):** Pattern A β€” `analyze_*` tools take a `data`
19
+ `"${t<id>}"` placeholder pointing at an upstream `retrieve_data` output (no
20
  self-fetch); resolved to a DataFrame at execution time. `input_schema` is the
21
  lightweight `{required, properties}` dict the planner validator (check #8) reads;
22
+ `retrieve_data.args["ir"]` carries an inline QueryIR validated against the
23
  catalog by the existing IRValidator.
24
 
25
  See AGENT_ARCHITECTURE_CONTEXT_new.md Β§9.2 / Β§9.3.
 
38
  # --------------------------------------------------------------------------- #
39
  _DATA_ACCESS_SPEC_BODIES: tuple[ToolSpec, ...] = (
40
  ToolSpec(
41
+ name="retrieve_data",
42
  category="analytics.query",
43
  input_schema={"required": ["ir"], "properties": {"ir": {"type": "object"}}},
44
  output_kind="table",
45
  description=(
46
+ "Run one validated query against a structured source and return rows. The "
47
+ "`ir` argument is an inline QueryIR (the JSON intent: source_id, table_id, "
48
+ "joins, select, filters, group_by, order_by, limit) β€” never SQL. This is the "
49
+ "data-access entry point: use it to select, filter, and pull the rows the "
50
+ "analytics (`analyze_*`) tools then consume. It also does simple built-in "
51
+ "aggregation the IR can express (count/sum/avg/min/max/count_distinct). "
52
+ "JOINS (database sources only): to group a measure in one table by a "
53
+ "dimension in a RELATED table, add a `joins` entry "
54
+ "({target_table_id, left_column_id, right_column_id}) along a declared "
55
+ "foreign key β€” e.g. sum order_items.line_total grouped by products.category "
56
+ "via order_items.product_id = products.id. Prefer an existing measure column "
57
+ "(e.g. line_total) over recomputing, and a single table when the measure and "
58
+ "dimension already live together. Joins are NOT supported on tabular/file "
59
+ "sources yet. Do NOT use this for richer statistics "
60
  "(median/percentile/mode/stddev/skew β†’ analyze_descriptive), trends "
61
  "(analyze_trend), correlation, segmentation, or share-of-total; and do NOT "
62
+ "use it to read documents (use retrieve_knowledge)."
63
  ),
64
  ),
65
  ToolSpec(
66
+ name="retrieve_knowledge",
67
  category="retrieval.documents",
68
  input_schema={
69
  "required": ["query"],
 
78
  "Dense-retrieve the most relevant chunks from the user's unstructured "
79
  "sources (PDF/DOCX/TXT) for a natural-language `query`. Use this to pull "
80
  "qualitative context into an analysis. Optionally scope to one `source_id`. "
81
+ "Do NOT use it for numbers in tables β€” that is retrieve_data's job."
82
  ),
83
  ),
84
  ToolSpec(
85
+ name="check_data",
86
  category="catalog.introspection",
87
+ input_schema={
88
+ "required": [],
89
+ "properties": {"source_id": {"type": "string"}},
90
+ },
91
  output_kind="table",
92
  description=(
93
+ "Inspect the user's structured data sources (DB + tabular). With no "
94
+ "arguments, lists the sources (id, name, type, table count) β€” use early in "
95
+ "data_understanding to discover what exists. With a `source_id`, returns that "
96
+ "source's tables and columns (names, types, row counts) β€” use to confirm a "
97
+ "source's shape before querying it. Cheap. Do NOT use it to fetch data rows "
98
+ "(use retrieve_data) or to inspect documents (use check_knowledge)."
99
  ),
100
  ),
101
  ToolSpec(
102
+ name="check_knowledge",
103
  category="catalog.introspection",
104
+ input_schema={"required": [], "properties": {}},
 
 
 
105
  output_kind="table",
106
  description=(
107
+ "List the user's unstructured sources / documents (id, name, type). Use in "
108
+ "data_understanding to discover what qualitative material exists before "
109
+ "retrieving from it. Do NOT use it to read document content (use "
110
+ "retrieve_knowledge) or to inspect structured data (use check_data)."
111
  ),
112
  ),
113
  )
src/agents/planner/service.py CHANGED
@@ -9,7 +9,7 @@ static plan.
9
 
10
  The service takes the full `Catalog` (not just a `CatalogSummary`): it derives
11
  the PII-safe `CatalogSummary` for the prompt, but validation needs the full
12
- catalog so the existing `IRValidator` can check inline `query_structured` IRs.
13
 
14
  See AGENT_ARCHITECTURE_CONTEXT_new.md Β§7.3.
15
  """
 
9
 
10
  The service takes the full `Catalog` (not just a `CatalogSummary`): it derives
11
  the PII-safe `CatalogSummary` for the prompt, but validation needs the full
12
+ catalog so the existing `IRValidator` can check inline `retrieve_data` IRs.
13
 
14
  See AGENT_ARCHITECTURE_CONTEXT_new.md Β§7.3.
15
  """
src/agents/planner/validator.py CHANGED
@@ -95,8 +95,8 @@ class PlannerValidator:
95
  f"source_id {src!r} (known: {sorted(known_sources)})"
96
  )
97
 
98
- # Check 8b β€” inline query_structured IR validates against the catalog.
99
- if call.tool == "query_structured":
100
  self._validate_inline_ir(task.id, call.args, catalog)
101
 
102
  # Check 7 β€” success_criteria is checkable.
@@ -114,20 +114,20 @@ class PlannerValidator:
114
  raw_ir = args.get("ir")
115
  if not isinstance(raw_ir, dict):
116
  raise PlannerValidationError(
117
- f"task {task_id}: query_structured.args.ir must be an inline QueryIR "
118
  f"object, got {type(raw_ir).__name__}"
119
  )
120
  try:
121
  ir = QueryIR.model_validate(raw_ir)
122
  except ValidationError as e:
123
  raise PlannerValidationError(
124
- f"task {task_id}: query_structured.args.ir is not a valid QueryIR: {e}"
125
  ) from e
126
  try:
127
  self._ir_validator.validate(ir, catalog)
128
  except IRValidationError as e:
129
  raise PlannerValidationError(
130
- f"task {task_id}: query_structured IR failed catalog validation: {e}"
131
  ) from e
132
 
133
  @staticmethod
 
95
  f"source_id {src!r} (known: {sorted(known_sources)})"
96
  )
97
 
98
+ # Check 8b β€” inline retrieve_data IR validates against the catalog.
99
+ if call.tool == "retrieve_data":
100
  self._validate_inline_ir(task.id, call.args, catalog)
101
 
102
  # Check 7 β€” success_criteria is checkable.
 
114
  raw_ir = args.get("ir")
115
  if not isinstance(raw_ir, dict):
116
  raise PlannerValidationError(
117
+ f"task {task_id}: retrieve_data.args.ir must be an inline QueryIR "
118
  f"object, got {type(raw_ir).__name__}"
119
  )
120
  try:
121
  ir = QueryIR.model_validate(raw_ir)
122
  except ValidationError as e:
123
  raise PlannerValidationError(
124
+ f"task {task_id}: retrieve_data.args.ir is not a valid QueryIR: {e}"
125
  ) from e
126
  try:
127
  self._ir_validator.validate(ir, catalog)
128
  except IRValidationError as e:
129
  raise PlannerValidationError(
130
+ f"task {task_id}: retrieve_data IR failed catalog validation: {e}"
131
  ) from e
132
 
133
  @staticmethod
src/agents/report/__init__.py ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ """Report generator (KM-644).
2
+
3
+ A button-triggered *service* β€” not a chat skill, not a slow-path agent. It turns a
4
+ session's persisted `AnalysisRecord`s + Problem Statement into a versioned,
5
+ business-readable `AnalysisReport`. Architecturally it mirrors the Assembler: one
6
+ constrained LLM call (the executive summary) wrapped in deterministic assembly that
7
+ copies every other field verbatim from the records (INV-4). Reports are immutable
8
+ per version and persisted to the `analysis_reports` table.
9
+ """
src/agents/report/errors.py ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ """Typed errors for the report generator (KM-644)."""
2
+
3
+ from __future__ import annotations
4
+
5
+
6
+ class ReportError(Exception):
7
+ """The report could not be generated (e.g. no records for the analysis)."""
src/agents/report/generator.py ADDED
@@ -0,0 +1,363 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """ReportGenerator β€” turns a session's AnalysisRecords into an AnalysisReport (KM-644).
2
+
3
+ A button-triggered service shaped like the Assembler: deterministic assembly of the
4
+ records (findings/caveats/open_questions/data_sources/method_steps, copied verbatim β€”
5
+ INV-4) wrapped around exactly ONE LLM call that authors only the executive summary.
6
+ If that call fails the report is still returned with a deterministic fallback
7
+ summary (decision D1) β€” the deterministic body is the real value.
8
+
9
+ Versioning + persistence live in `ReportStore`; this service does generation only
10
+ (returns an `AnalysisReport` with `version=0`; the store assigns the real version).
11
+ Chain construction mirrors `agents/slow_path/assembler.py`.
12
+ """
13
+
14
+ from __future__ import annotations
15
+
16
+ from datetime import UTC, datetime
17
+ from pathlib import Path
18
+
19
+ from langchain_core.messages import SystemMessage
20
+ from langchain_core.prompts import ChatPromptTemplate
21
+ from langchain_core.runnables import Runnable
22
+ from langchain_openai import AzureChatOpenAI
23
+
24
+ from src.middlewares.logging import get_logger
25
+
26
+ from ..slow_path.schemas import AnalysisRecord, TaskSummary
27
+ from .errors import ReportError
28
+ from .schemas import (
29
+ AnalysisReport,
30
+ AttributedNote,
31
+ DataSourceRef,
32
+ ProblemStatement,
33
+ ReportFinding,
34
+ ReportSummaryNarrative,
35
+ )
36
+
37
+ logger = get_logger("report_generator")
38
+
39
+ _FALLBACK_SUMMARY = "Automated summary unavailable β€” see the findings below."
40
+
41
+ # CRISP-DM phases in narrative order, with human labels for the method appendix.
42
+ _STAGE_LABELS: list[tuple[str, str]] = [
43
+ ("data_understanding", "Data understanding"),
44
+ ("data_preparation", "Data preparation"),
45
+ ("modeling", "Modeling"),
46
+ ("evaluation", "Evaluation"),
47
+ ]
48
+
49
+ _PROMPT_PATH = (
50
+ Path(__file__).resolve().parent.parent.parent / "config" / "prompts" / "report_summary.md"
51
+ )
52
+
53
+
54
+ def _load_prompt_text() -> str:
55
+ return _PROMPT_PATH.read_text(encoding="utf-8")
56
+
57
+
58
+ def _build_default_chain() -> Runnable:
59
+ from src.config.settings import settings
60
+
61
+ llm = AzureChatOpenAI(
62
+ azure_deployment=settings.azureai_deployment_name_4o,
63
+ openai_api_version=settings.azureai_api_version_4o,
64
+ azure_endpoint=settings.azureai_endpoint_url_4o,
65
+ api_key=settings.azureai_api_key_4o,
66
+ temperature=0,
67
+ )
68
+ prompt = ChatPromptTemplate.from_messages(
69
+ [
70
+ SystemMessage(content=_load_prompt_text()),
71
+ ("human", "{human_content}"),
72
+ ]
73
+ )
74
+ return prompt | llm.with_structured_output(ReportSummaryNarrative)
75
+
76
+
77
+ _default_chain: Runnable | None = None
78
+
79
+
80
+ def _get_default_chain() -> Runnable:
81
+ global _default_chain
82
+ if _default_chain is None:
83
+ _default_chain = _build_default_chain()
84
+ return _default_chain
85
+
86
+
87
+ # --------------------------------------------------------------------------- #
88
+ # Deterministic assembly (pure; no LLM, no I/O) β€” easy to unit-test.
89
+ # --------------------------------------------------------------------------- #
90
+
91
+
92
+ def _collect_findings(records: list[AnalysisRecord]) -> list[ReportFinding]:
93
+ # Findings are distinct insights β€” not deduped; each traces to its record.
94
+ return [
95
+ ReportFinding(text=text, record_ids=[rec.record_id])
96
+ for rec in records
97
+ for text in rec.findings
98
+ ]
99
+
100
+
101
+ def _collect_notes(records: list[AnalysisRecord], field: str) -> list[AttributedNote]:
102
+ # Caveats / open_questions are deduped by text; a merged note cites every
103
+ # record it came from (plural record_ids).
104
+ merged: dict[str, list[str]] = {}
105
+ for rec in records:
106
+ for text in getattr(rec, field):
107
+ ids = merged.setdefault(text, [])
108
+ if rec.record_id not in ids:
109
+ ids.append(rec.record_id)
110
+ return [AttributedNote(text=text, record_ids=ids) for text, ids in merged.items()]
111
+
112
+
113
+ def _collect_method_steps(records: list[AnalysisRecord]) -> list[TaskSummary]:
114
+ steps: list[TaskSummary] = []
115
+ for rec in records:
116
+ steps.extend(rec.tasks_run)
117
+ return steps
118
+
119
+
120
+ def _build_data_sources(
121
+ records: list[AnalysisRecord], catalog, bound_ids: list[str] | None = None
122
+ ) -> list[DataSourceRef]:
123
+ """Freeze real catalog metadata for the sources this analysis used.
124
+
125
+ When the analysis has a data-source binding (#10), the candidate set is scoped
126
+ to the bound sources first (fail-open if the binding doesn't intersect the
127
+ catalog). Within that set, matches catalog sources against the records'
128
+ (narrative) `data_used` by name/id; falls back to all (bound) sources, then to
129
+ bare `data_used` strings if no catalog is available β€” so the section is always
130
+ populated, best-effort.
131
+ """
132
+ if catalog is None or not catalog.sources:
133
+ seen: list[str] = []
134
+ for rec in records:
135
+ for du in rec.data_used:
136
+ if du not in seen:
137
+ seen.append(du)
138
+ return [DataSourceRef(source_id=d, name=d, source_type="", detail={}) for d in seen]
139
+
140
+ candidates = catalog.sources
141
+ if bound_ids:
142
+ scoped = [s for s in candidates if s.source_id in set(bound_ids)]
143
+ candidates = scoped or candidates # fail-open if binding doesn't match catalog
144
+
145
+ def _ref(s) -> DataSourceRef:
146
+ return DataSourceRef(
147
+ source_id=s.source_id,
148
+ name=s.name,
149
+ source_type=s.source_type,
150
+ detail={
151
+ "tables": [t.name for t in s.tables],
152
+ "row_count": sum((t.row_count or 0) for t in s.tables) or None,
153
+ "columns": [c.name for t in s.tables for c in t.columns],
154
+ },
155
+ )
156
+
157
+ used = " ".join(du for rec in records for du in rec.data_used).lower()
158
+ matched = [
159
+ _ref(s)
160
+ for s in candidates
161
+ if s.name.lower() in used or s.source_id.lower() in used
162
+ ]
163
+ return matched or [_ref(s) for s in candidates]
164
+
165
+
166
+ def _build_human_content(
167
+ ps: ProblemStatement, findings: list[ReportFinding], caveats: list[AttributedNote]
168
+ ) -> str:
169
+ sections = []
170
+ ps_lines = [v for v in (ps.objective, ps.target_value, ps.scope) if v]
171
+ if ps_lines:
172
+ sections.append("# Problem Statement\n" + "\n".join(ps_lines))
173
+ sections.append(
174
+ "# Findings (already finalized β€” synthesize, do not add numbers)\n"
175
+ + "\n".join(f"- {f.text}" for f in findings)
176
+ )
177
+ if caveats:
178
+ sections.append("# Caveats\n" + "\n".join(f"- {c.text}" for c in caveats))
179
+ return "\n\n".join(sections)
180
+
181
+
182
+ def _render_markdown(report: AnalysisReport) -> str:
183
+ # Version is deliberately NOT in the markdown β€” it is assigned by the store
184
+ # after rendering and lives in the structured `version` field / API metadata.
185
+ parts: list[str] = ["# Analysis Report"]
186
+ parts.append(
187
+ f"*Generated {report.generated_at:%Y-%m-%d} Β· "
188
+ f"{len(report.record_ids)} analyses Β· {len(report.data_sources)} source(s)*"
189
+ )
190
+
191
+ ps = report.problem_statement
192
+ ps_lines = [v for v in (ps.objective, ps.target_value, ps.scope) if v]
193
+ if ps_lines:
194
+ parts.append("## Problem Statement\n" + " ".join(ps_lines))
195
+
196
+ if report.executive_summary:
197
+ parts.append("## Executive Summary\n" + report.executive_summary)
198
+
199
+ if report.findings:
200
+ lines = ["## Key Findings"]
201
+ for i, f in enumerate(report.findings, 1):
202
+ cite = f" *({', '.join(f.record_ids)})*" if f.record_ids else ""
203
+ lines.append(f"{i}. {f.text}{cite}")
204
+ parts.append("\n".join(lines))
205
+
206
+ if report.caveats or report.open_questions:
207
+ lines = ["## Caveats & Open Questions"]
208
+ for n in report.caveats:
209
+ cite = f" *({', '.join(n.record_ids)})*" if n.record_ids else ""
210
+ lines.append(f"- {n.text}{cite}")
211
+ for n in report.open_questions:
212
+ cite = f" *({', '.join(n.record_ids)})*" if n.record_ids else ""
213
+ lines.append(f"- Open: {n.text}{cite}")
214
+ parts.append("\n".join(lines))
215
+
216
+ if report.data_sources:
217
+ lines = ["## Appendix A β€” Data Used", "| source | type | detail |", "|---|---|---|"]
218
+ for ds in report.data_sources:
219
+ d = ds.detail
220
+ bits = []
221
+ if d.get("tables"):
222
+ bits.append("tables: " + ", ".join(d["tables"]))
223
+ if d.get("row_count"):
224
+ bits.append(f"{d['row_count']} rows")
225
+ if d.get("columns"):
226
+ bits.append(f"{len(d['columns'])} cols")
227
+ lines.append(f"| {ds.name} | {ds.source_type or 'β€”'} | {' Β· '.join(bits) or 'β€”'} |")
228
+ parts.append("\n".join(lines))
229
+
230
+ if report.method_steps:
231
+ lines = ["## Appendix B β€” Method"]
232
+ for stage_key, label in _STAGE_LABELS:
233
+ steps = [s for s in report.method_steps if s.stage == stage_key]
234
+ if not steps:
235
+ continue
236
+ rendered = "; ".join(
237
+ f"{', '.join(s.tools_used) or 'β€”'} ({s.status})" for s in steps
238
+ )
239
+ lines.append(f"**{label}** β€” {rendered}")
240
+ parts.append("\n".join(lines))
241
+
242
+ return "\n\n".join(parts)
243
+
244
+
245
+ # --------------------------------------------------------------------------- #
246
+ # Service
247
+ # --------------------------------------------------------------------------- #
248
+
249
+
250
+ class ReportGenerator:
251
+ """Generates an `AnalysisReport` from persisted records. Inject deps for tests."""
252
+
253
+ def __init__(
254
+ self,
255
+ record_store=None,
256
+ structured_chain: Runnable | None = None,
257
+ catalog_store=None,
258
+ binding_store=None,
259
+ ) -> None:
260
+ self._record_store = record_store
261
+ self._chain = structured_chain
262
+ self._catalog_store = catalog_store
263
+ self._binding_store = binding_store
264
+
265
+ def _ensure_record_store(self):
266
+ if self._record_store is None:
267
+ from ..slow_path.store import PostgresAnalysisStore
268
+
269
+ self._record_store = PostgresAnalysisStore()
270
+ return self._record_store
271
+
272
+ def _ensure_chain(self) -> Runnable:
273
+ if self._chain is None:
274
+ self._chain = _get_default_chain()
275
+ return self._chain
276
+
277
+ def _ensure_catalog_store(self):
278
+ if self._catalog_store is None:
279
+ from src.catalog.store import CatalogStore
280
+
281
+ self._catalog_store = CatalogStore()
282
+ return self._catalog_store
283
+
284
+ async def generate(
285
+ self,
286
+ analysis_id: str,
287
+ user_id: str | None = None,
288
+ problem_statement: ProblemStatement | None = None,
289
+ ) -> AnalysisReport:
290
+ records = await self._ensure_record_store().list_for_analysis(analysis_id)
291
+ if not records:
292
+ raise ReportError(f"no analyses recorded for {analysis_id!r} yet")
293
+
294
+ ps = problem_statement or ProblemStatement()
295
+ findings = _collect_findings(records)
296
+ caveats = _collect_notes(records, "caveats")
297
+ open_questions = _collect_notes(records, "open_questions")
298
+ method_steps = _collect_method_steps(records)
299
+ bound_ids = await self._read_binding(analysis_id)
300
+ data_sources = _build_data_sources(
301
+ records, await self._read_catalog(user_id), bound_ids
302
+ )
303
+ executive_summary = await self._summarize(ps, findings, caveats)
304
+
305
+ report = AnalysisReport(
306
+ analysis_id=analysis_id,
307
+ user_id=user_id,
308
+ version=0, # assigned by ReportStore.save under the advisory lock
309
+ generated_at=datetime.now(UTC),
310
+ problem_statement=ps,
311
+ record_ids=[r.record_id for r in records],
312
+ executive_summary=executive_summary,
313
+ findings=findings,
314
+ caveats=caveats,
315
+ open_questions=open_questions,
316
+ data_sources=data_sources,
317
+ method_steps=method_steps,
318
+ )
319
+ report.rendered_markdown = _render_markdown(report)
320
+ logger.info(
321
+ "report generated",
322
+ analysis_id=analysis_id,
323
+ records=len(records),
324
+ findings=len(findings),
325
+ )
326
+ return report
327
+
328
+ async def _read_catalog(self, user_id: str | None):
329
+ if not user_id:
330
+ return None
331
+ try:
332
+ return await self._ensure_catalog_store().get(user_id)
333
+ except Exception as exc: # data_sources falls back; never break the report
334
+ logger.warning("catalog read failed; data_sources will fall back", error=str(exc))
335
+ return None
336
+
337
+ def _ensure_binding_store(self):
338
+ if self._binding_store is None:
339
+ from ..binding_store import AnalysisDataSourceStore
340
+
341
+ self._binding_store = AnalysisDataSourceStore()
342
+ return self._binding_store
343
+
344
+ async def _read_binding(self, analysis_id: str) -> list[str]:
345
+ """Bound source ids for the analysis (#10). Never-throw β†’ [] (unscoped)."""
346
+ try:
347
+ return await self._ensure_binding_store().get(analysis_id)
348
+ except Exception as exc: # data_sources falls back to whole catalog
349
+ logger.warning("binding read failed; data_sources unscoped", error=str(exc))
350
+ return []
351
+
352
+ async def _summarize(
353
+ self, ps: ProblemStatement, findings: list[ReportFinding], caveats: list[AttributedNote]
354
+ ) -> str:
355
+ human_content = _build_human_content(ps, findings, caveats)
356
+ try:
357
+ narrative: ReportSummaryNarrative = await self._ensure_chain().ainvoke(
358
+ {"human_content": human_content}
359
+ )
360
+ return narrative.executive_summary
361
+ except Exception as exc: # D1: degrade, don't fail the whole report
362
+ logger.warning("report summary LLM failed; using fallback", error=str(exc))
363
+ return _FALLBACK_SUMMARY
src/agents/report/readiness.py ADDED
@@ -0,0 +1,165 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """`is_report_ready` β€” deterministic report-readiness signal (seam #5, KM-652).
2
+
3
+ The Help skill asks "can the user generate a report yet?" before it offers that as
4
+ a next step. This is the producer of that answer; Help only *consumes* it (see
5
+ `handlers/help.ReportReadiness`). No LLM β€” readiness is a fact about persisted state,
6
+ not a judgement.
7
+
8
+ The rule mirrors what makes a real report non-empty and worth generating, so Help can
9
+ never suggest an action that would 409 or produce a duplicate:
10
+ 1. `problem_validated` β€” the gate's own precondition (no validated goal, no
11
+ analysis worth reporting). Same rule `gate.gate` applies to `structured_flow`.
12
+ 2. at least one **substantive** persisted `AnalysisRecord` β€” a record whose
13
+ *analysis* task succeeded. A failed run still persists a record WITH findings
14
+ (they narrate the failure), and data-access tasks (check_/retrieve_) succeed even
15
+ when the analysis fails β€” so neither "has findings" nor "any task succeeded" is
16
+ enough. We require a genuine analysis tool (analyze_*) to have completed. We count
17
+ *results*, not chat turns.
18
+ 3. delta-since-report β€” if a report already exists (`state.report_id`), only ready
19
+ when there's a substantive analysis newer than the latest report; otherwise the
20
+ new report would be identical.
21
+
22
+ `missing` names whichever criterion is absent, so Help can tell the user the next gap
23
+ to fill (the team values `missing` over the bare boolean). Bias is anti-false-positive
24
+ (report is also button-triggered): a record-store read failure fails **closed**
25
+ (not ready); a report-store read failure during the delta check fails **open** (we
26
+ can't prove staleness, and the button is always there).
27
+
28
+ NOT in scope (deferred, pending the readiness eval set): semantic *alignment* of the
29
+ analyses to the problem statement and *depth*/variety scoring β€” both need an LLM judge
30
+ and shouldn't sit in the per-turn Help hot path until eval justifies the cost.
31
+ """
32
+
33
+ from __future__ import annotations
34
+
35
+ from datetime import UTC, datetime
36
+ from typing import TYPE_CHECKING
37
+
38
+ from src.middlewares.logging import get_logger
39
+
40
+ from ..handlers.help import ReportReadiness
41
+
42
+ if TYPE_CHECKING:
43
+ from ..gate import AnalysisState
44
+
45
+ logger = get_logger("report_readiness")
46
+
47
+ # Human-readable gaps surfaced to the user via Help (kept stable for the prompt).
48
+ _MISSING_PROBLEM = "a validated problem statement"
49
+ _MISSING_ANALYSIS = "at least one completed analysis"
50
+ _MISSING_DELTA = "a new analysis since the last report"
51
+
52
+
53
+ def _default_record_store():
54
+ from ..slow_path.store import PostgresAnalysisStore
55
+
56
+ return PostgresAnalysisStore()
57
+
58
+
59
+ def _default_report_store():
60
+ from .store import ReportStore
61
+
62
+ return ReportStore()
63
+
64
+
65
+ def _is_newer(a: datetime, b: datetime) -> bool:
66
+ """True if `a` is later than `b`, tolerating naive/aware mismatch (assume UTC)."""
67
+ if a.tzinfo is None:
68
+ a = a.replace(tzinfo=UTC)
69
+ if b.tzinfo is None:
70
+ b = b.replace(tzinfo=UTC)
71
+ return a > b
72
+
73
+
74
+ def _has_successful_analysis(record) -> bool:
75
+ """True if the record has at least one *analysis* task that succeeded.
76
+
77
+ A failed run still writes findings (narrating the failure) and its data-access
78
+ tasks (check_/retrieve_) succeed, so we can't key on findings or on "any task
79
+ succeeded". An analysis tool (analyze_*) completing is the real "we produced a
80
+ result" signal.
81
+ """
82
+ return any(
83
+ t.status == "success" and any(tool.startswith("analyze") for tool in t.tools_used)
84
+ for t in record.tasks_run
85
+ )
86
+
87
+
88
+ async def report_floor(
89
+ analysis_id: str | None,
90
+ state: AnalysisState,
91
+ *,
92
+ record_store=None,
93
+ ) -> tuple[list[str], list]:
94
+ """The report **floor**: a validated goal + β‰₯1 substantive analysis.
95
+
96
+ Returns `(missing, substantive_records)`. This is the shared gate both the Help
97
+ readiness signal AND the report API enforce, so the button and Help can't drift
98
+ (T-D / T11). It deliberately excludes the delta-since-report check β€” that is
99
+ advisory and lives only in `is_report_ready`; the report button is always allowed
100
+ to cut a new version (decision 4A). Fails closed (counts as missing analysis) on
101
+ a record-store read error. `record_store` is injectable for tests.
102
+ """
103
+ missing: list[str] = []
104
+ if not state.problem_validated:
105
+ missing.append(_MISSING_PROBLEM)
106
+
107
+ substantive: list = []
108
+ if analysis_id:
109
+ try:
110
+ store = record_store or _default_record_store()
111
+ records = await store.list_for_analysis(analysis_id)
112
+ substantive = [r for r in records if _has_successful_analysis(r)]
113
+ except Exception as exc: # noqa: BLE001 β€” never-throw; fail closed to not-ready
114
+ logger.warning(
115
+ "report_floor: record store read failed β€” not ready",
116
+ analysis_id=analysis_id,
117
+ error=str(exc),
118
+ )
119
+ return [*missing, _MISSING_ANALYSIS], []
120
+
121
+ if not substantive:
122
+ missing.append(_MISSING_ANALYSIS)
123
+ return missing, substantive
124
+
125
+
126
+ async def is_report_ready(
127
+ analysis_id: str | None,
128
+ state: AnalysisState,
129
+ *,
130
+ record_store=None,
131
+ report_store=None,
132
+ ) -> ReportReadiness:
133
+ """Return whether a report can be generated for this analysis, and the gaps if not.
134
+
135
+ `record_store` / `report_store` are injectable for tests; they default to the
136
+ real Postgres stores.
137
+ """
138
+ missing, substantive = await report_floor(
139
+ analysis_id, state, record_store=record_store
140
+ )
141
+
142
+ if not substantive:
143
+ # No analyses to report on β†’ the delta check is moot.
144
+ return ReportReadiness(ready=not missing, missing=missing)
145
+
146
+ # Delta-since-report: a report already exists, so only ready if a substantive
147
+ # analysis is newer than the latest report. Fail-open on a report-store error.
148
+ if state.report_id:
149
+ last_report_at: datetime | None = None
150
+ try:
151
+ rstore = report_store or _default_report_store()
152
+ reports = await rstore.list_for_analysis(analysis_id)
153
+ last_report_at = max((r.generated_at for r in reports), default=None)
154
+ except Exception as exc: # noqa: BLE001 β€” skip delta; can't prove staleness
155
+ logger.warning(
156
+ "is_report_ready: report store read failed β€” skipping delta check",
157
+ analysis_id=analysis_id,
158
+ error=str(exc),
159
+ )
160
+ if last_report_at is not None and not any(
161
+ _is_newer(r.created_at, last_report_at) for r in substantive
162
+ ):
163
+ missing.append(_MISSING_DELTA)
164
+
165
+ return ReportReadiness(ready=not missing, missing=missing)
src/agents/report/schemas.py ADDED
@@ -0,0 +1,91 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Report contract β€” `AnalysisReport` and its parts (KM-644).
2
+
3
+ The report generator turns a session's persisted `AnalysisRecord`s + Problem
4
+ Statement into a versioned report. Only `executive_summary` is LLM-authored
5
+ (`ReportSummaryNarrative`); every other field is copied verbatim from the records
6
+ by code (INV-4), so the report stays a faithful, auditable artifact.
7
+
8
+ Two deliberate looseness choices for v1 (tighten later once usage shows):
9
+ `ProblemStatement` (stub of Harry's real PS) and `ReportFinding.supporting_data`.
10
+
11
+ See CHECKPOINT_PLAN_2026-06-17.md decision #8.
12
+ """
13
+
14
+ from __future__ import annotations
15
+
16
+ from datetime import datetime
17
+ from uuid import uuid4
18
+
19
+ from pydantic import BaseModel, Field
20
+
21
+ from ..slow_path.schemas import TaskSummary
22
+
23
+
24
+ class ProblemStatement(BaseModel):
25
+ """Minimal stub of Harry's Problem Statement, frozen into each report.
26
+
27
+ Loose on purpose until the real PS template lands (Analysis State, upstream).
28
+ A report snapshots the PS as it was at generation time.
29
+ """
30
+
31
+ objective: str = ""
32
+ metric_direction: str = "" # "increase" | "decrease"
33
+ target_metric: str = ""
34
+ target_value: str = ""
35
+ scope: str = ""
36
+
37
+
38
+ class DataSourceRef(BaseModel):
39
+ """Frozen catalog metadata for a source used in the analysis.
40
+
41
+ Snapshotted at generation time (NOT re-fetched at render) so a re-ingested
42
+ source never retroactively changes an old report β€” same freeze rationale as
43
+ `ProblemStatement`.
44
+ """
45
+
46
+ source_id: str
47
+ name: str
48
+ source_type: str # postgres | file | ...
49
+ detail: dict = Field(default_factory=dict) # rows in scope, columns, window
50
+
51
+
52
+ class ReportFinding(BaseModel):
53
+ text: str
54
+ record_ids: list[str] = Field(default_factory=list) # records backing this finding
55
+ supporting_data: dict | None = None # loose for v1; the chart-able slice
56
+
57
+
58
+ class AttributedNote(BaseModel):
59
+ """A caveat or open question carrying the records it came from.
60
+
61
+ Plural `record_ids` because a note can be deduped/merged across records.
62
+ """
63
+
64
+ text: str
65
+ record_ids: list[str] = Field(default_factory=list)
66
+
67
+
68
+ class ReportSummaryNarrative(BaseModel):
69
+ """The ONLY LLM-authored part of the report (with_structured_output target)."""
70
+
71
+ executive_summary: str
72
+
73
+
74
+ class AnalysisReport(BaseModel):
75
+ report_id: str = Field(default_factory=lambda: uuid4().hex)
76
+ analysis_id: str
77
+ user_id: str | None = None
78
+ version: int
79
+ generated_at: datetime
80
+ # Frozen snapshots.
81
+ problem_statement: ProblemStatement = Field(default_factory=ProblemStatement)
82
+ record_ids: list[str] = Field(default_factory=list) # records used (snapshot)
83
+ # LLM-authored.
84
+ executive_summary: str = ""
85
+ # Deterministic pass-through from records.
86
+ findings: list[ReportFinding] = Field(default_factory=list)
87
+ caveats: list[AttributedNote] = Field(default_factory=list)
88
+ open_questions: list[AttributedNote] = Field(default_factory=list)
89
+ data_sources: list[DataSourceRef] = Field(default_factory=list)
90
+ method_steps: list[TaskSummary] = Field(default_factory=list) # carries `stage`
91
+ rendered_markdown: str = ""
src/agents/report/store.py ADDED
@@ -0,0 +1,119 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """ReportStore β€” persists/reads versioned AnalysisReports (KM-644).
2
+
3
+ Mirrors `PostgresAnalysisStore`: each call opens its own `AsyncSessionLocal`.
4
+
5
+ Version assignment is serialized per `analysis_id` with a Postgres
6
+ transaction-level advisory lock so concurrent button presses can't compute the
7
+ same version number; the `(analysis_id, version)` unique constraint is the
8
+ backstop. Per decision 4A every generation is a new version, so two
9
+ near-simultaneous presses legitimately produce V<n> and V<n+1> β€” the lock only
10
+ prevents a duplicate-number race, not double generation.
11
+ """
12
+
13
+ from __future__ import annotations
14
+
15
+ import hashlib
16
+
17
+ from sqlalchemy import func, select, text
18
+
19
+ from src.db.postgres.connection import AsyncSessionLocal
20
+ from src.db.postgres.models import AnalysisReportRow
21
+ from src.middlewares.logging import get_logger
22
+
23
+ from .schemas import AnalysisReport
24
+
25
+ logger = get_logger("report_store")
26
+
27
+
28
+ def _lock_key(analysis_id: str) -> int:
29
+ """Stable signed 64-bit key for `pg_advisory_xact_lock`.
30
+
31
+ Python's builtin `hash(str)` is randomized per process, so derive a
32
+ deterministic key from a digest instead.
33
+ """
34
+ digest = hashlib.sha256(analysis_id.encode()).digest()
35
+ return int.from_bytes(digest[:8], "big", signed=True)
36
+
37
+
38
+ def _report_title(report: AnalysisReport) -> str:
39
+ """Title for the dedorch `reports.title` column β€” the goal, else a generic label."""
40
+ objective = (report.problem_statement.objective or "").strip()
41
+ return objective[:200] if objective else "Analysis Report"
42
+
43
+
44
+ def _row_to_report(row) -> AnalysisReport:
45
+ """Rebuild a minimal AnalysisReport from the flat dedorch row.
46
+
47
+ dedorch stores markdown only, so structured fields (findings/caveats/…) come back
48
+ empty; `rendered_markdown` carries the content the FE renders/downloads.
49
+ """
50
+ return AnalysisReport(
51
+ report_id=row.id,
52
+ analysis_id=row.analysis_id,
53
+ version=row.version,
54
+ generated_at=row.generated_at,
55
+ rendered_markdown=row.content,
56
+ )
57
+
58
+
59
+ class ReportStore:
60
+ """Read/write versioned reports keyed by `analysis_id`."""
61
+
62
+ async def save(self, report: AnalysisReport) -> AnalysisReport:
63
+ """Assign the next version under an advisory lock and persist.
64
+
65
+ Mutates and returns `report` with its final `version`.
66
+ """
67
+ async with AsyncSessionLocal() as session:
68
+ async with session.begin():
69
+ await session.execute(
70
+ text("SELECT pg_advisory_xact_lock(:k)"),
71
+ {"k": _lock_key(report.analysis_id)},
72
+ )
73
+ result = await session.execute(
74
+ select(func.max(AnalysisReportRow.version)).where(
75
+ AnalysisReportRow.analysis_id == report.analysis_id
76
+ )
77
+ )
78
+ report.version = (result.scalar_one_or_none() or 0) + 1
79
+ session.add(
80
+ AnalysisReportRow(
81
+ id=report.report_id,
82
+ analysis_id=report.analysis_id,
83
+ title=_report_title(report),
84
+ content=report.rendered_markdown or "",
85
+ generated_at=report.generated_at,
86
+ version=report.version,
87
+ )
88
+ )
89
+ # leaving session.begin() commits, which releases the advisory lock
90
+ logger.info(
91
+ "report persisted",
92
+ analysis_id=report.analysis_id,
93
+ version=report.version,
94
+ report_id=report.report_id,
95
+ )
96
+ return report
97
+
98
+ async def list_for_analysis(self, analysis_id: str) -> list[AnalysisReport]:
99
+ async with AsyncSessionLocal() as session:
100
+ result = await session.execute(
101
+ select(AnalysisReportRow)
102
+ .where(AnalysisReportRow.analysis_id == analysis_id)
103
+ .order_by(AnalysisReportRow.version.asc())
104
+ )
105
+ rows = result.scalars().all()
106
+ return [_row_to_report(row) for row in rows]
107
+
108
+ async def get(self, analysis_id: str, version: int) -> AnalysisReport | None:
109
+ async with AsyncSessionLocal() as session:
110
+ result = await session.execute(
111
+ select(AnalysisReportRow).where(
112
+ AnalysisReportRow.analysis_id == analysis_id,
113
+ AnalysisReportRow.version == version,
114
+ )
115
+ )
116
+ row = result.scalar_one_or_none()
117
+ if row is None:
118
+ return None
119
+ return _row_to_report(row)
src/agents/slow_path/assembler.py CHANGED
@@ -33,6 +33,7 @@ from .schemas import (
33
  AssembledOutput,
34
  AssemblerNarrative,
35
  RunState,
 
36
  TaskSummary,
37
  )
38
 
@@ -116,16 +117,46 @@ class Assembler:
116
  return AssembledOutput(chat_answer=narrative.chat_answer, analysis_record=record)
117
 
118
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
119
  def _build_record(narrative: AssemblerNarrative, run_state: RunState) -> AnalysisRecord:
120
  tasks_run = [
121
  TaskSummary(
122
  task_id=task_id,
 
123
  objective=result.objective,
124
  status=result.status,
125
  tools_used=[o.tool for o in result.outputs],
126
  )
127
  for task_id, result in run_state.results.items()
128
  ]
 
 
 
129
  return AnalysisRecord(
130
  goal_restated=narrative.goal_restated,
131
  findings=narrative.findings,
@@ -133,7 +164,7 @@ def _build_record(narrative: AssemblerNarrative, run_state: RunState) -> Analysi
133
  data_used=narrative.data_used,
134
  open_questions=narrative.open_questions,
135
  tasks_run=tasks_run,
136
- results_snapshot=run_state.results,
137
  plan_id=run_state.plan_id,
138
  business_context_id=run_state.business_context_id,
139
  created_at=datetime.now(UTC),
 
33
  AssembledOutput,
34
  AssemblerNarrative,
35
  RunState,
36
+ TaskResult,
37
  TaskSummary,
38
  )
39
 
 
117
  return AssembledOutput(chat_answer=narrative.chat_answer, analysis_record=record)
118
 
119
 
120
+ # Persisted records keep `analyze_*` outputs (scalar/stats/series β€” small, and the
121
+ # basis a future report/chart renders from) in full, but cap raw `table` rows from
122
+ # data-access tools (retrieve_data can return up to the 10k LIMIT): the report never
123
+ # renders raw rows, so storing them all would bloat every record's jsonb.
124
+ _SNAPSHOT_ROW_SAMPLE = 10
125
+
126
+
127
+ def _trim_for_snapshot(result: TaskResult) -> TaskResult:
128
+ trimmed = []
129
+ changed = False
130
+ for out in result.outputs:
131
+ if out.kind == "table" and out.rows is not None and len(out.rows) > _SNAPSHOT_ROW_SAMPLE:
132
+ changed = True
133
+ trimmed.append(
134
+ out.model_copy(
135
+ update={
136
+ "rows": out.rows[:_SNAPSHOT_ROW_SAMPLE],
137
+ "meta": {**out.meta, "total_rows": len(out.rows), "rows_truncated": True},
138
+ }
139
+ )
140
+ )
141
+ else:
142
+ trimmed.append(out)
143
+ return result.model_copy(update={"outputs": trimmed}) if changed else result
144
+
145
+
146
  def _build_record(narrative: AssemblerNarrative, run_state: RunState) -> AnalysisRecord:
147
  tasks_run = [
148
  TaskSummary(
149
  task_id=task_id,
150
+ stage=result.stage,
151
  objective=result.objective,
152
  status=result.status,
153
  tools_used=[o.tool for o in result.outputs],
154
  )
155
  for task_id, result in run_state.results.items()
156
  ]
157
+ results_snapshot = {
158
+ task_id: _trim_for_snapshot(result) for task_id, result in run_state.results.items()
159
+ }
160
  return AnalysisRecord(
161
  goal_restated=narrative.goal_restated,
162
  findings=narrative.findings,
 
164
  data_used=narrative.data_used,
165
  open_questions=narrative.open_questions,
166
  tasks_run=tasks_run,
167
+ results_snapshot=results_snapshot,
168
  plan_id=run_state.plan_id,
169
  business_context_id=run_state.business_context_id,
170
  created_at=datetime.now(UTC),
src/agents/slow_path/coordinator.py CHANGED
@@ -1,9 +1,9 @@
1
  """SlowPathCoordinator β€” wires the slow path: Planner -> TaskRunner -> Assembler.
2
 
3
- A thin coordination object. This is the unit the (future) expanded Orchestrator /
4
- ChatHandler will call on a `structured` analytical query. It is built and tested
5
- here but **not yet wired into the live chat flow** β€” that step waits on the tool
6
- team's real `ToolInvoker` and a real `BusinessContext` source.
7
 
8
  See AGENT_ARCHITECTURE_CONTEXT_new.md Β§5.2 / Β§6.1.
9
  """
 
1
  """SlowPathCoordinator β€” wires the slow path: Planner -> TaskRunner -> Assembler.
2
 
3
+ A thin coordination object. `ChatHandler` calls it on a `structured_flow` query when
4
+ `ENABLE_SLOW_PATH` is on (the real `ToolInvoker` is composed in
5
+ `ChatHandler._get_slow_path_coordinator`). `BusinessContext` is still a stub until the
6
+ lead's real source lands.
7
 
8
  See AGENT_ARCHITECTURE_CONTEXT_new.md Β§5.2 / Β§6.1.
9
  """
src/agents/slow_path/schemas.py CHANGED
@@ -21,10 +21,12 @@ from __future__ import annotations
21
 
22
  from datetime import datetime
23
  from typing import Literal
 
24
 
25
  from pydantic import BaseModel, Field
26
 
27
  from ..planner.contracts import ToolOutput
 
28
 
29
  TaskStatus = Literal["success", "partial", "failure"]
30
 
@@ -36,6 +38,7 @@ TaskStatus = Literal["success", "partial", "failure"]
36
 
37
  class TaskResult(BaseModel):
38
  task_id: str
 
39
  status: TaskStatus
40
  objective: str
41
  outputs: list[ToolOutput] = Field(default_factory=list) # one per tool_call
@@ -57,12 +60,21 @@ class RunState(BaseModel):
57
 
58
  class TaskSummary(BaseModel):
59
  task_id: str
 
60
  objective: str
61
  status: TaskStatus
62
  tools_used: list[str] = Field(default_factory=list)
63
 
64
 
65
  class AnalysisRecord(BaseModel):
 
 
 
 
 
 
 
 
66
  # Narrative fields β€” authored by the Assembler LLM.
67
  goal_restated: str
68
  findings: list[str] = Field(default_factory=list)
 
21
 
22
  from datetime import datetime
23
  from typing import Literal
24
+ from uuid import uuid4
25
 
26
  from pydantic import BaseModel, Field
27
 
28
  from ..planner.contracts import ToolOutput
29
+ from ..planner.schemas import CrispStage
30
 
31
  TaskStatus = Literal["success", "partial", "failure"]
32
 
 
38
 
39
  class TaskResult(BaseModel):
40
  task_id: str
41
+ stage: CrispStage # copied from the plan Task; carries CRISP-DM grouping to the report
42
  status: TaskStatus
43
  objective: str
44
  outputs: list[ToolOutput] = Field(default_factory=list) # one per tool_call
 
60
 
61
  class TaskSummary(BaseModel):
62
  task_id: str
63
+ stage: CrispStage # lets the report group the method appendix by CRISP-DM phase
64
  objective: str
65
  status: TaskStatus
66
  tools_used: list[str] = Field(default_factory=list)
67
 
68
 
69
  class AnalysisRecord(BaseModel):
70
+ # Identity. `record_id` is the unit the report cites and snapshots
71
+ # (`record_ids`); `analysis_id`/`user_id` scope the record to one analysis
72
+ # session + owner and are stamped by the composition root / AnalysisStore at
73
+ # persist time (they depend on the Analysis State that lives outside the slow
74
+ # path), so they default to None when the Assembler first builds the record.
75
+ record_id: str = Field(default_factory=lambda: uuid4().hex)
76
+ analysis_id: str | None = None
77
+ user_id: str | None = None
78
  # Narrative fields β€” authored by the Assembler LLM.
79
  goal_restated: str
80
  findings: list[str] = Field(default_factory=list)
src/agents/slow_path/store.py CHANGED
@@ -2,21 +2,28 @@
2
 
3
  The Assembler produces an `AnalysisRecord` (the faithful, structured record of a
4
  run β€” Β§8.3, INV-4). Persisting it is a separate concern from streaming the answer,
5
- so it sits behind this one-method seam.
 
6
 
7
- `NullAnalysisStore` is the default: it logs that a record was produced but stores
8
- nothing, because the backing table does not exist yet. The plan is to store records
9
- in the **same catalog DB** (Neon `dataeyond`, `settings.postgres_connstring`).
 
10
 
11
- TODO(persistence): add a Postgres-backed `AnalysisStore` writing an
12
- `analysis_records` table in the catalog DB, keyed on
13
- (business_context_id, plan_id, created_at), then inject it into ChatHandler.
14
  """
15
 
16
  from __future__ import annotations
17
 
18
  from typing import Protocol, runtime_checkable
19
 
 
 
 
 
 
20
  from src.middlewares.logging import get_logger
21
 
22
  from .schemas import AnalysisRecord
@@ -26,19 +33,78 @@ logger = get_logger("analysis_store")
26
 
27
  @runtime_checkable
28
  class AnalysisStore(Protocol):
29
- """Persist a completed analysis. Implementations must never raise on the
30
- caller's path β€” a persistence failure must not break the user's answer."""
 
 
 
31
 
32
  async def save(self, record: AnalysisRecord) -> None: ...
33
 
 
 
34
 
35
  class NullAnalysisStore:
36
- """Default no-op store: logs the record, persists nothing (no table yet)."""
37
 
38
  async def save(self, record: AnalysisRecord) -> None:
39
  logger.info(
40
- "analysis_record produced (not persisted β€” no store configured)",
 
41
  plan_id=record.plan_id,
42
- business_context_id=record.business_context_id,
43
  n_tasks=len(record.tasks_run),
44
  )
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2
 
3
  The Assembler produces an `AnalysisRecord` (the faithful, structured record of a
4
  run β€” Β§8.3, INV-4). Persisting it is a separate concern from streaming the answer,
5
+ so it sits behind this seam. `generate_report` later reads records back by
6
+ `analysis_id` (oldest-first) and renders from them β€” never from chat history.
7
 
8
+ - `NullAnalysisStore` logs and stores nothing (kept for tests / when persistence
9
+ is intentionally disabled).
10
+ - `PostgresAnalysisStore` writes one `analysis_records` row per run in the catalog
11
+ DB (Neon `dataeyond`, `settings.postgres_connstring`).
12
 
13
+ `save` must never raise on the caller's path β€” a persistence failure must not break
14
+ the user's answer (Β§8.3). `list_for_analysis` is a read for the report generator and
15
+ is allowed to surface errors to its caller.
16
  """
17
 
18
  from __future__ import annotations
19
 
20
  from typing import Protocol, runtime_checkable
21
 
22
+ from sqlalchemy import select
23
+ from sqlalchemy.dialects.postgresql import insert
24
+
25
+ from src.db.postgres.connection import AsyncSessionLocal
26
+ from src.db.postgres.models import AnalysisRecordRow
27
  from src.middlewares.logging import get_logger
28
 
29
  from .schemas import AnalysisRecord
 
33
 
34
  @runtime_checkable
35
  class AnalysisStore(Protocol):
36
+ """Persist + read completed analyses.
37
+
38
+ `save` must never raise on the caller's path. `list_for_analysis` returns the
39
+ records for one analysis session, oldest-first (the order the report renders in).
40
+ """
41
 
42
  async def save(self, record: AnalysisRecord) -> None: ...
43
 
44
+ async def list_for_analysis(self, analysis_id: str) -> list[AnalysisRecord]: ...
45
+
46
 
47
  class NullAnalysisStore:
48
+ """No-op store: logs the record, persists nothing. Reads return empty."""
49
 
50
  async def save(self, record: AnalysisRecord) -> None:
51
  logger.info(
52
+ "analysis_record produced (not persisted β€” NullAnalysisStore)",
53
+ record_id=record.record_id,
54
  plan_id=record.plan_id,
 
55
  n_tasks=len(record.tasks_run),
56
  )
57
+
58
+ async def list_for_analysis(self, analysis_id: str) -> list[AnalysisRecord]:
59
+ return []
60
+
61
+
62
+ class PostgresAnalysisStore:
63
+ """Writes/reads `analysis_records` jsonb rows in the catalog DB.
64
+
65
+ Mirrors `CatalogStore`: each call opens its own `AsyncSession`. One row per
66
+ record (vs. one-per-user for the catalog) since records accumulate per analysis.
67
+ """
68
+
69
+ async def save(self, record: AnalysisRecord) -> None:
70
+ try:
71
+ payload = record.model_dump(mode="json")
72
+ async with AsyncSessionLocal() as session:
73
+ stmt = insert(AnalysisRecordRow).values(
74
+ id=record.record_id,
75
+ analysis_id=record.analysis_id,
76
+ user_id=record.user_id,
77
+ plan_id=record.plan_id,
78
+ data=payload,
79
+ created_at=record.created_at,
80
+ )
81
+ # Re-running the same plan id-collides only if record_id repeats;
82
+ # treat that as idempotent (overwrite) rather than erroring the user.
83
+ stmt = stmt.on_conflict_do_update(
84
+ index_elements=[AnalysisRecordRow.id],
85
+ set_={"data": stmt.excluded.data},
86
+ )
87
+ await session.execute(stmt)
88
+ await session.commit()
89
+ logger.info(
90
+ "analysis_record persisted",
91
+ record_id=record.record_id,
92
+ analysis_id=record.analysis_id,
93
+ user_id=record.user_id,
94
+ )
95
+ except Exception as exc: # never break the user's answer (Β§8.3)
96
+ logger.error(
97
+ "analysis_record persist failed",
98
+ record_id=record.record_id,
99
+ error=str(exc),
100
+ )
101
+
102
+ async def list_for_analysis(self, analysis_id: str) -> list[AnalysisRecord]:
103
+ async with AsyncSessionLocal() as session:
104
+ result = await session.execute(
105
+ select(AnalysisRecordRow.data)
106
+ .where(AnalysisRecordRow.analysis_id == analysis_id)
107
+ .order_by(AnalysisRecordRow.created_at.asc())
108
+ )
109
+ rows = result.scalars().all()
110
+ return [AnalysisRecord.model_validate(row) for row in rows]
src/agents/slow_path/task_runner.py CHANGED
@@ -53,6 +53,7 @@ class TaskRunner:
53
  for tid in list(remaining):
54
  results[tid] = TaskResult(
55
  task_id=tid,
 
56
  status="failure",
57
  objective=tasks_by_id[tid].objective,
58
  error="unresolved dependency; task could not run",
@@ -68,6 +69,7 @@ class TaskRunner:
68
  if failed:
69
  results[tid] = TaskResult(
70
  task_id=tid,
 
71
  status="failure",
72
  objective=task.objective,
73
  error=f"skipped: upstream {failed} did not succeed",
@@ -110,6 +112,7 @@ class TaskRunner:
110
  error = errs[0] if errs else "all tool calls failed"
111
  return TaskResult(
112
  task_id=task.id,
 
113
  status=status,
114
  objective=task.objective,
115
  outputs=outputs,
 
53
  for tid in list(remaining):
54
  results[tid] = TaskResult(
55
  task_id=tid,
56
+ stage=tasks_by_id[tid].stage,
57
  status="failure",
58
  objective=tasks_by_id[tid].objective,
59
  error="unresolved dependency; task could not run",
 
69
  if failed:
70
  results[tid] = TaskResult(
71
  task_id=tid,
72
+ stage=task.stage,
73
  status="failure",
74
  objective=task.objective,
75
  error=f"skipped: upstream {failed} did not succeed",
 
112
  error = errs[0] if errs else "all tool calls failed"
113
  return TaskResult(
114
  task_id=task.id,
115
+ stage=task.stage,
116
  status=status,
117
  objective=task.objective,
118
  outputs=outputs,
src/agents/state_store.py ADDED
@@ -0,0 +1,128 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """AnalysisStateStore β€” read/write the per-analysis session state.
2
+
3
+ The orchestrator gate + Help skill read `AnalysisState` (the locked contract in
4
+ `gate.py`) every turn; the Problem Statement skill writes `problem_validated`. The
5
+ row shares its id with the chat `rooms` row β€” one session = one analysis = one
6
+ conversation (`analysis_id == room_id`).
7
+
8
+ Mirrors `PostgresAnalysisStore`: each call opens its own `AsyncSession`.
9
+ """
10
+
11
+ from __future__ import annotations
12
+
13
+ from sqlalchemy.dialects.postgresql import insert
14
+
15
+ from src.agents.gate import AnalysisState
16
+ from src.db.postgres.connection import AsyncSessionLocal
17
+ from src.db.postgres.models import AnalysisStateRow
18
+ from src.middlewares.logging import get_logger
19
+
20
+ logger = get_logger("analysis_state_store")
21
+
22
+
23
+ def _row_to_state(row: AnalysisStateRow) -> AnalysisState:
24
+ """Map a DB row to the frozen `AnalysisState` contract."""
25
+ return AnalysisState(
26
+ id=row.id,
27
+ analysis_title=row.analysis_title,
28
+ problem_statement=row.problem_statement,
29
+ problem_validated=row.problem_validated,
30
+ owner_id=row.owner_id,
31
+ report_id=row.report_id,
32
+ created_at=row.created_at,
33
+ updated_at=row.updated_at,
34
+ )
35
+
36
+
37
+ class AnalysisStateStore:
38
+ """Read/write the dedorch `analysis` table, keyed by the shared session id."""
39
+
40
+ async def get(self, analysis_id: str) -> AnalysisState | None:
41
+ async with AsyncSessionLocal() as session:
42
+ row = await session.get(AnalysisStateRow, analysis_id)
43
+ return _row_to_state(row) if row is not None else None
44
+
45
+ async def ensure(
46
+ self,
47
+ analysis_id: str,
48
+ owner_id: str,
49
+ analysis_title: str = "New analysis",
50
+ ) -> AnalysisState:
51
+ """Get-or-create the state row for a session (idempotent, race-safe).
52
+
53
+ Sessions born from `/room/create` have no `analysis_states` row; without
54
+ one the gate redirect-loops and `problem_statement` / `report_id` writes
55
+ silently no-op. Called per turn (analysis_id == room_id) so any session is
56
+ gate-ready. `INSERT ... ON CONFLICT DO NOTHING` makes concurrent first
57
+ turns safe; the row is then read back. Legacy rows created this way carry
58
+ no source bindings β€” binding scoping fail-opens to the whole catalog.
59
+ """
60
+ async with AsyncSessionLocal() as session:
61
+ stmt = (
62
+ insert(AnalysisStateRow)
63
+ .values(
64
+ id=analysis_id,
65
+ owner_id=owner_id,
66
+ analysis_title=analysis_title,
67
+ problem_statement="",
68
+ problem_validated=False,
69
+ )
70
+ .on_conflict_do_nothing(index_elements=[AnalysisStateRow.id])
71
+ )
72
+ await session.execute(stmt)
73
+ await session.commit()
74
+ row = await session.get(AnalysisStateRow, analysis_id)
75
+ return _row_to_state(row)
76
+
77
+ async def create(
78
+ self,
79
+ *,
80
+ analysis_id: str,
81
+ owner_id: str,
82
+ analysis_title: str = "New analysis",
83
+ problem_statement: str = "",
84
+ ) -> AnalysisState:
85
+ """Create the state row for a new analysis (id shared with its chat room)."""
86
+ async with AsyncSessionLocal() as session:
87
+ row = AnalysisStateRow(
88
+ id=analysis_id,
89
+ owner_id=owner_id,
90
+ analysis_title=analysis_title,
91
+ problem_statement=problem_statement,
92
+ problem_validated=False,
93
+ )
94
+ session.add(row)
95
+ await session.commit()
96
+ await session.refresh(row)
97
+ return _row_to_state(row)
98
+
99
+ async def update(
100
+ self,
101
+ analysis_id: str,
102
+ *,
103
+ problem_statement: str | None = None,
104
+ problem_validated: bool | None = None,
105
+ report_id: str | None = None,
106
+ ) -> AnalysisState | None:
107
+ """Patch the given fields (only non-None args are written). Returns the row.
108
+
109
+ Used by the Problem Statement skill (`problem_validated`) and the report
110
+ flow (`report_id`). Returns None if the analysis doesn't exist.
111
+ """
112
+ async with AsyncSessionLocal() as session:
113
+ row = await session.get(AnalysisStateRow, analysis_id)
114
+ if row is None:
115
+ logger.warning(
116
+ "analysis row missing β€” update skipped",
117
+ analysis_id=analysis_id,
118
+ )
119
+ return None
120
+ if problem_statement is not None:
121
+ row.problem_statement = problem_statement
122
+ if problem_validated is not None:
123
+ row.problem_validated = problem_validated
124
+ if report_id is not None:
125
+ row.report_id = report_id
126
+ await session.commit()
127
+ await session.refresh(row)
128
+ return _row_to_state(row)
src/api/v1/analysis.py ADDED
@@ -0,0 +1,174 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Analysis session API β€” create a new analysis (the per-session workspace).
2
+
3
+ An analysis IS the chat session: the `analysis_states` row and the chat `rooms`
4
+ row share one id (`analysis_id == room_id`), so the existing `room_id` on the chat
5
+ request doubles as the `analysis_id`. Creating an analysis enforces the data-first
6
+ gate (>=1 bound source) and seeds the state with a title + an optional problem
7
+ statement (validated later by the Problem Statement skill).
8
+ """
9
+
10
+ import uuid
11
+
12
+ from fastapi import APIRouter, Depends, HTTPException
13
+ from pydantic import BaseModel, Field
14
+ from sqlalchemy import select
15
+ from sqlalchemy.ext.asyncio import AsyncSession
16
+
17
+ from src.db.postgres.connection import get_db
18
+ from src.db.postgres.models import AnalysisDataSourceRow, AnalysisStateRow, Room
19
+ from src.middlewares.logging import get_logger, log_execution
20
+
21
+ logger = get_logger("analysis_api")
22
+
23
+ router = APIRouter(prefix="/api/v1", tags=["Analysis"])
24
+
25
+
26
+ def _serialize_state(row: AnalysisStateRow, data_source_ids: list[str]) -> dict:
27
+ """The full analysis payload: the 8 state fields + the bound source ids."""
28
+ return {
29
+ "id": row.id,
30
+ "analysis_title": row.analysis_title,
31
+ "problem_statement": row.problem_statement,
32
+ "problem_validated": row.problem_validated,
33
+ "owner_id": row.owner_id,
34
+ "report_id": row.report_id,
35
+ "data_source_ids": data_source_ids,
36
+ "created_at": row.created_at.isoformat() if row.created_at else None,
37
+ "updated_at": row.updated_at.isoformat() if row.updated_at else None,
38
+ }
39
+
40
+
41
+ async def _bound_source_ids(db: AsyncSession, analysis_id: str) -> list[str]:
42
+ result = await db.execute(
43
+ select(AnalysisDataSourceRow.reference_id).where(
44
+ AnalysisDataSourceRow.analysis_id == analysis_id
45
+ )
46
+ )
47
+ return list(result.scalars().all())
48
+
49
+
50
+ async def _sources_by_id(user_id: str) -> dict:
51
+ """Catalog sources keyed by source_id, to resolve `type`/`name` on binding.
52
+
53
+ Never-throw: missing catalog / read error β†’ empty map, and binding rows fall back
54
+ to type='unknown' / name=reference_id.
55
+ """
56
+ try:
57
+ from src.catalog.store import CatalogStore
58
+
59
+ catalog = await CatalogStore().get(user_id)
60
+ except Exception as e: # noqa: BLE001 β€” binding must not fail on catalog read
61
+ logger.warning("analysis: catalog read failed for binding", user_id=user_id, error=str(e))
62
+ return {}
63
+ return {s.source_id: s for s in catalog.sources} if catalog else {}
64
+
65
+
66
+ class CreateAnalysisRequest(BaseModel):
67
+ user_id: str
68
+ analysis_title: str = "New analysis"
69
+ problem_statement: str = ""
70
+ data_source_ids: list[str] = Field(default_factory=list)
71
+
72
+
73
+ @router.post("/analysis/create")
74
+ @log_execution(logger)
75
+ async def create_analysis(
76
+ request: CreateAnalysisRequest,
77
+ db: AsyncSession = Depends(get_db),
78
+ ):
79
+ """Create a new analysis session: one shared id for its state + chat room.
80
+
81
+ Data-first gate (decision #2): an analysis requires >=1 bound data source.
82
+ The bound sources are persisted as dedorch `data_sources` rows (#10) in the same
83
+ transaction as the state + room, so the analysis is scoped to exactly the sources
84
+ the user picked. `structured_flow` and the report read this binding back.
85
+ """
86
+ if not request.data_source_ids:
87
+ raise HTTPException(
88
+ status_code=400,
89
+ detail="An analysis requires at least one bound data source.",
90
+ )
91
+
92
+ analysis_id = str(uuid.uuid4())
93
+ # The analysis IS the session: state row + chat room + source bindings share one
94
+ # id, created atomically in one transaction.
95
+ state_row = AnalysisStateRow(
96
+ id=analysis_id,
97
+ owner_id=request.user_id,
98
+ analysis_title=request.analysis_title,
99
+ problem_statement=request.problem_statement,
100
+ problem_validated=False,
101
+ )
102
+ db.add(Room(id=analysis_id, user_id=request.user_id, title=request.analysis_title))
103
+ db.add(state_row)
104
+ # dict.fromkeys dedupes while preserving order. Each binding row snapshots the
105
+ # source's type + name from the catalog (reference_id = catalog source id);
106
+ # bound_at/created_at default to now() in dedorch.
107
+ bound_ids = list(dict.fromkeys(request.data_source_ids))
108
+ src_by_id = await _sources_by_id(request.user_id)
109
+ for source_id in bound_ids:
110
+ src = src_by_id.get(source_id)
111
+ db.add(
112
+ AnalysisDataSourceRow(
113
+ id=str(uuid.uuid4()),
114
+ analysis_id=analysis_id,
115
+ type=src.source_type if src else "unknown",
116
+ name=src.name if src else source_id,
117
+ reference_id=source_id,
118
+ bound_by=request.user_id,
119
+ )
120
+ )
121
+ await db.commit()
122
+ await db.refresh(state_row)
123
+
124
+ logger.info(
125
+ "analysis created",
126
+ analysis_id=analysis_id,
127
+ user_id=request.user_id,
128
+ sources=len(bound_ids),
129
+ )
130
+ return {
131
+ "status": "success",
132
+ "message": "Analysis created successfully",
133
+ "data": _serialize_state(state_row, bound_ids),
134
+ }
135
+
136
+
137
+ @router.get("/analysis")
138
+ @log_execution(logger)
139
+ async def list_analyses(user_id: str, db: AsyncSession = Depends(get_db)):
140
+ """List a user's analyses, most-recently-updated first (Analysis sidebar).
141
+
142
+ Summary fields only (no per-row source bindings β€” fetch those via the detail
143
+ endpoint) to keep the list a single query.
144
+ """
145
+ result = await db.execute(
146
+ select(AnalysisStateRow)
147
+ .where(AnalysisStateRow.owner_id == user_id)
148
+ .order_by(AnalysisStateRow.updated_at.desc())
149
+ )
150
+ rows = result.scalars().all()
151
+ return {
152
+ "status": "success",
153
+ "data": [
154
+ {
155
+ "id": r.id,
156
+ "analysis_title": r.analysis_title,
157
+ "problem_validated": r.problem_validated,
158
+ "report_id": r.report_id,
159
+ "updated_at": r.updated_at.isoformat() if r.updated_at else None,
160
+ }
161
+ for r in rows
162
+ ],
163
+ }
164
+
165
+
166
+ @router.get("/analysis/{analysis_id}")
167
+ @log_execution(logger)
168
+ async def get_analysis(analysis_id: str, db: AsyncSession = Depends(get_db)):
169
+ """Read one analysis's state + bound data sources (the FE workspace render)."""
170
+ row = await db.get(AnalysisStateRow, analysis_id)
171
+ if row is None:
172
+ raise HTTPException(status_code=404, detail=f"Analysis {analysis_id!r} not found.")
173
+ data_source_ids = await _bound_source_ids(db, analysis_id)
174
+ return {"status": "success", "data": _serialize_state(row, data_source_ids)}
src/api/v1/chat.py CHANGED
@@ -31,6 +31,7 @@ router = APIRouter(prefix="/api/v1", tags=["Chat"])
31
  _chat_handler = ChatHandler(
32
  enable_tracing=True,
33
  enable_slow_path=settings.enable_slow_path,
 
34
  )
35
 
36
  _GREETINGS = frozenset(["hi", "hello", "hey", "halo", "hai", "hei"])
@@ -64,8 +65,39 @@ async def get_cached_response(redis, cache_key: str) -> Optional[dict]:
64
  return None
65
 
66
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
67
  async def cache_response(redis, cache_key: str, response: str, sources: list):
68
- await redis.setex(cache_key, 86400, json.dumps({"response": response, "sources": sources}))
 
 
 
 
69
 
70
 
71
  async def load_history(db: AsyncSession, room_id: str, limit: int = 10) -> list:
@@ -107,10 +139,10 @@ async def save_messages(
107
 
108
 
109
  @router.delete("/chat/cache")
110
- async def clear_chat_cache(room_id: str, message: str):
111
- """Delete the Redis cache entry for a specific room + message pair."""
112
  redis = await get_redis()
113
- cache_key = f"{settings.redis_prefix}chat:{room_id}:{message}"
114
  deleted = await redis.delete(cache_key)
115
  return {"deleted": deleted > 0, "cache_key": cache_key}
116
 
@@ -146,7 +178,7 @@ async def chat_stream(request: ChatRequest, db: AsyncSession = Depends(get_db)):
146
  3. done β€” signals end of stream
147
  """
148
  redis = await get_redis()
149
- cache_key = f"{settings.redis_prefix}chat:{request.room_id}:{request.message}"
150
 
151
  # Redis cache hit
152
  cached = await get_cached_response(redis, cache_key)
@@ -186,8 +218,17 @@ async def chat_stream(request: ChatRequest, db: AsyncSession = Depends(get_db)):
186
  logger.info("stream_response started", room_id=request.room_id, user_id=request.user_id)
187
  full_response = ""
188
  sources: List[Dict[str, Any]] = []
189
- async for event in handler.handle(request.message, request.user_id, history):
190
- if event["event"] == "sources":
 
 
 
 
 
 
 
 
 
191
  try:
192
  sources = json.loads(event["data"]) or []
193
  except (TypeError, ValueError):
@@ -197,7 +238,10 @@ async def chat_stream(request: ChatRequest, db: AsyncSession = Depends(get_db)):
197
  full_response += event["data"]
198
  yield event
199
  elif event["event"] == "done":
200
- await cache_response(redis, cache_key, full_response, sources=sources)
 
 
 
201
  logger.info("saving messages", sources_count=len(sources), sources=sources)
202
  try:
203
  await save_messages(db, request.room_id, request.message, full_response, sources=sources)
@@ -211,7 +255,6 @@ async def chat_stream(request: ChatRequest, db: AsyncSession = Depends(get_db)):
211
  elif event["event"] == "error":
212
  yield event
213
  return
214
- # "intent" event: consumed internally, not forwarded to frontend
215
 
216
  return EventSourceResponse(stream_response())
217
 
 
31
  _chat_handler = ChatHandler(
32
  enable_tracing=True,
33
  enable_slow_path=settings.enable_slow_path,
34
+ enable_gate=settings.enable_gate,
35
  )
36
 
37
  _GREETINGS = frozenset(["hi", "hello", "hey", "halo", "hai", "hei"])
 
65
  return None
66
 
67
 
68
+ # 1h TTL per the 2026-06-11 checkpoint decision (Redis = retrieval/query caching
69
+ # only, short-lived). Was 24h, which served stale answers after re-ingestion.
70
+ _CHAT_CACHE_TTL_SECONDS = 3600
71
+
72
+ # Only stateless replies are safe to cache. The cache key is (room, user, message)
73
+ # with no analysis-state/data version, so caching a state- or data-dependent answer
74
+ # (help / problem_statement / check / structured_flow / unstructured_flow) would
75
+ # replay a stale answer after the state or data changes β€” and, since the read check
76
+ # runs before the gate, could even bypass the gate when the same message repeats.
77
+ # So we cache ONLY the `chat` intent. Caching analysis answers needs proper
78
+ # invalidation on data/state change β€” deferred. The write is gated by the intent the
79
+ # handler already emits; the read stays as-is (safe because only `chat` is ever
80
+ # stored).
81
+ _CACHEABLE_INTENTS = frozenset({"chat"})
82
+
83
+
84
+ def _chat_cache_key(room_id: str, user_id: str, message: str) -> str:
85
+ # user_id is part of the key so one user's cached answer can never be
86
+ # replayed to another (R5); room_id stays first so the room-wide clear
87
+ # endpoint can keep matching on a `chat:{room_id}:*` prefix.
88
+ # LIMITATION (T-G): the key omits conversation history, so a repeated message
89
+ # replays its cached answer even if the conversation has since moved on. Only
90
+ # the stateless `chat` intent is cached, so the blast radius is small β€” but a
91
+ # history-aware key (hash of last-N turns) would close it. Flagged to Harry.
92
+ return f"{settings.redis_prefix}chat:{room_id}:{user_id}:{message}"
93
+
94
+
95
  async def cache_response(redis, cache_key: str, response: str, sources: list):
96
+ await redis.setex(
97
+ cache_key,
98
+ _CHAT_CACHE_TTL_SECONDS,
99
+ json.dumps({"response": response, "sources": sources}),
100
+ )
101
 
102
 
103
  async def load_history(db: AsyncSession, room_id: str, limit: int = 10) -> list:
 
139
 
140
 
141
  @router.delete("/chat/cache")
142
+ async def clear_chat_cache(room_id: str, user_id: str, message: str):
143
+ """Delete the Redis cache entry for a specific room + user + message pair."""
144
  redis = await get_redis()
145
+ cache_key = _chat_cache_key(room_id, user_id, message)
146
  deleted = await redis.delete(cache_key)
147
  return {"deleted": deleted > 0, "cache_key": cache_key}
148
 
 
178
  3. done β€” signals end of stream
179
  """
180
  redis = await get_redis()
181
+ cache_key = _chat_cache_key(request.room_id, request.user_id, request.message)
182
 
183
  # Redis cache hit
184
  cached = await get_cached_response(redis, cache_key)
 
218
  logger.info("stream_response started", room_id=request.room_id, user_id=request.user_id)
219
  full_response = ""
220
  sources: List[Dict[str, Any]] = []
221
+ effective_intent: Optional[str] = None
222
+ async for event in handler.handle(
223
+ request.message, request.user_id, history, analysis_id=request.room_id
224
+ ):
225
+ if event["event"] == "intent":
226
+ # consumed internally (not forwarded); gates caching below.
227
+ try:
228
+ effective_intent = json.loads(event["data"]).get("intent")
229
+ except (TypeError, ValueError, AttributeError):
230
+ effective_intent = None
231
+ elif event["event"] == "sources":
232
  try:
233
  sources = json.loads(event["data"]) or []
234
  except (TypeError, ValueError):
 
238
  full_response += event["data"]
239
  yield event
240
  elif event["event"] == "done":
241
+ # Only cache stateless `chat` replies β€” caching a state/data-
242
+ # dependent answer would replay it stale (see _CACHEABLE_INTENTS).
243
+ if effective_intent in _CACHEABLE_INTENTS:
244
+ await cache_response(redis, cache_key, full_response, sources=sources)
245
  logger.info("saving messages", sources_count=len(sources), sources=sources)
246
  try:
247
  await save_messages(db, request.room_id, request.message, full_response, sources=sources)
 
255
  elif event["event"] == "error":
256
  yield event
257
  return
 
258
 
259
  return EventSourceResponse(stream_response())
260
 
src/api/v1/report.py ADDED
@@ -0,0 +1,189 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Report API (KM-644) β€” the dedicated "Generate Report" surface.
2
+
3
+ NOT a chat route. The frontend button calls these endpoints directly:
4
+ POST /report generate a new version for a session
5
+ GET /report/{analysis_id} list a session's report versions
6
+ GET /report/{analysis_id}/{ver} fetch one version
7
+
8
+ Generation reads persisted AnalysisRecords + Problem Statement, makes one LLM call
9
+ (the executive summary), and persists an immutable versioned artifact. The
10
+ ReportGenerator + ReportStore are process singletons (the generator caches its LLM
11
+ chain warm across requests, like ChatHandler).
12
+
13
+ Note (T-E): AnalysisRecords are only persisted by the slow path, so reports require
14
+ `ENABLE_SLOW_PATH=on`. With it off, no records exist and generation 409s β€” by design,
15
+ not a bug. POST gates on the same floor as Help's readiness signal (validated goal +
16
+ β‰₯1 substantive analysis) so the button and Help never disagree.
17
+ """
18
+
19
+ from fastapi import APIRouter, HTTPException, Query, status
20
+
21
+ from src.agents.report.errors import ReportError
22
+ from src.agents.report.generator import ReportGenerator
23
+ from src.agents.report.schemas import AnalysisReport, ProblemStatement
24
+ from src.agents.report.store import ReportStore
25
+ from src.middlewares.logging import get_logger, log_execution
26
+ from src.models.api.report import ReportVersionEntry
27
+
28
+ logger = get_logger("report_api")
29
+
30
+ router = APIRouter(prefix="/api/v1", tags=["Report"])
31
+
32
+ _generator = ReportGenerator()
33
+ _store = ReportStore()
34
+
35
+
36
+ async def _load_state(analysis_id: str):
37
+ """Load the AnalysisState (for the floor gate + problem statement). Never-throw."""
38
+ try:
39
+ from src.agents.state_store import AnalysisStateStore
40
+
41
+ return await AnalysisStateStore().get(analysis_id)
42
+ except Exception as e: # noqa: BLE001 β€” never block report generation on this
43
+ logger.warning("report: state load failed", analysis_id=analysis_id, error=str(e))
44
+ return None
45
+
46
+
47
+ def _problem_statement_from(state) -> ProblemStatement:
48
+ """Map the analysis's free-text problem statement into the report's structured PS."""
49
+ if state is None or not state.problem_statement:
50
+ return ProblemStatement()
51
+ return ProblemStatement(objective=state.problem_statement)
52
+
53
+
54
+ async def _record_report_on_state(analysis_id: str, report_id: str) -> None:
55
+ """Write the new `report_id` back onto the Analysis State (never-throw).
56
+
57
+ Closes the loop so Help's `has_report` and the readiness delta-check can see
58
+ that a report exists. A missing state row / write error must not fail a report
59
+ that already generated and persisted.
60
+ """
61
+ try:
62
+ from src.agents.state_store import AnalysisStateStore
63
+
64
+ await AnalysisStateStore().update(analysis_id, report_id=report_id)
65
+ except Exception as e: # noqa: BLE001
66
+ logger.warning(
67
+ "report: report_id write-back failed", analysis_id=analysis_id, error=str(e)
68
+ )
69
+
70
+
71
+ @router.post(
72
+ "/report",
73
+ response_model=AnalysisReport,
74
+ status_code=status.HTTP_201_CREATED,
75
+ summary="Generate a new report version for an analysis session",
76
+ responses={
77
+ 201: {"description": "A new versioned report was generated and persisted."},
78
+ 409: {"description": "No analyses recorded for this session yet β€” nothing to report."},
79
+ 500: {"description": "Report generation or persistence failed."},
80
+ },
81
+ )
82
+ @log_execution(logger)
83
+ async def generate_report(
84
+ analysis_id: str = Query(..., description="The analysis session to report on."),
85
+ user_id: str = Query(..., description="Owner of the analysis session."),
86
+ ):
87
+ """Generate, persist, and return a new report version.
88
+
89
+ Each call produces a new version (V1, V2, …) that snapshots the records and
90
+ Problem Statement it used. Server-side gate: the report **floor** β€” a validated
91
+ goal + β‰₯1 substantive analysis β€” the same floor Help's readiness signal uses, so
92
+ the button and Help can't disagree (T-D). The delta-since-report check is NOT
93
+ applied here: a new version is always allowed (decision 4A).
94
+ """
95
+ from src.agents.gate import stub_analysis_state
96
+ from src.agents.report.readiness import report_floor
97
+
98
+ state = await _load_state(analysis_id)
99
+ floor_missing, _ = await report_floor(
100
+ analysis_id, state or stub_analysis_state(problem_validated=False)
101
+ )
102
+ if floor_missing:
103
+ raise HTTPException(
104
+ status_code=status.HTTP_409_CONFLICT,
105
+ detail="Not ready to generate a report β€” still needs "
106
+ + ", ".join(floor_missing)
107
+ + ".",
108
+ )
109
+
110
+ try:
111
+ problem_statement = _problem_statement_from(state)
112
+ report = await _generator.generate(
113
+ analysis_id, user_id, problem_statement=problem_statement
114
+ )
115
+ except ReportError as e:
116
+ raise HTTPException(status_code=status.HTTP_409_CONFLICT, detail=str(e)) from e
117
+ except Exception as e:
118
+ logger.error("report generation failed", analysis_id=analysis_id, error=str(e))
119
+ raise HTTPException(
120
+ status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
121
+ detail=f"Report generation failed: {e}",
122
+ ) from e
123
+
124
+ try:
125
+ saved = await _store.save(report)
126
+ except Exception as e:
127
+ logger.error("report persist failed", analysis_id=analysis_id, error=str(e))
128
+ raise HTTPException(
129
+ status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
130
+ detail=f"Report persistence failed: {e}",
131
+ ) from e
132
+
133
+ await _record_report_on_state(analysis_id, saved.report_id)
134
+ return saved
135
+
136
+
137
+ @router.get(
138
+ "/report/{analysis_id}",
139
+ response_model=list[ReportVersionEntry],
140
+ summary="List a session's report versions",
141
+ response_description="Version metadata, oldest-first. Empty if none generated yet.",
142
+ )
143
+ @log_execution(logger)
144
+ async def list_report_versions(analysis_id: str):
145
+ """Return version metadata for a session (for the Analysis-menu sidebar)."""
146
+ try:
147
+ reports = await _store.list_for_analysis(analysis_id)
148
+ except Exception as e:
149
+ logger.error("report list failed", analysis_id=analysis_id, error=str(e))
150
+ raise HTTPException(
151
+ status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
152
+ detail=f"Failed to list reports: {e}",
153
+ ) from e
154
+
155
+ return [
156
+ ReportVersionEntry(
157
+ report_id=r.report_id,
158
+ version=r.version,
159
+ generated_at=r.generated_at,
160
+ record_count=len(r.record_ids),
161
+ )
162
+ for r in reports
163
+ ]
164
+
165
+
166
+ @router.get(
167
+ "/report/{analysis_id}/{version}",
168
+ response_model=AnalysisReport,
169
+ summary="Fetch one report version",
170
+ responses={404: {"description": "No report at that version for this session."}},
171
+ )
172
+ @log_execution(logger)
173
+ async def get_report_version(analysis_id: str, version: int):
174
+ """Return the full content of a specific report version."""
175
+ try:
176
+ report = await _store.get(analysis_id, version)
177
+ except Exception as e:
178
+ logger.error("report fetch failed", analysis_id=analysis_id, version=version, error=str(e))
179
+ raise HTTPException(
180
+ status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
181
+ detail=f"Failed to fetch report: {e}",
182
+ ) from e
183
+
184
+ if report is None:
185
+ raise HTTPException(
186
+ status_code=status.HTTP_404_NOT_FOUND,
187
+ detail=f"No report v{version} for analysis {analysis_id!r}.",
188
+ )
189
+ return report
src/api/v1/tools.py ADDED
@@ -0,0 +1,124 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Tool / command catalog API endpoints.
2
+
3
+ Exposes the agent's user-invocable slash-command catalog so the Golang backend
4
+ can cache it and the frontend can render its "/" command menu WITHOUT calling the
5
+ AI agent for every list (Golang GETs + caches `list_tools`).
6
+
7
+ Scope confirmed: the catalog is the UNIFIED set of
8
+ everything the user can invoke via `/` β€”
9
+ spanning what the team internally splits into skills + analytics tools +
10
+ data-access tools. Naming : verb-first, kebab-case, `/` prefix.
11
+
12
+ Each command maps 1:1 to a real internal tool/intent `name` (the dispatch key);
13
+ the granular data-access tools (check_data, check_knowledge, retrieve_data,
14
+ retrieve_knowledge) are listed separately.
15
+ NOTE: the merged `check` intent still exists for natural-language routing β€” it is
16
+ NOT a slash command; slash invocation bypasses the router to the tool directly.
17
+ Deferred analytics tools (comparison/contribution/profile/segment) are NOT
18
+ exposed (not wired to the Planner).
19
+
20
+ Stateless and deterministic β€” safe for the Golang backend to cache.
21
+ """
22
+
23
+ from typing import Literal
24
+
25
+ from fastapi import APIRouter
26
+ from pydantic import BaseModel
27
+
28
+ from src.middlewares.logging import get_logger, log_execution
29
+
30
+ logger = get_logger("tools_api")
31
+
32
+ router = APIRouter(prefix="/api/v1", tags=["Tools"])
33
+
34
+ CommandType = Literal["skill", "analytics", "data_access"]
35
+
36
+
37
+ class CommandResponse(BaseModel):
38
+ command: str # FE-facing slash command, e.g. "/analyze-descriptive"
39
+ name: str # internal handler/tool name, e.g. "analyze_descriptive"
40
+ type: CommandType
41
+ description: str
42
+
43
+
44
+ class ListToolsResponse(BaseModel):
45
+ count: int
46
+ tools: list[CommandResponse]
47
+
48
+
49
+ # Single source of truth for the FE slash-command catalog. Order = display order.
50
+ # Keep `command` in Harry's convention (verb-first, kebab-case, `/`); `name` is the
51
+ # internal route/tool name used by the orchestrator.
52
+ _COMMAND_CATALOG: list[CommandResponse] = [
53
+ CommandResponse(
54
+ command="/help",
55
+ name="help",
56
+ type="skill",
57
+ description="Show what the assistant can do and guide your next step.",
58
+ ),
59
+ CommandResponse(
60
+ command="/problem-statement",
61
+ name="problem_statement",
62
+ type="skill",
63
+ description="Define and validate your analysis goal (objective + metric) "
64
+ "before exploring data.",
65
+ ),
66
+ CommandResponse(
67
+ command="/analyze-descriptive",
68
+ name="analyze_descriptive",
69
+ type="analytics",
70
+ description="Summary statistics for selected columns (count, mean, min, max, …).",
71
+ ),
72
+ CommandResponse(
73
+ command="/analyze-aggregate",
74
+ name="analyze_aggregate",
75
+ type="analytics",
76
+ description="Group and aggregate values (sum, count, average) by dimension.",
77
+ ),
78
+ CommandResponse(
79
+ command="/analyze-correlation",
80
+ name="analyze_correlation",
81
+ type="analytics",
82
+ description="Correlation strength between numeric columns.",
83
+ ),
84
+ CommandResponse(
85
+ command="/analyze-trend",
86
+ name="analyze_trend",
87
+ type="analytics",
88
+ description="Trend of a value over time at a chosen frequency.",
89
+ ),
90
+ CommandResponse(
91
+ command="/check-data",
92
+ name="check_data",
93
+ type="data_access",
94
+ description="Inventory of the available structured data sources.",
95
+ ),
96
+ CommandResponse(
97
+ command="/check-knowledge",
98
+ name="check_knowledge",
99
+ type="data_access",
100
+ description="Inventory of the available knowledge / uploaded documents.",
101
+ ),
102
+ CommandResponse(
103
+ command="/retrieve-data",
104
+ name="retrieve_data",
105
+ type="data_access",
106
+ description="Pull rows from a structured source for analysis.",
107
+ ),
108
+ CommandResponse(
109
+ command="/retrieve-knowledge",
110
+ name="retrieve_knowledge",
111
+ type="data_access",
112
+ description="Retrieve relevant passages from your uploaded documents.",
113
+ ),
114
+ ]
115
+
116
+
117
+ @router.get("/tools", response_model=ListToolsResponse)
118
+ @log_execution(logger)
119
+ async def list_tools() -> ListToolsResponse:
120
+ """List the user-invocable slash-command catalog (skills + tools).
121
+
122
+ Static per deployment β€” safe for the Golang backend to cache.
123
+ """
124
+ return ListToolsResponse(count=len(_COMMAND_CATALOG), tools=_COMMAND_CATALOG)
src/catalog/reader.py CHANGED
@@ -45,8 +45,9 @@ class MemoizingCatalogReader(CatalogReader):
45
 
46
  One per request. The same per-user catalog is otherwise fetched from the
47
  catalog DB 4-5x during a single slow-path run (planner load, then
48
- describe_source's structured+unstructured reads, then query_structured's
49
- structured read). Wrapping the base reader collapses those to one round-trip
 
50
  per distinct source_hint and pins a single consistent snapshot for the whole
51
  request (plan-time and execution-time catalogs can no longer diverge).
52
  """
 
45
 
46
  One per request. The same per-user catalog is otherwise fetched from the
47
  catalog DB 4-5x during a single slow-path run (planner load, then
48
+ check_data's structured read + check_knowledge's unstructured read, then
49
+ retrieve_data's structured read). Wrapping the base reader collapses those
50
+ to one round-trip
51
  per distinct source_hint and pins a single consistent snapshot for the whole
52
  request (plan-time and execution-time catalogs can no longer diverge).
53
  """
src/config/prompts/help.md ADDED
@@ -0,0 +1,107 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <!-- help.md Β· v1 Β· Help skill prompt. Bump to v2 (don't silently overwrite) on major change,
2
+ e.g. when real UI steps land from the frontend. See checkpoint 2026-06-18. -->
3
+
4
+ You are the **Help guide** for an AI data-analysis assistant. Think of yourself as the
5
+ instruction sheet that comes with a board game: your only job is to tell the user
6
+ **where they are in their analysis and what to do next**, so they are never lost. You do
7
+ **not** do analysis, answer data questions, or invent facts about their data.
8
+
9
+ ## What you receive this turn
10
+
11
+ You are given context, never raw user prose to analyze:
12
+
13
+ - **`analysis_state`** β€” the current per-analysis state. Fields you use:
14
+ - `analysis_title` β€” what this analysis is called.
15
+ - `problem_statement` β€” the user's goal (may be empty/weak; it is optional at creation).
16
+ - `problem_validated` (bool) β€” **the gate.** `false` = the goal still needs work; `true` = the goal is set and analysis is unlocked.
17
+ - `report_id` β€” `0`/absent means no report has ever been generated.
18
+ - **`chat_history`** β€” the conversation so far. Use it to judge how far along the user is and to avoid repeating yourself.
19
+ - **`report_ready`** β€” a **deterministic** signal computed for you (NOT your judgment):
20
+ - `ready` (bool) β€” whether there is enough analysis to generate a report.
21
+ - `missing` (list) β€” if not ready, the gaps to fill.
22
+ - **`available_actions`** *(optional)* β€” which actions are actually wired right now. If present, **only suggest actions listed here.**
23
+
24
+ > **Hard rule β€” never misguide.** Trust the signals above for *what is possible*, not your
25
+ > own guess. If `report_ready.ready` is `false`, do **not** tell the user to generate a
26
+ > report. If an action isn't in `available_actions`, do not suggest it. If Help is wrong,
27
+ > the user is wrong.
28
+
29
+ ## How to answer β€” two layers, always
30
+
31
+ 1. **Where you are + what's next** β€” one short sentence locating the user, then the single most useful next step.
32
+ 2. **How** β€” concrete, do-able instructions for that step (not just "you can analyze now" β€” show *how* to start).
33
+
34
+ Keep it short. Lead with the next step; don't recap everything.
35
+
36
+ ## State-tiered guidance
37
+
38
+ Pick the branch that matches `analysis_state` + `report_ready`:
39
+
40
+ ### A. `problem_validated == false` β†’ fix the goal first
41
+ The user can't get good analysis without a clear goal. Steer them to define or sharpen the
42
+ problem statement.
43
+ - If `problem_statement` is empty: encourage them to state what they want to find out, and mention the AI can help β€” they can run **`/problem_statement`** (or just describe their goal in chat).
44
+ - If `problem_statement` exists but is vague: gently push for something more **measurable and concrete** (a target, a metric, a timeframe), grounded in their `analysis_title` and the data they've bound. Give one short example of a sharper version.
45
+ - Do **not** push analysis or reports yet.
46
+
47
+ ### B. `problem_validated == true`, little/no analysis yet β†’ orient to analysis
48
+ Tell them the goal is set and they can start asking questions about their data. Give the **how**:
49
+ - Suggest 2–3 concrete starter questions, **descriptive/basic first** (e.g. "Which products sell the most?", "How have sales trended this month?").
50
+ - **Tie suggestions back to their `problem_statement`** so the analysis stays relevant β€” don't suggest random analyses.
51
+ - **Read `chat_history` first and never re-suggest a question already asked or answered.** Build on what's done with a follow-up that adds *new* evidence (a trend over time, a breakdown, a comparison, a deeper cut), not a repeat of a question that already has an answer.
52
+ - You may offer a basic end-to-end "starter analysis" path (a few descriptive questions β†’ a first report), kept simple.
53
+
54
+ ### C. `problem_validated == true`, analysis under way, `report_ready.ready == false` β†’ close the gaps
55
+ They've started but there isn't enough yet for a report. Point at `report_ready.missing` and
56
+ recommend the specific next questions that would fill those gaps (phrase them as questions
57
+ the user can ask), still anchored to the problem statement.
58
+
59
+ ### D. `problem_validated == true` and `report_ready.ready == true` β†’ nudge toward the report
60
+ There's enough to report. Encourage them to generate it. Report can be triggered **two ways**:
61
+ the **`/generate report`** skill **or** the report button β€” mention both so it feels natural.
62
+ Do not over-promise the report's depth.
63
+
64
+ ## How-to phrasing (degrade gracefully)
65
+
66
+ - **Via chat / skills** β€” write these **accurately and specifically**; they are stable (e.g. "type your question in the chat", "run `/problem_statement`", "run `/generate report`").
67
+ - **Via the UI (buttons/menus)** β€” the frontend isn't final yet. Describe UI steps **generically** ("use the Generate Report option") rather than naming exact buttons/positions you're unsure of. Prefer the chat/skill path when unsure. *(A later version of this file will fill in the real UI steps.)*
68
+ - If a field in `analysis_state` is missing or the state looks unwired, **fall back to generic guidance** rather than guessing specifics.
69
+
70
+ ## Tone
71
+
72
+ Plain, warm, and encouraging β€” like a helpful guide, **not** a hype trailer. No exclamation
73
+ spam, no overselling. Respond in the **user's language** (match `chat_history` β€” Indonesian or
74
+ English). A few sentences is usually enough.
75
+
76
+ ## Constraints
77
+
78
+ - You **only** guide. Never run analysis, never produce report content, never quote data values.
79
+ - Never suggest an action that the signals say isn't available or isn't ready.
80
+ - One step at a time β€” give the next step, not the whole roadmap.
81
+ - When you suggest questions, **dedupe against `chat_history`** β€” only propose analyses not yet run that move the goal forward; a question that already has an answer adds no fresh evidence.
82
+ - No markdown headers or code fences in your reply; short prose (and an inline `/command` or a tiny bullet list) is fine.
83
+
84
+ ## Examples
85
+
86
+ ```
87
+ State: problem_validated=false, problem_statement=""
88
+ β†’ "Looks like we haven't set a goal yet. Tell me what you want to find out β€” for example,
89
+ 'reduce churn next quarter' β€” or run /problem_statement and I'll help you shape it."
90
+
91
+ State: problem_validated=false, problem_statement="make sales better"
92
+ β†’ "Your goal is a good start but a bit broad. Let's make it measurable β€” e.g. 'grow north-region
93
+ revenue by 10% this quarter.' Run /problem_statement and we'll refine it together."
94
+
95
+ State: problem_validated=true, chat_history nearly empty
96
+ β†’ "Your goal is set β€” you can start exploring now. Try a basic question first, like
97
+ 'Which products sell the most?' or 'How have monthly sales trended?', then we can dig into
98
+ what's driving your goal."
99
+
100
+ State: problem_validated=true, report_ready.ready=false, missing=["no comparison over time"]
101
+ β†’ "Good progress. Before a report, it's worth looking at change over time β€” try asking
102
+ 'How does this quarter compare to last?' Once we have that, we can put the report together."
103
+
104
+ State: problem_validated=true, report_ready.ready=true
105
+ β†’ "You've covered enough to summarize. You can generate your report now β€” run /generate report
106
+ or use the report option to create it."
107
+ ```
src/config/prompts/intent_router.md CHANGED
@@ -1,82 +1,119 @@
1
- You are the intent router for an AI data assistant. Given a user's latest message (and optionally recent conversation history), decide which downstream path should handle it.
2
 
3
  ## Output
4
 
5
  Return three fields:
6
 
7
- - **`needs_search`** β€” `true` if we must look at the user's data to answer; `false` for greetings, farewells, off-topic chitchat, or meta questions about the assistant itself.
8
- - **`source_hint`** β€” one of:
9
- - `chat` β€” no data lookup needed (greetings, farewells, generic small talk).
10
- - `unstructured` β€” the user is asking about a topic, concept, feature, or factual knowledge that may exist in uploaded documents (PDF / DOCX / TXT). The user does not need to explicitly mention a document.
11
- - `structured` β€” the user is asking a **data question** answerable from a database or a tabular file (CSV / XLSX / Parquet). This includes counts, sums, top-N, filters, comparisons, trends, joins across registered structured sources.
12
- - **`rewritten_query`** β€” a **standalone** version of the user's question that incorporates necessary context from history. If the original message is already standalone, return it unchanged. If `needs_search` is `false`, leave this empty/null.
 
 
 
13
 
14
  ## Routing rules
15
 
16
- 1. If the message is ONLY a pure greeting / farewell / thanks / "how are you" / "what can you do" / compliment with no factual question β†’ `chat` + `needs_search=false`.
17
- 2. If the message asks a data question answerable from a database or tabular file (counts, sums, top-N, filters, comparisons, trends, sheet rows, table columns) β†’ `structured` + `needs_search=true`.
18
- 3. If the message asks about a topic, concept, feature, explanation, summary, or factual knowledge β€” even without explicitly mentioning a document β€” route to `unstructured` + `needs_search=true`. The user may have uploaded relevant documents covering that topic.
19
- 4. If ambiguous between structured and unstructured β†’ prefer `unstructured`. Only prefer `structured` if there are clear signals of tabular/numeric data questions.
20
- 5. Cross-source comparison ("compare DB sales to the customers.csv file") β†’ `structured`. The planner sees both source types in one prompt and can correlate.
 
 
 
 
 
 
 
 
21
 
22
  ## Rewriting follow-ups
23
 
24
- When history is present and the new message references prior context using pronouns or fragments ("tell me more", "what about last quarter?", "and by region?"), expand the rewritten_query into a fully standalone question. Example:
25
 
26
  History: "What was our top product last month?" β†’ "Pro Plan Annual at $487k"
27
  Message: "How does that compare to Q1?"
28
  rewritten_query: "How does Pro Plan Annual's revenue last month compare to Q1?"
29
 
30
- If the original is already standalone, copy it verbatim into rewritten_query.
31
 
32
  ## Few-shot examples
33
 
34
  ```
35
  User: "Hi"
36
- β†’ needs_search=false, source_hint="chat", rewritten_query=null
37
 
38
  User: "Bye, thanks"
39
- β†’ needs_search=false, source_hint="chat", rewritten_query=null
40
 
41
  User: "What can you do?"
42
- β†’ needs_search=false, source_hint="chat", rewritten_query=null
43
 
44
- User: "How many orders did we get last month?"
45
- β†’ needs_search=true, source_hint="structured",
46
- rewritten_query="How many orders did we get last month?"
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
47
 
48
  User: "What does the Q1 board memo say about churn?"
49
- β†’ needs_search=true, source_hint="unstructured",
50
- rewritten_query="What does the Q1 board memo say about churn?"
51
 
52
- User: "Top 5 customers by revenue this year"
53
- β†’ needs_search=true, source_hint="structured",
54
- rewritten_query="Top 5 customers by revenue this year"
55
 
56
  User: "apa key feature dari iot connectivity?"
57
- β†’ needs_search=true, source_hint="unstructured",
58
- rewritten_query="What are the key features of IoT connectivity?"
59
 
60
- User: "jelaskan tentang machine learning"
61
- β†’ needs_search=true, source_hint="unstructured",
62
- rewritten_query="Explain machine learning"
 
 
 
 
63
 
64
- User: "bagaimana cara kerja neural network?"
65
- β†’ needs_search=true, source_hint="unstructured",
66
- rewritten_query="How does a neural network work?"
67
 
68
- User: "what is the main purpose of this system?"
69
- β†’ needs_search=true, source_hint="unstructured",
70
- rewritten_query="What is the main purpose of this system?"
 
71
 
72
  History: assistant: "Pro Plan Annual led at $487,200 in April."
73
  User: "And in March?"
74
- β†’ needs_search=true, source_hint="structured",
75
- rewritten_query="What was Pro Plan Annual's revenue in March?"
76
  ```
77
 
78
  ## Constraints
79
 
80
- - Do not invent data. If the question is factual or knowledge-based (not clearly tabular), route to `unstructured` and let the retriever decide. Only route to `structured` if the question clearly involves counts, sums, filters, or trends from tabular sources.
 
81
  - Do not refuse β€” refusal happens later in guardrails. Just classify.
82
  - One JSON object as output; no prose, no markdown.
 
1
+ You are the intent router for an AI data assistant. Given a user's latest message (and optionally recent conversation history), decide which downstream **handler** should process it. You classify the route only β€” you do not answer the question.
2
 
3
  ## Output
4
 
5
  Return three fields:
6
 
7
+ - **`intent`** β€” exactly one of:
8
+ - `chat` β€” conversational, no data needed: greetings, farewells, thanks, "how are you", "what can you do", small talk.
9
+ - `help` β€” the user wants to know **what to do next** or how the process works ("what's the next step?", "how do I start?", "what should I do now?").
10
+ - `problem_statement` β€” the user wants to **define or refine the analysis goal**: the business problem, objectives, what to increase/decrease, targets/success metrics β€” or is answering questions about the goal.
11
+ - `check` β€” the user wants an **inventory** of what they have: "what data do I have?", "what columns are in this table?", "what documents did I upload?", "describe my dataset". This is metadata/listing, not analysis.
12
+ - `unstructured_flow` β€” the user asks about a **topic, concept, feature, explanation, or factual knowledge** that may live in uploaded documents (PDF/DOCX/TXT). Pure document Q&A. The user need not mention a document.
13
+ - `structured_flow` β€” the user asks an **analytical question over their data**: counts, sums, top-N, filters, comparisons, trends, correlations, segments, share-of-total, joins across structured sources. This routes to the slow analytical path.
14
+ - **`rewritten_query`** β€” a **standalone** version of the user's question, with context from history resolved. If the message is already standalone, copy it verbatim. Leave empty/null for `chat` and `help`.
15
+ - **`confidence`** β€” your confidence in the chosen intent, a number in [0, 1].
16
 
17
  ## Routing rules
18
 
19
+ 1. Pure greeting / farewell / thanks / "what can you do" / compliment with no task β†’ `chat`.
20
+ 2. "What do I do next / how do I proceed / where do I start" β†’ `help`.
21
+ 3. The user states or refines a goal, objective, target, or success metric, or answers a goal-defining question β†’ `problem_statement`.
22
+ 4. "What data / columns / tables / documents do I have", "describe my data", inventory or metadata requests β†’ `check`.
23
+ 5. A question answerable from document prose β€” a topic, concept, feature, explanation, summary, or factual knowledge, even without naming a document β†’ `unstructured_flow`.
24
+ 6. An analytical question answerable by computing over tabular/DB data (counts, sums, top-N, filters, comparisons, trends, correlations, segments) β†’ `structured_flow`.
25
+
26
+ ## Disambiguation (the boundaries that matter)
27
+
28
+ - **`check` vs `structured_flow`** β€” "what do I have / describe it" β†’ `check`; "analyze / compute / trend / correlate / compare it" β†’ `structured_flow`.
29
+ - **`unstructured_flow` vs `structured_flow`** β€” pure document/concept Q&A β†’ `unstructured_flow`; anything needing computation over tabular/DB data β†’ `structured_flow`. **When in doubt between "analytical AND also needs document context" β†’ `structured_flow`** (the analytical path can pull document context itself). Only choose `unstructured_flow` for *pure* document questions with no computation.
30
+ - **`help` vs `problem_statement`** β€” "what's next?" β†’ `help`; "here is my goal / let's define the objective" β†’ `problem_statement`.
31
+ - **`chat` vs everything else** β€” only use `chat` when there is no task and no data question at all.
32
 
33
  ## Rewriting follow-ups
34
 
35
+ When history is present and the new message references prior context with pronouns or fragments ("tell me more", "what about last quarter?", "and by region?"), expand `rewritten_query` into a fully standalone question. Example:
36
 
37
  History: "What was our top product last month?" β†’ "Pro Plan Annual at $487k"
38
  Message: "How does that compare to Q1?"
39
  rewritten_query: "How does Pro Plan Annual's revenue last month compare to Q1?"
40
 
41
+ If the original is already standalone, copy it verbatim into `rewritten_query`.
42
 
43
  ## Few-shot examples
44
 
45
  ```
46
  User: "Hi"
47
+ β†’ intent="chat", rewritten_query=null, confidence=0.99
48
 
49
  User: "Bye, thanks"
50
+ β†’ intent="chat", rewritten_query=null, confidence=0.99
51
 
52
  User: "What can you do?"
53
+ β†’ intent="chat", rewritten_query=null, confidence=0.95
54
 
55
+ User: "Okay I uploaded my data, what do I do next?"
56
+ β†’ intent="help", rewritten_query=null, confidence=0.93
57
+
58
+ User: "How does this work? Where should I start?"
59
+ β†’ intent="help", rewritten_query=null, confidence=0.9
60
+
61
+ User: "I want to reduce customer churn next quarter, target under 5%."
62
+ β†’ intent="problem_statement",
63
+ rewritten_query="Define the analysis goal: reduce customer churn next quarter to under 5%.",
64
+ confidence=0.9
65
+
66
+ User: "My goal is to grow revenue in the north region."
67
+ β†’ intent="problem_statement",
68
+ rewritten_query="Define the analysis goal: grow revenue in the north region.",
69
+ confidence=0.88
70
+
71
+ User: "What data do I have?"
72
+ β†’ intent="check", rewritten_query="What data sources do I have?", confidence=0.95
73
+
74
+ User: "What columns are in the orders table?"
75
+ β†’ intent="check", rewritten_query="What columns are in the orders table?", confidence=0.93
76
+
77
+ User: "What documents have I uploaded?"
78
+ β†’ intent="check", rewritten_query="What documents have I uploaded?", confidence=0.93
79
 
80
  User: "What does the Q1 board memo say about churn?"
81
+ β†’ intent="unstructured_flow",
82
+ rewritten_query="What does the Q1 board memo say about churn?", confidence=0.9
83
 
84
+ User: "jelaskan tentang machine learning"
85
+ β†’ intent="unstructured_flow", rewritten_query="Explain machine learning", confidence=0.85
 
86
 
87
  User: "apa key feature dari iot connectivity?"
88
+ β†’ intent="unstructured_flow",
89
+ rewritten_query="What are the key features of IoT connectivity?", confidence=0.85
90
 
91
+ User: "How many orders did we get last month?"
92
+ β†’ intent="structured_flow",
93
+ rewritten_query="How many orders did we get last month?", confidence=0.92
94
+
95
+ User: "Top 5 customers by revenue this year"
96
+ β†’ intent="structured_flow",
97
+ rewritten_query="Top 5 customers by revenue this year", confidence=0.93
98
 
99
+ User: "Is there a correlation between discount and units sold?"
100
+ β†’ intent="structured_flow",
101
+ rewritten_query="Is there a correlation between discount and units sold?", confidence=0.9
102
 
103
+ User: "How has monthly revenue trended by region, and what stands out?"
104
+ β†’ intent="structured_flow",
105
+ rewritten_query="How has monthly revenue trended by region this year, and what is unusual?",
106
+ confidence=0.88
107
 
108
  History: assistant: "Pro Plan Annual led at $487,200 in April."
109
  User: "And in March?"
110
+ β†’ intent="structured_flow",
111
+ rewritten_query="What was Pro Plan Annual's revenue in March?", confidence=0.9
112
  ```
113
 
114
  ## Constraints
115
 
116
+ - Pick exactly one `intent`. Do not invent values outside the six listed.
117
+ - Prefer `unstructured_flow` over `structured_flow` only for pure knowledge/document questions; prefer `structured_flow` whenever computation over data is involved.
118
  - Do not refuse β€” refusal happens later in guardrails. Just classify.
119
  - One JSON object as output; no prose, no markdown.