Agentic-Service-Data-Eyond-Catalog

Running

App Files Files Community

feat/Knowledge & Data Tools

by rhbt6767 - opened 23 days ago

base: refs/heads/main

←

from: refs/pr/3

Discussion Files changed

+4909

-363

This view is limited to 50 files because it contains too many changes. See the raw diff here.

Files changed (50) hide show

ARCHITECTURE.md +2 -0
CHECKPOINT_PLAN_2026-06-17.md +147 -0
PROGRESS.md +32 -6
eval/__init__.py +0 -0
eval/intent/README.md +70 -0
eval/intent/__init__.py +0 -0
eval/intent/intent_dataset.json +56 -0
eval/intent/results/.gitkeep +0 -0
eval/intent/run_eval.py +384 -0
eval/readiness/README.md +34 -0
eval/readiness/__init__.py +0 -0
eval/readiness/readiness_dataset.json +40 -0
eval/readiness/results/.gitkeep +0 -0
eval/readiness/results/readiness_result_2026-06-22_101645.json +268 -0
eval/readiness/results/readiness_result_2026-06-22_143809.json +284 -0
eval/readiness/run_eval.py +309 -0
main.py +6 -0
pyproject.toml +2 -0
src/agents/binding_store.py +34 -0
src/agents/chat_handler.py +254 -32
src/agents/gate.py +108 -0
src/agents/handlers/__init__.py +1 -0
src/agents/handlers/check.py +165 -0
src/agents/handlers/help.py +192 -0
src/agents/handlers/problem_statement.py +171 -0
src/agents/orchestration.py +42 -27
src/agents/planner/examples.py +20 -22
src/agents/planner/inputs.py +3 -3
src/agents/planner/registry.py +39 -28
src/agents/planner/service.py +1 -1
src/agents/planner/validator.py +5 -5
src/agents/report/__init__.py +9 -0
src/agents/report/errors.py +7 -0
src/agents/report/generator.py +363 -0
src/agents/report/readiness.py +165 -0
src/agents/report/schemas.py +91 -0
src/agents/report/store.py +119 -0
src/agents/slow_path/assembler.py +32 -1
src/agents/slow_path/coordinator.py +4 -4
src/agents/slow_path/schemas.py +12 -0
src/agents/slow_path/store.py +78 -12
src/agents/slow_path/task_runner.py +3 -0
src/agents/state_store.py +128 -0
src/api/v1/analysis.py +174 -0
src/api/v1/chat.py +52 -9
src/api/v1/report.py +189 -0
src/api/v1/tools.py +124 -0
src/catalog/reader.py +3 -2
src/config/prompts/help.md +107 -0
src/config/prompts/intent_router.md +76 -39

ARCHITECTURE.md CHANGED Viewed

@@ -63,6 +63,8 @@ DB vs tabular is **not** a routing concern — it's a per-source attribute (`sou
 ## 3. Routing model
 ```
 source_hint ∈ { chat, unstructured, structured }
 ```

 ## 3. Routing model
+> **Superseded 2026-06-18** — the 3-way `source_hint` below was reworked into a flat **6-intent** handler router (`chat`, `help`, `problem_statement`, `check`, `unstructured_flow`, `structured_flow`). Modality (structured vs unstructured *data*) is now the Planner's job, not the router's. See `ORCHESTRATOR_REWORK_PLAN.md`.
 ```
 source_hint ∈ { chat, unstructured, structured }
 ```

CHECKPOINT_PLAN_2026-06-17.md ADDED Viewed

	@@ -0,0 +1,147 @@

+# Checkpoint Plan — Wednesday, 17 June 2026
+Working plan for Sofhia & Rifqi based on the checkpoint with mas Harry on **Thursday, 11 June 2026**.
+Goal: everything below is **merged and demo-able before the next sync on Wednesday, 17 June (afternoon)**.
+**Updated at: Friday, 12 June 2026** (Sofhia + Rifqi)
+> Source of truth for decisions is the meeting itself. Note: the NotebookLM summary is **stale on two points** — Data Availability Check was *eliminated* as a tool, and Success Metrics was *folded into* the Problem Statement template. Do not build either as a standalone skill.
+---
+## 0. Progress (per Fri 12 Jun — Sofhia)
+Dated snapshot of what landed this session. Live task status (incl. what's left) lives in §2 Ownership — this section only records the deltas + traceability.
+- ✅ **Tool matrix** built (xlsx, all ~10 tools + status colours) — presentation material ready.
+- ✅ **Registry trimmed to 4 active analytics** (`KM-641`, commit `66e2e4d`): `ACTIVE_ANALYTICS_TOOLS` (descriptive, aggregate, correlation, trend) vs `DEFERRED_ANALYTICS_TOOLS` (comparison, contribution, profile, segment) — specs + compute fns kept, only registry exposure withheld. Tests 206 pass, ruff/mypy clean.
+- ✅ **Planner few-shot synced**: Example A `analyze_contribution` → `analyze_aggregate` (so few-shots don't reference a deferred tool).
+- ✅ **Data-access tools renamed** (`KM-642`, commit `c38c0c2`): `query_structured` → `data_retrieve`, `retrieve_documents` → `knowledge_retrieve` across the tool layer + planner stub/prompt/validator/few-shots. Mechanical, no behavior change.
+- ✅ **`data_check` merge + `knowledge_check`** (`KM-643`, commit `4bd5f1e`): `list_sources` + `describe_source` → one parameterized `data_check` (no arg = list structured sources; `source_id` = schema) + new `knowledge_check` (unstructured). Tests 206 pass.
+- ✅ **Redis Cloud live** (free tier, TTL = 1 h), env vars shared in the group (Rifqi).
+- ✅ **Planner tool list verified** against the trimmed registry — no references to old tool names or deferred analytics anywhere in `src/` (Rifqi).
+- 📌 **Decision:** `tests/` stays gitignored — team decided not to push tests to origin (closes PROGRESS.md R3 as won't-do).
+- 📌 **Ownership:** Rifqi owns `generate_report` development + the `analysis_records` table / real `AnalysisStore` (contract still co-designed with Sofhia).
+- ✅ **R5 cache fix** (Rifqi, `b701e95`): chat cache scoped by `user_id`, TTL 24h→1h.
+- ✅ **AnalysisRecord persistence landed** (Rifqi): `stage` now flows to the record (CRISP-DM grouping for the report) + identity fields (`record_id`/`analysis_id`/`user_id`); `PostgresAnalysisStore` + `analysis_records` table replace `NullAnalysisStore`, wired into `ChatHandler`. Unblocks the `generate_report` renderer and the DoD "record persisted" step. Open: `analysis_id` handoff from Harry's Analysis State.
+- ✅ **Verb-first tool naming** (Sofhia, commit `2d6406d`): the 4 data/knowledge tools renamed to lead with a verb — `data_check`→`check_data`, `knowledge_check`→`check_knowledge`, `data_retrieve`→`retrieve_data`, `knowledge_retrieve`→`retrieve_knowledge` (the `analyze_*` tools already lead with a verb). These verb-first names are now canonical; the tool-set table + §3 below use them. Dated log entries above keep the old names as historical record.
+---
+## 1. Locked decisions (from the 2026-06-11 checkpoint)
+1. **Single chat page.** The separate interview/survey page is killed. Sidebar = Knowledge menu (connect/manage data) + Analysis menu (sessions).
+2. **Data-first hard gate.** Creating a new analysis requires **≥ 1 bound data source** (server-side rejection, no empty sessions). User provides title + optional short description.
+3. **Analysis State lives in the DB.** Per-analysis row: `user_id`, `data_source_ids[]`, `interview_status` (default `not_pass`), `report_status` (default `no_report` → `V1`, `V2`, …). Explicitly **NOT cached, NOT in Redis** — the Orchestrator reads it from Postgres every turn.
+4. **Skills, not agents.** No separate interview agent. The Orchestrator routes per user turn using the Analysis State; an analytical request still executes through the existing Planner → TaskRunner → Assembler spine (static plan, no mid-run LLM).
+5. **Interview = one skill: Problem Statement.** Success metrics become fields inside the PS template (what to increase/decrease + target). Data availability check is handled by the data-first creation gate + PS validation cross-checking fields against the bound catalog — not a separate tool.
+6. **Analytics focus = 4 tools:** descriptive, aggregate, correlation, trend. The other four composites (comparison, contribution, profile, segment) are **deprioritized, not deleted** — keep the code, just don't register them. If "comparison" returns later it should be a proper statistical **test**, not a generic compare.
+7. **`describe_source` merges into the listing tool** — one call returns sources *with* their schema/metadata, fewer tools for the planner.
+8. **Report = on-demand, button-triggered (not a chat skill).** A dedicated "Generate Report" button in the Analysis menu calls a **report API** (not the chat route): trigger generation for a session, list its versions, fetch a version. Renders from accumulated **AnalysisRecords + the Problem Statement** — never from chat history. Each report is a **persisted, versioned artifact**: generation snapshots the record IDs it used and bumps `report_status` to `V<n>`. (Owner: Rifqi, KM-644.)
+9. **Help = deterministic guide.** No LLM: read Analysis State → tell the user the next required step. Callable in any state.
+10. **Redis Cloud free tier, TTL = 1 hour**, env shared in the team group — for retrieval/query caching only, never for state.
+### Final tool set (~10)
+| Tool (canonical, verb-first) | Maps to (lineage) | Status |
+|---|---|---|
+| `check_knowledge` | new — list user's documents + metadata | done |
+| `check_data` | `list_sources` + `describe_source` merged (catalog-backed) | done |
+| `retrieve_knowledge` | `retrieve_documents` → `knowledge_retrieve` | done |
+| `retrieve_data` | `query_structured` → `data_retrieve` (tabular: file + DB, both working) | done |
+| `analyze_descriptive` | `src/tools/analytics/descriptive.py` | done |
+| `analyze_aggregate` | `src/tools/analytics/aggregation.py` | done |
+| `analyze_correlation` | `src/tools/analytics/relationship.py` | done |
+| `analyze_trend` | `src/tools/analytics/temporal.py` | done |
+| `problem_statement` | new — interview skill (**Harry**) | Harry |
+| `generate_report` | new — on-demand, versioned | to design |
+| `help` | new — deterministic state guide | to build |
+(`problem_statement` + `help` live at the orchestrator level; `generate_report` is **button-triggered via a dedicated report API**, not chat-routed (decision #8). The TaskRunner registry holds the 4 analytics + 4 data/knowledge tools. Unregister `analyze_comparison`, `analyze_contribution`, `analyze_profile`, `analyze_segment` from the planner-visible registry — keep the modules.)
+---
+## 2. Ownership
+### Sofhia
+- [x] 4 analytics tools: trim registry to 4 active, tests still pass after deprioritizing the other four. (`KM-641`, commit `66e2e4d`)
+- [x] Data/knowledge tools: merge `describe_source` into `data_check`, rename `retrieve_documents` → `knowledge_retrieve`, `query_structured` → `data_retrieve`, build `knowledge_check`. (`KM-642` `c38c0c2`, `KM-643` `4bd5f1e`)
+- [ ] Co-design `generate_report` contract with Rifqi (Rifqi owns development, see §3).
+- [x] Tool matrix (see §4).
+### Rifqi
+- [x] **Redis Cloud free tier** (~30–50 MB): create instance, set TTL = 1 h, share env vars in the group. (done 12 Jun)
+- [x] **R5 cache fix**: chat cache key scoped by `user_id`, TTL 24h→1h (urgent on shared Redis). (12 Jun, commit `b701e95`)
+- [x] **AnalysisRecord contract gaps closed**: `stage` (CRISP-DM) now flows Task→TaskResult→TaskSummary so the report can group the method appendix; `AnalysisRecord` gained `record_id`/`analysis_id`/`user_id` identity fields. (12 Jun)
+- [x] **`analysis_records` table + real `AnalysisStore`**: `PostgresAnalysisStore` (save + `list_for_analysis`, never-throw) replaces `NullAnalysisStore`; wired into `ChatHandler`, `user_id` stamped at save. Satisfies the DoD "record persisted" step. (12 Jun)
+- [ ] **Own `generate_report` development — KM-644 "Report Generator"** (contract co-designed with Sofhia, see §3). Button-triggered via a dedicated **report API** (trigger / list versions / fetch); reads `analysis_records` + Problem Statement; persists a versioned report artifact, bumps `report_status`. *(record persistence done above; report API + persistence + renderer + contract doc next)*
+- [x] Verify planner tool list matches the trimmed registry (4 analytics + 4 data/knowledge) and few-shots don't reference removed tools. (verified 12 Jun — no stale tool names in `src/`)
+- ⚠️ **Blocked-on-Harry**: `analysis_id` is `NULL` on persisted records until the Analysis State reaches the slow path — need the session-ID handoff so `generate_report` can group records per analysis.
+### Shared (Sofhia + Rifqi)
+- [ ] `generate_report` design + skeleton: input = AnalysisRecords for the session + Problem Statement from Analysis State; output = versioned artifact; bumps `report_status`. Agree on the contract even if rendering is stubbed for Wednesday. (Development: Rifqi.)
+- [ ] `help` skill: deterministic — read Analysis State, return the next required step. Small, do it together or whoever finishes first.
+- [ ] Tool behavior smoke test end-to-end on an easy case (descriptive/aggregate path), per Harry's ask: "robust tools before agents."
+### Harry (dependencies — not ours, but we block on them)
+- `problem_statement` skill + PS template (incl. increase/decrease target fields).
+- Analysis State class + DB table, frontend analysis-builder step.
+- Merging our PRs (he auto-merges; he clones from latest after).
+---
+## 3. Per-tool behavior contract (how to build each one)
+Harry's framing: for every tool, define **goal / trigger / input / process / output**, and behave like a Claude-style skill — if a required argument is missing, respond with a polite feedback message asking for it (e.g. table/column name), never guess silently.
+- **`check_knowledge`** — "what documents do I have?" → list documents with name, type, uploaded-at.
+- **`check_data`** — "what data do I have?" → sources (file + DB) with schema/metadata from the data catalog, created/uploaded timestamps.
+- **`retrieve_knowledge`** — RAG over uploaded documents; returns passages with source attribution.
+- **`retrieve_data`** — query tabular data (file + DB) via QueryIR; output consumable by the `analyze_*` tools.
+- **`analyze_*` (4)** — require valid table/column references; if missing or wrong, return actionable feedback instead of guessing.
+- **`generate_report`** — button-triggered via a dedicated report API (not chat-routed); on-demand only (never auto); post-pass gated; renders from AnalysisRecords + PS; persists a versioned artifact, snapshots record IDs, bumps version. (KM-644, Rifqi.)
+- **`help`** — no LLM; state → next step. Repeating it is fine, that's its job.
+---
+## 4. Tool matrix (deliverable for the sync)
+Harry explicitly asked for a matrix covering every tool. Produce one sheet/markdown table with columns:
+`tool | goal | trigger (when the orchestrator calls it) | input | process | output | gated by interview_status? | status (done / in progress / planned)`
+Use the tool set table in §1 as the row list. This doubles as the presentation material on Wednesday.
+---
+## 5. Day-by-day
+| Day | Target |
+|---|---|
+| **Thu 11** | Checkpoint meeting + task split with Harry. |
+| **Fri 12 (today)** | ✅ Registry trimmed to 4 analytics + few-shot synced (Sofhia, KM-641). ✅ Tool matrix built. ⏳ Redis Cloud + env share (Rifqi). |
+| **Mon 15** | Data/knowledge tools done (`data_check` merge, renames, `knowledge_check`). `generate_report` contract agreed. |
+| **Tue 16** | `help` skill done. `generate_report` skeleton wired to AnalysisRecord. Tool matrix drafted. End-to-end smoke test on the easy path. |
+| **Wed 17 (AM)** | Buffer: fix fallout, finalize matrix, rehearse the demo flow. |
+| **Wed 17 (PM)** | **Sync with Harry.** |
+---
+## 6. Open questions to confirm with Harry on Wednesday
+1. **Gate scope.** Proposal: keep the fast path + exploration tools (`check_knowledge`, `check_data`, retrieves, `help`, arguably `descriptive`) available **pre-pass**; gate only the insight tools (correlation, trend, report). Hard-gating everything risks frustrating users who just want to look at their data.
+2. **Who flips `interview_status` to `pass`?** Proposal: a deterministic validator (PS template slots complete + fields cross-checked against the bound catalog) makes the call — the LLM conducts the conversation but never decides the pass. ("Conversational skin, deterministic skeleton.")
+3. **Skills vs spine — one sentence to lock in writing:** *"Skills are registry tools executed by the existing Planner → TaskRunner → Assembler spine; the Analysis State gate is a pre-check in the Orchestrator."* This keeps the new flow and the locked architecture fully compatible.
+4. `generate_report` invocation goes through the same gate (post-pass only) — confirm.
+---
+## 7. Definition of done for Wednesday
+- [ ] All team PRs merged; Harry unblocked on the Analysis State class.
+- [ ] Registry exposes exactly 4 analytics + 4 data/knowledge tools, all passing local tests.
+- [ ] Redis Cloud shared and working locally for all three of us (TTL 1 h).
+- [ ] `help` works against a (possibly stubbed) Analysis State.
+- [ ] `generate_report` contract written; skeleton callable.
+- [ ] Tool matrix ready to present.
+- [ ] One end-to-end happy path runs: create analysis (with data) → blocked pre-pass → interview stub passes → descriptive/aggregate answer → record persisted.

PROGRESS.md CHANGED Viewed

@@ -2,8 +2,32 @@
 Persistent tracker mirroring the 42-item ownership table in `REPO_CONTEXT.md` "Team — division of work". Update as PRs land. Future Claude Code sessions read this to know what's already done.
-**Last updated**: 2026-06-10 (tool layer complete + hardening/DRY + Langfuse tracing + gated slow-path wiring)
-**Current open PR**: `pr/2` — active.
 ---
@@ -39,12 +63,12 @@ Verified against code before logging. Severity: **critical** / important / nice-
 |---|---|---|---|---|
 | R1 | **AuthN/AuthZ** on data endpoints — reject body-supplied `user_id`/`room_id`, derive identity from a verified token. `/chat/stream` has none (`chat.py:40,128`); tenant isolation is client honesty. **CORRECTION to the review:** `security/auth.py` is a STUB (all `NotImplementedError`); the real JWT impl lives in `src/users/users.py` (`encode_jwt`/`decode_jwt`, HS, env-keyed) **but is unused** — `/login` (`api/v1/users.py`) returns the user profile as plain JSON and mints NO token. So R1 is cross-team: (1) `/login` must issue a JWT, (2) frontend must send it as `Bearer`, (3) data endpoints validate it. **Gates the engine-cache work (DB2).** | **critical** | DB/B + frontend | `[ ]` |
 | R2 | **Always compile a LIMIT** — `sql.py` now emits a bound for every query: explicit limit honored (clamped to `MAX_RESULT_ROWS=10000`), unbounded queries get `LIMIT cap+1` so an unbounded SELECT can't stream a whole table into memory. `CompiledSql.row_cap` carries the cap; `DbExecutor` caps + flags truncation from it (dropped its own `_ROW_HARD_CAP`). Tests updated (`test_sql.py`, +3 cases); `S608` restored to `tests/**` ruff ignore (was dropped). | **critical** | DB | `[x]` |
-| R3 | **Commit `tests/` + minimal CI** — `tests/` is gitignored; the 200+ tests cited as done exist only on laptops (already caused rename rot). GitHub origin carries tests; HF Space gets the Docker build (already doesn't COPY tests). | **critical (process)** | shared | `[ ]` |
 | DB1 | **In-memory `describe_source`** (request-scoped `MemoizingCatalogReader`, `reader.py`) + **LLM-client hoist** (shared module-level `ChatHandler` in `chat.py`). Measured live: `describe_source` 3.5s→~2.0s (structured read now served from the planner's cached snapshot; only the unstructured read remains a round-trip), catalog reads/request ~5→~2. External `query_structured` handshake unchanged (DB2's job) so total slow path is ~flat until DB2. Tests: `tests/catalog/test_reader.py`. | important | agent | `[x]` |
 | DB2 | **Keyed engine cache** — `src/database_client/engine.py::UserEngineCache` (process singleton): pooled engines keyed by `client_id + creds-hash` (rotation auto-invalidates), bounded LRU (50) + 600s idle TTL, `pool_pre_ping` + `pool_recycle=300`. `DbExecutor._run_sync` reuses the warm connection instead of `create_engine→connect→dispose` per query (postgres/supabase only; other db_types keep the legacy path — no regression). **Live-measured: warm `query_structured` 6.6–9.4s → ~2.5s** (the residual is the per-call catalog-DB client fetch + pre-ping, not the external handshake). **Finding:** Neon's transaction pooler REJECTS `default_transaction_read_only` as a libpq startup `option` — caught live; moved read-only + statement_timeout to a per-connection `connect` event (best-effort; authoritative read-only is the SELECT-only compiler + sqlglot guard, see R10). Per-request ownership/active check kept. Proceeded ahead of R1 per owner decision (marginal security delta over the existing no-auth state; auth tracked separately). Tests: `tests/database_client/test_engine.py`. First query/process still cold → DB3. | important | DB | `[x]` |
 | DB3 | **Speculative pre-connect** — `DbExecutor.prewarm(catalog, user_id)` warms the pooled engine for schema sources (fire-and-forget at slow-path entry) so the cold first-query handshake overlaps the ~4s Planner call. Best-effort, never raises; gated to the default path (skipped when a coordinator factory is injected). Verified live through `ChatHandler.handle`. | nice-to-have | DB | `[x]` |
 | R4 | **Per-stage progress events** — `SlowPathCoordinator.run` gained an optional `progress` callback; `ChatHandler` bridges it to SSE `status` events (`chat.py` forwards them). Live: stream now shows `Planning…`→`Running N steps…`→`Composing…` (max wire gap ~4.6s, was ~13s of silence) → fixes proxy idle-timeout + UX. **Deferred:** token-streaming the Assembler answer needs splitting it into a streamed prose call + a structured-record call — that doubles the Assembler LLM calls (cost/latency), so it's a separate decision; the answer is still emitted as one chunk after the (fast ~2.5s) Assembler. Test: `test_chat_handler_wiring.py`. | important | agent | `[~]` |
-| R5 | **Response cache**: key on `user_id` + catalog version; invalidate on ingest. Today `chat:{room_id}:{message}`, 24h TTL, no user (`chat.py:138`) → cross-room replay + stale answers. | important | B | `[ ]` |
 | R6 | **Hard time budget** — wrap `coordinator.run()` in `asyncio.wait_for` (60–90s). `Constraints.time_budget_seconds` is rendered but not enforced. | important | agent | `[ ]` |
 | R7 | **Root-task-failure short-circuit** before the Assembler (templated/fast-path fallback, NOT replanning) — stops paying ~2k tok to narrate an empty RunState. | important | agent | `[ ]` |
 | R8 | **Catalog upsert race** — per-user advisory lock around read-merge-upsert (`store.py`); concurrent uploads can drop a source. | important | DB | `[ ]` |
@@ -129,7 +153,8 @@ LLM tokens; verified live to US Cloud.
 wires Pattern A correctly; self-corrects via retry.
 **Open follow-ups:** real `BusinessContext` (lead); create `analysis_records` table +
-real `AnalysisStore`; register data-access `ToolSpec`s upstream (`data_access_registry()`)
 or keep the planner stub; 4o → GPT-mini deployment swap; flip `enable_slow_path` on once
 `BusinessContext` is real. NOTE: 3 test files pre-existing broken from rename rot
 (`test_chat_handler.py`, `test_intent_router.py`, `test_answer_agent.py` import the old
@@ -396,8 +421,9 @@ New scope after the original 42-item table; added as the tool layer landed (KM-6
 | — | Tool contracts (`tools/contracts.py`) | TAB | `[x]` | KM-627 — canonical `ToolSpec` / `ToolRegistry` / `ToolOutput`. `agents/planner/contracts.py` re-exports them (+ keeps the lead's `BusinessContext` stub). |
 | — | Analytics registry (`tools/registry.py`) | TAB | `[x]` | KM-628 — `analytics_registry()`. `analyze_descriptive.required` = `["data","column_ids"]` (aligned to compute signature, commit 4bb7623). |
 | — | Invoker layer (`tools/invoker.py`) | TAB | `[x]` | KM-629 — `AnalyticsToolInvoker` (Pattern A: `analyze_*` take a `data` `${t<id>}` placeholder from upstream `query_structured`; `_materialize` → DataFrame, `_coerce_decimals` covers the whole family) + `CompositeToolInvoker` (routes data-access vs analytics by name). |
-| — | Data-access tools (`tools/data_access.py`) | TAB | `[x]` | KM-630 — `DataAccessToolInvoker`: `list_sources` / `describe_source` / `query_structured` / `retrieve_documents`. Per-request DI (`user_id` + `CatalogReader`). `query_structured` calls `IRValidator` + `ExecutorDispatcher` (planner skipped — IR pre-built by the agent Planner). |
 | — | Tool tests (`tests/unit/tools/`) | TAB | `[x]` | analytics + data-access + invoker tests (gitignored). Incl. regression `test_decimal_columns_coerced_for_analyze_contribution`. |
 ### API surface

 Persistent tracker mirroring the 42-item ownership table in `REPO_CONTEXT.md` "Team — division of work". Update as PRs land. Future Claude Code sessions read this to know what's already done.
+**Last updated**: 2026-06-12 (Redis Cloud live; R3 closed as won't-do; R5 cache fix; AnalysisRecord persistence landed — `PostgresAnalysisStore` + `analysis_records` table)
+**Current open PR**: `pr/3` — active.
+---
+## What just shipped (2026-06-12 — AnalysisRecord persistence, Rifqi)
+Groundwork for `generate_report`. The slow path now persists a real, citable
+record; the report (next) renders from it.
+- **Contract gaps closed** (`agents/slow_path/schemas.py`): `stage: CrispStage`
+  added to `TaskResult` + `TaskSummary` and populated at all 3 `TaskResult` build
+  sites in `task_runner.py` + copied in `assembler._build_record` — so the report
+  can group its method appendix by CRISP-DM phase. `AnalysisRecord` gained identity:
+  `record_id` (auto uuid), `analysis_id`/`user_id` (optional; stamped at persist).
+- **Real store** (`agents/slow_path/store.py`): `PostgresAnalysisStore` —
+  `save()` (never-throw, idempotent upsert) + `list_for_analysis()` (oldest-first,
+  the report's render order). `NullAnalysisStore` kept (tests / disabled persistence).
+  `AnalysisStore` Protocol gained `list_for_analysis`.
+- **Table** (`db/postgres/models.py`): `analysis_records` jsonb table (one row per
+  run, indexed by `analysis_id` + `user_id`); registered in `init_db.py`, created by
+  `create_all` on startup (no migration — `data_catalog` precedent).
+- **Wired** (`agents/chat_handler.py`): default store flipped to `PostgresAnalysisStore`;
+  `user_id` stamped onto the record at the save site (in scope there).
+- **Open**: `analysis_id` is `NULL` until Harry's Analysis State reaches the slow
+  path (session-ID handoff needed to group records per analysis).
 ---
 |---|---|---|---|---|
 | R1 | **AuthN/AuthZ** on data endpoints — reject body-supplied `user_id`/`room_id`, derive identity from a verified token. `/chat/stream` has none (`chat.py:40,128`); tenant isolation is client honesty. **CORRECTION to the review:** `security/auth.py` is a STUB (all `NotImplementedError`); the real JWT impl lives in `src/users/users.py` (`encode_jwt`/`decode_jwt`, HS, env-keyed) **but is unused** — `/login` (`api/v1/users.py`) returns the user profile as plain JSON and mints NO token. So R1 is cross-team: (1) `/login` must issue a JWT, (2) frontend must send it as `Bearer`, (3) data endpoints validate it. **Gates the engine-cache work (DB2).** | **critical** | DB/B + frontend | `[ ]` |
 | R2 | **Always compile a LIMIT** — `sql.py` now emits a bound for every query: explicit limit honored (clamped to `MAX_RESULT_ROWS=10000`), unbounded queries get `LIMIT cap+1` so an unbounded SELECT can't stream a whole table into memory. `CompiledSql.row_cap` carries the cap; `DbExecutor` caps + flags truncation from it (dropped its own `_ROW_HARD_CAP`). Tests updated (`test_sql.py`, +3 cases); `S608` restored to `tests/**` ruff ignore (was dropped). | **critical** | DB | `[x]` |
+| R3 | **Commit `tests/` + minimal CI** — `tests/` is gitignored; the 200+ tests cited as done exist only on laptops (already caused rename rot). ~~GitHub origin carries tests; HF Space gets the Docker build.~~ **2026-06-12: team decided tests stay gitignored/local — closed as won't-do.** | **critical (process)** | shared | `[won't do]` |
 | DB1 | **In-memory `describe_source`** (request-scoped `MemoizingCatalogReader`, `reader.py`) + **LLM-client hoist** (shared module-level `ChatHandler` in `chat.py`). Measured live: `describe_source` 3.5s→~2.0s (structured read now served from the planner's cached snapshot; only the unstructured read remains a round-trip), catalog reads/request ~5→~2. External `query_structured` handshake unchanged (DB2's job) so total slow path is ~flat until DB2. Tests: `tests/catalog/test_reader.py`. | important | agent | `[x]` |
 | DB2 | **Keyed engine cache** — `src/database_client/engine.py::UserEngineCache` (process singleton): pooled engines keyed by `client_id + creds-hash` (rotation auto-invalidates), bounded LRU (50) + 600s idle TTL, `pool_pre_ping` + `pool_recycle=300`. `DbExecutor._run_sync` reuses the warm connection instead of `create_engine→connect→dispose` per query (postgres/supabase only; other db_types keep the legacy path — no regression). **Live-measured: warm `query_structured` 6.6–9.4s → ~2.5s** (the residual is the per-call catalog-DB client fetch + pre-ping, not the external handshake). **Finding:** Neon's transaction pooler REJECTS `default_transaction_read_only` as a libpq startup `option` — caught live; moved read-only + statement_timeout to a per-connection `connect` event (best-effort; authoritative read-only is the SELECT-only compiler + sqlglot guard, see R10). Per-request ownership/active check kept. Proceeded ahead of R1 per owner decision (marginal security delta over the existing no-auth state; auth tracked separately). Tests: `tests/database_client/test_engine.py`. First query/process still cold → DB3. | important | DB | `[x]` |
 | DB3 | **Speculative pre-connect** — `DbExecutor.prewarm(catalog, user_id)` warms the pooled engine for schema sources (fire-and-forget at slow-path entry) so the cold first-query handshake overlaps the ~4s Planner call. Best-effort, never raises; gated to the default path (skipped when a coordinator factory is injected). Verified live through `ChatHandler.handle`. | nice-to-have | DB | `[x]` |
 | R4 | **Per-stage progress events** — `SlowPathCoordinator.run` gained an optional `progress` callback; `ChatHandler` bridges it to SSE `status` events (`chat.py` forwards them). Live: stream now shows `Planning…`→`Running N steps…`→`Composing…` (max wire gap ~4.6s, was ~13s of silence) → fixes proxy idle-timeout + UX. **Deferred:** token-streaming the Assembler answer needs splitting it into a streamed prose call + a structured-record call — that doubles the Assembler LLM calls (cost/latency), so it's a separate decision; the answer is still emitted as one chunk after the (fast ~2.5s) Assembler. Test: `test_chat_handler_wiring.py`. | important | agent | `[~]` |
+| R5 | **Response cache**: key on `user_id` + catalog version; invalidate on ingest. Was `chat:{room_id}:{message}`, 24h TTL, no user → cross-user replay + stale answers. **2026-06-12 (Rifqi):** key now `chat:{room_id}:{user_id}:{message}` via `_chat_cache_key()`, TTL 24h→1h (checkpoint decision) — urgent now that Redis is a shared Cloud instance. `DELETE /chat/cache` gained a required `user_id` param (frontend heads-up); room-wide clear pattern unchanged. **Still open:** catalog-version in key / invalidate-on-ingest. | important | B | `[~]` |
 | R6 | **Hard time budget** — wrap `coordinator.run()` in `asyncio.wait_for` (60–90s). `Constraints.time_budget_seconds` is rendered but not enforced. | important | agent | `[ ]` |
 | R7 | **Root-task-failure short-circuit** before the Assembler (templated/fast-path fallback, NOT replanning) — stops paying ~2k tok to narrate an empty RunState. | important | agent | `[ ]` |
 | R8 | **Catalog upsert race** — per-user advisory lock around read-merge-upsert (`store.py`); concurrent uploads can drop a source. | important | DB | `[ ]` |
 wires Pattern A correctly; self-corrects via retry.
 **Open follow-ups:** real `BusinessContext` (lead); create `analysis_records` table +
+real `AnalysisStore` (**Rifqi owns, 2026-06-12** — folded into `generate_report` work,
+see `CHECKPOINT_PLAN_2026-06-17.md`); register data-access `ToolSpec`s upstream (`data_access_registry()`)
 or keep the planner stub; 4o → GPT-mini deployment swap; flip `enable_slow_path` on once
 `BusinessContext` is real. NOTE: 3 test files pre-existing broken from rename rot
 (`test_chat_handler.py`, `test_intent_router.py`, `test_answer_agent.py` import the old
 | — | Tool contracts (`tools/contracts.py`) | TAB | `[x]` | KM-627 — canonical `ToolSpec` / `ToolRegistry` / `ToolOutput`. `agents/planner/contracts.py` re-exports them (+ keeps the lead's `BusinessContext` stub). |
 | — | Analytics registry (`tools/registry.py`) | TAB | `[x]` | KM-628 — `analytics_registry()`. `analyze_descriptive.required` = `["data","column_ids"]` (aligned to compute signature, commit 4bb7623). |
 | — | Invoker layer (`tools/invoker.py`) | TAB | `[x]` | KM-629 — `AnalyticsToolInvoker` (Pattern A: `analyze_*` take a `data` `${t<id>}` placeholder from upstream `query_structured`; `_materialize` → DataFrame, `_coerce_decimals` covers the whole family) + `CompositeToolInvoker` (routes data-access vs analytics by name). |
+| — | Data-access tools (`tools/data_access.py`) | TAB | `[x]` | KM-630 — `DataAccessToolInvoker`: `list_sources` / `describe_source` / `query_structured` / `retrieve_documents`. Per-request DI (`user_id` + `CatalogReader`). `query_structured` calls `IRValidator` + `ExecutorDispatcher` (planner skipped — IR pre-built by the agent Planner). **Superseded by KM-642/643** — renamed `data_retrieve`/`knowledge_retrieve` and `list_sources`+`describe_source` merged into `data_check` + new `knowledge_check`; see row below. |
 | — | Tool tests (`tests/unit/tools/`) | TAB | `[x]` | analytics + data-access + invoker tests (gitignored). Incl. regression `test_decimal_columns_coerced_for_analyze_contribution`. |
+| — | Data/knowledge tool taxonomy (`tools/data_access.py`) | TAB | `[x]` | KM-642/643 (commits c38c0c2, 4bd5f1e) — renamed `query_structured`→`data_retrieve`, `retrieve_documents`→`knowledge_retrieve`; merged `list_sources`+`describe_source` → parameterized `data_check` (no arg = list structured sources; `source_id` = that source's schema) + new `knowledge_check` (unstructured/documents). Split mirrors the catalog's structured/unstructured slices. Planner stub/prompt/validator/few-shots synced; `DATA_ACCESS_TOOLS` guard kept in lockstep. Note: dated log entries above (e.g. the 2026-06-09 E2E) keep the old names as historical record. |
 ### API surface

eval/__init__.py ADDED Viewed

File without changes

eval/intent/README.md ADDED Viewed

	@@ -0,0 +1,70 @@

+# Intent-Routing Eval (E3)
+Scores the live 6-intent router (`OrchestratorAgent.classify`) against a golden
+dataset of labelled messages. Run it before any deploy that touches the router
+prompt (`src/config/prompts/intent_router.md`) or its few-shots.
+## Files
+| File | What |
+|---|---|
+| `intent_dataset.json` | **Golden dataset** — `message` + known-correct `expected_intent` per case. The source of truth scoring compares against. |
+| `run_eval.py` | Runner — calls the router per case, scores correctness, records latency + tokens. |
+| `results/` | Timestamped run reports, one JSON per run (never overwritten). |
+## Run
+Run as a module (`-m`), not the file path — module mode puts the repo root on
+`sys.path` so `src` imports resolve; `python eval/intent/run_eval.py` fails.
+```bash
+uv run python -m eval.intent.run_eval            # full dataset
+uv run python -m eval.intent.run_eval --limit 6  # quick smoke test
+uv run python -m eval.intent.run_eval --langfuse # also stream traces to Langfuse
+```
+Needs a populated `.env` (Azure OpenAI) — it calls the live model and spends
+tokens. Output: a per-case detail table + an aggregate summary in the terminal,
+and `results/eval_result_<timestamp>.json`.
+**Tracking is the committed result files, not Langfuse** — the JSON reports in
+`results/` are the versionable audit trail (see below). `--langfuse` is an
+*optional* extra: when set, each case is also sent as a Langfuse trace (grouped
+under one `intent_eval_<ts>` session) with a `intent_correct` 1/0 score, so the
+same run is browsable in the Langfuse dashboard. It is off by default and the
+eval runs fully without Langfuse configured.
+## What's measured
+- **correctness** — overall + per-intent + per-language accuracy (`got == expected`)
+- **runtime** — average ms per case
+- **tokens** — input / output / total (read from the model response, no Langfuse)
+## Commit convention for `results/`
+The reports are **versionable**, not a scratch log:
+- **Do commit** a result after a meaningful change — e.g. a new
+  `intent_router.md` version, or new dataset cases. The new timestamped file
+  *adds* to the history; old files are never replaced. This is how we answer
+  "did accuracy improve after prompt v2?" — diff two committed result files.
+- **Don't commit** throwaway runs while iterating. Just leave them unstaged or
+  delete them.
+So the audit trail = prompt versions (in `src/config/prompts/`) lined up against
+the committed result files here.
+## Dataset notes
+- 6 intents: `chat`, `help`, `problem_statement`, `check`, `unstructured_flow`,
+  `structured_flow`. Each has 6-7 **distinct** scenarios (not EN/ID translation
+  pairs), balanced across English + Indonesian.
+- `carried_over: true` rows mirror the pre-rework `intent_router.md` examples
+  (regression). `lang` enables per-language scoring. `id` is a stable handle for
+  diffing the same case across runs.
+- Routing labels are decided from the question **phrasing**, not from which file
+  holds the answer (the router has no catalog access). See the `_grounding` note
+  in `intent_dataset.json`.
+- Owner: Rifqi (structured/DB-grounded rows) + Sofhia (unstructured/document +
+  tabular-file rows). Merge both into this one file.
+```

eval/intent/__init__.py ADDED Viewed

File without changes

eval/intent/intent_dataset.json ADDED Viewed

	@@ -0,0 +1,56 @@

+{
+  "_about": "Golden intent dataset for the reworked 6-intent router (E3 eval — runs against the live LLM, not the mocked unit tests). Each of the 6 intents has 6-7 DISTINCT scenarios (not EN/ID translation pairs of one scenario) balanced across English + Indonesian, since users code-switch (technical terms often stay English: revenue, churn, upload, base_price). `carried_over` rows mirror the old intent_router.md examples for regression; the rest are new. `id` is a stable per-case handle so timestamped run files (eval_result_<ts>.json) can be diffed case-by-case across runs — it is NOT the run timestamp (that lives in the filename). `lang` enables per-language correctness scoring (aggregate, not matched-pairs).",
+  "_grounding": "structured_flow + structured `check` rows are grounded in Sofhia's real test files: online_vs_offline_learning_dataset.csv (cols: Learning_Mode, Subject, Study_Hours, Retention_Score, Focus_Level, Exam_Score) and xl_knowledge_large.xlsx (telco product catalog: product_name, category, base_price, is_active, region_restriction...). unstructured_flow + document `check` rows are grounded in the IoT connectivity indicative-price PDF and the Internet of Things DOCX. IMPORTANT: routing labels are decided from the question PHRASING, not from which file holds the answer (the router has no catalog access) — so structured rows are clearly analytical (avg/correlation/count) and unstructured rows are clearly explanatory (jelaskan/summarize/features), with no price-lookup collisions. Rifqi's DB-grounded rows can be merged into this same file for variety; owner: Rifqi + Sofhia.",
+  "_next_layer": "Not in this seed: (1) deliberate near-boundary cases (chat-vs-help, check-vs-structured); (2) follow-up/contextual handling — decide whether follow-ups route to `chat` or are out of scope.",
+  "schema": {
+    "id": "stable per-case handle, <intent>_<NN>",
+    "message": "the user utterance fed to the router",
+    "expected_intent": "one of: chat | help | problem_statement | check | unstructured_flow | structured_flow",
+    "lang": "en | id",
+    "carried_over": "true if mirrored from the pre-rework intent_router.md examples"
+  },
+  "cases": [
+    { "id": "chat_01", "message": "Hi", "expected_intent": "chat", "lang": "en", "carried_over": true },
+    { "id": "chat_02", "message": "Bye, thanks", "expected_intent": "chat", "lang": "en", "carried_over": true },
+    { "id": "chat_03", "message": "What can you do?", "expected_intent": "chat", "lang": "en", "carried_over": true },
+    { "id": "chat_04", "message": "Kamu bisa ngerti bahasa Indonesia gk?", "expected_intent": "chat", "lang": "id", "carried_over": false },
+    { "id": "chat_05", "message": "Test, kebaca gak?", "expected_intent": "chat", "lang": "id", "carried_over": false },
+    { "id": "chat_06", "message": "Oh paham2", "expected_intent": "chat", "lang": "id", "carried_over": false },
+    { "id": "help_01", "message": "Okay I uploaded my data, what do I do next?", "expected_intent": "help", "lang": "en", "carried_over": false },
+    { "id": "help_02", "message": "How does this work, where should I start?", "expected_intent": "help", "lang": "en", "carried_over": false },
+    { "id": "help_03", "message": "How do I connect my database to this?", "expected_intent": "help", "lang": "en", "carried_over": false },
+    { "id": "help_04", "message": "Setelah analisis selesai, aku bisa ngapain lagi?", "expected_intent": "help", "lang": "id", "carried_over": false },
+    { "id": "help_05", "message": "Aku harus upload file dulu atau connect database dulu atau bisa langsung tanpa keduanya?", "expected_intent": "help", "lang": "id", "carried_over": false },
+    { "id": "help_06", "message": "Cara bikin report-nya gimana deh?", "expected_intent": "help", "lang": "id", "carried_over": false },
+    { "id": "ps_01", "message": "I want to reduce customer churn next quarter, target under 5%.", "expected_intent": "problem_statement", "lang": "en", "carried_over": false },
+    { "id": "ps_02", "message": "My goal is to improve online students' exam scores this semester.", "expected_intent": "problem_statement", "lang": "en", "carried_over": false },
+    { "id": "ps_03", "message": "We need to figure out which product categories to push next year.", "expected_intent": "problem_statement", "lang": "en", "carried_over": false },
+    { "id": "ps_04", "message": "Aku mau tau faktor apa yg paling ngaruh ke retention score siswa.", "expected_intent": "problem_statement", "lang": "id", "carried_over": false },
+    { "id": "ps_05", "message": "Tujuanku naikin penjualan produk prepaid kuartal depan.", "expected_intent": "problem_statement", "lang": "id", "carried_over": false },
+    { "id": "ps_06", "message": "Aku pengen fokus benahin paket internet yang kurang laku di luar Jawa.", "expected_intent": "problem_statement", "lang": "id", "carried_over": false },
+    { "id": "check_01", "message": "What data do I have?", "expected_intent": "check", "lang": "en", "carried_over": false },
+    { "id": "check_02", "message": "What columns are in the online vs offline learning dataset?", "expected_intent": "check", "lang": "en", "carried_over": false },
+    { "id": "check_03", "message": "Is the IoT connectivity pricing PDF already uploaded?", "expected_intent": "check", "lang": "en", "carried_over": false },
+    { "id": "check_04", "message": "Kolom di tabel product master list apa aja?", "expected_intent": "check", "lang": "id", "carried_over": false },
+    { "id": "check_05", "message": "Dokumen apa aja yang udh aku upload?", "expected_intent": "check", "lang": "id", "carried_over": false },
+    { "id": "check_06", "message": "Sumber dataku yang berupa database yg mana aja?", "expected_intent": "check", "lang": "id", "carried_over": false },
+    { "id": "unstructured_01", "message": "apa key feature dari iot connectivity?", "expected_intent": "unstructured_flow", "lang": "id", "carried_over": true },
+    { "id": "unstructured_02", "message": "Jelaskan tentang Internet of Things.", "expected_intent": "unstructured_flow", "lang": "id", "carried_over": false },
+    { "id": "unstructured_03", "message": "Menurut dokumen IoT connectivity, paket apa aja yang ditawarkan?", "expected_intent": "unstructured_flow", "lang": "id", "carried_over": false },
+    { "id": "unstructured_04", "message": "What pricing tiers are in the IoT connectivity document?", "expected_intent": "unstructured_flow", "lang": "en", "carried_over": false },
+    { "id": "unstructured_05", "message": "Summarize the key points from the IoT connectivity pricing document.", "expected_intent": "unstructured_flow", "lang": "en", "carried_over": false },
+    { "id": "unstructured_06", "message": "What use cases of IoT are mentioned in the document?", "expected_intent": "unstructured_flow", "lang": "en", "carried_over": false },
+    { "id": "structured_01", "message": "How many orders did we get last month?", "expected_intent": "structured_flow", "lang": "en", "carried_over": true },
+    { "id": "structured_02", "message": "Top 5 customers by revenue this year", "expected_intent": "structured_flow", "lang": "en", "carried_over": true },
+    { "id": "structured_03", "message": "What's the average exam score per learning mode?", "expected_intent": "structured_flow", "lang": "en", "carried_over": false },
+    { "id": "structured_04", "message": "Is there a correlation between study hours and exam score?", "expected_intent": "structured_flow", "lang": "en", "carried_over": false },
+    { "id": "structured_05", "message": "Rata-rata base price per kategori produk berapa?", "expected_intent": "structured_flow", "lang": "id", "carried_over": false },
+    { "id": "structured_06", "message": "Ada berapa produk yang masih aktif per kategori?", "expected_intent": "structured_flow", "lang": "id", "carried_over": false },
+    { "id": "structured_07", "message": "Bandingin retention score antara siswa online sama offline.", "expected_intent": "structured_flow", "lang": "id", "carried_over": false }
+  ]
+}

eval/intent/results/.gitkeep ADDED Viewed

File without changes

eval/intent/run_eval.py ADDED Viewed

	@@ -0,0 +1,384 @@

+"""Intent-routing eval runner (E3).
+Feeds each golden case in `intent_dataset.json` to the live 6-intent router
+(`OrchestratorAgent.classify`), then scores correctness + records latency and
+token usage. Prints a per-case detail table and an aggregate summary, and
+writes a timestamped JSON report under `results/` (never overwritten — one file
+per run, so runs can be diffed over time).
+Run before every deploy that touches the router prompt or its few-shots.
+Invoke as a module (`-m`) so the repo root is on `sys.path` and `src` imports
+resolve — running the file path directly (`python eval/intent/run_eval.py`)
+puts only `eval/intent/` on the path and fails:
+    uv run python -m eval.intent.run_eval
+    uv run python -m eval.intent.run_eval --limit 6     # quick smoke test
+Tokens come straight from the model response (LangChain `usage_metadata` via a
+callback) — no Langfuse needed. The router is called unmodified: it already
+accepts a `callbacks=` list and forwards it into the chain config.
+"""
+from __future__ import annotations
+import argparse
+import asyncio
+import json
+import statistics
+import time
+from dataclasses import asdict, dataclass
+from datetime import datetime
+from pathlib import Path
+from typing import Any
+from langchain_core.callbacks import BaseCallbackHandler
+from langchain_core.outputs import LLMResult
+from src.agents.orchestration import OrchestratorAgent
+_HERE = Path(__file__).resolve().parent
+DATASET = _HERE / "intent_dataset.json"
+RESULTS_DIR = _HERE / "results"
+INTENTS = [
+    "chat",
+    "help",
+    "problem_statement",
+    "check",
+    "unstructured_flow",
+    "structured_flow",
+]
+# Short labels so the EXPECT->GOT column stays narrow in the detail table.
+_ABBR = {
+    "chat": "chat",
+    "help": "help",
+    "problem_statement": "prob_stmt",
+    "check": "check",
+    "unstructured_flow": "unstruct",
+    "structured_flow": "structF",
+}
+class _UsageCollector(BaseCallbackHandler):
+    """Sums token usage across the LLM calls made during one classify().
+    Reads `usage_metadata` off each returned message (the canonical LangChain
+    field), falling back to `llm_output['token_usage']` for providers that only
+    populate the legacy field.
+    """
+    def __init__(self) -> None:
+        self.input_tokens = 0
+        self.output_tokens = 0
+        self.total_tokens = 0
+    def on_llm_end(self, response: LLMResult, **kwargs: Any) -> None:
+        before = self.total_tokens
+        for generation_list in response.generations:
+            for generation in generation_list:
+                message = getattr(generation, "message", None)
+                usage = getattr(message, "usage_metadata", None) if message else None
+                if usage:
+                    self.input_tokens += usage.get("input_tokens", 0)
+                    self.output_tokens += usage.get("output_tokens", 0)
+                    self.total_tokens += usage.get("total_tokens", 0)
+        if self.total_tokens == before and response.llm_output:
+            usage = response.llm_output.get("token_usage") or {}
+            self.input_tokens += usage.get("prompt_tokens", 0)
+            self.output_tokens += usage.get("completion_tokens", 0)
+            self.total_tokens += usage.get("total_tokens", 0)
+    @property
+    def tokens(self) -> dict[str, int]:
+        return {
+            "input": self.input_tokens,
+            "output": self.output_tokens,
+            "total": self.total_tokens,
+        }
+@dataclass
+class CaseResult:
+    id: str
+    lang: str
+    message: str
+    expected: str
+    got: str
+    correct: bool
+    latency_ms: int
+    tokens: dict[str, int]
+def load_cases(path: Path) -> list[dict[str, Any]]:
+    """Read the `cases` array, skipping the leading `_*` doc keys and `schema`."""
+    data = json.loads(path.read_text(encoding="utf-8"))
+    return list(data["cases"])
+@dataclass
+class _LangfuseCtx:
+    """Optional Langfuse sink — one session groups all cases of a run."""
+    session_id: str
+    client: Any
+def _new_langfuse_handler(lf_ctx: _LangfuseCtx, case: dict[str, Any]) -> Any:
+    """Per-case LangChain callback so each trace carries the case's labels."""
+    from langfuse.callback import CallbackHandler
+    from src.config.settings import settings
+    return CallbackHandler(
+        public_key=settings.LANGFUSE_PUBLIC_KEY,
+        secret_key=settings.LANGFUSE_SECRET_KEY,
+        host=settings.LANGFUSE_HOST,
+        session_id=lf_ctx.session_id,
+        trace_name=f"intent_eval/{case['id']}",
+        metadata={
+            "case_id": case["id"],
+            "expected": case["expected_intent"],
+            "lang": case["lang"],
+        },
+        tags=["intent-eval", case["expected_intent"], case["lang"]],
+    )
+def _score_langfuse(lf_ctx: _LangfuseCtx, handler: Any, result: CaseResult) -> None:
+    """Attach a 1/0 correctness score to the case's trace. Best-effort."""
+    try:
+        lf_ctx.client.score(
+            trace_id=handler.get_trace_id(),
+            name="intent_correct",
+            value=1 if result.correct else 0,
+            comment=f"{result.expected} -> {result.got}",
+        )
+    except Exception:  # noqa: BLE001, S110 — scoring must never break the run
+        pass
+async def run_case(
+    agent: OrchestratorAgent,
+    case: dict[str, Any],
+    lf_ctx: _LangfuseCtx | None = None,
+) -> CaseResult:
+    """Classify one message; never throws — a failed call is recorded as ERROR."""
+    collector = _UsageCollector()
+    callbacks: list[Any] = [collector]
+    lf_handler = _new_langfuse_handler(lf_ctx, case) if lf_ctx else None
+    if lf_handler is not None:
+        callbacks.append(lf_handler)
+    start = time.perf_counter()
+    got: str
+    try:
+        decision = await agent.classify(case["message"], callbacks=callbacks)
+        got = decision.intent
+    except Exception as exc:  # noqa: BLE001 — one bad case shouldn't kill the run
+        got = f"ERROR:{type(exc).__name__}"
+    latency_ms = round((time.perf_counter() - start) * 1000)
+    result = CaseResult(
+        id=case["id"],
+        lang=case["lang"],
+        message=case["message"],
+        expected=case["expected_intent"],
+        got=got,
+        correct=got == case["expected_intent"],
+        latency_ms=latency_ms,
+        tokens=collector.tokens,
+    )
+    if lf_ctx is not None and lf_handler is not None:
+        _score_langfuse(lf_ctx, lf_handler, result)
+    return result
+def _group_accuracy(results: list[CaseResult], key: str) -> dict[str, dict[str, Any]]:
+    out: dict[str, dict[str, Any]] = {}
+    keys = INTENTS if key == "expected" else sorted({getattr(r, key) for r in results})
+    for k in keys:
+        sub = [r for r in results if getattr(r, key) == k]
+        if not sub:
+            continue
+        passed = sum(r.correct for r in sub)
+        out[k] = {
+            "n": len(sub),
+            "passed": passed,
+            "accuracy": round(passed / len(sub), 3),
+        }
+    return out
+def summarize(results: list[CaseResult]) -> dict[str, Any]:
+    n = len(results)
+    passed = sum(r.correct for r in results)
+    latencies = [r.latency_ms for r in results]
+    tok_in = sum(r.tokens["input"] for r in results)
+    tok_out = sum(r.tokens["output"] for r in results)
+    tok_total = sum(r.tokens["total"] for r in results)
+    return {
+        "total": n,
+        "passed": passed,
+        "accuracy": round(passed / n, 3) if n else 0.0,
+        "runtime_avg_ms": round(statistics.mean(latencies)) if latencies else 0,
+        "runtime_total_s": round(sum(latencies) / 1000, 1),
+        "tokens": {
+            "input": tok_in,
+            "output": tok_out,
+            "total": tok_total,
+            "avg_total_per_case": round(tok_total / n) if n else 0,
+        },
+        "by_intent": _group_accuracy(results, "expected"),
+        "by_lang": _group_accuracy(results, "lang"),
+    }
+def _truncate(text: str, width: int) -> str:
+    text = text.replace("\n", " ")
+    return text if len(text) <= width else text[: width - 3] + "..."
+def format_table(results: list[CaseResult]) -> str:
+    header = (
+        f"{'ID':<15} {'L':<3} {'QUESTION':<40} "
+        f"{'EXPECT->GOT':<20} {'OK':<3} {'MS':>5} {'TOK':>6}"
+    )
+    rule = "-" * len(header)
+    lines = [rule, header, rule]
+    for r in results:
+        exp_got = f"{_ABBR.get(r.expected, r.expected)}->{_ABBR.get(r.got, r.got)}"
+        ok = "ok" if r.correct else "X"
+        lines.append(
+            f"{r.id:<15} {r.lang:<3} {_truncate(r.message, 40):<40} "
+            f"{_truncate(exp_got, 20):<20} {ok:<3} {r.latency_ms:>5} {r.tokens['total']:>6}"
+        )
+    lines.append(rule)
+    return "\n".join(lines)
+def format_summary(summary: dict[str, Any], results: list[CaseResult]) -> str:
+    lines = ["SUMMARY"]
+    lines.append(
+        f"  Overall    {summary['passed']}/{summary['total']} correct"
+        f"   ({summary['accuracy'] * 100:.1f}%)"
+    )
+    lines.append(
+        f"  Runtime    avg {summary['runtime_avg_ms']} ms"
+        f"   |  total {summary['runtime_total_s']} s"
+    )
+    tok = summary["tokens"]
+    lines.append(
+        f"  Tokens     avg {tok['avg_total_per_case']}"
+        f"   |  total {tok['total']}   (in {tok['input']} / out {tok['output']})"
+    )
+    lines.append("")
+    lines.append("  By intent")
+    for intent, m in summary["by_intent"].items():
+        lines.append(
+            f"    {intent:<18} {m['passed']}/{m['n']}  {m['accuracy'] * 100:.0f}%"
+        )
+    lines.append("  By language")
+    for lang, m in summary["by_lang"].items():
+        lines.append(
+            f"    {lang:<18} {m['passed']}/{m['n']}  {m['accuracy'] * 100:.0f}%"
+        )
+    failures = [r for r in results if not r.correct]
+    lines.append("")
+    lines.append(f"  FAILURES ({len(failures)})")
+    for r in failures:
+        lines.append(f"    {r.id:<14} [{r.lang}]  {r.expected:<12} -> {r.got}")
+    return "\n".join(lines)
+def build_report(
+    results: list[CaseResult], summary: dict[str, Any], meta: dict[str, Any]
+) -> dict[str, Any]:
+    run = {**meta, **{k: summary[k] for k in (
+        "total", "passed", "accuracy", "runtime_avg_ms", "runtime_total_s", "tokens"
+    )}}
+    return {
+        "run": run,
+        "by_intent": summary["by_intent"],
+        "by_lang": summary["by_lang"],
+        "cases": [asdict(r) for r in results],
+    }
+def _model_name() -> str:
+    try:
+        from src.config.settings import settings
+        return str(settings.azureai_deployment_name_4o)
+    except Exception:  # noqa: BLE001 — meta only; .env may be absent
+        return "gpt-4o"
+async def main() -> None:
+    parser = argparse.ArgumentParser(description="Intent-routing eval (E3)")
+    parser.add_argument("--dataset", type=Path, default=DATASET)
+    parser.add_argument("--limit", type=int, default=0, help="run first N cases only")
+    parser.add_argument("--prompt-version", default="intent_router.md")
+    parser.add_argument("--no-table", action="store_true", help="skip the detail table")
+    parser.add_argument(
+        "--langfuse", action="store_true",
+        help="also send each case as a Langfuse trace + correctness score",
+    )
+    args = parser.parse_args()
+    cases = load_cases(args.dataset)
+    if args.limit:
+        cases = cases[: args.limit]
+    started = datetime.now()
+    print(f"Intent Routing Eval -- {started:%Y-%m-%d %H:%M:%S}")
+    print(f"dataset: {args.dataset.name} ({len(cases)})  model: {_model_name()}  "
+          f"prompt: {args.prompt_version}")
+    lf_ctx: _LangfuseCtx | None = None
+    if args.langfuse:
+        try:
+            from src.observability.langfuse.langfuse import get_langfuse
+            lf_ctx = _LangfuseCtx(
+                session_id=f"intent_eval_{started:%Y%m%d_%H%M%S}",
+                client=get_langfuse(),  # type: ignore[no-untyped-call]
+            )
+            print(f"langfuse: enabled (session {lf_ctx.session_id})")
+        except Exception as exc:  # noqa: BLE001 — Langfuse is optional
+            print(f"langfuse: disabled ({type(exc).__name__}: {exc})")
+    agent = OrchestratorAgent()
+    results: list[CaseResult] = []
+    for case in cases:
+        results.append(await run_case(agent, case, lf_ctx))
+    if lf_ctx is not None:
+        try:
+            lf_ctx.client.flush()
+        except Exception:  # noqa: BLE001, S110 — flush failure shouldn't fail the run
+            pass
+    summary = summarize(results)
+    if not args.no_table:
+        print(format_table(results))
+    print(format_summary(summary, results))
+    meta = {
+        "timestamp": started.isoformat(timespec="seconds"),
+        "dataset": args.dataset.name,
+        "model": _model_name(),
+        "prompt_version": args.prompt_version,
+        "langfuse_session": lf_ctx.session_id if lf_ctx else None,
+    }
+    report = build_report(results, summary, meta)
+    RESULTS_DIR.mkdir(parents=True, exist_ok=True)
+    out_path = RESULTS_DIR / f"eval_result_{started:%Y-%m-%d_%H%M%S}.json"
+    out_path.write_text(
+        json.dumps(report, ensure_ascii=False, indent=2), encoding="utf-8"
+    )
+    print(f"\n-> saved: {out_path.relative_to(_HERE.parent.parent)}")
+if __name__ == "__main__":
+    asyncio.run(main())

eval/readiness/README.md ADDED Viewed

	@@ -0,0 +1,34 @@

+# Report-readiness eval
+Scores the deterministic `is_report_ready` signal (`src/agents/report/readiness.py`)
+that the Help skill consumes to decide whether to nudge the user toward generating a
+report. No LLM, no DB — each golden case declares an analysis state + a set of
+persisted records/reports, and the runner feeds them through `is_report_ready` via
+injectable fake stores.
+## Run
+```bash
+uv run python -m eval.readiness.run_eval
+uv run python -m eval.readiness.run_eval --limit 5     # smoke test
+uv run python -m eval.readiness.run_eval --no-table    # summary only
+```
+Each run writes a timestamped `results/readiness_result_<ts>.json` (never
+overwritten, diffable across runs).
+## What it measures
+- **Floor correctness** — exact `ready` + `missing` for the deterministic floor
+  (validated goal · ≥1 substantive record · delta-since-report). Should sit at ~100%;
+  this is the regression guard as criteria evolve.
+- **Alignment gap** — `alignment` cases have substantive records (floor says
+  `ready=true`) but `aligned=false`: the analyses don't address the problem statement.
+  The floor can't see this. The gap count is the evidence for/against adding the
+  deferred LLM-judge — "ship the floor, earn the judge."
+## Dataset
+`readiness_dataset.json` — groups: `floor`, `delta`, `edge` (doc-only product
+question), `alignment`. See the `_about` / `_alignment` doc keys in the file. The
+`aligned` label is a semantic judgment; owner: Rifqi (report semantics) + Sofhia.

eval/readiness/__init__.py ADDED Viewed

File without changes

eval/readiness/readiness_dataset.json ADDED Viewed

	@@ -0,0 +1,40 @@

+{
+  "_about": "Golden dataset for the report-readiness signal (`src/agents/report/readiness.is_report_ready`). Deterministic (no LLM): each case declares an analysis state + a set of persisted AnalysisRecords/reports, and the runner feeds them through is_report_ready via injectable fake stores, scoring the boolean `ready` AND the `missing` gaps. Floor cases should score ~100% (regression value). The `alignment` group probes the deferred LLM-judge — see _alignment.",
+  "_floor": "is_report_ready's deterministic floor: (1) problem_validated, (2) >=1 SUBSTANTIVE record, (3) delta-since-report. SUBSTANTIVE (KM-652 fix T1) = a record whose ANALYSIS task succeeded: tasks_run contains a task with status=success AND an analyze_* tool. A failed analysis still persists a record WITH findings (narrating the failure) and its data-access tasks (check_/retrieve_) succeed — so neither 'has findings' nor 'any task succeeded' counts. Only a successful analyze_* does.",
+  "_records": "records[].analysis = 'success' (analyze_* succeeded → substantive) | 'failure' (analyze_* failed, data-access still succeeded — the real e2e case, NOT substantive) | 'none' (only check_/retrieve_ succeeded, no analyze task — NOT substantive; guards the 'any task succeeded' trap). records[].findings = count (a failure run still has findings; floor ignores them now). records[].age_min / reports[].age_min = minutes ago (smaller = newer).",
+  "_alignment": "ALIGNMENT cases: a successful analysis (floor says ready=true) but `aligned=false` means it doesn't address the problem statement — a human would say NOT ready. Scored floor-correct, counted separately as the 'alignment gap' = evidence for/against the LLM-judge. Alignment label owner: Rifqi (report semantics) + Sofhia.",
+  "schema": {
+    "id": "stable per-case handle, <group>_<NN>",
+    "group": "floor | delta | edge | alignment",
+    "problem_validated": "bool",
+    "report_id": "null = never generated; a string = a report exists",
+    "records": "[{ analysis: success|failure|none, findings: int, age_min: int }]",
+    "reports": "[{ age_min: int }] (only meaningful when report_id set)",
+    "aligned": "bool — do the analyses address the problem statement? (floor ignores this)",
+    "expected_ready": "what the deterministic floor SHOULD return",
+    "expected_missing": "subset of [problem, analysis, delta]",
+    "note": "human-readable description"
+  },
+  "cases": [
+    { "id": "floor_01", "group": "floor", "problem_validated": false, "report_id": null, "records": [], "reports": [], "aligned": false, "expected_ready": false, "expected_missing": ["problem", "analysis"], "note": "new analysis: no validated goal and no records" },
+    { "id": "floor_02", "group": "floor", "problem_validated": false, "report_id": null, "records": [{ "analysis": "success", "findings": 2, "age_min": 30 }], "reports": [], "aligned": true, "expected_ready": false, "expected_missing": ["problem"], "note": "has a successful analysis but goal not validated (isolates the problem gap)" },
+    { "id": "floor_03", "group": "floor", "problem_validated": true, "report_id": null, "records": [], "reports": [], "aligned": false, "expected_ready": false, "expected_missing": ["analysis"], "note": "validated goal but no analysis run yet" },
+    { "id": "floor_04", "group": "floor", "problem_validated": true, "report_id": null, "records": [{ "analysis": "failure", "findings": 3, "age_min": 20 }], "reports": [], "aligned": false, "expected_ready": false, "expected_missing": ["analysis"], "note": "T1 REGRESSION: analyze_* FAILED but the record still has 3 findings (narrating failure) + check/retrieve succeeded. Must NOT be ready — this is the live e2e case (analyze_aggregate failed, report still got generated under the old 'has findings' rule)." },
+    { "id": "floor_05", "group": "floor", "problem_validated": true, "report_id": null, "records": [{ "analysis": "none", "findings": 0, "age_min": 15 }], "reports": [], "aligned": false, "expected_ready": false, "expected_missing": ["analysis"], "note": "T1 nuance: only data-access tasks (check/retrieve) succeeded, no analyze task. 'any task succeeded' would wrongly pass — must NOT be ready." },
+    { "id": "floor_06", "group": "floor", "problem_validated": true, "report_id": null, "records": [{ "analysis": "success", "findings": 2, "age_min": 15 }], "reports": [], "aligned": true, "expected_ready": true, "expected_missing": [], "note": "validated + one successful analysis, no prior report → ready" },
+    { "id": "floor_07", "group": "floor", "problem_validated": true, "report_id": null, "records": [{ "analysis": "success", "findings": 3, "age_min": 40 }, { "analysis": "success", "findings": 1, "age_min": 10 }], "reports": [], "aligned": true, "expected_ready": true, "expected_missing": [], "note": "multiple successful analyses → ready" },
+    { "id": "floor_08", "group": "floor", "problem_validated": true, "report_id": null, "records": [{ "analysis": "failure", "findings": 3, "age_min": 30 }, { "analysis": "success", "findings": 2, "age_min": 10 }], "reports": [], "aligned": true, "expected_ready": true, "expected_missing": [], "note": "one failed + one successful analysis → the successful one is enough → ready" },
+    { "id": "delta_01", "group": "delta", "problem_validated": true, "report_id": "rep-1", "records": [{ "analysis": "success", "findings": 2, "age_min": 120 }], "reports": [{ "age_min": 5 }], "aligned": true, "expected_ready": false, "expected_missing": ["delta"], "note": "report exists, all analysis older than it → nothing new to report" },
+    { "id": "delta_02", "group": "delta", "problem_validated": true, "report_id": "rep-1", "records": [{ "analysis": "success", "findings": 2, "age_min": 5 }], "reports": [{ "age_min": 120 }], "aligned": true, "expected_ready": true, "expected_missing": [], "note": "newer successful analysis after the report → ready to regenerate" },
+    { "id": "delta_03", "group": "delta", "problem_validated": true, "report_id": "rep-1", "records": [{ "analysis": "success", "findings": 1, "age_min": 90 }, { "analysis": "success", "findings": 2, "age_min": 10 }], "reports": [{ "age_min": 60 }], "aligned": true, "expected_ready": true, "expected_missing": [], "note": "one old + one newer-than-report success → ready" },
+    { "id": "delta_04", "group": "delta", "problem_validated": true, "report_id": "rep-2", "records": [{ "analysis": "success", "findings": 2, "age_min": 90 }], "reports": [{ "age_min": 200 }, { "age_min": 30 }], "aligned": true, "expected_ready": false, "expected_missing": ["delta"], "note": "multiple reports — newest wins; analysis older than newest report → not ready" },
+    { "id": "delta_05", "group": "delta", "problem_validated": true, "report_id": "rep-1", "records": [{ "analysis": "success", "findings": 2, "age_min": 120 }, { "analysis": "failure", "findings": 3, "age_min": 5 }], "reports": [{ "age_min": 60 }], "aligned": true, "expected_ready": false, "expected_missing": ["delta"], "note": "T1+delta: the only NEW analysis (age 5) is a FAILURE → no NEW substantive since the report → not ready. A failed retry must not unlock a duplicate report." },
+    { "id": "edge_01", "group": "edge", "problem_validated": true, "report_id": null, "records": [], "reports": [], "aligned": false, "expected_ready": false, "expected_missing": ["analysis"], "note": "doc-only analysis (RAG, no structured run) produces no AnalysisRecord → never report-able under the floor. PRODUCT QUESTION: should doc-only be report-able?" },
+    { "id": "align_01", "group": "alignment", "problem_validated": true, "report_id": null, "records": [{ "analysis": "success", "findings": 2, "age_min": 15 }], "reports": [], "aligned": false, "expected_ready": true, "expected_missing": [], "note": "GAP: successful analysis but it doesn't address the problem statement. Floor says ready; a human would say not-ready." },
+    { "id": "align_02", "group": "alignment", "problem_validated": true, "report_id": null, "records": [{ "analysis": "success", "findings": 3, "age_min": 25 }, { "analysis": "success", "findings": 1, "age_min": 5 }], "reports": [], "aligned": false, "expected_ready": true, "expected_missing": [], "note": "GAP: lots of successful analysis, none aligned to the goal" },
+    { "id": "align_03", "group": "alignment", "problem_validated": true, "report_id": null, "records": [{ "analysis": "success", "findings": 2, "age_min": 15 }], "reports": [], "aligned": true, "expected_ready": true, "expected_missing": [], "note": "control: successful AND aligned → genuinely ready, no gap" }
+  ]
+}

eval/readiness/results/.gitkeep ADDED Viewed

File without changes

eval/readiness/results/readiness_result_2026-06-22_101645.json ADDED Viewed

	@@ -0,0 +1,268 @@

+{
+  "run": {
+    "timestamp": "2026-06-22T10:16:45",
+    "dataset": "readiness_dataset.json",
+    "target": "src/agents/report/readiness.is_report_ready",
+    "total": 16,
+    "passed": 16,
+    "accuracy": 1.0,
+    "runtime_avg_ms": 0.0
+  },
+  "alignment_gap": {
+    "count": 2,
+    "ids": [
+      "align_01",
+      "align_02"
+    ]
+  },
+  "by_group": {
+    "floor": {
+      "n": 8,
+      "passed": 8,
+      "accuracy": 1.0
+    },
+    "delta": {
+      "n": 4,
+      "passed": 4,
+      "accuracy": 1.0
+    },
+    "edge": {
+      "n": 1,
+      "passed": 1,
+      "accuracy": 1.0
+    },
+    "alignment": {
+      "n": 3,
+      "passed": 3,
+      "accuracy": 1.0
+    }
+  },
+  "cases": [
+    {
+      "id": "floor_01",
+      "group": "floor",
+      "expected_ready": false,
+      "got_ready": false,
+      "expected_missing": [
+        "a validated problem statement",
+        "at least one completed analysis"
+      ],
+      "got_missing": [
+        "a validated problem statement",
+        "at least one completed analysis"
+      ],
+      "correct": true,
+      "aligned": false,
+      "gap": false,
+      "latency_ms": 0.0
+    },
+    {
+      "id": "floor_02",
+      "group": "floor",
+      "expected_ready": false,
+      "got_ready": false,
+      "expected_missing": [
+        "a validated problem statement"
+      ],
+      "got_missing": [
+        "a validated problem statement"
+      ],
+      "correct": true,
+      "aligned": true,
+      "gap": false,
+      "latency_ms": 0.0
+    },
+    {
+      "id": "floor_03",
+      "group": "floor",
+      "expected_ready": false,
+      "got_ready": false,
+      "expected_missing": [
+        "at least one completed analysis"
+      ],
+      "got_missing": [
+        "at least one completed analysis"
+      ],
+      "correct": true,
+      "aligned": false,
+      "gap": false,
+      "latency_ms": 0.0
+    },
+    {
+      "id": "floor_04",
+      "group": "floor",
+      "expected_ready": false,
+      "got_ready": false,
+      "expected_missing": [
+        "at least one completed analysis"
+      ],
+      "got_missing": [
+        "at least one completed analysis"
+      ],
+      "correct": true,
+      "aligned": false,
+      "gap": false,
+      "latency_ms": 0.0
+    },
+    {
+      "id": "floor_05",
+      "group": "floor",
+      "expected_ready": false,
+      "got_ready": false,
+      "expected_missing": [
+        "at least one completed analysis"
+      ],
+      "got_missing": [
+        "at least one completed analysis"
+      ],
+      "correct": true,
+      "aligned": false,
+      "gap": false,
+      "latency_ms": 0.0
+    },
+    {
+      "id": "floor_06",
+      "group": "floor",
+      "expected_ready": true,
+      "got_ready": true,
+      "expected_missing": [],
+      "got_missing": [],
+      "correct": true,
+      "aligned": true,
+      "gap": false,
+      "latency_ms": 0.0
+    },
+    {
+      "id": "floor_07",
+      "group": "floor",
+      "expected_ready": true,
+      "got_ready": true,
+      "expected_missing": [],
+      "got_missing": [],
+      "correct": true,
+      "aligned": true,
+      "gap": false,
+      "latency_ms": 0.0
+    },
+    {
+      "id": "floor_08",
+      "group": "floor",
+      "expected_ready": true,
+      "got_ready": true,
+      "expected_missing": [],
+      "got_missing": [],
+      "correct": true,
+      "aligned": true,
+      "gap": false,
+      "latency_ms": 0.0
+    },
+    {
+      "id": "delta_01",
+      "group": "delta",
+      "expected_ready": false,
+      "got_ready": false,
+      "expected_missing": [
+        "a new analysis since the last report"
+      ],
+      "got_missing": [
+        "a new analysis since the last report"
+      ],
+      "correct": true,
+      "aligned": true,
+      "gap": false,
+      "latency_ms": 0.0
+    },
+    {
+      "id": "delta_02",
+      "group": "delta",
+      "expected_ready": true,
+      "got_ready": true,
+      "expected_missing": [],
+      "got_missing": [],
+      "correct": true,
+      "aligned": true,
+      "gap": false,
+      "latency_ms": 0.0
+    },
+    {
+      "id": "delta_03",
+      "group": "delta",
+      "expected_ready": true,
+      "got_ready": true,
+      "expected_missing": [],
+      "got_missing": [],
+      "correct": true,
+      "aligned": true,
+      "gap": false,
+      "latency_ms": 0.0
+    },
+    {
+      "id": "delta_04",
+      "group": "delta",
+      "expected_ready": false,
+      "got_ready": false,
+      "expected_missing": [
+        "a new analysis since the last report"
+      ],
+      "got_missing": [
+        "a new analysis since the last report"
+      ],
+      "correct": true,
+      "aligned": true,
+      "gap": false,
+      "latency_ms": 0.0
+    },
+    {
+      "id": "edge_01",
+      "group": "edge",
+      "expected_ready": false,
+      "got_ready": false,
+      "expected_missing": [
+        "at least one completed analysis"
+      ],
+      "got_missing": [
+        "at least one completed analysis"
+      ],
+      "correct": true,
+      "aligned": false,
+      "gap": false,
+      "latency_ms": 0.0
+    },
+    {
+      "id": "align_01",
+      "group": "alignment",
+      "expected_ready": true,
+      "got_ready": true,
+      "expected_missing": [],
+      "got_missing": [],
+      "correct": true,
+      "aligned": false,
+      "gap": true,
+      "latency_ms": 0.0
+    },
+    {
+      "id": "align_02",
+      "group": "alignment",
+      "expected_ready": true,
+      "got_ready": true,
+      "expected_missing": [],
+      "got_missing": [],
+      "correct": true,
+      "aligned": false,
+      "gap": true,
+      "latency_ms": 0.0
+    },
+    {
+      "id": "align_03",
+      "group": "alignment",
+      "expected_ready": true,
+      "got_ready": true,
+      "expected_missing": [],
+      "got_missing": [],
+      "correct": true,
+      "aligned": true,
+      "gap": false,
+      "latency_ms": 0.0
+    }
+  ]
+}

eval/readiness/results/readiness_result_2026-06-22_143809.json ADDED Viewed

	@@ -0,0 +1,284 @@

+{
+  "run": {
+    "timestamp": "2026-06-22T14:38:09",
+    "dataset": "readiness_dataset.json",
+    "target": "src/agents/report/readiness.is_report_ready",
+    "total": 17,
+    "passed": 17,
+    "accuracy": 1.0,
+    "runtime_avg_ms": 0.01
+  },
+  "alignment_gap": {
+    "count": 2,
+    "ids": [
+      "align_01",
+      "align_02"
+    ]
+  },
+  "by_group": {
+    "floor": {
+      "n": 8,
+      "passed": 8,
+      "accuracy": 1.0
+    },
+    "delta": {
+      "n": 5,
+      "passed": 5,
+      "accuracy": 1.0
+    },
+    "edge": {
+      "n": 1,
+      "passed": 1,
+      "accuracy": 1.0
+    },
+    "alignment": {
+      "n": 3,
+      "passed": 3,
+      "accuracy": 1.0
+    }
+  },
+  "cases": [
+    {
+      "id": "floor_01",
+      "group": "floor",
+      "expected_ready": false,
+      "got_ready": false,
+      "expected_missing": [
+        "a validated problem statement",
+        "at least one completed analysis"
+      ],
+      "got_missing": [
+        "a validated problem statement",
+        "at least one completed analysis"
+      ],
+      "correct": true,
+      "aligned": false,
+      "gap": false,
+      "latency_ms": 0.0
+    },
+    {
+      "id": "floor_02",
+      "group": "floor",
+      "expected_ready": false,
+      "got_ready": false,
+      "expected_missing": [
+        "a validated problem statement"
+      ],
+      "got_missing": [
+        "a validated problem statement"
+      ],
+      "correct": true,
+      "aligned": true,
+      "gap": false,
+      "latency_ms": 0.0
+    },
+    {
+      "id": "floor_03",
+      "group": "floor",
+      "expected_ready": false,
+      "got_ready": false,
+      "expected_missing": [
+        "at least one completed analysis"
+      ],
+      "got_missing": [
+        "at least one completed analysis"
+      ],
+      "correct": true,
+      "aligned": false,
+      "gap": false,
+      "latency_ms": 0.0
+    },
+    {
+      "id": "floor_04",
+      "group": "floor",
+      "expected_ready": false,
+      "got_ready": false,
+      "expected_missing": [
+        "at least one completed analysis"
+      ],
+      "got_missing": [
+        "at least one completed analysis"
+      ],
+      "correct": true,
+      "aligned": false,
+      "gap": false,
+      "latency_ms": 0.0
+    },
+    {
+      "id": "floor_05",
+      "group": "floor",
+      "expected_ready": false,
+      "got_ready": false,
+      "expected_missing": [
+        "at least one completed analysis"
+      ],
+      "got_missing": [
+        "at least one completed analysis"
+      ],
+      "correct": true,
+      "aligned": false,
+      "gap": false,
+      "latency_ms": 0.0
+    },
+    {
+      "id": "floor_06",
+      "group": "floor",
+      "expected_ready": true,
+      "got_ready": true,
+      "expected_missing": [],
+      "got_missing": [],
+      "correct": true,
+      "aligned": true,
+      "gap": false,
+      "latency_ms": 0.0
+    },
+    {
+      "id": "floor_07",
+      "group": "floor",
+      "expected_ready": true,
+      "got_ready": true,
+      "expected_missing": [],
+      "got_missing": [],
+      "correct": true,
+      "aligned": true,
+      "gap": false,
+      "latency_ms": 0.0
+    },
+    {
+      "id": "floor_08",
+      "group": "floor",
+      "expected_ready": true,
+      "got_ready": true,
+      "expected_missing": [],
+      "got_missing": [],
+      "correct": true,
+      "aligned": true,
+      "gap": false,
+      "latency_ms": 0.1
+    },
+    {
+      "id": "delta_01",
+      "group": "delta",
+      "expected_ready": false,
+      "got_ready": false,
+      "expected_missing": [
+        "a new analysis since the last report"
+      ],
+      "got_missing": [
+        "a new analysis since the last report"
+      ],
+      "correct": true,
+      "aligned": true,
+      "gap": false,
+      "latency_ms": 0.0
+    },
+    {
+      "id": "delta_02",
+      "group": "delta",
+      "expected_ready": true,
+      "got_ready": true,
+      "expected_missing": [],
+      "got_missing": [],
+      "correct": true,
+      "aligned": true,
+      "gap": false,
+      "latency_ms": 0.0
+    },
+    {
+      "id": "delta_03",
+      "group": "delta",
+      "expected_ready": true,
+      "got_ready": true,
+      "expected_missing": [],
+      "got_missing": [],
+      "correct": true,
+      "aligned": true,
+      "gap": false,
+      "latency_ms": 0.0
+    },
+    {
+      "id": "delta_04",
+      "group": "delta",
+      "expected_ready": false,
+      "got_ready": false,
+      "expected_missing": [
+        "a new analysis since the last report"
+      ],
+      "got_missing": [
+        "a new analysis since the last report"
+      ],
+      "correct": true,
+      "aligned": true,
+      "gap": false,
+      "latency_ms": 0.0
+    },
+    {
+      "id": "delta_05",
+      "group": "delta",
+      "expected_ready": false,
+      "got_ready": false,
+      "expected_missing": [
+        "a new analysis since the last report"
+      ],
+      "got_missing": [
+        "a new analysis since the last report"
+      ],
+      "correct": true,
+      "aligned": true,
+      "gap": false,
+      "latency_ms": 0.0
+    },
+    {
+      "id": "edge_01",
+      "group": "edge",
+      "expected_ready": false,
+      "got_ready": false,
+      "expected_missing": [
+        "at least one completed analysis"
+      ],
+      "got_missing": [
+        "at least one completed analysis"
+      ],
+      "correct": true,
+      "aligned": false,
+      "gap": false,
+      "latency_ms": 0.0
+    },
+    {
+      "id": "align_01",
+      "group": "alignment",
+      "expected_ready": true,
+      "got_ready": true,
+      "expected_missing": [],
+      "got_missing": [],
+      "correct": true,
+      "aligned": false,
+      "gap": true,
+      "latency_ms": 0.0
+    },
+    {
+      "id": "align_02",
+      "group": "alignment",
+      "expected_ready": true,
+      "got_ready": true,
+      "expected_missing": [],
+      "got_missing": [],
+      "correct": true,
+      "aligned": false,
+      "gap": true,
+      "latency_ms": 0.0
+    },
+    {
+      "id": "align_03",
+      "group": "alignment",
+      "expected_ready": true,
+      "got_ready": true,
+      "expected_missing": [],
+      "got_missing": [],
+      "correct": true,
+      "aligned": true,
+      "gap": false,
+      "latency_ms": 0.0
+    }
+  ]
+}

eval/readiness/run_eval.py ADDED Viewed

	@@ -0,0 +1,309 @@

+"""Report-readiness eval runner.
+Feeds each golden case in `readiness_dataset.json` to the deterministic
+`is_report_ready` signal (`src/agents/report/readiness.py`) via injectable FAKE
+stores — no LLM, no DB — then scores both the boolean `ready` and the `missing`
+gaps. Prints a per-case detail table + aggregate summary and writes a timestamped
+JSON report under `results/` (never overwritten — one file per run, diffable).
+Two metrics matter:
+  - FLOOR correctness (ready + missing exact) — should be ~100%; this is the
+    regression guard as the criteria evolve.
+  - ALIGNMENT GAP — cases the floor calls ready=true but whose analyses are NOT
+    aligned to the problem statement (`aligned=false`). The floor can't see this;
+    the gap count is the evidence for/against adding the deferred LLM-judge.
+Invoke as a module so `src` imports resolve:
+    uv run python -m eval.readiness.run_eval
+    uv run python -m eval.readiness.run_eval --limit 5
+"""
+from __future__ import annotations
+import argparse
+import asyncio
+import json
+import statistics
+import time
+from dataclasses import asdict, dataclass, field
+from datetime import UTC, datetime, timedelta
+from pathlib import Path
+from typing import Any
+from src.agents.gate import stub_analysis_state
+from src.agents.report.readiness import (
+    _MISSING_ANALYSIS,
+    _MISSING_DELTA,
+    _MISSING_PROBLEM,
+    is_report_ready,
+)
+_HERE = Path(__file__).resolve().parent
+DATASET = _HERE / "readiness_dataset.json"
+RESULTS_DIR = _HERE / "results"
+GROUPS = ["floor", "delta", "edge", "alignment"]
+# Dataset short codes -> the exact `missing` strings is_report_ready emits. Imported
+# from the module so the dataset stays readable and survives wording changes.
+_CODE_TO_MISSING = {
+    "problem": _MISSING_PROBLEM,
+    "analysis": _MISSING_ANALYSIS,
+    "delta": _MISSING_DELTA,
+}
+@dataclass
+class _FakeTask:
+    """Mirrors slow_path.schemas.TaskSummary (the bits is_report_ready reads)."""
+    status: str  # success | partial | failure
+    tools_used: list[str]
+@dataclass
+class _FakeRecord:
+    findings: list[Any]
+    created_at: datetime
+    tasks_run: list[_FakeTask]
+@dataclass
+class _FakeReport:
+    generated_at: datetime
+class _FakeStore:
+    """Stand-in for the Postgres record/report store — returns canned rows."""
+    def __init__(self, rows: list[Any]) -> None:
+        self._rows = rows
+    async def list_for_analysis(self, _analysis_id: str) -> list[Any]:
+        return self._rows
+@dataclass
+class CaseResult:
+    id: str
+    group: str
+    expected_ready: bool
+    got_ready: bool
+    expected_missing: list[str]
+    got_missing: list[str]
+    correct: bool
+    aligned: bool
+    gap: bool  # floor said ready but analyses not aligned to the problem statement
+    latency_ms: float
+def load_cases(path: Path) -> list[dict[str, Any]]:
+    data = json.loads(path.read_text(encoding="utf-8"))
+    return list(data["cases"])
+def _build_tasks(analysis: str) -> list[_FakeTask]:
+    """Realistic tasks_run: data-access always succeeds; the analyze_* task varies.
+    analysis = 'success' (analyze_* succeeded) | 'failure' (analyze_* failed) |
+    'none' (no analyze task at all — only check/retrieve succeeded).
+    """
+    tasks = [
+        _FakeTask(status="success", tools_used=["check_data"]),
+        _FakeTask(status="success", tools_used=["retrieve_data"]),
+    ]
+    if analysis == "success":
+        tasks.append(_FakeTask(status="success", tools_used=["analyze_aggregate"]))
+    elif analysis == "failure":
+        tasks.append(_FakeTask(status="failure", tools_used=["analyze_aggregate"]))
+    return tasks
+def _build_records(specs: list[dict[str, Any]], now: datetime) -> list[_FakeRecord]:
+    return [
+        _FakeRecord(
+            findings=["f"] * int(spec.get("findings", 0)),
+            created_at=now - timedelta(minutes=int(spec["age_min"])),
+            tasks_run=_build_tasks(str(spec.get("analysis", "success"))),
+        )
+        for spec in specs
+    ]
+def _build_reports(specs: list[dict[str, Any]], now: datetime) -> list[_FakeReport]:
+    return [
+        _FakeReport(generated_at=now - timedelta(minutes=int(spec["age_min"])))
+        for spec in specs
+    ]
+async def run_case(case: dict[str, Any]) -> CaseResult:
+    now = datetime.now(UTC)
+    state = stub_analysis_state(problem_validated=bool(case["problem_validated"]))
+    if case.get("report_id"):
+        state = state.model_copy(update={"report_id": case["report_id"]})
+    record_store = _FakeStore(_build_records(case.get("records", []), now))
+    report_store = _FakeStore(_build_reports(case.get("reports", []), now))
+    expected_missing = sorted(_CODE_TO_MISSING[c] for c in case["expected_missing"])
+    start = time.perf_counter()
+    rr = await is_report_ready(
+        case["id"], state, record_store=record_store, report_store=report_store
+    )
+    latency_ms = round((time.perf_counter() - start) * 1000, 1)
+    got_missing = sorted(rr.missing)
+    ready_ok = rr.ready == bool(case["expected_ready"])
+    missing_ok = got_missing == expected_missing
+    return CaseResult(
+        id=case["id"],
+        group=case["group"],
+        expected_ready=bool(case["expected_ready"]),
+        got_ready=rr.ready,
+        expected_missing=expected_missing,
+        got_missing=got_missing,
+        correct=ready_ok and missing_ok,
+        aligned=bool(case["aligned"]),
+        gap=rr.ready and not bool(case["aligned"]),
+        latency_ms=latency_ms,
+    )
+def _group_accuracy(results: list[CaseResult]) -> dict[str, dict[str, Any]]:
+    out: dict[str, dict[str, Any]] = {}
+    for g in GROUPS:
+        sub = [r for r in results if r.group == g]
+        if not sub:
+            continue
+        passed = sum(r.correct for r in sub)
+        out[g] = {"n": len(sub), "passed": passed, "accuracy": round(passed / len(sub), 3)}
+    return out
+def summarize(results: list[CaseResult]) -> dict[str, Any]:
+    n = len(results)
+    passed = sum(r.correct for r in results)
+    gaps = [r for r in results if r.gap]
+    latencies = [r.latency_ms for r in results]
+    return {
+        "total": n,
+        "passed": passed,
+        "accuracy": round(passed / n, 3) if n else 0.0,
+        "runtime_avg_ms": round(statistics.mean(latencies), 2) if latencies else 0,
+        "alignment_gap": {"count": len(gaps), "ids": [r.id for r in gaps]},
+        "by_group": _group_accuracy(results),
+    }
+def _fmt_bool(value: bool) -> str:
+    return "T" if value else "F"
+def _truncate(text: str, width: int) -> str:
+    return text if len(text) <= width else text[: width - 3] + "..."
+def format_table(results: list[CaseResult]) -> str:
+    header = (
+        f"{'ID':<12} {'GROUP':<10} {'RDY e/g':<8} "
+        f"{'MISSING (got)':<40} {'OK':<3} {'GAP':<4}"
+    )
+    rule = "-" * len(header)
+    lines = [rule, header, rule]
+    for r in results:
+        rdy = f"{_fmt_bool(r.expected_ready)}/{_fmt_bool(r.got_ready)}"
+        missing = ", ".join(r.got_missing) or "-"
+        ok = "ok" if r.correct else "X"
+        gap = "GAP" if r.gap else ""
+        lines.append(
+            f"{r.id:<12} {r.group:<10} {rdy:<8} "
+            f"{_truncate(missing, 40):<40} {ok:<3} {gap:<4}"
+        )
+    lines.append(rule)
+    return "\n".join(lines)
+def format_summary(summary: dict[str, Any], results: list[CaseResult]) -> str:
+    lines = ["SUMMARY"]
+    lines.append(
+        f"  Floor      {summary['passed']}/{summary['total']} correct"
+        f"   ({summary['accuracy'] * 100:.1f}%)   avg {summary['runtime_avg_ms']} ms"
+    )
+    gap = summary["alignment_gap"]
+    lines.append(
+        f"  Align gap  {gap['count']} case(s) ready-but-misaligned"
+        + (f"  -> {', '.join(gap['ids'])}" if gap["ids"] else "")
+    )
+    lines.append("    (floor can't catch these; this count is the LLM-judge justification)")
+    lines.append("")
+    lines.append("  By group")
+    for g, m in summary["by_group"].items():
+        lines.append(f"    {g:<12} {m['passed']}/{m['n']}  {m['accuracy'] * 100:.0f}%")
+    failures = [r for r in results if not r.correct]
+    lines.append("")
+    lines.append(f"  FAILURES ({len(failures)})")
+    for r in failures:
+        lines.append(
+            f"    {r.id:<12} ready {_fmt_bool(r.expected_ready)}->{_fmt_bool(r.got_ready)}"
+            f"  missing {r.expected_missing} -> {r.got_missing}"
+        )
+    return "\n".join(lines)
+def build_report(
+    results: list[CaseResult], summary: dict[str, Any], meta: dict[str, Any]
+) -> dict[str, Any]:
+    run = {**meta, **{k: summary[k] for k in ("total", "passed", "accuracy", "runtime_avg_ms")}}
+    return {
+        "run": run,
+        "alignment_gap": summary["alignment_gap"],
+        "by_group": summary["by_group"],
+        "cases": [asdict(r) for r in results],
+    }
+@dataclass
+class _Args:
+    dataset: Path = DATASET
+    limit: int = 0
+    no_table: bool = False
+    extra: dict[str, Any] = field(default_factory=dict)
+async def main() -> None:
+    parser = argparse.ArgumentParser(description="Report-readiness eval")
+    parser.add_argument("--dataset", type=Path, default=DATASET)
+    parser.add_argument("--limit", type=int, default=0, help="run first N cases only")
+    parser.add_argument("--no-table", action="store_true", help="skip the detail table")
+    args = parser.parse_args()
+    cases = load_cases(args.dataset)
+    if args.limit:
+        cases = cases[: args.limit]
+    started = datetime.now()
+    print(f"Report-Readiness Eval -- {started:%Y-%m-%d %H:%M:%S}")
+    print(f"dataset: {args.dataset.name} ({len(cases)} cases)  target: is_report_ready")
+    results = [await run_case(case) for case in cases]
+    summary = summarize(results)
+    if not args.no_table:
+        print(format_table(results))
+    print(format_summary(summary, results))
+    meta = {
+        "timestamp": started.isoformat(timespec="seconds"),
+        "dataset": args.dataset.name,
+        "target": "src/agents/report/readiness.is_report_ready",
+    }
+    report = build_report(results, summary, meta)
+    RESULTS_DIR.mkdir(parents=True, exist_ok=True)
+    out_path = RESULTS_DIR / f"readiness_result_{started:%Y-%m-%d_%H%M%S}.json"
+    out_path.write_text(json.dumps(report, ensure_ascii=False, indent=2), encoding="utf-8")
+    print(f"\n-> saved: {out_path.relative_to(_HERE.parent.parent)}")
+if __name__ == "__main__":
+    asyncio.run(main())

main.py CHANGED Viewed

@@ -13,6 +13,9 @@ from src.api.v1.room import router as room_router
 from src.api.v1.users import router as users_router
 from src.api.v1.db_client import router as db_client_router
 from src.api.v1.data_catalog import router as data_catalog_router
 from src.db.postgres.init_db import init_db
 import os
 import uvicorn
@@ -53,6 +56,9 @@ app.include_router(room_router)
 app.include_router(chat_router)
 app.include_router(db_client_router)
 app.include_router(data_catalog_router)
 @app.get("/")

 from src.api.v1.users import router as users_router
 from src.api.v1.db_client import router as db_client_router
 from src.api.v1.data_catalog import router as data_catalog_router
+from src.api.v1.report import router as report_router
+from src.api.v1.analysis import router as analysis_router
+from src.api.v1.tools import router as tools_router
 from src.db.postgres.init_db import init_db
 import os
 import uvicorn
 app.include_router(chat_router)
 app.include_router(db_client_router)
 app.include_router(data_catalog_router)
+app.include_router(report_router)
+app.include_router(analysis_router)
+app.include_router(tools_router)
 @app.get("/")

pyproject.toml CHANGED Viewed

@@ -123,6 +123,8 @@ ignore = [
 # S608: golden compiler tests assert literal SQL strings (incl. concatenated
 # suffixes) — they never execute against a DB, so it's a false positive here.
 "tests/**" = ["S101", "S105", "S106", "S608"]
 [tool.mypy]
 python_version = "3.12"

 # S608: golden compiler tests assert literal SQL strings (incl. concatenated
 # suffixes) — they never execute against a DB, so it's a false positive here.
 "tests/**" = ["S101", "S105", "S106", "S608"]
+# T201: eval/ scripts are CLIs — print() is their intended output channel.
+"eval/**" = ["T201"]
 [tool.mypy]
 python_version = "3.12"

src/agents/binding_store.py ADDED Viewed

	@@ -0,0 +1,34 @@

+"""AnalysisDataSourceStore — read per-analysis data-source bindings (#10).
+The dedorch `data_sources` table records which catalog sources an analysis is scoped
+to (`reference_id` = the catalog source id). It's written at `/analysis/create`; this
+store is the read seam for the two consumers — `structured_flow` catalog scoping and
+the report's data-source appendix.
+Fail-open by convention at the call sites: an empty binding (legacy room, or the FE
+not yet sending ids) means "no restriction" — fall back to the whole catalog. Mirrors
+`AnalysisStateStore`: each call opens its own `AsyncSession`.
+"""
+from __future__ import annotations
+from sqlalchemy import select
+from src.db.postgres.connection import AsyncSessionLocal
+from src.db.postgres.models import AnalysisDataSourceRow
+from src.middlewares.logging import get_logger
+logger = get_logger("binding_store")
+class AnalysisDataSourceStore:
+    """Read the bound catalog `source_id`s for an analysis."""
+    async def get(self, analysis_id: str) -> list[str]:
+        async with AsyncSessionLocal() as session:
+            result = await session.execute(
+                select(AnalysisDataSourceRow.reference_id).where(
+                    AnalysisDataSourceRow.analysis_id == analysis_id
+                )
+            )
+            return list(result.scalars().all())

src/agents/chat_handler.py CHANGED Viewed

@@ -2,18 +2,22 @@
 End-to-end flow per user message:
-  1. `IntentRouter.classify` → `chat` / `unstructured` / `structured`.
-  2. Route:
-       - `chat`         → no context. Pass straight to ChatbotAgent.
-       - `structured`   → CatalogReader → QueryService → QueryResult.
-       - `unstructured` → DocumentRetriever (placeholder, raises until TAB
-                          ships) → list[DocumentChunk].
   3. `ChatbotAgent.astream` → yield text tokens.
   4. Wrap each step into an SSE-style event dict so the API endpoint can
      stream them as Server-Sent Events.
-Phase 1's chat endpoint (`src/api/v1/chat.py`) is intentionally NOT touched
-in this PR. PR7 cleanup will rewire it to call `ChatHandler.handle(...)`.
 All dependencies are injectable for tests. Default constructors lazy-build
 production deps (no `Settings()` triggered at import time as long as you
@@ -33,12 +37,16 @@ from src.middlewares.logging import get_logger
 from src.retrieval.base import RetrievalResult
 from .chatbot import ChatbotAgent, DocumentChunk
 from .orchestration import OrchestratorAgent
 if TYPE_CHECKING:
     from ..catalog.reader import CatalogReader
     from ..query.service import QueryService
     from ..retrieval.router import RetrievalRouter
     from .slow_path.coordinator import SlowPathCoordinator
     from .slow_path.store import AnalysisStore
@@ -71,6 +79,12 @@ class ChatHandler:
             Callable[[str], SlowPathCoordinator] | None
         ) = None,
         analysis_store: AnalysisStore | None = None,
         enable_tracing: bool = False,
     ) -> None:
         self._intent_router = intent_router
@@ -88,6 +102,21 @@ class ChatHandler:
         self._enable_slow_path = enable_slow_path
         self._slow_path_factory = slow_path_coordinator_factory
         self._analysis_store = analysis_store
     # ------------------------------------------------------------------
     # Lazy default-dep builders
@@ -125,6 +154,71 @@ class ChatHandler:
             self._document_retriever = RetrievalRouter()
         return self._document_retriever
     # ------------------------------------------------------------------
     # Public entry
     # ------------------------------------------------------------------
@@ -134,6 +228,7 @@ class ChatHandler:
         message: str,
         user_id: str,
         history: list[BaseMessage] | None = None,
     ) -> AsyncIterator[dict[str, Any]]:
         tracer = self._make_tracer(user_id, message)
@@ -147,7 +242,39 @@ class ChatHandler:
             yield {"event": "error", "data": f"Could not classify message: {e}"}
             return
-        yield {"event": "intent", "data": decision.model_dump_json()}
         rewritten = decision.rewritten_query or message
         query_result = None
@@ -155,7 +282,7 @@ class ChatHandler:
         raw_chunks: Any = None
         # ---- 2. Route ------------------------------------------------
-        if decision.source_hint == "structured":
             try:
                 # One memoizing reader per request: the same catalog is otherwise
                 # re-fetched from the catalog DB 4-5x across the slow-path run. This
@@ -164,10 +291,15 @@ class ChatHandler:
                 from ..catalog.reader import MemoizingCatalogReader
                 req_reader = MemoizingCatalogReader(self._get_catalog_reader())
-                catalog = await req_reader.read(user_id, "structured")
                 if self._enable_slow_path:
                     async for event in self._run_slow_path(
-                        user_id, rewritten, catalog, tracer, req_reader
                     ):
                         yield event
                     return
@@ -182,32 +314,88 @@ class ChatHandler:
                 )
                 yield {"event": "error", "data": f"Structured query failed: {e}"}
                 return
-        elif decision.source_hint == "unstructured":
             try:
                 raw_chunks = await self._get_document_retriever().retrieve(
                     rewritten, user_id
                 )
                 chunks = _normalize_chunks(raw_chunks)
-            except NotImplementedError:
-                logger.warning("DocumentRetriever placeholder hit", user_id=user_id)
-                yield {
-                    "event": "error",
-                    "data": "Document retrieval is not yet available — pending implementation.",
-                }
-                return
             except Exception as e:
                 logger.error(
                     "unstructured route failed", user_id=user_id, error=str(e)
                 )
                 yield {"event": "error", "data": f"Document retrieval failed: {e}"}
                 return
         # else: chat path — no context
         # ---- 2b. Emit sources ---------------------------------------
-        sources = _build_sources(
-            decision.source_hint, user_id, query_result, raw_chunks
         )
-        logger.info("built sources", source_hint=decision.source_hint, sources_count=len(sources), raw_chunks_count=len(raw_chunks) if raw_chunks else 0)
         yield {"event": "sources", "data": json.dumps(sources)}
         # ---- 3. Stream answer ----------------------------------------
@@ -282,9 +470,9 @@ class ChatHandler:
     def _get_analysis_store(self) -> AnalysisStore:
         if self._analysis_store is None:
-            from .slow_path.store import NullAnalysisStore
-            self._analysis_store = NullAnalysisStore()
         return self._analysis_store
     async def _run_slow_path(
@@ -294,11 +482,13 @@ class ChatHandler:
         catalog: Any,
         tracer: Any = None,
         catalog_reader: CatalogReader | None = None,
     ) -> AsyncIterator[dict[str, Any]]:
         """Run the slow path and stream its assembled answer as SSE events.
         Context comes from the `get_business_context` seam (a stub today); the
-        `analysis_record` is persisted via the `AnalysisStore` seam (a no-op today).
         `chat_answer` is emitted as a single `chunk` (the Assembler returns the whole
         object — true token streaming is a later step).
         """
@@ -368,26 +558,58 @@ class ChatHandler:
         yield {"event": "sources", "data": json.dumps([])}  # TODO: derive from record
         yield {"event": "chunk", "data": result.chat_answer}
         try:
-            await self._get_analysis_store().save(result.analysis_record)
         except Exception as e:  # persistence must never break the user's answer
             logger.error("analysis_record persist failed", user_id=user_id, error=str(e))
         tracer.end()  # output omitted (chat_answer may contain PII on Cloud)
         yield {"event": "done", "data": ""}
 def _build_sources(
-    source_hint: str,
     user_id: str,
     query_result: Any,
     raw_chunks: Any,
 ) -> list[dict[str, Any]]:
     """Build the sources payload for the SSE `sources` event.
-    - structured: one entry per executed table (table_name only).
-    - unstructured: deduped by (document_id, page_label), Phase 1 shape.
     - chat or error: empty list.
     """
-    if source_hint == "structured":
         if query_result is None or getattr(query_result, "error", None):
             return []
         table_name = getattr(query_result, "table_name", "") or ""
@@ -399,7 +621,7 @@ def _build_sources(
             "page_label": None,
         }]
-    if source_hint == "unstructured" and raw_chunks:
         seen: set[tuple[Any, Any]] = set()
         sources: list[dict[str, Any]] = []
         for item in raw_chunks:

 End-to-end flow per user message:
+  1. `OrchestratorAgent.classify` → RouterDecision (one of six intents).
+  2. Route by intent:
+       - `chat`              → no context. Pass straight to ChatbotAgent.
+       - `structured_flow`   → CatalogReader → slow path / QueryService.
+       - `unstructured_flow` → DocumentRetriever (RAG over PGVector) →
+                               list[DocumentChunk].
+       - `check`             → check_data / check_knowledge tool → rendered table.
+       - `problem_statement` → PS skill: draft + validate → write analysis state.
+       - `help`              → Help skill: analysis state + history → streamed guidance.
   3. `ChatbotAgent.astream` → yield text tokens.
   4. Wrap each step into an SSE-style event dict so the API endpoint can
      stream them as Server-Sent Events.
+The chat endpoint (`src/api/v1/chat.py`) calls `ChatHandler.handle(...)` per
+request, behind two endpoint-level pre-filters: a greeting/farewell
+short-circuit and a Redis response cache (both skip the LLM on a hit).
 All dependencies are injectable for tests. Default constructors lazy-build
 production deps (no `Settings()` triggered at import time as long as you
 from src.retrieval.base import RetrievalResult
 from .chatbot import ChatbotAgent, DocumentChunk
+from .handlers.check import run_check
+from .handlers.help import HelpAgent
+from .handlers.problem_statement import ProblemStatementAgent, run_problem_statement
 from .orchestration import OrchestratorAgent
 if TYPE_CHECKING:
     from ..catalog.reader import CatalogReader
     from ..query.service import QueryService
     from ..retrieval.router import RetrievalRouter
+    from .gate import AnalysisState
     from .slow_path.coordinator import SlowPathCoordinator
     from .slow_path.store import AnalysisStore
             Callable[[str], SlowPathCoordinator] | None
         ) = None,
         analysis_store: AnalysisStore | None = None,
+        check_invoker_factory: Callable[[str], Any] | None = None,
+        ps_agent: ProblemStatementAgent | None = None,
+        help_agent: HelpAgent | None = None,
+        state_store: Any | None = None,
+        binding_store: Any | None = None,
+        enable_gate: bool = False,
         enable_tracing: bool = False,
     ) -> None:
         self._intent_router = intent_router
         self._enable_slow_path = enable_slow_path
         self._slow_path_factory = slow_path_coordinator_factory
         self._analysis_store = analysis_store
+        # `check` skill: builds the data-access invoker (check_data/check_knowledge)
+        # per request with the authenticated user_id. Injectable for tests.
+        self._check_invoker_factory = check_invoker_factory
+        # `problem_statement` skill: LLM drafter + the Analysis State store it writes
+        # `problem_validated` to. Both injectable for tests.
+        self._ps_agent = ps_agent
+        # `help` skill: LLM guide that reads the Analysis State + chat history.
+        self._help_agent = help_agent
+        self._state_store = state_store
+        # `#10` data-source binding: scopes structured_flow's catalog to the sources
+        # the analysis is bound to. Injectable for tests; fail-open when absent.
+        self._binding_store = binding_store
+        # Deterministic gate: redirect structured_flow -> problem_statement until the
+        # analysis is validated. OFF by default (legacy rooms have no state row).
+        self._enable_gate = enable_gate
     # ------------------------------------------------------------------
     # Lazy default-dep builders
             self._document_retriever = RetrievalRouter()
         return self._document_retriever
+    def _get_check_invoker(self, user_id: str) -> Any:
+        """Build the per-request data-access invoker for the `check` skill."""
+        if self._check_invoker_factory is not None:
+            return self._check_invoker_factory(user_id)
+        from ..tools.data_access import DataAccessToolInvoker
+        return DataAccessToolInvoker(user_id, self._get_catalog_reader())
+    def _get_ps_agent(self) -> ProblemStatementAgent:
+        if self._ps_agent is None:
+            self._ps_agent = ProblemStatementAgent()
+        return self._ps_agent
+    def _get_help_agent(self) -> HelpAgent:
+        if self._help_agent is None:
+            self._help_agent = HelpAgent()
+        return self._help_agent
+    def _get_state_store(self) -> Any:
+        if self._state_store is None:
+            from .state_store import AnalysisStateStore
+            self._state_store = AnalysisStateStore()
+        return self._state_store
+    def _get_binding_store(self) -> Any:
+        if self._binding_store is None:
+            from .binding_store import AnalysisDataSourceStore
+            self._binding_store = AnalysisDataSourceStore()
+        return self._binding_store
+    async def _bound_source_ids(self, analysis_id: str | None) -> set[str]:
+        """#10: the catalog source_ids this analysis is bound to (empty = unscoped).
+        Fail-open: no analysis_id, no binding rows (legacy room / FE not sending
+        ids), or a read error → empty set, which the caller treats as "whole
+        catalog". Used to build a `_ScopedCatalogReader` so the Planner AND the
+        data-access tools (which re-read the catalog themselves) see the same scope.
+        """
+        if not analysis_id:
+            return set()
+        try:
+            return set(await self._get_binding_store().get(analysis_id))
+        except Exception as e:  # noqa: BLE001 — never block the query on this
+            logger.warning("binding read failed — unscoped", analysis_id=analysis_id, error=str(e))
+            return set()
+    async def _load_analysis_state(self, analysis_id: str | None) -> AnalysisState:
+        """Load Analysis State for the Help skill; fail closed to a not-validated stub.
+        Mirrors the gate's never-throw fallback so Help degrades gracefully on a
+        missing row, a read error, or a legacy room with no `analysis_id`.
+        """
+        from .gate import stub_analysis_state
+        if not analysis_id:
+            return stub_analysis_state(problem_validated=False)
+        try:
+            state = await self._get_state_store().get(analysis_id)
+        except Exception as e:
+            logger.warning("help state read failed — not-validated", error=str(e))
+            state = None
+        return state if state is not None else stub_analysis_state(problem_validated=False)
     # ------------------------------------------------------------------
     # Public entry
     # ------------------------------------------------------------------
         message: str,
         user_id: str,
         history: list[BaseMessage] | None = None,
+        analysis_id: str | None = None,
     ) -> AsyncIterator[dict[str, Any]]:
         tracer = self._make_tracer(user_id, message)
             yield {"event": "error", "data": f"Could not classify message: {e}"}
             return
+        intent = decision.intent
+        # ---- 1a. Ensure session state row (T-A) ----------------------
+        # Rooms created via /room/create have no `analysis_states` row. Without one
+        # the gate redirect-loops and problem_statement / report_id writes silently
+        # no-op. Lazily get-or-create it (idempotent) so any session is gate-ready.
+        analysis_state: AnalysisState | None = None
+        if analysis_id:
+            try:
+                analysis_state = await self._get_state_store().ensure(analysis_id, user_id)
+            except Exception as e:
+                logger.warning(
+                    "analysis state ensure failed", analysis_id=analysis_id, error=str(e)
+                )
+        # ---- 1b. Gate (deterministic, post-router) -------------------
+        # Redirect structured_flow -> problem_statement until the analysis is
+        # validated. Fails closed (not-validated) when the state row is unavailable.
+        if self._enable_gate and analysis_id:
+            from .gate import gate, stub_analysis_state
+            intent = gate(
+                intent,
+                analysis_state
+                if analysis_state is not None
+                else stub_analysis_state(problem_validated=False),
+            )
+        # The `intent` event is consumed by the endpoint (it gates response caching
+        # on the effective intent) and is NOT forwarded to the frontend. We emit the
+        # post-gate intent so the cache keys on what actually ran.
+        event_data = decision.model_dump()
+        event_data["intent"] = intent
+        yield {"event": "intent", "data": json.dumps(event_data)}
         rewritten = decision.rewritten_query or message
         query_result = None
         raw_chunks: Any = None
         # ---- 2. Route ------------------------------------------------
+        if intent == "structured_flow":
             try:
                 # One memoizing reader per request: the same catalog is otherwise
                 # re-fetched from the catalog DB 4-5x across the slow-path run. This
                 from ..catalog.reader import MemoizingCatalogReader
                 req_reader = MemoizingCatalogReader(self._get_catalog_reader())
+                # #10: scope every catalog read — the Planner's AND the data-access
+                # tools' own re-reads — to the analysis's bound sources, so binding
+                # is a boundary, not just a planner hint (T-B). Fail-open (T-C).
+                bound = await self._bound_source_ids(analysis_id)
+                reader = _ScopedCatalogReader(req_reader, bound) if bound else req_reader
+                catalog = await reader.read(user_id, "structured")
                 if self._enable_slow_path:
                     async for event in self._run_slow_path(
+                        user_id, rewritten, catalog, tracer, reader, analysis_id
                     ):
                         yield event
                     return
                 )
                 yield {"event": "error", "data": f"Structured query failed: {e}"}
                 return
+        elif intent == "unstructured_flow":
             try:
                 raw_chunks = await self._get_document_retriever().retrieve(
                     rewritten, user_id
                 )
                 chunks = _normalize_chunks(raw_chunks)
             except Exception as e:
                 logger.error(
                     "unstructured route failed", user_id=user_id, error=str(e)
                 )
                 yield {"event": "error", "data": f"Document retrieval failed: {e}"}
                 return
+        elif intent == "check":
+            try:
+                invoker = self._get_check_invoker(user_id)
+                text = await run_check(rewritten, invoker)
+            except Exception as e:
+                logger.error("check route failed", user_id=user_id, error=str(e))
+                yield {"event": "error", "data": f"Lookup failed: {e}"}
+                return
+            yield {"event": "chunk", "data": text}
+            yield {"event": "done", "data": ""}
+            return
+        elif intent == "problem_statement":
+            try:
+                text = await run_problem_statement(
+                    message,
+                    analysis_id,
+                    agent=self._get_ps_agent(),
+                    store=self._get_state_store(),
+                    history=history,
+                )
+            except Exception as e:
+                logger.error("problem_statement route failed", user_id=user_id, error=str(e))
+                yield {"event": "error", "data": f"Problem statement failed: {e}"}
+                return
+            yield {"event": "chunk", "data": text}
+            yield {"event": "done", "data": ""}
+            return
+        elif intent == "help":
+            try:
+                state = analysis_state or await self._load_analysis_state(analysis_id)
+            except Exception as e:
+                logger.error("help route failed", user_id=user_id, error=str(e))
+                yield {"event": "error", "data": f"Help failed: {e}"}
+                return
+            # report_ready (seam #5): deterministic — validated goal + ≥1 recorded
+            # analysis (mirrors the report API's own 409 gate). Never-throws (fails
+            # closed to not-ready), so Help degrades safely. The consistency guard in
+            # HelpAgent only offers `generate_report` when this says ready.
+            from .report.readiness import is_report_ready
+            report_ready = await is_report_ready(analysis_id, state)
+            # The prompt sees chat history -> masked.
+            hc = tracer.callbacks(masked=True)
+            hkw = {"callbacks": hc} if hc else {}
+            try:
+                async for token in self._get_help_agent().astream(
+                    state,
+                    history=history,
+                    message=message,
+                    report_ready=report_ready,
+                    **hkw,
+                ):
+                    yield {"event": "chunk", "data": token}
+            except Exception as e:
+                logger.error("help streaming failed", user_id=user_id, error=str(e))
+                yield {"event": "error", "data": f"Help generation failed: {e}"}
+                return
+            tracer.end()
+            yield {"event": "done", "data": ""}
+            return
         # else: chat path — no context
         # ---- 2b. Emit sources ---------------------------------------
+        sources = _build_sources(intent, user_id, query_result, raw_chunks)
+        logger.info(
+            "built sources",
+            intent=intent,
+            sources_count=len(sources),
+            raw_chunks_count=len(raw_chunks) if raw_chunks else 0,
         )
         yield {"event": "sources", "data": json.dumps(sources)}
         # ---- 3. Stream answer ----------------------------------------
     def _get_analysis_store(self) -> AnalysisStore:
         if self._analysis_store is None:
+            from .slow_path.store import PostgresAnalysisStore
+            self._analysis_store = PostgresAnalysisStore()
         return self._analysis_store
     async def _run_slow_path(
         catalog: Any,
         tracer: Any = None,
         catalog_reader: CatalogReader | None = None,
+        analysis_id: str | None = None,
     ) -> AsyncIterator[dict[str, Any]]:
         """Run the slow path and stream its assembled answer as SSE events.
         Context comes from the `get_business_context` seam (a stub today); the
+        `analysis_record` is persisted via the `AnalysisStore` seam (PostgresAnalysisStore),
+        stamped with the request's user_id + analysis_id so the report can group it.
         `chat_answer` is emitted as a single `chunk` (the Assembler returns the whole
         object — true token streaming is a later step).
         """
         yield {"event": "sources", "data": json.dumps([])}  # TODO: derive from record
         yield {"event": "chunk", "data": result.chat_answer}
         try:
+            # Stamp identity from the request scope: owner + the shared session id
+            # (analysis_id == room_id). Without analysis_id the record is orphaned —
+            # list_for_analysis can't find it, so the report + is_report_ready go
+            # blind. The store is never-throw.
+            record = result.analysis_record.model_copy(
+                update={"user_id": user_id, "analysis_id": analysis_id}
+            )
+            await self._get_analysis_store().save(record)
         except Exception as e:  # persistence must never break the user's answer
             logger.error("analysis_record persist failed", user_id=user_id, error=str(e))
         tracer.end()  # output omitted (chat_answer may contain PII on Cloud)
         yield {"event": "done", "data": ""}
+class _ScopedCatalogReader:
+    """Wraps a CatalogReader, restricting `structured` reads to an analysis's bound
+    sources (#10).
+    Scoping lives here — not at a single call site — so the Planner AND the
+    data-access tools (which re-read the catalog themselves) see the same scoped
+    view; otherwise binding is only a hint to the Planner while the executor runs
+    against the full catalog. Fail-open: an empty or fully-disjoint binding yields
+    the whole catalog, so a stale / cross-source binding degrades instead of
+    emptying the catalog. Only `structured` reads are scoped (all #10 binds today);
+    `unstructured` / retrieval reads pass through.
+    """
+    def __init__(self, inner: Any, bound: set[str]) -> None:
+        self._inner = inner
+        self._bound = bound
+    async def read(self, user_id: str, source_hint: str) -> Any:
+        catalog = await self._inner.read(user_id, source_hint)
+        if not self._bound or source_hint != "structured":
+            return catalog
+        scoped = [s for s in catalog.sources if s.source_id in self._bound]
+        return catalog.model_copy(update={"sources": scoped or catalog.sources})
 def _build_sources(
+    intent: str,
     user_id: str,
     query_result: Any,
     raw_chunks: Any,
 ) -> list[dict[str, Any]]:
     """Build the sources payload for the SSE `sources` event.
+    - structured_flow: one entry per executed table (table_name only).
+    - unstructured_flow: deduped by (document_id, page_label), Phase 1 shape.
     - chat or error: empty list.
     """
+    if intent == "structured_flow":
         if query_result is None or getattr(query_result, "error", None):
             return []
         table_name = getattr(query_result, "table_name", "") or ""
             "page_label": None,
         }]
+    if intent == "unstructured_flow" and raw_chunks:
         seen: set[tuple[Any, Any]] = set()
         sources: list[dict[str, Any]] = []
         for item in raw_chunks:

src/agents/gate.py ADDED Viewed

	@@ -0,0 +1,108 @@

+"""Deterministic routing gate — policy check over the router's intent.
+After the LLM router picks an intent, the gate checks it against the per-analysis
+Analysis State and returns the **effective** intent: allow as-is, or redirect. No
+LLM, no I/O in `gate()` itself.
+Only one rule has teeth in v1: an analytical request (`structured_flow`) requires a
+validated problem statement (`problem_validated is True`); otherwise it is
+redirected to `problem_statement` so the user defines the goal first. Everything
+else passes through. `generate_report` is not a router intent (button / report
+API), so it is not gated here.
+`AnalysisState` is the locked 8-field contract (mirrors the `analysis_states` DB
+table). `get_analysis_state` reads the real per-analysis row via `AnalysisStateStore`
+(#9, landed); it fails closed to a not-validated stub on a missing row or read error.
+See `ORCHESTRATOR_REWORK_PLAN.md` §4.
+"""
+from __future__ import annotations
+from datetime import UTC, datetime
+from pydantic import BaseModel
+from src.agents.orchestration import Intent
+from src.middlewares.logging import get_logger
+logger = get_logger("gate")
+class AnalysisState(BaseModel):
+    """Per-analysis state the gate + Help skill read every turn (locked contract).
+    `problem_validated` is the gate driver; `report_id` is null until a report
+    exists. Field names mirror the `analysis_states` table so the DB read swaps in
+    without touching readers.
+    """
+    id: str
+    analysis_title: str
+    problem_statement: str
+    problem_validated: bool = False
+    owner_id: str
+    report_id: str | None = None
+    created_at: datetime
+    updated_at: datetime
+def gate(intent: Intent, state: AnalysisState) -> Intent:
+    """Return the effective intent after applying the deterministic gate policy.
+    `structured_flow` requires `problem_validated is True`; otherwise redirect to
+    `problem_statement`. All other intents pass through unchanged.
+    """
+    if intent == "structured_flow" and not state.problem_validated:
+        logger.info(
+            "gate redirect",
+            requested=intent,
+            effective="problem_statement",
+            reason="problem_not_validated",
+        )
+        return "problem_statement"
+    return intent
+def stub_analysis_state(*, problem_validated: bool = False) -> AnalysisState:
+    """Hardcoded Analysis State for building/testing before the DB lands (#9).
+    Shared fixture so the gate, the Help skill, and tests all exercise the same
+    shape. `problem_validated=True` simulates a passed interview.
+    """
+    now = datetime.now(UTC)
+    return AnalysisState(
+        id="stub-analysis",
+        analysis_title="Stub analysis",
+        problem_statement="Stub problem statement" if problem_validated else "",
+        problem_validated=problem_validated,
+        owner_id="stub-user",
+        report_id=None,
+        created_at=now,
+        updated_at=now,
+    )
+async def get_analysis_state(analysis_id: str) -> AnalysisState:
+    """Load the Analysis State for an analysis (shared id with the chat room).
+    Reads the `analysis_states` row via `AnalysisStateStore`. Never-throw seam: a
+    missing row (e.g. a legacy room created before this table) or a read failure
+    degrades to a **not-validated** stub, so the gate fails closed (→ steer to
+    `problem_statement`) rather than running ungated analysis. The store import is
+    lazy so this module stays import-safe without a DB.
+    """
+    try:
+        from src.agents.state_store import AnalysisStateStore
+        state = await AnalysisStateStore().get(analysis_id)
+    except Exception as exc:  # noqa: BLE001 — never-throw; fail closed to not-validated
+        logger.warning(
+            "get_analysis_state read failed — default not-validated",
+            analysis_id=analysis_id,
+            error=str(exc),
+        )
+        return stub_analysis_state(problem_validated=False)
+    if state is None:
+        logger.debug("analysis_state missing — default not-validated", analysis_id=analysis_id)
+        return stub_analysis_state(problem_validated=False)
+    return state

src/agents/handlers/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ """Deterministic skill handlers dispatched by the orchestrator (non-LLM)."""

src/agents/handlers/check.py ADDED Viewed

	@@ -0,0 +1,165 @@

+"""`check` skill handler — deterministic data/document inventory (no LLM).
+The router emits a single `check` intent; this handler picks the concrete tool
+(`check_data` for structured sources, `check_knowledge` for documents) and renders
+the tool's `ToolOutput` table into a markdown reply. Broad queries with no
+specific cue call both tools concurrently and stitch a helicopter-view inventory.
+See `ORCHESTRATOR_REWORK_PLAN.md` §2.
+The data-access invoker never throws (§8.4); `render_tool_output` handles the
+`error` envelope defensively.
+"""
+from __future__ import annotations
+import asyncio
+import re
+from typing import TYPE_CHECKING
+from src.tools.contracts import ToolOutput
+if TYPE_CHECKING:
+    from src.agents.slow_path.invoker import ToolInvoker
+# Cues that point at documents rather than structured data.
+_KNOWLEDGE_CUES = (
+    "document",
+    "docs",
+    "doc ",
+    "file",
+    "pdf",
+    "docx",
+    ".txt",
+    "uploaded",
+    "knowledge",
+    "dokumen",
+)
+# Cues that point at structured/tabular data specifically.
+_DATA_CUES = (
+    "kolom",
+    "column",
+    "tabel",
+    "table",
+    "baris",
+    "row",
+    "schema",
+    "skema",
+    "database",
+    " db",
+)
+def _intent(message: str) -> str:
+    """Return 'knowledge', 'data', or 'both' (helicopter view) from keyword cues."""
+    lowered = message.lower()
+    is_knowledge = any(cue in lowered for cue in _KNOWLEDGE_CUES)
+    is_data = any(cue in lowered for cue in _DATA_CUES)
+    if is_knowledge and not is_data:
+        return "knowledge"
+    if is_data and not is_knowledge:
+        return "data"
+    return "both"
+def render_tool_output(out: ToolOutput) -> str:
+    """Render a `check_*` ToolOutput table into a markdown string, or '' if empty."""
+    if out.kind == "error":
+        return f"Sorry, I couldn't look that up: {out.error}"
+    columns = out.columns or []
+    rows = out.rows or []
+    if not rows:
+        return ""
+    header = "| " + " | ".join(columns) + " |"
+    separator = "| " + " | ".join("---" for _ in columns) + " |"
+    body = "\n".join(
+        "| " + " | ".join(str(cell) for cell in row) + " |" for row in rows
+    )
+    return f"{header}\n{separator}\n{body}"
+def _matched_source_ids(message: str, inventory: ToolOutput) -> list[str]:
+    """All source_ids whose name appears as a whole word in the message.
+    The user names sources in plain words ("sales", "kolom sales sama orders");
+    the tool needs exact `source_id`s. We resolve them against the inventory
+    rows (kind="table", columns include "source_id" + "name") instead of an LLM
+    — a cheap match against catalog metadata already in hand. Whole-word match
+    (`\\b`) avoids nuisance hits ("orders" inside "reorders") and treats `_` as
+    part of the word, so "sales" won't pick up "sales_archive". Multiple named
+    sources all match, so the caller can show each schema.
+    """
+    if inventory.kind != "table" or not inventory.rows:
+        return []
+    cols = inventory.columns or []
+    try:
+        id_idx = cols.index("source_id")
+        name_idx = cols.index("name")
+    except ValueError:
+        return []
+    matched: list[str] = []
+    for row in inventory.rows:
+        name = str(row[name_idx])
+        if name and re.search(rf"\b{re.escape(name)}\b", message, re.IGNORECASE):
+            matched.append(str(row[id_idx]))
+    return matched
+def _render_helicopter(data_out: ToolOutput, knowledge_out: ToolOutput) -> str:
+    """Stitch structured + document inventory into one helicopter-view reply."""
+    parts: list[str] = []
+    data_table = render_tool_output(data_out)
+    if data_table:
+        parts.append(f"**Data terstruktur**\n{data_table}")
+    knowledge_table = render_tool_output(knowledge_out)
+    if knowledge_table:
+        parts.append(f"**Dokumen**\n{knowledge_table}")
+    if not parts:
+        return "Nothing registered yet — I don't see any sources or documents."
+    return "\n\n".join(parts)
+async def run_check(message: str, invoker: ToolInvoker) -> str:
+    """Route to check_data, check_knowledge, or both (helicopter view) based on cues."""
+    intent = _intent(message)
+    _no_match = "Nothing registered yet — I don't see any matching sources."
+    if intent == "knowledge":
+        out = await invoker.invoke("check_knowledge", {})
+        return render_tool_output(out) or _no_match
+    if intent == "data":
+        inventory = await invoker.invoke("check_data", {})
+        if inventory.kind == "error":
+            return render_tool_output(inventory)
+        # Drill down to the schema of each source the user named; if they named
+        # none, return the source listing.
+        source_ids = _matched_source_ids(message, inventory)
+        if not source_ids:
+            return render_tool_output(inventory) or _no_match
+        schemas = await asyncio.gather(
+            *(invoker.invoke("check_data", {"source_id": sid}) for sid in source_ids)
+        )
+        if len(schemas) == 1:
+            return render_tool_output(schemas[0]) or _no_match
+        # Multiple named sources → one labelled section per source.
+        sections: list[str] = []
+        for out in schemas:
+            table = render_tool_output(out)
+            if table:
+                label = (out.meta or {}).get("source_name") or "source"
+                sections.append(f"**{label}**\n{table}")
+        return "\n\n".join(sections) or _no_match
+    # broad / ambiguous → helicopter view: call both concurrently
+    data_out, knowledge_out = await asyncio.gather(
+        invoker.invoke("check_data", {}),
+        invoker.invoke("check_knowledge", {}),
+    )
+    return _render_helicopter(data_out, knowledge_out)

src/agents/handlers/help.py ADDED Viewed

	@@ -0,0 +1,192 @@

+"""`help` skill handler — state-aware next-step guidance (LLM call).
+Reads the per-analysis state + chat history (and a deterministic report-readiness
+signal) and tells the user where they are and what to do next. Help only guides;
+it never runs analysis or produces data answers.
+The prompt lives in `config/prompts/help.md` (the playbook); this module composes
+the context and streams the LLM answer, mirroring `ChatbotAgent`. The **consistency
+guard** has teeth here, not just in the prompt: `_derive_available_actions` computes
+the actions actually allowed from the state (the same policy as `gate.py`), and that
+list is fed into the prompt — the LLM is told to suggest *only* those, so it can't
+tell the user to generate a report when the goal isn't validated or the analysis
+isn't ready.
+SEAMS:
+  - `AnalysisState` is the locked 8-field contract from `gate.py` (KM-652). The gate,
+    this skill, and tests share `gate.stub_analysis_state(...)` so they exercise the
+    same shape.
+  - `ReportReadiness` is the return shape of `is_report_ready(chat_history)` (seam #5,
+    Rifqi — not built yet). Help *consumes* it; it does not compute it. Until it lands,
+    the caller passes a stub (default: not ready).
+"""
+from __future__ import annotations
+from collections.abc import AsyncIterator
+from dataclasses import dataclass, field
+from pathlib import Path
+from typing import Any
+from langchain_core.messages import BaseMessage
+from langchain_core.output_parsers import StrOutputParser
+from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
+from langchain_core.runnables import Runnable
+from langchain_openai import AzureChatOpenAI
+from src.agents.gate import AnalysisState
+from src.middlewares.logging import get_logger
+logger = get_logger("help")
+_PROMPT_DIR = Path(__file__).resolve().parent.parent.parent / "config" / "prompts"
+_SYSTEM_PROMPT_PATH = _PROMPT_DIR / "help.md"
+_GUARDRAILS_PATH = _PROMPT_DIR / "guardrails.md"
+# Neutral human turn when Help is triggered by a slash command with no real content.
+_DEFAULT_TRIGGER = "What should I do next?"
+@dataclass
+class ReportReadiness:
+    """Deterministic report-readiness signal — the return of Rifqi's `is_report_ready`.
+    `missing` lists the gaps to fill when `ready` is False.
+    """
+    ready: bool = False
+    missing: list[str] = field(default_factory=list)
+def _derive_available_actions(state: AnalysisState, report_ready: ReportReadiness) -> list[str]:
+    """Actions Help is allowed to suggest, derived from state (mirrors `gate.py`).
+    This is the consistency guard's teeth: analysis is gated behind a validated goal
+    (same rule the gate applies to `structured_flow`), and a report is only offered
+    when the readiness signal says so. Keep this policy in sync with `gate.gate`.
+    """
+    if not state.problem_validated:
+        # Goal not set → the only useful move is defining the problem statement.
+        return ["define_problem_statement"]
+    actions = ["ask_analysis_question", "refine_problem_statement"]
+    if report_ready.ready:
+        actions.append("generate_report")
+    return actions
+def _format_state(state: AnalysisState) -> str:
+    """Render the analysis state as a compact context block for the LLM."""
+    has_report = "yes" if state.report_id else "no"
+    return (
+        "[Analysis state]\n"
+        f"analysis_title: {state.analysis_title or '(none)'}\n"
+        f"problem_statement: {state.problem_statement or '(empty)'}\n"
+        f"problem_validated: {str(state.problem_validated).lower()}\n"
+        f"has_report: {has_report}"
+    )
+def _format_report_ready(report_ready: ReportReadiness) -> str:
+    missing = ", ".join(report_ready.missing) if report_ready.missing else "(none)"
+    return (
+        "[Report readiness]\n"
+        f"ready: {str(report_ready.ready).lower()}\n"
+        f"missing: {missing}"
+    )
+def _build_context_block(
+    state: AnalysisState,
+    report_ready: ReportReadiness,
+    available_actions: list[str],
+) -> str:
+    """Compose the deterministic context the prompt's 'never misguide' rule trusts."""
+    return "\n\n".join(
+        [
+            _format_state(state),
+            _format_report_ready(report_ready),
+            "[Available actions]\n" + ", ".join(available_actions),
+        ]
+    )
+def _load_system_prompt() -> str:
+    """Compose system prompt = help.md + guardrails.md (guardrails last, as elsewhere)."""
+    help_md = _SYSTEM_PROMPT_PATH.read_text(encoding="utf-8")
+    guardrails = _GUARDRAILS_PATH.read_text(encoding="utf-8")
+    return f"{help_md}\n\n{guardrails}"
+def _build_default_chain() -> Runnable:
+    from src.config.settings import settings
+    llm = AzureChatOpenAI(
+        azure_deployment=settings.azureai_deployment_name_4o,
+        openai_api_version=settings.azureai_api_version_4o,
+        azure_endpoint=settings.azureai_endpoint_url_4o,
+        api_key=settings.azureai_api_key_4o,
+        temperature=0.3,
+        model_kwargs={"stream_options": {"include_usage": True}},
+    )
+    prompt = ChatPromptTemplate.from_messages(
+        [
+            ("system", _load_system_prompt()),
+            MessagesPlaceholder(variable_name="history", optional=True),
+            ("human", "{message}"),
+            ("system", "Analysis state and signals for this turn:\n\n{context}"),
+        ]
+    )
+    return prompt | llm | StrOutputParser()
+class HelpAgent:
+    """Streams state-aware guidance to the user.
+    `chain` is injectable: tests pass a fake that yields canned tokens. Default
+    constructs the production Azure OpenAI streaming chain on first use.
+    """
+    def __init__(self, chain: Runnable | None = None) -> None:
+        self._chain = chain
+    def _ensure_chain(self) -> Runnable:
+        if self._chain is None:
+            self._chain = _build_default_chain()
+        return self._chain
+    async def astream(
+        self,
+        state: AnalysisState,
+        history: list[BaseMessage] | None = None,
+        report_ready: ReportReadiness | None = None,
+        message: str | None = None,
+        available_actions: list[str] | None = None,
+        callbacks: list | None = None,
+    ) -> AsyncIterator[str]:
+        """Stream tokens of the guidance reply.
+        `report_ready` defaults to "not ready" so a missing signal degrades safely.
+        `available_actions`, when omitted, is derived deterministically from state.
+        """
+        readiness = report_ready or ReportReadiness()
+        actions = available_actions or _derive_available_actions(state, readiness)
+        logger.info(
+            "help guidance",
+            problem_validated=state.problem_validated,
+            report_ready=readiness.ready,
+            available_actions=actions,
+        )
+        chain = self._ensure_chain()
+        payload: dict[str, Any] = {
+            "message": message or _DEFAULT_TRIGGER,
+            "history": history or [],
+            "context": _build_context_block(state, readiness, actions),
+        }
+        if callbacks:
+            async for token in chain.astream(payload, config={"callbacks": callbacks}):
+                yield token
+        else:
+            async for token in chain.astream(payload):
+                yield token

src/agents/handlers/problem_statement.py ADDED Viewed

	@@ -0,0 +1,171 @@

+"""Problem Statement skill — guide the user to a usable problem statement.
+Routed by the orchestrator (intent `problem_statement`) and callable as a skill.
+An LLM drafts/refines the statement from the analysis title + the user's message and
+declares what's still `missing`; a check validates only when nothing is missing. The
+model is instructed to fill `objective`/`metric` ONLY from what the user explicitly
+stated — a bare data question ("which X has the most Y?") leaves them in `missing`, so
+it does not auto-validate (the gate stays meaningful). On a valid draft it persists
+`problem_statement` + `problem_validated=True`; otherwise it streams guidance and
+leaves the analysis un-validated.
+NOTE: completeness is still a (hardened) LLM judgment — the truly deterministic gate
+is an explicit user confirmation, planned with the frontend (see T3b / #11).
+See `ORCHESTRATOR_REWORK_PLAN.md` §4 and the 2026-06-18 checkpoint.
+"""
+from __future__ import annotations
+from pathlib import Path
+from typing import TYPE_CHECKING
+from langchain_core.messages import BaseMessage
+from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
+from langchain_core.runnables import Runnable
+from langchain_openai import AzureChatOpenAI
+from pydantic import BaseModel, Field
+from src.middlewares.logging import get_logger
+if TYPE_CHECKING:
+    from src.agents.state_store import AnalysisStateStore
+logger = get_logger("problem_statement")
+_PROMPT_PATH = (
+    Path(__file__).resolve().parent.parent.parent
+    / "config"
+    / "prompts"
+    / "problem_statement.md"
+)
+class ProblemStatementDraft(BaseModel):
+    """LLM output for the Problem Statement skill."""
+    problem_statement: str = Field(
+        ..., description="The refined, standalone problem statement (never empty)."
+    )
+    objective: str = Field(
+        "", description="What success looks like — fill ONLY when the user explicitly "
+        "stated it; never inferred from a data question. Empty otherwise."
+    )
+    metric: str = Field(
+        "", description="The KPI to move/investigate — fill ONLY when the user "
+        "explicitly stated it; never inferred from a data question. Empty otherwise."
+    )
+    missing: list[str] = Field(
+        default_factory=list,
+        description="Which of 'objective' / 'metric' the user has NOT explicitly stated "
+        "yet. A bare data question leaves both here. Empty list = complete.",
+    )
+    feedback: str = Field(
+        ...,
+        description="Message to the user — guidance if incomplete, confirmation if complete.",
+    )
+def is_valid(draft: ProblemStatementDraft) -> bool:
+    """Complete iff there's a statement and the model flagged nothing missing.
+    Keying on the model's explicit `missing` list (rather than 'are objective/metric
+    non-empty') is what stops a bare data question from auto-validating: the hardened
+    prompt puts the un-stated parts in `missing`, so this returns False for it.
+    """
+    return bool(draft.problem_statement.strip()) and not draft.missing
+def _load_prompt_text() -> str:
+    return _PROMPT_PATH.read_text(encoding="utf-8")
+def _build_default_chain() -> Runnable:
+    from src.config.settings import settings
+    llm = AzureChatOpenAI(
+        azure_deployment=settings.azureai_deployment_name_4o,
+        openai_api_version=settings.azureai_api_version_4o,
+        azure_endpoint=settings.azureai_endpoint_url_4o,
+        api_key=settings.azureai_api_key_4o,
+        temperature=0,
+    )
+    prompt = ChatPromptTemplate.from_messages(
+        [
+            ("system", _load_prompt_text()),
+            MessagesPlaceholder(variable_name="history", optional=True),
+            (
+                "human",
+                "Analysis title: {analysis_title}\n"
+                "Current problem statement: {current}\n\n"
+                "User message: {message}",
+            ),
+        ]
+    )
+    return prompt | llm.with_structured_output(ProblemStatementDraft)
+class ProblemStatementAgent:
+    """Single LLM call that drafts/refines a problem statement.
+    Inject `chain` for tests; the default builds the Azure OpenAI chain on first use.
+    """
+    def __init__(self, chain: Runnable | None = None) -> None:
+        self._chain = chain
+    def _ensure_chain(self) -> Runnable:
+        if self._chain is None:
+            self._chain = _build_default_chain()
+        return self._chain
+    async def draft(
+        self,
+        message: str,
+        analysis_title: str,
+        current: str,
+        history: list[BaseMessage] | None = None,
+    ) -> ProblemStatementDraft:
+        chain = self._ensure_chain()
+        return await chain.ainvoke(
+            {
+                "message": message,
+                "analysis_title": analysis_title,
+                "current": current,
+                "history": history or [],
+            }
+        )
+async def run_problem_statement(
+    message: str,
+    analysis_id: str | None,
+    *,
+    agent: ProblemStatementAgent,
+    store: AnalysisStateStore,
+    history: list[BaseMessage] | None = None,
+) -> str:
+    """Draft + validate the problem statement; persist on a valid draft.
+    Loads the current title/statement (if the analysis exists), drafts a refinement,
+    runs the deterministic completeness check, and writes `problem_statement` +
+    `problem_validated` back. Returns the user-facing feedback. When `analysis_id` is
+    missing (e.g. a legacy room), it still drafts + returns guidance but cannot persist.
+    """
+    analysis_title, current = "New analysis", ""
+    if analysis_id:
+        state = await store.get(analysis_id)
+        if state is not None:
+            analysis_title, current = state.analysis_title, state.problem_statement
+    draft = await agent.draft(message, analysis_title, current, history)
+    validated = is_valid(draft)
+    if analysis_id:
+        await store.update(
+            analysis_id,
+            problem_statement=draft.problem_statement,
+            problem_validated=validated,
+        )
+    logger.info("problem_statement drafted", analysis_id=analysis_id, validated=validated)
+    return draft.feedback

src/agents/orchestration.py CHANGED Viewed

@@ -1,13 +1,17 @@
-"""OrchestratorAgent — classifies a user message and emits source_hint.
-Output: needs_search (bool) + source_hint ∈ { chat, unstructured, structured }
-+ rewritten_query (standalone form of the user's question, history-resolved).
-Phase 2 replaces the previous intent-classification body. The class name
-is preserved so existing import sites (`from src.agents.orchestration
-import OrchestratorAgent`) keep working. The default LLM chain is
-constructed lazily so the module is import-safe even without `.env`
-populated.
 """
 from __future__ import annotations
@@ -25,7 +29,14 @@ from src.middlewares.logging import get_logger
 logger = get_logger("orchestrator")
-SourceHint = Literal["chat", "unstructured", "structured"]
 _PROMPT_PATH = (
     Path(__file__).resolve().parent.parent
@@ -35,21 +46,29 @@ _PROMPT_PATH = (
 )
-class IntentRouterDecision(BaseModel):
     """LLM output. Pydantic so it can be used with `with_structured_output`."""
-    needs_search: bool = Field(
-        ..., description="True if we must look at the user's data to answer."
-    )
-    source_hint: SourceHint = Field(
         ...,
-        description="Which downstream path: 'chat' (no lookup), "
-        "'unstructured' (PDF/DOCX/TXT prose), 'structured' (DB / tabular file).",
     )
     rewritten_query: str | None = Field(
         None,
-        description="Standalone version of the question, history-resolved. "
-        "Null when needs_search=false.",
     )
@@ -74,11 +93,11 @@ def _build_default_chain() -> Runnable:
             ("human", "{message}"),
         ]
     )
-    return prompt | llm.with_structured_output(IntentRouterDecision)
 class OrchestratorAgent:
-    """Classifies a user message into chat / unstructured / structured.
     Inject `structured_chain` for tests; default builds the production
     Azure OpenAI chain on first use.
@@ -97,18 +116,14 @@ class OrchestratorAgent:
         message: str,
         history: list[BaseMessage] | None = None,
         callbacks: list | None = None,
-    ) -> IntentRouterDecision:
         chain = self._ensure_chain()
         payload = {"message": message, "history": history or []}
         if callbacks:
-            decision: IntentRouterDecision = await chain.ainvoke(
                 payload, config={"callbacks": callbacks}
             )
         else:
             decision = await chain.ainvoke(payload)
-        logger.info(
-            "intent classified",
-            source_hint=decision.source_hint,
-            needs_search=decision.needs_search,
-        )
         return decision

+"""OrchestratorAgent — classifies a user message into one of six intents.
+Output: RouterDecision { intent, rewritten_query, confidence }.
+The router is a **handler-level** intent classifier, not a data-modality
+classifier: `structured_flow` routes to the slow Planner spine and
+`unstructured_flow` to the fast RAG path; the structured/unstructured data mix on
+the slow path is the Planner's job, not the router's. See
+`ORCHESTRATOR_REWORK_PLAN.md`.
+The class name `OrchestratorAgent` is preserved so existing import sites
+(`from src.agents.orchestration import OrchestratorAgent`) keep working. The
+default LLM chain is built lazily so the module is import-safe even without
+`.env` populated.
 """
 from __future__ import annotations
 logger = get_logger("orchestrator")
+Intent = Literal[
+    "chat",
+    "help",
+    "problem_statement",
+    "check",
+    "unstructured_flow",
+    "structured_flow",
+]
 _PROMPT_PATH = (
     Path(__file__).resolve().parent.parent
 )
+class RouterDecision(BaseModel):
     """LLM output. Pydantic so it can be used with `with_structured_output`."""
+    intent: Intent = Field(
         ...,
+        description=(
+            "Handler route for this message: 'chat' (conversational, no data), "
+            "'help' (what-to-do-next guidance), 'problem_statement' (define or "
+            "refine the analysis goal), 'check' (inventory: what data/documents "
+            "exist), 'unstructured_flow' (answer from documents, fast RAG), or "
+            "'structured_flow' (analytical question over data, slow Planner path)."
+        ),
     )
     rewritten_query: str | None = Field(
         None,
+        description=(
+            "Standalone version of the question, history-resolved. Null for "
+            "'chat' and 'help' (no data lookup needed)."
+        ),
+    )
+    confidence: float | None = Field(
+        None,
+        description="Classifier confidence in [0, 1]. Optional.",
     )
             ("human", "{message}"),
         ]
     )
+    return prompt | llm.with_structured_output(RouterDecision)
 class OrchestratorAgent:
+    """Classifies a user message into one of the six router intents.
     Inject `structured_chain` for tests; default builds the production
     Azure OpenAI chain on first use.
         message: str,
         history: list[BaseMessage] | None = None,
         callbacks: list | None = None,
+    ) -> RouterDecision:
         chain = self._ensure_chain()
         payload = {"message": message, "history": history or []}
         if callbacks:
+            decision: RouterDecision = await chain.ainvoke(
                 payload, config={"callbacks": callbacks}
             )
         else:
             decision = await chain.ainvoke(payload)
+        logger.info("intent classified", intent=decision.intent)
         return decision

src/agents/planner/examples.py CHANGED Viewed

@@ -2,7 +2,7 @@
 Two illustrative (question -> TaskList) pairs that teach the OUTPUT SHAPE:
 stages, dependency edges, ordered tool-call chains, inline QueryIR,
-"${t<id>}" placeholders, and the assumed data-flow convention — `query_structured`
 pulls rows, then a composite `analyze_*` tool consumes them via a `data` placeholder
 referencing the upstream result's column aliases (Pattern A; the tool team may
 instead pick self-fetch by `source_id`, in which case these examples are reshaped
@@ -21,9 +21,8 @@ from .schemas import Task, TaskList, ToolCall
 # --------------------------------------------------------------------------- #
 # Example A — exploratory, no modeling.
 # "Which product categories drove last quarter's revenue?"
-# Shows: query_structured pulls rows -> analyze_contribution computes each
-# category's share of the total in one call (no manual per-category + total
-# queries).
 # --------------------------------------------------------------------------- #
 _EXAMPLE_A = TaskList(
@@ -36,7 +35,7 @@ _EXAMPLE_A = TaskList(
             id="t1",
             stage="data_understanding",
             objective="Confirm the sales source exposes category, revenue, and order date.",
-            tool_calls=[ToolCall(tool="describe_source", args={"source_id": "src_sales"})],
             expected_output="source_shape",
             success_criteria="Produced the orders table schema; the 3 needed columns are present.",
             depends_on=[],
@@ -48,7 +47,7 @@ _EXAMPLE_A = TaskList(
             objective="Pull last quarter's order-level category and revenue rows.",
             tool_calls=[
                 ToolCall(
-                    tool="query_structured",
                     args={
                         "ir": {
                             "source_id": "src_sales",
@@ -78,20 +77,19 @@ _EXAMPLE_A = TaskList(
         Task(
             id="t3",
             stage="evaluation",
-            objective="Rank each category's revenue share of the quarter total.",
             tool_calls=[
                 ToolCall(
-                    tool="analyze_contribution",
                     args={
                         "data": "${t2}",
-                        "dimension": "category",
-                        "value_column": "revenue",
-                        "agg": "sum",
                     },
                 )
             ],
-            expected_output="category_contribution",
-            success_criteria="Produced each category's revenue share, ranked high to low.",
             depends_on=["t2"],
             estimated_cost="low",
         ),
@@ -113,7 +111,7 @@ _EXAMPLE_B = TaskList(
             id="t1",
             stage="data_understanding",
             objective="Confirm the sales source exposes order date, revenue, and region.",
-            tool_calls=[ToolCall(tool="describe_source", args={"source_id": "src_sales"})],
             expected_output="source_shape",
             success_criteria="Produced the orders table schema; the needed columns are present.",
             depends_on=[],
@@ -125,7 +123,7 @@ _EXAMPLE_B = TaskList(
             objective="Pull this year's order dates, revenue, and region.",
             tool_calls=[
                 ToolCall(
-                    tool="query_structured",
                     args={
                         "ir": {
                             "source_id": "src_sales",
@@ -189,8 +187,8 @@ _EXAMPLE_B = TaskList(
 # Example C — mixed structured + unstructured.
 # "Revenue dipped in Q1 — what happened?"
 # Shows: a structured branch (query -> analyze_trend) runs alongside an
-# INDEPENDENT retrieve_documents branch that pulls qualitative context. Note
-# retrieve_documents takes a natural-language `query` (NOT a `${t<id>}` data
 # placeholder — it is a source, not a consumer) and can run in parallel; the
 # Assembler folds the document context into the explanation.
 # --------------------------------------------------------------------------- #
@@ -205,7 +203,7 @@ _EXAMPLE_C = TaskList(
             id="t1",
             stage="data_understanding",
             objective="Confirm the sales source exposes order date and revenue.",
-            tool_calls=[ToolCall(tool="describe_source", args={"source_id": "src_sales"})],
             expected_output="source_shape",
             success_criteria="Produced the orders table schema; date and revenue columns present.",
             depends_on=[],
@@ -217,7 +215,7 @@ _EXAMPLE_C = TaskList(
             objective="Pull Q1 order dates and revenue.",
             tool_calls=[
                 ToolCall(
-                    tool="query_structured",
                     args={
                         "ir": {
                             "source_id": "src_sales",
@@ -275,7 +273,7 @@ _EXAMPLE_C = TaskList(
             objective="Retrieve qualitative context on Q1 operational events behind a dip.",
             tool_calls=[
                 ToolCall(
-                    tool="retrieve_documents",
                     args={
                         "query": "operational issues, outages, or notable events in Q1 2026",
                         "top_k": 5,
@@ -310,7 +308,7 @@ _EXAMPLE_D = TaskList(
             id="t1",
             stage="data_understanding",
             objective="Confirm the sales source exposes region and revenue.",
-            tool_calls=[ToolCall(tool="describe_source", args={"source_id": "src_sales"})],
             expected_output="source_shape",
             success_criteria="Produced the orders table schema; region and revenue present.",
             depends_on=[],
@@ -322,7 +320,7 @@ _EXAMPLE_D = TaskList(
             objective="Pull order-level region and revenue.",
             tool_calls=[
                 ToolCall(
-                    tool="query_structured",
                     args={
                         "ir": {
                             "source_id": "src_sales",

 Two illustrative (question -> TaskList) pairs that teach the OUTPUT SHAPE:
 stages, dependency edges, ordered tool-call chains, inline QueryIR,
+"${t<id>}" placeholders, and the assumed data-flow convention — `retrieve_data`
 pulls rows, then a composite `analyze_*` tool consumes them via a `data` placeholder
 referencing the upstream result's column aliases (Pattern A; the tool team may
 instead pick self-fetch by `source_id`, in which case these examples are reshaped
 # --------------------------------------------------------------------------- #
 # Example A — exploratory, no modeling.
 # "Which product categories drove last quarter's revenue?"
+# Shows: retrieve_data pulls rows -> analyze_aggregate sums revenue per
+# category in one call (no manual per-category queries).
 # --------------------------------------------------------------------------- #
 _EXAMPLE_A = TaskList(
             id="t1",
             stage="data_understanding",
             objective="Confirm the sales source exposes category, revenue, and order date.",
+            tool_calls=[ToolCall(tool="check_data", args={"source_id": "src_sales"})],
             expected_output="source_shape",
             success_criteria="Produced the orders table schema; the 3 needed columns are present.",
             depends_on=[],
             objective="Pull last quarter's order-level category and revenue rows.",
             tool_calls=[
                 ToolCall(
+                    tool="retrieve_data",
                     args={
                         "ir": {
                             "source_id": "src_sales",
         Task(
             id="t3",
             stage="evaluation",
+            objective="Sum revenue per category for the quarter.",
             tool_calls=[
                 ToolCall(
+                    tool="analyze_aggregate",
                     args={
                         "data": "${t2}",
+                        "aggregations": {"revenue": ["sum"]},
+                        "group_by": ["category"],
                     },
                 )
             ],
+            expected_output="category_revenue",
+            success_criteria="Produced total revenue per category, one row each.",
             depends_on=["t2"],
             estimated_cost="low",
         ),
             id="t1",
             stage="data_understanding",
             objective="Confirm the sales source exposes order date, revenue, and region.",
+            tool_calls=[ToolCall(tool="check_data", args={"source_id": "src_sales"})],
             expected_output="source_shape",
             success_criteria="Produced the orders table schema; the needed columns are present.",
             depends_on=[],
             objective="Pull this year's order dates, revenue, and region.",
             tool_calls=[
                 ToolCall(
+                    tool="retrieve_data",
                     args={
                         "ir": {
                             "source_id": "src_sales",
 # Example C — mixed structured + unstructured.
 # "Revenue dipped in Q1 — what happened?"
 # Shows: a structured branch (query -> analyze_trend) runs alongside an
+# INDEPENDENT retrieve_knowledge branch that pulls qualitative context. Note
+# retrieve_knowledge takes a natural-language `query` (NOT a `${t<id>}` data
 # placeholder — it is a source, not a consumer) and can run in parallel; the
 # Assembler folds the document context into the explanation.
 # --------------------------------------------------------------------------- #
             id="t1",
             stage="data_understanding",
             objective="Confirm the sales source exposes order date and revenue.",
+            tool_calls=[ToolCall(tool="check_data", args={"source_id": "src_sales"})],
             expected_output="source_shape",
             success_criteria="Produced the orders table schema; date and revenue columns present.",
             depends_on=[],
             objective="Pull Q1 order dates and revenue.",
             tool_calls=[
                 ToolCall(
+                    tool="retrieve_data",
                     args={
                         "ir": {
                             "source_id": "src_sales",
             objective="Retrieve qualitative context on Q1 operational events behind a dip.",
             tool_calls=[
                 ToolCall(
+                    tool="retrieve_knowledge",
                     args={
                         "query": "operational issues, outages, or notable events in Q1 2026",
                         "top_k": 5,
             id="t1",
             stage="data_understanding",
             objective="Confirm the sales source exposes region and revenue.",
+            tool_calls=[ToolCall(tool="check_data", args={"source_id": "src_sales"})],
             expected_output="source_shape",
             success_criteria="Produced the orders table schema; region and revenue present.",
             depends_on=[],
             objective="Pull order-level region and revenue.",
             tool_calls=[
                 ToolCall(
+                    tool="retrieve_data",
                     args={
                         "ir": {
                             "source_id": "src_sales",

src/agents/planner/inputs.py CHANGED Viewed

@@ -4,9 +4,9 @@
 for the planner prompt. It carries every table + column id/type/PII flag + row
 counts + low-cardinality top_values, with `sample_values` nulled on PII columns
 (INV: no PII sample values into the prompt, see doc §13). It also lists the
-available unstructured sources so the planner can plan `retrieve_documents`.
-The planner *validator* still checks inline `query_structured` IRs against the
 full `Catalog` via the existing IRValidator — the summary is a prompt input, not
 the validation source of truth.
@@ -124,7 +124,7 @@ class CatalogSummary(BaseModel):
             lines.append("")
         if self.unstructured_sources:
-            lines.append("Unstructured sources (for retrieve_documents):")
             for src in self.unstructured_sources:
                 lines.append(f"  - {src.name} — id={src.source_id}")

 for the planner prompt. It carries every table + column id/type/PII flag + row
 counts + low-cardinality top_values, with `sample_values` nulled on PII columns
 (INV: no PII sample values into the prompt, see doc §13). It also lists the
+available unstructured sources so the planner can plan `retrieve_knowledge`.
+The planner *validator* still checks inline `retrieve_data` IRs against the
 full `Catalog` via the existing IRValidator — the summary is a prompt input, not
 the validation source of truth.
             lines.append("")
         if self.unstructured_sources:
+            lines.append("Unstructured sources (for retrieve_knowledge):")
             for src in self.unstructured_sources:
                 lines.append(f"  - {src.name} — id={src.source_id}")

src/agents/planner/registry.py CHANGED Viewed

@@ -7,8 +7,8 @@ outside it).
   `src/tools/registry.py::analytics_registry()` (KM-628), built on the canonical
   `ToolSpec` (`src/tools/contracts.py`, KM-465/KM-627) and the prompt-style tool
   descriptions (KM-625). No longer a stub on our side — it tracks the real registry.
-- **Data access (`query_structured` / `retrieve_documents` / `list_sources` /
-  `describe_source`) — spec BODIES still a local stub.** The tool team owns these too,
   but their wrappers + `ToolSpec`s haven't landed yet (KM-465 #4). We keep best-guess
   spec bodies here so the Planner can plan end-to-end — but the NAMES derive from
   `src.tools.data_access.DATA_ACCESS_TOOLS` (R11), so a tool rename/addition upstream
@@ -16,10 +16,10 @@ outside it).
   this slice and swap `default_registry()` for the tool team's full composition.
 **Confirmed conventions (KM-465):** Pattern A — `analyze_*` tools take a `data`
-`"${t<id>}"` placeholder pointing at an upstream `query_structured` output (no
 self-fetch); resolved to a DataFrame at execution time. `input_schema` is the
 lightweight `{required, properties}` dict the planner validator (check #8) reads;
-`query_structured.args["ir"]` carries an inline QueryIR validated against the
 catalog by the existing IRValidator.
 See AGENT_ARCHITECTURE_CONTEXT_new.md §9.2 / §9.3.
@@ -38,25 +38,32 @@ from .contracts import ToolRegistry, ToolSpec
 # --------------------------------------------------------------------------- #
 _DATA_ACCESS_SPEC_BODIES: tuple[ToolSpec, ...] = (
     ToolSpec(
-        name="query_structured",
         category="analytics.query",
         input_schema={"required": ["ir"], "properties": {"ir": {"type": "object"}}},
         output_kind="table",
         description=(
-            "Run one validated, single-table query against a structured source (DB "
-            "schema or tabular file) and return rows. The `ir` argument is an inline "
-            "QueryIR (the JSON intent: source_id, table_id, select, filters, group_by, "
-            "order_by, limit) — never SQL. This is the data-access entry point: use it "
-            "to select, filter, and pull the rows the analytics (`analyze_*`) tools "
-            "then consume. It also does simple built-in aggregation the IR can express "
-            "(count/sum/avg/min/max/count_distinct). Do NOT use it for richer statistics "
             "(median/percentile/mode/stddev/skew → analyze_descriptive), trends "
             "(analyze_trend), correlation, segmentation, or share-of-total; and do NOT "
-            "use it to read documents (use retrieve_documents)."
         ),
     ),
     ToolSpec(
-        name="retrieve_documents",
         category="retrieval.documents",
         input_schema={
             "required": ["query"],
@@ -71,32 +78,36 @@ _DATA_ACCESS_SPEC_BODIES: tuple[ToolSpec, ...] = (
             "Dense-retrieve the most relevant chunks from the user's unstructured "
             "sources (PDF/DOCX/TXT) for a natural-language `query`. Use this to pull "
             "qualitative context into an analysis. Optionally scope to one `source_id`. "
-            "Do NOT use it for numbers in tables — that is query_structured's job."
         ),
     ),
     ToolSpec(
-        name="list_sources",
         category="catalog.introspection",
-        input_schema={"required": [], "properties": {}},
         output_kind="table",
         description=(
-            "List the user's available data sources (id, name, type, table count). Use "
-            "early in data_understanding when the plan must discover what exists before "
-            "querying. Cheap. Do NOT use it to read column details (use describe_source)."
         ),
     ),
     ToolSpec(
-        name="describe_source",
         category="catalog.introspection",
-        input_schema={
-            "required": ["source_id"],
-            "properties": {"source_id": {"type": "string"}},
-        },
         output_kind="table",
         description=(
-            "Return the tables and columns (names, types, row counts) of one source by "
-            "`source_id`. Use in data_understanding to confirm the shape of a source "
-            "before querying it. Do NOT use it to fetch data rows (use query_structured)."
         ),
     ),
 )

   `src/tools/registry.py::analytics_registry()` (KM-628), built on the canonical
   `ToolSpec` (`src/tools/contracts.py`, KM-465/KM-627) and the prompt-style tool
   descriptions (KM-625). No longer a stub on our side — it tracks the real registry.
+- **Data access (`retrieve_data` / `retrieve_knowledge` / `check_data` /
+  `check_knowledge`) — spec BODIES still a local stub.** The tool team owns these too,
   but their wrappers + `ToolSpec`s haven't landed yet (KM-465 #4). We keep best-guess
   spec bodies here so the Planner can plan end-to-end — but the NAMES derive from
   `src.tools.data_access.DATA_ACCESS_TOOLS` (R11), so a tool rename/addition upstream
   this slice and swap `default_registry()` for the tool team's full composition.
 **Confirmed conventions (KM-465):** Pattern A — `analyze_*` tools take a `data`
+`"${t<id>}"` placeholder pointing at an upstream `retrieve_data` output (no
 self-fetch); resolved to a DataFrame at execution time. `input_schema` is the
 lightweight `{required, properties}` dict the planner validator (check #8) reads;
+`retrieve_data.args["ir"]` carries an inline QueryIR validated against the
 catalog by the existing IRValidator.
 See AGENT_ARCHITECTURE_CONTEXT_new.md §9.2 / §9.3.
 # --------------------------------------------------------------------------- #
 _DATA_ACCESS_SPEC_BODIES: tuple[ToolSpec, ...] = (
     ToolSpec(
+        name="retrieve_data",
         category="analytics.query",
         input_schema={"required": ["ir"], "properties": {"ir": {"type": "object"}}},
         output_kind="table",
         description=(
+            "Run one validated query against a structured source and return rows. The "
+            "`ir` argument is an inline QueryIR (the JSON intent: source_id, table_id, "
+            "joins, select, filters, group_by, order_by, limit) — never SQL. This is the "
+            "data-access entry point: use it to select, filter, and pull the rows the "
+            "analytics (`analyze_*`) tools then consume. It also does simple built-in "
+            "aggregation the IR can express (count/sum/avg/min/max/count_distinct). "
+            "JOINS (database sources only): to group a measure in one table by a "
+            "dimension in a RELATED table, add a `joins` entry "
+            "({target_table_id, left_column_id, right_column_id}) along a declared "
+            "foreign key — e.g. sum order_items.line_total grouped by products.category "
+            "via order_items.product_id = products.id. Prefer an existing measure column "
+            "(e.g. line_total) over recomputing, and a single table when the measure and "
+            "dimension already live together. Joins are NOT supported on tabular/file "
+            "sources yet. Do NOT use this for richer statistics "
             "(median/percentile/mode/stddev/skew → analyze_descriptive), trends "
             "(analyze_trend), correlation, segmentation, or share-of-total; and do NOT "
+            "use it to read documents (use retrieve_knowledge)."
         ),
     ),
     ToolSpec(
+        name="retrieve_knowledge",
         category="retrieval.documents",
         input_schema={
             "required": ["query"],
             "Dense-retrieve the most relevant chunks from the user's unstructured "
             "sources (PDF/DOCX/TXT) for a natural-language `query`. Use this to pull "
             "qualitative context into an analysis. Optionally scope to one `source_id`. "
+            "Do NOT use it for numbers in tables — that is retrieve_data's job."
         ),
     ),
     ToolSpec(
+        name="check_data",
         category="catalog.introspection",
+        input_schema={
+            "required": [],
+            "properties": {"source_id": {"type": "string"}},
+        },
         output_kind="table",
         description=(
+            "Inspect the user's structured data sources (DB + tabular). With no "
+            "arguments, lists the sources (id, name, type, table count) — use early in "
+            "data_understanding to discover what exists. With a `source_id`, returns that "
+            "source's tables and columns (names, types, row counts) — use to confirm a "
+            "source's shape before querying it. Cheap. Do NOT use it to fetch data rows "
+            "(use retrieve_data) or to inspect documents (use check_knowledge)."
         ),
     ),
     ToolSpec(
+        name="check_knowledge",
         category="catalog.introspection",
+        input_schema={"required": [], "properties": {}},
         output_kind="table",
         description=(
+            "List the user's unstructured sources / documents (id, name, type). Use in "
+            "data_understanding to discover what qualitative material exists before "
+            "retrieving from it. Do NOT use it to read document content (use "
+            "retrieve_knowledge) or to inspect structured data (use check_data)."
         ),
     ),
 )

src/agents/planner/service.py CHANGED Viewed

@@ -9,7 +9,7 @@ static plan.
 The service takes the full `Catalog` (not just a `CatalogSummary`): it derives
 the PII-safe `CatalogSummary` for the prompt, but validation needs the full
-catalog so the existing `IRValidator` can check inline `query_structured` IRs.
 See AGENT_ARCHITECTURE_CONTEXT_new.md §7.3.
 """

 The service takes the full `Catalog` (not just a `CatalogSummary`): it derives
 the PII-safe `CatalogSummary` for the prompt, but validation needs the full
+catalog so the existing `IRValidator` can check inline `retrieve_data` IRs.
 See AGENT_ARCHITECTURE_CONTEXT_new.md §7.3.
 """

src/agents/planner/validator.py CHANGED Viewed

@@ -95,8 +95,8 @@ class PlannerValidator:
                             f"source_id {src!r} (known: {sorted(known_sources)})"
                         )
-                # Check 8b — inline query_structured IR validates against the catalog.
-                if call.tool == "query_structured":
                     self._validate_inline_ir(task.id, call.args, catalog)
             # Check 7 — success_criteria is checkable.
@@ -114,20 +114,20 @@ class PlannerValidator:
         raw_ir = args.get("ir")
         if not isinstance(raw_ir, dict):
             raise PlannerValidationError(
-                f"task {task_id}: query_structured.args.ir must be an inline QueryIR "
                 f"object, got {type(raw_ir).__name__}"
             )
         try:
             ir = QueryIR.model_validate(raw_ir)
         except ValidationError as e:
             raise PlannerValidationError(
-                f"task {task_id}: query_structured.args.ir is not a valid QueryIR: {e}"
             ) from e
         try:
             self._ir_validator.validate(ir, catalog)
         except IRValidationError as e:
             raise PlannerValidationError(
-                f"task {task_id}: query_structured IR failed catalog validation: {e}"
             ) from e
     @staticmethod

                             f"source_id {src!r} (known: {sorted(known_sources)})"
                         )
+                # Check 8b — inline retrieve_data IR validates against the catalog.
+                if call.tool == "retrieve_data":
                     self._validate_inline_ir(task.id, call.args, catalog)
             # Check 7 — success_criteria is checkable.
         raw_ir = args.get("ir")
         if not isinstance(raw_ir, dict):
             raise PlannerValidationError(
+                f"task {task_id}: retrieve_data.args.ir must be an inline QueryIR "
                 f"object, got {type(raw_ir).__name__}"
             )
         try:
             ir = QueryIR.model_validate(raw_ir)
         except ValidationError as e:
             raise PlannerValidationError(
+                f"task {task_id}: retrieve_data.args.ir is not a valid QueryIR: {e}"
             ) from e
         try:
             self._ir_validator.validate(ir, catalog)
         except IRValidationError as e:
             raise PlannerValidationError(
+                f"task {task_id}: retrieve_data IR failed catalog validation: {e}"
             ) from e
     @staticmethod

src/agents/report/__init__.py ADDED Viewed

	@@ -0,0 +1,9 @@

+"""Report generator (KM-644).
+A button-triggered *service* — not a chat skill, not a slow-path agent. It turns a
+session's persisted `AnalysisRecord`s + Problem Statement into a versioned,
+business-readable `AnalysisReport`. Architecturally it mirrors the Assembler: one
+constrained LLM call (the executive summary) wrapped in deterministic assembly that
+copies every other field verbatim from the records (INV-4). Reports are immutable
+per version and persisted to the `analysis_reports` table.
+"""

src/agents/report/errors.py ADDED Viewed

	@@ -0,0 +1,7 @@

+"""Typed errors for the report generator (KM-644)."""
+from __future__ import annotations
+class ReportError(Exception):
+    """The report could not be generated (e.g. no records for the analysis)."""

src/agents/report/generator.py ADDED Viewed

	@@ -0,0 +1,363 @@

+"""ReportGenerator — turns a session's AnalysisRecords into an AnalysisReport (KM-644).
+A button-triggered service shaped like the Assembler: deterministic assembly of the
+records (findings/caveats/open_questions/data_sources/method_steps, copied verbatim —
+INV-4) wrapped around exactly ONE LLM call that authors only the executive summary.
+If that call fails the report is still returned with a deterministic fallback
+summary (decision D1) — the deterministic body is the real value.
+Versioning + persistence live in `ReportStore`; this service does generation only
+(returns an `AnalysisReport` with `version=0`; the store assigns the real version).
+Chain construction mirrors `agents/slow_path/assembler.py`.
+"""
+from __future__ import annotations
+from datetime import UTC, datetime
+from pathlib import Path
+from langchain_core.messages import SystemMessage
+from langchain_core.prompts import ChatPromptTemplate
+from langchain_core.runnables import Runnable
+from langchain_openai import AzureChatOpenAI
+from src.middlewares.logging import get_logger
+from ..slow_path.schemas import AnalysisRecord, TaskSummary
+from .errors import ReportError
+from .schemas import (
+    AnalysisReport,
+    AttributedNote,
+    DataSourceRef,
+    ProblemStatement,
+    ReportFinding,
+    ReportSummaryNarrative,
+)
+logger = get_logger("report_generator")
+_FALLBACK_SUMMARY = "Automated summary unavailable — see the findings below."
+# CRISP-DM phases in narrative order, with human labels for the method appendix.
+_STAGE_LABELS: list[tuple[str, str]] = [
+    ("data_understanding", "Data understanding"),
+    ("data_preparation", "Data preparation"),
+    ("modeling", "Modeling"),
+    ("evaluation", "Evaluation"),
+]
+_PROMPT_PATH = (
+    Path(__file__).resolve().parent.parent.parent / "config" / "prompts" / "report_summary.md"
+)
+def _load_prompt_text() -> str:
+    return _PROMPT_PATH.read_text(encoding="utf-8")
+def _build_default_chain() -> Runnable:
+    from src.config.settings import settings
+    llm = AzureChatOpenAI(
+        azure_deployment=settings.azureai_deployment_name_4o,
+        openai_api_version=settings.azureai_api_version_4o,
+        azure_endpoint=settings.azureai_endpoint_url_4o,
+        api_key=settings.azureai_api_key_4o,
+        temperature=0,
+    )
+    prompt = ChatPromptTemplate.from_messages(
+        [
+            SystemMessage(content=_load_prompt_text()),
+            ("human", "{human_content}"),
+        ]
+    )
+    return prompt | llm.with_structured_output(ReportSummaryNarrative)
+_default_chain: Runnable | None = None
+def _get_default_chain() -> Runnable:
+    global _default_chain
+    if _default_chain is None:
+        _default_chain = _build_default_chain()
+    return _default_chain
+# --------------------------------------------------------------------------- #
+# Deterministic assembly (pure; no LLM, no I/O) — easy to unit-test.
+# --------------------------------------------------------------------------- #
+def _collect_findings(records: list[AnalysisRecord]) -> list[ReportFinding]:
+    # Findings are distinct insights — not deduped; each traces to its record.
+    return [
+        ReportFinding(text=text, record_ids=[rec.record_id])
+        for rec in records
+        for text in rec.findings
+    ]
+def _collect_notes(records: list[AnalysisRecord], field: str) -> list[AttributedNote]:
+    # Caveats / open_questions are deduped by text; a merged note cites every
+    # record it came from (plural record_ids).
+    merged: dict[str, list[str]] = {}
+    for rec in records:
+        for text in getattr(rec, field):
+            ids = merged.setdefault(text, [])
+            if rec.record_id not in ids:
+                ids.append(rec.record_id)
+    return [AttributedNote(text=text, record_ids=ids) for text, ids in merged.items()]
+def _collect_method_steps(records: list[AnalysisRecord]) -> list[TaskSummary]:
+    steps: list[TaskSummary] = []
+    for rec in records:
+        steps.extend(rec.tasks_run)
+    return steps
+def _build_data_sources(
+    records: list[AnalysisRecord], catalog, bound_ids: list[str] | None = None
+) -> list[DataSourceRef]:
+    """Freeze real catalog metadata for the sources this analysis used.
+    When the analysis has a data-source binding (#10), the candidate set is scoped
+    to the bound sources first (fail-open if the binding doesn't intersect the
+    catalog). Within that set, matches catalog sources against the records'
+    (narrative) `data_used` by name/id; falls back to all (bound) sources, then to
+    bare `data_used` strings if no catalog is available — so the section is always
+    populated, best-effort.
+    """
+    if catalog is None or not catalog.sources:
+        seen: list[str] = []
+        for rec in records:
+            for du in rec.data_used:
+                if du not in seen:
+                    seen.append(du)
+        return [DataSourceRef(source_id=d, name=d, source_type="", detail={}) for d in seen]
+    candidates = catalog.sources
+    if bound_ids:
+        scoped = [s for s in candidates if s.source_id in set(bound_ids)]
+        candidates = scoped or candidates  # fail-open if binding doesn't match catalog
+    def _ref(s) -> DataSourceRef:
+        return DataSourceRef(
+            source_id=s.source_id,
+            name=s.name,
+            source_type=s.source_type,
+            detail={
+                "tables": [t.name for t in s.tables],
+                "row_count": sum((t.row_count or 0) for t in s.tables) or None,
+                "columns": [c.name for t in s.tables for c in t.columns],
+            },
+        )
+    used = " ".join(du for rec in records for du in rec.data_used).lower()
+    matched = [
+        _ref(s)
+        for s in candidates
+        if s.name.lower() in used or s.source_id.lower() in used
+    ]
+    return matched or [_ref(s) for s in candidates]
+def _build_human_content(
+    ps: ProblemStatement, findings: list[ReportFinding], caveats: list[AttributedNote]
+) -> str:
+    sections = []
+    ps_lines = [v for v in (ps.objective, ps.target_value, ps.scope) if v]
+    if ps_lines:
+        sections.append("# Problem Statement\n" + "\n".join(ps_lines))
+    sections.append(
+        "# Findings (already finalized — synthesize, do not add numbers)\n"
+        + "\n".join(f"- {f.text}" for f in findings)
+    )
+    if caveats:
+        sections.append("# Caveats\n" + "\n".join(f"- {c.text}" for c in caveats))
+    return "\n\n".join(sections)
+def _render_markdown(report: AnalysisReport) -> str:
+    # Version is deliberately NOT in the markdown — it is assigned by the store
+    # after rendering and lives in the structured `version` field / API metadata.
+    parts: list[str] = ["# Analysis Report"]
+    parts.append(
+        f"*Generated {report.generated_at:%Y-%m-%d} · "
+        f"{len(report.record_ids)} analyses · {len(report.data_sources)} source(s)*"
+    )
+    ps = report.problem_statement
+    ps_lines = [v for v in (ps.objective, ps.target_value, ps.scope) if v]
+    if ps_lines:
+        parts.append("## Problem Statement\n" + " ".join(ps_lines))
+    if report.executive_summary:
+        parts.append("## Executive Summary\n" + report.executive_summary)
+    if report.findings:
+        lines = ["## Key Findings"]
+        for i, f in enumerate(report.findings, 1):
+            cite = f" *({', '.join(f.record_ids)})*" if f.record_ids else ""
+            lines.append(f"{i}. {f.text}{cite}")
+        parts.append("\n".join(lines))
+    if report.caveats or report.open_questions:
+        lines = ["## Caveats & Open Questions"]
+        for n in report.caveats:
+            cite = f" *({', '.join(n.record_ids)})*" if n.record_ids else ""
+            lines.append(f"- {n.text}{cite}")
+        for n in report.open_questions:
+            cite = f" *({', '.join(n.record_ids)})*" if n.record_ids else ""
+            lines.append(f"- Open: {n.text}{cite}")
+        parts.append("\n".join(lines))
+    if report.data_sources:
+        lines = ["## Appendix A — Data Used", "| source | type | detail |", "|---|---|---|"]
+        for ds in report.data_sources:
+            d = ds.detail
+            bits = []
+            if d.get("tables"):
+                bits.append("tables: " + ", ".join(d["tables"]))
+            if d.get("row_count"):
+                bits.append(f"{d['row_count']} rows")
+            if d.get("columns"):
+                bits.append(f"{len(d['columns'])} cols")
+            lines.append(f"| {ds.name} | {ds.source_type or '—'} | {' · '.join(bits) or '—'} |")
+        parts.append("\n".join(lines))
+    if report.method_steps:
+        lines = ["## Appendix B — Method"]
+        for stage_key, label in _STAGE_LABELS:
+            steps = [s for s in report.method_steps if s.stage == stage_key]
+            if not steps:
+                continue
+            rendered = "; ".join(
+                f"{', '.join(s.tools_used) or '—'} ({s.status})" for s in steps
+            )
+            lines.append(f"**{label}** — {rendered}")
+        parts.append("\n".join(lines))
+    return "\n\n".join(parts)
+# --------------------------------------------------------------------------- #
+# Service
+# --------------------------------------------------------------------------- #
+class ReportGenerator:
+    """Generates an `AnalysisReport` from persisted records. Inject deps for tests."""
+    def __init__(
+        self,
+        record_store=None,
+        structured_chain: Runnable | None = None,
+        catalog_store=None,
+        binding_store=None,
+    ) -> None:
+        self._record_store = record_store
+        self._chain = structured_chain
+        self._catalog_store = catalog_store
+        self._binding_store = binding_store
+    def _ensure_record_store(self):
+        if self._record_store is None:
+            from ..slow_path.store import PostgresAnalysisStore
+            self._record_store = PostgresAnalysisStore()
+        return self._record_store
+    def _ensure_chain(self) -> Runnable:
+        if self._chain is None:
+            self._chain = _get_default_chain()
+        return self._chain
+    def _ensure_catalog_store(self):
+        if self._catalog_store is None:
+            from src.catalog.store import CatalogStore
+            self._catalog_store = CatalogStore()
+        return self._catalog_store
+    async def generate(
+        self,
+        analysis_id: str,
+        user_id: str | None = None,
+        problem_statement: ProblemStatement | None = None,
+    ) -> AnalysisReport:
+        records = await self._ensure_record_store().list_for_analysis(analysis_id)
+        if not records:
+            raise ReportError(f"no analyses recorded for {analysis_id!r} yet")
+        ps = problem_statement or ProblemStatement()
+        findings = _collect_findings(records)
+        caveats = _collect_notes(records, "caveats")
+        open_questions = _collect_notes(records, "open_questions")
+        method_steps = _collect_method_steps(records)
+        bound_ids = await self._read_binding(analysis_id)
+        data_sources = _build_data_sources(
+            records, await self._read_catalog(user_id), bound_ids
+        )
+        executive_summary = await self._summarize(ps, findings, caveats)
+        report = AnalysisReport(
+            analysis_id=analysis_id,
+            user_id=user_id,
+            version=0,  # assigned by ReportStore.save under the advisory lock
+            generated_at=datetime.now(UTC),
+            problem_statement=ps,
+            record_ids=[r.record_id for r in records],
+            executive_summary=executive_summary,
+            findings=findings,
+            caveats=caveats,
+            open_questions=open_questions,
+            data_sources=data_sources,
+            method_steps=method_steps,
+        )
+        report.rendered_markdown = _render_markdown(report)
+        logger.info(
+            "report generated",
+            analysis_id=analysis_id,
+            records=len(records),
+            findings=len(findings),
+        )
+        return report
+    async def _read_catalog(self, user_id: str | None):
+        if not user_id:
+            return None
+        try:
+            return await self._ensure_catalog_store().get(user_id)
+        except Exception as exc:  # data_sources falls back; never break the report
+            logger.warning("catalog read failed; data_sources will fall back", error=str(exc))
+            return None
+    def _ensure_binding_store(self):
+        if self._binding_store is None:
+            from ..binding_store import AnalysisDataSourceStore
+            self._binding_store = AnalysisDataSourceStore()
+        return self._binding_store
+    async def _read_binding(self, analysis_id: str) -> list[str]:
+        """Bound source ids for the analysis (#10). Never-throw → [] (unscoped)."""
+        try:
+            return await self._ensure_binding_store().get(analysis_id)
+        except Exception as exc:  # data_sources falls back to whole catalog
+            logger.warning("binding read failed; data_sources unscoped", error=str(exc))
+            return []
+    async def _summarize(
+        self, ps: ProblemStatement, findings: list[ReportFinding], caveats: list[AttributedNote]
+    ) -> str:
+        human_content = _build_human_content(ps, findings, caveats)
+        try:
+            narrative: ReportSummaryNarrative = await self._ensure_chain().ainvoke(
+                {"human_content": human_content}
+            )
+            return narrative.executive_summary
+        except Exception as exc:  # D1: degrade, don't fail the whole report
+            logger.warning("report summary LLM failed; using fallback", error=str(exc))
+            return _FALLBACK_SUMMARY

src/agents/report/readiness.py ADDED Viewed

	@@ -0,0 +1,165 @@

+"""`is_report_ready` — deterministic report-readiness signal (seam #5, KM-652).
+The Help skill asks "can the user generate a report yet?" before it offers that as
+a next step. This is the producer of that answer; Help only *consumes* it (see
+`handlers/help.ReportReadiness`). No LLM — readiness is a fact about persisted state,
+not a judgement.
+The rule mirrors what makes a real report non-empty and worth generating, so Help can
+never suggest an action that would 409 or produce a duplicate:
+  1. `problem_validated` — the gate's own precondition (no validated goal, no
+     analysis worth reporting). Same rule `gate.gate` applies to `structured_flow`.
+  2. at least one **substantive** persisted `AnalysisRecord` — a record whose
+     *analysis* task succeeded. A failed run still persists a record WITH findings
+     (they narrate the failure), and data-access tasks (check_/retrieve_) succeed even
+     when the analysis fails — so neither "has findings" nor "any task succeeded" is
+     enough. We require a genuine analysis tool (analyze_*) to have completed. We count
+     *results*, not chat turns.
+  3. delta-since-report — if a report already exists (`state.report_id`), only ready
+     when there's a substantive analysis newer than the latest report; otherwise the
+     new report would be identical.
+`missing` names whichever criterion is absent, so Help can tell the user the next gap
+to fill (the team values `missing` over the bare boolean). Bias is anti-false-positive
+(report is also button-triggered): a record-store read failure fails **closed**
+(not ready); a report-store read failure during the delta check fails **open** (we
+can't prove staleness, and the button is always there).
+NOT in scope (deferred, pending the readiness eval set): semantic *alignment* of the
+analyses to the problem statement and *depth*/variety scoring — both need an LLM judge
+and shouldn't sit in the per-turn Help hot path until eval justifies the cost.
+"""
+from __future__ import annotations
+from datetime import UTC, datetime
+from typing import TYPE_CHECKING
+from src.middlewares.logging import get_logger
+from ..handlers.help import ReportReadiness
+if TYPE_CHECKING:
+    from ..gate import AnalysisState
+logger = get_logger("report_readiness")
+# Human-readable gaps surfaced to the user via Help (kept stable for the prompt).
+_MISSING_PROBLEM = "a validated problem statement"
+_MISSING_ANALYSIS = "at least one completed analysis"
+_MISSING_DELTA = "a new analysis since the last report"
+def _default_record_store():
+    from ..slow_path.store import PostgresAnalysisStore
+    return PostgresAnalysisStore()
+def _default_report_store():
+    from .store import ReportStore
+    return ReportStore()
+def _is_newer(a: datetime, b: datetime) -> bool:
+    """True if `a` is later than `b`, tolerating naive/aware mismatch (assume UTC)."""
+    if a.tzinfo is None:
+        a = a.replace(tzinfo=UTC)
+    if b.tzinfo is None:
+        b = b.replace(tzinfo=UTC)
+    return a > b
+def _has_successful_analysis(record) -> bool:
+    """True if the record has at least one *analysis* task that succeeded.
+    A failed run still writes findings (narrating the failure) and its data-access
+    tasks (check_/retrieve_) succeed, so we can't key on findings or on "any task
+    succeeded". An analysis tool (analyze_*) completing is the real "we produced a
+    result" signal.
+    """
+    return any(
+        t.status == "success" and any(tool.startswith("analyze") for tool in t.tools_used)
+        for t in record.tasks_run
+    )
+async def report_floor(
+    analysis_id: str | None,
+    state: AnalysisState,
+    *,
+    record_store=None,
+) -> tuple[list[str], list]:
+    """The report **floor**: a validated goal + ≥1 substantive analysis.
+    Returns `(missing, substantive_records)`. This is the shared gate both the Help
+    readiness signal AND the report API enforce, so the button and Help can't drift
+    (T-D / T11). It deliberately excludes the delta-since-report check — that is
+    advisory and lives only in `is_report_ready`; the report button is always allowed
+    to cut a new version (decision 4A). Fails closed (counts as missing analysis) on
+    a record-store read error. `record_store` is injectable for tests.
+    """
+    missing: list[str] = []
+    if not state.problem_validated:
+        missing.append(_MISSING_PROBLEM)
+    substantive: list = []
+    if analysis_id:
+        try:
+            store = record_store or _default_record_store()
+            records = await store.list_for_analysis(analysis_id)
+            substantive = [r for r in records if _has_successful_analysis(r)]
+        except Exception as exc:  # noqa: BLE001 — never-throw; fail closed to not-ready
+            logger.warning(
+                "report_floor: record store read failed — not ready",
+                analysis_id=analysis_id,
+                error=str(exc),
+            )
+            return [*missing, _MISSING_ANALYSIS], []
+    if not substantive:
+        missing.append(_MISSING_ANALYSIS)
+    return missing, substantive
+async def is_report_ready(
+    analysis_id: str | None,
+    state: AnalysisState,
+    *,
+    record_store=None,
+    report_store=None,
+) -> ReportReadiness:
+    """Return whether a report can be generated for this analysis, and the gaps if not.
+    `record_store` / `report_store` are injectable for tests; they default to the
+    real Postgres stores.
+    """
+    missing, substantive = await report_floor(
+        analysis_id, state, record_store=record_store
+    )
+    if not substantive:
+        # No analyses to report on → the delta check is moot.
+        return ReportReadiness(ready=not missing, missing=missing)
+    # Delta-since-report: a report already exists, so only ready if a substantive
+    # analysis is newer than the latest report. Fail-open on a report-store error.
+    if state.report_id:
+        last_report_at: datetime | None = None
+        try:
+            rstore = report_store or _default_report_store()
+            reports = await rstore.list_for_analysis(analysis_id)
+            last_report_at = max((r.generated_at for r in reports), default=None)
+        except Exception as exc:  # noqa: BLE001 — skip delta; can't prove staleness
+            logger.warning(
+                "is_report_ready: report store read failed — skipping delta check",
+                analysis_id=analysis_id,
+                error=str(exc),
+            )
+        if last_report_at is not None and not any(
+            _is_newer(r.created_at, last_report_at) for r in substantive
+        ):
+            missing.append(_MISSING_DELTA)
+    return ReportReadiness(ready=not missing, missing=missing)

src/agents/report/schemas.py ADDED Viewed

	@@ -0,0 +1,91 @@

+"""Report contract — `AnalysisReport` and its parts (KM-644).
+The report generator turns a session's persisted `AnalysisRecord`s + Problem
+Statement into a versioned report. Only `executive_summary` is LLM-authored
+(`ReportSummaryNarrative`); every other field is copied verbatim from the records
+by code (INV-4), so the report stays a faithful, auditable artifact.
+Two deliberate looseness choices for v1 (tighten later once usage shows):
+`ProblemStatement` (stub of Harry's real PS) and `ReportFinding.supporting_data`.
+See CHECKPOINT_PLAN_2026-06-17.md decision #8.
+"""
+from __future__ import annotations
+from datetime import datetime
+from uuid import uuid4
+from pydantic import BaseModel, Field
+from ..slow_path.schemas import TaskSummary
+class ProblemStatement(BaseModel):
+    """Minimal stub of Harry's Problem Statement, frozen into each report.
+    Loose on purpose until the real PS template lands (Analysis State, upstream).
+    A report snapshots the PS as it was at generation time.
+    """
+    objective: str = ""
+    metric_direction: str = ""  # "increase" | "decrease"
+    target_metric: str = ""
+    target_value: str = ""
+    scope: str = ""
+class DataSourceRef(BaseModel):
+    """Frozen catalog metadata for a source used in the analysis.
+    Snapshotted at generation time (NOT re-fetched at render) so a re-ingested
+    source never retroactively changes an old report — same freeze rationale as
+    `ProblemStatement`.
+    """
+    source_id: str
+    name: str
+    source_type: str  # postgres | file | ...
+    detail: dict = Field(default_factory=dict)  # rows in scope, columns, window
+class ReportFinding(BaseModel):
+    text: str
+    record_ids: list[str] = Field(default_factory=list)  # records backing this finding
+    supporting_data: dict | None = None  # loose for v1; the chart-able slice
+class AttributedNote(BaseModel):
+    """A caveat or open question carrying the records it came from.
+    Plural `record_ids` because a note can be deduped/merged across records.
+    """
+    text: str
+    record_ids: list[str] = Field(default_factory=list)
+class ReportSummaryNarrative(BaseModel):
+    """The ONLY LLM-authored part of the report (with_structured_output target)."""
+    executive_summary: str
+class AnalysisReport(BaseModel):
+    report_id: str = Field(default_factory=lambda: uuid4().hex)
+    analysis_id: str
+    user_id: str | None = None
+    version: int
+    generated_at: datetime
+    # Frozen snapshots.
+    problem_statement: ProblemStatement = Field(default_factory=ProblemStatement)
+    record_ids: list[str] = Field(default_factory=list)  # records used (snapshot)
+    # LLM-authored.
+    executive_summary: str = ""
+    # Deterministic pass-through from records.
+    findings: list[ReportFinding] = Field(default_factory=list)
+    caveats: list[AttributedNote] = Field(default_factory=list)
+    open_questions: list[AttributedNote] = Field(default_factory=list)
+    data_sources: list[DataSourceRef] = Field(default_factory=list)
+    method_steps: list[TaskSummary] = Field(default_factory=list)  # carries `stage`
+    rendered_markdown: str = ""

src/agents/report/store.py ADDED Viewed

	@@ -0,0 +1,119 @@

+"""ReportStore — persists/reads versioned AnalysisReports (KM-644).
+Mirrors `PostgresAnalysisStore`: each call opens its own `AsyncSessionLocal`.
+Version assignment is serialized per `analysis_id` with a Postgres
+transaction-level advisory lock so concurrent button presses can't compute the
+same version number; the `(analysis_id, version)` unique constraint is the
+backstop. Per decision 4A every generation is a new version, so two
+near-simultaneous presses legitimately produce V<n> and V<n+1> — the lock only
+prevents a duplicate-number race, not double generation.
+"""
+from __future__ import annotations
+import hashlib
+from sqlalchemy import func, select, text
+from src.db.postgres.connection import AsyncSessionLocal
+from src.db.postgres.models import AnalysisReportRow
+from src.middlewares.logging import get_logger
+from .schemas import AnalysisReport
+logger = get_logger("report_store")
+def _lock_key(analysis_id: str) -> int:
+    """Stable signed 64-bit key for `pg_advisory_xact_lock`.
+    Python's builtin `hash(str)` is randomized per process, so derive a
+    deterministic key from a digest instead.
+    """
+    digest = hashlib.sha256(analysis_id.encode()).digest()
+    return int.from_bytes(digest[:8], "big", signed=True)
+def _report_title(report: AnalysisReport) -> str:
+    """Title for the dedorch `reports.title` column — the goal, else a generic label."""
+    objective = (report.problem_statement.objective or "").strip()
+    return objective[:200] if objective else "Analysis Report"
+def _row_to_report(row) -> AnalysisReport:
+    """Rebuild a minimal AnalysisReport from the flat dedorch row.
+    dedorch stores markdown only, so structured fields (findings/caveats/…) come back
+    empty; `rendered_markdown` carries the content the FE renders/downloads.
+    """
+    return AnalysisReport(
+        report_id=row.id,
+        analysis_id=row.analysis_id,
+        version=row.version,
+        generated_at=row.generated_at,
+        rendered_markdown=row.content,
+    )
+class ReportStore:
+    """Read/write versioned reports keyed by `analysis_id`."""
+    async def save(self, report: AnalysisReport) -> AnalysisReport:
+        """Assign the next version under an advisory lock and persist.
+        Mutates and returns `report` with its final `version`.
+        """
+        async with AsyncSessionLocal() as session:
+            async with session.begin():
+                await session.execute(
+                    text("SELECT pg_advisory_xact_lock(:k)"),
+                    {"k": _lock_key(report.analysis_id)},
+                )
+                result = await session.execute(
+                    select(func.max(AnalysisReportRow.version)).where(
+                        AnalysisReportRow.analysis_id == report.analysis_id
+                    )
+                )
+                report.version = (result.scalar_one_or_none() or 0) + 1
+                session.add(
+                    AnalysisReportRow(
+                        id=report.report_id,
+                        analysis_id=report.analysis_id,
+                        title=_report_title(report),
+                        content=report.rendered_markdown or "",
+                        generated_at=report.generated_at,
+                        version=report.version,
+                    )
+                )
+            # leaving session.begin() commits, which releases the advisory lock
+        logger.info(
+            "report persisted",
+            analysis_id=report.analysis_id,
+            version=report.version,
+            report_id=report.report_id,
+        )
+        return report
+    async def list_for_analysis(self, analysis_id: str) -> list[AnalysisReport]:
+        async with AsyncSessionLocal() as session:
+            result = await session.execute(
+                select(AnalysisReportRow)
+                .where(AnalysisReportRow.analysis_id == analysis_id)
+                .order_by(AnalysisReportRow.version.asc())
+            )
+            rows = result.scalars().all()
+        return [_row_to_report(row) for row in rows]
+    async def get(self, analysis_id: str, version: int) -> AnalysisReport | None:
+        async with AsyncSessionLocal() as session:
+            result = await session.execute(
+                select(AnalysisReportRow).where(
+                    AnalysisReportRow.analysis_id == analysis_id,
+                    AnalysisReportRow.version == version,
+                )
+            )
+            row = result.scalar_one_or_none()
+        if row is None:
+            return None
+        return _row_to_report(row)

src/agents/slow_path/assembler.py CHANGED Viewed

@@ -33,6 +33,7 @@ from .schemas import (
     AssembledOutput,
     AssemblerNarrative,
     RunState,
     TaskSummary,
 )
@@ -116,16 +117,46 @@ class Assembler:
         return AssembledOutput(chat_answer=narrative.chat_answer, analysis_record=record)
 def _build_record(narrative: AssemblerNarrative, run_state: RunState) -> AnalysisRecord:
     tasks_run = [
         TaskSummary(
             task_id=task_id,
             objective=result.objective,
             status=result.status,
             tools_used=[o.tool for o in result.outputs],
         )
         for task_id, result in run_state.results.items()
     ]
     return AnalysisRecord(
         goal_restated=narrative.goal_restated,
         findings=narrative.findings,
@@ -133,7 +164,7 @@ def _build_record(narrative: AssemblerNarrative, run_state: RunState) -> Analysi
         data_used=narrative.data_used,
         open_questions=narrative.open_questions,
         tasks_run=tasks_run,
-        results_snapshot=run_state.results,
         plan_id=run_state.plan_id,
         business_context_id=run_state.business_context_id,
         created_at=datetime.now(UTC),

     AssembledOutput,
     AssemblerNarrative,
     RunState,
+    TaskResult,
     TaskSummary,
 )
         return AssembledOutput(chat_answer=narrative.chat_answer, analysis_record=record)
+# Persisted records keep `analyze_*` outputs (scalar/stats/series — small, and the
+# basis a future report/chart renders from) in full, but cap raw `table` rows from
+# data-access tools (retrieve_data can return up to the 10k LIMIT): the report never
+# renders raw rows, so storing them all would bloat every record's jsonb.
+_SNAPSHOT_ROW_SAMPLE = 10
+def _trim_for_snapshot(result: TaskResult) -> TaskResult:
+    trimmed = []
+    changed = False
+    for out in result.outputs:
+        if out.kind == "table" and out.rows is not None and len(out.rows) > _SNAPSHOT_ROW_SAMPLE:
+            changed = True
+            trimmed.append(
+                out.model_copy(
+                    update={
+                        "rows": out.rows[:_SNAPSHOT_ROW_SAMPLE],
+                        "meta": {**out.meta, "total_rows": len(out.rows), "rows_truncated": True},
+                    }
+                )
+            )
+        else:
+            trimmed.append(out)
+    return result.model_copy(update={"outputs": trimmed}) if changed else result
 def _build_record(narrative: AssemblerNarrative, run_state: RunState) -> AnalysisRecord:
     tasks_run = [
         TaskSummary(
             task_id=task_id,
+            stage=result.stage,
             objective=result.objective,
             status=result.status,
             tools_used=[o.tool for o in result.outputs],
         )
         for task_id, result in run_state.results.items()
     ]
+    results_snapshot = {
+        task_id: _trim_for_snapshot(result) for task_id, result in run_state.results.items()
+    }
     return AnalysisRecord(
         goal_restated=narrative.goal_restated,
         findings=narrative.findings,
         data_used=narrative.data_used,
         open_questions=narrative.open_questions,
         tasks_run=tasks_run,
+        results_snapshot=results_snapshot,
         plan_id=run_state.plan_id,
         business_context_id=run_state.business_context_id,
         created_at=datetime.now(UTC),

src/agents/slow_path/coordinator.py CHANGED Viewed

@@ -1,9 +1,9 @@
 """SlowPathCoordinator — wires the slow path: Planner -> TaskRunner -> Assembler.
-A thin coordination object. This is the unit the (future) expanded Orchestrator /
-ChatHandler will call on a `structured` analytical query. It is built and tested
-here but **not yet wired into the live chat flow** — that step waits on the tool
-team's real `ToolInvoker` and a real `BusinessContext` source.
 See AGENT_ARCHITECTURE_CONTEXT_new.md §5.2 / §6.1.
 """

 """SlowPathCoordinator — wires the slow path: Planner -> TaskRunner -> Assembler.
+A thin coordination object. `ChatHandler` calls it on a `structured_flow` query when
+`ENABLE_SLOW_PATH` is on (the real `ToolInvoker` is composed in
+`ChatHandler._get_slow_path_coordinator`). `BusinessContext` is still a stub until the
+lead's real source lands.
 See AGENT_ARCHITECTURE_CONTEXT_new.md §5.2 / §6.1.
 """

src/agents/slow_path/schemas.py CHANGED Viewed

@@ -21,10 +21,12 @@ from __future__ import annotations
 from datetime import datetime
 from typing import Literal
 from pydantic import BaseModel, Field
 from ..planner.contracts import ToolOutput
 TaskStatus = Literal["success", "partial", "failure"]
@@ -36,6 +38,7 @@ TaskStatus = Literal["success", "partial", "failure"]
 class TaskResult(BaseModel):
     task_id: str
     status: TaskStatus
     objective: str
     outputs: list[ToolOutput] = Field(default_factory=list)  # one per tool_call
@@ -57,12 +60,21 @@ class RunState(BaseModel):
 class TaskSummary(BaseModel):
     task_id: str
     objective: str
     status: TaskStatus
     tools_used: list[str] = Field(default_factory=list)
 class AnalysisRecord(BaseModel):
     # Narrative fields — authored by the Assembler LLM.
     goal_restated: str
     findings: list[str] = Field(default_factory=list)

 from datetime import datetime
 from typing import Literal
+from uuid import uuid4
 from pydantic import BaseModel, Field
 from ..planner.contracts import ToolOutput
+from ..planner.schemas import CrispStage
 TaskStatus = Literal["success", "partial", "failure"]
 class TaskResult(BaseModel):
     task_id: str
+    stage: CrispStage  # copied from the plan Task; carries CRISP-DM grouping to the report
     status: TaskStatus
     objective: str
     outputs: list[ToolOutput] = Field(default_factory=list)  # one per tool_call
 class TaskSummary(BaseModel):
     task_id: str
+    stage: CrispStage  # lets the report group the method appendix by CRISP-DM phase
     objective: str
     status: TaskStatus
     tools_used: list[str] = Field(default_factory=list)
 class AnalysisRecord(BaseModel):
+    # Identity. `record_id` is the unit the report cites and snapshots
+    # (`record_ids`); `analysis_id`/`user_id` scope the record to one analysis
+    # session + owner and are stamped by the composition root / AnalysisStore at
+    # persist time (they depend on the Analysis State that lives outside the slow
+    # path), so they default to None when the Assembler first builds the record.
+    record_id: str = Field(default_factory=lambda: uuid4().hex)
+    analysis_id: str | None = None
+    user_id: str | None = None
     # Narrative fields — authored by the Assembler LLM.
     goal_restated: str
     findings: list[str] = Field(default_factory=list)

src/agents/slow_path/store.py CHANGED Viewed

@@ -2,21 +2,28 @@
 The Assembler produces an `AnalysisRecord` (the faithful, structured record of a
 run — §8.3, INV-4). Persisting it is a separate concern from streaming the answer,
-so it sits behind this one-method seam.
-`NullAnalysisStore` is the default: it logs that a record was produced but stores
-nothing, because the backing table does not exist yet. The plan is to store records
-in the **same catalog DB** (Neon `dataeyond`, `settings.postgres_connstring`).
-TODO(persistence): add a Postgres-backed `AnalysisStore` writing an
-`analysis_records` table in the catalog DB, keyed on
-(business_context_id, plan_id, created_at), then inject it into ChatHandler.
 """
 from __future__ import annotations
 from typing import Protocol, runtime_checkable
 from src.middlewares.logging import get_logger
 from .schemas import AnalysisRecord
@@ -26,19 +33,78 @@ logger = get_logger("analysis_store")
 @runtime_checkable
 class AnalysisStore(Protocol):
-    """Persist a completed analysis. Implementations must never raise on the
-    caller's path — a persistence failure must not break the user's answer."""
     async def save(self, record: AnalysisRecord) -> None: ...
 class NullAnalysisStore:
-    """Default no-op store: logs the record, persists nothing (no table yet)."""
     async def save(self, record: AnalysisRecord) -> None:
         logger.info(
-            "analysis_record produced (not persisted — no store configured)",
             plan_id=record.plan_id,
-            business_context_id=record.business_context_id,
             n_tasks=len(record.tasks_run),
         )

 The Assembler produces an `AnalysisRecord` (the faithful, structured record of a
 run — §8.3, INV-4). Persisting it is a separate concern from streaming the answer,
+so it sits behind this seam. `generate_report` later reads records back by
+`analysis_id` (oldest-first) and renders from them — never from chat history.
+- `NullAnalysisStore` logs and stores nothing (kept for tests / when persistence
+  is intentionally disabled).
+- `PostgresAnalysisStore` writes one `analysis_records` row per run in the catalog
+  DB (Neon `dataeyond`, `settings.postgres_connstring`).
+`save` must never raise on the caller's path — a persistence failure must not break
+the user's answer (§8.3). `list_for_analysis` is a read for the report generator and
+is allowed to surface errors to its caller.
 """
 from __future__ import annotations
 from typing import Protocol, runtime_checkable
+from sqlalchemy import select
+from sqlalchemy.dialects.postgresql import insert
+from src.db.postgres.connection import AsyncSessionLocal
+from src.db.postgres.models import AnalysisRecordRow
 from src.middlewares.logging import get_logger
 from .schemas import AnalysisRecord
 @runtime_checkable
 class AnalysisStore(Protocol):
+    """Persist + read completed analyses.
+    `save` must never raise on the caller's path. `list_for_analysis` returns the
+    records for one analysis session, oldest-first (the order the report renders in).
+    """
     async def save(self, record: AnalysisRecord) -> None: ...
+    async def list_for_analysis(self, analysis_id: str) -> list[AnalysisRecord]: ...
 class NullAnalysisStore:
+    """No-op store: logs the record, persists nothing. Reads return empty."""
     async def save(self, record: AnalysisRecord) -> None:
         logger.info(
+            "analysis_record produced (not persisted — NullAnalysisStore)",
+            record_id=record.record_id,
             plan_id=record.plan_id,
             n_tasks=len(record.tasks_run),
         )
+    async def list_for_analysis(self, analysis_id: str) -> list[AnalysisRecord]:
+        return []
+class PostgresAnalysisStore:
+    """Writes/reads `analysis_records` jsonb rows in the catalog DB.
+    Mirrors `CatalogStore`: each call opens its own `AsyncSession`. One row per
+    record (vs. one-per-user for the catalog) since records accumulate per analysis.
+    """
+    async def save(self, record: AnalysisRecord) -> None:
+        try:
+            payload = record.model_dump(mode="json")
+            async with AsyncSessionLocal() as session:
+                stmt = insert(AnalysisRecordRow).values(
+                    id=record.record_id,
+                    analysis_id=record.analysis_id,
+                    user_id=record.user_id,
+                    plan_id=record.plan_id,
+                    data=payload,
+                    created_at=record.created_at,
+                )
+                # Re-running the same plan id-collides only if record_id repeats;
+                # treat that as idempotent (overwrite) rather than erroring the user.
+                stmt = stmt.on_conflict_do_update(
+                    index_elements=[AnalysisRecordRow.id],
+                    set_={"data": stmt.excluded.data},
+                )
+                await session.execute(stmt)
+                await session.commit()
+            logger.info(
+                "analysis_record persisted",
+                record_id=record.record_id,
+                analysis_id=record.analysis_id,
+                user_id=record.user_id,
+            )
+        except Exception as exc:  # never break the user's answer (§8.3)
+            logger.error(
+                "analysis_record persist failed",
+                record_id=record.record_id,
+                error=str(exc),
+            )
+    async def list_for_analysis(self, analysis_id: str) -> list[AnalysisRecord]:
+        async with AsyncSessionLocal() as session:
+            result = await session.execute(
+                select(AnalysisRecordRow.data)
+                .where(AnalysisRecordRow.analysis_id == analysis_id)
+                .order_by(AnalysisRecordRow.created_at.asc())
+            )
+            rows = result.scalars().all()
+        return [AnalysisRecord.model_validate(row) for row in rows]

src/agents/slow_path/task_runner.py CHANGED Viewed

@@ -53,6 +53,7 @@ class TaskRunner:
                 for tid in list(remaining):
                     results[tid] = TaskResult(
                         task_id=tid,
                         status="failure",
                         objective=tasks_by_id[tid].objective,
                         error="unresolved dependency; task could not run",
@@ -68,6 +69,7 @@ class TaskRunner:
                 if failed:
                     results[tid] = TaskResult(
                         task_id=tid,
                         status="failure",
                         objective=task.objective,
                         error=f"skipped: upstream {failed} did not succeed",
@@ -110,6 +112,7 @@ class TaskRunner:
             error = errs[0] if errs else "all tool calls failed"
         return TaskResult(
             task_id=task.id,
             status=status,
             objective=task.objective,
             outputs=outputs,

                 for tid in list(remaining):
                     results[tid] = TaskResult(
                         task_id=tid,
+                        stage=tasks_by_id[tid].stage,
                         status="failure",
                         objective=tasks_by_id[tid].objective,
                         error="unresolved dependency; task could not run",
                 if failed:
                     results[tid] = TaskResult(
                         task_id=tid,
+                        stage=task.stage,
                         status="failure",
                         objective=task.objective,
                         error=f"skipped: upstream {failed} did not succeed",
             error = errs[0] if errs else "all tool calls failed"
         return TaskResult(
             task_id=task.id,
+            stage=task.stage,
             status=status,
             objective=task.objective,
             outputs=outputs,

src/agents/state_store.py ADDED Viewed

	@@ -0,0 +1,128 @@

+"""AnalysisStateStore — read/write the per-analysis session state.
+The orchestrator gate + Help skill read `AnalysisState` (the locked contract in
+`gate.py`) every turn; the Problem Statement skill writes `problem_validated`. The
+row shares its id with the chat `rooms` row — one session = one analysis = one
+conversation (`analysis_id == room_id`).
+Mirrors `PostgresAnalysisStore`: each call opens its own `AsyncSession`.
+"""
+from __future__ import annotations
+from sqlalchemy.dialects.postgresql import insert
+from src.agents.gate import AnalysisState
+from src.db.postgres.connection import AsyncSessionLocal
+from src.db.postgres.models import AnalysisStateRow
+from src.middlewares.logging import get_logger
+logger = get_logger("analysis_state_store")
+def _row_to_state(row: AnalysisStateRow) -> AnalysisState:
+    """Map a DB row to the frozen `AnalysisState` contract."""
+    return AnalysisState(
+        id=row.id,
+        analysis_title=row.analysis_title,
+        problem_statement=row.problem_statement,
+        problem_validated=row.problem_validated,
+        owner_id=row.owner_id,
+        report_id=row.report_id,
+        created_at=row.created_at,
+        updated_at=row.updated_at,
+    )
+class AnalysisStateStore:
+    """Read/write the dedorch `analysis` table, keyed by the shared session id."""
+    async def get(self, analysis_id: str) -> AnalysisState | None:
+        async with AsyncSessionLocal() as session:
+            row = await session.get(AnalysisStateRow, analysis_id)
+            return _row_to_state(row) if row is not None else None
+    async def ensure(
+        self,
+        analysis_id: str,
+        owner_id: str,
+        analysis_title: str = "New analysis",
+    ) -> AnalysisState:
+        """Get-or-create the state row for a session (idempotent, race-safe).
+        Sessions born from `/room/create` have no `analysis_states` row; without
+        one the gate redirect-loops and `problem_statement` / `report_id` writes
+        silently no-op. Called per turn (analysis_id == room_id) so any session is
+        gate-ready. `INSERT ... ON CONFLICT DO NOTHING` makes concurrent first
+        turns safe; the row is then read back. Legacy rows created this way carry
+        no source bindings — binding scoping fail-opens to the whole catalog.
+        """
+        async with AsyncSessionLocal() as session:
+            stmt = (
+                insert(AnalysisStateRow)
+                .values(
+                    id=analysis_id,
+                    owner_id=owner_id,
+                    analysis_title=analysis_title,
+                    problem_statement="",
+                    problem_validated=False,
+                )
+                .on_conflict_do_nothing(index_elements=[AnalysisStateRow.id])
+            )
+            await session.execute(stmt)
+            await session.commit()
+            row = await session.get(AnalysisStateRow, analysis_id)
+            return _row_to_state(row)
+    async def create(
+        self,
+        *,
+        analysis_id: str,
+        owner_id: str,
+        analysis_title: str = "New analysis",
+        problem_statement: str = "",
+    ) -> AnalysisState:
+        """Create the state row for a new analysis (id shared with its chat room)."""
+        async with AsyncSessionLocal() as session:
+            row = AnalysisStateRow(
+                id=analysis_id,
+                owner_id=owner_id,
+                analysis_title=analysis_title,
+                problem_statement=problem_statement,
+                problem_validated=False,
+            )
+            session.add(row)
+            await session.commit()
+            await session.refresh(row)
+            return _row_to_state(row)
+    async def update(
+        self,
+        analysis_id: str,
+        *,
+        problem_statement: str | None = None,
+        problem_validated: bool | None = None,
+        report_id: str | None = None,
+    ) -> AnalysisState | None:
+        """Patch the given fields (only non-None args are written). Returns the row.
+        Used by the Problem Statement skill (`problem_validated`) and the report
+        flow (`report_id`). Returns None if the analysis doesn't exist.
+        """
+        async with AsyncSessionLocal() as session:
+            row = await session.get(AnalysisStateRow, analysis_id)
+            if row is None:
+                logger.warning(
+                    "analysis row missing — update skipped",
+                    analysis_id=analysis_id,
+                )
+                return None
+            if problem_statement is not None:
+                row.problem_statement = problem_statement
+            if problem_validated is not None:
+                row.problem_validated = problem_validated
+            if report_id is not None:
+                row.report_id = report_id
+            await session.commit()
+            await session.refresh(row)
+            return _row_to_state(row)

src/api/v1/analysis.py ADDED Viewed

	@@ -0,0 +1,174 @@

+"""Analysis session API — create a new analysis (the per-session workspace).
+An analysis IS the chat session: the `analysis_states` row and the chat `rooms`
+row share one id (`analysis_id == room_id`), so the existing `room_id` on the chat
+request doubles as the `analysis_id`. Creating an analysis enforces the data-first
+gate (>=1 bound source) and seeds the state with a title + an optional problem
+statement (validated later by the Problem Statement skill).
+"""
+import uuid
+from fastapi import APIRouter, Depends, HTTPException
+from pydantic import BaseModel, Field
+from sqlalchemy import select
+from sqlalchemy.ext.asyncio import AsyncSession
+from src.db.postgres.connection import get_db
+from src.db.postgres.models import AnalysisDataSourceRow, AnalysisStateRow, Room
+from src.middlewares.logging import get_logger, log_execution
+logger = get_logger("analysis_api")
+router = APIRouter(prefix="/api/v1", tags=["Analysis"])
+def _serialize_state(row: AnalysisStateRow, data_source_ids: list[str]) -> dict:
+    """The full analysis payload: the 8 state fields + the bound source ids."""
+    return {
+        "id": row.id,
+        "analysis_title": row.analysis_title,
+        "problem_statement": row.problem_statement,
+        "problem_validated": row.problem_validated,
+        "owner_id": row.owner_id,
+        "report_id": row.report_id,
+        "data_source_ids": data_source_ids,
+        "created_at": row.created_at.isoformat() if row.created_at else None,
+        "updated_at": row.updated_at.isoformat() if row.updated_at else None,
+    }
+async def _bound_source_ids(db: AsyncSession, analysis_id: str) -> list[str]:
+    result = await db.execute(
+        select(AnalysisDataSourceRow.reference_id).where(
+            AnalysisDataSourceRow.analysis_id == analysis_id
+        )
+    )
+    return list(result.scalars().all())
+async def _sources_by_id(user_id: str) -> dict:
+    """Catalog sources keyed by source_id, to resolve `type`/`name` on binding.
+    Never-throw: missing catalog / read error → empty map, and binding rows fall back
+    to type='unknown' / name=reference_id.
+    """
+    try:
+        from src.catalog.store import CatalogStore
+        catalog = await CatalogStore().get(user_id)
+    except Exception as e:  # noqa: BLE001 — binding must not fail on catalog read
+        logger.warning("analysis: catalog read failed for binding", user_id=user_id, error=str(e))
+        return {}
+    return {s.source_id: s for s in catalog.sources} if catalog else {}
+class CreateAnalysisRequest(BaseModel):
+    user_id: str
+    analysis_title: str = "New analysis"
+    problem_statement: str = ""
+    data_source_ids: list[str] = Field(default_factory=list)
+@router.post("/analysis/create")
+@log_execution(logger)
+async def create_analysis(
+    request: CreateAnalysisRequest,
+    db: AsyncSession = Depends(get_db),
+):
+    """Create a new analysis session: one shared id for its state + chat room.
+    Data-first gate (decision #2): an analysis requires >=1 bound data source.
+    The bound sources are persisted as dedorch `data_sources` rows (#10) in the same
+    transaction as the state + room, so the analysis is scoped to exactly the sources
+    the user picked. `structured_flow` and the report read this binding back.
+    """
+    if not request.data_source_ids:
+        raise HTTPException(
+            status_code=400,
+            detail="An analysis requires at least one bound data source.",
+        )
+    analysis_id = str(uuid.uuid4())
+    # The analysis IS the session: state row + chat room + source bindings share one
+    # id, created atomically in one transaction.
+    state_row = AnalysisStateRow(
+        id=analysis_id,
+        owner_id=request.user_id,
+        analysis_title=request.analysis_title,
+        problem_statement=request.problem_statement,
+        problem_validated=False,
+    )
+    db.add(Room(id=analysis_id, user_id=request.user_id, title=request.analysis_title))
+    db.add(state_row)
+    # dict.fromkeys dedupes while preserving order. Each binding row snapshots the
+    # source's type + name from the catalog (reference_id = catalog source id);
+    # bound_at/created_at default to now() in dedorch.
+    bound_ids = list(dict.fromkeys(request.data_source_ids))
+    src_by_id = await _sources_by_id(request.user_id)
+    for source_id in bound_ids:
+        src = src_by_id.get(source_id)
+        db.add(
+            AnalysisDataSourceRow(
+                id=str(uuid.uuid4()),
+                analysis_id=analysis_id,
+                type=src.source_type if src else "unknown",
+                name=src.name if src else source_id,
+                reference_id=source_id,
+                bound_by=request.user_id,
+            )
+        )
+    await db.commit()
+    await db.refresh(state_row)
+    logger.info(
+        "analysis created",
+        analysis_id=analysis_id,
+        user_id=request.user_id,
+        sources=len(bound_ids),
+    )
+    return {
+        "status": "success",
+        "message": "Analysis created successfully",
+        "data": _serialize_state(state_row, bound_ids),
+    }
+@router.get("/analysis")
+@log_execution(logger)
+async def list_analyses(user_id: str, db: AsyncSession = Depends(get_db)):
+    """List a user's analyses, most-recently-updated first (Analysis sidebar).
+    Summary fields only (no per-row source bindings — fetch those via the detail
+    endpoint) to keep the list a single query.
+    """
+    result = await db.execute(
+        select(AnalysisStateRow)
+        .where(AnalysisStateRow.owner_id == user_id)
+        .order_by(AnalysisStateRow.updated_at.desc())
+    )
+    rows = result.scalars().all()
+    return {
+        "status": "success",
+        "data": [
+            {
+                "id": r.id,
+                "analysis_title": r.analysis_title,
+                "problem_validated": r.problem_validated,
+                "report_id": r.report_id,
+                "updated_at": r.updated_at.isoformat() if r.updated_at else None,
+            }
+            for r in rows
+        ],
+    }
+@router.get("/analysis/{analysis_id}")
+@log_execution(logger)
+async def get_analysis(analysis_id: str, db: AsyncSession = Depends(get_db)):
+    """Read one analysis's state + bound data sources (the FE workspace render)."""
+    row = await db.get(AnalysisStateRow, analysis_id)
+    if row is None:
+        raise HTTPException(status_code=404, detail=f"Analysis {analysis_id!r} not found.")
+    data_source_ids = await _bound_source_ids(db, analysis_id)
+    return {"status": "success", "data": _serialize_state(row, data_source_ids)}

src/api/v1/chat.py CHANGED Viewed

@@ -31,6 +31,7 @@ router = APIRouter(prefix="/api/v1", tags=["Chat"])
 _chat_handler = ChatHandler(
     enable_tracing=True,
     enable_slow_path=settings.enable_slow_path,
 )
 _GREETINGS = frozenset(["hi", "hello", "hey", "halo", "hai", "hei"])
@@ -64,8 +65,39 @@ async def get_cached_response(redis, cache_key: str) -> Optional[dict]:
     return None
 async def cache_response(redis, cache_key: str, response: str, sources: list):
-    await redis.setex(cache_key, 86400, json.dumps({"response": response, "sources": sources}))
 async def load_history(db: AsyncSession, room_id: str, limit: int = 10) -> list:
@@ -107,10 +139,10 @@ async def save_messages(
 @router.delete("/chat/cache")
-async def clear_chat_cache(room_id: str, message: str):
-    """Delete the Redis cache entry for a specific room + message pair."""
     redis = await get_redis()
-    cache_key = f"{settings.redis_prefix}chat:{room_id}:{message}"
     deleted = await redis.delete(cache_key)
     return {"deleted": deleted > 0, "cache_key": cache_key}
@@ -146,7 +178,7 @@ async def chat_stream(request: ChatRequest, db: AsyncSession = Depends(get_db)):
       3. done     — signals end of stream
     """
     redis = await get_redis()
-    cache_key = f"{settings.redis_prefix}chat:{request.room_id}:{request.message}"
     # Redis cache hit
     cached = await get_cached_response(redis, cache_key)
@@ -186,8 +218,17 @@ async def chat_stream(request: ChatRequest, db: AsyncSession = Depends(get_db)):
             logger.info("stream_response started", room_id=request.room_id, user_id=request.user_id)
             full_response = ""
             sources: List[Dict[str, Any]] = []
-            async for event in handler.handle(request.message, request.user_id, history):
-                if event["event"] == "sources":
                     try:
                         sources = json.loads(event["data"]) or []
                     except (TypeError, ValueError):
@@ -197,7 +238,10 @@ async def chat_stream(request: ChatRequest, db: AsyncSession = Depends(get_db)):
                     full_response += event["data"]
                     yield event
                 elif event["event"] == "done":
-                    await cache_response(redis, cache_key, full_response, sources=sources)
                     logger.info("saving messages", sources_count=len(sources), sources=sources)
                     try:
                         await save_messages(db, request.room_id, request.message, full_response, sources=sources)
@@ -211,7 +255,6 @@ async def chat_stream(request: ChatRequest, db: AsyncSession = Depends(get_db)):
                 elif event["event"] == "error":
                     yield event
                     return
-                # "intent" event: consumed internally, not forwarded to frontend
         return EventSourceResponse(stream_response())

 _chat_handler = ChatHandler(
     enable_tracing=True,
     enable_slow_path=settings.enable_slow_path,
+    enable_gate=settings.enable_gate,
 )
 _GREETINGS = frozenset(["hi", "hello", "hey", "halo", "hai", "hei"])
     return None
+# 1h TTL per the 2026-06-11 checkpoint decision (Redis = retrieval/query caching
+# only, short-lived). Was 24h, which served stale answers after re-ingestion.
+_CHAT_CACHE_TTL_SECONDS = 3600
+# Only stateless replies are safe to cache. The cache key is (room, user, message)
+# with no analysis-state/data version, so caching a state- or data-dependent answer
+# (help / problem_statement / check / structured_flow / unstructured_flow) would
+# replay a stale answer after the state or data changes — and, since the read check
+# runs before the gate, could even bypass the gate when the same message repeats.
+# So we cache ONLY the `chat` intent. Caching analysis answers needs proper
+# invalidation on data/state change — deferred. The write is gated by the intent the
+# handler already emits; the read stays as-is (safe because only `chat` is ever
+# stored).
+_CACHEABLE_INTENTS = frozenset({"chat"})
+def _chat_cache_key(room_id: str, user_id: str, message: str) -> str:
+    # user_id is part of the key so one user's cached answer can never be
+    # replayed to another (R5); room_id stays first so the room-wide clear
+    # endpoint can keep matching on a `chat:{room_id}:*` prefix.
+    # LIMITATION (T-G): the key omits conversation history, so a repeated message
+    # replays its cached answer even if the conversation has since moved on. Only
+    # the stateless `chat` intent is cached, so the blast radius is small — but a
+    # history-aware key (hash of last-N turns) would close it. Flagged to Harry.
+    return f"{settings.redis_prefix}chat:{room_id}:{user_id}:{message}"
 async def cache_response(redis, cache_key: str, response: str, sources: list):
+    await redis.setex(
+        cache_key,
+        _CHAT_CACHE_TTL_SECONDS,
+        json.dumps({"response": response, "sources": sources}),
+    )
 async def load_history(db: AsyncSession, room_id: str, limit: int = 10) -> list:
 @router.delete("/chat/cache")
+async def clear_chat_cache(room_id: str, user_id: str, message: str):
+    """Delete the Redis cache entry for a specific room + user + message pair."""
     redis = await get_redis()
+    cache_key = _chat_cache_key(room_id, user_id, message)
     deleted = await redis.delete(cache_key)
     return {"deleted": deleted > 0, "cache_key": cache_key}
       3. done     — signals end of stream
     """
     redis = await get_redis()
+    cache_key = _chat_cache_key(request.room_id, request.user_id, request.message)
     # Redis cache hit
     cached = await get_cached_response(redis, cache_key)
             logger.info("stream_response started", room_id=request.room_id, user_id=request.user_id)
             full_response = ""
             sources: List[Dict[str, Any]] = []
+            effective_intent: Optional[str] = None
+            async for event in handler.handle(
+                request.message, request.user_id, history, analysis_id=request.room_id
+            ):
+                if event["event"] == "intent":
+                    # consumed internally (not forwarded); gates caching below.
+                    try:
+                        effective_intent = json.loads(event["data"]).get("intent")
+                    except (TypeError, ValueError, AttributeError):
+                        effective_intent = None
+                elif event["event"] == "sources":
                     try:
                         sources = json.loads(event["data"]) or []
                     except (TypeError, ValueError):
                     full_response += event["data"]
                     yield event
                 elif event["event"] == "done":
+                    # Only cache stateless `chat` replies — caching a state/data-
+                    # dependent answer would replay it stale (see _CACHEABLE_INTENTS).
+                    if effective_intent in _CACHEABLE_INTENTS:
+                        await cache_response(redis, cache_key, full_response, sources=sources)
                     logger.info("saving messages", sources_count=len(sources), sources=sources)
                     try:
                         await save_messages(db, request.room_id, request.message, full_response, sources=sources)
                 elif event["event"] == "error":
                     yield event
                     return
         return EventSourceResponse(stream_response())

src/api/v1/report.py ADDED Viewed

	@@ -0,0 +1,189 @@

+"""Report API (KM-644) — the dedicated "Generate Report" surface.
+NOT a chat route. The frontend button calls these endpoints directly:
+  POST /report                       generate a new version for a session
+  GET  /report/{analysis_id}         list a session's report versions
+  GET  /report/{analysis_id}/{ver}   fetch one version
+Generation reads persisted AnalysisRecords + Problem Statement, makes one LLM call
+(the executive summary), and persists an immutable versioned artifact. The
+ReportGenerator + ReportStore are process singletons (the generator caches its LLM
+chain warm across requests, like ChatHandler).
+Note (T-E): AnalysisRecords are only persisted by the slow path, so reports require
+`ENABLE_SLOW_PATH=on`. With it off, no records exist and generation 409s — by design,
+not a bug. POST gates on the same floor as Help's readiness signal (validated goal +
+≥1 substantive analysis) so the button and Help never disagree.
+"""
+from fastapi import APIRouter, HTTPException, Query, status
+from src.agents.report.errors import ReportError
+from src.agents.report.generator import ReportGenerator
+from src.agents.report.schemas import AnalysisReport, ProblemStatement
+from src.agents.report.store import ReportStore
+from src.middlewares.logging import get_logger, log_execution
+from src.models.api.report import ReportVersionEntry
+logger = get_logger("report_api")
+router = APIRouter(prefix="/api/v1", tags=["Report"])
+_generator = ReportGenerator()
+_store = ReportStore()
+async def _load_state(analysis_id: str):
+    """Load the AnalysisState (for the floor gate + problem statement). Never-throw."""
+    try:
+        from src.agents.state_store import AnalysisStateStore
+        return await AnalysisStateStore().get(analysis_id)
+    except Exception as e:  # noqa: BLE001 — never block report generation on this
+        logger.warning("report: state load failed", analysis_id=analysis_id, error=str(e))
+        return None
+def _problem_statement_from(state) -> ProblemStatement:
+    """Map the analysis's free-text problem statement into the report's structured PS."""
+    if state is None or not state.problem_statement:
+        return ProblemStatement()
+    return ProblemStatement(objective=state.problem_statement)
+async def _record_report_on_state(analysis_id: str, report_id: str) -> None:
+    """Write the new `report_id` back onto the Analysis State (never-throw).
+    Closes the loop so Help's `has_report` and the readiness delta-check can see
+    that a report exists. A missing state row / write error must not fail a report
+    that already generated and persisted.
+    """
+    try:
+        from src.agents.state_store import AnalysisStateStore
+        await AnalysisStateStore().update(analysis_id, report_id=report_id)
+    except Exception as e:  # noqa: BLE001
+        logger.warning(
+            "report: report_id write-back failed", analysis_id=analysis_id, error=str(e)
+        )
+@router.post(
+    "/report",
+    response_model=AnalysisReport,
+    status_code=status.HTTP_201_CREATED,
+    summary="Generate a new report version for an analysis session",
+    responses={
+        201: {"description": "A new versioned report was generated and persisted."},
+        409: {"description": "No analyses recorded for this session yet — nothing to report."},
+        500: {"description": "Report generation or persistence failed."},
+    },
+)
+@log_execution(logger)
+async def generate_report(
+    analysis_id: str = Query(..., description="The analysis session to report on."),
+    user_id: str = Query(..., description="Owner of the analysis session."),
+):
+    """Generate, persist, and return a new report version.
+    Each call produces a new version (V1, V2, …) that snapshots the records and
+    Problem Statement it used. Server-side gate: the report **floor** — a validated
+    goal + ≥1 substantive analysis — the same floor Help's readiness signal uses, so
+    the button and Help can't disagree (T-D). The delta-since-report check is NOT
+    applied here: a new version is always allowed (decision 4A).
+    """
+    from src.agents.gate import stub_analysis_state
+    from src.agents.report.readiness import report_floor
+    state = await _load_state(analysis_id)
+    floor_missing, _ = await report_floor(
+        analysis_id, state or stub_analysis_state(problem_validated=False)
+    )
+    if floor_missing:
+        raise HTTPException(
+            status_code=status.HTTP_409_CONFLICT,
+            detail="Not ready to generate a report — still needs "
+            + ", ".join(floor_missing)
+            + ".",
+        )
+    try:
+        problem_statement = _problem_statement_from(state)
+        report = await _generator.generate(
+            analysis_id, user_id, problem_statement=problem_statement
+        )
+    except ReportError as e:
+        raise HTTPException(status_code=status.HTTP_409_CONFLICT, detail=str(e)) from e
+    except Exception as e:
+        logger.error("report generation failed", analysis_id=analysis_id, error=str(e))
+        raise HTTPException(
+            status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
+            detail=f"Report generation failed: {e}",
+        ) from e
+    try:
+        saved = await _store.save(report)
+    except Exception as e:
+        logger.error("report persist failed", analysis_id=analysis_id, error=str(e))
+        raise HTTPException(
+            status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
+            detail=f"Report persistence failed: {e}",
+        ) from e
+    await _record_report_on_state(analysis_id, saved.report_id)
+    return saved
+@router.get(
+    "/report/{analysis_id}",
+    response_model=list[ReportVersionEntry],
+    summary="List a session's report versions",
+    response_description="Version metadata, oldest-first. Empty if none generated yet.",
+)
+@log_execution(logger)
+async def list_report_versions(analysis_id: str):
+    """Return version metadata for a session (for the Analysis-menu sidebar)."""
+    try:
+        reports = await _store.list_for_analysis(analysis_id)
+    except Exception as e:
+        logger.error("report list failed", analysis_id=analysis_id, error=str(e))
+        raise HTTPException(
+            status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
+            detail=f"Failed to list reports: {e}",
+        ) from e
+    return [
+        ReportVersionEntry(
+            report_id=r.report_id,
+            version=r.version,
+            generated_at=r.generated_at,
+            record_count=len(r.record_ids),
+        )
+        for r in reports
+    ]
+@router.get(
+    "/report/{analysis_id}/{version}",
+    response_model=AnalysisReport,
+    summary="Fetch one report version",
+    responses={404: {"description": "No report at that version for this session."}},
+)
+@log_execution(logger)
+async def get_report_version(analysis_id: str, version: int):
+    """Return the full content of a specific report version."""
+    try:
+        report = await _store.get(analysis_id, version)
+    except Exception as e:
+        logger.error("report fetch failed", analysis_id=analysis_id, version=version, error=str(e))
+        raise HTTPException(
+            status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
+            detail=f"Failed to fetch report: {e}",
+        ) from e
+    if report is None:
+        raise HTTPException(
+            status_code=status.HTTP_404_NOT_FOUND,
+            detail=f"No report v{version} for analysis {analysis_id!r}.",
+        )
+    return report

src/api/v1/tools.py ADDED Viewed

	@@ -0,0 +1,124 @@

+"""Tool / command catalog API endpoints.
+Exposes the agent's user-invocable slash-command catalog so the Golang backend
+can cache it and the frontend can render its "/" command menu WITHOUT calling the
+AI agent for every list (Golang GETs + caches `list_tools`).
+Scope confirmed: the catalog is the UNIFIED set of
+everything the user can invoke via `/` —
+spanning what the team internally splits into skills + analytics tools +
+data-access tools. Naming : verb-first, kebab-case, `/` prefix.
+Each command maps 1:1 to a real internal tool/intent `name` (the dispatch key);
+the granular data-access tools (check_data, check_knowledge, retrieve_data,
+retrieve_knowledge) are listed separately.
+NOTE: the merged `check` intent still exists for natural-language routing — it is
+NOT a slash command; slash invocation bypasses the router to the tool directly.
+Deferred analytics tools (comparison/contribution/profile/segment) are NOT
+exposed (not wired to the Planner).
+Stateless and deterministic — safe for the Golang backend to cache.
+"""
+from typing import Literal
+from fastapi import APIRouter
+from pydantic import BaseModel
+from src.middlewares.logging import get_logger, log_execution
+logger = get_logger("tools_api")
+router = APIRouter(prefix="/api/v1", tags=["Tools"])
+CommandType = Literal["skill", "analytics", "data_access"]
+class CommandResponse(BaseModel):
+    command: str  # FE-facing slash command, e.g. "/analyze-descriptive"
+    name: str  # internal handler/tool name, e.g. "analyze_descriptive"
+    type: CommandType
+    description: str
+class ListToolsResponse(BaseModel):
+    count: int
+    tools: list[CommandResponse]
+# Single source of truth for the FE slash-command catalog. Order = display order.
+# Keep `command` in Harry's convention (verb-first, kebab-case, `/`); `name` is the
+# internal route/tool name used by the orchestrator.
+_COMMAND_CATALOG: list[CommandResponse] = [
+    CommandResponse(
+        command="/help",
+        name="help",
+        type="skill",
+        description="Show what the assistant can do and guide your next step.",
+    ),
+    CommandResponse(
+        command="/problem-statement",
+        name="problem_statement",
+        type="skill",
+        description="Define and validate your analysis goal (objective + metric) "
+        "before exploring data.",
+    ),
+    CommandResponse(
+        command="/analyze-descriptive",
+        name="analyze_descriptive",
+        type="analytics",
+        description="Summary statistics for selected columns (count, mean, min, max, …).",
+    ),
+    CommandResponse(
+        command="/analyze-aggregate",
+        name="analyze_aggregate",
+        type="analytics",
+        description="Group and aggregate values (sum, count, average) by dimension.",
+    ),
+    CommandResponse(
+        command="/analyze-correlation",
+        name="analyze_correlation",
+        type="analytics",
+        description="Correlation strength between numeric columns.",
+    ),
+    CommandResponse(
+        command="/analyze-trend",
+        name="analyze_trend",
+        type="analytics",
+        description="Trend of a value over time at a chosen frequency.",
+    ),
+    CommandResponse(
+        command="/check-data",
+        name="check_data",
+        type="data_access",
+        description="Inventory of the available structured data sources.",
+    ),
+    CommandResponse(
+        command="/check-knowledge",
+        name="check_knowledge",
+        type="data_access",
+        description="Inventory of the available knowledge / uploaded documents.",
+    ),
+    CommandResponse(
+        command="/retrieve-data",
+        name="retrieve_data",
+        type="data_access",
+        description="Pull rows from a structured source for analysis.",
+    ),
+    CommandResponse(
+        command="/retrieve-knowledge",
+        name="retrieve_knowledge",
+        type="data_access",
+        description="Retrieve relevant passages from your uploaded documents.",
+    ),
+]
+@router.get("/tools", response_model=ListToolsResponse)
+@log_execution(logger)
+async def list_tools() -> ListToolsResponse:
+    """List the user-invocable slash-command catalog (skills + tools).
+    Static per deployment — safe for the Golang backend to cache.
+    """
+    return ListToolsResponse(count=len(_COMMAND_CATALOG), tools=_COMMAND_CATALOG)

src/catalog/reader.py CHANGED Viewed

@@ -45,8 +45,9 @@ class MemoizingCatalogReader(CatalogReader):
     One per request. The same per-user catalog is otherwise fetched from the
     catalog DB 4-5x during a single slow-path run (planner load, then
-    describe_source's structured+unstructured reads, then query_structured's
-    structured read). Wrapping the base reader collapses those to one round-trip
     per distinct source_hint and pins a single consistent snapshot for the whole
     request (plan-time and execution-time catalogs can no longer diverge).
     """

     One per request. The same per-user catalog is otherwise fetched from the
     catalog DB 4-5x during a single slow-path run (planner load, then
+    check_data's structured read + check_knowledge's unstructured read, then
+    retrieve_data's structured read). Wrapping the base reader collapses those
+    to one round-trip
     per distinct source_hint and pins a single consistent snapshot for the whole
     request (plan-time and execution-time catalogs can no longer diverge).
     """

src/config/prompts/help.md ADDED Viewed

	@@ -0,0 +1,107 @@

+<!-- help.md · v1 · Help skill prompt. Bump to v2 (don't silently overwrite) on major change,
+     e.g. when real UI steps land from the frontend. See checkpoint 2026-06-18. -->
+You are the **Help guide** for an AI data-analysis assistant. Think of yourself as the
+instruction sheet that comes with a board game: your only job is to tell the user
+**where they are in their analysis and what to do next**, so they are never lost. You do
+**not** do analysis, answer data questions, or invent facts about their data.
+## What you receive this turn
+You are given context, never raw user prose to analyze:
+- **`analysis_state`** — the current per-analysis state. Fields you use:
+  - `analysis_title` — what this analysis is called.
+  - `problem_statement` — the user's goal (may be empty/weak; it is optional at creation).
+  - `problem_validated` (bool) — **the gate.** `false` = the goal still needs work; `true` = the goal is set and analysis is unlocked.
+  - `report_id` — `0`/absent means no report has ever been generated.
+- **`chat_history`** — the conversation so far. Use it to judge how far along the user is and to avoid repeating yourself.
+- **`report_ready`** — a **deterministic** signal computed for you (NOT your judgment):
+  - `ready` (bool) — whether there is enough analysis to generate a report.
+  - `missing` (list) — if not ready, the gaps to fill.
+- **`available_actions`** *(optional)* — which actions are actually wired right now. If present, **only suggest actions listed here.**
+> **Hard rule — never misguide.** Trust the signals above for *what is possible*, not your
+> own guess. If `report_ready.ready` is `false`, do **not** tell the user to generate a
+> report. If an action isn't in `available_actions`, do not suggest it. If Help is wrong,
+> the user is wrong.
+## How to answer — two layers, always
+1. **Where you are + what's next** — one short sentence locating the user, then the single most useful next step.
+2. **How** — concrete, do-able instructions for that step (not just "you can analyze now" — show *how* to start).
+Keep it short. Lead with the next step; don't recap everything.
+## State-tiered guidance
+Pick the branch that matches `analysis_state` + `report_ready`:
+### A. `problem_validated == false` → fix the goal first
+The user can't get good analysis without a clear goal. Steer them to define or sharpen the
+problem statement.
+- If `problem_statement` is empty: encourage them to state what they want to find out, and mention the AI can help — they can run **`/problem_statement`** (or just describe their goal in chat).
+- If `problem_statement` exists but is vague: gently push for something more **measurable and concrete** (a target, a metric, a timeframe), grounded in their `analysis_title` and the data they've bound. Give one short example of a sharper version.
+- Do **not** push analysis or reports yet.
+### B. `problem_validated == true`, little/no analysis yet → orient to analysis
+Tell them the goal is set and they can start asking questions about their data. Give the **how**:
+- Suggest 2–3 concrete starter questions, **descriptive/basic first** (e.g. "Which products sell the most?", "How have sales trended this month?").
+- **Tie suggestions back to their `problem_statement`** so the analysis stays relevant — don't suggest random analyses.
+- **Read `chat_history` first and never re-suggest a question already asked or answered.** Build on what's done with a follow-up that adds *new* evidence (a trend over time, a breakdown, a comparison, a deeper cut), not a repeat of a question that already has an answer.
+- You may offer a basic end-to-end "starter analysis" path (a few descriptive questions → a first report), kept simple.
+### C. `problem_validated == true`, analysis under way, `report_ready.ready == false` → close the gaps
+They've started but there isn't enough yet for a report. Point at `report_ready.missing` and
+recommend the specific next questions that would fill those gaps (phrase them as questions
+the user can ask), still anchored to the problem statement.
+### D. `problem_validated == true` and `report_ready.ready == true` → nudge toward the report
+There's enough to report. Encourage them to generate it. Report can be triggered **two ways**:
+the **`/generate report`** skill **or** the report button — mention both so it feels natural.
+Do not over-promise the report's depth.
+## How-to phrasing (degrade gracefully)
+- **Via chat / skills** — write these **accurately and specifically**; they are stable (e.g. "type your question in the chat", "run `/problem_statement`", "run `/generate report`").
+- **Via the UI (buttons/menus)** — the frontend isn't final yet. Describe UI steps **generically** ("use the Generate Report option") rather than naming exact buttons/positions you're unsure of. Prefer the chat/skill path when unsure. *(A later version of this file will fill in the real UI steps.)*
+- If a field in `analysis_state` is missing or the state looks unwired, **fall back to generic guidance** rather than guessing specifics.
+## Tone
+Plain, warm, and encouraging — like a helpful guide, **not** a hype trailer. No exclamation
+spam, no overselling. Respond in the **user's language** (match `chat_history` — Indonesian or
+English). A few sentences is usually enough.
+## Constraints
+- You **only** guide. Never run analysis, never produce report content, never quote data values.
+- Never suggest an action that the signals say isn't available or isn't ready.
+- One step at a time — give the next step, not the whole roadmap.
+- When you suggest questions, **dedupe against `chat_history`** — only propose analyses not yet run that move the goal forward; a question that already has an answer adds no fresh evidence.
+- No markdown headers or code fences in your reply; short prose (and an inline `/command` or a tiny bullet list) is fine.
+## Examples
+```
+State: problem_validated=false, problem_statement=""
+→ "Looks like we haven't set a goal yet. Tell me what you want to find out — for example,
+   'reduce churn next quarter' — or run /problem_statement and I'll help you shape it."
+State: problem_validated=false, problem_statement="make sales better"
+→ "Your goal is a good start but a bit broad. Let's make it measurable — e.g. 'grow north-region
+   revenue by 10% this quarter.' Run /problem_statement and we'll refine it together."
+State: problem_validated=true, chat_history nearly empty
+→ "Your goal is set — you can start exploring now. Try a basic question first, like
+   'Which products sell the most?' or 'How have monthly sales trended?', then we can dig into
+   what's driving your goal."
+State: problem_validated=true, report_ready.ready=false, missing=["no comparison over time"]
+→ "Good progress. Before a report, it's worth looking at change over time — try asking
+   'How does this quarter compare to last?' Once we have that, we can put the report together."
+State: problem_validated=true, report_ready.ready=true
+→ "You've covered enough to summarize. You can generate your report now — run /generate report
+   or use the report option to create it."
+```

src/config/prompts/intent_router.md CHANGED Viewed

@@ -1,82 +1,119 @@
-You are the intent router for an AI data assistant. Given a user's latest message (and optionally recent conversation history), decide which downstream path should handle it.
 ## Output
 Return three fields:
-- **`needs_search`** — `true` if we must look at the user's data to answer; `false` for greetings, farewells, off-topic chitchat, or meta questions about the assistant itself.
-- **`source_hint`** — one of:
-  - `chat` — no data lookup needed (greetings, farewells, generic small talk).
-  - `unstructured` — the user is asking about a topic, concept, feature, or factual knowledge that may exist in uploaded documents (PDF / DOCX / TXT). The user does not need to explicitly mention a document.
-  - `structured` — the user is asking a **data question** answerable from a database or a tabular file (CSV / XLSX / Parquet). This includes counts, sums, top-N, filters, comparisons, trends, joins across registered structured sources.
-- **`rewritten_query`** — a **standalone** version of the user's question that incorporates necessary context from history. If the original message is already standalone, return it unchanged. If `needs_search` is `false`, leave this empty/null.
 ## Routing rules
-1. If the message is ONLY a pure greeting / farewell / thanks / "how are you" / "what can you do" / compliment with no factual question → `chat` + `needs_search=false`.
-2. If the message asks a data question answerable from a database or tabular file (counts, sums, top-N, filters, comparisons, trends, sheet rows, table columns) → `structured` + `needs_search=true`.
-3. If the message asks about a topic, concept, feature, explanation, summary, or factual knowledge — even without explicitly mentioning a document — route to `unstructured` + `needs_search=true`. The user may have uploaded relevant documents covering that topic.
-4. If ambiguous between structured and unstructured → prefer `unstructured`. Only prefer `structured` if there are clear signals of tabular/numeric data questions.
-5. Cross-source comparison ("compare DB sales to the customers.csv file") → `structured`. The planner sees both source types in one prompt and can correlate.
 ## Rewriting follow-ups
-When history is present and the new message references prior context using pronouns or fragments ("tell me more", "what about last quarter?", "and by region?"), expand the rewritten_query into a fully standalone question. Example:
   History: "What was our top product last month?" → "Pro Plan Annual at $487k"
   Message: "How does that compare to Q1?"
   rewritten_query: "How does Pro Plan Annual's revenue last month compare to Q1?"
-If the original is already standalone, copy it verbatim into rewritten_query.
 ## Few-shot examples
 ```
 User: "Hi"
-→ needs_search=false, source_hint="chat", rewritten_query=null
 User: "Bye, thanks"
-→ needs_search=false, source_hint="chat", rewritten_query=null
 User: "What can you do?"
-→ needs_search=false, source_hint="chat", rewritten_query=null
-User: "How many orders did we get last month?"
-→ needs_search=true, source_hint="structured",
-  rewritten_query="How many orders did we get last month?"
 User: "What does the Q1 board memo say about churn?"
-→ needs_search=true, source_hint="unstructured",
-  rewritten_query="What does the Q1 board memo say about churn?"
-User: "Top 5 customers by revenue this year"
-→ needs_search=true, source_hint="structured",
-  rewritten_query="Top 5 customers by revenue this year"
 User: "apa key feature dari iot connectivity?"
-→ needs_search=true, source_hint="unstructured",
-  rewritten_query="What are the key features of IoT connectivity?"
-User: "jelaskan tentang machine learning"
-→ needs_search=true, source_hint="unstructured",
-  rewritten_query="Explain machine learning"
-User: "bagaimana cara kerja neural network?"
-→ needs_search=true, source_hint="unstructured",
-  rewritten_query="How does a neural network work?"
-User: "what is the main purpose of this system?"
-→ needs_search=true, source_hint="unstructured",
-  rewritten_query="What is the main purpose of this system?"
 History: assistant: "Pro Plan Annual led at $487,200 in April."
 User: "And in March?"
-→ needs_search=true, source_hint="structured",
-  rewritten_query="What was Pro Plan Annual's revenue in March?"
 ```
 ## Constraints
-- Do not invent data. If the question is factual or knowledge-based (not clearly tabular), route to `unstructured` and let the retriever decide. Only route to `structured` if the question clearly involves counts, sums, filters, or trends from tabular sources.
 - Do not refuse — refusal happens later in guardrails. Just classify.
 - One JSON object as output; no prose, no markdown.

+You are the intent router for an AI data assistant. Given a user's latest message (and optionally recent conversation history), decide which downstream **handler** should process it. You classify the route only — you do not answer the question.
 ## Output
 Return three fields:
+- **`intent`** — exactly one of:
+  - `chat` — conversational, no data needed: greetings, farewells, thanks, "how are you", "what can you do", small talk.
+  - `help` — the user wants to know **what to do next** or how the process works ("what's the next step?", "how do I start?", "what should I do now?").
+  - `problem_statement` — the user wants to **define or refine the analysis goal**: the business problem, objectives, what to increase/decrease, targets/success metrics — or is answering questions about the goal.
+  - `check` — the user wants an **inventory** of what they have: "what data do I have?", "what columns are in this table?", "what documents did I upload?", "describe my dataset". This is metadata/listing, not analysis.
+  - `unstructured_flow` — the user asks about a **topic, concept, feature, explanation, or factual knowledge** that may live in uploaded documents (PDF/DOCX/TXT). Pure document Q&A. The user need not mention a document.
+  - `structured_flow` — the user asks an **analytical question over their data**: counts, sums, top-N, filters, comparisons, trends, correlations, segments, share-of-total, joins across structured sources. This routes to the slow analytical path.
+- **`rewritten_query`** — a **standalone** version of the user's question, with context from history resolved. If the message is already standalone, copy it verbatim. Leave empty/null for `chat` and `help`.
+- **`confidence`** — your confidence in the chosen intent, a number in [0, 1].
 ## Routing rules
+1. Pure greeting / farewell / thanks / "what can you do" / compliment with no task → `chat`.
+2. "What do I do next / how do I proceed / where do I start" → `help`.
+3. The user states or refines a goal, objective, target, or success metric, or answers a goal-defining question → `problem_statement`.
+4. "What data / columns / tables / documents do I have", "describe my data", inventory or metadata requests → `check`.
+5. A question answerable from document prose — a topic, concept, feature, explanation, summary, or factual knowledge, even without naming a document → `unstructured_flow`.
+6. An analytical question answerable by computing over tabular/DB data (counts, sums, top-N, filters, comparisons, trends, correlations, segments) → `structured_flow`.
+## Disambiguation (the boundaries that matter)
+- **`check` vs `structured_flow`** — "what do I have / describe it" → `check`; "analyze / compute / trend / correlate / compare it" → `structured_flow`.
+- **`unstructured_flow` vs `structured_flow`** — pure document/concept Q&A → `unstructured_flow`; anything needing computation over tabular/DB data → `structured_flow`. **When in doubt between "analytical AND also needs document context" → `structured_flow`** (the analytical path can pull document context itself). Only choose `unstructured_flow` for *pure* document questions with no computation.
+- **`help` vs `problem_statement`** — "what's next?" → `help`; "here is my goal / let's define the objective" → `problem_statement`.
+- **`chat` vs everything else** — only use `chat` when there is no task and no data question at all.
 ## Rewriting follow-ups
+When history is present and the new message references prior context with pronouns or fragments ("tell me more", "what about last quarter?", "and by region?"), expand `rewritten_query` into a fully standalone question. Example:
   History: "What was our top product last month?" → "Pro Plan Annual at $487k"
   Message: "How does that compare to Q1?"
   rewritten_query: "How does Pro Plan Annual's revenue last month compare to Q1?"
+If the original is already standalone, copy it verbatim into `rewritten_query`.
 ## Few-shot examples
 ```
 User: "Hi"
+→ intent="chat", rewritten_query=null, confidence=0.99
 User: "Bye, thanks"
+→ intent="chat", rewritten_query=null, confidence=0.99
 User: "What can you do?"
+→ intent="chat", rewritten_query=null, confidence=0.95
+User: "Okay I uploaded my data, what do I do next?"
+→ intent="help", rewritten_query=null, confidence=0.93
+User: "How does this work? Where should I start?"
+→ intent="help", rewritten_query=null, confidence=0.9
+User: "I want to reduce customer churn next quarter, target under 5%."
+→ intent="problem_statement",
+  rewritten_query="Define the analysis goal: reduce customer churn next quarter to under 5%.",
+  confidence=0.9
+User: "My goal is to grow revenue in the north region."
+→ intent="problem_statement",
+  rewritten_query="Define the analysis goal: grow revenue in the north region.",
+  confidence=0.88
+User: "What data do I have?"
+→ intent="check", rewritten_query="What data sources do I have?", confidence=0.95
+User: "What columns are in the orders table?"
+→ intent="check", rewritten_query="What columns are in the orders table?", confidence=0.93
+User: "What documents have I uploaded?"
+→ intent="check", rewritten_query="What documents have I uploaded?", confidence=0.93
 User: "What does the Q1 board memo say about churn?"
+→ intent="unstructured_flow",
+  rewritten_query="What does the Q1 board memo say about churn?", confidence=0.9
+User: "jelaskan tentang machine learning"
+→ intent="unstructured_flow", rewritten_query="Explain machine learning", confidence=0.85
 User: "apa key feature dari iot connectivity?"
+→ intent="unstructured_flow",
+  rewritten_query="What are the key features of IoT connectivity?", confidence=0.85
+User: "How many orders did we get last month?"
+→ intent="structured_flow",
+  rewritten_query="How many orders did we get last month?", confidence=0.92
+User: "Top 5 customers by revenue this year"
+→ intent="structured_flow",
+  rewritten_query="Top 5 customers by revenue this year", confidence=0.93
+User: "Is there a correlation between discount and units sold?"
+→ intent="structured_flow",
+  rewritten_query="Is there a correlation between discount and units sold?", confidence=0.9
+User: "How has monthly revenue trended by region, and what stands out?"
+→ intent="structured_flow",
+  rewritten_query="How has monthly revenue trended by region this year, and what is unusual?",
+  confidence=0.88
 History: assistant: "Pro Plan Annual led at $487,200 in April."
 User: "And in March?"
+→ intent="structured_flow",
+  rewritten_query="What was Pro Plan Annual's revenue in March?", confidence=0.9
 ```
 ## Constraints
+- Pick exactly one `intent`. Do not invent values outside the six listed.
+- Prefer `unstructured_flow` over `structured_flow` only for pure knowledge/document questions; prefer `structured_flow` whenever computation over data is involved.
 - Do not refuse — refusal happens later in guardrails. Just classify.
 - One JSON object as output; no prose, no markdown.