feat/Knowledge & Data Tools
#3
by rhbt6767 - opened
This view is limited to 50 files because it contains too many changes. See the raw diff here.
- ARCHITECTURE.md +2 -0
- CHECKPOINT_PLAN_2026-06-17.md +147 -0
- PROGRESS.md +32 -6
- eval/__init__.py +0 -0
- eval/intent/README.md +70 -0
- eval/intent/__init__.py +0 -0
- eval/intent/intent_dataset.json +56 -0
- eval/intent/results/.gitkeep +0 -0
- eval/intent/run_eval.py +384 -0
- eval/readiness/README.md +34 -0
- eval/readiness/__init__.py +0 -0
- eval/readiness/readiness_dataset.json +40 -0
- eval/readiness/results/.gitkeep +0 -0
- eval/readiness/results/readiness_result_2026-06-22_101645.json +268 -0
- eval/readiness/results/readiness_result_2026-06-22_143809.json +284 -0
- eval/readiness/run_eval.py +309 -0
- main.py +6 -0
- pyproject.toml +2 -0
- src/agents/binding_store.py +34 -0
- src/agents/chat_handler.py +254 -32
- src/agents/gate.py +108 -0
- src/agents/handlers/__init__.py +1 -0
- src/agents/handlers/check.py +165 -0
- src/agents/handlers/help.py +192 -0
- src/agents/handlers/problem_statement.py +171 -0
- src/agents/orchestration.py +42 -27
- src/agents/planner/examples.py +20 -22
- src/agents/planner/inputs.py +3 -3
- src/agents/planner/registry.py +39 -28
- src/agents/planner/service.py +1 -1
- src/agents/planner/validator.py +5 -5
- src/agents/report/__init__.py +9 -0
- src/agents/report/errors.py +7 -0
- src/agents/report/generator.py +363 -0
- src/agents/report/readiness.py +165 -0
- src/agents/report/schemas.py +91 -0
- src/agents/report/store.py +119 -0
- src/agents/slow_path/assembler.py +32 -1
- src/agents/slow_path/coordinator.py +4 -4
- src/agents/slow_path/schemas.py +12 -0
- src/agents/slow_path/store.py +78 -12
- src/agents/slow_path/task_runner.py +3 -0
- src/agents/state_store.py +128 -0
- src/api/v1/analysis.py +174 -0
- src/api/v1/chat.py +52 -9
- src/api/v1/report.py +189 -0
- src/api/v1/tools.py +124 -0
- src/catalog/reader.py +3 -2
- src/config/prompts/help.md +107 -0
- src/config/prompts/intent_router.md +76 -39
ARCHITECTURE.md
CHANGED
|
@@ -63,6 +63,8 @@ DB vs tabular is **not** a routing concern β it's a per-source attribute (`sou
|
|
| 63 |
|
| 64 |
## 3. Routing model
|
| 65 |
|
|
|
|
|
|
|
| 66 |
```
|
| 67 |
source_hint β { chat, unstructured, structured }
|
| 68 |
```
|
|
|
|
| 63 |
|
| 64 |
## 3. Routing model
|
| 65 |
|
| 66 |
+
> **Superseded 2026-06-18** β the 3-way `source_hint` below was reworked into a flat **6-intent** handler router (`chat`, `help`, `problem_statement`, `check`, `unstructured_flow`, `structured_flow`). Modality (structured vs unstructured *data*) is now the Planner's job, not the router's. See `ORCHESTRATOR_REWORK_PLAN.md`.
|
| 67 |
+
|
| 68 |
```
|
| 69 |
source_hint β { chat, unstructured, structured }
|
| 70 |
```
|
CHECKPOINT_PLAN_2026-06-17.md
ADDED
|
@@ -0,0 +1,147 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Checkpoint Plan β Wednesday, 17 June 2026
|
| 2 |
+
|
| 3 |
+
Working plan for Sofhia & Rifqi based on the checkpoint with mas Harry on **Thursday, 11 June 2026**.
|
| 4 |
+
Goal: everything below is **merged and demo-able before the next sync on Wednesday, 17 June (afternoon)**.
|
| 5 |
+
|
| 6 |
+
**Updated at: Friday, 12 June 2026** (Sofhia + Rifqi)
|
| 7 |
+
|
| 8 |
+
> Source of truth for decisions is the meeting itself. Note: the NotebookLM summary is **stale on two points** β Data Availability Check was *eliminated* as a tool, and Success Metrics was *folded into* the Problem Statement template. Do not build either as a standalone skill.
|
| 9 |
+
|
| 10 |
+
---
|
| 11 |
+
|
| 12 |
+
## 0. Progress (per Fri 12 Jun β Sofhia)
|
| 13 |
+
|
| 14 |
+
Dated snapshot of what landed this session. Live task status (incl. what's left) lives in Β§2 Ownership β this section only records the deltas + traceability.
|
| 15 |
+
|
| 16 |
+
- β
**Tool matrix** built (xlsx, all ~10 tools + status colours) β presentation material ready.
|
| 17 |
+
- β
**Registry trimmed to 4 active analytics** (`KM-641`, commit `66e2e4d`): `ACTIVE_ANALYTICS_TOOLS` (descriptive, aggregate, correlation, trend) vs `DEFERRED_ANALYTICS_TOOLS` (comparison, contribution, profile, segment) β specs + compute fns kept, only registry exposure withheld. Tests 206 pass, ruff/mypy clean.
|
| 18 |
+
- β
**Planner few-shot synced**: Example A `analyze_contribution` β `analyze_aggregate` (so few-shots don't reference a deferred tool).
|
| 19 |
+
- β
**Data-access tools renamed** (`KM-642`, commit `c38c0c2`): `query_structured` β `data_retrieve`, `retrieve_documents` β `knowledge_retrieve` across the tool layer + planner stub/prompt/validator/few-shots. Mechanical, no behavior change.
|
| 20 |
+
- β
**`data_check` merge + `knowledge_check`** (`KM-643`, commit `4bd5f1e`): `list_sources` + `describe_source` β one parameterized `data_check` (no arg = list structured sources; `source_id` = schema) + new `knowledge_check` (unstructured). Tests 206 pass.
|
| 21 |
+
- β
**Redis Cloud live** (free tier, TTL = 1 h), env vars shared in the group (Rifqi).
|
| 22 |
+
- β
**Planner tool list verified** against the trimmed registry β no references to old tool names or deferred analytics anywhere in `src/` (Rifqi).
|
| 23 |
+
- π **Decision:** `tests/` stays gitignored β team decided not to push tests to origin (closes PROGRESS.md R3 as won't-do).
|
| 24 |
+
- π **Ownership:** Rifqi owns `generate_report` development + the `analysis_records` table / real `AnalysisStore` (contract still co-designed with Sofhia).
|
| 25 |
+
- β
**R5 cache fix** (Rifqi, `b701e95`): chat cache scoped by `user_id`, TTL 24hβ1h.
|
| 26 |
+
- β
**AnalysisRecord persistence landed** (Rifqi): `stage` now flows to the record (CRISP-DM grouping for the report) + identity fields (`record_id`/`analysis_id`/`user_id`); `PostgresAnalysisStore` + `analysis_records` table replace `NullAnalysisStore`, wired into `ChatHandler`. Unblocks the `generate_report` renderer and the DoD "record persisted" step. Open: `analysis_id` handoff from Harry's Analysis State.
|
| 27 |
+
- β
**Verb-first tool naming** (Sofhia, commit `2d6406d`): the 4 data/knowledge tools renamed to lead with a verb β `data_check`β`check_data`, `knowledge_check`β`check_knowledge`, `data_retrieve`β`retrieve_data`, `knowledge_retrieve`β`retrieve_knowledge` (the `analyze_*` tools already lead with a verb). These verb-first names are now canonical; the tool-set table + Β§3 below use them. Dated log entries above keep the old names as historical record.
|
| 28 |
+
|
| 29 |
+
---
|
| 30 |
+
|
| 31 |
+
## 1. Locked decisions (from the 2026-06-11 checkpoint)
|
| 32 |
+
|
| 33 |
+
1. **Single chat page.** The separate interview/survey page is killed. Sidebar = Knowledge menu (connect/manage data) + Analysis menu (sessions).
|
| 34 |
+
2. **Data-first hard gate.** Creating a new analysis requires **β₯ 1 bound data source** (server-side rejection, no empty sessions). User provides title + optional short description.
|
| 35 |
+
3. **Analysis State lives in the DB.** Per-analysis row: `user_id`, `data_source_ids[]`, `interview_status` (default `not_pass`), `report_status` (default `no_report` β `V1`, `V2`, β¦). Explicitly **NOT cached, NOT in Redis** β the Orchestrator reads it from Postgres every turn.
|
| 36 |
+
4. **Skills, not agents.** No separate interview agent. The Orchestrator routes per user turn using the Analysis State; an analytical request still executes through the existing Planner β TaskRunner β Assembler spine (static plan, no mid-run LLM).
|
| 37 |
+
5. **Interview = one skill: Problem Statement.** Success metrics become fields inside the PS template (what to increase/decrease + target). Data availability check is handled by the data-first creation gate + PS validation cross-checking fields against the bound catalog β not a separate tool.
|
| 38 |
+
6. **Analytics focus = 4 tools:** descriptive, aggregate, correlation, trend. The other four composites (comparison, contribution, profile, segment) are **deprioritized, not deleted** β keep the code, just don't register them. If "comparison" returns later it should be a proper statistical **test**, not a generic compare.
|
| 39 |
+
7. **`describe_source` merges into the listing tool** β one call returns sources *with* their schema/metadata, fewer tools for the planner.
|
| 40 |
+
8. **Report = on-demand, button-triggered (not a chat skill).** A dedicated "Generate Report" button in the Analysis menu calls a **report API** (not the chat route): trigger generation for a session, list its versions, fetch a version. Renders from accumulated **AnalysisRecords + the Problem Statement** β never from chat history. Each report is a **persisted, versioned artifact**: generation snapshots the record IDs it used and bumps `report_status` to `V<n>`. (Owner: Rifqi, KM-644.)
|
| 41 |
+
9. **Help = deterministic guide.** No LLM: read Analysis State β tell the user the next required step. Callable in any state.
|
| 42 |
+
10. **Redis Cloud free tier, TTL = 1 hour**, env shared in the team group β for retrieval/query caching only, never for state.
|
| 43 |
+
|
| 44 |
+
### Final tool set (~10)
|
| 45 |
+
|
| 46 |
+
| Tool (canonical, verb-first) | Maps to (lineage) | Status |
|
| 47 |
+
|---|---|---|
|
| 48 |
+
| `check_knowledge` | new β list user's documents + metadata | done |
|
| 49 |
+
| `check_data` | `list_sources` + `describe_source` merged (catalog-backed) | done |
|
| 50 |
+
| `retrieve_knowledge` | `retrieve_documents` β `knowledge_retrieve` | done |
|
| 51 |
+
| `retrieve_data` | `query_structured` β `data_retrieve` (tabular: file + DB, both working) | done |
|
| 52 |
+
| `analyze_descriptive` | `src/tools/analytics/descriptive.py` | done |
|
| 53 |
+
| `analyze_aggregate` | `src/tools/analytics/aggregation.py` | done |
|
| 54 |
+
| `analyze_correlation` | `src/tools/analytics/relationship.py` | done |
|
| 55 |
+
| `analyze_trend` | `src/tools/analytics/temporal.py` | done |
|
| 56 |
+
| `problem_statement` | new β interview skill (**Harry**) | Harry |
|
| 57 |
+
| `generate_report` | new β on-demand, versioned | to design |
|
| 58 |
+
| `help` | new β deterministic state guide | to build |
|
| 59 |
+
|
| 60 |
+
(`problem_statement` + `help` live at the orchestrator level; `generate_report` is **button-triggered via a dedicated report API**, not chat-routed (decision #8). The TaskRunner registry holds the 4 analytics + 4 data/knowledge tools. Unregister `analyze_comparison`, `analyze_contribution`, `analyze_profile`, `analyze_segment` from the planner-visible registry β keep the modules.)
|
| 61 |
+
|
| 62 |
+
---
|
| 63 |
+
|
| 64 |
+
## 2. Ownership
|
| 65 |
+
|
| 66 |
+
### Sofhia
|
| 67 |
+
- [x] 4 analytics tools: trim registry to 4 active, tests still pass after deprioritizing the other four. (`KM-641`, commit `66e2e4d`)
|
| 68 |
+
- [x] Data/knowledge tools: merge `describe_source` into `data_check`, rename `retrieve_documents` β `knowledge_retrieve`, `query_structured` β `data_retrieve`, build `knowledge_check`. (`KM-642` `c38c0c2`, `KM-643` `4bd5f1e`)
|
| 69 |
+
- [ ] Co-design `generate_report` contract with Rifqi (Rifqi owns development, see Β§3).
|
| 70 |
+
- [x] Tool matrix (see Β§4).
|
| 71 |
+
|
| 72 |
+
### Rifqi
|
| 73 |
+
- [x] **Redis Cloud free tier** (~30β50 MB): create instance, set TTL = 1 h, share env vars in the group. (done 12 Jun)
|
| 74 |
+
- [x] **R5 cache fix**: chat cache key scoped by `user_id`, TTL 24hβ1h (urgent on shared Redis). (12 Jun, commit `b701e95`)
|
| 75 |
+
- [x] **AnalysisRecord contract gaps closed**: `stage` (CRISP-DM) now flows TaskβTaskResultβTaskSummary so the report can group the method appendix; `AnalysisRecord` gained `record_id`/`analysis_id`/`user_id` identity fields. (12 Jun)
|
| 76 |
+
- [x] **`analysis_records` table + real `AnalysisStore`**: `PostgresAnalysisStore` (save + `list_for_analysis`, never-throw) replaces `NullAnalysisStore`; wired into `ChatHandler`, `user_id` stamped at save. Satisfies the DoD "record persisted" step. (12 Jun)
|
| 77 |
+
- [ ] **Own `generate_report` development β KM-644 "Report Generator"** (contract co-designed with Sofhia, see Β§3). Button-triggered via a dedicated **report API** (trigger / list versions / fetch); reads `analysis_records` + Problem Statement; persists a versioned report artifact, bumps `report_status`. *(record persistence done above; report API + persistence + renderer + contract doc next)*
|
| 78 |
+
- [x] Verify planner tool list matches the trimmed registry (4 analytics + 4 data/knowledge) and few-shots don't reference removed tools. (verified 12 Jun β no stale tool names in `src/`)
|
| 79 |
+
- β οΈ **Blocked-on-Harry**: `analysis_id` is `NULL` on persisted records until the Analysis State reaches the slow path β need the session-ID handoff so `generate_report` can group records per analysis.
|
| 80 |
+
|
| 81 |
+
### Shared (Sofhia + Rifqi)
|
| 82 |
+
- [ ] `generate_report` design + skeleton: input = AnalysisRecords for the session + Problem Statement from Analysis State; output = versioned artifact; bumps `report_status`. Agree on the contract even if rendering is stubbed for Wednesday. (Development: Rifqi.)
|
| 83 |
+
- [ ] `help` skill: deterministic β read Analysis State, return the next required step. Small, do it together or whoever finishes first.
|
| 84 |
+
- [ ] Tool behavior smoke test end-to-end on an easy case (descriptive/aggregate path), per Harry's ask: "robust tools before agents."
|
| 85 |
+
|
| 86 |
+
### Harry (dependencies β not ours, but we block on them)
|
| 87 |
+
- `problem_statement` skill + PS template (incl. increase/decrease target fields).
|
| 88 |
+
- Analysis State class + DB table, frontend analysis-builder step.
|
| 89 |
+
- Merging our PRs (he auto-merges; he clones from latest after).
|
| 90 |
+
|
| 91 |
+
---
|
| 92 |
+
|
| 93 |
+
## 3. Per-tool behavior contract (how to build each one)
|
| 94 |
+
|
| 95 |
+
Harry's framing: for every tool, define **goal / trigger / input / process / output**, and behave like a Claude-style skill β if a required argument is missing, respond with a polite feedback message asking for it (e.g. table/column name), never guess silently.
|
| 96 |
+
|
| 97 |
+
- **`check_knowledge`** β "what documents do I have?" β list documents with name, type, uploaded-at.
|
| 98 |
+
- **`check_data`** β "what data do I have?" β sources (file + DB) with schema/metadata from the data catalog, created/uploaded timestamps.
|
| 99 |
+
- **`retrieve_knowledge`** β RAG over uploaded documents; returns passages with source attribution.
|
| 100 |
+
- **`retrieve_data`** β query tabular data (file + DB) via QueryIR; output consumable by the `analyze_*` tools.
|
| 101 |
+
- **`analyze_*` (4)** β require valid table/column references; if missing or wrong, return actionable feedback instead of guessing.
|
| 102 |
+
- **`generate_report`** β button-triggered via a dedicated report API (not chat-routed); on-demand only (never auto); post-pass gated; renders from AnalysisRecords + PS; persists a versioned artifact, snapshots record IDs, bumps version. (KM-644, Rifqi.)
|
| 103 |
+
- **`help`** β no LLM; state β next step. Repeating it is fine, that's its job.
|
| 104 |
+
|
| 105 |
+
---
|
| 106 |
+
|
| 107 |
+
## 4. Tool matrix (deliverable for the sync)
|
| 108 |
+
|
| 109 |
+
Harry explicitly asked for a matrix covering every tool. Produce one sheet/markdown table with columns:
|
| 110 |
+
|
| 111 |
+
`tool | goal | trigger (when the orchestrator calls it) | input | process | output | gated by interview_status? | status (done / in progress / planned)`
|
| 112 |
+
|
| 113 |
+
Use the tool set table in Β§1 as the row list. This doubles as the presentation material on Wednesday.
|
| 114 |
+
|
| 115 |
+
---
|
| 116 |
+
|
| 117 |
+
## 5. Day-by-day
|
| 118 |
+
|
| 119 |
+
| Day | Target |
|
| 120 |
+
|---|---|
|
| 121 |
+
| **Thu 11** | Checkpoint meeting + task split with Harry. |
|
| 122 |
+
| **Fri 12 (today)** | β
Registry trimmed to 4 analytics + few-shot synced (Sofhia, KM-641). β
Tool matrix built. β³ Redis Cloud + env share (Rifqi). |
|
| 123 |
+
| **Mon 15** | Data/knowledge tools done (`data_check` merge, renames, `knowledge_check`). `generate_report` contract agreed. |
|
| 124 |
+
| **Tue 16** | `help` skill done. `generate_report` skeleton wired to AnalysisRecord. Tool matrix drafted. End-to-end smoke test on the easy path. |
|
| 125 |
+
| **Wed 17 (AM)** | Buffer: fix fallout, finalize matrix, rehearse the demo flow. |
|
| 126 |
+
| **Wed 17 (PM)** | **Sync with Harry.** |
|
| 127 |
+
|
| 128 |
+
---
|
| 129 |
+
|
| 130 |
+
## 6. Open questions to confirm with Harry on Wednesday
|
| 131 |
+
|
| 132 |
+
1. **Gate scope.** Proposal: keep the fast path + exploration tools (`check_knowledge`, `check_data`, retrieves, `help`, arguably `descriptive`) available **pre-pass**; gate only the insight tools (correlation, trend, report). Hard-gating everything risks frustrating users who just want to look at their data.
|
| 133 |
+
2. **Who flips `interview_status` to `pass`?** Proposal: a deterministic validator (PS template slots complete + fields cross-checked against the bound catalog) makes the call β the LLM conducts the conversation but never decides the pass. ("Conversational skin, deterministic skeleton.")
|
| 134 |
+
3. **Skills vs spine β one sentence to lock in writing:** *"Skills are registry tools executed by the existing Planner β TaskRunner β Assembler spine; the Analysis State gate is a pre-check in the Orchestrator."* This keeps the new flow and the locked architecture fully compatible.
|
| 135 |
+
4. `generate_report` invocation goes through the same gate (post-pass only) β confirm.
|
| 136 |
+
|
| 137 |
+
---
|
| 138 |
+
|
| 139 |
+
## 7. Definition of done for Wednesday
|
| 140 |
+
|
| 141 |
+
- [ ] All team PRs merged; Harry unblocked on the Analysis State class.
|
| 142 |
+
- [ ] Registry exposes exactly 4 analytics + 4 data/knowledge tools, all passing local tests.
|
| 143 |
+
- [ ] Redis Cloud shared and working locally for all three of us (TTL 1 h).
|
| 144 |
+
- [ ] `help` works against a (possibly stubbed) Analysis State.
|
| 145 |
+
- [ ] `generate_report` contract written; skeleton callable.
|
| 146 |
+
- [ ] Tool matrix ready to present.
|
| 147 |
+
- [ ] One end-to-end happy path runs: create analysis (with data) β blocked pre-pass β interview stub passes β descriptive/aggregate answer β record persisted.
|
PROGRESS.md
CHANGED
|
@@ -2,8 +2,32 @@
|
|
| 2 |
|
| 3 |
Persistent tracker mirroring the 42-item ownership table in `REPO_CONTEXT.md` "Team β division of work". Update as PRs land. Future Claude Code sessions read this to know what's already done.
|
| 4 |
|
| 5 |
-
**Last updated**: 2026-06-
|
| 6 |
-
**Current open PR**: `pr/
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 7 |
|
| 8 |
---
|
| 9 |
|
|
@@ -39,12 +63,12 @@ Verified against code before logging. Severity: **critical** / important / nice-
|
|
| 39 |
|---|---|---|---|---|
|
| 40 |
| R1 | **AuthN/AuthZ** on data endpoints β reject body-supplied `user_id`/`room_id`, derive identity from a verified token. `/chat/stream` has none (`chat.py:40,128`); tenant isolation is client honesty. **CORRECTION to the review:** `security/auth.py` is a STUB (all `NotImplementedError`); the real JWT impl lives in `src/users/users.py` (`encode_jwt`/`decode_jwt`, HS, env-keyed) **but is unused** β `/login` (`api/v1/users.py`) returns the user profile as plain JSON and mints NO token. So R1 is cross-team: (1) `/login` must issue a JWT, (2) frontend must send it as `Bearer`, (3) data endpoints validate it. **Gates the engine-cache work (DB2).** | **critical** | DB/B + frontend | `[ ]` |
|
| 41 |
| R2 | **Always compile a LIMIT** β `sql.py` now emits a bound for every query: explicit limit honored (clamped to `MAX_RESULT_ROWS=10000`), unbounded queries get `LIMIT cap+1` so an unbounded SELECT can't stream a whole table into memory. `CompiledSql.row_cap` carries the cap; `DbExecutor` caps + flags truncation from it (dropped its own `_ROW_HARD_CAP`). Tests updated (`test_sql.py`, +3 cases); `S608` restored to `tests/**` ruff ignore (was dropped). | **critical** | DB | `[x]` |
|
| 42 |
-
| R3 | **Commit `tests/` + minimal CI** β `tests/` is gitignored; the 200+ tests cited as done exist only on laptops (already caused rename rot). GitHub origin carries tests; HF Space gets the Docker build
|
| 43 |
| DB1 | **In-memory `describe_source`** (request-scoped `MemoizingCatalogReader`, `reader.py`) + **LLM-client hoist** (shared module-level `ChatHandler` in `chat.py`). Measured live: `describe_source` 3.5sβ~2.0s (structured read now served from the planner's cached snapshot; only the unstructured read remains a round-trip), catalog reads/request ~5β~2. External `query_structured` handshake unchanged (DB2's job) so total slow path is ~flat until DB2. Tests: `tests/catalog/test_reader.py`. | important | agent | `[x]` |
|
| 44 |
| DB2 | **Keyed engine cache** β `src/database_client/engine.py::UserEngineCache` (process singleton): pooled engines keyed by `client_id + creds-hash` (rotation auto-invalidates), bounded LRU (50) + 600s idle TTL, `pool_pre_ping` + `pool_recycle=300`. `DbExecutor._run_sync` reuses the warm connection instead of `create_engineβconnectβdispose` per query (postgres/supabase only; other db_types keep the legacy path β no regression). **Live-measured: warm `query_structured` 6.6β9.4s β ~2.5s** (the residual is the per-call catalog-DB client fetch + pre-ping, not the external handshake). **Finding:** Neon's transaction pooler REJECTS `default_transaction_read_only` as a libpq startup `option` β caught live; moved read-only + statement_timeout to a per-connection `connect` event (best-effort; authoritative read-only is the SELECT-only compiler + sqlglot guard, see R10). Per-request ownership/active check kept. Proceeded ahead of R1 per owner decision (marginal security delta over the existing no-auth state; auth tracked separately). Tests: `tests/database_client/test_engine.py`. First query/process still cold β DB3. | important | DB | `[x]` |
|
| 45 |
| DB3 | **Speculative pre-connect** β `DbExecutor.prewarm(catalog, user_id)` warms the pooled engine for schema sources (fire-and-forget at slow-path entry) so the cold first-query handshake overlaps the ~4s Planner call. Best-effort, never raises; gated to the default path (skipped when a coordinator factory is injected). Verified live through `ChatHandler.handle`. | nice-to-have | DB | `[x]` |
|
| 46 |
| R4 | **Per-stage progress events** β `SlowPathCoordinator.run` gained an optional `progress` callback; `ChatHandler` bridges it to SSE `status` events (`chat.py` forwards them). Live: stream now shows `Planningβ¦`β`Running N stepsβ¦`β`Composingβ¦` (max wire gap ~4.6s, was ~13s of silence) β fixes proxy idle-timeout + UX. **Deferred:** token-streaming the Assembler answer needs splitting it into a streamed prose call + a structured-record call β that doubles the Assembler LLM calls (cost/latency), so it's a separate decision; the answer is still emitted as one chunk after the (fast ~2.5s) Assembler. Test: `test_chat_handler_wiring.py`. | important | agent | `[~]` |
|
| 47 |
-
| R5 | **Response cache**: key on `user_id` + catalog version; invalidate on ingest.
|
| 48 |
| R6 | **Hard time budget** β wrap `coordinator.run()` in `asyncio.wait_for` (60β90s). `Constraints.time_budget_seconds` is rendered but not enforced. | important | agent | `[ ]` |
|
| 49 |
| R7 | **Root-task-failure short-circuit** before the Assembler (templated/fast-path fallback, NOT replanning) β stops paying ~2k tok to narrate an empty RunState. | important | agent | `[ ]` |
|
| 50 |
| R8 | **Catalog upsert race** β per-user advisory lock around read-merge-upsert (`store.py`); concurrent uploads can drop a source. | important | DB | `[ ]` |
|
|
@@ -129,7 +153,8 @@ LLM tokens; verified live to US Cloud.
|
|
| 129 |
wires Pattern A correctly; self-corrects via retry.
|
| 130 |
|
| 131 |
**Open follow-ups:** real `BusinessContext` (lead); create `analysis_records` table +
|
| 132 |
-
real `AnalysisStore`
|
|
|
|
| 133 |
or keep the planner stub; 4o β GPT-mini deployment swap; flip `enable_slow_path` on once
|
| 134 |
`BusinessContext` is real. NOTE: 3 test files pre-existing broken from rename rot
|
| 135 |
(`test_chat_handler.py`, `test_intent_router.py`, `test_answer_agent.py` import the old
|
|
@@ -396,8 +421,9 @@ New scope after the original 42-item table; added as the tool layer landed (KM-6
|
|
| 396 |
| β | Tool contracts (`tools/contracts.py`) | TAB | `[x]` | KM-627 β canonical `ToolSpec` / `ToolRegistry` / `ToolOutput`. `agents/planner/contracts.py` re-exports them (+ keeps the lead's `BusinessContext` stub). |
|
| 397 |
| β | Analytics registry (`tools/registry.py`) | TAB | `[x]` | KM-628 β `analytics_registry()`. `analyze_descriptive.required` = `["data","column_ids"]` (aligned to compute signature, commit 4bb7623). |
|
| 398 |
| β | Invoker layer (`tools/invoker.py`) | TAB | `[x]` | KM-629 β `AnalyticsToolInvoker` (Pattern A: `analyze_*` take a `data` `${t<id>}` placeholder from upstream `query_structured`; `_materialize` β DataFrame, `_coerce_decimals` covers the whole family) + `CompositeToolInvoker` (routes data-access vs analytics by name). |
|
| 399 |
-
| β | Data-access tools (`tools/data_access.py`) | TAB | `[x]` | KM-630 β `DataAccessToolInvoker`: `list_sources` / `describe_source` / `query_structured` / `retrieve_documents`. Per-request DI (`user_id` + `CatalogReader`). `query_structured` calls `IRValidator` + `ExecutorDispatcher` (planner skipped β IR pre-built by the agent Planner). |
|
| 400 |
| β | Tool tests (`tests/unit/tools/`) | TAB | `[x]` | analytics + data-access + invoker tests (gitignored). Incl. regression `test_decimal_columns_coerced_for_analyze_contribution`. |
|
|
|
|
| 401 |
|
| 402 |
### API surface
|
| 403 |
|
|
|
|
| 2 |
|
| 3 |
Persistent tracker mirroring the 42-item ownership table in `REPO_CONTEXT.md` "Team β division of work". Update as PRs land. Future Claude Code sessions read this to know what's already done.
|
| 4 |
|
| 5 |
+
**Last updated**: 2026-06-12 (Redis Cloud live; R3 closed as won't-do; R5 cache fix; AnalysisRecord persistence landed β `PostgresAnalysisStore` + `analysis_records` table)
|
| 6 |
+
**Current open PR**: `pr/3` β active.
|
| 7 |
+
|
| 8 |
+
---
|
| 9 |
+
|
| 10 |
+
## What just shipped (2026-06-12 β AnalysisRecord persistence, Rifqi)
|
| 11 |
+
|
| 12 |
+
Groundwork for `generate_report`. The slow path now persists a real, citable
|
| 13 |
+
record; the report (next) renders from it.
|
| 14 |
+
|
| 15 |
+
- **Contract gaps closed** (`agents/slow_path/schemas.py`): `stage: CrispStage`
|
| 16 |
+
added to `TaskResult` + `TaskSummary` and populated at all 3 `TaskResult` build
|
| 17 |
+
sites in `task_runner.py` + copied in `assembler._build_record` β so the report
|
| 18 |
+
can group its method appendix by CRISP-DM phase. `AnalysisRecord` gained identity:
|
| 19 |
+
`record_id` (auto uuid), `analysis_id`/`user_id` (optional; stamped at persist).
|
| 20 |
+
- **Real store** (`agents/slow_path/store.py`): `PostgresAnalysisStore` β
|
| 21 |
+
`save()` (never-throw, idempotent upsert) + `list_for_analysis()` (oldest-first,
|
| 22 |
+
the report's render order). `NullAnalysisStore` kept (tests / disabled persistence).
|
| 23 |
+
`AnalysisStore` Protocol gained `list_for_analysis`.
|
| 24 |
+
- **Table** (`db/postgres/models.py`): `analysis_records` jsonb table (one row per
|
| 25 |
+
run, indexed by `analysis_id` + `user_id`); registered in `init_db.py`, created by
|
| 26 |
+
`create_all` on startup (no migration β `data_catalog` precedent).
|
| 27 |
+
- **Wired** (`agents/chat_handler.py`): default store flipped to `PostgresAnalysisStore`;
|
| 28 |
+
`user_id` stamped onto the record at the save site (in scope there).
|
| 29 |
+
- **Open**: `analysis_id` is `NULL` until Harry's Analysis State reaches the slow
|
| 30 |
+
path (session-ID handoff needed to group records per analysis).
|
| 31 |
|
| 32 |
---
|
| 33 |
|
|
|
|
| 63 |
|---|---|---|---|---|
|
| 64 |
| R1 | **AuthN/AuthZ** on data endpoints β reject body-supplied `user_id`/`room_id`, derive identity from a verified token. `/chat/stream` has none (`chat.py:40,128`); tenant isolation is client honesty. **CORRECTION to the review:** `security/auth.py` is a STUB (all `NotImplementedError`); the real JWT impl lives in `src/users/users.py` (`encode_jwt`/`decode_jwt`, HS, env-keyed) **but is unused** β `/login` (`api/v1/users.py`) returns the user profile as plain JSON and mints NO token. So R1 is cross-team: (1) `/login` must issue a JWT, (2) frontend must send it as `Bearer`, (3) data endpoints validate it. **Gates the engine-cache work (DB2).** | **critical** | DB/B + frontend | `[ ]` |
|
| 65 |
| R2 | **Always compile a LIMIT** β `sql.py` now emits a bound for every query: explicit limit honored (clamped to `MAX_RESULT_ROWS=10000`), unbounded queries get `LIMIT cap+1` so an unbounded SELECT can't stream a whole table into memory. `CompiledSql.row_cap` carries the cap; `DbExecutor` caps + flags truncation from it (dropped its own `_ROW_HARD_CAP`). Tests updated (`test_sql.py`, +3 cases); `S608` restored to `tests/**` ruff ignore (was dropped). | **critical** | DB | `[x]` |
|
| 66 |
+
| R3 | **Commit `tests/` + minimal CI** β `tests/` is gitignored; the 200+ tests cited as done exist only on laptops (already caused rename rot). ~~GitHub origin carries tests; HF Space gets the Docker build.~~ **2026-06-12: team decided tests stay gitignored/local β closed as won't-do.** | **critical (process)** | shared | `[won't do]` |
|
| 67 |
| DB1 | **In-memory `describe_source`** (request-scoped `MemoizingCatalogReader`, `reader.py`) + **LLM-client hoist** (shared module-level `ChatHandler` in `chat.py`). Measured live: `describe_source` 3.5sβ~2.0s (structured read now served from the planner's cached snapshot; only the unstructured read remains a round-trip), catalog reads/request ~5β~2. External `query_structured` handshake unchanged (DB2's job) so total slow path is ~flat until DB2. Tests: `tests/catalog/test_reader.py`. | important | agent | `[x]` |
|
| 68 |
| DB2 | **Keyed engine cache** β `src/database_client/engine.py::UserEngineCache` (process singleton): pooled engines keyed by `client_id + creds-hash` (rotation auto-invalidates), bounded LRU (50) + 600s idle TTL, `pool_pre_ping` + `pool_recycle=300`. `DbExecutor._run_sync` reuses the warm connection instead of `create_engineβconnectβdispose` per query (postgres/supabase only; other db_types keep the legacy path β no regression). **Live-measured: warm `query_structured` 6.6β9.4s β ~2.5s** (the residual is the per-call catalog-DB client fetch + pre-ping, not the external handshake). **Finding:** Neon's transaction pooler REJECTS `default_transaction_read_only` as a libpq startup `option` β caught live; moved read-only + statement_timeout to a per-connection `connect` event (best-effort; authoritative read-only is the SELECT-only compiler + sqlglot guard, see R10). Per-request ownership/active check kept. Proceeded ahead of R1 per owner decision (marginal security delta over the existing no-auth state; auth tracked separately). Tests: `tests/database_client/test_engine.py`. First query/process still cold β DB3. | important | DB | `[x]` |
|
| 69 |
| DB3 | **Speculative pre-connect** β `DbExecutor.prewarm(catalog, user_id)` warms the pooled engine for schema sources (fire-and-forget at slow-path entry) so the cold first-query handshake overlaps the ~4s Planner call. Best-effort, never raises; gated to the default path (skipped when a coordinator factory is injected). Verified live through `ChatHandler.handle`. | nice-to-have | DB | `[x]` |
|
| 70 |
| R4 | **Per-stage progress events** β `SlowPathCoordinator.run` gained an optional `progress` callback; `ChatHandler` bridges it to SSE `status` events (`chat.py` forwards them). Live: stream now shows `Planningβ¦`β`Running N stepsβ¦`β`Composingβ¦` (max wire gap ~4.6s, was ~13s of silence) β fixes proxy idle-timeout + UX. **Deferred:** token-streaming the Assembler answer needs splitting it into a streamed prose call + a structured-record call β that doubles the Assembler LLM calls (cost/latency), so it's a separate decision; the answer is still emitted as one chunk after the (fast ~2.5s) Assembler. Test: `test_chat_handler_wiring.py`. | important | agent | `[~]` |
|
| 71 |
+
| R5 | **Response cache**: key on `user_id` + catalog version; invalidate on ingest. Was `chat:{room_id}:{message}`, 24h TTL, no user β cross-user replay + stale answers. **2026-06-12 (Rifqi):** key now `chat:{room_id}:{user_id}:{message}` via `_chat_cache_key()`, TTL 24hβ1h (checkpoint decision) β urgent now that Redis is a shared Cloud instance. `DELETE /chat/cache` gained a required `user_id` param (frontend heads-up); room-wide clear pattern unchanged. **Still open:** catalog-version in key / invalidate-on-ingest. | important | B | `[~]` |
|
| 72 |
| R6 | **Hard time budget** β wrap `coordinator.run()` in `asyncio.wait_for` (60β90s). `Constraints.time_budget_seconds` is rendered but not enforced. | important | agent | `[ ]` |
|
| 73 |
| R7 | **Root-task-failure short-circuit** before the Assembler (templated/fast-path fallback, NOT replanning) β stops paying ~2k tok to narrate an empty RunState. | important | agent | `[ ]` |
|
| 74 |
| R8 | **Catalog upsert race** β per-user advisory lock around read-merge-upsert (`store.py`); concurrent uploads can drop a source. | important | DB | `[ ]` |
|
|
|
|
| 153 |
wires Pattern A correctly; self-corrects via retry.
|
| 154 |
|
| 155 |
**Open follow-ups:** real `BusinessContext` (lead); create `analysis_records` table +
|
| 156 |
+
real `AnalysisStore` (**Rifqi owns, 2026-06-12** β folded into `generate_report` work,
|
| 157 |
+
see `CHECKPOINT_PLAN_2026-06-17.md`); register data-access `ToolSpec`s upstream (`data_access_registry()`)
|
| 158 |
or keep the planner stub; 4o β GPT-mini deployment swap; flip `enable_slow_path` on once
|
| 159 |
`BusinessContext` is real. NOTE: 3 test files pre-existing broken from rename rot
|
| 160 |
(`test_chat_handler.py`, `test_intent_router.py`, `test_answer_agent.py` import the old
|
|
|
|
| 421 |
| β | Tool contracts (`tools/contracts.py`) | TAB | `[x]` | KM-627 β canonical `ToolSpec` / `ToolRegistry` / `ToolOutput`. `agents/planner/contracts.py` re-exports them (+ keeps the lead's `BusinessContext` stub). |
|
| 422 |
| β | Analytics registry (`tools/registry.py`) | TAB | `[x]` | KM-628 β `analytics_registry()`. `analyze_descriptive.required` = `["data","column_ids"]` (aligned to compute signature, commit 4bb7623). |
|
| 423 |
| β | Invoker layer (`tools/invoker.py`) | TAB | `[x]` | KM-629 β `AnalyticsToolInvoker` (Pattern A: `analyze_*` take a `data` `${t<id>}` placeholder from upstream `query_structured`; `_materialize` β DataFrame, `_coerce_decimals` covers the whole family) + `CompositeToolInvoker` (routes data-access vs analytics by name). |
|
| 424 |
+
| β | Data-access tools (`tools/data_access.py`) | TAB | `[x]` | KM-630 β `DataAccessToolInvoker`: `list_sources` / `describe_source` / `query_structured` / `retrieve_documents`. Per-request DI (`user_id` + `CatalogReader`). `query_structured` calls `IRValidator` + `ExecutorDispatcher` (planner skipped β IR pre-built by the agent Planner). **Superseded by KM-642/643** β renamed `data_retrieve`/`knowledge_retrieve` and `list_sources`+`describe_source` merged into `data_check` + new `knowledge_check`; see row below. |
|
| 425 |
| β | Tool tests (`tests/unit/tools/`) | TAB | `[x]` | analytics + data-access + invoker tests (gitignored). Incl. regression `test_decimal_columns_coerced_for_analyze_contribution`. |
|
| 426 |
+
| β | Data/knowledge tool taxonomy (`tools/data_access.py`) | TAB | `[x]` | KM-642/643 (commits c38c0c2, 4bd5f1e) β renamed `query_structured`β`data_retrieve`, `retrieve_documents`β`knowledge_retrieve`; merged `list_sources`+`describe_source` β parameterized `data_check` (no arg = list structured sources; `source_id` = that source's schema) + new `knowledge_check` (unstructured/documents). Split mirrors the catalog's structured/unstructured slices. Planner stub/prompt/validator/few-shots synced; `DATA_ACCESS_TOOLS` guard kept in lockstep. Note: dated log entries above (e.g. the 2026-06-09 E2E) keep the old names as historical record. |
|
| 427 |
|
| 428 |
### API surface
|
| 429 |
|
eval/__init__.py
ADDED
|
File without changes
|
eval/intent/README.md
ADDED
|
@@ -0,0 +1,70 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Intent-Routing Eval (E3)
|
| 2 |
+
|
| 3 |
+
Scores the live 6-intent router (`OrchestratorAgent.classify`) against a golden
|
| 4 |
+
dataset of labelled messages. Run it before any deploy that touches the router
|
| 5 |
+
prompt (`src/config/prompts/intent_router.md`) or its few-shots.
|
| 6 |
+
|
| 7 |
+
## Files
|
| 8 |
+
|
| 9 |
+
| File | What |
|
| 10 |
+
|---|---|
|
| 11 |
+
| `intent_dataset.json` | **Golden dataset** β `message` + known-correct `expected_intent` per case. The source of truth scoring compares against. |
|
| 12 |
+
| `run_eval.py` | Runner β calls the router per case, scores correctness, records latency + tokens. |
|
| 13 |
+
| `results/` | Timestamped run reports, one JSON per run (never overwritten). |
|
| 14 |
+
|
| 15 |
+
## Run
|
| 16 |
+
|
| 17 |
+
Run as a module (`-m`), not the file path β module mode puts the repo root on
|
| 18 |
+
`sys.path` so `src` imports resolve; `python eval/intent/run_eval.py` fails.
|
| 19 |
+
|
| 20 |
+
```bash
|
| 21 |
+
uv run python -m eval.intent.run_eval # full dataset
|
| 22 |
+
uv run python -m eval.intent.run_eval --limit 6 # quick smoke test
|
| 23 |
+
uv run python -m eval.intent.run_eval --langfuse # also stream traces to Langfuse
|
| 24 |
+
```
|
| 25 |
+
|
| 26 |
+
Needs a populated `.env` (Azure OpenAI) β it calls the live model and spends
|
| 27 |
+
tokens. Output: a per-case detail table + an aggregate summary in the terminal,
|
| 28 |
+
and `results/eval_result_<timestamp>.json`.
|
| 29 |
+
|
| 30 |
+
**Tracking is the committed result files, not Langfuse** β the JSON reports in
|
| 31 |
+
`results/` are the versionable audit trail (see below). `--langfuse` is an
|
| 32 |
+
*optional* extra: when set, each case is also sent as a Langfuse trace (grouped
|
| 33 |
+
under one `intent_eval_<ts>` session) with a `intent_correct` 1/0 score, so the
|
| 34 |
+
same run is browsable in the Langfuse dashboard. It is off by default and the
|
| 35 |
+
eval runs fully without Langfuse configured.
|
| 36 |
+
|
| 37 |
+
## What's measured
|
| 38 |
+
|
| 39 |
+
- **correctness** β overall + per-intent + per-language accuracy (`got == expected`)
|
| 40 |
+
- **runtime** β average ms per case
|
| 41 |
+
- **tokens** β input / output / total (read from the model response, no Langfuse)
|
| 42 |
+
|
| 43 |
+
## Commit convention for `results/`
|
| 44 |
+
|
| 45 |
+
The reports are **versionable**, not a scratch log:
|
| 46 |
+
|
| 47 |
+
- **Do commit** a result after a meaningful change β e.g. a new
|
| 48 |
+
`intent_router.md` version, or new dataset cases. The new timestamped file
|
| 49 |
+
*adds* to the history; old files are never replaced. This is how we answer
|
| 50 |
+
"did accuracy improve after prompt v2?" β diff two committed result files.
|
| 51 |
+
- **Don't commit** throwaway runs while iterating. Just leave them unstaged or
|
| 52 |
+
delete them.
|
| 53 |
+
|
| 54 |
+
So the audit trail = prompt versions (in `src/config/prompts/`) lined up against
|
| 55 |
+
the committed result files here.
|
| 56 |
+
|
| 57 |
+
## Dataset notes
|
| 58 |
+
|
| 59 |
+
- 6 intents: `chat`, `help`, `problem_statement`, `check`, `unstructured_flow`,
|
| 60 |
+
`structured_flow`. Each has 6-7 **distinct** scenarios (not EN/ID translation
|
| 61 |
+
pairs), balanced across English + Indonesian.
|
| 62 |
+
- `carried_over: true` rows mirror the pre-rework `intent_router.md` examples
|
| 63 |
+
(regression). `lang` enables per-language scoring. `id` is a stable handle for
|
| 64 |
+
diffing the same case across runs.
|
| 65 |
+
- Routing labels are decided from the question **phrasing**, not from which file
|
| 66 |
+
holds the answer (the router has no catalog access). See the `_grounding` note
|
| 67 |
+
in `intent_dataset.json`.
|
| 68 |
+
- Owner: Rifqi (structured/DB-grounded rows) + Sofhia (unstructured/document +
|
| 69 |
+
tabular-file rows). Merge both into this one file.
|
| 70 |
+
```
|
eval/intent/__init__.py
ADDED
|
File without changes
|
eval/intent/intent_dataset.json
ADDED
|
@@ -0,0 +1,56 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"_about": "Golden intent dataset for the reworked 6-intent router (E3 eval β runs against the live LLM, not the mocked unit tests). Each of the 6 intents has 6-7 DISTINCT scenarios (not EN/ID translation pairs of one scenario) balanced across English + Indonesian, since users code-switch (technical terms often stay English: revenue, churn, upload, base_price). `carried_over` rows mirror the old intent_router.md examples for regression; the rest are new. `id` is a stable per-case handle so timestamped run files (eval_result_<ts>.json) can be diffed case-by-case across runs β it is NOT the run timestamp (that lives in the filename). `lang` enables per-language correctness scoring (aggregate, not matched-pairs).",
|
| 3 |
+
"_grounding": "structured_flow + structured `check` rows are grounded in Sofhia's real test files: online_vs_offline_learning_dataset.csv (cols: Learning_Mode, Subject, Study_Hours, Retention_Score, Focus_Level, Exam_Score) and xl_knowledge_large.xlsx (telco product catalog: product_name, category, base_price, is_active, region_restriction...). unstructured_flow + document `check` rows are grounded in the IoT connectivity indicative-price PDF and the Internet of Things DOCX. IMPORTANT: routing labels are decided from the question PHRASING, not from which file holds the answer (the router has no catalog access) β so structured rows are clearly analytical (avg/correlation/count) and unstructured rows are clearly explanatory (jelaskan/summarize/features), with no price-lookup collisions. Rifqi's DB-grounded rows can be merged into this same file for variety; owner: Rifqi + Sofhia.",
|
| 4 |
+
"_next_layer": "Not in this seed: (1) deliberate near-boundary cases (chat-vs-help, check-vs-structured); (2) follow-up/contextual handling β decide whether follow-ups route to `chat` or are out of scope.",
|
| 5 |
+
"schema": {
|
| 6 |
+
"id": "stable per-case handle, <intent>_<NN>",
|
| 7 |
+
"message": "the user utterance fed to the router",
|
| 8 |
+
"expected_intent": "one of: chat | help | problem_statement | check | unstructured_flow | structured_flow",
|
| 9 |
+
"lang": "en | id",
|
| 10 |
+
"carried_over": "true if mirrored from the pre-rework intent_router.md examples"
|
| 11 |
+
},
|
| 12 |
+
"cases": [
|
| 13 |
+
{ "id": "chat_01", "message": "Hi", "expected_intent": "chat", "lang": "en", "carried_over": true },
|
| 14 |
+
{ "id": "chat_02", "message": "Bye, thanks", "expected_intent": "chat", "lang": "en", "carried_over": true },
|
| 15 |
+
{ "id": "chat_03", "message": "What can you do?", "expected_intent": "chat", "lang": "en", "carried_over": true },
|
| 16 |
+
{ "id": "chat_04", "message": "Kamu bisa ngerti bahasa Indonesia gk?", "expected_intent": "chat", "lang": "id", "carried_over": false },
|
| 17 |
+
{ "id": "chat_05", "message": "Test, kebaca gak?", "expected_intent": "chat", "lang": "id", "carried_over": false },
|
| 18 |
+
{ "id": "chat_06", "message": "Oh paham2", "expected_intent": "chat", "lang": "id", "carried_over": false },
|
| 19 |
+
|
| 20 |
+
{ "id": "help_01", "message": "Okay I uploaded my data, what do I do next?", "expected_intent": "help", "lang": "en", "carried_over": false },
|
| 21 |
+
{ "id": "help_02", "message": "How does this work, where should I start?", "expected_intent": "help", "lang": "en", "carried_over": false },
|
| 22 |
+
{ "id": "help_03", "message": "How do I connect my database to this?", "expected_intent": "help", "lang": "en", "carried_over": false },
|
| 23 |
+
{ "id": "help_04", "message": "Setelah analisis selesai, aku bisa ngapain lagi?", "expected_intent": "help", "lang": "id", "carried_over": false },
|
| 24 |
+
{ "id": "help_05", "message": "Aku harus upload file dulu atau connect database dulu atau bisa langsung tanpa keduanya?", "expected_intent": "help", "lang": "id", "carried_over": false },
|
| 25 |
+
{ "id": "help_06", "message": "Cara bikin report-nya gimana deh?", "expected_intent": "help", "lang": "id", "carried_over": false },
|
| 26 |
+
|
| 27 |
+
{ "id": "ps_01", "message": "I want to reduce customer churn next quarter, target under 5%.", "expected_intent": "problem_statement", "lang": "en", "carried_over": false },
|
| 28 |
+
{ "id": "ps_02", "message": "My goal is to improve online students' exam scores this semester.", "expected_intent": "problem_statement", "lang": "en", "carried_over": false },
|
| 29 |
+
{ "id": "ps_03", "message": "We need to figure out which product categories to push next year.", "expected_intent": "problem_statement", "lang": "en", "carried_over": false },
|
| 30 |
+
{ "id": "ps_04", "message": "Aku mau tau faktor apa yg paling ngaruh ke retention score siswa.", "expected_intent": "problem_statement", "lang": "id", "carried_over": false },
|
| 31 |
+
{ "id": "ps_05", "message": "Tujuanku naikin penjualan produk prepaid kuartal depan.", "expected_intent": "problem_statement", "lang": "id", "carried_over": false },
|
| 32 |
+
{ "id": "ps_06", "message": "Aku pengen fokus benahin paket internet yang kurang laku di luar Jawa.", "expected_intent": "problem_statement", "lang": "id", "carried_over": false },
|
| 33 |
+
|
| 34 |
+
{ "id": "check_01", "message": "What data do I have?", "expected_intent": "check", "lang": "en", "carried_over": false },
|
| 35 |
+
{ "id": "check_02", "message": "What columns are in the online vs offline learning dataset?", "expected_intent": "check", "lang": "en", "carried_over": false },
|
| 36 |
+
{ "id": "check_03", "message": "Is the IoT connectivity pricing PDF already uploaded?", "expected_intent": "check", "lang": "en", "carried_over": false },
|
| 37 |
+
{ "id": "check_04", "message": "Kolom di tabel product master list apa aja?", "expected_intent": "check", "lang": "id", "carried_over": false },
|
| 38 |
+
{ "id": "check_05", "message": "Dokumen apa aja yang udh aku upload?", "expected_intent": "check", "lang": "id", "carried_over": false },
|
| 39 |
+
{ "id": "check_06", "message": "Sumber dataku yang berupa database yg mana aja?", "expected_intent": "check", "lang": "id", "carried_over": false },
|
| 40 |
+
|
| 41 |
+
{ "id": "unstructured_01", "message": "apa key feature dari iot connectivity?", "expected_intent": "unstructured_flow", "lang": "id", "carried_over": true },
|
| 42 |
+
{ "id": "unstructured_02", "message": "Jelaskan tentang Internet of Things.", "expected_intent": "unstructured_flow", "lang": "id", "carried_over": false },
|
| 43 |
+
{ "id": "unstructured_03", "message": "Menurut dokumen IoT connectivity, paket apa aja yang ditawarkan?", "expected_intent": "unstructured_flow", "lang": "id", "carried_over": false },
|
| 44 |
+
{ "id": "unstructured_04", "message": "What pricing tiers are in the IoT connectivity document?", "expected_intent": "unstructured_flow", "lang": "en", "carried_over": false },
|
| 45 |
+
{ "id": "unstructured_05", "message": "Summarize the key points from the IoT connectivity pricing document.", "expected_intent": "unstructured_flow", "lang": "en", "carried_over": false },
|
| 46 |
+
{ "id": "unstructured_06", "message": "What use cases of IoT are mentioned in the document?", "expected_intent": "unstructured_flow", "lang": "en", "carried_over": false },
|
| 47 |
+
|
| 48 |
+
{ "id": "structured_01", "message": "How many orders did we get last month?", "expected_intent": "structured_flow", "lang": "en", "carried_over": true },
|
| 49 |
+
{ "id": "structured_02", "message": "Top 5 customers by revenue this year", "expected_intent": "structured_flow", "lang": "en", "carried_over": true },
|
| 50 |
+
{ "id": "structured_03", "message": "What's the average exam score per learning mode?", "expected_intent": "structured_flow", "lang": "en", "carried_over": false },
|
| 51 |
+
{ "id": "structured_04", "message": "Is there a correlation between study hours and exam score?", "expected_intent": "structured_flow", "lang": "en", "carried_over": false },
|
| 52 |
+
{ "id": "structured_05", "message": "Rata-rata base price per kategori produk berapa?", "expected_intent": "structured_flow", "lang": "id", "carried_over": false },
|
| 53 |
+
{ "id": "structured_06", "message": "Ada berapa produk yang masih aktif per kategori?", "expected_intent": "structured_flow", "lang": "id", "carried_over": false },
|
| 54 |
+
{ "id": "structured_07", "message": "Bandingin retention score antara siswa online sama offline.", "expected_intent": "structured_flow", "lang": "id", "carried_over": false }
|
| 55 |
+
]
|
| 56 |
+
}
|
eval/intent/results/.gitkeep
ADDED
|
File without changes
|
eval/intent/run_eval.py
ADDED
|
@@ -0,0 +1,384 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Intent-routing eval runner (E3).
|
| 2 |
+
|
| 3 |
+
Feeds each golden case in `intent_dataset.json` to the live 6-intent router
|
| 4 |
+
(`OrchestratorAgent.classify`), then scores correctness + records latency and
|
| 5 |
+
token usage. Prints a per-case detail table and an aggregate summary, and
|
| 6 |
+
writes a timestamped JSON report under `results/` (never overwritten β one file
|
| 7 |
+
per run, so runs can be diffed over time).
|
| 8 |
+
|
| 9 |
+
Run before every deploy that touches the router prompt or its few-shots.
|
| 10 |
+
Invoke as a module (`-m`) so the repo root is on `sys.path` and `src` imports
|
| 11 |
+
resolve β running the file path directly (`python eval/intent/run_eval.py`)
|
| 12 |
+
puts only `eval/intent/` on the path and fails:
|
| 13 |
+
|
| 14 |
+
uv run python -m eval.intent.run_eval
|
| 15 |
+
uv run python -m eval.intent.run_eval --limit 6 # quick smoke test
|
| 16 |
+
|
| 17 |
+
Tokens come straight from the model response (LangChain `usage_metadata` via a
|
| 18 |
+
callback) β no Langfuse needed. The router is called unmodified: it already
|
| 19 |
+
accepts a `callbacks=` list and forwards it into the chain config.
|
| 20 |
+
"""
|
| 21 |
+
|
| 22 |
+
from __future__ import annotations
|
| 23 |
+
|
| 24 |
+
import argparse
|
| 25 |
+
import asyncio
|
| 26 |
+
import json
|
| 27 |
+
import statistics
|
| 28 |
+
import time
|
| 29 |
+
from dataclasses import asdict, dataclass
|
| 30 |
+
from datetime import datetime
|
| 31 |
+
from pathlib import Path
|
| 32 |
+
from typing import Any
|
| 33 |
+
|
| 34 |
+
from langchain_core.callbacks import BaseCallbackHandler
|
| 35 |
+
from langchain_core.outputs import LLMResult
|
| 36 |
+
|
| 37 |
+
from src.agents.orchestration import OrchestratorAgent
|
| 38 |
+
|
| 39 |
+
_HERE = Path(__file__).resolve().parent
|
| 40 |
+
DATASET = _HERE / "intent_dataset.json"
|
| 41 |
+
RESULTS_DIR = _HERE / "results"
|
| 42 |
+
|
| 43 |
+
INTENTS = [
|
| 44 |
+
"chat",
|
| 45 |
+
"help",
|
| 46 |
+
"problem_statement",
|
| 47 |
+
"check",
|
| 48 |
+
"unstructured_flow",
|
| 49 |
+
"structured_flow",
|
| 50 |
+
]
|
| 51 |
+
|
| 52 |
+
# Short labels so the EXPECT->GOT column stays narrow in the detail table.
|
| 53 |
+
_ABBR = {
|
| 54 |
+
"chat": "chat",
|
| 55 |
+
"help": "help",
|
| 56 |
+
"problem_statement": "prob_stmt",
|
| 57 |
+
"check": "check",
|
| 58 |
+
"unstructured_flow": "unstruct",
|
| 59 |
+
"structured_flow": "structF",
|
| 60 |
+
}
|
| 61 |
+
|
| 62 |
+
|
| 63 |
+
class _UsageCollector(BaseCallbackHandler):
|
| 64 |
+
"""Sums token usage across the LLM calls made during one classify().
|
| 65 |
+
|
| 66 |
+
Reads `usage_metadata` off each returned message (the canonical LangChain
|
| 67 |
+
field), falling back to `llm_output['token_usage']` for providers that only
|
| 68 |
+
populate the legacy field.
|
| 69 |
+
"""
|
| 70 |
+
|
| 71 |
+
def __init__(self) -> None:
|
| 72 |
+
self.input_tokens = 0
|
| 73 |
+
self.output_tokens = 0
|
| 74 |
+
self.total_tokens = 0
|
| 75 |
+
|
| 76 |
+
def on_llm_end(self, response: LLMResult, **kwargs: Any) -> None:
|
| 77 |
+
before = self.total_tokens
|
| 78 |
+
for generation_list in response.generations:
|
| 79 |
+
for generation in generation_list:
|
| 80 |
+
message = getattr(generation, "message", None)
|
| 81 |
+
usage = getattr(message, "usage_metadata", None) if message else None
|
| 82 |
+
if usage:
|
| 83 |
+
self.input_tokens += usage.get("input_tokens", 0)
|
| 84 |
+
self.output_tokens += usage.get("output_tokens", 0)
|
| 85 |
+
self.total_tokens += usage.get("total_tokens", 0)
|
| 86 |
+
if self.total_tokens == before and response.llm_output:
|
| 87 |
+
usage = response.llm_output.get("token_usage") or {}
|
| 88 |
+
self.input_tokens += usage.get("prompt_tokens", 0)
|
| 89 |
+
self.output_tokens += usage.get("completion_tokens", 0)
|
| 90 |
+
self.total_tokens += usage.get("total_tokens", 0)
|
| 91 |
+
|
| 92 |
+
@property
|
| 93 |
+
def tokens(self) -> dict[str, int]:
|
| 94 |
+
return {
|
| 95 |
+
"input": self.input_tokens,
|
| 96 |
+
"output": self.output_tokens,
|
| 97 |
+
"total": self.total_tokens,
|
| 98 |
+
}
|
| 99 |
+
|
| 100 |
+
|
| 101 |
+
@dataclass
|
| 102 |
+
class CaseResult:
|
| 103 |
+
id: str
|
| 104 |
+
lang: str
|
| 105 |
+
message: str
|
| 106 |
+
expected: str
|
| 107 |
+
got: str
|
| 108 |
+
correct: bool
|
| 109 |
+
latency_ms: int
|
| 110 |
+
tokens: dict[str, int]
|
| 111 |
+
|
| 112 |
+
|
| 113 |
+
def load_cases(path: Path) -> list[dict[str, Any]]:
|
| 114 |
+
"""Read the `cases` array, skipping the leading `_*` doc keys and `schema`."""
|
| 115 |
+
data = json.loads(path.read_text(encoding="utf-8"))
|
| 116 |
+
return list(data["cases"])
|
| 117 |
+
|
| 118 |
+
|
| 119 |
+
@dataclass
|
| 120 |
+
class _LangfuseCtx:
|
| 121 |
+
"""Optional Langfuse sink β one session groups all cases of a run."""
|
| 122 |
+
|
| 123 |
+
session_id: str
|
| 124 |
+
client: Any
|
| 125 |
+
|
| 126 |
+
|
| 127 |
+
def _new_langfuse_handler(lf_ctx: _LangfuseCtx, case: dict[str, Any]) -> Any:
|
| 128 |
+
"""Per-case LangChain callback so each trace carries the case's labels."""
|
| 129 |
+
from langfuse.callback import CallbackHandler
|
| 130 |
+
|
| 131 |
+
from src.config.settings import settings
|
| 132 |
+
|
| 133 |
+
return CallbackHandler(
|
| 134 |
+
public_key=settings.LANGFUSE_PUBLIC_KEY,
|
| 135 |
+
secret_key=settings.LANGFUSE_SECRET_KEY,
|
| 136 |
+
host=settings.LANGFUSE_HOST,
|
| 137 |
+
session_id=lf_ctx.session_id,
|
| 138 |
+
trace_name=f"intent_eval/{case['id']}",
|
| 139 |
+
metadata={
|
| 140 |
+
"case_id": case["id"],
|
| 141 |
+
"expected": case["expected_intent"],
|
| 142 |
+
"lang": case["lang"],
|
| 143 |
+
},
|
| 144 |
+
tags=["intent-eval", case["expected_intent"], case["lang"]],
|
| 145 |
+
)
|
| 146 |
+
|
| 147 |
+
|
| 148 |
+
def _score_langfuse(lf_ctx: _LangfuseCtx, handler: Any, result: CaseResult) -> None:
|
| 149 |
+
"""Attach a 1/0 correctness score to the case's trace. Best-effort."""
|
| 150 |
+
try:
|
| 151 |
+
lf_ctx.client.score(
|
| 152 |
+
trace_id=handler.get_trace_id(),
|
| 153 |
+
name="intent_correct",
|
| 154 |
+
value=1 if result.correct else 0,
|
| 155 |
+
comment=f"{result.expected} -> {result.got}",
|
| 156 |
+
)
|
| 157 |
+
except Exception: # noqa: BLE001, S110 β scoring must never break the run
|
| 158 |
+
pass
|
| 159 |
+
|
| 160 |
+
|
| 161 |
+
async def run_case(
|
| 162 |
+
agent: OrchestratorAgent,
|
| 163 |
+
case: dict[str, Any],
|
| 164 |
+
lf_ctx: _LangfuseCtx | None = None,
|
| 165 |
+
) -> CaseResult:
|
| 166 |
+
"""Classify one message; never throws β a failed call is recorded as ERROR."""
|
| 167 |
+
collector = _UsageCollector()
|
| 168 |
+
callbacks: list[Any] = [collector]
|
| 169 |
+
lf_handler = _new_langfuse_handler(lf_ctx, case) if lf_ctx else None
|
| 170 |
+
if lf_handler is not None:
|
| 171 |
+
callbacks.append(lf_handler)
|
| 172 |
+
|
| 173 |
+
start = time.perf_counter()
|
| 174 |
+
got: str
|
| 175 |
+
try:
|
| 176 |
+
decision = await agent.classify(case["message"], callbacks=callbacks)
|
| 177 |
+
got = decision.intent
|
| 178 |
+
except Exception as exc: # noqa: BLE001 β one bad case shouldn't kill the run
|
| 179 |
+
got = f"ERROR:{type(exc).__name__}"
|
| 180 |
+
latency_ms = round((time.perf_counter() - start) * 1000)
|
| 181 |
+
|
| 182 |
+
result = CaseResult(
|
| 183 |
+
id=case["id"],
|
| 184 |
+
lang=case["lang"],
|
| 185 |
+
message=case["message"],
|
| 186 |
+
expected=case["expected_intent"],
|
| 187 |
+
got=got,
|
| 188 |
+
correct=got == case["expected_intent"],
|
| 189 |
+
latency_ms=latency_ms,
|
| 190 |
+
tokens=collector.tokens,
|
| 191 |
+
)
|
| 192 |
+
if lf_ctx is not None and lf_handler is not None:
|
| 193 |
+
_score_langfuse(lf_ctx, lf_handler, result)
|
| 194 |
+
return result
|
| 195 |
+
|
| 196 |
+
|
| 197 |
+
def _group_accuracy(results: list[CaseResult], key: str) -> dict[str, dict[str, Any]]:
|
| 198 |
+
out: dict[str, dict[str, Any]] = {}
|
| 199 |
+
keys = INTENTS if key == "expected" else sorted({getattr(r, key) for r in results})
|
| 200 |
+
for k in keys:
|
| 201 |
+
sub = [r for r in results if getattr(r, key) == k]
|
| 202 |
+
if not sub:
|
| 203 |
+
continue
|
| 204 |
+
passed = sum(r.correct for r in sub)
|
| 205 |
+
out[k] = {
|
| 206 |
+
"n": len(sub),
|
| 207 |
+
"passed": passed,
|
| 208 |
+
"accuracy": round(passed / len(sub), 3),
|
| 209 |
+
}
|
| 210 |
+
return out
|
| 211 |
+
|
| 212 |
+
|
| 213 |
+
def summarize(results: list[CaseResult]) -> dict[str, Any]:
|
| 214 |
+
n = len(results)
|
| 215 |
+
passed = sum(r.correct for r in results)
|
| 216 |
+
latencies = [r.latency_ms for r in results]
|
| 217 |
+
tok_in = sum(r.tokens["input"] for r in results)
|
| 218 |
+
tok_out = sum(r.tokens["output"] for r in results)
|
| 219 |
+
tok_total = sum(r.tokens["total"] for r in results)
|
| 220 |
+
return {
|
| 221 |
+
"total": n,
|
| 222 |
+
"passed": passed,
|
| 223 |
+
"accuracy": round(passed / n, 3) if n else 0.0,
|
| 224 |
+
"runtime_avg_ms": round(statistics.mean(latencies)) if latencies else 0,
|
| 225 |
+
"runtime_total_s": round(sum(latencies) / 1000, 1),
|
| 226 |
+
"tokens": {
|
| 227 |
+
"input": tok_in,
|
| 228 |
+
"output": tok_out,
|
| 229 |
+
"total": tok_total,
|
| 230 |
+
"avg_total_per_case": round(tok_total / n) if n else 0,
|
| 231 |
+
},
|
| 232 |
+
"by_intent": _group_accuracy(results, "expected"),
|
| 233 |
+
"by_lang": _group_accuracy(results, "lang"),
|
| 234 |
+
}
|
| 235 |
+
|
| 236 |
+
|
| 237 |
+
def _truncate(text: str, width: int) -> str:
|
| 238 |
+
text = text.replace("\n", " ")
|
| 239 |
+
return text if len(text) <= width else text[: width - 3] + "..."
|
| 240 |
+
|
| 241 |
+
|
| 242 |
+
def format_table(results: list[CaseResult]) -> str:
|
| 243 |
+
header = (
|
| 244 |
+
f"{'ID':<15} {'L':<3} {'QUESTION':<40} "
|
| 245 |
+
f"{'EXPECT->GOT':<20} {'OK':<3} {'MS':>5} {'TOK':>6}"
|
| 246 |
+
)
|
| 247 |
+
rule = "-" * len(header)
|
| 248 |
+
lines = [rule, header, rule]
|
| 249 |
+
for r in results:
|
| 250 |
+
exp_got = f"{_ABBR.get(r.expected, r.expected)}->{_ABBR.get(r.got, r.got)}"
|
| 251 |
+
ok = "ok" if r.correct else "X"
|
| 252 |
+
lines.append(
|
| 253 |
+
f"{r.id:<15} {r.lang:<3} {_truncate(r.message, 40):<40} "
|
| 254 |
+
f"{_truncate(exp_got, 20):<20} {ok:<3} {r.latency_ms:>5} {r.tokens['total']:>6}"
|
| 255 |
+
)
|
| 256 |
+
lines.append(rule)
|
| 257 |
+
return "\n".join(lines)
|
| 258 |
+
|
| 259 |
+
|
| 260 |
+
def format_summary(summary: dict[str, Any], results: list[CaseResult]) -> str:
|
| 261 |
+
lines = ["SUMMARY"]
|
| 262 |
+
lines.append(
|
| 263 |
+
f" Overall {summary['passed']}/{summary['total']} correct"
|
| 264 |
+
f" ({summary['accuracy'] * 100:.1f}%)"
|
| 265 |
+
)
|
| 266 |
+
lines.append(
|
| 267 |
+
f" Runtime avg {summary['runtime_avg_ms']} ms"
|
| 268 |
+
f" | total {summary['runtime_total_s']} s"
|
| 269 |
+
)
|
| 270 |
+
tok = summary["tokens"]
|
| 271 |
+
lines.append(
|
| 272 |
+
f" Tokens avg {tok['avg_total_per_case']}"
|
| 273 |
+
f" | total {tok['total']} (in {tok['input']} / out {tok['output']})"
|
| 274 |
+
)
|
| 275 |
+
lines.append("")
|
| 276 |
+
lines.append(" By intent")
|
| 277 |
+
for intent, m in summary["by_intent"].items():
|
| 278 |
+
lines.append(
|
| 279 |
+
f" {intent:<18} {m['passed']}/{m['n']} {m['accuracy'] * 100:.0f}%"
|
| 280 |
+
)
|
| 281 |
+
lines.append(" By language")
|
| 282 |
+
for lang, m in summary["by_lang"].items():
|
| 283 |
+
lines.append(
|
| 284 |
+
f" {lang:<18} {m['passed']}/{m['n']} {m['accuracy'] * 100:.0f}%"
|
| 285 |
+
)
|
| 286 |
+
failures = [r for r in results if not r.correct]
|
| 287 |
+
lines.append("")
|
| 288 |
+
lines.append(f" FAILURES ({len(failures)})")
|
| 289 |
+
for r in failures:
|
| 290 |
+
lines.append(f" {r.id:<14} [{r.lang}] {r.expected:<12} -> {r.got}")
|
| 291 |
+
return "\n".join(lines)
|
| 292 |
+
|
| 293 |
+
|
| 294 |
+
def build_report(
|
| 295 |
+
results: list[CaseResult], summary: dict[str, Any], meta: dict[str, Any]
|
| 296 |
+
) -> dict[str, Any]:
|
| 297 |
+
run = {**meta, **{k: summary[k] for k in (
|
| 298 |
+
"total", "passed", "accuracy", "runtime_avg_ms", "runtime_total_s", "tokens"
|
| 299 |
+
)}}
|
| 300 |
+
return {
|
| 301 |
+
"run": run,
|
| 302 |
+
"by_intent": summary["by_intent"],
|
| 303 |
+
"by_lang": summary["by_lang"],
|
| 304 |
+
"cases": [asdict(r) for r in results],
|
| 305 |
+
}
|
| 306 |
+
|
| 307 |
+
|
| 308 |
+
def _model_name() -> str:
|
| 309 |
+
try:
|
| 310 |
+
from src.config.settings import settings
|
| 311 |
+
|
| 312 |
+
return str(settings.azureai_deployment_name_4o)
|
| 313 |
+
except Exception: # noqa: BLE001 β meta only; .env may be absent
|
| 314 |
+
return "gpt-4o"
|
| 315 |
+
|
| 316 |
+
|
| 317 |
+
async def main() -> None:
|
| 318 |
+
parser = argparse.ArgumentParser(description="Intent-routing eval (E3)")
|
| 319 |
+
parser.add_argument("--dataset", type=Path, default=DATASET)
|
| 320 |
+
parser.add_argument("--limit", type=int, default=0, help="run first N cases only")
|
| 321 |
+
parser.add_argument("--prompt-version", default="intent_router.md")
|
| 322 |
+
parser.add_argument("--no-table", action="store_true", help="skip the detail table")
|
| 323 |
+
parser.add_argument(
|
| 324 |
+
"--langfuse", action="store_true",
|
| 325 |
+
help="also send each case as a Langfuse trace + correctness score",
|
| 326 |
+
)
|
| 327 |
+
args = parser.parse_args()
|
| 328 |
+
|
| 329 |
+
cases = load_cases(args.dataset)
|
| 330 |
+
if args.limit:
|
| 331 |
+
cases = cases[: args.limit]
|
| 332 |
+
|
| 333 |
+
started = datetime.now()
|
| 334 |
+
print(f"Intent Routing Eval -- {started:%Y-%m-%d %H:%M:%S}")
|
| 335 |
+
print(f"dataset: {args.dataset.name} ({len(cases)}) model: {_model_name()} "
|
| 336 |
+
f"prompt: {args.prompt_version}")
|
| 337 |
+
|
| 338 |
+
lf_ctx: _LangfuseCtx | None = None
|
| 339 |
+
if args.langfuse:
|
| 340 |
+
try:
|
| 341 |
+
from src.observability.langfuse.langfuse import get_langfuse
|
| 342 |
+
|
| 343 |
+
lf_ctx = _LangfuseCtx(
|
| 344 |
+
session_id=f"intent_eval_{started:%Y%m%d_%H%M%S}",
|
| 345 |
+
client=get_langfuse(), # type: ignore[no-untyped-call]
|
| 346 |
+
)
|
| 347 |
+
print(f"langfuse: enabled (session {lf_ctx.session_id})")
|
| 348 |
+
except Exception as exc: # noqa: BLE001 β Langfuse is optional
|
| 349 |
+
print(f"langfuse: disabled ({type(exc).__name__}: {exc})")
|
| 350 |
+
|
| 351 |
+
agent = OrchestratorAgent()
|
| 352 |
+
results: list[CaseResult] = []
|
| 353 |
+
for case in cases:
|
| 354 |
+
results.append(await run_case(agent, case, lf_ctx))
|
| 355 |
+
|
| 356 |
+
if lf_ctx is not None:
|
| 357 |
+
try:
|
| 358 |
+
lf_ctx.client.flush()
|
| 359 |
+
except Exception: # noqa: BLE001, S110 β flush failure shouldn't fail the run
|
| 360 |
+
pass
|
| 361 |
+
|
| 362 |
+
summary = summarize(results)
|
| 363 |
+
if not args.no_table:
|
| 364 |
+
print(format_table(results))
|
| 365 |
+
print(format_summary(summary, results))
|
| 366 |
+
|
| 367 |
+
meta = {
|
| 368 |
+
"timestamp": started.isoformat(timespec="seconds"),
|
| 369 |
+
"dataset": args.dataset.name,
|
| 370 |
+
"model": _model_name(),
|
| 371 |
+
"prompt_version": args.prompt_version,
|
| 372 |
+
"langfuse_session": lf_ctx.session_id if lf_ctx else None,
|
| 373 |
+
}
|
| 374 |
+
report = build_report(results, summary, meta)
|
| 375 |
+
RESULTS_DIR.mkdir(parents=True, exist_ok=True)
|
| 376 |
+
out_path = RESULTS_DIR / f"eval_result_{started:%Y-%m-%d_%H%M%S}.json"
|
| 377 |
+
out_path.write_text(
|
| 378 |
+
json.dumps(report, ensure_ascii=False, indent=2), encoding="utf-8"
|
| 379 |
+
)
|
| 380 |
+
print(f"\n-> saved: {out_path.relative_to(_HERE.parent.parent)}")
|
| 381 |
+
|
| 382 |
+
|
| 383 |
+
if __name__ == "__main__":
|
| 384 |
+
asyncio.run(main())
|
eval/readiness/README.md
ADDED
|
@@ -0,0 +1,34 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Report-readiness eval
|
| 2 |
+
|
| 3 |
+
Scores the deterministic `is_report_ready` signal (`src/agents/report/readiness.py`)
|
| 4 |
+
that the Help skill consumes to decide whether to nudge the user toward generating a
|
| 5 |
+
report. No LLM, no DB β each golden case declares an analysis state + a set of
|
| 6 |
+
persisted records/reports, and the runner feeds them through `is_report_ready` via
|
| 7 |
+
injectable fake stores.
|
| 8 |
+
|
| 9 |
+
## Run
|
| 10 |
+
|
| 11 |
+
```bash
|
| 12 |
+
uv run python -m eval.readiness.run_eval
|
| 13 |
+
uv run python -m eval.readiness.run_eval --limit 5 # smoke test
|
| 14 |
+
uv run python -m eval.readiness.run_eval --no-table # summary only
|
| 15 |
+
```
|
| 16 |
+
|
| 17 |
+
Each run writes a timestamped `results/readiness_result_<ts>.json` (never
|
| 18 |
+
overwritten, diffable across runs).
|
| 19 |
+
|
| 20 |
+
## What it measures
|
| 21 |
+
|
| 22 |
+
- **Floor correctness** β exact `ready` + `missing` for the deterministic floor
|
| 23 |
+
(validated goal Β· β₯1 substantive record Β· delta-since-report). Should sit at ~100%;
|
| 24 |
+
this is the regression guard as criteria evolve.
|
| 25 |
+
- **Alignment gap** β `alignment` cases have substantive records (floor says
|
| 26 |
+
`ready=true`) but `aligned=false`: the analyses don't address the problem statement.
|
| 27 |
+
The floor can't see this. The gap count is the evidence for/against adding the
|
| 28 |
+
deferred LLM-judge β "ship the floor, earn the judge."
|
| 29 |
+
|
| 30 |
+
## Dataset
|
| 31 |
+
|
| 32 |
+
`readiness_dataset.json` β groups: `floor`, `delta`, `edge` (doc-only product
|
| 33 |
+
question), `alignment`. See the `_about` / `_alignment` doc keys in the file. The
|
| 34 |
+
`aligned` label is a semantic judgment; owner: Rifqi (report semantics) + Sofhia.
|
eval/readiness/__init__.py
ADDED
|
File without changes
|
eval/readiness/readiness_dataset.json
ADDED
|
@@ -0,0 +1,40 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"_about": "Golden dataset for the report-readiness signal (`src/agents/report/readiness.is_report_ready`). Deterministic (no LLM): each case declares an analysis state + a set of persisted AnalysisRecords/reports, and the runner feeds them through is_report_ready via injectable fake stores, scoring the boolean `ready` AND the `missing` gaps. Floor cases should score ~100% (regression value). The `alignment` group probes the deferred LLM-judge β see _alignment.",
|
| 3 |
+
"_floor": "is_report_ready's deterministic floor: (1) problem_validated, (2) >=1 SUBSTANTIVE record, (3) delta-since-report. SUBSTANTIVE (KM-652 fix T1) = a record whose ANALYSIS task succeeded: tasks_run contains a task with status=success AND an analyze_* tool. A failed analysis still persists a record WITH findings (narrating the failure) and its data-access tasks (check_/retrieve_) succeed β so neither 'has findings' nor 'any task succeeded' counts. Only a successful analyze_* does.",
|
| 4 |
+
"_records": "records[].analysis = 'success' (analyze_* succeeded β substantive) | 'failure' (analyze_* failed, data-access still succeeded β the real e2e case, NOT substantive) | 'none' (only check_/retrieve_ succeeded, no analyze task β NOT substantive; guards the 'any task succeeded' trap). records[].findings = count (a failure run still has findings; floor ignores them now). records[].age_min / reports[].age_min = minutes ago (smaller = newer).",
|
| 5 |
+
"_alignment": "ALIGNMENT cases: a successful analysis (floor says ready=true) but `aligned=false` means it doesn't address the problem statement β a human would say NOT ready. Scored floor-correct, counted separately as the 'alignment gap' = evidence for/against the LLM-judge. Alignment label owner: Rifqi (report semantics) + Sofhia.",
|
| 6 |
+
"schema": {
|
| 7 |
+
"id": "stable per-case handle, <group>_<NN>",
|
| 8 |
+
"group": "floor | delta | edge | alignment",
|
| 9 |
+
"problem_validated": "bool",
|
| 10 |
+
"report_id": "null = never generated; a string = a report exists",
|
| 11 |
+
"records": "[{ analysis: success|failure|none, findings: int, age_min: int }]",
|
| 12 |
+
"reports": "[{ age_min: int }] (only meaningful when report_id set)",
|
| 13 |
+
"aligned": "bool β do the analyses address the problem statement? (floor ignores this)",
|
| 14 |
+
"expected_ready": "what the deterministic floor SHOULD return",
|
| 15 |
+
"expected_missing": "subset of [problem, analysis, delta]",
|
| 16 |
+
"note": "human-readable description"
|
| 17 |
+
},
|
| 18 |
+
"cases": [
|
| 19 |
+
{ "id": "floor_01", "group": "floor", "problem_validated": false, "report_id": null, "records": [], "reports": [], "aligned": false, "expected_ready": false, "expected_missing": ["problem", "analysis"], "note": "new analysis: no validated goal and no records" },
|
| 20 |
+
{ "id": "floor_02", "group": "floor", "problem_validated": false, "report_id": null, "records": [{ "analysis": "success", "findings": 2, "age_min": 30 }], "reports": [], "aligned": true, "expected_ready": false, "expected_missing": ["problem"], "note": "has a successful analysis but goal not validated (isolates the problem gap)" },
|
| 21 |
+
{ "id": "floor_03", "group": "floor", "problem_validated": true, "report_id": null, "records": [], "reports": [], "aligned": false, "expected_ready": false, "expected_missing": ["analysis"], "note": "validated goal but no analysis run yet" },
|
| 22 |
+
{ "id": "floor_04", "group": "floor", "problem_validated": true, "report_id": null, "records": [{ "analysis": "failure", "findings": 3, "age_min": 20 }], "reports": [], "aligned": false, "expected_ready": false, "expected_missing": ["analysis"], "note": "T1 REGRESSION: analyze_* FAILED but the record still has 3 findings (narrating failure) + check/retrieve succeeded. Must NOT be ready β this is the live e2e case (analyze_aggregate failed, report still got generated under the old 'has findings' rule)." },
|
| 23 |
+
{ "id": "floor_05", "group": "floor", "problem_validated": true, "report_id": null, "records": [{ "analysis": "none", "findings": 0, "age_min": 15 }], "reports": [], "aligned": false, "expected_ready": false, "expected_missing": ["analysis"], "note": "T1 nuance: only data-access tasks (check/retrieve) succeeded, no analyze task. 'any task succeeded' would wrongly pass β must NOT be ready." },
|
| 24 |
+
{ "id": "floor_06", "group": "floor", "problem_validated": true, "report_id": null, "records": [{ "analysis": "success", "findings": 2, "age_min": 15 }], "reports": [], "aligned": true, "expected_ready": true, "expected_missing": [], "note": "validated + one successful analysis, no prior report β ready" },
|
| 25 |
+
{ "id": "floor_07", "group": "floor", "problem_validated": true, "report_id": null, "records": [{ "analysis": "success", "findings": 3, "age_min": 40 }, { "analysis": "success", "findings": 1, "age_min": 10 }], "reports": [], "aligned": true, "expected_ready": true, "expected_missing": [], "note": "multiple successful analyses β ready" },
|
| 26 |
+
{ "id": "floor_08", "group": "floor", "problem_validated": true, "report_id": null, "records": [{ "analysis": "failure", "findings": 3, "age_min": 30 }, { "analysis": "success", "findings": 2, "age_min": 10 }], "reports": [], "aligned": true, "expected_ready": true, "expected_missing": [], "note": "one failed + one successful analysis β the successful one is enough β ready" },
|
| 27 |
+
|
| 28 |
+
{ "id": "delta_01", "group": "delta", "problem_validated": true, "report_id": "rep-1", "records": [{ "analysis": "success", "findings": 2, "age_min": 120 }], "reports": [{ "age_min": 5 }], "aligned": true, "expected_ready": false, "expected_missing": ["delta"], "note": "report exists, all analysis older than it β nothing new to report" },
|
| 29 |
+
{ "id": "delta_02", "group": "delta", "problem_validated": true, "report_id": "rep-1", "records": [{ "analysis": "success", "findings": 2, "age_min": 5 }], "reports": [{ "age_min": 120 }], "aligned": true, "expected_ready": true, "expected_missing": [], "note": "newer successful analysis after the report β ready to regenerate" },
|
| 30 |
+
{ "id": "delta_03", "group": "delta", "problem_validated": true, "report_id": "rep-1", "records": [{ "analysis": "success", "findings": 1, "age_min": 90 }, { "analysis": "success", "findings": 2, "age_min": 10 }], "reports": [{ "age_min": 60 }], "aligned": true, "expected_ready": true, "expected_missing": [], "note": "one old + one newer-than-report success β ready" },
|
| 31 |
+
{ "id": "delta_04", "group": "delta", "problem_validated": true, "report_id": "rep-2", "records": [{ "analysis": "success", "findings": 2, "age_min": 90 }], "reports": [{ "age_min": 200 }, { "age_min": 30 }], "aligned": true, "expected_ready": false, "expected_missing": ["delta"], "note": "multiple reports β newest wins; analysis older than newest report β not ready" },
|
| 32 |
+
{ "id": "delta_05", "group": "delta", "problem_validated": true, "report_id": "rep-1", "records": [{ "analysis": "success", "findings": 2, "age_min": 120 }, { "analysis": "failure", "findings": 3, "age_min": 5 }], "reports": [{ "age_min": 60 }], "aligned": true, "expected_ready": false, "expected_missing": ["delta"], "note": "T1+delta: the only NEW analysis (age 5) is a FAILURE β no NEW substantive since the report β not ready. A failed retry must not unlock a duplicate report." },
|
| 33 |
+
|
| 34 |
+
{ "id": "edge_01", "group": "edge", "problem_validated": true, "report_id": null, "records": [], "reports": [], "aligned": false, "expected_ready": false, "expected_missing": ["analysis"], "note": "doc-only analysis (RAG, no structured run) produces no AnalysisRecord β never report-able under the floor. PRODUCT QUESTION: should doc-only be report-able?" },
|
| 35 |
+
|
| 36 |
+
{ "id": "align_01", "group": "alignment", "problem_validated": true, "report_id": null, "records": [{ "analysis": "success", "findings": 2, "age_min": 15 }], "reports": [], "aligned": false, "expected_ready": true, "expected_missing": [], "note": "GAP: successful analysis but it doesn't address the problem statement. Floor says ready; a human would say not-ready." },
|
| 37 |
+
{ "id": "align_02", "group": "alignment", "problem_validated": true, "report_id": null, "records": [{ "analysis": "success", "findings": 3, "age_min": 25 }, { "analysis": "success", "findings": 1, "age_min": 5 }], "reports": [], "aligned": false, "expected_ready": true, "expected_missing": [], "note": "GAP: lots of successful analysis, none aligned to the goal" },
|
| 38 |
+
{ "id": "align_03", "group": "alignment", "problem_validated": true, "report_id": null, "records": [{ "analysis": "success", "findings": 2, "age_min": 15 }], "reports": [], "aligned": true, "expected_ready": true, "expected_missing": [], "note": "control: successful AND aligned β genuinely ready, no gap" }
|
| 39 |
+
]
|
| 40 |
+
}
|
eval/readiness/results/.gitkeep
ADDED
|
File without changes
|
eval/readiness/results/readiness_result_2026-06-22_101645.json
ADDED
|
@@ -0,0 +1,268 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"run": {
|
| 3 |
+
"timestamp": "2026-06-22T10:16:45",
|
| 4 |
+
"dataset": "readiness_dataset.json",
|
| 5 |
+
"target": "src/agents/report/readiness.is_report_ready",
|
| 6 |
+
"total": 16,
|
| 7 |
+
"passed": 16,
|
| 8 |
+
"accuracy": 1.0,
|
| 9 |
+
"runtime_avg_ms": 0.0
|
| 10 |
+
},
|
| 11 |
+
"alignment_gap": {
|
| 12 |
+
"count": 2,
|
| 13 |
+
"ids": [
|
| 14 |
+
"align_01",
|
| 15 |
+
"align_02"
|
| 16 |
+
]
|
| 17 |
+
},
|
| 18 |
+
"by_group": {
|
| 19 |
+
"floor": {
|
| 20 |
+
"n": 8,
|
| 21 |
+
"passed": 8,
|
| 22 |
+
"accuracy": 1.0
|
| 23 |
+
},
|
| 24 |
+
"delta": {
|
| 25 |
+
"n": 4,
|
| 26 |
+
"passed": 4,
|
| 27 |
+
"accuracy": 1.0
|
| 28 |
+
},
|
| 29 |
+
"edge": {
|
| 30 |
+
"n": 1,
|
| 31 |
+
"passed": 1,
|
| 32 |
+
"accuracy": 1.0
|
| 33 |
+
},
|
| 34 |
+
"alignment": {
|
| 35 |
+
"n": 3,
|
| 36 |
+
"passed": 3,
|
| 37 |
+
"accuracy": 1.0
|
| 38 |
+
}
|
| 39 |
+
},
|
| 40 |
+
"cases": [
|
| 41 |
+
{
|
| 42 |
+
"id": "floor_01",
|
| 43 |
+
"group": "floor",
|
| 44 |
+
"expected_ready": false,
|
| 45 |
+
"got_ready": false,
|
| 46 |
+
"expected_missing": [
|
| 47 |
+
"a validated problem statement",
|
| 48 |
+
"at least one completed analysis"
|
| 49 |
+
],
|
| 50 |
+
"got_missing": [
|
| 51 |
+
"a validated problem statement",
|
| 52 |
+
"at least one completed analysis"
|
| 53 |
+
],
|
| 54 |
+
"correct": true,
|
| 55 |
+
"aligned": false,
|
| 56 |
+
"gap": false,
|
| 57 |
+
"latency_ms": 0.0
|
| 58 |
+
},
|
| 59 |
+
{
|
| 60 |
+
"id": "floor_02",
|
| 61 |
+
"group": "floor",
|
| 62 |
+
"expected_ready": false,
|
| 63 |
+
"got_ready": false,
|
| 64 |
+
"expected_missing": [
|
| 65 |
+
"a validated problem statement"
|
| 66 |
+
],
|
| 67 |
+
"got_missing": [
|
| 68 |
+
"a validated problem statement"
|
| 69 |
+
],
|
| 70 |
+
"correct": true,
|
| 71 |
+
"aligned": true,
|
| 72 |
+
"gap": false,
|
| 73 |
+
"latency_ms": 0.0
|
| 74 |
+
},
|
| 75 |
+
{
|
| 76 |
+
"id": "floor_03",
|
| 77 |
+
"group": "floor",
|
| 78 |
+
"expected_ready": false,
|
| 79 |
+
"got_ready": false,
|
| 80 |
+
"expected_missing": [
|
| 81 |
+
"at least one completed analysis"
|
| 82 |
+
],
|
| 83 |
+
"got_missing": [
|
| 84 |
+
"at least one completed analysis"
|
| 85 |
+
],
|
| 86 |
+
"correct": true,
|
| 87 |
+
"aligned": false,
|
| 88 |
+
"gap": false,
|
| 89 |
+
"latency_ms": 0.0
|
| 90 |
+
},
|
| 91 |
+
{
|
| 92 |
+
"id": "floor_04",
|
| 93 |
+
"group": "floor",
|
| 94 |
+
"expected_ready": false,
|
| 95 |
+
"got_ready": false,
|
| 96 |
+
"expected_missing": [
|
| 97 |
+
"at least one completed analysis"
|
| 98 |
+
],
|
| 99 |
+
"got_missing": [
|
| 100 |
+
"at least one completed analysis"
|
| 101 |
+
],
|
| 102 |
+
"correct": true,
|
| 103 |
+
"aligned": false,
|
| 104 |
+
"gap": false,
|
| 105 |
+
"latency_ms": 0.0
|
| 106 |
+
},
|
| 107 |
+
{
|
| 108 |
+
"id": "floor_05",
|
| 109 |
+
"group": "floor",
|
| 110 |
+
"expected_ready": false,
|
| 111 |
+
"got_ready": false,
|
| 112 |
+
"expected_missing": [
|
| 113 |
+
"at least one completed analysis"
|
| 114 |
+
],
|
| 115 |
+
"got_missing": [
|
| 116 |
+
"at least one completed analysis"
|
| 117 |
+
],
|
| 118 |
+
"correct": true,
|
| 119 |
+
"aligned": false,
|
| 120 |
+
"gap": false,
|
| 121 |
+
"latency_ms": 0.0
|
| 122 |
+
},
|
| 123 |
+
{
|
| 124 |
+
"id": "floor_06",
|
| 125 |
+
"group": "floor",
|
| 126 |
+
"expected_ready": true,
|
| 127 |
+
"got_ready": true,
|
| 128 |
+
"expected_missing": [],
|
| 129 |
+
"got_missing": [],
|
| 130 |
+
"correct": true,
|
| 131 |
+
"aligned": true,
|
| 132 |
+
"gap": false,
|
| 133 |
+
"latency_ms": 0.0
|
| 134 |
+
},
|
| 135 |
+
{
|
| 136 |
+
"id": "floor_07",
|
| 137 |
+
"group": "floor",
|
| 138 |
+
"expected_ready": true,
|
| 139 |
+
"got_ready": true,
|
| 140 |
+
"expected_missing": [],
|
| 141 |
+
"got_missing": [],
|
| 142 |
+
"correct": true,
|
| 143 |
+
"aligned": true,
|
| 144 |
+
"gap": false,
|
| 145 |
+
"latency_ms": 0.0
|
| 146 |
+
},
|
| 147 |
+
{
|
| 148 |
+
"id": "floor_08",
|
| 149 |
+
"group": "floor",
|
| 150 |
+
"expected_ready": true,
|
| 151 |
+
"got_ready": true,
|
| 152 |
+
"expected_missing": [],
|
| 153 |
+
"got_missing": [],
|
| 154 |
+
"correct": true,
|
| 155 |
+
"aligned": true,
|
| 156 |
+
"gap": false,
|
| 157 |
+
"latency_ms": 0.0
|
| 158 |
+
},
|
| 159 |
+
{
|
| 160 |
+
"id": "delta_01",
|
| 161 |
+
"group": "delta",
|
| 162 |
+
"expected_ready": false,
|
| 163 |
+
"got_ready": false,
|
| 164 |
+
"expected_missing": [
|
| 165 |
+
"a new analysis since the last report"
|
| 166 |
+
],
|
| 167 |
+
"got_missing": [
|
| 168 |
+
"a new analysis since the last report"
|
| 169 |
+
],
|
| 170 |
+
"correct": true,
|
| 171 |
+
"aligned": true,
|
| 172 |
+
"gap": false,
|
| 173 |
+
"latency_ms": 0.0
|
| 174 |
+
},
|
| 175 |
+
{
|
| 176 |
+
"id": "delta_02",
|
| 177 |
+
"group": "delta",
|
| 178 |
+
"expected_ready": true,
|
| 179 |
+
"got_ready": true,
|
| 180 |
+
"expected_missing": [],
|
| 181 |
+
"got_missing": [],
|
| 182 |
+
"correct": true,
|
| 183 |
+
"aligned": true,
|
| 184 |
+
"gap": false,
|
| 185 |
+
"latency_ms": 0.0
|
| 186 |
+
},
|
| 187 |
+
{
|
| 188 |
+
"id": "delta_03",
|
| 189 |
+
"group": "delta",
|
| 190 |
+
"expected_ready": true,
|
| 191 |
+
"got_ready": true,
|
| 192 |
+
"expected_missing": [],
|
| 193 |
+
"got_missing": [],
|
| 194 |
+
"correct": true,
|
| 195 |
+
"aligned": true,
|
| 196 |
+
"gap": false,
|
| 197 |
+
"latency_ms": 0.0
|
| 198 |
+
},
|
| 199 |
+
{
|
| 200 |
+
"id": "delta_04",
|
| 201 |
+
"group": "delta",
|
| 202 |
+
"expected_ready": false,
|
| 203 |
+
"got_ready": false,
|
| 204 |
+
"expected_missing": [
|
| 205 |
+
"a new analysis since the last report"
|
| 206 |
+
],
|
| 207 |
+
"got_missing": [
|
| 208 |
+
"a new analysis since the last report"
|
| 209 |
+
],
|
| 210 |
+
"correct": true,
|
| 211 |
+
"aligned": true,
|
| 212 |
+
"gap": false,
|
| 213 |
+
"latency_ms": 0.0
|
| 214 |
+
},
|
| 215 |
+
{
|
| 216 |
+
"id": "edge_01",
|
| 217 |
+
"group": "edge",
|
| 218 |
+
"expected_ready": false,
|
| 219 |
+
"got_ready": false,
|
| 220 |
+
"expected_missing": [
|
| 221 |
+
"at least one completed analysis"
|
| 222 |
+
],
|
| 223 |
+
"got_missing": [
|
| 224 |
+
"at least one completed analysis"
|
| 225 |
+
],
|
| 226 |
+
"correct": true,
|
| 227 |
+
"aligned": false,
|
| 228 |
+
"gap": false,
|
| 229 |
+
"latency_ms": 0.0
|
| 230 |
+
},
|
| 231 |
+
{
|
| 232 |
+
"id": "align_01",
|
| 233 |
+
"group": "alignment",
|
| 234 |
+
"expected_ready": true,
|
| 235 |
+
"got_ready": true,
|
| 236 |
+
"expected_missing": [],
|
| 237 |
+
"got_missing": [],
|
| 238 |
+
"correct": true,
|
| 239 |
+
"aligned": false,
|
| 240 |
+
"gap": true,
|
| 241 |
+
"latency_ms": 0.0
|
| 242 |
+
},
|
| 243 |
+
{
|
| 244 |
+
"id": "align_02",
|
| 245 |
+
"group": "alignment",
|
| 246 |
+
"expected_ready": true,
|
| 247 |
+
"got_ready": true,
|
| 248 |
+
"expected_missing": [],
|
| 249 |
+
"got_missing": [],
|
| 250 |
+
"correct": true,
|
| 251 |
+
"aligned": false,
|
| 252 |
+
"gap": true,
|
| 253 |
+
"latency_ms": 0.0
|
| 254 |
+
},
|
| 255 |
+
{
|
| 256 |
+
"id": "align_03",
|
| 257 |
+
"group": "alignment",
|
| 258 |
+
"expected_ready": true,
|
| 259 |
+
"got_ready": true,
|
| 260 |
+
"expected_missing": [],
|
| 261 |
+
"got_missing": [],
|
| 262 |
+
"correct": true,
|
| 263 |
+
"aligned": true,
|
| 264 |
+
"gap": false,
|
| 265 |
+
"latency_ms": 0.0
|
| 266 |
+
}
|
| 267 |
+
]
|
| 268 |
+
}
|
eval/readiness/results/readiness_result_2026-06-22_143809.json
ADDED
|
@@ -0,0 +1,284 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"run": {
|
| 3 |
+
"timestamp": "2026-06-22T14:38:09",
|
| 4 |
+
"dataset": "readiness_dataset.json",
|
| 5 |
+
"target": "src/agents/report/readiness.is_report_ready",
|
| 6 |
+
"total": 17,
|
| 7 |
+
"passed": 17,
|
| 8 |
+
"accuracy": 1.0,
|
| 9 |
+
"runtime_avg_ms": 0.01
|
| 10 |
+
},
|
| 11 |
+
"alignment_gap": {
|
| 12 |
+
"count": 2,
|
| 13 |
+
"ids": [
|
| 14 |
+
"align_01",
|
| 15 |
+
"align_02"
|
| 16 |
+
]
|
| 17 |
+
},
|
| 18 |
+
"by_group": {
|
| 19 |
+
"floor": {
|
| 20 |
+
"n": 8,
|
| 21 |
+
"passed": 8,
|
| 22 |
+
"accuracy": 1.0
|
| 23 |
+
},
|
| 24 |
+
"delta": {
|
| 25 |
+
"n": 5,
|
| 26 |
+
"passed": 5,
|
| 27 |
+
"accuracy": 1.0
|
| 28 |
+
},
|
| 29 |
+
"edge": {
|
| 30 |
+
"n": 1,
|
| 31 |
+
"passed": 1,
|
| 32 |
+
"accuracy": 1.0
|
| 33 |
+
},
|
| 34 |
+
"alignment": {
|
| 35 |
+
"n": 3,
|
| 36 |
+
"passed": 3,
|
| 37 |
+
"accuracy": 1.0
|
| 38 |
+
}
|
| 39 |
+
},
|
| 40 |
+
"cases": [
|
| 41 |
+
{
|
| 42 |
+
"id": "floor_01",
|
| 43 |
+
"group": "floor",
|
| 44 |
+
"expected_ready": false,
|
| 45 |
+
"got_ready": false,
|
| 46 |
+
"expected_missing": [
|
| 47 |
+
"a validated problem statement",
|
| 48 |
+
"at least one completed analysis"
|
| 49 |
+
],
|
| 50 |
+
"got_missing": [
|
| 51 |
+
"a validated problem statement",
|
| 52 |
+
"at least one completed analysis"
|
| 53 |
+
],
|
| 54 |
+
"correct": true,
|
| 55 |
+
"aligned": false,
|
| 56 |
+
"gap": false,
|
| 57 |
+
"latency_ms": 0.0
|
| 58 |
+
},
|
| 59 |
+
{
|
| 60 |
+
"id": "floor_02",
|
| 61 |
+
"group": "floor",
|
| 62 |
+
"expected_ready": false,
|
| 63 |
+
"got_ready": false,
|
| 64 |
+
"expected_missing": [
|
| 65 |
+
"a validated problem statement"
|
| 66 |
+
],
|
| 67 |
+
"got_missing": [
|
| 68 |
+
"a validated problem statement"
|
| 69 |
+
],
|
| 70 |
+
"correct": true,
|
| 71 |
+
"aligned": true,
|
| 72 |
+
"gap": false,
|
| 73 |
+
"latency_ms": 0.0
|
| 74 |
+
},
|
| 75 |
+
{
|
| 76 |
+
"id": "floor_03",
|
| 77 |
+
"group": "floor",
|
| 78 |
+
"expected_ready": false,
|
| 79 |
+
"got_ready": false,
|
| 80 |
+
"expected_missing": [
|
| 81 |
+
"at least one completed analysis"
|
| 82 |
+
],
|
| 83 |
+
"got_missing": [
|
| 84 |
+
"at least one completed analysis"
|
| 85 |
+
],
|
| 86 |
+
"correct": true,
|
| 87 |
+
"aligned": false,
|
| 88 |
+
"gap": false,
|
| 89 |
+
"latency_ms": 0.0
|
| 90 |
+
},
|
| 91 |
+
{
|
| 92 |
+
"id": "floor_04",
|
| 93 |
+
"group": "floor",
|
| 94 |
+
"expected_ready": false,
|
| 95 |
+
"got_ready": false,
|
| 96 |
+
"expected_missing": [
|
| 97 |
+
"at least one completed analysis"
|
| 98 |
+
],
|
| 99 |
+
"got_missing": [
|
| 100 |
+
"at least one completed analysis"
|
| 101 |
+
],
|
| 102 |
+
"correct": true,
|
| 103 |
+
"aligned": false,
|
| 104 |
+
"gap": false,
|
| 105 |
+
"latency_ms": 0.0
|
| 106 |
+
},
|
| 107 |
+
{
|
| 108 |
+
"id": "floor_05",
|
| 109 |
+
"group": "floor",
|
| 110 |
+
"expected_ready": false,
|
| 111 |
+
"got_ready": false,
|
| 112 |
+
"expected_missing": [
|
| 113 |
+
"at least one completed analysis"
|
| 114 |
+
],
|
| 115 |
+
"got_missing": [
|
| 116 |
+
"at least one completed analysis"
|
| 117 |
+
],
|
| 118 |
+
"correct": true,
|
| 119 |
+
"aligned": false,
|
| 120 |
+
"gap": false,
|
| 121 |
+
"latency_ms": 0.0
|
| 122 |
+
},
|
| 123 |
+
{
|
| 124 |
+
"id": "floor_06",
|
| 125 |
+
"group": "floor",
|
| 126 |
+
"expected_ready": true,
|
| 127 |
+
"got_ready": true,
|
| 128 |
+
"expected_missing": [],
|
| 129 |
+
"got_missing": [],
|
| 130 |
+
"correct": true,
|
| 131 |
+
"aligned": true,
|
| 132 |
+
"gap": false,
|
| 133 |
+
"latency_ms": 0.0
|
| 134 |
+
},
|
| 135 |
+
{
|
| 136 |
+
"id": "floor_07",
|
| 137 |
+
"group": "floor",
|
| 138 |
+
"expected_ready": true,
|
| 139 |
+
"got_ready": true,
|
| 140 |
+
"expected_missing": [],
|
| 141 |
+
"got_missing": [],
|
| 142 |
+
"correct": true,
|
| 143 |
+
"aligned": true,
|
| 144 |
+
"gap": false,
|
| 145 |
+
"latency_ms": 0.0
|
| 146 |
+
},
|
| 147 |
+
{
|
| 148 |
+
"id": "floor_08",
|
| 149 |
+
"group": "floor",
|
| 150 |
+
"expected_ready": true,
|
| 151 |
+
"got_ready": true,
|
| 152 |
+
"expected_missing": [],
|
| 153 |
+
"got_missing": [],
|
| 154 |
+
"correct": true,
|
| 155 |
+
"aligned": true,
|
| 156 |
+
"gap": false,
|
| 157 |
+
"latency_ms": 0.1
|
| 158 |
+
},
|
| 159 |
+
{
|
| 160 |
+
"id": "delta_01",
|
| 161 |
+
"group": "delta",
|
| 162 |
+
"expected_ready": false,
|
| 163 |
+
"got_ready": false,
|
| 164 |
+
"expected_missing": [
|
| 165 |
+
"a new analysis since the last report"
|
| 166 |
+
],
|
| 167 |
+
"got_missing": [
|
| 168 |
+
"a new analysis since the last report"
|
| 169 |
+
],
|
| 170 |
+
"correct": true,
|
| 171 |
+
"aligned": true,
|
| 172 |
+
"gap": false,
|
| 173 |
+
"latency_ms": 0.0
|
| 174 |
+
},
|
| 175 |
+
{
|
| 176 |
+
"id": "delta_02",
|
| 177 |
+
"group": "delta",
|
| 178 |
+
"expected_ready": true,
|
| 179 |
+
"got_ready": true,
|
| 180 |
+
"expected_missing": [],
|
| 181 |
+
"got_missing": [],
|
| 182 |
+
"correct": true,
|
| 183 |
+
"aligned": true,
|
| 184 |
+
"gap": false,
|
| 185 |
+
"latency_ms": 0.0
|
| 186 |
+
},
|
| 187 |
+
{
|
| 188 |
+
"id": "delta_03",
|
| 189 |
+
"group": "delta",
|
| 190 |
+
"expected_ready": true,
|
| 191 |
+
"got_ready": true,
|
| 192 |
+
"expected_missing": [],
|
| 193 |
+
"got_missing": [],
|
| 194 |
+
"correct": true,
|
| 195 |
+
"aligned": true,
|
| 196 |
+
"gap": false,
|
| 197 |
+
"latency_ms": 0.0
|
| 198 |
+
},
|
| 199 |
+
{
|
| 200 |
+
"id": "delta_04",
|
| 201 |
+
"group": "delta",
|
| 202 |
+
"expected_ready": false,
|
| 203 |
+
"got_ready": false,
|
| 204 |
+
"expected_missing": [
|
| 205 |
+
"a new analysis since the last report"
|
| 206 |
+
],
|
| 207 |
+
"got_missing": [
|
| 208 |
+
"a new analysis since the last report"
|
| 209 |
+
],
|
| 210 |
+
"correct": true,
|
| 211 |
+
"aligned": true,
|
| 212 |
+
"gap": false,
|
| 213 |
+
"latency_ms": 0.0
|
| 214 |
+
},
|
| 215 |
+
{
|
| 216 |
+
"id": "delta_05",
|
| 217 |
+
"group": "delta",
|
| 218 |
+
"expected_ready": false,
|
| 219 |
+
"got_ready": false,
|
| 220 |
+
"expected_missing": [
|
| 221 |
+
"a new analysis since the last report"
|
| 222 |
+
],
|
| 223 |
+
"got_missing": [
|
| 224 |
+
"a new analysis since the last report"
|
| 225 |
+
],
|
| 226 |
+
"correct": true,
|
| 227 |
+
"aligned": true,
|
| 228 |
+
"gap": false,
|
| 229 |
+
"latency_ms": 0.0
|
| 230 |
+
},
|
| 231 |
+
{
|
| 232 |
+
"id": "edge_01",
|
| 233 |
+
"group": "edge",
|
| 234 |
+
"expected_ready": false,
|
| 235 |
+
"got_ready": false,
|
| 236 |
+
"expected_missing": [
|
| 237 |
+
"at least one completed analysis"
|
| 238 |
+
],
|
| 239 |
+
"got_missing": [
|
| 240 |
+
"at least one completed analysis"
|
| 241 |
+
],
|
| 242 |
+
"correct": true,
|
| 243 |
+
"aligned": false,
|
| 244 |
+
"gap": false,
|
| 245 |
+
"latency_ms": 0.0
|
| 246 |
+
},
|
| 247 |
+
{
|
| 248 |
+
"id": "align_01",
|
| 249 |
+
"group": "alignment",
|
| 250 |
+
"expected_ready": true,
|
| 251 |
+
"got_ready": true,
|
| 252 |
+
"expected_missing": [],
|
| 253 |
+
"got_missing": [],
|
| 254 |
+
"correct": true,
|
| 255 |
+
"aligned": false,
|
| 256 |
+
"gap": true,
|
| 257 |
+
"latency_ms": 0.0
|
| 258 |
+
},
|
| 259 |
+
{
|
| 260 |
+
"id": "align_02",
|
| 261 |
+
"group": "alignment",
|
| 262 |
+
"expected_ready": true,
|
| 263 |
+
"got_ready": true,
|
| 264 |
+
"expected_missing": [],
|
| 265 |
+
"got_missing": [],
|
| 266 |
+
"correct": true,
|
| 267 |
+
"aligned": false,
|
| 268 |
+
"gap": true,
|
| 269 |
+
"latency_ms": 0.0
|
| 270 |
+
},
|
| 271 |
+
{
|
| 272 |
+
"id": "align_03",
|
| 273 |
+
"group": "alignment",
|
| 274 |
+
"expected_ready": true,
|
| 275 |
+
"got_ready": true,
|
| 276 |
+
"expected_missing": [],
|
| 277 |
+
"got_missing": [],
|
| 278 |
+
"correct": true,
|
| 279 |
+
"aligned": true,
|
| 280 |
+
"gap": false,
|
| 281 |
+
"latency_ms": 0.0
|
| 282 |
+
}
|
| 283 |
+
]
|
| 284 |
+
}
|
eval/readiness/run_eval.py
ADDED
|
@@ -0,0 +1,309 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Report-readiness eval runner.
|
| 2 |
+
|
| 3 |
+
Feeds each golden case in `readiness_dataset.json` to the deterministic
|
| 4 |
+
`is_report_ready` signal (`src/agents/report/readiness.py`) via injectable FAKE
|
| 5 |
+
stores β no LLM, no DB β then scores both the boolean `ready` and the `missing`
|
| 6 |
+
gaps. Prints a per-case detail table + aggregate summary and writes a timestamped
|
| 7 |
+
JSON report under `results/` (never overwritten β one file per run, diffable).
|
| 8 |
+
|
| 9 |
+
Two metrics matter:
|
| 10 |
+
- FLOOR correctness (ready + missing exact) β should be ~100%; this is the
|
| 11 |
+
regression guard as the criteria evolve.
|
| 12 |
+
- ALIGNMENT GAP β cases the floor calls ready=true but whose analyses are NOT
|
| 13 |
+
aligned to the problem statement (`aligned=false`). The floor can't see this;
|
| 14 |
+
the gap count is the evidence for/against adding the deferred LLM-judge.
|
| 15 |
+
|
| 16 |
+
Invoke as a module so `src` imports resolve:
|
| 17 |
+
|
| 18 |
+
uv run python -m eval.readiness.run_eval
|
| 19 |
+
uv run python -m eval.readiness.run_eval --limit 5
|
| 20 |
+
"""
|
| 21 |
+
|
| 22 |
+
from __future__ import annotations
|
| 23 |
+
|
| 24 |
+
import argparse
|
| 25 |
+
import asyncio
|
| 26 |
+
import json
|
| 27 |
+
import statistics
|
| 28 |
+
import time
|
| 29 |
+
from dataclasses import asdict, dataclass, field
|
| 30 |
+
from datetime import UTC, datetime, timedelta
|
| 31 |
+
from pathlib import Path
|
| 32 |
+
from typing import Any
|
| 33 |
+
|
| 34 |
+
from src.agents.gate import stub_analysis_state
|
| 35 |
+
from src.agents.report.readiness import (
|
| 36 |
+
_MISSING_ANALYSIS,
|
| 37 |
+
_MISSING_DELTA,
|
| 38 |
+
_MISSING_PROBLEM,
|
| 39 |
+
is_report_ready,
|
| 40 |
+
)
|
| 41 |
+
|
| 42 |
+
_HERE = Path(__file__).resolve().parent
|
| 43 |
+
DATASET = _HERE / "readiness_dataset.json"
|
| 44 |
+
RESULTS_DIR = _HERE / "results"
|
| 45 |
+
GROUPS = ["floor", "delta", "edge", "alignment"]
|
| 46 |
+
|
| 47 |
+
# Dataset short codes -> the exact `missing` strings is_report_ready emits. Imported
|
| 48 |
+
# from the module so the dataset stays readable and survives wording changes.
|
| 49 |
+
_CODE_TO_MISSING = {
|
| 50 |
+
"problem": _MISSING_PROBLEM,
|
| 51 |
+
"analysis": _MISSING_ANALYSIS,
|
| 52 |
+
"delta": _MISSING_DELTA,
|
| 53 |
+
}
|
| 54 |
+
|
| 55 |
+
|
| 56 |
+
@dataclass
|
| 57 |
+
class _FakeTask:
|
| 58 |
+
"""Mirrors slow_path.schemas.TaskSummary (the bits is_report_ready reads)."""
|
| 59 |
+
|
| 60 |
+
status: str # success | partial | failure
|
| 61 |
+
tools_used: list[str]
|
| 62 |
+
|
| 63 |
+
|
| 64 |
+
@dataclass
|
| 65 |
+
class _FakeRecord:
|
| 66 |
+
findings: list[Any]
|
| 67 |
+
created_at: datetime
|
| 68 |
+
tasks_run: list[_FakeTask]
|
| 69 |
+
|
| 70 |
+
|
| 71 |
+
@dataclass
|
| 72 |
+
class _FakeReport:
|
| 73 |
+
generated_at: datetime
|
| 74 |
+
|
| 75 |
+
|
| 76 |
+
class _FakeStore:
|
| 77 |
+
"""Stand-in for the Postgres record/report store β returns canned rows."""
|
| 78 |
+
|
| 79 |
+
def __init__(self, rows: list[Any]) -> None:
|
| 80 |
+
self._rows = rows
|
| 81 |
+
|
| 82 |
+
async def list_for_analysis(self, _analysis_id: str) -> list[Any]:
|
| 83 |
+
return self._rows
|
| 84 |
+
|
| 85 |
+
|
| 86 |
+
@dataclass
|
| 87 |
+
class CaseResult:
|
| 88 |
+
id: str
|
| 89 |
+
group: str
|
| 90 |
+
expected_ready: bool
|
| 91 |
+
got_ready: bool
|
| 92 |
+
expected_missing: list[str]
|
| 93 |
+
got_missing: list[str]
|
| 94 |
+
correct: bool
|
| 95 |
+
aligned: bool
|
| 96 |
+
gap: bool # floor said ready but analyses not aligned to the problem statement
|
| 97 |
+
latency_ms: float
|
| 98 |
+
|
| 99 |
+
|
| 100 |
+
def load_cases(path: Path) -> list[dict[str, Any]]:
|
| 101 |
+
data = json.loads(path.read_text(encoding="utf-8"))
|
| 102 |
+
return list(data["cases"])
|
| 103 |
+
|
| 104 |
+
|
| 105 |
+
def _build_tasks(analysis: str) -> list[_FakeTask]:
|
| 106 |
+
"""Realistic tasks_run: data-access always succeeds; the analyze_* task varies.
|
| 107 |
+
|
| 108 |
+
analysis = 'success' (analyze_* succeeded) | 'failure' (analyze_* failed) |
|
| 109 |
+
'none' (no analyze task at all β only check/retrieve succeeded).
|
| 110 |
+
"""
|
| 111 |
+
tasks = [
|
| 112 |
+
_FakeTask(status="success", tools_used=["check_data"]),
|
| 113 |
+
_FakeTask(status="success", tools_used=["retrieve_data"]),
|
| 114 |
+
]
|
| 115 |
+
if analysis == "success":
|
| 116 |
+
tasks.append(_FakeTask(status="success", tools_used=["analyze_aggregate"]))
|
| 117 |
+
elif analysis == "failure":
|
| 118 |
+
tasks.append(_FakeTask(status="failure", tools_used=["analyze_aggregate"]))
|
| 119 |
+
return tasks
|
| 120 |
+
|
| 121 |
+
|
| 122 |
+
def _build_records(specs: list[dict[str, Any]], now: datetime) -> list[_FakeRecord]:
|
| 123 |
+
return [
|
| 124 |
+
_FakeRecord(
|
| 125 |
+
findings=["f"] * int(spec.get("findings", 0)),
|
| 126 |
+
created_at=now - timedelta(minutes=int(spec["age_min"])),
|
| 127 |
+
tasks_run=_build_tasks(str(spec.get("analysis", "success"))),
|
| 128 |
+
)
|
| 129 |
+
for spec in specs
|
| 130 |
+
]
|
| 131 |
+
|
| 132 |
+
|
| 133 |
+
def _build_reports(specs: list[dict[str, Any]], now: datetime) -> list[_FakeReport]:
|
| 134 |
+
return [
|
| 135 |
+
_FakeReport(generated_at=now - timedelta(minutes=int(spec["age_min"])))
|
| 136 |
+
for spec in specs
|
| 137 |
+
]
|
| 138 |
+
|
| 139 |
+
|
| 140 |
+
async def run_case(case: dict[str, Any]) -> CaseResult:
|
| 141 |
+
now = datetime.now(UTC)
|
| 142 |
+
state = stub_analysis_state(problem_validated=bool(case["problem_validated"]))
|
| 143 |
+
if case.get("report_id"):
|
| 144 |
+
state = state.model_copy(update={"report_id": case["report_id"]})
|
| 145 |
+
|
| 146 |
+
record_store = _FakeStore(_build_records(case.get("records", []), now))
|
| 147 |
+
report_store = _FakeStore(_build_reports(case.get("reports", []), now))
|
| 148 |
+
expected_missing = sorted(_CODE_TO_MISSING[c] for c in case["expected_missing"])
|
| 149 |
+
|
| 150 |
+
start = time.perf_counter()
|
| 151 |
+
rr = await is_report_ready(
|
| 152 |
+
case["id"], state, record_store=record_store, report_store=report_store
|
| 153 |
+
)
|
| 154 |
+
latency_ms = round((time.perf_counter() - start) * 1000, 1)
|
| 155 |
+
|
| 156 |
+
got_missing = sorted(rr.missing)
|
| 157 |
+
ready_ok = rr.ready == bool(case["expected_ready"])
|
| 158 |
+
missing_ok = got_missing == expected_missing
|
| 159 |
+
return CaseResult(
|
| 160 |
+
id=case["id"],
|
| 161 |
+
group=case["group"],
|
| 162 |
+
expected_ready=bool(case["expected_ready"]),
|
| 163 |
+
got_ready=rr.ready,
|
| 164 |
+
expected_missing=expected_missing,
|
| 165 |
+
got_missing=got_missing,
|
| 166 |
+
correct=ready_ok and missing_ok,
|
| 167 |
+
aligned=bool(case["aligned"]),
|
| 168 |
+
gap=rr.ready and not bool(case["aligned"]),
|
| 169 |
+
latency_ms=latency_ms,
|
| 170 |
+
)
|
| 171 |
+
|
| 172 |
+
|
| 173 |
+
def _group_accuracy(results: list[CaseResult]) -> dict[str, dict[str, Any]]:
|
| 174 |
+
out: dict[str, dict[str, Any]] = {}
|
| 175 |
+
for g in GROUPS:
|
| 176 |
+
sub = [r for r in results if r.group == g]
|
| 177 |
+
if not sub:
|
| 178 |
+
continue
|
| 179 |
+
passed = sum(r.correct for r in sub)
|
| 180 |
+
out[g] = {"n": len(sub), "passed": passed, "accuracy": round(passed / len(sub), 3)}
|
| 181 |
+
return out
|
| 182 |
+
|
| 183 |
+
|
| 184 |
+
def summarize(results: list[CaseResult]) -> dict[str, Any]:
|
| 185 |
+
n = len(results)
|
| 186 |
+
passed = sum(r.correct for r in results)
|
| 187 |
+
gaps = [r for r in results if r.gap]
|
| 188 |
+
latencies = [r.latency_ms for r in results]
|
| 189 |
+
return {
|
| 190 |
+
"total": n,
|
| 191 |
+
"passed": passed,
|
| 192 |
+
"accuracy": round(passed / n, 3) if n else 0.0,
|
| 193 |
+
"runtime_avg_ms": round(statistics.mean(latencies), 2) if latencies else 0,
|
| 194 |
+
"alignment_gap": {"count": len(gaps), "ids": [r.id for r in gaps]},
|
| 195 |
+
"by_group": _group_accuracy(results),
|
| 196 |
+
}
|
| 197 |
+
|
| 198 |
+
|
| 199 |
+
def _fmt_bool(value: bool) -> str:
|
| 200 |
+
return "T" if value else "F"
|
| 201 |
+
|
| 202 |
+
|
| 203 |
+
def _truncate(text: str, width: int) -> str:
|
| 204 |
+
return text if len(text) <= width else text[: width - 3] + "..."
|
| 205 |
+
|
| 206 |
+
|
| 207 |
+
def format_table(results: list[CaseResult]) -> str:
|
| 208 |
+
header = (
|
| 209 |
+
f"{'ID':<12} {'GROUP':<10} {'RDY e/g':<8} "
|
| 210 |
+
f"{'MISSING (got)':<40} {'OK':<3} {'GAP':<4}"
|
| 211 |
+
)
|
| 212 |
+
rule = "-" * len(header)
|
| 213 |
+
lines = [rule, header, rule]
|
| 214 |
+
for r in results:
|
| 215 |
+
rdy = f"{_fmt_bool(r.expected_ready)}/{_fmt_bool(r.got_ready)}"
|
| 216 |
+
missing = ", ".join(r.got_missing) or "-"
|
| 217 |
+
ok = "ok" if r.correct else "X"
|
| 218 |
+
gap = "GAP" if r.gap else ""
|
| 219 |
+
lines.append(
|
| 220 |
+
f"{r.id:<12} {r.group:<10} {rdy:<8} "
|
| 221 |
+
f"{_truncate(missing, 40):<40} {ok:<3} {gap:<4}"
|
| 222 |
+
)
|
| 223 |
+
lines.append(rule)
|
| 224 |
+
return "\n".join(lines)
|
| 225 |
+
|
| 226 |
+
|
| 227 |
+
def format_summary(summary: dict[str, Any], results: list[CaseResult]) -> str:
|
| 228 |
+
lines = ["SUMMARY"]
|
| 229 |
+
lines.append(
|
| 230 |
+
f" Floor {summary['passed']}/{summary['total']} correct"
|
| 231 |
+
f" ({summary['accuracy'] * 100:.1f}%) avg {summary['runtime_avg_ms']} ms"
|
| 232 |
+
)
|
| 233 |
+
gap = summary["alignment_gap"]
|
| 234 |
+
lines.append(
|
| 235 |
+
f" Align gap {gap['count']} case(s) ready-but-misaligned"
|
| 236 |
+
+ (f" -> {', '.join(gap['ids'])}" if gap["ids"] else "")
|
| 237 |
+
)
|
| 238 |
+
lines.append(" (floor can't catch these; this count is the LLM-judge justification)")
|
| 239 |
+
lines.append("")
|
| 240 |
+
lines.append(" By group")
|
| 241 |
+
for g, m in summary["by_group"].items():
|
| 242 |
+
lines.append(f" {g:<12} {m['passed']}/{m['n']} {m['accuracy'] * 100:.0f}%")
|
| 243 |
+
failures = [r for r in results if not r.correct]
|
| 244 |
+
lines.append("")
|
| 245 |
+
lines.append(f" FAILURES ({len(failures)})")
|
| 246 |
+
for r in failures:
|
| 247 |
+
lines.append(
|
| 248 |
+
f" {r.id:<12} ready {_fmt_bool(r.expected_ready)}->{_fmt_bool(r.got_ready)}"
|
| 249 |
+
f" missing {r.expected_missing} -> {r.got_missing}"
|
| 250 |
+
)
|
| 251 |
+
return "\n".join(lines)
|
| 252 |
+
|
| 253 |
+
|
| 254 |
+
def build_report(
|
| 255 |
+
results: list[CaseResult], summary: dict[str, Any], meta: dict[str, Any]
|
| 256 |
+
) -> dict[str, Any]:
|
| 257 |
+
run = {**meta, **{k: summary[k] for k in ("total", "passed", "accuracy", "runtime_avg_ms")}}
|
| 258 |
+
return {
|
| 259 |
+
"run": run,
|
| 260 |
+
"alignment_gap": summary["alignment_gap"],
|
| 261 |
+
"by_group": summary["by_group"],
|
| 262 |
+
"cases": [asdict(r) for r in results],
|
| 263 |
+
}
|
| 264 |
+
|
| 265 |
+
|
| 266 |
+
@dataclass
|
| 267 |
+
class _Args:
|
| 268 |
+
dataset: Path = DATASET
|
| 269 |
+
limit: int = 0
|
| 270 |
+
no_table: bool = False
|
| 271 |
+
extra: dict[str, Any] = field(default_factory=dict)
|
| 272 |
+
|
| 273 |
+
|
| 274 |
+
async def main() -> None:
|
| 275 |
+
parser = argparse.ArgumentParser(description="Report-readiness eval")
|
| 276 |
+
parser.add_argument("--dataset", type=Path, default=DATASET)
|
| 277 |
+
parser.add_argument("--limit", type=int, default=0, help="run first N cases only")
|
| 278 |
+
parser.add_argument("--no-table", action="store_true", help="skip the detail table")
|
| 279 |
+
args = parser.parse_args()
|
| 280 |
+
|
| 281 |
+
cases = load_cases(args.dataset)
|
| 282 |
+
if args.limit:
|
| 283 |
+
cases = cases[: args.limit]
|
| 284 |
+
|
| 285 |
+
started = datetime.now()
|
| 286 |
+
print(f"Report-Readiness Eval -- {started:%Y-%m-%d %H:%M:%S}")
|
| 287 |
+
print(f"dataset: {args.dataset.name} ({len(cases)} cases) target: is_report_ready")
|
| 288 |
+
|
| 289 |
+
results = [await run_case(case) for case in cases]
|
| 290 |
+
|
| 291 |
+
summary = summarize(results)
|
| 292 |
+
if not args.no_table:
|
| 293 |
+
print(format_table(results))
|
| 294 |
+
print(format_summary(summary, results))
|
| 295 |
+
|
| 296 |
+
meta = {
|
| 297 |
+
"timestamp": started.isoformat(timespec="seconds"),
|
| 298 |
+
"dataset": args.dataset.name,
|
| 299 |
+
"target": "src/agents/report/readiness.is_report_ready",
|
| 300 |
+
}
|
| 301 |
+
report = build_report(results, summary, meta)
|
| 302 |
+
RESULTS_DIR.mkdir(parents=True, exist_ok=True)
|
| 303 |
+
out_path = RESULTS_DIR / f"readiness_result_{started:%Y-%m-%d_%H%M%S}.json"
|
| 304 |
+
out_path.write_text(json.dumps(report, ensure_ascii=False, indent=2), encoding="utf-8")
|
| 305 |
+
print(f"\n-> saved: {out_path.relative_to(_HERE.parent.parent)}")
|
| 306 |
+
|
| 307 |
+
|
| 308 |
+
if __name__ == "__main__":
|
| 309 |
+
asyncio.run(main())
|
main.py
CHANGED
|
@@ -13,6 +13,9 @@ from src.api.v1.room import router as room_router
|
|
| 13 |
from src.api.v1.users import router as users_router
|
| 14 |
from src.api.v1.db_client import router as db_client_router
|
| 15 |
from src.api.v1.data_catalog import router as data_catalog_router
|
|
|
|
|
|
|
|
|
|
| 16 |
from src.db.postgres.init_db import init_db
|
| 17 |
import os
|
| 18 |
import uvicorn
|
|
@@ -53,6 +56,9 @@ app.include_router(room_router)
|
|
| 53 |
app.include_router(chat_router)
|
| 54 |
app.include_router(db_client_router)
|
| 55 |
app.include_router(data_catalog_router)
|
|
|
|
|
|
|
|
|
|
| 56 |
|
| 57 |
|
| 58 |
@app.get("/")
|
|
|
|
| 13 |
from src.api.v1.users import router as users_router
|
| 14 |
from src.api.v1.db_client import router as db_client_router
|
| 15 |
from src.api.v1.data_catalog import router as data_catalog_router
|
| 16 |
+
from src.api.v1.report import router as report_router
|
| 17 |
+
from src.api.v1.analysis import router as analysis_router
|
| 18 |
+
from src.api.v1.tools import router as tools_router
|
| 19 |
from src.db.postgres.init_db import init_db
|
| 20 |
import os
|
| 21 |
import uvicorn
|
|
|
|
| 56 |
app.include_router(chat_router)
|
| 57 |
app.include_router(db_client_router)
|
| 58 |
app.include_router(data_catalog_router)
|
| 59 |
+
app.include_router(report_router)
|
| 60 |
+
app.include_router(analysis_router)
|
| 61 |
+
app.include_router(tools_router)
|
| 62 |
|
| 63 |
|
| 64 |
@app.get("/")
|
pyproject.toml
CHANGED
|
@@ -123,6 +123,8 @@ ignore = [
|
|
| 123 |
# S608: golden compiler tests assert literal SQL strings (incl. concatenated
|
| 124 |
# suffixes) β they never execute against a DB, so it's a false positive here.
|
| 125 |
"tests/**" = ["S101", "S105", "S106", "S608"]
|
|
|
|
|
|
|
| 126 |
|
| 127 |
[tool.mypy]
|
| 128 |
python_version = "3.12"
|
|
|
|
| 123 |
# S608: golden compiler tests assert literal SQL strings (incl. concatenated
|
| 124 |
# suffixes) β they never execute against a DB, so it's a false positive here.
|
| 125 |
"tests/**" = ["S101", "S105", "S106", "S608"]
|
| 126 |
+
# T201: eval/ scripts are CLIs β print() is their intended output channel.
|
| 127 |
+
"eval/**" = ["T201"]
|
| 128 |
|
| 129 |
[tool.mypy]
|
| 130 |
python_version = "3.12"
|
src/agents/binding_store.py
ADDED
|
@@ -0,0 +1,34 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""AnalysisDataSourceStore β read per-analysis data-source bindings (#10).
|
| 2 |
+
|
| 3 |
+
The dedorch `data_sources` table records which catalog sources an analysis is scoped
|
| 4 |
+
to (`reference_id` = the catalog source id). It's written at `/analysis/create`; this
|
| 5 |
+
store is the read seam for the two consumers β `structured_flow` catalog scoping and
|
| 6 |
+
the report's data-source appendix.
|
| 7 |
+
|
| 8 |
+
Fail-open by convention at the call sites: an empty binding (legacy room, or the FE
|
| 9 |
+
not yet sending ids) means "no restriction" β fall back to the whole catalog. Mirrors
|
| 10 |
+
`AnalysisStateStore`: each call opens its own `AsyncSession`.
|
| 11 |
+
"""
|
| 12 |
+
|
| 13 |
+
from __future__ import annotations
|
| 14 |
+
|
| 15 |
+
from sqlalchemy import select
|
| 16 |
+
|
| 17 |
+
from src.db.postgres.connection import AsyncSessionLocal
|
| 18 |
+
from src.db.postgres.models import AnalysisDataSourceRow
|
| 19 |
+
from src.middlewares.logging import get_logger
|
| 20 |
+
|
| 21 |
+
logger = get_logger("binding_store")
|
| 22 |
+
|
| 23 |
+
|
| 24 |
+
class AnalysisDataSourceStore:
|
| 25 |
+
"""Read the bound catalog `source_id`s for an analysis."""
|
| 26 |
+
|
| 27 |
+
async def get(self, analysis_id: str) -> list[str]:
|
| 28 |
+
async with AsyncSessionLocal() as session:
|
| 29 |
+
result = await session.execute(
|
| 30 |
+
select(AnalysisDataSourceRow.reference_id).where(
|
| 31 |
+
AnalysisDataSourceRow.analysis_id == analysis_id
|
| 32 |
+
)
|
| 33 |
+
)
|
| 34 |
+
return list(result.scalars().all())
|
src/agents/chat_handler.py
CHANGED
|
@@ -2,18 +2,22 @@
|
|
| 2 |
|
| 3 |
End-to-end flow per user message:
|
| 4 |
|
| 5 |
-
1. `
|
| 6 |
-
2. Route:
|
| 7 |
-
- `chat`
|
| 8 |
-
- `
|
| 9 |
-
- `
|
| 10 |
-
|
|
|
|
|
|
|
|
|
|
| 11 |
3. `ChatbotAgent.astream` β yield text tokens.
|
| 12 |
4. Wrap each step into an SSE-style event dict so the API endpoint can
|
| 13 |
stream them as Server-Sent Events.
|
| 14 |
|
| 15 |
-
|
| 16 |
-
|
|
|
|
| 17 |
|
| 18 |
All dependencies are injectable for tests. Default constructors lazy-build
|
| 19 |
production deps (no `Settings()` triggered at import time as long as you
|
|
@@ -33,12 +37,16 @@ from src.middlewares.logging import get_logger
|
|
| 33 |
from src.retrieval.base import RetrievalResult
|
| 34 |
|
| 35 |
from .chatbot import ChatbotAgent, DocumentChunk
|
|
|
|
|
|
|
|
|
|
| 36 |
from .orchestration import OrchestratorAgent
|
| 37 |
|
| 38 |
if TYPE_CHECKING:
|
| 39 |
from ..catalog.reader import CatalogReader
|
| 40 |
from ..query.service import QueryService
|
| 41 |
from ..retrieval.router import RetrievalRouter
|
|
|
|
| 42 |
from .slow_path.coordinator import SlowPathCoordinator
|
| 43 |
from .slow_path.store import AnalysisStore
|
| 44 |
|
|
@@ -71,6 +79,12 @@ class ChatHandler:
|
|
| 71 |
Callable[[str], SlowPathCoordinator] | None
|
| 72 |
) = None,
|
| 73 |
analysis_store: AnalysisStore | None = None,
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 74 |
enable_tracing: bool = False,
|
| 75 |
) -> None:
|
| 76 |
self._intent_router = intent_router
|
|
@@ -88,6 +102,21 @@ class ChatHandler:
|
|
| 88 |
self._enable_slow_path = enable_slow_path
|
| 89 |
self._slow_path_factory = slow_path_coordinator_factory
|
| 90 |
self._analysis_store = analysis_store
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 91 |
|
| 92 |
# ------------------------------------------------------------------
|
| 93 |
# Lazy default-dep builders
|
|
@@ -125,6 +154,71 @@ class ChatHandler:
|
|
| 125 |
self._document_retriever = RetrievalRouter()
|
| 126 |
return self._document_retriever
|
| 127 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 128 |
# ------------------------------------------------------------------
|
| 129 |
# Public entry
|
| 130 |
# ------------------------------------------------------------------
|
|
@@ -134,6 +228,7 @@ class ChatHandler:
|
|
| 134 |
message: str,
|
| 135 |
user_id: str,
|
| 136 |
history: list[BaseMessage] | None = None,
|
|
|
|
| 137 |
) -> AsyncIterator[dict[str, Any]]:
|
| 138 |
tracer = self._make_tracer(user_id, message)
|
| 139 |
|
|
@@ -147,7 +242,39 @@ class ChatHandler:
|
|
| 147 |
yield {"event": "error", "data": f"Could not classify message: {e}"}
|
| 148 |
return
|
| 149 |
|
| 150 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 151 |
|
| 152 |
rewritten = decision.rewritten_query or message
|
| 153 |
query_result = None
|
|
@@ -155,7 +282,7 @@ class ChatHandler:
|
|
| 155 |
raw_chunks: Any = None
|
| 156 |
|
| 157 |
# ---- 2. Route ------------------------------------------------
|
| 158 |
-
if
|
| 159 |
try:
|
| 160 |
# One memoizing reader per request: the same catalog is otherwise
|
| 161 |
# re-fetched from the catalog DB 4-5x across the slow-path run. This
|
|
@@ -164,10 +291,15 @@ class ChatHandler:
|
|
| 164 |
from ..catalog.reader import MemoizingCatalogReader
|
| 165 |
|
| 166 |
req_reader = MemoizingCatalogReader(self._get_catalog_reader())
|
| 167 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 168 |
if self._enable_slow_path:
|
| 169 |
async for event in self._run_slow_path(
|
| 170 |
-
user_id, rewritten, catalog, tracer,
|
| 171 |
):
|
| 172 |
yield event
|
| 173 |
return
|
|
@@ -182,32 +314,88 @@ class ChatHandler:
|
|
| 182 |
)
|
| 183 |
yield {"event": "error", "data": f"Structured query failed: {e}"}
|
| 184 |
return
|
| 185 |
-
elif
|
| 186 |
try:
|
| 187 |
raw_chunks = await self._get_document_retriever().retrieve(
|
| 188 |
rewritten, user_id
|
| 189 |
)
|
| 190 |
chunks = _normalize_chunks(raw_chunks)
|
| 191 |
-
except NotImplementedError:
|
| 192 |
-
logger.warning("DocumentRetriever placeholder hit", user_id=user_id)
|
| 193 |
-
yield {
|
| 194 |
-
"event": "error",
|
| 195 |
-
"data": "Document retrieval is not yet available β pending implementation.",
|
| 196 |
-
}
|
| 197 |
-
return
|
| 198 |
except Exception as e:
|
| 199 |
logger.error(
|
| 200 |
"unstructured route failed", user_id=user_id, error=str(e)
|
| 201 |
)
|
| 202 |
yield {"event": "error", "data": f"Document retrieval failed: {e}"}
|
| 203 |
return
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 204 |
# else: chat path β no context
|
| 205 |
|
| 206 |
# ---- 2b. Emit sources ---------------------------------------
|
| 207 |
-
sources = _build_sources(
|
| 208 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 209 |
)
|
| 210 |
-
logger.info("built sources", source_hint=decision.source_hint, sources_count=len(sources), raw_chunks_count=len(raw_chunks) if raw_chunks else 0)
|
| 211 |
yield {"event": "sources", "data": json.dumps(sources)}
|
| 212 |
|
| 213 |
# ---- 3. Stream answer ----------------------------------------
|
|
@@ -282,9 +470,9 @@ class ChatHandler:
|
|
| 282 |
|
| 283 |
def _get_analysis_store(self) -> AnalysisStore:
|
| 284 |
if self._analysis_store is None:
|
| 285 |
-
from .slow_path.store import
|
| 286 |
|
| 287 |
-
self._analysis_store =
|
| 288 |
return self._analysis_store
|
| 289 |
|
| 290 |
async def _run_slow_path(
|
|
@@ -294,11 +482,13 @@ class ChatHandler:
|
|
| 294 |
catalog: Any,
|
| 295 |
tracer: Any = None,
|
| 296 |
catalog_reader: CatalogReader | None = None,
|
|
|
|
| 297 |
) -> AsyncIterator[dict[str, Any]]:
|
| 298 |
"""Run the slow path and stream its assembled answer as SSE events.
|
| 299 |
|
| 300 |
Context comes from the `get_business_context` seam (a stub today); the
|
| 301 |
-
`analysis_record` is persisted via the `AnalysisStore` seam (
|
|
|
|
| 302 |
`chat_answer` is emitted as a single `chunk` (the Assembler returns the whole
|
| 303 |
object β true token streaming is a later step).
|
| 304 |
"""
|
|
@@ -368,26 +558,58 @@ class ChatHandler:
|
|
| 368 |
yield {"event": "sources", "data": json.dumps([])} # TODO: derive from record
|
| 369 |
yield {"event": "chunk", "data": result.chat_answer}
|
| 370 |
try:
|
| 371 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 372 |
except Exception as e: # persistence must never break the user's answer
|
| 373 |
logger.error("analysis_record persist failed", user_id=user_id, error=str(e))
|
| 374 |
tracer.end() # output omitted (chat_answer may contain PII on Cloud)
|
| 375 |
yield {"event": "done", "data": ""}
|
| 376 |
|
| 377 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 378 |
def _build_sources(
|
| 379 |
-
|
| 380 |
user_id: str,
|
| 381 |
query_result: Any,
|
| 382 |
raw_chunks: Any,
|
| 383 |
) -> list[dict[str, Any]]:
|
| 384 |
"""Build the sources payload for the SSE `sources` event.
|
| 385 |
|
| 386 |
-
-
|
| 387 |
-
-
|
| 388 |
- chat or error: empty list.
|
| 389 |
"""
|
| 390 |
-
if
|
| 391 |
if query_result is None or getattr(query_result, "error", None):
|
| 392 |
return []
|
| 393 |
table_name = getattr(query_result, "table_name", "") or ""
|
|
@@ -399,7 +621,7 @@ def _build_sources(
|
|
| 399 |
"page_label": None,
|
| 400 |
}]
|
| 401 |
|
| 402 |
-
if
|
| 403 |
seen: set[tuple[Any, Any]] = set()
|
| 404 |
sources: list[dict[str, Any]] = []
|
| 405 |
for item in raw_chunks:
|
|
|
|
| 2 |
|
| 3 |
End-to-end flow per user message:
|
| 4 |
|
| 5 |
+
1. `OrchestratorAgent.classify` β RouterDecision (one of six intents).
|
| 6 |
+
2. Route by intent:
|
| 7 |
+
- `chat` β no context. Pass straight to ChatbotAgent.
|
| 8 |
+
- `structured_flow` β CatalogReader β slow path / QueryService.
|
| 9 |
+
- `unstructured_flow` β DocumentRetriever (RAG over PGVector) β
|
| 10 |
+
list[DocumentChunk].
|
| 11 |
+
- `check` β check_data / check_knowledge tool β rendered table.
|
| 12 |
+
- `problem_statement` β PS skill: draft + validate β write analysis state.
|
| 13 |
+
- `help` β Help skill: analysis state + history β streamed guidance.
|
| 14 |
3. `ChatbotAgent.astream` β yield text tokens.
|
| 15 |
4. Wrap each step into an SSE-style event dict so the API endpoint can
|
| 16 |
stream them as Server-Sent Events.
|
| 17 |
|
| 18 |
+
The chat endpoint (`src/api/v1/chat.py`) calls `ChatHandler.handle(...)` per
|
| 19 |
+
request, behind two endpoint-level pre-filters: a greeting/farewell
|
| 20 |
+
short-circuit and a Redis response cache (both skip the LLM on a hit).
|
| 21 |
|
| 22 |
All dependencies are injectable for tests. Default constructors lazy-build
|
| 23 |
production deps (no `Settings()` triggered at import time as long as you
|
|
|
|
| 37 |
from src.retrieval.base import RetrievalResult
|
| 38 |
|
| 39 |
from .chatbot import ChatbotAgent, DocumentChunk
|
| 40 |
+
from .handlers.check import run_check
|
| 41 |
+
from .handlers.help import HelpAgent
|
| 42 |
+
from .handlers.problem_statement import ProblemStatementAgent, run_problem_statement
|
| 43 |
from .orchestration import OrchestratorAgent
|
| 44 |
|
| 45 |
if TYPE_CHECKING:
|
| 46 |
from ..catalog.reader import CatalogReader
|
| 47 |
from ..query.service import QueryService
|
| 48 |
from ..retrieval.router import RetrievalRouter
|
| 49 |
+
from .gate import AnalysisState
|
| 50 |
from .slow_path.coordinator import SlowPathCoordinator
|
| 51 |
from .slow_path.store import AnalysisStore
|
| 52 |
|
|
|
|
| 79 |
Callable[[str], SlowPathCoordinator] | None
|
| 80 |
) = None,
|
| 81 |
analysis_store: AnalysisStore | None = None,
|
| 82 |
+
check_invoker_factory: Callable[[str], Any] | None = None,
|
| 83 |
+
ps_agent: ProblemStatementAgent | None = None,
|
| 84 |
+
help_agent: HelpAgent | None = None,
|
| 85 |
+
state_store: Any | None = None,
|
| 86 |
+
binding_store: Any | None = None,
|
| 87 |
+
enable_gate: bool = False,
|
| 88 |
enable_tracing: bool = False,
|
| 89 |
) -> None:
|
| 90 |
self._intent_router = intent_router
|
|
|
|
| 102 |
self._enable_slow_path = enable_slow_path
|
| 103 |
self._slow_path_factory = slow_path_coordinator_factory
|
| 104 |
self._analysis_store = analysis_store
|
| 105 |
+
# `check` skill: builds the data-access invoker (check_data/check_knowledge)
|
| 106 |
+
# per request with the authenticated user_id. Injectable for tests.
|
| 107 |
+
self._check_invoker_factory = check_invoker_factory
|
| 108 |
+
# `problem_statement` skill: LLM drafter + the Analysis State store it writes
|
| 109 |
+
# `problem_validated` to. Both injectable for tests.
|
| 110 |
+
self._ps_agent = ps_agent
|
| 111 |
+
# `help` skill: LLM guide that reads the Analysis State + chat history.
|
| 112 |
+
self._help_agent = help_agent
|
| 113 |
+
self._state_store = state_store
|
| 114 |
+
# `#10` data-source binding: scopes structured_flow's catalog to the sources
|
| 115 |
+
# the analysis is bound to. Injectable for tests; fail-open when absent.
|
| 116 |
+
self._binding_store = binding_store
|
| 117 |
+
# Deterministic gate: redirect structured_flow -> problem_statement until the
|
| 118 |
+
# analysis is validated. OFF by default (legacy rooms have no state row).
|
| 119 |
+
self._enable_gate = enable_gate
|
| 120 |
|
| 121 |
# ------------------------------------------------------------------
|
| 122 |
# Lazy default-dep builders
|
|
|
|
| 154 |
self._document_retriever = RetrievalRouter()
|
| 155 |
return self._document_retriever
|
| 156 |
|
| 157 |
+
def _get_check_invoker(self, user_id: str) -> Any:
|
| 158 |
+
"""Build the per-request data-access invoker for the `check` skill."""
|
| 159 |
+
if self._check_invoker_factory is not None:
|
| 160 |
+
return self._check_invoker_factory(user_id)
|
| 161 |
+
from ..tools.data_access import DataAccessToolInvoker
|
| 162 |
+
|
| 163 |
+
return DataAccessToolInvoker(user_id, self._get_catalog_reader())
|
| 164 |
+
|
| 165 |
+
def _get_ps_agent(self) -> ProblemStatementAgent:
|
| 166 |
+
if self._ps_agent is None:
|
| 167 |
+
self._ps_agent = ProblemStatementAgent()
|
| 168 |
+
return self._ps_agent
|
| 169 |
+
|
| 170 |
+
def _get_help_agent(self) -> HelpAgent:
|
| 171 |
+
if self._help_agent is None:
|
| 172 |
+
self._help_agent = HelpAgent()
|
| 173 |
+
return self._help_agent
|
| 174 |
+
|
| 175 |
+
def _get_state_store(self) -> Any:
|
| 176 |
+
if self._state_store is None:
|
| 177 |
+
from .state_store import AnalysisStateStore
|
| 178 |
+
|
| 179 |
+
self._state_store = AnalysisStateStore()
|
| 180 |
+
return self._state_store
|
| 181 |
+
|
| 182 |
+
def _get_binding_store(self) -> Any:
|
| 183 |
+
if self._binding_store is None:
|
| 184 |
+
from .binding_store import AnalysisDataSourceStore
|
| 185 |
+
|
| 186 |
+
self._binding_store = AnalysisDataSourceStore()
|
| 187 |
+
return self._binding_store
|
| 188 |
+
|
| 189 |
+
async def _bound_source_ids(self, analysis_id: str | None) -> set[str]:
|
| 190 |
+
"""#10: the catalog source_ids this analysis is bound to (empty = unscoped).
|
| 191 |
+
|
| 192 |
+
Fail-open: no analysis_id, no binding rows (legacy room / FE not sending
|
| 193 |
+
ids), or a read error β empty set, which the caller treats as "whole
|
| 194 |
+
catalog". Used to build a `_ScopedCatalogReader` so the Planner AND the
|
| 195 |
+
data-access tools (which re-read the catalog themselves) see the same scope.
|
| 196 |
+
"""
|
| 197 |
+
if not analysis_id:
|
| 198 |
+
return set()
|
| 199 |
+
try:
|
| 200 |
+
return set(await self._get_binding_store().get(analysis_id))
|
| 201 |
+
except Exception as e: # noqa: BLE001 β never block the query on this
|
| 202 |
+
logger.warning("binding read failed β unscoped", analysis_id=analysis_id, error=str(e))
|
| 203 |
+
return set()
|
| 204 |
+
|
| 205 |
+
async def _load_analysis_state(self, analysis_id: str | None) -> AnalysisState:
|
| 206 |
+
"""Load Analysis State for the Help skill; fail closed to a not-validated stub.
|
| 207 |
+
|
| 208 |
+
Mirrors the gate's never-throw fallback so Help degrades gracefully on a
|
| 209 |
+
missing row, a read error, or a legacy room with no `analysis_id`.
|
| 210 |
+
"""
|
| 211 |
+
from .gate import stub_analysis_state
|
| 212 |
+
|
| 213 |
+
if not analysis_id:
|
| 214 |
+
return stub_analysis_state(problem_validated=False)
|
| 215 |
+
try:
|
| 216 |
+
state = await self._get_state_store().get(analysis_id)
|
| 217 |
+
except Exception as e:
|
| 218 |
+
logger.warning("help state read failed β not-validated", error=str(e))
|
| 219 |
+
state = None
|
| 220 |
+
return state if state is not None else stub_analysis_state(problem_validated=False)
|
| 221 |
+
|
| 222 |
# ------------------------------------------------------------------
|
| 223 |
# Public entry
|
| 224 |
# ------------------------------------------------------------------
|
|
|
|
| 228 |
message: str,
|
| 229 |
user_id: str,
|
| 230 |
history: list[BaseMessage] | None = None,
|
| 231 |
+
analysis_id: str | None = None,
|
| 232 |
) -> AsyncIterator[dict[str, Any]]:
|
| 233 |
tracer = self._make_tracer(user_id, message)
|
| 234 |
|
|
|
|
| 242 |
yield {"event": "error", "data": f"Could not classify message: {e}"}
|
| 243 |
return
|
| 244 |
|
| 245 |
+
intent = decision.intent
|
| 246 |
+
# ---- 1a. Ensure session state row (T-A) ----------------------
|
| 247 |
+
# Rooms created via /room/create have no `analysis_states` row. Without one
|
| 248 |
+
# the gate redirect-loops and problem_statement / report_id writes silently
|
| 249 |
+
# no-op. Lazily get-or-create it (idempotent) so any session is gate-ready.
|
| 250 |
+
analysis_state: AnalysisState | None = None
|
| 251 |
+
if analysis_id:
|
| 252 |
+
try:
|
| 253 |
+
analysis_state = await self._get_state_store().ensure(analysis_id, user_id)
|
| 254 |
+
except Exception as e:
|
| 255 |
+
logger.warning(
|
| 256 |
+
"analysis state ensure failed", analysis_id=analysis_id, error=str(e)
|
| 257 |
+
)
|
| 258 |
+
|
| 259 |
+
# ---- 1b. Gate (deterministic, post-router) -------------------
|
| 260 |
+
# Redirect structured_flow -> problem_statement until the analysis is
|
| 261 |
+
# validated. Fails closed (not-validated) when the state row is unavailable.
|
| 262 |
+
if self._enable_gate and analysis_id:
|
| 263 |
+
from .gate import gate, stub_analysis_state
|
| 264 |
+
|
| 265 |
+
intent = gate(
|
| 266 |
+
intent,
|
| 267 |
+
analysis_state
|
| 268 |
+
if analysis_state is not None
|
| 269 |
+
else stub_analysis_state(problem_validated=False),
|
| 270 |
+
)
|
| 271 |
+
|
| 272 |
+
# The `intent` event is consumed by the endpoint (it gates response caching
|
| 273 |
+
# on the effective intent) and is NOT forwarded to the frontend. We emit the
|
| 274 |
+
# post-gate intent so the cache keys on what actually ran.
|
| 275 |
+
event_data = decision.model_dump()
|
| 276 |
+
event_data["intent"] = intent
|
| 277 |
+
yield {"event": "intent", "data": json.dumps(event_data)}
|
| 278 |
|
| 279 |
rewritten = decision.rewritten_query or message
|
| 280 |
query_result = None
|
|
|
|
| 282 |
raw_chunks: Any = None
|
| 283 |
|
| 284 |
# ---- 2. Route ------------------------------------------------
|
| 285 |
+
if intent == "structured_flow":
|
| 286 |
try:
|
| 287 |
# One memoizing reader per request: the same catalog is otherwise
|
| 288 |
# re-fetched from the catalog DB 4-5x across the slow-path run. This
|
|
|
|
| 291 |
from ..catalog.reader import MemoizingCatalogReader
|
| 292 |
|
| 293 |
req_reader = MemoizingCatalogReader(self._get_catalog_reader())
|
| 294 |
+
# #10: scope every catalog read β the Planner's AND the data-access
|
| 295 |
+
# tools' own re-reads β to the analysis's bound sources, so binding
|
| 296 |
+
# is a boundary, not just a planner hint (T-B). Fail-open (T-C).
|
| 297 |
+
bound = await self._bound_source_ids(analysis_id)
|
| 298 |
+
reader = _ScopedCatalogReader(req_reader, bound) if bound else req_reader
|
| 299 |
+
catalog = await reader.read(user_id, "structured")
|
| 300 |
if self._enable_slow_path:
|
| 301 |
async for event in self._run_slow_path(
|
| 302 |
+
user_id, rewritten, catalog, tracer, reader, analysis_id
|
| 303 |
):
|
| 304 |
yield event
|
| 305 |
return
|
|
|
|
| 314 |
)
|
| 315 |
yield {"event": "error", "data": f"Structured query failed: {e}"}
|
| 316 |
return
|
| 317 |
+
elif intent == "unstructured_flow":
|
| 318 |
try:
|
| 319 |
raw_chunks = await self._get_document_retriever().retrieve(
|
| 320 |
rewritten, user_id
|
| 321 |
)
|
| 322 |
chunks = _normalize_chunks(raw_chunks)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 323 |
except Exception as e:
|
| 324 |
logger.error(
|
| 325 |
"unstructured route failed", user_id=user_id, error=str(e)
|
| 326 |
)
|
| 327 |
yield {"event": "error", "data": f"Document retrieval failed: {e}"}
|
| 328 |
return
|
| 329 |
+
elif intent == "check":
|
| 330 |
+
try:
|
| 331 |
+
invoker = self._get_check_invoker(user_id)
|
| 332 |
+
text = await run_check(rewritten, invoker)
|
| 333 |
+
except Exception as e:
|
| 334 |
+
logger.error("check route failed", user_id=user_id, error=str(e))
|
| 335 |
+
yield {"event": "error", "data": f"Lookup failed: {e}"}
|
| 336 |
+
return
|
| 337 |
+
yield {"event": "chunk", "data": text}
|
| 338 |
+
yield {"event": "done", "data": ""}
|
| 339 |
+
return
|
| 340 |
+
elif intent == "problem_statement":
|
| 341 |
+
try:
|
| 342 |
+
text = await run_problem_statement(
|
| 343 |
+
message,
|
| 344 |
+
analysis_id,
|
| 345 |
+
agent=self._get_ps_agent(),
|
| 346 |
+
store=self._get_state_store(),
|
| 347 |
+
history=history,
|
| 348 |
+
)
|
| 349 |
+
except Exception as e:
|
| 350 |
+
logger.error("problem_statement route failed", user_id=user_id, error=str(e))
|
| 351 |
+
yield {"event": "error", "data": f"Problem statement failed: {e}"}
|
| 352 |
+
return
|
| 353 |
+
yield {"event": "chunk", "data": text}
|
| 354 |
+
yield {"event": "done", "data": ""}
|
| 355 |
+
return
|
| 356 |
+
elif intent == "help":
|
| 357 |
+
try:
|
| 358 |
+
state = analysis_state or await self._load_analysis_state(analysis_id)
|
| 359 |
+
except Exception as e:
|
| 360 |
+
logger.error("help route failed", user_id=user_id, error=str(e))
|
| 361 |
+
yield {"event": "error", "data": f"Help failed: {e}"}
|
| 362 |
+
return
|
| 363 |
+
# report_ready (seam #5): deterministic β validated goal + β₯1 recorded
|
| 364 |
+
# analysis (mirrors the report API's own 409 gate). Never-throws (fails
|
| 365 |
+
# closed to not-ready), so Help degrades safely. The consistency guard in
|
| 366 |
+
# HelpAgent only offers `generate_report` when this says ready.
|
| 367 |
+
from .report.readiness import is_report_ready
|
| 368 |
+
|
| 369 |
+
report_ready = await is_report_ready(analysis_id, state)
|
| 370 |
+
# The prompt sees chat history -> masked.
|
| 371 |
+
hc = tracer.callbacks(masked=True)
|
| 372 |
+
hkw = {"callbacks": hc} if hc else {}
|
| 373 |
+
try:
|
| 374 |
+
async for token in self._get_help_agent().astream(
|
| 375 |
+
state,
|
| 376 |
+
history=history,
|
| 377 |
+
message=message,
|
| 378 |
+
report_ready=report_ready,
|
| 379 |
+
**hkw,
|
| 380 |
+
):
|
| 381 |
+
yield {"event": "chunk", "data": token}
|
| 382 |
+
except Exception as e:
|
| 383 |
+
logger.error("help streaming failed", user_id=user_id, error=str(e))
|
| 384 |
+
yield {"event": "error", "data": f"Help generation failed: {e}"}
|
| 385 |
+
return
|
| 386 |
+
tracer.end()
|
| 387 |
+
yield {"event": "done", "data": ""}
|
| 388 |
+
return
|
| 389 |
# else: chat path β no context
|
| 390 |
|
| 391 |
# ---- 2b. Emit sources ---------------------------------------
|
| 392 |
+
sources = _build_sources(intent, user_id, query_result, raw_chunks)
|
| 393 |
+
logger.info(
|
| 394 |
+
"built sources",
|
| 395 |
+
intent=intent,
|
| 396 |
+
sources_count=len(sources),
|
| 397 |
+
raw_chunks_count=len(raw_chunks) if raw_chunks else 0,
|
| 398 |
)
|
|
|
|
| 399 |
yield {"event": "sources", "data": json.dumps(sources)}
|
| 400 |
|
| 401 |
# ---- 3. Stream answer ----------------------------------------
|
|
|
|
| 470 |
|
| 471 |
def _get_analysis_store(self) -> AnalysisStore:
|
| 472 |
if self._analysis_store is None:
|
| 473 |
+
from .slow_path.store import PostgresAnalysisStore
|
| 474 |
|
| 475 |
+
self._analysis_store = PostgresAnalysisStore()
|
| 476 |
return self._analysis_store
|
| 477 |
|
| 478 |
async def _run_slow_path(
|
|
|
|
| 482 |
catalog: Any,
|
| 483 |
tracer: Any = None,
|
| 484 |
catalog_reader: CatalogReader | None = None,
|
| 485 |
+
analysis_id: str | None = None,
|
| 486 |
) -> AsyncIterator[dict[str, Any]]:
|
| 487 |
"""Run the slow path and stream its assembled answer as SSE events.
|
| 488 |
|
| 489 |
Context comes from the `get_business_context` seam (a stub today); the
|
| 490 |
+
`analysis_record` is persisted via the `AnalysisStore` seam (PostgresAnalysisStore),
|
| 491 |
+
stamped with the request's user_id + analysis_id so the report can group it.
|
| 492 |
`chat_answer` is emitted as a single `chunk` (the Assembler returns the whole
|
| 493 |
object β true token streaming is a later step).
|
| 494 |
"""
|
|
|
|
| 558 |
yield {"event": "sources", "data": json.dumps([])} # TODO: derive from record
|
| 559 |
yield {"event": "chunk", "data": result.chat_answer}
|
| 560 |
try:
|
| 561 |
+
# Stamp identity from the request scope: owner + the shared session id
|
| 562 |
+
# (analysis_id == room_id). Without analysis_id the record is orphaned β
|
| 563 |
+
# list_for_analysis can't find it, so the report + is_report_ready go
|
| 564 |
+
# blind. The store is never-throw.
|
| 565 |
+
record = result.analysis_record.model_copy(
|
| 566 |
+
update={"user_id": user_id, "analysis_id": analysis_id}
|
| 567 |
+
)
|
| 568 |
+
await self._get_analysis_store().save(record)
|
| 569 |
except Exception as e: # persistence must never break the user's answer
|
| 570 |
logger.error("analysis_record persist failed", user_id=user_id, error=str(e))
|
| 571 |
tracer.end() # output omitted (chat_answer may contain PII on Cloud)
|
| 572 |
yield {"event": "done", "data": ""}
|
| 573 |
|
| 574 |
|
| 575 |
+
class _ScopedCatalogReader:
|
| 576 |
+
"""Wraps a CatalogReader, restricting `structured` reads to an analysis's bound
|
| 577 |
+
sources (#10).
|
| 578 |
+
|
| 579 |
+
Scoping lives here β not at a single call site β so the Planner AND the
|
| 580 |
+
data-access tools (which re-read the catalog themselves) see the same scoped
|
| 581 |
+
view; otherwise binding is only a hint to the Planner while the executor runs
|
| 582 |
+
against the full catalog. Fail-open: an empty or fully-disjoint binding yields
|
| 583 |
+
the whole catalog, so a stale / cross-source binding degrades instead of
|
| 584 |
+
emptying the catalog. Only `structured` reads are scoped (all #10 binds today);
|
| 585 |
+
`unstructured` / retrieval reads pass through.
|
| 586 |
+
"""
|
| 587 |
+
|
| 588 |
+
def __init__(self, inner: Any, bound: set[str]) -> None:
|
| 589 |
+
self._inner = inner
|
| 590 |
+
self._bound = bound
|
| 591 |
+
|
| 592 |
+
async def read(self, user_id: str, source_hint: str) -> Any:
|
| 593 |
+
catalog = await self._inner.read(user_id, source_hint)
|
| 594 |
+
if not self._bound or source_hint != "structured":
|
| 595 |
+
return catalog
|
| 596 |
+
scoped = [s for s in catalog.sources if s.source_id in self._bound]
|
| 597 |
+
return catalog.model_copy(update={"sources": scoped or catalog.sources})
|
| 598 |
+
|
| 599 |
+
|
| 600 |
def _build_sources(
|
| 601 |
+
intent: str,
|
| 602 |
user_id: str,
|
| 603 |
query_result: Any,
|
| 604 |
raw_chunks: Any,
|
| 605 |
) -> list[dict[str, Any]]:
|
| 606 |
"""Build the sources payload for the SSE `sources` event.
|
| 607 |
|
| 608 |
+
- structured_flow: one entry per executed table (table_name only).
|
| 609 |
+
- unstructured_flow: deduped by (document_id, page_label), Phase 1 shape.
|
| 610 |
- chat or error: empty list.
|
| 611 |
"""
|
| 612 |
+
if intent == "structured_flow":
|
| 613 |
if query_result is None or getattr(query_result, "error", None):
|
| 614 |
return []
|
| 615 |
table_name = getattr(query_result, "table_name", "") or ""
|
|
|
|
| 621 |
"page_label": None,
|
| 622 |
}]
|
| 623 |
|
| 624 |
+
if intent == "unstructured_flow" and raw_chunks:
|
| 625 |
seen: set[tuple[Any, Any]] = set()
|
| 626 |
sources: list[dict[str, Any]] = []
|
| 627 |
for item in raw_chunks:
|
src/agents/gate.py
ADDED
|
@@ -0,0 +1,108 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Deterministic routing gate β policy check over the router's intent.
|
| 2 |
+
|
| 3 |
+
After the LLM router picks an intent, the gate checks it against the per-analysis
|
| 4 |
+
Analysis State and returns the **effective** intent: allow as-is, or redirect. No
|
| 5 |
+
LLM, no I/O in `gate()` itself.
|
| 6 |
+
|
| 7 |
+
Only one rule has teeth in v1: an analytical request (`structured_flow`) requires a
|
| 8 |
+
validated problem statement (`problem_validated is True`); otherwise it is
|
| 9 |
+
redirected to `problem_statement` so the user defines the goal first. Everything
|
| 10 |
+
else passes through. `generate_report` is not a router intent (button / report
|
| 11 |
+
API), so it is not gated here.
|
| 12 |
+
|
| 13 |
+
`AnalysisState` is the locked 8-field contract (mirrors the `analysis_states` DB
|
| 14 |
+
table). `get_analysis_state` reads the real per-analysis row via `AnalysisStateStore`
|
| 15 |
+
(#9, landed); it fails closed to a not-validated stub on a missing row or read error.
|
| 16 |
+
See `ORCHESTRATOR_REWORK_PLAN.md` Β§4.
|
| 17 |
+
"""
|
| 18 |
+
|
| 19 |
+
from __future__ import annotations
|
| 20 |
+
|
| 21 |
+
from datetime import UTC, datetime
|
| 22 |
+
|
| 23 |
+
from pydantic import BaseModel
|
| 24 |
+
|
| 25 |
+
from src.agents.orchestration import Intent
|
| 26 |
+
from src.middlewares.logging import get_logger
|
| 27 |
+
|
| 28 |
+
logger = get_logger("gate")
|
| 29 |
+
|
| 30 |
+
|
| 31 |
+
class AnalysisState(BaseModel):
|
| 32 |
+
"""Per-analysis state the gate + Help skill read every turn (locked contract).
|
| 33 |
+
|
| 34 |
+
`problem_validated` is the gate driver; `report_id` is null until a report
|
| 35 |
+
exists. Field names mirror the `analysis_states` table so the DB read swaps in
|
| 36 |
+
without touching readers.
|
| 37 |
+
"""
|
| 38 |
+
|
| 39 |
+
id: str
|
| 40 |
+
analysis_title: str
|
| 41 |
+
problem_statement: str
|
| 42 |
+
problem_validated: bool = False
|
| 43 |
+
owner_id: str
|
| 44 |
+
report_id: str | None = None
|
| 45 |
+
created_at: datetime
|
| 46 |
+
updated_at: datetime
|
| 47 |
+
|
| 48 |
+
|
| 49 |
+
def gate(intent: Intent, state: AnalysisState) -> Intent:
|
| 50 |
+
"""Return the effective intent after applying the deterministic gate policy.
|
| 51 |
+
|
| 52 |
+
`structured_flow` requires `problem_validated is True`; otherwise redirect to
|
| 53 |
+
`problem_statement`. All other intents pass through unchanged.
|
| 54 |
+
"""
|
| 55 |
+
if intent == "structured_flow" and not state.problem_validated:
|
| 56 |
+
logger.info(
|
| 57 |
+
"gate redirect",
|
| 58 |
+
requested=intent,
|
| 59 |
+
effective="problem_statement",
|
| 60 |
+
reason="problem_not_validated",
|
| 61 |
+
)
|
| 62 |
+
return "problem_statement"
|
| 63 |
+
return intent
|
| 64 |
+
|
| 65 |
+
|
| 66 |
+
def stub_analysis_state(*, problem_validated: bool = False) -> AnalysisState:
|
| 67 |
+
"""Hardcoded Analysis State for building/testing before the DB lands (#9).
|
| 68 |
+
|
| 69 |
+
Shared fixture so the gate, the Help skill, and tests all exercise the same
|
| 70 |
+
shape. `problem_validated=True` simulates a passed interview.
|
| 71 |
+
"""
|
| 72 |
+
now = datetime.now(UTC)
|
| 73 |
+
return AnalysisState(
|
| 74 |
+
id="stub-analysis",
|
| 75 |
+
analysis_title="Stub analysis",
|
| 76 |
+
problem_statement="Stub problem statement" if problem_validated else "",
|
| 77 |
+
problem_validated=problem_validated,
|
| 78 |
+
owner_id="stub-user",
|
| 79 |
+
report_id=None,
|
| 80 |
+
created_at=now,
|
| 81 |
+
updated_at=now,
|
| 82 |
+
)
|
| 83 |
+
|
| 84 |
+
|
| 85 |
+
async def get_analysis_state(analysis_id: str) -> AnalysisState:
|
| 86 |
+
"""Load the Analysis State for an analysis (shared id with the chat room).
|
| 87 |
+
|
| 88 |
+
Reads the `analysis_states` row via `AnalysisStateStore`. Never-throw seam: a
|
| 89 |
+
missing row (e.g. a legacy room created before this table) or a read failure
|
| 90 |
+
degrades to a **not-validated** stub, so the gate fails closed (β steer to
|
| 91 |
+
`problem_statement`) rather than running ungated analysis. The store import is
|
| 92 |
+
lazy so this module stays import-safe without a DB.
|
| 93 |
+
"""
|
| 94 |
+
try:
|
| 95 |
+
from src.agents.state_store import AnalysisStateStore
|
| 96 |
+
|
| 97 |
+
state = await AnalysisStateStore().get(analysis_id)
|
| 98 |
+
except Exception as exc: # noqa: BLE001 β never-throw; fail closed to not-validated
|
| 99 |
+
logger.warning(
|
| 100 |
+
"get_analysis_state read failed β default not-validated",
|
| 101 |
+
analysis_id=analysis_id,
|
| 102 |
+
error=str(exc),
|
| 103 |
+
)
|
| 104 |
+
return stub_analysis_state(problem_validated=False)
|
| 105 |
+
if state is None:
|
| 106 |
+
logger.debug("analysis_state missing β default not-validated", analysis_id=analysis_id)
|
| 107 |
+
return stub_analysis_state(problem_validated=False)
|
| 108 |
+
return state
|
src/agents/handlers/__init__.py
ADDED
|
@@ -0,0 +1 @@
|
|
|
|
|
|
|
| 1 |
+
"""Deterministic skill handlers dispatched by the orchestrator (non-LLM)."""
|
src/agents/handlers/check.py
ADDED
|
@@ -0,0 +1,165 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""`check` skill handler β deterministic data/document inventory (no LLM).
|
| 2 |
+
|
| 3 |
+
The router emits a single `check` intent; this handler picks the concrete tool
|
| 4 |
+
(`check_data` for structured sources, `check_knowledge` for documents) and renders
|
| 5 |
+
the tool's `ToolOutput` table into a markdown reply. Broad queries with no
|
| 6 |
+
specific cue call both tools concurrently and stitch a helicopter-view inventory.
|
| 7 |
+
See `ORCHESTRATOR_REWORK_PLAN.md` Β§2.
|
| 8 |
+
|
| 9 |
+
The data-access invoker never throws (Β§8.4); `render_tool_output` handles the
|
| 10 |
+
`error` envelope defensively.
|
| 11 |
+
"""
|
| 12 |
+
|
| 13 |
+
from __future__ import annotations
|
| 14 |
+
|
| 15 |
+
import asyncio
|
| 16 |
+
import re
|
| 17 |
+
from typing import TYPE_CHECKING
|
| 18 |
+
|
| 19 |
+
from src.tools.contracts import ToolOutput
|
| 20 |
+
|
| 21 |
+
if TYPE_CHECKING:
|
| 22 |
+
from src.agents.slow_path.invoker import ToolInvoker
|
| 23 |
+
|
| 24 |
+
# Cues that point at documents rather than structured data.
|
| 25 |
+
_KNOWLEDGE_CUES = (
|
| 26 |
+
"document",
|
| 27 |
+
"docs",
|
| 28 |
+
"doc ",
|
| 29 |
+
"file",
|
| 30 |
+
"pdf",
|
| 31 |
+
"docx",
|
| 32 |
+
".txt",
|
| 33 |
+
"uploaded",
|
| 34 |
+
"knowledge",
|
| 35 |
+
"dokumen",
|
| 36 |
+
)
|
| 37 |
+
|
| 38 |
+
# Cues that point at structured/tabular data specifically.
|
| 39 |
+
_DATA_CUES = (
|
| 40 |
+
"kolom",
|
| 41 |
+
"column",
|
| 42 |
+
"tabel",
|
| 43 |
+
"table",
|
| 44 |
+
"baris",
|
| 45 |
+
"row",
|
| 46 |
+
"schema",
|
| 47 |
+
"skema",
|
| 48 |
+
"database",
|
| 49 |
+
" db",
|
| 50 |
+
)
|
| 51 |
+
|
| 52 |
+
|
| 53 |
+
def _intent(message: str) -> str:
|
| 54 |
+
"""Return 'knowledge', 'data', or 'both' (helicopter view) from keyword cues."""
|
| 55 |
+
lowered = message.lower()
|
| 56 |
+
is_knowledge = any(cue in lowered for cue in _KNOWLEDGE_CUES)
|
| 57 |
+
is_data = any(cue in lowered for cue in _DATA_CUES)
|
| 58 |
+
if is_knowledge and not is_data:
|
| 59 |
+
return "knowledge"
|
| 60 |
+
if is_data and not is_knowledge:
|
| 61 |
+
return "data"
|
| 62 |
+
return "both"
|
| 63 |
+
|
| 64 |
+
|
| 65 |
+
def render_tool_output(out: ToolOutput) -> str:
|
| 66 |
+
"""Render a `check_*` ToolOutput table into a markdown string, or '' if empty."""
|
| 67 |
+
if out.kind == "error":
|
| 68 |
+
return f"Sorry, I couldn't look that up: {out.error}"
|
| 69 |
+
columns = out.columns or []
|
| 70 |
+
rows = out.rows or []
|
| 71 |
+
if not rows:
|
| 72 |
+
return ""
|
| 73 |
+
header = "| " + " | ".join(columns) + " |"
|
| 74 |
+
separator = "| " + " | ".join("---" for _ in columns) + " |"
|
| 75 |
+
body = "\n".join(
|
| 76 |
+
"| " + " | ".join(str(cell) for cell in row) + " |" for row in rows
|
| 77 |
+
)
|
| 78 |
+
return f"{header}\n{separator}\n{body}"
|
| 79 |
+
|
| 80 |
+
|
| 81 |
+
def _matched_source_ids(message: str, inventory: ToolOutput) -> list[str]:
|
| 82 |
+
"""All source_ids whose name appears as a whole word in the message.
|
| 83 |
+
|
| 84 |
+
The user names sources in plain words ("sales", "kolom sales sama orders");
|
| 85 |
+
the tool needs exact `source_id`s. We resolve them against the inventory
|
| 86 |
+
rows (kind="table", columns include "source_id" + "name") instead of an LLM
|
| 87 |
+
β a cheap match against catalog metadata already in hand. Whole-word match
|
| 88 |
+
(`\\b`) avoids nuisance hits ("orders" inside "reorders") and treats `_` as
|
| 89 |
+
part of the word, so "sales" won't pick up "sales_archive". Multiple named
|
| 90 |
+
sources all match, so the caller can show each schema.
|
| 91 |
+
"""
|
| 92 |
+
if inventory.kind != "table" or not inventory.rows:
|
| 93 |
+
return []
|
| 94 |
+
cols = inventory.columns or []
|
| 95 |
+
try:
|
| 96 |
+
id_idx = cols.index("source_id")
|
| 97 |
+
name_idx = cols.index("name")
|
| 98 |
+
except ValueError:
|
| 99 |
+
return []
|
| 100 |
+
|
| 101 |
+
matched: list[str] = []
|
| 102 |
+
for row in inventory.rows:
|
| 103 |
+
name = str(row[name_idx])
|
| 104 |
+
if name and re.search(rf"\b{re.escape(name)}\b", message, re.IGNORECASE):
|
| 105 |
+
matched.append(str(row[id_idx]))
|
| 106 |
+
return matched
|
| 107 |
+
|
| 108 |
+
|
| 109 |
+
def _render_helicopter(data_out: ToolOutput, knowledge_out: ToolOutput) -> str:
|
| 110 |
+
"""Stitch structured + document inventory into one helicopter-view reply."""
|
| 111 |
+
parts: list[str] = []
|
| 112 |
+
|
| 113 |
+
data_table = render_tool_output(data_out)
|
| 114 |
+
if data_table:
|
| 115 |
+
parts.append(f"**Data terstruktur**\n{data_table}")
|
| 116 |
+
|
| 117 |
+
knowledge_table = render_tool_output(knowledge_out)
|
| 118 |
+
if knowledge_table:
|
| 119 |
+
parts.append(f"**Dokumen**\n{knowledge_table}")
|
| 120 |
+
|
| 121 |
+
if not parts:
|
| 122 |
+
return "Nothing registered yet β I don't see any sources or documents."
|
| 123 |
+
|
| 124 |
+
return "\n\n".join(parts)
|
| 125 |
+
|
| 126 |
+
|
| 127 |
+
async def run_check(message: str, invoker: ToolInvoker) -> str:
|
| 128 |
+
"""Route to check_data, check_knowledge, or both (helicopter view) based on cues."""
|
| 129 |
+
intent = _intent(message)
|
| 130 |
+
|
| 131 |
+
_no_match = "Nothing registered yet β I don't see any matching sources."
|
| 132 |
+
|
| 133 |
+
if intent == "knowledge":
|
| 134 |
+
out = await invoker.invoke("check_knowledge", {})
|
| 135 |
+
return render_tool_output(out) or _no_match
|
| 136 |
+
|
| 137 |
+
if intent == "data":
|
| 138 |
+
inventory = await invoker.invoke("check_data", {})
|
| 139 |
+
if inventory.kind == "error":
|
| 140 |
+
return render_tool_output(inventory)
|
| 141 |
+
# Drill down to the schema of each source the user named; if they named
|
| 142 |
+
# none, return the source listing.
|
| 143 |
+
source_ids = _matched_source_ids(message, inventory)
|
| 144 |
+
if not source_ids:
|
| 145 |
+
return render_tool_output(inventory) or _no_match
|
| 146 |
+
schemas = await asyncio.gather(
|
| 147 |
+
*(invoker.invoke("check_data", {"source_id": sid}) for sid in source_ids)
|
| 148 |
+
)
|
| 149 |
+
if len(schemas) == 1:
|
| 150 |
+
return render_tool_output(schemas[0]) or _no_match
|
| 151 |
+
# Multiple named sources β one labelled section per source.
|
| 152 |
+
sections: list[str] = []
|
| 153 |
+
for out in schemas:
|
| 154 |
+
table = render_tool_output(out)
|
| 155 |
+
if table:
|
| 156 |
+
label = (out.meta or {}).get("source_name") or "source"
|
| 157 |
+
sections.append(f"**{label}**\n{table}")
|
| 158 |
+
return "\n\n".join(sections) or _no_match
|
| 159 |
+
|
| 160 |
+
# broad / ambiguous β helicopter view: call both concurrently
|
| 161 |
+
data_out, knowledge_out = await asyncio.gather(
|
| 162 |
+
invoker.invoke("check_data", {}),
|
| 163 |
+
invoker.invoke("check_knowledge", {}),
|
| 164 |
+
)
|
| 165 |
+
return _render_helicopter(data_out, knowledge_out)
|
src/agents/handlers/help.py
ADDED
|
@@ -0,0 +1,192 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""`help` skill handler β state-aware next-step guidance (LLM call).
|
| 2 |
+
|
| 3 |
+
Reads the per-analysis state + chat history (and a deterministic report-readiness
|
| 4 |
+
signal) and tells the user where they are and what to do next. Help only guides;
|
| 5 |
+
it never runs analysis or produces data answers.
|
| 6 |
+
|
| 7 |
+
The prompt lives in `config/prompts/help.md` (the playbook); this module composes
|
| 8 |
+
the context and streams the LLM answer, mirroring `ChatbotAgent`. The **consistency
|
| 9 |
+
guard** has teeth here, not just in the prompt: `_derive_available_actions` computes
|
| 10 |
+
the actions actually allowed from the state (the same policy as `gate.py`), and that
|
| 11 |
+
list is fed into the prompt β the LLM is told to suggest *only* those, so it can't
|
| 12 |
+
tell the user to generate a report when the goal isn't validated or the analysis
|
| 13 |
+
isn't ready.
|
| 14 |
+
|
| 15 |
+
SEAMS:
|
| 16 |
+
- `AnalysisState` is the locked 8-field contract from `gate.py` (KM-652). The gate,
|
| 17 |
+
this skill, and tests share `gate.stub_analysis_state(...)` so they exercise the
|
| 18 |
+
same shape.
|
| 19 |
+
- `ReportReadiness` is the return shape of `is_report_ready(chat_history)` (seam #5,
|
| 20 |
+
Rifqi β not built yet). Help *consumes* it; it does not compute it. Until it lands,
|
| 21 |
+
the caller passes a stub (default: not ready).
|
| 22 |
+
"""
|
| 23 |
+
|
| 24 |
+
from __future__ import annotations
|
| 25 |
+
|
| 26 |
+
from collections.abc import AsyncIterator
|
| 27 |
+
from dataclasses import dataclass, field
|
| 28 |
+
from pathlib import Path
|
| 29 |
+
from typing import Any
|
| 30 |
+
|
| 31 |
+
from langchain_core.messages import BaseMessage
|
| 32 |
+
from langchain_core.output_parsers import StrOutputParser
|
| 33 |
+
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
|
| 34 |
+
from langchain_core.runnables import Runnable
|
| 35 |
+
from langchain_openai import AzureChatOpenAI
|
| 36 |
+
|
| 37 |
+
from src.agents.gate import AnalysisState
|
| 38 |
+
from src.middlewares.logging import get_logger
|
| 39 |
+
|
| 40 |
+
logger = get_logger("help")
|
| 41 |
+
|
| 42 |
+
_PROMPT_DIR = Path(__file__).resolve().parent.parent.parent / "config" / "prompts"
|
| 43 |
+
_SYSTEM_PROMPT_PATH = _PROMPT_DIR / "help.md"
|
| 44 |
+
_GUARDRAILS_PATH = _PROMPT_DIR / "guardrails.md"
|
| 45 |
+
|
| 46 |
+
# Neutral human turn when Help is triggered by a slash command with no real content.
|
| 47 |
+
_DEFAULT_TRIGGER = "What should I do next?"
|
| 48 |
+
|
| 49 |
+
|
| 50 |
+
@dataclass
|
| 51 |
+
class ReportReadiness:
|
| 52 |
+
"""Deterministic report-readiness signal β the return of Rifqi's `is_report_ready`.
|
| 53 |
+
|
| 54 |
+
`missing` lists the gaps to fill when `ready` is False.
|
| 55 |
+
"""
|
| 56 |
+
|
| 57 |
+
ready: bool = False
|
| 58 |
+
missing: list[str] = field(default_factory=list)
|
| 59 |
+
|
| 60 |
+
|
| 61 |
+
def _derive_available_actions(state: AnalysisState, report_ready: ReportReadiness) -> list[str]:
|
| 62 |
+
"""Actions Help is allowed to suggest, derived from state (mirrors `gate.py`).
|
| 63 |
+
|
| 64 |
+
This is the consistency guard's teeth: analysis is gated behind a validated goal
|
| 65 |
+
(same rule the gate applies to `structured_flow`), and a report is only offered
|
| 66 |
+
when the readiness signal says so. Keep this policy in sync with `gate.gate`.
|
| 67 |
+
"""
|
| 68 |
+
if not state.problem_validated:
|
| 69 |
+
# Goal not set β the only useful move is defining the problem statement.
|
| 70 |
+
return ["define_problem_statement"]
|
| 71 |
+
|
| 72 |
+
actions = ["ask_analysis_question", "refine_problem_statement"]
|
| 73 |
+
if report_ready.ready:
|
| 74 |
+
actions.append("generate_report")
|
| 75 |
+
return actions
|
| 76 |
+
|
| 77 |
+
|
| 78 |
+
def _format_state(state: AnalysisState) -> str:
|
| 79 |
+
"""Render the analysis state as a compact context block for the LLM."""
|
| 80 |
+
has_report = "yes" if state.report_id else "no"
|
| 81 |
+
return (
|
| 82 |
+
"[Analysis state]\n"
|
| 83 |
+
f"analysis_title: {state.analysis_title or '(none)'}\n"
|
| 84 |
+
f"problem_statement: {state.problem_statement or '(empty)'}\n"
|
| 85 |
+
f"problem_validated: {str(state.problem_validated).lower()}\n"
|
| 86 |
+
f"has_report: {has_report}"
|
| 87 |
+
)
|
| 88 |
+
|
| 89 |
+
|
| 90 |
+
def _format_report_ready(report_ready: ReportReadiness) -> str:
|
| 91 |
+
missing = ", ".join(report_ready.missing) if report_ready.missing else "(none)"
|
| 92 |
+
return (
|
| 93 |
+
"[Report readiness]\n"
|
| 94 |
+
f"ready: {str(report_ready.ready).lower()}\n"
|
| 95 |
+
f"missing: {missing}"
|
| 96 |
+
)
|
| 97 |
+
|
| 98 |
+
|
| 99 |
+
def _build_context_block(
|
| 100 |
+
state: AnalysisState,
|
| 101 |
+
report_ready: ReportReadiness,
|
| 102 |
+
available_actions: list[str],
|
| 103 |
+
) -> str:
|
| 104 |
+
"""Compose the deterministic context the prompt's 'never misguide' rule trusts."""
|
| 105 |
+
return "\n\n".join(
|
| 106 |
+
[
|
| 107 |
+
_format_state(state),
|
| 108 |
+
_format_report_ready(report_ready),
|
| 109 |
+
"[Available actions]\n" + ", ".join(available_actions),
|
| 110 |
+
]
|
| 111 |
+
)
|
| 112 |
+
|
| 113 |
+
|
| 114 |
+
def _load_system_prompt() -> str:
|
| 115 |
+
"""Compose system prompt = help.md + guardrails.md (guardrails last, as elsewhere)."""
|
| 116 |
+
help_md = _SYSTEM_PROMPT_PATH.read_text(encoding="utf-8")
|
| 117 |
+
guardrails = _GUARDRAILS_PATH.read_text(encoding="utf-8")
|
| 118 |
+
return f"{help_md}\n\n{guardrails}"
|
| 119 |
+
|
| 120 |
+
|
| 121 |
+
def _build_default_chain() -> Runnable:
|
| 122 |
+
from src.config.settings import settings
|
| 123 |
+
|
| 124 |
+
llm = AzureChatOpenAI(
|
| 125 |
+
azure_deployment=settings.azureai_deployment_name_4o,
|
| 126 |
+
openai_api_version=settings.azureai_api_version_4o,
|
| 127 |
+
azure_endpoint=settings.azureai_endpoint_url_4o,
|
| 128 |
+
api_key=settings.azureai_api_key_4o,
|
| 129 |
+
temperature=0.3,
|
| 130 |
+
model_kwargs={"stream_options": {"include_usage": True}},
|
| 131 |
+
)
|
| 132 |
+
prompt = ChatPromptTemplate.from_messages(
|
| 133 |
+
[
|
| 134 |
+
("system", _load_system_prompt()),
|
| 135 |
+
MessagesPlaceholder(variable_name="history", optional=True),
|
| 136 |
+
("human", "{message}"),
|
| 137 |
+
("system", "Analysis state and signals for this turn:\n\n{context}"),
|
| 138 |
+
]
|
| 139 |
+
)
|
| 140 |
+
return prompt | llm | StrOutputParser()
|
| 141 |
+
|
| 142 |
+
|
| 143 |
+
class HelpAgent:
|
| 144 |
+
"""Streams state-aware guidance to the user.
|
| 145 |
+
|
| 146 |
+
`chain` is injectable: tests pass a fake that yields canned tokens. Default
|
| 147 |
+
constructs the production Azure OpenAI streaming chain on first use.
|
| 148 |
+
"""
|
| 149 |
+
|
| 150 |
+
def __init__(self, chain: Runnable | None = None) -> None:
|
| 151 |
+
self._chain = chain
|
| 152 |
+
|
| 153 |
+
def _ensure_chain(self) -> Runnable:
|
| 154 |
+
if self._chain is None:
|
| 155 |
+
self._chain = _build_default_chain()
|
| 156 |
+
return self._chain
|
| 157 |
+
|
| 158 |
+
async def astream(
|
| 159 |
+
self,
|
| 160 |
+
state: AnalysisState,
|
| 161 |
+
history: list[BaseMessage] | None = None,
|
| 162 |
+
report_ready: ReportReadiness | None = None,
|
| 163 |
+
message: str | None = None,
|
| 164 |
+
available_actions: list[str] | None = None,
|
| 165 |
+
callbacks: list | None = None,
|
| 166 |
+
) -> AsyncIterator[str]:
|
| 167 |
+
"""Stream tokens of the guidance reply.
|
| 168 |
+
|
| 169 |
+
`report_ready` defaults to "not ready" so a missing signal degrades safely.
|
| 170 |
+
`available_actions`, when omitted, is derived deterministically from state.
|
| 171 |
+
"""
|
| 172 |
+
readiness = report_ready or ReportReadiness()
|
| 173 |
+
actions = available_actions or _derive_available_actions(state, readiness)
|
| 174 |
+
logger.info(
|
| 175 |
+
"help guidance",
|
| 176 |
+
problem_validated=state.problem_validated,
|
| 177 |
+
report_ready=readiness.ready,
|
| 178 |
+
available_actions=actions,
|
| 179 |
+
)
|
| 180 |
+
|
| 181 |
+
chain = self._ensure_chain()
|
| 182 |
+
payload: dict[str, Any] = {
|
| 183 |
+
"message": message or _DEFAULT_TRIGGER,
|
| 184 |
+
"history": history or [],
|
| 185 |
+
"context": _build_context_block(state, readiness, actions),
|
| 186 |
+
}
|
| 187 |
+
if callbacks:
|
| 188 |
+
async for token in chain.astream(payload, config={"callbacks": callbacks}):
|
| 189 |
+
yield token
|
| 190 |
+
else:
|
| 191 |
+
async for token in chain.astream(payload):
|
| 192 |
+
yield token
|
src/agents/handlers/problem_statement.py
ADDED
|
@@ -0,0 +1,171 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Problem Statement skill β guide the user to a usable problem statement.
|
| 2 |
+
|
| 3 |
+
Routed by the orchestrator (intent `problem_statement`) and callable as a skill.
|
| 4 |
+
An LLM drafts/refines the statement from the analysis title + the user's message and
|
| 5 |
+
declares what's still `missing`; a check validates only when nothing is missing. The
|
| 6 |
+
model is instructed to fill `objective`/`metric` ONLY from what the user explicitly
|
| 7 |
+
stated β a bare data question ("which X has the most Y?") leaves them in `missing`, so
|
| 8 |
+
it does not auto-validate (the gate stays meaningful). On a valid draft it persists
|
| 9 |
+
`problem_statement` + `problem_validated=True`; otherwise it streams guidance and
|
| 10 |
+
leaves the analysis un-validated.
|
| 11 |
+
|
| 12 |
+
NOTE: completeness is still a (hardened) LLM judgment β the truly deterministic gate
|
| 13 |
+
is an explicit user confirmation, planned with the frontend (see T3b / #11).
|
| 14 |
+
|
| 15 |
+
See `ORCHESTRATOR_REWORK_PLAN.md` Β§4 and the 2026-06-18 checkpoint.
|
| 16 |
+
"""
|
| 17 |
+
|
| 18 |
+
from __future__ import annotations
|
| 19 |
+
|
| 20 |
+
from pathlib import Path
|
| 21 |
+
from typing import TYPE_CHECKING
|
| 22 |
+
|
| 23 |
+
from langchain_core.messages import BaseMessage
|
| 24 |
+
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
|
| 25 |
+
from langchain_core.runnables import Runnable
|
| 26 |
+
from langchain_openai import AzureChatOpenAI
|
| 27 |
+
from pydantic import BaseModel, Field
|
| 28 |
+
|
| 29 |
+
from src.middlewares.logging import get_logger
|
| 30 |
+
|
| 31 |
+
if TYPE_CHECKING:
|
| 32 |
+
from src.agents.state_store import AnalysisStateStore
|
| 33 |
+
|
| 34 |
+
logger = get_logger("problem_statement")
|
| 35 |
+
|
| 36 |
+
_PROMPT_PATH = (
|
| 37 |
+
Path(__file__).resolve().parent.parent.parent
|
| 38 |
+
/ "config"
|
| 39 |
+
/ "prompts"
|
| 40 |
+
/ "problem_statement.md"
|
| 41 |
+
)
|
| 42 |
+
|
| 43 |
+
|
| 44 |
+
class ProblemStatementDraft(BaseModel):
|
| 45 |
+
"""LLM output for the Problem Statement skill."""
|
| 46 |
+
|
| 47 |
+
problem_statement: str = Field(
|
| 48 |
+
..., description="The refined, standalone problem statement (never empty)."
|
| 49 |
+
)
|
| 50 |
+
objective: str = Field(
|
| 51 |
+
"", description="What success looks like β fill ONLY when the user explicitly "
|
| 52 |
+
"stated it; never inferred from a data question. Empty otherwise."
|
| 53 |
+
)
|
| 54 |
+
metric: str = Field(
|
| 55 |
+
"", description="The KPI to move/investigate β fill ONLY when the user "
|
| 56 |
+
"explicitly stated it; never inferred from a data question. Empty otherwise."
|
| 57 |
+
)
|
| 58 |
+
missing: list[str] = Field(
|
| 59 |
+
default_factory=list,
|
| 60 |
+
description="Which of 'objective' / 'metric' the user has NOT explicitly stated "
|
| 61 |
+
"yet. A bare data question leaves both here. Empty list = complete.",
|
| 62 |
+
)
|
| 63 |
+
feedback: str = Field(
|
| 64 |
+
...,
|
| 65 |
+
description="Message to the user β guidance if incomplete, confirmation if complete.",
|
| 66 |
+
)
|
| 67 |
+
|
| 68 |
+
|
| 69 |
+
def is_valid(draft: ProblemStatementDraft) -> bool:
|
| 70 |
+
"""Complete iff there's a statement and the model flagged nothing missing.
|
| 71 |
+
|
| 72 |
+
Keying on the model's explicit `missing` list (rather than 'are objective/metric
|
| 73 |
+
non-empty') is what stops a bare data question from auto-validating: the hardened
|
| 74 |
+
prompt puts the un-stated parts in `missing`, so this returns False for it.
|
| 75 |
+
"""
|
| 76 |
+
return bool(draft.problem_statement.strip()) and not draft.missing
|
| 77 |
+
|
| 78 |
+
|
| 79 |
+
def _load_prompt_text() -> str:
|
| 80 |
+
return _PROMPT_PATH.read_text(encoding="utf-8")
|
| 81 |
+
|
| 82 |
+
|
| 83 |
+
def _build_default_chain() -> Runnable:
|
| 84 |
+
from src.config.settings import settings
|
| 85 |
+
|
| 86 |
+
llm = AzureChatOpenAI(
|
| 87 |
+
azure_deployment=settings.azureai_deployment_name_4o,
|
| 88 |
+
openai_api_version=settings.azureai_api_version_4o,
|
| 89 |
+
azure_endpoint=settings.azureai_endpoint_url_4o,
|
| 90 |
+
api_key=settings.azureai_api_key_4o,
|
| 91 |
+
temperature=0,
|
| 92 |
+
)
|
| 93 |
+
prompt = ChatPromptTemplate.from_messages(
|
| 94 |
+
[
|
| 95 |
+
("system", _load_prompt_text()),
|
| 96 |
+
MessagesPlaceholder(variable_name="history", optional=True),
|
| 97 |
+
(
|
| 98 |
+
"human",
|
| 99 |
+
"Analysis title: {analysis_title}\n"
|
| 100 |
+
"Current problem statement: {current}\n\n"
|
| 101 |
+
"User message: {message}",
|
| 102 |
+
),
|
| 103 |
+
]
|
| 104 |
+
)
|
| 105 |
+
return prompt | llm.with_structured_output(ProblemStatementDraft)
|
| 106 |
+
|
| 107 |
+
|
| 108 |
+
class ProblemStatementAgent:
|
| 109 |
+
"""Single LLM call that drafts/refines a problem statement.
|
| 110 |
+
|
| 111 |
+
Inject `chain` for tests; the default builds the Azure OpenAI chain on first use.
|
| 112 |
+
"""
|
| 113 |
+
|
| 114 |
+
def __init__(self, chain: Runnable | None = None) -> None:
|
| 115 |
+
self._chain = chain
|
| 116 |
+
|
| 117 |
+
def _ensure_chain(self) -> Runnable:
|
| 118 |
+
if self._chain is None:
|
| 119 |
+
self._chain = _build_default_chain()
|
| 120 |
+
return self._chain
|
| 121 |
+
|
| 122 |
+
async def draft(
|
| 123 |
+
self,
|
| 124 |
+
message: str,
|
| 125 |
+
analysis_title: str,
|
| 126 |
+
current: str,
|
| 127 |
+
history: list[BaseMessage] | None = None,
|
| 128 |
+
) -> ProblemStatementDraft:
|
| 129 |
+
chain = self._ensure_chain()
|
| 130 |
+
return await chain.ainvoke(
|
| 131 |
+
{
|
| 132 |
+
"message": message,
|
| 133 |
+
"analysis_title": analysis_title,
|
| 134 |
+
"current": current,
|
| 135 |
+
"history": history or [],
|
| 136 |
+
}
|
| 137 |
+
)
|
| 138 |
+
|
| 139 |
+
|
| 140 |
+
async def run_problem_statement(
|
| 141 |
+
message: str,
|
| 142 |
+
analysis_id: str | None,
|
| 143 |
+
*,
|
| 144 |
+
agent: ProblemStatementAgent,
|
| 145 |
+
store: AnalysisStateStore,
|
| 146 |
+
history: list[BaseMessage] | None = None,
|
| 147 |
+
) -> str:
|
| 148 |
+
"""Draft + validate the problem statement; persist on a valid draft.
|
| 149 |
+
|
| 150 |
+
Loads the current title/statement (if the analysis exists), drafts a refinement,
|
| 151 |
+
runs the deterministic completeness check, and writes `problem_statement` +
|
| 152 |
+
`problem_validated` back. Returns the user-facing feedback. When `analysis_id` is
|
| 153 |
+
missing (e.g. a legacy room), it still drafts + returns guidance but cannot persist.
|
| 154 |
+
"""
|
| 155 |
+
analysis_title, current = "New analysis", ""
|
| 156 |
+
if analysis_id:
|
| 157 |
+
state = await store.get(analysis_id)
|
| 158 |
+
if state is not None:
|
| 159 |
+
analysis_title, current = state.analysis_title, state.problem_statement
|
| 160 |
+
|
| 161 |
+
draft = await agent.draft(message, analysis_title, current, history)
|
| 162 |
+
validated = is_valid(draft)
|
| 163 |
+
|
| 164 |
+
if analysis_id:
|
| 165 |
+
await store.update(
|
| 166 |
+
analysis_id,
|
| 167 |
+
problem_statement=draft.problem_statement,
|
| 168 |
+
problem_validated=validated,
|
| 169 |
+
)
|
| 170 |
+
logger.info("problem_statement drafted", analysis_id=analysis_id, validated=validated)
|
| 171 |
+
return draft.feedback
|
src/agents/orchestration.py
CHANGED
|
@@ -1,13 +1,17 @@
|
|
| 1 |
-
"""OrchestratorAgent β classifies a user message
|
| 2 |
|
| 3 |
-
Output:
|
| 4 |
-
+ rewritten_query (standalone form of the user's question, history-resolved).
|
| 5 |
|
| 6 |
-
|
| 7 |
-
|
| 8 |
-
|
| 9 |
-
|
| 10 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 11 |
"""
|
| 12 |
|
| 13 |
from __future__ import annotations
|
|
@@ -25,7 +29,14 @@ from src.middlewares.logging import get_logger
|
|
| 25 |
|
| 26 |
logger = get_logger("orchestrator")
|
| 27 |
|
| 28 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 29 |
|
| 30 |
_PROMPT_PATH = (
|
| 31 |
Path(__file__).resolve().parent.parent
|
|
@@ -35,21 +46,29 @@ _PROMPT_PATH = (
|
|
| 35 |
)
|
| 36 |
|
| 37 |
|
| 38 |
-
class
|
| 39 |
"""LLM output. Pydantic so it can be used with `with_structured_output`."""
|
| 40 |
|
| 41 |
-
|
| 42 |
-
..., description="True if we must look at the user's data to answer."
|
| 43 |
-
)
|
| 44 |
-
source_hint: SourceHint = Field(
|
| 45 |
...,
|
| 46 |
-
description=
|
| 47 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 48 |
)
|
| 49 |
rewritten_query: str | None = Field(
|
| 50 |
None,
|
| 51 |
-
description=
|
| 52 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 53 |
)
|
| 54 |
|
| 55 |
|
|
@@ -74,11 +93,11 @@ def _build_default_chain() -> Runnable:
|
|
| 74 |
("human", "{message}"),
|
| 75 |
]
|
| 76 |
)
|
| 77 |
-
return prompt | llm.with_structured_output(
|
| 78 |
|
| 79 |
|
| 80 |
class OrchestratorAgent:
|
| 81 |
-
"""Classifies a user message into
|
| 82 |
|
| 83 |
Inject `structured_chain` for tests; default builds the production
|
| 84 |
Azure OpenAI chain on first use.
|
|
@@ -97,18 +116,14 @@ class OrchestratorAgent:
|
|
| 97 |
message: str,
|
| 98 |
history: list[BaseMessage] | None = None,
|
| 99 |
callbacks: list | None = None,
|
| 100 |
-
) ->
|
| 101 |
chain = self._ensure_chain()
|
| 102 |
payload = {"message": message, "history": history or []}
|
| 103 |
if callbacks:
|
| 104 |
-
decision:
|
| 105 |
payload, config={"callbacks": callbacks}
|
| 106 |
)
|
| 107 |
else:
|
| 108 |
decision = await chain.ainvoke(payload)
|
| 109 |
-
logger.info(
|
| 110 |
-
"intent classified",
|
| 111 |
-
source_hint=decision.source_hint,
|
| 112 |
-
needs_search=decision.needs_search,
|
| 113 |
-
)
|
| 114 |
return decision
|
|
|
|
| 1 |
+
"""OrchestratorAgent β classifies a user message into one of six intents.
|
| 2 |
|
| 3 |
+
Output: RouterDecision { intent, rewritten_query, confidence }.
|
|
|
|
| 4 |
|
| 5 |
+
The router is a **handler-level** intent classifier, not a data-modality
|
| 6 |
+
classifier: `structured_flow` routes to the slow Planner spine and
|
| 7 |
+
`unstructured_flow` to the fast RAG path; the structured/unstructured data mix on
|
| 8 |
+
the slow path is the Planner's job, not the router's. See
|
| 9 |
+
`ORCHESTRATOR_REWORK_PLAN.md`.
|
| 10 |
+
|
| 11 |
+
The class name `OrchestratorAgent` is preserved so existing import sites
|
| 12 |
+
(`from src.agents.orchestration import OrchestratorAgent`) keep working. The
|
| 13 |
+
default LLM chain is built lazily so the module is import-safe even without
|
| 14 |
+
`.env` populated.
|
| 15 |
"""
|
| 16 |
|
| 17 |
from __future__ import annotations
|
|
|
|
| 29 |
|
| 30 |
logger = get_logger("orchestrator")
|
| 31 |
|
| 32 |
+
Intent = Literal[
|
| 33 |
+
"chat",
|
| 34 |
+
"help",
|
| 35 |
+
"problem_statement",
|
| 36 |
+
"check",
|
| 37 |
+
"unstructured_flow",
|
| 38 |
+
"structured_flow",
|
| 39 |
+
]
|
| 40 |
|
| 41 |
_PROMPT_PATH = (
|
| 42 |
Path(__file__).resolve().parent.parent
|
|
|
|
| 46 |
)
|
| 47 |
|
| 48 |
|
| 49 |
+
class RouterDecision(BaseModel):
|
| 50 |
"""LLM output. Pydantic so it can be used with `with_structured_output`."""
|
| 51 |
|
| 52 |
+
intent: Intent = Field(
|
|
|
|
|
|
|
|
|
|
| 53 |
...,
|
| 54 |
+
description=(
|
| 55 |
+
"Handler route for this message: 'chat' (conversational, no data), "
|
| 56 |
+
"'help' (what-to-do-next guidance), 'problem_statement' (define or "
|
| 57 |
+
"refine the analysis goal), 'check' (inventory: what data/documents "
|
| 58 |
+
"exist), 'unstructured_flow' (answer from documents, fast RAG), or "
|
| 59 |
+
"'structured_flow' (analytical question over data, slow Planner path)."
|
| 60 |
+
),
|
| 61 |
)
|
| 62 |
rewritten_query: str | None = Field(
|
| 63 |
None,
|
| 64 |
+
description=(
|
| 65 |
+
"Standalone version of the question, history-resolved. Null for "
|
| 66 |
+
"'chat' and 'help' (no data lookup needed)."
|
| 67 |
+
),
|
| 68 |
+
)
|
| 69 |
+
confidence: float | None = Field(
|
| 70 |
+
None,
|
| 71 |
+
description="Classifier confidence in [0, 1]. Optional.",
|
| 72 |
)
|
| 73 |
|
| 74 |
|
|
|
|
| 93 |
("human", "{message}"),
|
| 94 |
]
|
| 95 |
)
|
| 96 |
+
return prompt | llm.with_structured_output(RouterDecision)
|
| 97 |
|
| 98 |
|
| 99 |
class OrchestratorAgent:
|
| 100 |
+
"""Classifies a user message into one of the six router intents.
|
| 101 |
|
| 102 |
Inject `structured_chain` for tests; default builds the production
|
| 103 |
Azure OpenAI chain on first use.
|
|
|
|
| 116 |
message: str,
|
| 117 |
history: list[BaseMessage] | None = None,
|
| 118 |
callbacks: list | None = None,
|
| 119 |
+
) -> RouterDecision:
|
| 120 |
chain = self._ensure_chain()
|
| 121 |
payload = {"message": message, "history": history or []}
|
| 122 |
if callbacks:
|
| 123 |
+
decision: RouterDecision = await chain.ainvoke(
|
| 124 |
payload, config={"callbacks": callbacks}
|
| 125 |
)
|
| 126 |
else:
|
| 127 |
decision = await chain.ainvoke(payload)
|
| 128 |
+
logger.info("intent classified", intent=decision.intent)
|
|
|
|
|
|
|
|
|
|
|
|
|
| 129 |
return decision
|
src/agents/planner/examples.py
CHANGED
|
@@ -2,7 +2,7 @@
|
|
| 2 |
|
| 3 |
Two illustrative (question -> TaskList) pairs that teach the OUTPUT SHAPE:
|
| 4 |
stages, dependency edges, ordered tool-call chains, inline QueryIR,
|
| 5 |
-
"${t<id>}" placeholders, and the assumed data-flow convention β `
|
| 6 |
pulls rows, then a composite `analyze_*` tool consumes them via a `data` placeholder
|
| 7 |
referencing the upstream result's column aliases (Pattern A; the tool team may
|
| 8 |
instead pick self-fetch by `source_id`, in which case these examples are reshaped
|
|
@@ -21,9 +21,8 @@ from .schemas import Task, TaskList, ToolCall
|
|
| 21 |
# --------------------------------------------------------------------------- #
|
| 22 |
# Example A β exploratory, no modeling.
|
| 23 |
# "Which product categories drove last quarter's revenue?"
|
| 24 |
-
# Shows:
|
| 25 |
-
# category
|
| 26 |
-
# queries).
|
| 27 |
# --------------------------------------------------------------------------- #
|
| 28 |
|
| 29 |
_EXAMPLE_A = TaskList(
|
|
@@ -36,7 +35,7 @@ _EXAMPLE_A = TaskList(
|
|
| 36 |
id="t1",
|
| 37 |
stage="data_understanding",
|
| 38 |
objective="Confirm the sales source exposes category, revenue, and order date.",
|
| 39 |
-
tool_calls=[ToolCall(tool="
|
| 40 |
expected_output="source_shape",
|
| 41 |
success_criteria="Produced the orders table schema; the 3 needed columns are present.",
|
| 42 |
depends_on=[],
|
|
@@ -48,7 +47,7 @@ _EXAMPLE_A = TaskList(
|
|
| 48 |
objective="Pull last quarter's order-level category and revenue rows.",
|
| 49 |
tool_calls=[
|
| 50 |
ToolCall(
|
| 51 |
-
tool="
|
| 52 |
args={
|
| 53 |
"ir": {
|
| 54 |
"source_id": "src_sales",
|
|
@@ -78,20 +77,19 @@ _EXAMPLE_A = TaskList(
|
|
| 78 |
Task(
|
| 79 |
id="t3",
|
| 80 |
stage="evaluation",
|
| 81 |
-
objective="
|
| 82 |
tool_calls=[
|
| 83 |
ToolCall(
|
| 84 |
-
tool="
|
| 85 |
args={
|
| 86 |
"data": "${t2}",
|
| 87 |
-
"
|
| 88 |
-
"
|
| 89 |
-
"agg": "sum",
|
| 90 |
},
|
| 91 |
)
|
| 92 |
],
|
| 93 |
-
expected_output="
|
| 94 |
-
success_criteria="Produced
|
| 95 |
depends_on=["t2"],
|
| 96 |
estimated_cost="low",
|
| 97 |
),
|
|
@@ -113,7 +111,7 @@ _EXAMPLE_B = TaskList(
|
|
| 113 |
id="t1",
|
| 114 |
stage="data_understanding",
|
| 115 |
objective="Confirm the sales source exposes order date, revenue, and region.",
|
| 116 |
-
tool_calls=[ToolCall(tool="
|
| 117 |
expected_output="source_shape",
|
| 118 |
success_criteria="Produced the orders table schema; the needed columns are present.",
|
| 119 |
depends_on=[],
|
|
@@ -125,7 +123,7 @@ _EXAMPLE_B = TaskList(
|
|
| 125 |
objective="Pull this year's order dates, revenue, and region.",
|
| 126 |
tool_calls=[
|
| 127 |
ToolCall(
|
| 128 |
-
tool="
|
| 129 |
args={
|
| 130 |
"ir": {
|
| 131 |
"source_id": "src_sales",
|
|
@@ -189,8 +187,8 @@ _EXAMPLE_B = TaskList(
|
|
| 189 |
# Example C β mixed structured + unstructured.
|
| 190 |
# "Revenue dipped in Q1 β what happened?"
|
| 191 |
# Shows: a structured branch (query -> analyze_trend) runs alongside an
|
| 192 |
-
# INDEPENDENT
|
| 193 |
-
#
|
| 194 |
# placeholder β it is a source, not a consumer) and can run in parallel; the
|
| 195 |
# Assembler folds the document context into the explanation.
|
| 196 |
# --------------------------------------------------------------------------- #
|
|
@@ -205,7 +203,7 @@ _EXAMPLE_C = TaskList(
|
|
| 205 |
id="t1",
|
| 206 |
stage="data_understanding",
|
| 207 |
objective="Confirm the sales source exposes order date and revenue.",
|
| 208 |
-
tool_calls=[ToolCall(tool="
|
| 209 |
expected_output="source_shape",
|
| 210 |
success_criteria="Produced the orders table schema; date and revenue columns present.",
|
| 211 |
depends_on=[],
|
|
@@ -217,7 +215,7 @@ _EXAMPLE_C = TaskList(
|
|
| 217 |
objective="Pull Q1 order dates and revenue.",
|
| 218 |
tool_calls=[
|
| 219 |
ToolCall(
|
| 220 |
-
tool="
|
| 221 |
args={
|
| 222 |
"ir": {
|
| 223 |
"source_id": "src_sales",
|
|
@@ -275,7 +273,7 @@ _EXAMPLE_C = TaskList(
|
|
| 275 |
objective="Retrieve qualitative context on Q1 operational events behind a dip.",
|
| 276 |
tool_calls=[
|
| 277 |
ToolCall(
|
| 278 |
-
tool="
|
| 279 |
args={
|
| 280 |
"query": "operational issues, outages, or notable events in Q1 2026",
|
| 281 |
"top_k": 5,
|
|
@@ -310,7 +308,7 @@ _EXAMPLE_D = TaskList(
|
|
| 310 |
id="t1",
|
| 311 |
stage="data_understanding",
|
| 312 |
objective="Confirm the sales source exposes region and revenue.",
|
| 313 |
-
tool_calls=[ToolCall(tool="
|
| 314 |
expected_output="source_shape",
|
| 315 |
success_criteria="Produced the orders table schema; region and revenue present.",
|
| 316 |
depends_on=[],
|
|
@@ -322,7 +320,7 @@ _EXAMPLE_D = TaskList(
|
|
| 322 |
objective="Pull order-level region and revenue.",
|
| 323 |
tool_calls=[
|
| 324 |
ToolCall(
|
| 325 |
-
tool="
|
| 326 |
args={
|
| 327 |
"ir": {
|
| 328 |
"source_id": "src_sales",
|
|
|
|
| 2 |
|
| 3 |
Two illustrative (question -> TaskList) pairs that teach the OUTPUT SHAPE:
|
| 4 |
stages, dependency edges, ordered tool-call chains, inline QueryIR,
|
| 5 |
+
"${t<id>}" placeholders, and the assumed data-flow convention β `retrieve_data`
|
| 6 |
pulls rows, then a composite `analyze_*` tool consumes them via a `data` placeholder
|
| 7 |
referencing the upstream result's column aliases (Pattern A; the tool team may
|
| 8 |
instead pick self-fetch by `source_id`, in which case these examples are reshaped
|
|
|
|
| 21 |
# --------------------------------------------------------------------------- #
|
| 22 |
# Example A β exploratory, no modeling.
|
| 23 |
# "Which product categories drove last quarter's revenue?"
|
| 24 |
+
# Shows: retrieve_data pulls rows -> analyze_aggregate sums revenue per
|
| 25 |
+
# category in one call (no manual per-category queries).
|
|
|
|
| 26 |
# --------------------------------------------------------------------------- #
|
| 27 |
|
| 28 |
_EXAMPLE_A = TaskList(
|
|
|
|
| 35 |
id="t1",
|
| 36 |
stage="data_understanding",
|
| 37 |
objective="Confirm the sales source exposes category, revenue, and order date.",
|
| 38 |
+
tool_calls=[ToolCall(tool="check_data", args={"source_id": "src_sales"})],
|
| 39 |
expected_output="source_shape",
|
| 40 |
success_criteria="Produced the orders table schema; the 3 needed columns are present.",
|
| 41 |
depends_on=[],
|
|
|
|
| 47 |
objective="Pull last quarter's order-level category and revenue rows.",
|
| 48 |
tool_calls=[
|
| 49 |
ToolCall(
|
| 50 |
+
tool="retrieve_data",
|
| 51 |
args={
|
| 52 |
"ir": {
|
| 53 |
"source_id": "src_sales",
|
|
|
|
| 77 |
Task(
|
| 78 |
id="t3",
|
| 79 |
stage="evaluation",
|
| 80 |
+
objective="Sum revenue per category for the quarter.",
|
| 81 |
tool_calls=[
|
| 82 |
ToolCall(
|
| 83 |
+
tool="analyze_aggregate",
|
| 84 |
args={
|
| 85 |
"data": "${t2}",
|
| 86 |
+
"aggregations": {"revenue": ["sum"]},
|
| 87 |
+
"group_by": ["category"],
|
|
|
|
| 88 |
},
|
| 89 |
)
|
| 90 |
],
|
| 91 |
+
expected_output="category_revenue",
|
| 92 |
+
success_criteria="Produced total revenue per category, one row each.",
|
| 93 |
depends_on=["t2"],
|
| 94 |
estimated_cost="low",
|
| 95 |
),
|
|
|
|
| 111 |
id="t1",
|
| 112 |
stage="data_understanding",
|
| 113 |
objective="Confirm the sales source exposes order date, revenue, and region.",
|
| 114 |
+
tool_calls=[ToolCall(tool="check_data", args={"source_id": "src_sales"})],
|
| 115 |
expected_output="source_shape",
|
| 116 |
success_criteria="Produced the orders table schema; the needed columns are present.",
|
| 117 |
depends_on=[],
|
|
|
|
| 123 |
objective="Pull this year's order dates, revenue, and region.",
|
| 124 |
tool_calls=[
|
| 125 |
ToolCall(
|
| 126 |
+
tool="retrieve_data",
|
| 127 |
args={
|
| 128 |
"ir": {
|
| 129 |
"source_id": "src_sales",
|
|
|
|
| 187 |
# Example C β mixed structured + unstructured.
|
| 188 |
# "Revenue dipped in Q1 β what happened?"
|
| 189 |
# Shows: a structured branch (query -> analyze_trend) runs alongside an
|
| 190 |
+
# INDEPENDENT retrieve_knowledge branch that pulls qualitative context. Note
|
| 191 |
+
# retrieve_knowledge takes a natural-language `query` (NOT a `${t<id>}` data
|
| 192 |
# placeholder β it is a source, not a consumer) and can run in parallel; the
|
| 193 |
# Assembler folds the document context into the explanation.
|
| 194 |
# --------------------------------------------------------------------------- #
|
|
|
|
| 203 |
id="t1",
|
| 204 |
stage="data_understanding",
|
| 205 |
objective="Confirm the sales source exposes order date and revenue.",
|
| 206 |
+
tool_calls=[ToolCall(tool="check_data", args={"source_id": "src_sales"})],
|
| 207 |
expected_output="source_shape",
|
| 208 |
success_criteria="Produced the orders table schema; date and revenue columns present.",
|
| 209 |
depends_on=[],
|
|
|
|
| 215 |
objective="Pull Q1 order dates and revenue.",
|
| 216 |
tool_calls=[
|
| 217 |
ToolCall(
|
| 218 |
+
tool="retrieve_data",
|
| 219 |
args={
|
| 220 |
"ir": {
|
| 221 |
"source_id": "src_sales",
|
|
|
|
| 273 |
objective="Retrieve qualitative context on Q1 operational events behind a dip.",
|
| 274 |
tool_calls=[
|
| 275 |
ToolCall(
|
| 276 |
+
tool="retrieve_knowledge",
|
| 277 |
args={
|
| 278 |
"query": "operational issues, outages, or notable events in Q1 2026",
|
| 279 |
"top_k": 5,
|
|
|
|
| 308 |
id="t1",
|
| 309 |
stage="data_understanding",
|
| 310 |
objective="Confirm the sales source exposes region and revenue.",
|
| 311 |
+
tool_calls=[ToolCall(tool="check_data", args={"source_id": "src_sales"})],
|
| 312 |
expected_output="source_shape",
|
| 313 |
success_criteria="Produced the orders table schema; region and revenue present.",
|
| 314 |
depends_on=[],
|
|
|
|
| 320 |
objective="Pull order-level region and revenue.",
|
| 321 |
tool_calls=[
|
| 322 |
ToolCall(
|
| 323 |
+
tool="retrieve_data",
|
| 324 |
args={
|
| 325 |
"ir": {
|
| 326 |
"source_id": "src_sales",
|
src/agents/planner/inputs.py
CHANGED
|
@@ -4,9 +4,9 @@
|
|
| 4 |
for the planner prompt. It carries every table + column id/type/PII flag + row
|
| 5 |
counts + low-cardinality top_values, with `sample_values` nulled on PII columns
|
| 6 |
(INV: no PII sample values into the prompt, see doc Β§13). It also lists the
|
| 7 |
-
available unstructured sources so the planner can plan `
|
| 8 |
|
| 9 |
-
The planner *validator* still checks inline `
|
| 10 |
full `Catalog` via the existing IRValidator β the summary is a prompt input, not
|
| 11 |
the validation source of truth.
|
| 12 |
|
|
@@ -124,7 +124,7 @@ class CatalogSummary(BaseModel):
|
|
| 124 |
lines.append("")
|
| 125 |
|
| 126 |
if self.unstructured_sources:
|
| 127 |
-
lines.append("Unstructured sources (for
|
| 128 |
for src in self.unstructured_sources:
|
| 129 |
lines.append(f" - {src.name} β id={src.source_id}")
|
| 130 |
|
|
|
|
| 4 |
for the planner prompt. It carries every table + column id/type/PII flag + row
|
| 5 |
counts + low-cardinality top_values, with `sample_values` nulled on PII columns
|
| 6 |
(INV: no PII sample values into the prompt, see doc Β§13). It also lists the
|
| 7 |
+
available unstructured sources so the planner can plan `retrieve_knowledge`.
|
| 8 |
|
| 9 |
+
The planner *validator* still checks inline `retrieve_data` IRs against the
|
| 10 |
full `Catalog` via the existing IRValidator β the summary is a prompt input, not
|
| 11 |
the validation source of truth.
|
| 12 |
|
|
|
|
| 124 |
lines.append("")
|
| 125 |
|
| 126 |
if self.unstructured_sources:
|
| 127 |
+
lines.append("Unstructured sources (for retrieve_knowledge):")
|
| 128 |
for src in self.unstructured_sources:
|
| 129 |
lines.append(f" - {src.name} β id={src.source_id}")
|
| 130 |
|
src/agents/planner/registry.py
CHANGED
|
@@ -7,8 +7,8 @@ outside it).
|
|
| 7 |
`src/tools/registry.py::analytics_registry()` (KM-628), built on the canonical
|
| 8 |
`ToolSpec` (`src/tools/contracts.py`, KM-465/KM-627) and the prompt-style tool
|
| 9 |
descriptions (KM-625). No longer a stub on our side β it tracks the real registry.
|
| 10 |
-
- **Data access (`
|
| 11 |
-
`
|
| 12 |
but their wrappers + `ToolSpec`s haven't landed yet (KM-465 #4). We keep best-guess
|
| 13 |
spec bodies here so the Planner can plan end-to-end β but the NAMES derive from
|
| 14 |
`src.tools.data_access.DATA_ACCESS_TOOLS` (R11), so a tool rename/addition upstream
|
|
@@ -16,10 +16,10 @@ outside it).
|
|
| 16 |
this slice and swap `default_registry()` for the tool team's full composition.
|
| 17 |
|
| 18 |
**Confirmed conventions (KM-465):** Pattern A β `analyze_*` tools take a `data`
|
| 19 |
-
`"${t<id>}"` placeholder pointing at an upstream `
|
| 20 |
self-fetch); resolved to a DataFrame at execution time. `input_schema` is the
|
| 21 |
lightweight `{required, properties}` dict the planner validator (check #8) reads;
|
| 22 |
-
`
|
| 23 |
catalog by the existing IRValidator.
|
| 24 |
|
| 25 |
See AGENT_ARCHITECTURE_CONTEXT_new.md Β§9.2 / Β§9.3.
|
|
@@ -38,25 +38,32 @@ from .contracts import ToolRegistry, ToolSpec
|
|
| 38 |
# --------------------------------------------------------------------------- #
|
| 39 |
_DATA_ACCESS_SPEC_BODIES: tuple[ToolSpec, ...] = (
|
| 40 |
ToolSpec(
|
| 41 |
-
name="
|
| 42 |
category="analytics.query",
|
| 43 |
input_schema={"required": ["ir"], "properties": {"ir": {"type": "object"}}},
|
| 44 |
output_kind="table",
|
| 45 |
description=(
|
| 46 |
-
"Run one validated
|
| 47 |
-
"
|
| 48 |
-
"
|
| 49 |
-
"
|
| 50 |
-
"
|
| 51 |
-
"
|
| 52 |
-
"(
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 53 |
"(median/percentile/mode/stddev/skew β analyze_descriptive), trends "
|
| 54 |
"(analyze_trend), correlation, segmentation, or share-of-total; and do NOT "
|
| 55 |
-
"use it to read documents (use
|
| 56 |
),
|
| 57 |
),
|
| 58 |
ToolSpec(
|
| 59 |
-
name="
|
| 60 |
category="retrieval.documents",
|
| 61 |
input_schema={
|
| 62 |
"required": ["query"],
|
|
@@ -71,32 +78,36 @@ _DATA_ACCESS_SPEC_BODIES: tuple[ToolSpec, ...] = (
|
|
| 71 |
"Dense-retrieve the most relevant chunks from the user's unstructured "
|
| 72 |
"sources (PDF/DOCX/TXT) for a natural-language `query`. Use this to pull "
|
| 73 |
"qualitative context into an analysis. Optionally scope to one `source_id`. "
|
| 74 |
-
"Do NOT use it for numbers in tables β that is
|
| 75 |
),
|
| 76 |
),
|
| 77 |
ToolSpec(
|
| 78 |
-
name="
|
| 79 |
category="catalog.introspection",
|
| 80 |
-
input_schema={
|
|
|
|
|
|
|
|
|
|
| 81 |
output_kind="table",
|
| 82 |
description=(
|
| 83 |
-
"
|
| 84 |
-
"
|
| 85 |
-
"
|
|
|
|
|
|
|
|
|
|
| 86 |
),
|
| 87 |
),
|
| 88 |
ToolSpec(
|
| 89 |
-
name="
|
| 90 |
category="catalog.introspection",
|
| 91 |
-
input_schema={
|
| 92 |
-
"required": ["source_id"],
|
| 93 |
-
"properties": {"source_id": {"type": "string"}},
|
| 94 |
-
},
|
| 95 |
output_kind="table",
|
| 96 |
description=(
|
| 97 |
-
"
|
| 98 |
-
"
|
| 99 |
-
"
|
|
|
|
| 100 |
),
|
| 101 |
),
|
| 102 |
)
|
|
|
|
| 7 |
`src/tools/registry.py::analytics_registry()` (KM-628), built on the canonical
|
| 8 |
`ToolSpec` (`src/tools/contracts.py`, KM-465/KM-627) and the prompt-style tool
|
| 9 |
descriptions (KM-625). No longer a stub on our side β it tracks the real registry.
|
| 10 |
+
- **Data access (`retrieve_data` / `retrieve_knowledge` / `check_data` /
|
| 11 |
+
`check_knowledge`) β spec BODIES still a local stub.** The tool team owns these too,
|
| 12 |
but their wrappers + `ToolSpec`s haven't landed yet (KM-465 #4). We keep best-guess
|
| 13 |
spec bodies here so the Planner can plan end-to-end β but the NAMES derive from
|
| 14 |
`src.tools.data_access.DATA_ACCESS_TOOLS` (R11), so a tool rename/addition upstream
|
|
|
|
| 16 |
this slice and swap `default_registry()` for the tool team's full composition.
|
| 17 |
|
| 18 |
**Confirmed conventions (KM-465):** Pattern A β `analyze_*` tools take a `data`
|
| 19 |
+
`"${t<id>}"` placeholder pointing at an upstream `retrieve_data` output (no
|
| 20 |
self-fetch); resolved to a DataFrame at execution time. `input_schema` is the
|
| 21 |
lightweight `{required, properties}` dict the planner validator (check #8) reads;
|
| 22 |
+
`retrieve_data.args["ir"]` carries an inline QueryIR validated against the
|
| 23 |
catalog by the existing IRValidator.
|
| 24 |
|
| 25 |
See AGENT_ARCHITECTURE_CONTEXT_new.md Β§9.2 / Β§9.3.
|
|
|
|
| 38 |
# --------------------------------------------------------------------------- #
|
| 39 |
_DATA_ACCESS_SPEC_BODIES: tuple[ToolSpec, ...] = (
|
| 40 |
ToolSpec(
|
| 41 |
+
name="retrieve_data",
|
| 42 |
category="analytics.query",
|
| 43 |
input_schema={"required": ["ir"], "properties": {"ir": {"type": "object"}}},
|
| 44 |
output_kind="table",
|
| 45 |
description=(
|
| 46 |
+
"Run one validated query against a structured source and return rows. The "
|
| 47 |
+
"`ir` argument is an inline QueryIR (the JSON intent: source_id, table_id, "
|
| 48 |
+
"joins, select, filters, group_by, order_by, limit) β never SQL. This is the "
|
| 49 |
+
"data-access entry point: use it to select, filter, and pull the rows the "
|
| 50 |
+
"analytics (`analyze_*`) tools then consume. It also does simple built-in "
|
| 51 |
+
"aggregation the IR can express (count/sum/avg/min/max/count_distinct). "
|
| 52 |
+
"JOINS (database sources only): to group a measure in one table by a "
|
| 53 |
+
"dimension in a RELATED table, add a `joins` entry "
|
| 54 |
+
"({target_table_id, left_column_id, right_column_id}) along a declared "
|
| 55 |
+
"foreign key β e.g. sum order_items.line_total grouped by products.category "
|
| 56 |
+
"via order_items.product_id = products.id. Prefer an existing measure column "
|
| 57 |
+
"(e.g. line_total) over recomputing, and a single table when the measure and "
|
| 58 |
+
"dimension already live together. Joins are NOT supported on tabular/file "
|
| 59 |
+
"sources yet. Do NOT use this for richer statistics "
|
| 60 |
"(median/percentile/mode/stddev/skew β analyze_descriptive), trends "
|
| 61 |
"(analyze_trend), correlation, segmentation, or share-of-total; and do NOT "
|
| 62 |
+
"use it to read documents (use retrieve_knowledge)."
|
| 63 |
),
|
| 64 |
),
|
| 65 |
ToolSpec(
|
| 66 |
+
name="retrieve_knowledge",
|
| 67 |
category="retrieval.documents",
|
| 68 |
input_schema={
|
| 69 |
"required": ["query"],
|
|
|
|
| 78 |
"Dense-retrieve the most relevant chunks from the user's unstructured "
|
| 79 |
"sources (PDF/DOCX/TXT) for a natural-language `query`. Use this to pull "
|
| 80 |
"qualitative context into an analysis. Optionally scope to one `source_id`. "
|
| 81 |
+
"Do NOT use it for numbers in tables β that is retrieve_data's job."
|
| 82 |
),
|
| 83 |
),
|
| 84 |
ToolSpec(
|
| 85 |
+
name="check_data",
|
| 86 |
category="catalog.introspection",
|
| 87 |
+
input_schema={
|
| 88 |
+
"required": [],
|
| 89 |
+
"properties": {"source_id": {"type": "string"}},
|
| 90 |
+
},
|
| 91 |
output_kind="table",
|
| 92 |
description=(
|
| 93 |
+
"Inspect the user's structured data sources (DB + tabular). With no "
|
| 94 |
+
"arguments, lists the sources (id, name, type, table count) β use early in "
|
| 95 |
+
"data_understanding to discover what exists. With a `source_id`, returns that "
|
| 96 |
+
"source's tables and columns (names, types, row counts) β use to confirm a "
|
| 97 |
+
"source's shape before querying it. Cheap. Do NOT use it to fetch data rows "
|
| 98 |
+
"(use retrieve_data) or to inspect documents (use check_knowledge)."
|
| 99 |
),
|
| 100 |
),
|
| 101 |
ToolSpec(
|
| 102 |
+
name="check_knowledge",
|
| 103 |
category="catalog.introspection",
|
| 104 |
+
input_schema={"required": [], "properties": {}},
|
|
|
|
|
|
|
|
|
|
| 105 |
output_kind="table",
|
| 106 |
description=(
|
| 107 |
+
"List the user's unstructured sources / documents (id, name, type). Use in "
|
| 108 |
+
"data_understanding to discover what qualitative material exists before "
|
| 109 |
+
"retrieving from it. Do NOT use it to read document content (use "
|
| 110 |
+
"retrieve_knowledge) or to inspect structured data (use check_data)."
|
| 111 |
),
|
| 112 |
),
|
| 113 |
)
|
src/agents/planner/service.py
CHANGED
|
@@ -9,7 +9,7 @@ static plan.
|
|
| 9 |
|
| 10 |
The service takes the full `Catalog` (not just a `CatalogSummary`): it derives
|
| 11 |
the PII-safe `CatalogSummary` for the prompt, but validation needs the full
|
| 12 |
-
catalog so the existing `IRValidator` can check inline `
|
| 13 |
|
| 14 |
See AGENT_ARCHITECTURE_CONTEXT_new.md Β§7.3.
|
| 15 |
"""
|
|
|
|
| 9 |
|
| 10 |
The service takes the full `Catalog` (not just a `CatalogSummary`): it derives
|
| 11 |
the PII-safe `CatalogSummary` for the prompt, but validation needs the full
|
| 12 |
+
catalog so the existing `IRValidator` can check inline `retrieve_data` IRs.
|
| 13 |
|
| 14 |
See AGENT_ARCHITECTURE_CONTEXT_new.md Β§7.3.
|
| 15 |
"""
|
src/agents/planner/validator.py
CHANGED
|
@@ -95,8 +95,8 @@ class PlannerValidator:
|
|
| 95 |
f"source_id {src!r} (known: {sorted(known_sources)})"
|
| 96 |
)
|
| 97 |
|
| 98 |
-
# Check 8b β inline
|
| 99 |
-
if call.tool == "
|
| 100 |
self._validate_inline_ir(task.id, call.args, catalog)
|
| 101 |
|
| 102 |
# Check 7 β success_criteria is checkable.
|
|
@@ -114,20 +114,20 @@ class PlannerValidator:
|
|
| 114 |
raw_ir = args.get("ir")
|
| 115 |
if not isinstance(raw_ir, dict):
|
| 116 |
raise PlannerValidationError(
|
| 117 |
-
f"task {task_id}:
|
| 118 |
f"object, got {type(raw_ir).__name__}"
|
| 119 |
)
|
| 120 |
try:
|
| 121 |
ir = QueryIR.model_validate(raw_ir)
|
| 122 |
except ValidationError as e:
|
| 123 |
raise PlannerValidationError(
|
| 124 |
-
f"task {task_id}:
|
| 125 |
) from e
|
| 126 |
try:
|
| 127 |
self._ir_validator.validate(ir, catalog)
|
| 128 |
except IRValidationError as e:
|
| 129 |
raise PlannerValidationError(
|
| 130 |
-
f"task {task_id}:
|
| 131 |
) from e
|
| 132 |
|
| 133 |
@staticmethod
|
|
|
|
| 95 |
f"source_id {src!r} (known: {sorted(known_sources)})"
|
| 96 |
)
|
| 97 |
|
| 98 |
+
# Check 8b β inline retrieve_data IR validates against the catalog.
|
| 99 |
+
if call.tool == "retrieve_data":
|
| 100 |
self._validate_inline_ir(task.id, call.args, catalog)
|
| 101 |
|
| 102 |
# Check 7 β success_criteria is checkable.
|
|
|
|
| 114 |
raw_ir = args.get("ir")
|
| 115 |
if not isinstance(raw_ir, dict):
|
| 116 |
raise PlannerValidationError(
|
| 117 |
+
f"task {task_id}: retrieve_data.args.ir must be an inline QueryIR "
|
| 118 |
f"object, got {type(raw_ir).__name__}"
|
| 119 |
)
|
| 120 |
try:
|
| 121 |
ir = QueryIR.model_validate(raw_ir)
|
| 122 |
except ValidationError as e:
|
| 123 |
raise PlannerValidationError(
|
| 124 |
+
f"task {task_id}: retrieve_data.args.ir is not a valid QueryIR: {e}"
|
| 125 |
) from e
|
| 126 |
try:
|
| 127 |
self._ir_validator.validate(ir, catalog)
|
| 128 |
except IRValidationError as e:
|
| 129 |
raise PlannerValidationError(
|
| 130 |
+
f"task {task_id}: retrieve_data IR failed catalog validation: {e}"
|
| 131 |
) from e
|
| 132 |
|
| 133 |
@staticmethod
|
src/agents/report/__init__.py
ADDED
|
@@ -0,0 +1,9 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Report generator (KM-644).
|
| 2 |
+
|
| 3 |
+
A button-triggered *service* β not a chat skill, not a slow-path agent. It turns a
|
| 4 |
+
session's persisted `AnalysisRecord`s + Problem Statement into a versioned,
|
| 5 |
+
business-readable `AnalysisReport`. Architecturally it mirrors the Assembler: one
|
| 6 |
+
constrained LLM call (the executive summary) wrapped in deterministic assembly that
|
| 7 |
+
copies every other field verbatim from the records (INV-4). Reports are immutable
|
| 8 |
+
per version and persisted to the `analysis_reports` table.
|
| 9 |
+
"""
|
src/agents/report/errors.py
ADDED
|
@@ -0,0 +1,7 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Typed errors for the report generator (KM-644)."""
|
| 2 |
+
|
| 3 |
+
from __future__ import annotations
|
| 4 |
+
|
| 5 |
+
|
| 6 |
+
class ReportError(Exception):
|
| 7 |
+
"""The report could not be generated (e.g. no records for the analysis)."""
|
src/agents/report/generator.py
ADDED
|
@@ -0,0 +1,363 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""ReportGenerator β turns a session's AnalysisRecords into an AnalysisReport (KM-644).
|
| 2 |
+
|
| 3 |
+
A button-triggered service shaped like the Assembler: deterministic assembly of the
|
| 4 |
+
records (findings/caveats/open_questions/data_sources/method_steps, copied verbatim β
|
| 5 |
+
INV-4) wrapped around exactly ONE LLM call that authors only the executive summary.
|
| 6 |
+
If that call fails the report is still returned with a deterministic fallback
|
| 7 |
+
summary (decision D1) β the deterministic body is the real value.
|
| 8 |
+
|
| 9 |
+
Versioning + persistence live in `ReportStore`; this service does generation only
|
| 10 |
+
(returns an `AnalysisReport` with `version=0`; the store assigns the real version).
|
| 11 |
+
Chain construction mirrors `agents/slow_path/assembler.py`.
|
| 12 |
+
"""
|
| 13 |
+
|
| 14 |
+
from __future__ import annotations
|
| 15 |
+
|
| 16 |
+
from datetime import UTC, datetime
|
| 17 |
+
from pathlib import Path
|
| 18 |
+
|
| 19 |
+
from langchain_core.messages import SystemMessage
|
| 20 |
+
from langchain_core.prompts import ChatPromptTemplate
|
| 21 |
+
from langchain_core.runnables import Runnable
|
| 22 |
+
from langchain_openai import AzureChatOpenAI
|
| 23 |
+
|
| 24 |
+
from src.middlewares.logging import get_logger
|
| 25 |
+
|
| 26 |
+
from ..slow_path.schemas import AnalysisRecord, TaskSummary
|
| 27 |
+
from .errors import ReportError
|
| 28 |
+
from .schemas import (
|
| 29 |
+
AnalysisReport,
|
| 30 |
+
AttributedNote,
|
| 31 |
+
DataSourceRef,
|
| 32 |
+
ProblemStatement,
|
| 33 |
+
ReportFinding,
|
| 34 |
+
ReportSummaryNarrative,
|
| 35 |
+
)
|
| 36 |
+
|
| 37 |
+
logger = get_logger("report_generator")
|
| 38 |
+
|
| 39 |
+
_FALLBACK_SUMMARY = "Automated summary unavailable β see the findings below."
|
| 40 |
+
|
| 41 |
+
# CRISP-DM phases in narrative order, with human labels for the method appendix.
|
| 42 |
+
_STAGE_LABELS: list[tuple[str, str]] = [
|
| 43 |
+
("data_understanding", "Data understanding"),
|
| 44 |
+
("data_preparation", "Data preparation"),
|
| 45 |
+
("modeling", "Modeling"),
|
| 46 |
+
("evaluation", "Evaluation"),
|
| 47 |
+
]
|
| 48 |
+
|
| 49 |
+
_PROMPT_PATH = (
|
| 50 |
+
Path(__file__).resolve().parent.parent.parent / "config" / "prompts" / "report_summary.md"
|
| 51 |
+
)
|
| 52 |
+
|
| 53 |
+
|
| 54 |
+
def _load_prompt_text() -> str:
|
| 55 |
+
return _PROMPT_PATH.read_text(encoding="utf-8")
|
| 56 |
+
|
| 57 |
+
|
| 58 |
+
def _build_default_chain() -> Runnable:
|
| 59 |
+
from src.config.settings import settings
|
| 60 |
+
|
| 61 |
+
llm = AzureChatOpenAI(
|
| 62 |
+
azure_deployment=settings.azureai_deployment_name_4o,
|
| 63 |
+
openai_api_version=settings.azureai_api_version_4o,
|
| 64 |
+
azure_endpoint=settings.azureai_endpoint_url_4o,
|
| 65 |
+
api_key=settings.azureai_api_key_4o,
|
| 66 |
+
temperature=0,
|
| 67 |
+
)
|
| 68 |
+
prompt = ChatPromptTemplate.from_messages(
|
| 69 |
+
[
|
| 70 |
+
SystemMessage(content=_load_prompt_text()),
|
| 71 |
+
("human", "{human_content}"),
|
| 72 |
+
]
|
| 73 |
+
)
|
| 74 |
+
return prompt | llm.with_structured_output(ReportSummaryNarrative)
|
| 75 |
+
|
| 76 |
+
|
| 77 |
+
_default_chain: Runnable | None = None
|
| 78 |
+
|
| 79 |
+
|
| 80 |
+
def _get_default_chain() -> Runnable:
|
| 81 |
+
global _default_chain
|
| 82 |
+
if _default_chain is None:
|
| 83 |
+
_default_chain = _build_default_chain()
|
| 84 |
+
return _default_chain
|
| 85 |
+
|
| 86 |
+
|
| 87 |
+
# --------------------------------------------------------------------------- #
|
| 88 |
+
# Deterministic assembly (pure; no LLM, no I/O) β easy to unit-test.
|
| 89 |
+
# --------------------------------------------------------------------------- #
|
| 90 |
+
|
| 91 |
+
|
| 92 |
+
def _collect_findings(records: list[AnalysisRecord]) -> list[ReportFinding]:
|
| 93 |
+
# Findings are distinct insights β not deduped; each traces to its record.
|
| 94 |
+
return [
|
| 95 |
+
ReportFinding(text=text, record_ids=[rec.record_id])
|
| 96 |
+
for rec in records
|
| 97 |
+
for text in rec.findings
|
| 98 |
+
]
|
| 99 |
+
|
| 100 |
+
|
| 101 |
+
def _collect_notes(records: list[AnalysisRecord], field: str) -> list[AttributedNote]:
|
| 102 |
+
# Caveats / open_questions are deduped by text; a merged note cites every
|
| 103 |
+
# record it came from (plural record_ids).
|
| 104 |
+
merged: dict[str, list[str]] = {}
|
| 105 |
+
for rec in records:
|
| 106 |
+
for text in getattr(rec, field):
|
| 107 |
+
ids = merged.setdefault(text, [])
|
| 108 |
+
if rec.record_id not in ids:
|
| 109 |
+
ids.append(rec.record_id)
|
| 110 |
+
return [AttributedNote(text=text, record_ids=ids) for text, ids in merged.items()]
|
| 111 |
+
|
| 112 |
+
|
| 113 |
+
def _collect_method_steps(records: list[AnalysisRecord]) -> list[TaskSummary]:
|
| 114 |
+
steps: list[TaskSummary] = []
|
| 115 |
+
for rec in records:
|
| 116 |
+
steps.extend(rec.tasks_run)
|
| 117 |
+
return steps
|
| 118 |
+
|
| 119 |
+
|
| 120 |
+
def _build_data_sources(
|
| 121 |
+
records: list[AnalysisRecord], catalog, bound_ids: list[str] | None = None
|
| 122 |
+
) -> list[DataSourceRef]:
|
| 123 |
+
"""Freeze real catalog metadata for the sources this analysis used.
|
| 124 |
+
|
| 125 |
+
When the analysis has a data-source binding (#10), the candidate set is scoped
|
| 126 |
+
to the bound sources first (fail-open if the binding doesn't intersect the
|
| 127 |
+
catalog). Within that set, matches catalog sources against the records'
|
| 128 |
+
(narrative) `data_used` by name/id; falls back to all (bound) sources, then to
|
| 129 |
+
bare `data_used` strings if no catalog is available β so the section is always
|
| 130 |
+
populated, best-effort.
|
| 131 |
+
"""
|
| 132 |
+
if catalog is None or not catalog.sources:
|
| 133 |
+
seen: list[str] = []
|
| 134 |
+
for rec in records:
|
| 135 |
+
for du in rec.data_used:
|
| 136 |
+
if du not in seen:
|
| 137 |
+
seen.append(du)
|
| 138 |
+
return [DataSourceRef(source_id=d, name=d, source_type="", detail={}) for d in seen]
|
| 139 |
+
|
| 140 |
+
candidates = catalog.sources
|
| 141 |
+
if bound_ids:
|
| 142 |
+
scoped = [s for s in candidates if s.source_id in set(bound_ids)]
|
| 143 |
+
candidates = scoped or candidates # fail-open if binding doesn't match catalog
|
| 144 |
+
|
| 145 |
+
def _ref(s) -> DataSourceRef:
|
| 146 |
+
return DataSourceRef(
|
| 147 |
+
source_id=s.source_id,
|
| 148 |
+
name=s.name,
|
| 149 |
+
source_type=s.source_type,
|
| 150 |
+
detail={
|
| 151 |
+
"tables": [t.name for t in s.tables],
|
| 152 |
+
"row_count": sum((t.row_count or 0) for t in s.tables) or None,
|
| 153 |
+
"columns": [c.name for t in s.tables for c in t.columns],
|
| 154 |
+
},
|
| 155 |
+
)
|
| 156 |
+
|
| 157 |
+
used = " ".join(du for rec in records for du in rec.data_used).lower()
|
| 158 |
+
matched = [
|
| 159 |
+
_ref(s)
|
| 160 |
+
for s in candidates
|
| 161 |
+
if s.name.lower() in used or s.source_id.lower() in used
|
| 162 |
+
]
|
| 163 |
+
return matched or [_ref(s) for s in candidates]
|
| 164 |
+
|
| 165 |
+
|
| 166 |
+
def _build_human_content(
|
| 167 |
+
ps: ProblemStatement, findings: list[ReportFinding], caveats: list[AttributedNote]
|
| 168 |
+
) -> str:
|
| 169 |
+
sections = []
|
| 170 |
+
ps_lines = [v for v in (ps.objective, ps.target_value, ps.scope) if v]
|
| 171 |
+
if ps_lines:
|
| 172 |
+
sections.append("# Problem Statement\n" + "\n".join(ps_lines))
|
| 173 |
+
sections.append(
|
| 174 |
+
"# Findings (already finalized β synthesize, do not add numbers)\n"
|
| 175 |
+
+ "\n".join(f"- {f.text}" for f in findings)
|
| 176 |
+
)
|
| 177 |
+
if caveats:
|
| 178 |
+
sections.append("# Caveats\n" + "\n".join(f"- {c.text}" for c in caveats))
|
| 179 |
+
return "\n\n".join(sections)
|
| 180 |
+
|
| 181 |
+
|
| 182 |
+
def _render_markdown(report: AnalysisReport) -> str:
|
| 183 |
+
# Version is deliberately NOT in the markdown β it is assigned by the store
|
| 184 |
+
# after rendering and lives in the structured `version` field / API metadata.
|
| 185 |
+
parts: list[str] = ["# Analysis Report"]
|
| 186 |
+
parts.append(
|
| 187 |
+
f"*Generated {report.generated_at:%Y-%m-%d} Β· "
|
| 188 |
+
f"{len(report.record_ids)} analyses Β· {len(report.data_sources)} source(s)*"
|
| 189 |
+
)
|
| 190 |
+
|
| 191 |
+
ps = report.problem_statement
|
| 192 |
+
ps_lines = [v for v in (ps.objective, ps.target_value, ps.scope) if v]
|
| 193 |
+
if ps_lines:
|
| 194 |
+
parts.append("## Problem Statement\n" + " ".join(ps_lines))
|
| 195 |
+
|
| 196 |
+
if report.executive_summary:
|
| 197 |
+
parts.append("## Executive Summary\n" + report.executive_summary)
|
| 198 |
+
|
| 199 |
+
if report.findings:
|
| 200 |
+
lines = ["## Key Findings"]
|
| 201 |
+
for i, f in enumerate(report.findings, 1):
|
| 202 |
+
cite = f" *({', '.join(f.record_ids)})*" if f.record_ids else ""
|
| 203 |
+
lines.append(f"{i}. {f.text}{cite}")
|
| 204 |
+
parts.append("\n".join(lines))
|
| 205 |
+
|
| 206 |
+
if report.caveats or report.open_questions:
|
| 207 |
+
lines = ["## Caveats & Open Questions"]
|
| 208 |
+
for n in report.caveats:
|
| 209 |
+
cite = f" *({', '.join(n.record_ids)})*" if n.record_ids else ""
|
| 210 |
+
lines.append(f"- {n.text}{cite}")
|
| 211 |
+
for n in report.open_questions:
|
| 212 |
+
cite = f" *({', '.join(n.record_ids)})*" if n.record_ids else ""
|
| 213 |
+
lines.append(f"- Open: {n.text}{cite}")
|
| 214 |
+
parts.append("\n".join(lines))
|
| 215 |
+
|
| 216 |
+
if report.data_sources:
|
| 217 |
+
lines = ["## Appendix A β Data Used", "| source | type | detail |", "|---|---|---|"]
|
| 218 |
+
for ds in report.data_sources:
|
| 219 |
+
d = ds.detail
|
| 220 |
+
bits = []
|
| 221 |
+
if d.get("tables"):
|
| 222 |
+
bits.append("tables: " + ", ".join(d["tables"]))
|
| 223 |
+
if d.get("row_count"):
|
| 224 |
+
bits.append(f"{d['row_count']} rows")
|
| 225 |
+
if d.get("columns"):
|
| 226 |
+
bits.append(f"{len(d['columns'])} cols")
|
| 227 |
+
lines.append(f"| {ds.name} | {ds.source_type or 'β'} | {' Β· '.join(bits) or 'β'} |")
|
| 228 |
+
parts.append("\n".join(lines))
|
| 229 |
+
|
| 230 |
+
if report.method_steps:
|
| 231 |
+
lines = ["## Appendix B β Method"]
|
| 232 |
+
for stage_key, label in _STAGE_LABELS:
|
| 233 |
+
steps = [s for s in report.method_steps if s.stage == stage_key]
|
| 234 |
+
if not steps:
|
| 235 |
+
continue
|
| 236 |
+
rendered = "; ".join(
|
| 237 |
+
f"{', '.join(s.tools_used) or 'β'} ({s.status})" for s in steps
|
| 238 |
+
)
|
| 239 |
+
lines.append(f"**{label}** β {rendered}")
|
| 240 |
+
parts.append("\n".join(lines))
|
| 241 |
+
|
| 242 |
+
return "\n\n".join(parts)
|
| 243 |
+
|
| 244 |
+
|
| 245 |
+
# --------------------------------------------------------------------------- #
|
| 246 |
+
# Service
|
| 247 |
+
# --------------------------------------------------------------------------- #
|
| 248 |
+
|
| 249 |
+
|
| 250 |
+
class ReportGenerator:
|
| 251 |
+
"""Generates an `AnalysisReport` from persisted records. Inject deps for tests."""
|
| 252 |
+
|
| 253 |
+
def __init__(
|
| 254 |
+
self,
|
| 255 |
+
record_store=None,
|
| 256 |
+
structured_chain: Runnable | None = None,
|
| 257 |
+
catalog_store=None,
|
| 258 |
+
binding_store=None,
|
| 259 |
+
) -> None:
|
| 260 |
+
self._record_store = record_store
|
| 261 |
+
self._chain = structured_chain
|
| 262 |
+
self._catalog_store = catalog_store
|
| 263 |
+
self._binding_store = binding_store
|
| 264 |
+
|
| 265 |
+
def _ensure_record_store(self):
|
| 266 |
+
if self._record_store is None:
|
| 267 |
+
from ..slow_path.store import PostgresAnalysisStore
|
| 268 |
+
|
| 269 |
+
self._record_store = PostgresAnalysisStore()
|
| 270 |
+
return self._record_store
|
| 271 |
+
|
| 272 |
+
def _ensure_chain(self) -> Runnable:
|
| 273 |
+
if self._chain is None:
|
| 274 |
+
self._chain = _get_default_chain()
|
| 275 |
+
return self._chain
|
| 276 |
+
|
| 277 |
+
def _ensure_catalog_store(self):
|
| 278 |
+
if self._catalog_store is None:
|
| 279 |
+
from src.catalog.store import CatalogStore
|
| 280 |
+
|
| 281 |
+
self._catalog_store = CatalogStore()
|
| 282 |
+
return self._catalog_store
|
| 283 |
+
|
| 284 |
+
async def generate(
|
| 285 |
+
self,
|
| 286 |
+
analysis_id: str,
|
| 287 |
+
user_id: str | None = None,
|
| 288 |
+
problem_statement: ProblemStatement | None = None,
|
| 289 |
+
) -> AnalysisReport:
|
| 290 |
+
records = await self._ensure_record_store().list_for_analysis(analysis_id)
|
| 291 |
+
if not records:
|
| 292 |
+
raise ReportError(f"no analyses recorded for {analysis_id!r} yet")
|
| 293 |
+
|
| 294 |
+
ps = problem_statement or ProblemStatement()
|
| 295 |
+
findings = _collect_findings(records)
|
| 296 |
+
caveats = _collect_notes(records, "caveats")
|
| 297 |
+
open_questions = _collect_notes(records, "open_questions")
|
| 298 |
+
method_steps = _collect_method_steps(records)
|
| 299 |
+
bound_ids = await self._read_binding(analysis_id)
|
| 300 |
+
data_sources = _build_data_sources(
|
| 301 |
+
records, await self._read_catalog(user_id), bound_ids
|
| 302 |
+
)
|
| 303 |
+
executive_summary = await self._summarize(ps, findings, caveats)
|
| 304 |
+
|
| 305 |
+
report = AnalysisReport(
|
| 306 |
+
analysis_id=analysis_id,
|
| 307 |
+
user_id=user_id,
|
| 308 |
+
version=0, # assigned by ReportStore.save under the advisory lock
|
| 309 |
+
generated_at=datetime.now(UTC),
|
| 310 |
+
problem_statement=ps,
|
| 311 |
+
record_ids=[r.record_id for r in records],
|
| 312 |
+
executive_summary=executive_summary,
|
| 313 |
+
findings=findings,
|
| 314 |
+
caveats=caveats,
|
| 315 |
+
open_questions=open_questions,
|
| 316 |
+
data_sources=data_sources,
|
| 317 |
+
method_steps=method_steps,
|
| 318 |
+
)
|
| 319 |
+
report.rendered_markdown = _render_markdown(report)
|
| 320 |
+
logger.info(
|
| 321 |
+
"report generated",
|
| 322 |
+
analysis_id=analysis_id,
|
| 323 |
+
records=len(records),
|
| 324 |
+
findings=len(findings),
|
| 325 |
+
)
|
| 326 |
+
return report
|
| 327 |
+
|
| 328 |
+
async def _read_catalog(self, user_id: str | None):
|
| 329 |
+
if not user_id:
|
| 330 |
+
return None
|
| 331 |
+
try:
|
| 332 |
+
return await self._ensure_catalog_store().get(user_id)
|
| 333 |
+
except Exception as exc: # data_sources falls back; never break the report
|
| 334 |
+
logger.warning("catalog read failed; data_sources will fall back", error=str(exc))
|
| 335 |
+
return None
|
| 336 |
+
|
| 337 |
+
def _ensure_binding_store(self):
|
| 338 |
+
if self._binding_store is None:
|
| 339 |
+
from ..binding_store import AnalysisDataSourceStore
|
| 340 |
+
|
| 341 |
+
self._binding_store = AnalysisDataSourceStore()
|
| 342 |
+
return self._binding_store
|
| 343 |
+
|
| 344 |
+
async def _read_binding(self, analysis_id: str) -> list[str]:
|
| 345 |
+
"""Bound source ids for the analysis (#10). Never-throw β [] (unscoped)."""
|
| 346 |
+
try:
|
| 347 |
+
return await self._ensure_binding_store().get(analysis_id)
|
| 348 |
+
except Exception as exc: # data_sources falls back to whole catalog
|
| 349 |
+
logger.warning("binding read failed; data_sources unscoped", error=str(exc))
|
| 350 |
+
return []
|
| 351 |
+
|
| 352 |
+
async def _summarize(
|
| 353 |
+
self, ps: ProblemStatement, findings: list[ReportFinding], caveats: list[AttributedNote]
|
| 354 |
+
) -> str:
|
| 355 |
+
human_content = _build_human_content(ps, findings, caveats)
|
| 356 |
+
try:
|
| 357 |
+
narrative: ReportSummaryNarrative = await self._ensure_chain().ainvoke(
|
| 358 |
+
{"human_content": human_content}
|
| 359 |
+
)
|
| 360 |
+
return narrative.executive_summary
|
| 361 |
+
except Exception as exc: # D1: degrade, don't fail the whole report
|
| 362 |
+
logger.warning("report summary LLM failed; using fallback", error=str(exc))
|
| 363 |
+
return _FALLBACK_SUMMARY
|
src/agents/report/readiness.py
ADDED
|
@@ -0,0 +1,165 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""`is_report_ready` β deterministic report-readiness signal (seam #5, KM-652).
|
| 2 |
+
|
| 3 |
+
The Help skill asks "can the user generate a report yet?" before it offers that as
|
| 4 |
+
a next step. This is the producer of that answer; Help only *consumes* it (see
|
| 5 |
+
`handlers/help.ReportReadiness`). No LLM β readiness is a fact about persisted state,
|
| 6 |
+
not a judgement.
|
| 7 |
+
|
| 8 |
+
The rule mirrors what makes a real report non-empty and worth generating, so Help can
|
| 9 |
+
never suggest an action that would 409 or produce a duplicate:
|
| 10 |
+
1. `problem_validated` β the gate's own precondition (no validated goal, no
|
| 11 |
+
analysis worth reporting). Same rule `gate.gate` applies to `structured_flow`.
|
| 12 |
+
2. at least one **substantive** persisted `AnalysisRecord` β a record whose
|
| 13 |
+
*analysis* task succeeded. A failed run still persists a record WITH findings
|
| 14 |
+
(they narrate the failure), and data-access tasks (check_/retrieve_) succeed even
|
| 15 |
+
when the analysis fails β so neither "has findings" nor "any task succeeded" is
|
| 16 |
+
enough. We require a genuine analysis tool (analyze_*) to have completed. We count
|
| 17 |
+
*results*, not chat turns.
|
| 18 |
+
3. delta-since-report β if a report already exists (`state.report_id`), only ready
|
| 19 |
+
when there's a substantive analysis newer than the latest report; otherwise the
|
| 20 |
+
new report would be identical.
|
| 21 |
+
|
| 22 |
+
`missing` names whichever criterion is absent, so Help can tell the user the next gap
|
| 23 |
+
to fill (the team values `missing` over the bare boolean). Bias is anti-false-positive
|
| 24 |
+
(report is also button-triggered): a record-store read failure fails **closed**
|
| 25 |
+
(not ready); a report-store read failure during the delta check fails **open** (we
|
| 26 |
+
can't prove staleness, and the button is always there).
|
| 27 |
+
|
| 28 |
+
NOT in scope (deferred, pending the readiness eval set): semantic *alignment* of the
|
| 29 |
+
analyses to the problem statement and *depth*/variety scoring β both need an LLM judge
|
| 30 |
+
and shouldn't sit in the per-turn Help hot path until eval justifies the cost.
|
| 31 |
+
"""
|
| 32 |
+
|
| 33 |
+
from __future__ import annotations
|
| 34 |
+
|
| 35 |
+
from datetime import UTC, datetime
|
| 36 |
+
from typing import TYPE_CHECKING
|
| 37 |
+
|
| 38 |
+
from src.middlewares.logging import get_logger
|
| 39 |
+
|
| 40 |
+
from ..handlers.help import ReportReadiness
|
| 41 |
+
|
| 42 |
+
if TYPE_CHECKING:
|
| 43 |
+
from ..gate import AnalysisState
|
| 44 |
+
|
| 45 |
+
logger = get_logger("report_readiness")
|
| 46 |
+
|
| 47 |
+
# Human-readable gaps surfaced to the user via Help (kept stable for the prompt).
|
| 48 |
+
_MISSING_PROBLEM = "a validated problem statement"
|
| 49 |
+
_MISSING_ANALYSIS = "at least one completed analysis"
|
| 50 |
+
_MISSING_DELTA = "a new analysis since the last report"
|
| 51 |
+
|
| 52 |
+
|
| 53 |
+
def _default_record_store():
|
| 54 |
+
from ..slow_path.store import PostgresAnalysisStore
|
| 55 |
+
|
| 56 |
+
return PostgresAnalysisStore()
|
| 57 |
+
|
| 58 |
+
|
| 59 |
+
def _default_report_store():
|
| 60 |
+
from .store import ReportStore
|
| 61 |
+
|
| 62 |
+
return ReportStore()
|
| 63 |
+
|
| 64 |
+
|
| 65 |
+
def _is_newer(a: datetime, b: datetime) -> bool:
|
| 66 |
+
"""True if `a` is later than `b`, tolerating naive/aware mismatch (assume UTC)."""
|
| 67 |
+
if a.tzinfo is None:
|
| 68 |
+
a = a.replace(tzinfo=UTC)
|
| 69 |
+
if b.tzinfo is None:
|
| 70 |
+
b = b.replace(tzinfo=UTC)
|
| 71 |
+
return a > b
|
| 72 |
+
|
| 73 |
+
|
| 74 |
+
def _has_successful_analysis(record) -> bool:
|
| 75 |
+
"""True if the record has at least one *analysis* task that succeeded.
|
| 76 |
+
|
| 77 |
+
A failed run still writes findings (narrating the failure) and its data-access
|
| 78 |
+
tasks (check_/retrieve_) succeed, so we can't key on findings or on "any task
|
| 79 |
+
succeeded". An analysis tool (analyze_*) completing is the real "we produced a
|
| 80 |
+
result" signal.
|
| 81 |
+
"""
|
| 82 |
+
return any(
|
| 83 |
+
t.status == "success" and any(tool.startswith("analyze") for tool in t.tools_used)
|
| 84 |
+
for t in record.tasks_run
|
| 85 |
+
)
|
| 86 |
+
|
| 87 |
+
|
| 88 |
+
async def report_floor(
|
| 89 |
+
analysis_id: str | None,
|
| 90 |
+
state: AnalysisState,
|
| 91 |
+
*,
|
| 92 |
+
record_store=None,
|
| 93 |
+
) -> tuple[list[str], list]:
|
| 94 |
+
"""The report **floor**: a validated goal + β₯1 substantive analysis.
|
| 95 |
+
|
| 96 |
+
Returns `(missing, substantive_records)`. This is the shared gate both the Help
|
| 97 |
+
readiness signal AND the report API enforce, so the button and Help can't drift
|
| 98 |
+
(T-D / T11). It deliberately excludes the delta-since-report check β that is
|
| 99 |
+
advisory and lives only in `is_report_ready`; the report button is always allowed
|
| 100 |
+
to cut a new version (decision 4A). Fails closed (counts as missing analysis) on
|
| 101 |
+
a record-store read error. `record_store` is injectable for tests.
|
| 102 |
+
"""
|
| 103 |
+
missing: list[str] = []
|
| 104 |
+
if not state.problem_validated:
|
| 105 |
+
missing.append(_MISSING_PROBLEM)
|
| 106 |
+
|
| 107 |
+
substantive: list = []
|
| 108 |
+
if analysis_id:
|
| 109 |
+
try:
|
| 110 |
+
store = record_store or _default_record_store()
|
| 111 |
+
records = await store.list_for_analysis(analysis_id)
|
| 112 |
+
substantive = [r for r in records if _has_successful_analysis(r)]
|
| 113 |
+
except Exception as exc: # noqa: BLE001 β never-throw; fail closed to not-ready
|
| 114 |
+
logger.warning(
|
| 115 |
+
"report_floor: record store read failed β not ready",
|
| 116 |
+
analysis_id=analysis_id,
|
| 117 |
+
error=str(exc),
|
| 118 |
+
)
|
| 119 |
+
return [*missing, _MISSING_ANALYSIS], []
|
| 120 |
+
|
| 121 |
+
if not substantive:
|
| 122 |
+
missing.append(_MISSING_ANALYSIS)
|
| 123 |
+
return missing, substantive
|
| 124 |
+
|
| 125 |
+
|
| 126 |
+
async def is_report_ready(
|
| 127 |
+
analysis_id: str | None,
|
| 128 |
+
state: AnalysisState,
|
| 129 |
+
*,
|
| 130 |
+
record_store=None,
|
| 131 |
+
report_store=None,
|
| 132 |
+
) -> ReportReadiness:
|
| 133 |
+
"""Return whether a report can be generated for this analysis, and the gaps if not.
|
| 134 |
+
|
| 135 |
+
`record_store` / `report_store` are injectable for tests; they default to the
|
| 136 |
+
real Postgres stores.
|
| 137 |
+
"""
|
| 138 |
+
missing, substantive = await report_floor(
|
| 139 |
+
analysis_id, state, record_store=record_store
|
| 140 |
+
)
|
| 141 |
+
|
| 142 |
+
if not substantive:
|
| 143 |
+
# No analyses to report on β the delta check is moot.
|
| 144 |
+
return ReportReadiness(ready=not missing, missing=missing)
|
| 145 |
+
|
| 146 |
+
# Delta-since-report: a report already exists, so only ready if a substantive
|
| 147 |
+
# analysis is newer than the latest report. Fail-open on a report-store error.
|
| 148 |
+
if state.report_id:
|
| 149 |
+
last_report_at: datetime | None = None
|
| 150 |
+
try:
|
| 151 |
+
rstore = report_store or _default_report_store()
|
| 152 |
+
reports = await rstore.list_for_analysis(analysis_id)
|
| 153 |
+
last_report_at = max((r.generated_at for r in reports), default=None)
|
| 154 |
+
except Exception as exc: # noqa: BLE001 β skip delta; can't prove staleness
|
| 155 |
+
logger.warning(
|
| 156 |
+
"is_report_ready: report store read failed β skipping delta check",
|
| 157 |
+
analysis_id=analysis_id,
|
| 158 |
+
error=str(exc),
|
| 159 |
+
)
|
| 160 |
+
if last_report_at is not None and not any(
|
| 161 |
+
_is_newer(r.created_at, last_report_at) for r in substantive
|
| 162 |
+
):
|
| 163 |
+
missing.append(_MISSING_DELTA)
|
| 164 |
+
|
| 165 |
+
return ReportReadiness(ready=not missing, missing=missing)
|
src/agents/report/schemas.py
ADDED
|
@@ -0,0 +1,91 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Report contract β `AnalysisReport` and its parts (KM-644).
|
| 2 |
+
|
| 3 |
+
The report generator turns a session's persisted `AnalysisRecord`s + Problem
|
| 4 |
+
Statement into a versioned report. Only `executive_summary` is LLM-authored
|
| 5 |
+
(`ReportSummaryNarrative`); every other field is copied verbatim from the records
|
| 6 |
+
by code (INV-4), so the report stays a faithful, auditable artifact.
|
| 7 |
+
|
| 8 |
+
Two deliberate looseness choices for v1 (tighten later once usage shows):
|
| 9 |
+
`ProblemStatement` (stub of Harry's real PS) and `ReportFinding.supporting_data`.
|
| 10 |
+
|
| 11 |
+
See CHECKPOINT_PLAN_2026-06-17.md decision #8.
|
| 12 |
+
"""
|
| 13 |
+
|
| 14 |
+
from __future__ import annotations
|
| 15 |
+
|
| 16 |
+
from datetime import datetime
|
| 17 |
+
from uuid import uuid4
|
| 18 |
+
|
| 19 |
+
from pydantic import BaseModel, Field
|
| 20 |
+
|
| 21 |
+
from ..slow_path.schemas import TaskSummary
|
| 22 |
+
|
| 23 |
+
|
| 24 |
+
class ProblemStatement(BaseModel):
|
| 25 |
+
"""Minimal stub of Harry's Problem Statement, frozen into each report.
|
| 26 |
+
|
| 27 |
+
Loose on purpose until the real PS template lands (Analysis State, upstream).
|
| 28 |
+
A report snapshots the PS as it was at generation time.
|
| 29 |
+
"""
|
| 30 |
+
|
| 31 |
+
objective: str = ""
|
| 32 |
+
metric_direction: str = "" # "increase" | "decrease"
|
| 33 |
+
target_metric: str = ""
|
| 34 |
+
target_value: str = ""
|
| 35 |
+
scope: str = ""
|
| 36 |
+
|
| 37 |
+
|
| 38 |
+
class DataSourceRef(BaseModel):
|
| 39 |
+
"""Frozen catalog metadata for a source used in the analysis.
|
| 40 |
+
|
| 41 |
+
Snapshotted at generation time (NOT re-fetched at render) so a re-ingested
|
| 42 |
+
source never retroactively changes an old report β same freeze rationale as
|
| 43 |
+
`ProblemStatement`.
|
| 44 |
+
"""
|
| 45 |
+
|
| 46 |
+
source_id: str
|
| 47 |
+
name: str
|
| 48 |
+
source_type: str # postgres | file | ...
|
| 49 |
+
detail: dict = Field(default_factory=dict) # rows in scope, columns, window
|
| 50 |
+
|
| 51 |
+
|
| 52 |
+
class ReportFinding(BaseModel):
|
| 53 |
+
text: str
|
| 54 |
+
record_ids: list[str] = Field(default_factory=list) # records backing this finding
|
| 55 |
+
supporting_data: dict | None = None # loose for v1; the chart-able slice
|
| 56 |
+
|
| 57 |
+
|
| 58 |
+
class AttributedNote(BaseModel):
|
| 59 |
+
"""A caveat or open question carrying the records it came from.
|
| 60 |
+
|
| 61 |
+
Plural `record_ids` because a note can be deduped/merged across records.
|
| 62 |
+
"""
|
| 63 |
+
|
| 64 |
+
text: str
|
| 65 |
+
record_ids: list[str] = Field(default_factory=list)
|
| 66 |
+
|
| 67 |
+
|
| 68 |
+
class ReportSummaryNarrative(BaseModel):
|
| 69 |
+
"""The ONLY LLM-authored part of the report (with_structured_output target)."""
|
| 70 |
+
|
| 71 |
+
executive_summary: str
|
| 72 |
+
|
| 73 |
+
|
| 74 |
+
class AnalysisReport(BaseModel):
|
| 75 |
+
report_id: str = Field(default_factory=lambda: uuid4().hex)
|
| 76 |
+
analysis_id: str
|
| 77 |
+
user_id: str | None = None
|
| 78 |
+
version: int
|
| 79 |
+
generated_at: datetime
|
| 80 |
+
# Frozen snapshots.
|
| 81 |
+
problem_statement: ProblemStatement = Field(default_factory=ProblemStatement)
|
| 82 |
+
record_ids: list[str] = Field(default_factory=list) # records used (snapshot)
|
| 83 |
+
# LLM-authored.
|
| 84 |
+
executive_summary: str = ""
|
| 85 |
+
# Deterministic pass-through from records.
|
| 86 |
+
findings: list[ReportFinding] = Field(default_factory=list)
|
| 87 |
+
caveats: list[AttributedNote] = Field(default_factory=list)
|
| 88 |
+
open_questions: list[AttributedNote] = Field(default_factory=list)
|
| 89 |
+
data_sources: list[DataSourceRef] = Field(default_factory=list)
|
| 90 |
+
method_steps: list[TaskSummary] = Field(default_factory=list) # carries `stage`
|
| 91 |
+
rendered_markdown: str = ""
|
src/agents/report/store.py
ADDED
|
@@ -0,0 +1,119 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""ReportStore β persists/reads versioned AnalysisReports (KM-644).
|
| 2 |
+
|
| 3 |
+
Mirrors `PostgresAnalysisStore`: each call opens its own `AsyncSessionLocal`.
|
| 4 |
+
|
| 5 |
+
Version assignment is serialized per `analysis_id` with a Postgres
|
| 6 |
+
transaction-level advisory lock so concurrent button presses can't compute the
|
| 7 |
+
same version number; the `(analysis_id, version)` unique constraint is the
|
| 8 |
+
backstop. Per decision 4A every generation is a new version, so two
|
| 9 |
+
near-simultaneous presses legitimately produce V<n> and V<n+1> β the lock only
|
| 10 |
+
prevents a duplicate-number race, not double generation.
|
| 11 |
+
"""
|
| 12 |
+
|
| 13 |
+
from __future__ import annotations
|
| 14 |
+
|
| 15 |
+
import hashlib
|
| 16 |
+
|
| 17 |
+
from sqlalchemy import func, select, text
|
| 18 |
+
|
| 19 |
+
from src.db.postgres.connection import AsyncSessionLocal
|
| 20 |
+
from src.db.postgres.models import AnalysisReportRow
|
| 21 |
+
from src.middlewares.logging import get_logger
|
| 22 |
+
|
| 23 |
+
from .schemas import AnalysisReport
|
| 24 |
+
|
| 25 |
+
logger = get_logger("report_store")
|
| 26 |
+
|
| 27 |
+
|
| 28 |
+
def _lock_key(analysis_id: str) -> int:
|
| 29 |
+
"""Stable signed 64-bit key for `pg_advisory_xact_lock`.
|
| 30 |
+
|
| 31 |
+
Python's builtin `hash(str)` is randomized per process, so derive a
|
| 32 |
+
deterministic key from a digest instead.
|
| 33 |
+
"""
|
| 34 |
+
digest = hashlib.sha256(analysis_id.encode()).digest()
|
| 35 |
+
return int.from_bytes(digest[:8], "big", signed=True)
|
| 36 |
+
|
| 37 |
+
|
| 38 |
+
def _report_title(report: AnalysisReport) -> str:
|
| 39 |
+
"""Title for the dedorch `reports.title` column β the goal, else a generic label."""
|
| 40 |
+
objective = (report.problem_statement.objective or "").strip()
|
| 41 |
+
return objective[:200] if objective else "Analysis Report"
|
| 42 |
+
|
| 43 |
+
|
| 44 |
+
def _row_to_report(row) -> AnalysisReport:
|
| 45 |
+
"""Rebuild a minimal AnalysisReport from the flat dedorch row.
|
| 46 |
+
|
| 47 |
+
dedorch stores markdown only, so structured fields (findings/caveats/β¦) come back
|
| 48 |
+
empty; `rendered_markdown` carries the content the FE renders/downloads.
|
| 49 |
+
"""
|
| 50 |
+
return AnalysisReport(
|
| 51 |
+
report_id=row.id,
|
| 52 |
+
analysis_id=row.analysis_id,
|
| 53 |
+
version=row.version,
|
| 54 |
+
generated_at=row.generated_at,
|
| 55 |
+
rendered_markdown=row.content,
|
| 56 |
+
)
|
| 57 |
+
|
| 58 |
+
|
| 59 |
+
class ReportStore:
|
| 60 |
+
"""Read/write versioned reports keyed by `analysis_id`."""
|
| 61 |
+
|
| 62 |
+
async def save(self, report: AnalysisReport) -> AnalysisReport:
|
| 63 |
+
"""Assign the next version under an advisory lock and persist.
|
| 64 |
+
|
| 65 |
+
Mutates and returns `report` with its final `version`.
|
| 66 |
+
"""
|
| 67 |
+
async with AsyncSessionLocal() as session:
|
| 68 |
+
async with session.begin():
|
| 69 |
+
await session.execute(
|
| 70 |
+
text("SELECT pg_advisory_xact_lock(:k)"),
|
| 71 |
+
{"k": _lock_key(report.analysis_id)},
|
| 72 |
+
)
|
| 73 |
+
result = await session.execute(
|
| 74 |
+
select(func.max(AnalysisReportRow.version)).where(
|
| 75 |
+
AnalysisReportRow.analysis_id == report.analysis_id
|
| 76 |
+
)
|
| 77 |
+
)
|
| 78 |
+
report.version = (result.scalar_one_or_none() or 0) + 1
|
| 79 |
+
session.add(
|
| 80 |
+
AnalysisReportRow(
|
| 81 |
+
id=report.report_id,
|
| 82 |
+
analysis_id=report.analysis_id,
|
| 83 |
+
title=_report_title(report),
|
| 84 |
+
content=report.rendered_markdown or "",
|
| 85 |
+
generated_at=report.generated_at,
|
| 86 |
+
version=report.version,
|
| 87 |
+
)
|
| 88 |
+
)
|
| 89 |
+
# leaving session.begin() commits, which releases the advisory lock
|
| 90 |
+
logger.info(
|
| 91 |
+
"report persisted",
|
| 92 |
+
analysis_id=report.analysis_id,
|
| 93 |
+
version=report.version,
|
| 94 |
+
report_id=report.report_id,
|
| 95 |
+
)
|
| 96 |
+
return report
|
| 97 |
+
|
| 98 |
+
async def list_for_analysis(self, analysis_id: str) -> list[AnalysisReport]:
|
| 99 |
+
async with AsyncSessionLocal() as session:
|
| 100 |
+
result = await session.execute(
|
| 101 |
+
select(AnalysisReportRow)
|
| 102 |
+
.where(AnalysisReportRow.analysis_id == analysis_id)
|
| 103 |
+
.order_by(AnalysisReportRow.version.asc())
|
| 104 |
+
)
|
| 105 |
+
rows = result.scalars().all()
|
| 106 |
+
return [_row_to_report(row) for row in rows]
|
| 107 |
+
|
| 108 |
+
async def get(self, analysis_id: str, version: int) -> AnalysisReport | None:
|
| 109 |
+
async with AsyncSessionLocal() as session:
|
| 110 |
+
result = await session.execute(
|
| 111 |
+
select(AnalysisReportRow).where(
|
| 112 |
+
AnalysisReportRow.analysis_id == analysis_id,
|
| 113 |
+
AnalysisReportRow.version == version,
|
| 114 |
+
)
|
| 115 |
+
)
|
| 116 |
+
row = result.scalar_one_or_none()
|
| 117 |
+
if row is None:
|
| 118 |
+
return None
|
| 119 |
+
return _row_to_report(row)
|
src/agents/slow_path/assembler.py
CHANGED
|
@@ -33,6 +33,7 @@ from .schemas import (
|
|
| 33 |
AssembledOutput,
|
| 34 |
AssemblerNarrative,
|
| 35 |
RunState,
|
|
|
|
| 36 |
TaskSummary,
|
| 37 |
)
|
| 38 |
|
|
@@ -116,16 +117,46 @@ class Assembler:
|
|
| 116 |
return AssembledOutput(chat_answer=narrative.chat_answer, analysis_record=record)
|
| 117 |
|
| 118 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 119 |
def _build_record(narrative: AssemblerNarrative, run_state: RunState) -> AnalysisRecord:
|
| 120 |
tasks_run = [
|
| 121 |
TaskSummary(
|
| 122 |
task_id=task_id,
|
|
|
|
| 123 |
objective=result.objective,
|
| 124 |
status=result.status,
|
| 125 |
tools_used=[o.tool for o in result.outputs],
|
| 126 |
)
|
| 127 |
for task_id, result in run_state.results.items()
|
| 128 |
]
|
|
|
|
|
|
|
|
|
|
| 129 |
return AnalysisRecord(
|
| 130 |
goal_restated=narrative.goal_restated,
|
| 131 |
findings=narrative.findings,
|
|
@@ -133,7 +164,7 @@ def _build_record(narrative: AssemblerNarrative, run_state: RunState) -> Analysi
|
|
| 133 |
data_used=narrative.data_used,
|
| 134 |
open_questions=narrative.open_questions,
|
| 135 |
tasks_run=tasks_run,
|
| 136 |
-
results_snapshot=
|
| 137 |
plan_id=run_state.plan_id,
|
| 138 |
business_context_id=run_state.business_context_id,
|
| 139 |
created_at=datetime.now(UTC),
|
|
|
|
| 33 |
AssembledOutput,
|
| 34 |
AssemblerNarrative,
|
| 35 |
RunState,
|
| 36 |
+
TaskResult,
|
| 37 |
TaskSummary,
|
| 38 |
)
|
| 39 |
|
|
|
|
| 117 |
return AssembledOutput(chat_answer=narrative.chat_answer, analysis_record=record)
|
| 118 |
|
| 119 |
|
| 120 |
+
# Persisted records keep `analyze_*` outputs (scalar/stats/series β small, and the
|
| 121 |
+
# basis a future report/chart renders from) in full, but cap raw `table` rows from
|
| 122 |
+
# data-access tools (retrieve_data can return up to the 10k LIMIT): the report never
|
| 123 |
+
# renders raw rows, so storing them all would bloat every record's jsonb.
|
| 124 |
+
_SNAPSHOT_ROW_SAMPLE = 10
|
| 125 |
+
|
| 126 |
+
|
| 127 |
+
def _trim_for_snapshot(result: TaskResult) -> TaskResult:
|
| 128 |
+
trimmed = []
|
| 129 |
+
changed = False
|
| 130 |
+
for out in result.outputs:
|
| 131 |
+
if out.kind == "table" and out.rows is not None and len(out.rows) > _SNAPSHOT_ROW_SAMPLE:
|
| 132 |
+
changed = True
|
| 133 |
+
trimmed.append(
|
| 134 |
+
out.model_copy(
|
| 135 |
+
update={
|
| 136 |
+
"rows": out.rows[:_SNAPSHOT_ROW_SAMPLE],
|
| 137 |
+
"meta": {**out.meta, "total_rows": len(out.rows), "rows_truncated": True},
|
| 138 |
+
}
|
| 139 |
+
)
|
| 140 |
+
)
|
| 141 |
+
else:
|
| 142 |
+
trimmed.append(out)
|
| 143 |
+
return result.model_copy(update={"outputs": trimmed}) if changed else result
|
| 144 |
+
|
| 145 |
+
|
| 146 |
def _build_record(narrative: AssemblerNarrative, run_state: RunState) -> AnalysisRecord:
|
| 147 |
tasks_run = [
|
| 148 |
TaskSummary(
|
| 149 |
task_id=task_id,
|
| 150 |
+
stage=result.stage,
|
| 151 |
objective=result.objective,
|
| 152 |
status=result.status,
|
| 153 |
tools_used=[o.tool for o in result.outputs],
|
| 154 |
)
|
| 155 |
for task_id, result in run_state.results.items()
|
| 156 |
]
|
| 157 |
+
results_snapshot = {
|
| 158 |
+
task_id: _trim_for_snapshot(result) for task_id, result in run_state.results.items()
|
| 159 |
+
}
|
| 160 |
return AnalysisRecord(
|
| 161 |
goal_restated=narrative.goal_restated,
|
| 162 |
findings=narrative.findings,
|
|
|
|
| 164 |
data_used=narrative.data_used,
|
| 165 |
open_questions=narrative.open_questions,
|
| 166 |
tasks_run=tasks_run,
|
| 167 |
+
results_snapshot=results_snapshot,
|
| 168 |
plan_id=run_state.plan_id,
|
| 169 |
business_context_id=run_state.business_context_id,
|
| 170 |
created_at=datetime.now(UTC),
|
src/agents/slow_path/coordinator.py
CHANGED
|
@@ -1,9 +1,9 @@
|
|
| 1 |
"""SlowPathCoordinator β wires the slow path: Planner -> TaskRunner -> Assembler.
|
| 2 |
|
| 3 |
-
A thin coordination object.
|
| 4 |
-
|
| 5 |
-
|
| 6 |
-
|
| 7 |
|
| 8 |
See AGENT_ARCHITECTURE_CONTEXT_new.md Β§5.2 / Β§6.1.
|
| 9 |
"""
|
|
|
|
| 1 |
"""SlowPathCoordinator β wires the slow path: Planner -> TaskRunner -> Assembler.
|
| 2 |
|
| 3 |
+
A thin coordination object. `ChatHandler` calls it on a `structured_flow` query when
|
| 4 |
+
`ENABLE_SLOW_PATH` is on (the real `ToolInvoker` is composed in
|
| 5 |
+
`ChatHandler._get_slow_path_coordinator`). `BusinessContext` is still a stub until the
|
| 6 |
+
lead's real source lands.
|
| 7 |
|
| 8 |
See AGENT_ARCHITECTURE_CONTEXT_new.md Β§5.2 / Β§6.1.
|
| 9 |
"""
|
src/agents/slow_path/schemas.py
CHANGED
|
@@ -21,10 +21,12 @@ from __future__ import annotations
|
|
| 21 |
|
| 22 |
from datetime import datetime
|
| 23 |
from typing import Literal
|
|
|
|
| 24 |
|
| 25 |
from pydantic import BaseModel, Field
|
| 26 |
|
| 27 |
from ..planner.contracts import ToolOutput
|
|
|
|
| 28 |
|
| 29 |
TaskStatus = Literal["success", "partial", "failure"]
|
| 30 |
|
|
@@ -36,6 +38,7 @@ TaskStatus = Literal["success", "partial", "failure"]
|
|
| 36 |
|
| 37 |
class TaskResult(BaseModel):
|
| 38 |
task_id: str
|
|
|
|
| 39 |
status: TaskStatus
|
| 40 |
objective: str
|
| 41 |
outputs: list[ToolOutput] = Field(default_factory=list) # one per tool_call
|
|
@@ -57,12 +60,21 @@ class RunState(BaseModel):
|
|
| 57 |
|
| 58 |
class TaskSummary(BaseModel):
|
| 59 |
task_id: str
|
|
|
|
| 60 |
objective: str
|
| 61 |
status: TaskStatus
|
| 62 |
tools_used: list[str] = Field(default_factory=list)
|
| 63 |
|
| 64 |
|
| 65 |
class AnalysisRecord(BaseModel):
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 66 |
# Narrative fields β authored by the Assembler LLM.
|
| 67 |
goal_restated: str
|
| 68 |
findings: list[str] = Field(default_factory=list)
|
|
|
|
| 21 |
|
| 22 |
from datetime import datetime
|
| 23 |
from typing import Literal
|
| 24 |
+
from uuid import uuid4
|
| 25 |
|
| 26 |
from pydantic import BaseModel, Field
|
| 27 |
|
| 28 |
from ..planner.contracts import ToolOutput
|
| 29 |
+
from ..planner.schemas import CrispStage
|
| 30 |
|
| 31 |
TaskStatus = Literal["success", "partial", "failure"]
|
| 32 |
|
|
|
|
| 38 |
|
| 39 |
class TaskResult(BaseModel):
|
| 40 |
task_id: str
|
| 41 |
+
stage: CrispStage # copied from the plan Task; carries CRISP-DM grouping to the report
|
| 42 |
status: TaskStatus
|
| 43 |
objective: str
|
| 44 |
outputs: list[ToolOutput] = Field(default_factory=list) # one per tool_call
|
|
|
|
| 60 |
|
| 61 |
class TaskSummary(BaseModel):
|
| 62 |
task_id: str
|
| 63 |
+
stage: CrispStage # lets the report group the method appendix by CRISP-DM phase
|
| 64 |
objective: str
|
| 65 |
status: TaskStatus
|
| 66 |
tools_used: list[str] = Field(default_factory=list)
|
| 67 |
|
| 68 |
|
| 69 |
class AnalysisRecord(BaseModel):
|
| 70 |
+
# Identity. `record_id` is the unit the report cites and snapshots
|
| 71 |
+
# (`record_ids`); `analysis_id`/`user_id` scope the record to one analysis
|
| 72 |
+
# session + owner and are stamped by the composition root / AnalysisStore at
|
| 73 |
+
# persist time (they depend on the Analysis State that lives outside the slow
|
| 74 |
+
# path), so they default to None when the Assembler first builds the record.
|
| 75 |
+
record_id: str = Field(default_factory=lambda: uuid4().hex)
|
| 76 |
+
analysis_id: str | None = None
|
| 77 |
+
user_id: str | None = None
|
| 78 |
# Narrative fields β authored by the Assembler LLM.
|
| 79 |
goal_restated: str
|
| 80 |
findings: list[str] = Field(default_factory=list)
|
src/agents/slow_path/store.py
CHANGED
|
@@ -2,21 +2,28 @@
|
|
| 2 |
|
| 3 |
The Assembler produces an `AnalysisRecord` (the faithful, structured record of a
|
| 4 |
run β Β§8.3, INV-4). Persisting it is a separate concern from streaming the answer,
|
| 5 |
-
so it sits behind this
|
|
|
|
| 6 |
|
| 7 |
-
`NullAnalysisStore`
|
| 8 |
-
|
| 9 |
-
|
|
|
|
| 10 |
|
| 11 |
-
|
| 12 |
-
`
|
| 13 |
-
|
| 14 |
"""
|
| 15 |
|
| 16 |
from __future__ import annotations
|
| 17 |
|
| 18 |
from typing import Protocol, runtime_checkable
|
| 19 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 20 |
from src.middlewares.logging import get_logger
|
| 21 |
|
| 22 |
from .schemas import AnalysisRecord
|
|
@@ -26,19 +33,78 @@ logger = get_logger("analysis_store")
|
|
| 26 |
|
| 27 |
@runtime_checkable
|
| 28 |
class AnalysisStore(Protocol):
|
| 29 |
-
"""Persist
|
| 30 |
-
|
|
|
|
|
|
|
|
|
|
| 31 |
|
| 32 |
async def save(self, record: AnalysisRecord) -> None: ...
|
| 33 |
|
|
|
|
|
|
|
| 34 |
|
| 35 |
class NullAnalysisStore:
|
| 36 |
-
"""
|
| 37 |
|
| 38 |
async def save(self, record: AnalysisRecord) -> None:
|
| 39 |
logger.info(
|
| 40 |
-
"analysis_record produced (not persisted β
|
|
|
|
| 41 |
plan_id=record.plan_id,
|
| 42 |
-
business_context_id=record.business_context_id,
|
| 43 |
n_tasks=len(record.tasks_run),
|
| 44 |
)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 2 |
|
| 3 |
The Assembler produces an `AnalysisRecord` (the faithful, structured record of a
|
| 4 |
run β Β§8.3, INV-4). Persisting it is a separate concern from streaming the answer,
|
| 5 |
+
so it sits behind this seam. `generate_report` later reads records back by
|
| 6 |
+
`analysis_id` (oldest-first) and renders from them β never from chat history.
|
| 7 |
|
| 8 |
+
- `NullAnalysisStore` logs and stores nothing (kept for tests / when persistence
|
| 9 |
+
is intentionally disabled).
|
| 10 |
+
- `PostgresAnalysisStore` writes one `analysis_records` row per run in the catalog
|
| 11 |
+
DB (Neon `dataeyond`, `settings.postgres_connstring`).
|
| 12 |
|
| 13 |
+
`save` must never raise on the caller's path β a persistence failure must not break
|
| 14 |
+
the user's answer (Β§8.3). `list_for_analysis` is a read for the report generator and
|
| 15 |
+
is allowed to surface errors to its caller.
|
| 16 |
"""
|
| 17 |
|
| 18 |
from __future__ import annotations
|
| 19 |
|
| 20 |
from typing import Protocol, runtime_checkable
|
| 21 |
|
| 22 |
+
from sqlalchemy import select
|
| 23 |
+
from sqlalchemy.dialects.postgresql import insert
|
| 24 |
+
|
| 25 |
+
from src.db.postgres.connection import AsyncSessionLocal
|
| 26 |
+
from src.db.postgres.models import AnalysisRecordRow
|
| 27 |
from src.middlewares.logging import get_logger
|
| 28 |
|
| 29 |
from .schemas import AnalysisRecord
|
|
|
|
| 33 |
|
| 34 |
@runtime_checkable
|
| 35 |
class AnalysisStore(Protocol):
|
| 36 |
+
"""Persist + read completed analyses.
|
| 37 |
+
|
| 38 |
+
`save` must never raise on the caller's path. `list_for_analysis` returns the
|
| 39 |
+
records for one analysis session, oldest-first (the order the report renders in).
|
| 40 |
+
"""
|
| 41 |
|
| 42 |
async def save(self, record: AnalysisRecord) -> None: ...
|
| 43 |
|
| 44 |
+
async def list_for_analysis(self, analysis_id: str) -> list[AnalysisRecord]: ...
|
| 45 |
+
|
| 46 |
|
| 47 |
class NullAnalysisStore:
|
| 48 |
+
"""No-op store: logs the record, persists nothing. Reads return empty."""
|
| 49 |
|
| 50 |
async def save(self, record: AnalysisRecord) -> None:
|
| 51 |
logger.info(
|
| 52 |
+
"analysis_record produced (not persisted β NullAnalysisStore)",
|
| 53 |
+
record_id=record.record_id,
|
| 54 |
plan_id=record.plan_id,
|
|
|
|
| 55 |
n_tasks=len(record.tasks_run),
|
| 56 |
)
|
| 57 |
+
|
| 58 |
+
async def list_for_analysis(self, analysis_id: str) -> list[AnalysisRecord]:
|
| 59 |
+
return []
|
| 60 |
+
|
| 61 |
+
|
| 62 |
+
class PostgresAnalysisStore:
|
| 63 |
+
"""Writes/reads `analysis_records` jsonb rows in the catalog DB.
|
| 64 |
+
|
| 65 |
+
Mirrors `CatalogStore`: each call opens its own `AsyncSession`. One row per
|
| 66 |
+
record (vs. one-per-user for the catalog) since records accumulate per analysis.
|
| 67 |
+
"""
|
| 68 |
+
|
| 69 |
+
async def save(self, record: AnalysisRecord) -> None:
|
| 70 |
+
try:
|
| 71 |
+
payload = record.model_dump(mode="json")
|
| 72 |
+
async with AsyncSessionLocal() as session:
|
| 73 |
+
stmt = insert(AnalysisRecordRow).values(
|
| 74 |
+
id=record.record_id,
|
| 75 |
+
analysis_id=record.analysis_id,
|
| 76 |
+
user_id=record.user_id,
|
| 77 |
+
plan_id=record.plan_id,
|
| 78 |
+
data=payload,
|
| 79 |
+
created_at=record.created_at,
|
| 80 |
+
)
|
| 81 |
+
# Re-running the same plan id-collides only if record_id repeats;
|
| 82 |
+
# treat that as idempotent (overwrite) rather than erroring the user.
|
| 83 |
+
stmt = stmt.on_conflict_do_update(
|
| 84 |
+
index_elements=[AnalysisRecordRow.id],
|
| 85 |
+
set_={"data": stmt.excluded.data},
|
| 86 |
+
)
|
| 87 |
+
await session.execute(stmt)
|
| 88 |
+
await session.commit()
|
| 89 |
+
logger.info(
|
| 90 |
+
"analysis_record persisted",
|
| 91 |
+
record_id=record.record_id,
|
| 92 |
+
analysis_id=record.analysis_id,
|
| 93 |
+
user_id=record.user_id,
|
| 94 |
+
)
|
| 95 |
+
except Exception as exc: # never break the user's answer (Β§8.3)
|
| 96 |
+
logger.error(
|
| 97 |
+
"analysis_record persist failed",
|
| 98 |
+
record_id=record.record_id,
|
| 99 |
+
error=str(exc),
|
| 100 |
+
)
|
| 101 |
+
|
| 102 |
+
async def list_for_analysis(self, analysis_id: str) -> list[AnalysisRecord]:
|
| 103 |
+
async with AsyncSessionLocal() as session:
|
| 104 |
+
result = await session.execute(
|
| 105 |
+
select(AnalysisRecordRow.data)
|
| 106 |
+
.where(AnalysisRecordRow.analysis_id == analysis_id)
|
| 107 |
+
.order_by(AnalysisRecordRow.created_at.asc())
|
| 108 |
+
)
|
| 109 |
+
rows = result.scalars().all()
|
| 110 |
+
return [AnalysisRecord.model_validate(row) for row in rows]
|
src/agents/slow_path/task_runner.py
CHANGED
|
@@ -53,6 +53,7 @@ class TaskRunner:
|
|
| 53 |
for tid in list(remaining):
|
| 54 |
results[tid] = TaskResult(
|
| 55 |
task_id=tid,
|
|
|
|
| 56 |
status="failure",
|
| 57 |
objective=tasks_by_id[tid].objective,
|
| 58 |
error="unresolved dependency; task could not run",
|
|
@@ -68,6 +69,7 @@ class TaskRunner:
|
|
| 68 |
if failed:
|
| 69 |
results[tid] = TaskResult(
|
| 70 |
task_id=tid,
|
|
|
|
| 71 |
status="failure",
|
| 72 |
objective=task.objective,
|
| 73 |
error=f"skipped: upstream {failed} did not succeed",
|
|
@@ -110,6 +112,7 @@ class TaskRunner:
|
|
| 110 |
error = errs[0] if errs else "all tool calls failed"
|
| 111 |
return TaskResult(
|
| 112 |
task_id=task.id,
|
|
|
|
| 113 |
status=status,
|
| 114 |
objective=task.objective,
|
| 115 |
outputs=outputs,
|
|
|
|
| 53 |
for tid in list(remaining):
|
| 54 |
results[tid] = TaskResult(
|
| 55 |
task_id=tid,
|
| 56 |
+
stage=tasks_by_id[tid].stage,
|
| 57 |
status="failure",
|
| 58 |
objective=tasks_by_id[tid].objective,
|
| 59 |
error="unresolved dependency; task could not run",
|
|
|
|
| 69 |
if failed:
|
| 70 |
results[tid] = TaskResult(
|
| 71 |
task_id=tid,
|
| 72 |
+
stage=task.stage,
|
| 73 |
status="failure",
|
| 74 |
objective=task.objective,
|
| 75 |
error=f"skipped: upstream {failed} did not succeed",
|
|
|
|
| 112 |
error = errs[0] if errs else "all tool calls failed"
|
| 113 |
return TaskResult(
|
| 114 |
task_id=task.id,
|
| 115 |
+
stage=task.stage,
|
| 116 |
status=status,
|
| 117 |
objective=task.objective,
|
| 118 |
outputs=outputs,
|
src/agents/state_store.py
ADDED
|
@@ -0,0 +1,128 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""AnalysisStateStore β read/write the per-analysis session state.
|
| 2 |
+
|
| 3 |
+
The orchestrator gate + Help skill read `AnalysisState` (the locked contract in
|
| 4 |
+
`gate.py`) every turn; the Problem Statement skill writes `problem_validated`. The
|
| 5 |
+
row shares its id with the chat `rooms` row β one session = one analysis = one
|
| 6 |
+
conversation (`analysis_id == room_id`).
|
| 7 |
+
|
| 8 |
+
Mirrors `PostgresAnalysisStore`: each call opens its own `AsyncSession`.
|
| 9 |
+
"""
|
| 10 |
+
|
| 11 |
+
from __future__ import annotations
|
| 12 |
+
|
| 13 |
+
from sqlalchemy.dialects.postgresql import insert
|
| 14 |
+
|
| 15 |
+
from src.agents.gate import AnalysisState
|
| 16 |
+
from src.db.postgres.connection import AsyncSessionLocal
|
| 17 |
+
from src.db.postgres.models import AnalysisStateRow
|
| 18 |
+
from src.middlewares.logging import get_logger
|
| 19 |
+
|
| 20 |
+
logger = get_logger("analysis_state_store")
|
| 21 |
+
|
| 22 |
+
|
| 23 |
+
def _row_to_state(row: AnalysisStateRow) -> AnalysisState:
|
| 24 |
+
"""Map a DB row to the frozen `AnalysisState` contract."""
|
| 25 |
+
return AnalysisState(
|
| 26 |
+
id=row.id,
|
| 27 |
+
analysis_title=row.analysis_title,
|
| 28 |
+
problem_statement=row.problem_statement,
|
| 29 |
+
problem_validated=row.problem_validated,
|
| 30 |
+
owner_id=row.owner_id,
|
| 31 |
+
report_id=row.report_id,
|
| 32 |
+
created_at=row.created_at,
|
| 33 |
+
updated_at=row.updated_at,
|
| 34 |
+
)
|
| 35 |
+
|
| 36 |
+
|
| 37 |
+
class AnalysisStateStore:
|
| 38 |
+
"""Read/write the dedorch `analysis` table, keyed by the shared session id."""
|
| 39 |
+
|
| 40 |
+
async def get(self, analysis_id: str) -> AnalysisState | None:
|
| 41 |
+
async with AsyncSessionLocal() as session:
|
| 42 |
+
row = await session.get(AnalysisStateRow, analysis_id)
|
| 43 |
+
return _row_to_state(row) if row is not None else None
|
| 44 |
+
|
| 45 |
+
async def ensure(
|
| 46 |
+
self,
|
| 47 |
+
analysis_id: str,
|
| 48 |
+
owner_id: str,
|
| 49 |
+
analysis_title: str = "New analysis",
|
| 50 |
+
) -> AnalysisState:
|
| 51 |
+
"""Get-or-create the state row for a session (idempotent, race-safe).
|
| 52 |
+
|
| 53 |
+
Sessions born from `/room/create` have no `analysis_states` row; without
|
| 54 |
+
one the gate redirect-loops and `problem_statement` / `report_id` writes
|
| 55 |
+
silently no-op. Called per turn (analysis_id == room_id) so any session is
|
| 56 |
+
gate-ready. `INSERT ... ON CONFLICT DO NOTHING` makes concurrent first
|
| 57 |
+
turns safe; the row is then read back. Legacy rows created this way carry
|
| 58 |
+
no source bindings β binding scoping fail-opens to the whole catalog.
|
| 59 |
+
"""
|
| 60 |
+
async with AsyncSessionLocal() as session:
|
| 61 |
+
stmt = (
|
| 62 |
+
insert(AnalysisStateRow)
|
| 63 |
+
.values(
|
| 64 |
+
id=analysis_id,
|
| 65 |
+
owner_id=owner_id,
|
| 66 |
+
analysis_title=analysis_title,
|
| 67 |
+
problem_statement="",
|
| 68 |
+
problem_validated=False,
|
| 69 |
+
)
|
| 70 |
+
.on_conflict_do_nothing(index_elements=[AnalysisStateRow.id])
|
| 71 |
+
)
|
| 72 |
+
await session.execute(stmt)
|
| 73 |
+
await session.commit()
|
| 74 |
+
row = await session.get(AnalysisStateRow, analysis_id)
|
| 75 |
+
return _row_to_state(row)
|
| 76 |
+
|
| 77 |
+
async def create(
|
| 78 |
+
self,
|
| 79 |
+
*,
|
| 80 |
+
analysis_id: str,
|
| 81 |
+
owner_id: str,
|
| 82 |
+
analysis_title: str = "New analysis",
|
| 83 |
+
problem_statement: str = "",
|
| 84 |
+
) -> AnalysisState:
|
| 85 |
+
"""Create the state row for a new analysis (id shared with its chat room)."""
|
| 86 |
+
async with AsyncSessionLocal() as session:
|
| 87 |
+
row = AnalysisStateRow(
|
| 88 |
+
id=analysis_id,
|
| 89 |
+
owner_id=owner_id,
|
| 90 |
+
analysis_title=analysis_title,
|
| 91 |
+
problem_statement=problem_statement,
|
| 92 |
+
problem_validated=False,
|
| 93 |
+
)
|
| 94 |
+
session.add(row)
|
| 95 |
+
await session.commit()
|
| 96 |
+
await session.refresh(row)
|
| 97 |
+
return _row_to_state(row)
|
| 98 |
+
|
| 99 |
+
async def update(
|
| 100 |
+
self,
|
| 101 |
+
analysis_id: str,
|
| 102 |
+
*,
|
| 103 |
+
problem_statement: str | None = None,
|
| 104 |
+
problem_validated: bool | None = None,
|
| 105 |
+
report_id: str | None = None,
|
| 106 |
+
) -> AnalysisState | None:
|
| 107 |
+
"""Patch the given fields (only non-None args are written). Returns the row.
|
| 108 |
+
|
| 109 |
+
Used by the Problem Statement skill (`problem_validated`) and the report
|
| 110 |
+
flow (`report_id`). Returns None if the analysis doesn't exist.
|
| 111 |
+
"""
|
| 112 |
+
async with AsyncSessionLocal() as session:
|
| 113 |
+
row = await session.get(AnalysisStateRow, analysis_id)
|
| 114 |
+
if row is None:
|
| 115 |
+
logger.warning(
|
| 116 |
+
"analysis row missing β update skipped",
|
| 117 |
+
analysis_id=analysis_id,
|
| 118 |
+
)
|
| 119 |
+
return None
|
| 120 |
+
if problem_statement is not None:
|
| 121 |
+
row.problem_statement = problem_statement
|
| 122 |
+
if problem_validated is not None:
|
| 123 |
+
row.problem_validated = problem_validated
|
| 124 |
+
if report_id is not None:
|
| 125 |
+
row.report_id = report_id
|
| 126 |
+
await session.commit()
|
| 127 |
+
await session.refresh(row)
|
| 128 |
+
return _row_to_state(row)
|
src/api/v1/analysis.py
ADDED
|
@@ -0,0 +1,174 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Analysis session API β create a new analysis (the per-session workspace).
|
| 2 |
+
|
| 3 |
+
An analysis IS the chat session: the `analysis_states` row and the chat `rooms`
|
| 4 |
+
row share one id (`analysis_id == room_id`), so the existing `room_id` on the chat
|
| 5 |
+
request doubles as the `analysis_id`. Creating an analysis enforces the data-first
|
| 6 |
+
gate (>=1 bound source) and seeds the state with a title + an optional problem
|
| 7 |
+
statement (validated later by the Problem Statement skill).
|
| 8 |
+
"""
|
| 9 |
+
|
| 10 |
+
import uuid
|
| 11 |
+
|
| 12 |
+
from fastapi import APIRouter, Depends, HTTPException
|
| 13 |
+
from pydantic import BaseModel, Field
|
| 14 |
+
from sqlalchemy import select
|
| 15 |
+
from sqlalchemy.ext.asyncio import AsyncSession
|
| 16 |
+
|
| 17 |
+
from src.db.postgres.connection import get_db
|
| 18 |
+
from src.db.postgres.models import AnalysisDataSourceRow, AnalysisStateRow, Room
|
| 19 |
+
from src.middlewares.logging import get_logger, log_execution
|
| 20 |
+
|
| 21 |
+
logger = get_logger("analysis_api")
|
| 22 |
+
|
| 23 |
+
router = APIRouter(prefix="/api/v1", tags=["Analysis"])
|
| 24 |
+
|
| 25 |
+
|
| 26 |
+
def _serialize_state(row: AnalysisStateRow, data_source_ids: list[str]) -> dict:
|
| 27 |
+
"""The full analysis payload: the 8 state fields + the bound source ids."""
|
| 28 |
+
return {
|
| 29 |
+
"id": row.id,
|
| 30 |
+
"analysis_title": row.analysis_title,
|
| 31 |
+
"problem_statement": row.problem_statement,
|
| 32 |
+
"problem_validated": row.problem_validated,
|
| 33 |
+
"owner_id": row.owner_id,
|
| 34 |
+
"report_id": row.report_id,
|
| 35 |
+
"data_source_ids": data_source_ids,
|
| 36 |
+
"created_at": row.created_at.isoformat() if row.created_at else None,
|
| 37 |
+
"updated_at": row.updated_at.isoformat() if row.updated_at else None,
|
| 38 |
+
}
|
| 39 |
+
|
| 40 |
+
|
| 41 |
+
async def _bound_source_ids(db: AsyncSession, analysis_id: str) -> list[str]:
|
| 42 |
+
result = await db.execute(
|
| 43 |
+
select(AnalysisDataSourceRow.reference_id).where(
|
| 44 |
+
AnalysisDataSourceRow.analysis_id == analysis_id
|
| 45 |
+
)
|
| 46 |
+
)
|
| 47 |
+
return list(result.scalars().all())
|
| 48 |
+
|
| 49 |
+
|
| 50 |
+
async def _sources_by_id(user_id: str) -> dict:
|
| 51 |
+
"""Catalog sources keyed by source_id, to resolve `type`/`name` on binding.
|
| 52 |
+
|
| 53 |
+
Never-throw: missing catalog / read error β empty map, and binding rows fall back
|
| 54 |
+
to type='unknown' / name=reference_id.
|
| 55 |
+
"""
|
| 56 |
+
try:
|
| 57 |
+
from src.catalog.store import CatalogStore
|
| 58 |
+
|
| 59 |
+
catalog = await CatalogStore().get(user_id)
|
| 60 |
+
except Exception as e: # noqa: BLE001 β binding must not fail on catalog read
|
| 61 |
+
logger.warning("analysis: catalog read failed for binding", user_id=user_id, error=str(e))
|
| 62 |
+
return {}
|
| 63 |
+
return {s.source_id: s for s in catalog.sources} if catalog else {}
|
| 64 |
+
|
| 65 |
+
|
| 66 |
+
class CreateAnalysisRequest(BaseModel):
|
| 67 |
+
user_id: str
|
| 68 |
+
analysis_title: str = "New analysis"
|
| 69 |
+
problem_statement: str = ""
|
| 70 |
+
data_source_ids: list[str] = Field(default_factory=list)
|
| 71 |
+
|
| 72 |
+
|
| 73 |
+
@router.post("/analysis/create")
|
| 74 |
+
@log_execution(logger)
|
| 75 |
+
async def create_analysis(
|
| 76 |
+
request: CreateAnalysisRequest,
|
| 77 |
+
db: AsyncSession = Depends(get_db),
|
| 78 |
+
):
|
| 79 |
+
"""Create a new analysis session: one shared id for its state + chat room.
|
| 80 |
+
|
| 81 |
+
Data-first gate (decision #2): an analysis requires >=1 bound data source.
|
| 82 |
+
The bound sources are persisted as dedorch `data_sources` rows (#10) in the same
|
| 83 |
+
transaction as the state + room, so the analysis is scoped to exactly the sources
|
| 84 |
+
the user picked. `structured_flow` and the report read this binding back.
|
| 85 |
+
"""
|
| 86 |
+
if not request.data_source_ids:
|
| 87 |
+
raise HTTPException(
|
| 88 |
+
status_code=400,
|
| 89 |
+
detail="An analysis requires at least one bound data source.",
|
| 90 |
+
)
|
| 91 |
+
|
| 92 |
+
analysis_id = str(uuid.uuid4())
|
| 93 |
+
# The analysis IS the session: state row + chat room + source bindings share one
|
| 94 |
+
# id, created atomically in one transaction.
|
| 95 |
+
state_row = AnalysisStateRow(
|
| 96 |
+
id=analysis_id,
|
| 97 |
+
owner_id=request.user_id,
|
| 98 |
+
analysis_title=request.analysis_title,
|
| 99 |
+
problem_statement=request.problem_statement,
|
| 100 |
+
problem_validated=False,
|
| 101 |
+
)
|
| 102 |
+
db.add(Room(id=analysis_id, user_id=request.user_id, title=request.analysis_title))
|
| 103 |
+
db.add(state_row)
|
| 104 |
+
# dict.fromkeys dedupes while preserving order. Each binding row snapshots the
|
| 105 |
+
# source's type + name from the catalog (reference_id = catalog source id);
|
| 106 |
+
# bound_at/created_at default to now() in dedorch.
|
| 107 |
+
bound_ids = list(dict.fromkeys(request.data_source_ids))
|
| 108 |
+
src_by_id = await _sources_by_id(request.user_id)
|
| 109 |
+
for source_id in bound_ids:
|
| 110 |
+
src = src_by_id.get(source_id)
|
| 111 |
+
db.add(
|
| 112 |
+
AnalysisDataSourceRow(
|
| 113 |
+
id=str(uuid.uuid4()),
|
| 114 |
+
analysis_id=analysis_id,
|
| 115 |
+
type=src.source_type if src else "unknown",
|
| 116 |
+
name=src.name if src else source_id,
|
| 117 |
+
reference_id=source_id,
|
| 118 |
+
bound_by=request.user_id,
|
| 119 |
+
)
|
| 120 |
+
)
|
| 121 |
+
await db.commit()
|
| 122 |
+
await db.refresh(state_row)
|
| 123 |
+
|
| 124 |
+
logger.info(
|
| 125 |
+
"analysis created",
|
| 126 |
+
analysis_id=analysis_id,
|
| 127 |
+
user_id=request.user_id,
|
| 128 |
+
sources=len(bound_ids),
|
| 129 |
+
)
|
| 130 |
+
return {
|
| 131 |
+
"status": "success",
|
| 132 |
+
"message": "Analysis created successfully",
|
| 133 |
+
"data": _serialize_state(state_row, bound_ids),
|
| 134 |
+
}
|
| 135 |
+
|
| 136 |
+
|
| 137 |
+
@router.get("/analysis")
|
| 138 |
+
@log_execution(logger)
|
| 139 |
+
async def list_analyses(user_id: str, db: AsyncSession = Depends(get_db)):
|
| 140 |
+
"""List a user's analyses, most-recently-updated first (Analysis sidebar).
|
| 141 |
+
|
| 142 |
+
Summary fields only (no per-row source bindings β fetch those via the detail
|
| 143 |
+
endpoint) to keep the list a single query.
|
| 144 |
+
"""
|
| 145 |
+
result = await db.execute(
|
| 146 |
+
select(AnalysisStateRow)
|
| 147 |
+
.where(AnalysisStateRow.owner_id == user_id)
|
| 148 |
+
.order_by(AnalysisStateRow.updated_at.desc())
|
| 149 |
+
)
|
| 150 |
+
rows = result.scalars().all()
|
| 151 |
+
return {
|
| 152 |
+
"status": "success",
|
| 153 |
+
"data": [
|
| 154 |
+
{
|
| 155 |
+
"id": r.id,
|
| 156 |
+
"analysis_title": r.analysis_title,
|
| 157 |
+
"problem_validated": r.problem_validated,
|
| 158 |
+
"report_id": r.report_id,
|
| 159 |
+
"updated_at": r.updated_at.isoformat() if r.updated_at else None,
|
| 160 |
+
}
|
| 161 |
+
for r in rows
|
| 162 |
+
],
|
| 163 |
+
}
|
| 164 |
+
|
| 165 |
+
|
| 166 |
+
@router.get("/analysis/{analysis_id}")
|
| 167 |
+
@log_execution(logger)
|
| 168 |
+
async def get_analysis(analysis_id: str, db: AsyncSession = Depends(get_db)):
|
| 169 |
+
"""Read one analysis's state + bound data sources (the FE workspace render)."""
|
| 170 |
+
row = await db.get(AnalysisStateRow, analysis_id)
|
| 171 |
+
if row is None:
|
| 172 |
+
raise HTTPException(status_code=404, detail=f"Analysis {analysis_id!r} not found.")
|
| 173 |
+
data_source_ids = await _bound_source_ids(db, analysis_id)
|
| 174 |
+
return {"status": "success", "data": _serialize_state(row, data_source_ids)}
|
src/api/v1/chat.py
CHANGED
|
@@ -31,6 +31,7 @@ router = APIRouter(prefix="/api/v1", tags=["Chat"])
|
|
| 31 |
_chat_handler = ChatHandler(
|
| 32 |
enable_tracing=True,
|
| 33 |
enable_slow_path=settings.enable_slow_path,
|
|
|
|
| 34 |
)
|
| 35 |
|
| 36 |
_GREETINGS = frozenset(["hi", "hello", "hey", "halo", "hai", "hei"])
|
|
@@ -64,8 +65,39 @@ async def get_cached_response(redis, cache_key: str) -> Optional[dict]:
|
|
| 64 |
return None
|
| 65 |
|
| 66 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 67 |
async def cache_response(redis, cache_key: str, response: str, sources: list):
|
| 68 |
-
await redis.setex(
|
|
|
|
|
|
|
|
|
|
|
|
|
| 69 |
|
| 70 |
|
| 71 |
async def load_history(db: AsyncSession, room_id: str, limit: int = 10) -> list:
|
|
@@ -107,10 +139,10 @@ async def save_messages(
|
|
| 107 |
|
| 108 |
|
| 109 |
@router.delete("/chat/cache")
|
| 110 |
-
async def clear_chat_cache(room_id: str, message: str):
|
| 111 |
-
"""Delete the Redis cache entry for a specific room + message pair."""
|
| 112 |
redis = await get_redis()
|
| 113 |
-
cache_key =
|
| 114 |
deleted = await redis.delete(cache_key)
|
| 115 |
return {"deleted": deleted > 0, "cache_key": cache_key}
|
| 116 |
|
|
@@ -146,7 +178,7 @@ async def chat_stream(request: ChatRequest, db: AsyncSession = Depends(get_db)):
|
|
| 146 |
3. done β signals end of stream
|
| 147 |
"""
|
| 148 |
redis = await get_redis()
|
| 149 |
-
cache_key =
|
| 150 |
|
| 151 |
# Redis cache hit
|
| 152 |
cached = await get_cached_response(redis, cache_key)
|
|
@@ -186,8 +218,17 @@ async def chat_stream(request: ChatRequest, db: AsyncSession = Depends(get_db)):
|
|
| 186 |
logger.info("stream_response started", room_id=request.room_id, user_id=request.user_id)
|
| 187 |
full_response = ""
|
| 188 |
sources: List[Dict[str, Any]] = []
|
| 189 |
-
|
| 190 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 191 |
try:
|
| 192 |
sources = json.loads(event["data"]) or []
|
| 193 |
except (TypeError, ValueError):
|
|
@@ -197,7 +238,10 @@ async def chat_stream(request: ChatRequest, db: AsyncSession = Depends(get_db)):
|
|
| 197 |
full_response += event["data"]
|
| 198 |
yield event
|
| 199 |
elif event["event"] == "done":
|
| 200 |
-
|
|
|
|
|
|
|
|
|
|
| 201 |
logger.info("saving messages", sources_count=len(sources), sources=sources)
|
| 202 |
try:
|
| 203 |
await save_messages(db, request.room_id, request.message, full_response, sources=sources)
|
|
@@ -211,7 +255,6 @@ async def chat_stream(request: ChatRequest, db: AsyncSession = Depends(get_db)):
|
|
| 211 |
elif event["event"] == "error":
|
| 212 |
yield event
|
| 213 |
return
|
| 214 |
-
# "intent" event: consumed internally, not forwarded to frontend
|
| 215 |
|
| 216 |
return EventSourceResponse(stream_response())
|
| 217 |
|
|
|
|
| 31 |
_chat_handler = ChatHandler(
|
| 32 |
enable_tracing=True,
|
| 33 |
enable_slow_path=settings.enable_slow_path,
|
| 34 |
+
enable_gate=settings.enable_gate,
|
| 35 |
)
|
| 36 |
|
| 37 |
_GREETINGS = frozenset(["hi", "hello", "hey", "halo", "hai", "hei"])
|
|
|
|
| 65 |
return None
|
| 66 |
|
| 67 |
|
| 68 |
+
# 1h TTL per the 2026-06-11 checkpoint decision (Redis = retrieval/query caching
|
| 69 |
+
# only, short-lived). Was 24h, which served stale answers after re-ingestion.
|
| 70 |
+
_CHAT_CACHE_TTL_SECONDS = 3600
|
| 71 |
+
|
| 72 |
+
# Only stateless replies are safe to cache. The cache key is (room, user, message)
|
| 73 |
+
# with no analysis-state/data version, so caching a state- or data-dependent answer
|
| 74 |
+
# (help / problem_statement / check / structured_flow / unstructured_flow) would
|
| 75 |
+
# replay a stale answer after the state or data changes β and, since the read check
|
| 76 |
+
# runs before the gate, could even bypass the gate when the same message repeats.
|
| 77 |
+
# So we cache ONLY the `chat` intent. Caching analysis answers needs proper
|
| 78 |
+
# invalidation on data/state change β deferred. The write is gated by the intent the
|
| 79 |
+
# handler already emits; the read stays as-is (safe because only `chat` is ever
|
| 80 |
+
# stored).
|
| 81 |
+
_CACHEABLE_INTENTS = frozenset({"chat"})
|
| 82 |
+
|
| 83 |
+
|
| 84 |
+
def _chat_cache_key(room_id: str, user_id: str, message: str) -> str:
|
| 85 |
+
# user_id is part of the key so one user's cached answer can never be
|
| 86 |
+
# replayed to another (R5); room_id stays first so the room-wide clear
|
| 87 |
+
# endpoint can keep matching on a `chat:{room_id}:*` prefix.
|
| 88 |
+
# LIMITATION (T-G): the key omits conversation history, so a repeated message
|
| 89 |
+
# replays its cached answer even if the conversation has since moved on. Only
|
| 90 |
+
# the stateless `chat` intent is cached, so the blast radius is small β but a
|
| 91 |
+
# history-aware key (hash of last-N turns) would close it. Flagged to Harry.
|
| 92 |
+
return f"{settings.redis_prefix}chat:{room_id}:{user_id}:{message}"
|
| 93 |
+
|
| 94 |
+
|
| 95 |
async def cache_response(redis, cache_key: str, response: str, sources: list):
|
| 96 |
+
await redis.setex(
|
| 97 |
+
cache_key,
|
| 98 |
+
_CHAT_CACHE_TTL_SECONDS,
|
| 99 |
+
json.dumps({"response": response, "sources": sources}),
|
| 100 |
+
)
|
| 101 |
|
| 102 |
|
| 103 |
async def load_history(db: AsyncSession, room_id: str, limit: int = 10) -> list:
|
|
|
|
| 139 |
|
| 140 |
|
| 141 |
@router.delete("/chat/cache")
|
| 142 |
+
async def clear_chat_cache(room_id: str, user_id: str, message: str):
|
| 143 |
+
"""Delete the Redis cache entry for a specific room + user + message pair."""
|
| 144 |
redis = await get_redis()
|
| 145 |
+
cache_key = _chat_cache_key(room_id, user_id, message)
|
| 146 |
deleted = await redis.delete(cache_key)
|
| 147 |
return {"deleted": deleted > 0, "cache_key": cache_key}
|
| 148 |
|
|
|
|
| 178 |
3. done β signals end of stream
|
| 179 |
"""
|
| 180 |
redis = await get_redis()
|
| 181 |
+
cache_key = _chat_cache_key(request.room_id, request.user_id, request.message)
|
| 182 |
|
| 183 |
# Redis cache hit
|
| 184 |
cached = await get_cached_response(redis, cache_key)
|
|
|
|
| 218 |
logger.info("stream_response started", room_id=request.room_id, user_id=request.user_id)
|
| 219 |
full_response = ""
|
| 220 |
sources: List[Dict[str, Any]] = []
|
| 221 |
+
effective_intent: Optional[str] = None
|
| 222 |
+
async for event in handler.handle(
|
| 223 |
+
request.message, request.user_id, history, analysis_id=request.room_id
|
| 224 |
+
):
|
| 225 |
+
if event["event"] == "intent":
|
| 226 |
+
# consumed internally (not forwarded); gates caching below.
|
| 227 |
+
try:
|
| 228 |
+
effective_intent = json.loads(event["data"]).get("intent")
|
| 229 |
+
except (TypeError, ValueError, AttributeError):
|
| 230 |
+
effective_intent = None
|
| 231 |
+
elif event["event"] == "sources":
|
| 232 |
try:
|
| 233 |
sources = json.loads(event["data"]) or []
|
| 234 |
except (TypeError, ValueError):
|
|
|
|
| 238 |
full_response += event["data"]
|
| 239 |
yield event
|
| 240 |
elif event["event"] == "done":
|
| 241 |
+
# Only cache stateless `chat` replies β caching a state/data-
|
| 242 |
+
# dependent answer would replay it stale (see _CACHEABLE_INTENTS).
|
| 243 |
+
if effective_intent in _CACHEABLE_INTENTS:
|
| 244 |
+
await cache_response(redis, cache_key, full_response, sources=sources)
|
| 245 |
logger.info("saving messages", sources_count=len(sources), sources=sources)
|
| 246 |
try:
|
| 247 |
await save_messages(db, request.room_id, request.message, full_response, sources=sources)
|
|
|
|
| 255 |
elif event["event"] == "error":
|
| 256 |
yield event
|
| 257 |
return
|
|
|
|
| 258 |
|
| 259 |
return EventSourceResponse(stream_response())
|
| 260 |
|
src/api/v1/report.py
ADDED
|
@@ -0,0 +1,189 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Report API (KM-644) β the dedicated "Generate Report" surface.
|
| 2 |
+
|
| 3 |
+
NOT a chat route. The frontend button calls these endpoints directly:
|
| 4 |
+
POST /report generate a new version for a session
|
| 5 |
+
GET /report/{analysis_id} list a session's report versions
|
| 6 |
+
GET /report/{analysis_id}/{ver} fetch one version
|
| 7 |
+
|
| 8 |
+
Generation reads persisted AnalysisRecords + Problem Statement, makes one LLM call
|
| 9 |
+
(the executive summary), and persists an immutable versioned artifact. The
|
| 10 |
+
ReportGenerator + ReportStore are process singletons (the generator caches its LLM
|
| 11 |
+
chain warm across requests, like ChatHandler).
|
| 12 |
+
|
| 13 |
+
Note (T-E): AnalysisRecords are only persisted by the slow path, so reports require
|
| 14 |
+
`ENABLE_SLOW_PATH=on`. With it off, no records exist and generation 409s β by design,
|
| 15 |
+
not a bug. POST gates on the same floor as Help's readiness signal (validated goal +
|
| 16 |
+
β₯1 substantive analysis) so the button and Help never disagree.
|
| 17 |
+
"""
|
| 18 |
+
|
| 19 |
+
from fastapi import APIRouter, HTTPException, Query, status
|
| 20 |
+
|
| 21 |
+
from src.agents.report.errors import ReportError
|
| 22 |
+
from src.agents.report.generator import ReportGenerator
|
| 23 |
+
from src.agents.report.schemas import AnalysisReport, ProblemStatement
|
| 24 |
+
from src.agents.report.store import ReportStore
|
| 25 |
+
from src.middlewares.logging import get_logger, log_execution
|
| 26 |
+
from src.models.api.report import ReportVersionEntry
|
| 27 |
+
|
| 28 |
+
logger = get_logger("report_api")
|
| 29 |
+
|
| 30 |
+
router = APIRouter(prefix="/api/v1", tags=["Report"])
|
| 31 |
+
|
| 32 |
+
_generator = ReportGenerator()
|
| 33 |
+
_store = ReportStore()
|
| 34 |
+
|
| 35 |
+
|
| 36 |
+
async def _load_state(analysis_id: str):
|
| 37 |
+
"""Load the AnalysisState (for the floor gate + problem statement). Never-throw."""
|
| 38 |
+
try:
|
| 39 |
+
from src.agents.state_store import AnalysisStateStore
|
| 40 |
+
|
| 41 |
+
return await AnalysisStateStore().get(analysis_id)
|
| 42 |
+
except Exception as e: # noqa: BLE001 β never block report generation on this
|
| 43 |
+
logger.warning("report: state load failed", analysis_id=analysis_id, error=str(e))
|
| 44 |
+
return None
|
| 45 |
+
|
| 46 |
+
|
| 47 |
+
def _problem_statement_from(state) -> ProblemStatement:
|
| 48 |
+
"""Map the analysis's free-text problem statement into the report's structured PS."""
|
| 49 |
+
if state is None or not state.problem_statement:
|
| 50 |
+
return ProblemStatement()
|
| 51 |
+
return ProblemStatement(objective=state.problem_statement)
|
| 52 |
+
|
| 53 |
+
|
| 54 |
+
async def _record_report_on_state(analysis_id: str, report_id: str) -> None:
|
| 55 |
+
"""Write the new `report_id` back onto the Analysis State (never-throw).
|
| 56 |
+
|
| 57 |
+
Closes the loop so Help's `has_report` and the readiness delta-check can see
|
| 58 |
+
that a report exists. A missing state row / write error must not fail a report
|
| 59 |
+
that already generated and persisted.
|
| 60 |
+
"""
|
| 61 |
+
try:
|
| 62 |
+
from src.agents.state_store import AnalysisStateStore
|
| 63 |
+
|
| 64 |
+
await AnalysisStateStore().update(analysis_id, report_id=report_id)
|
| 65 |
+
except Exception as e: # noqa: BLE001
|
| 66 |
+
logger.warning(
|
| 67 |
+
"report: report_id write-back failed", analysis_id=analysis_id, error=str(e)
|
| 68 |
+
)
|
| 69 |
+
|
| 70 |
+
|
| 71 |
+
@router.post(
|
| 72 |
+
"/report",
|
| 73 |
+
response_model=AnalysisReport,
|
| 74 |
+
status_code=status.HTTP_201_CREATED,
|
| 75 |
+
summary="Generate a new report version for an analysis session",
|
| 76 |
+
responses={
|
| 77 |
+
201: {"description": "A new versioned report was generated and persisted."},
|
| 78 |
+
409: {"description": "No analyses recorded for this session yet β nothing to report."},
|
| 79 |
+
500: {"description": "Report generation or persistence failed."},
|
| 80 |
+
},
|
| 81 |
+
)
|
| 82 |
+
@log_execution(logger)
|
| 83 |
+
async def generate_report(
|
| 84 |
+
analysis_id: str = Query(..., description="The analysis session to report on."),
|
| 85 |
+
user_id: str = Query(..., description="Owner of the analysis session."),
|
| 86 |
+
):
|
| 87 |
+
"""Generate, persist, and return a new report version.
|
| 88 |
+
|
| 89 |
+
Each call produces a new version (V1, V2, β¦) that snapshots the records and
|
| 90 |
+
Problem Statement it used. Server-side gate: the report **floor** β a validated
|
| 91 |
+
goal + β₯1 substantive analysis β the same floor Help's readiness signal uses, so
|
| 92 |
+
the button and Help can't disagree (T-D). The delta-since-report check is NOT
|
| 93 |
+
applied here: a new version is always allowed (decision 4A).
|
| 94 |
+
"""
|
| 95 |
+
from src.agents.gate import stub_analysis_state
|
| 96 |
+
from src.agents.report.readiness import report_floor
|
| 97 |
+
|
| 98 |
+
state = await _load_state(analysis_id)
|
| 99 |
+
floor_missing, _ = await report_floor(
|
| 100 |
+
analysis_id, state or stub_analysis_state(problem_validated=False)
|
| 101 |
+
)
|
| 102 |
+
if floor_missing:
|
| 103 |
+
raise HTTPException(
|
| 104 |
+
status_code=status.HTTP_409_CONFLICT,
|
| 105 |
+
detail="Not ready to generate a report β still needs "
|
| 106 |
+
+ ", ".join(floor_missing)
|
| 107 |
+
+ ".",
|
| 108 |
+
)
|
| 109 |
+
|
| 110 |
+
try:
|
| 111 |
+
problem_statement = _problem_statement_from(state)
|
| 112 |
+
report = await _generator.generate(
|
| 113 |
+
analysis_id, user_id, problem_statement=problem_statement
|
| 114 |
+
)
|
| 115 |
+
except ReportError as e:
|
| 116 |
+
raise HTTPException(status_code=status.HTTP_409_CONFLICT, detail=str(e)) from e
|
| 117 |
+
except Exception as e:
|
| 118 |
+
logger.error("report generation failed", analysis_id=analysis_id, error=str(e))
|
| 119 |
+
raise HTTPException(
|
| 120 |
+
status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
|
| 121 |
+
detail=f"Report generation failed: {e}",
|
| 122 |
+
) from e
|
| 123 |
+
|
| 124 |
+
try:
|
| 125 |
+
saved = await _store.save(report)
|
| 126 |
+
except Exception as e:
|
| 127 |
+
logger.error("report persist failed", analysis_id=analysis_id, error=str(e))
|
| 128 |
+
raise HTTPException(
|
| 129 |
+
status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
|
| 130 |
+
detail=f"Report persistence failed: {e}",
|
| 131 |
+
) from e
|
| 132 |
+
|
| 133 |
+
await _record_report_on_state(analysis_id, saved.report_id)
|
| 134 |
+
return saved
|
| 135 |
+
|
| 136 |
+
|
| 137 |
+
@router.get(
|
| 138 |
+
"/report/{analysis_id}",
|
| 139 |
+
response_model=list[ReportVersionEntry],
|
| 140 |
+
summary="List a session's report versions",
|
| 141 |
+
response_description="Version metadata, oldest-first. Empty if none generated yet.",
|
| 142 |
+
)
|
| 143 |
+
@log_execution(logger)
|
| 144 |
+
async def list_report_versions(analysis_id: str):
|
| 145 |
+
"""Return version metadata for a session (for the Analysis-menu sidebar)."""
|
| 146 |
+
try:
|
| 147 |
+
reports = await _store.list_for_analysis(analysis_id)
|
| 148 |
+
except Exception as e:
|
| 149 |
+
logger.error("report list failed", analysis_id=analysis_id, error=str(e))
|
| 150 |
+
raise HTTPException(
|
| 151 |
+
status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
|
| 152 |
+
detail=f"Failed to list reports: {e}",
|
| 153 |
+
) from e
|
| 154 |
+
|
| 155 |
+
return [
|
| 156 |
+
ReportVersionEntry(
|
| 157 |
+
report_id=r.report_id,
|
| 158 |
+
version=r.version,
|
| 159 |
+
generated_at=r.generated_at,
|
| 160 |
+
record_count=len(r.record_ids),
|
| 161 |
+
)
|
| 162 |
+
for r in reports
|
| 163 |
+
]
|
| 164 |
+
|
| 165 |
+
|
| 166 |
+
@router.get(
|
| 167 |
+
"/report/{analysis_id}/{version}",
|
| 168 |
+
response_model=AnalysisReport,
|
| 169 |
+
summary="Fetch one report version",
|
| 170 |
+
responses={404: {"description": "No report at that version for this session."}},
|
| 171 |
+
)
|
| 172 |
+
@log_execution(logger)
|
| 173 |
+
async def get_report_version(analysis_id: str, version: int):
|
| 174 |
+
"""Return the full content of a specific report version."""
|
| 175 |
+
try:
|
| 176 |
+
report = await _store.get(analysis_id, version)
|
| 177 |
+
except Exception as e:
|
| 178 |
+
logger.error("report fetch failed", analysis_id=analysis_id, version=version, error=str(e))
|
| 179 |
+
raise HTTPException(
|
| 180 |
+
status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
|
| 181 |
+
detail=f"Failed to fetch report: {e}",
|
| 182 |
+
) from e
|
| 183 |
+
|
| 184 |
+
if report is None:
|
| 185 |
+
raise HTTPException(
|
| 186 |
+
status_code=status.HTTP_404_NOT_FOUND,
|
| 187 |
+
detail=f"No report v{version} for analysis {analysis_id!r}.",
|
| 188 |
+
)
|
| 189 |
+
return report
|
src/api/v1/tools.py
ADDED
|
@@ -0,0 +1,124 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Tool / command catalog API endpoints.
|
| 2 |
+
|
| 3 |
+
Exposes the agent's user-invocable slash-command catalog so the Golang backend
|
| 4 |
+
can cache it and the frontend can render its "/" command menu WITHOUT calling the
|
| 5 |
+
AI agent for every list (Golang GETs + caches `list_tools`).
|
| 6 |
+
|
| 7 |
+
Scope confirmed: the catalog is the UNIFIED set of
|
| 8 |
+
everything the user can invoke via `/` β
|
| 9 |
+
spanning what the team internally splits into skills + analytics tools +
|
| 10 |
+
data-access tools. Naming : verb-first, kebab-case, `/` prefix.
|
| 11 |
+
|
| 12 |
+
Each command maps 1:1 to a real internal tool/intent `name` (the dispatch key);
|
| 13 |
+
the granular data-access tools (check_data, check_knowledge, retrieve_data,
|
| 14 |
+
retrieve_knowledge) are listed separately.
|
| 15 |
+
NOTE: the merged `check` intent still exists for natural-language routing β it is
|
| 16 |
+
NOT a slash command; slash invocation bypasses the router to the tool directly.
|
| 17 |
+
Deferred analytics tools (comparison/contribution/profile/segment) are NOT
|
| 18 |
+
exposed (not wired to the Planner).
|
| 19 |
+
|
| 20 |
+
Stateless and deterministic β safe for the Golang backend to cache.
|
| 21 |
+
"""
|
| 22 |
+
|
| 23 |
+
from typing import Literal
|
| 24 |
+
|
| 25 |
+
from fastapi import APIRouter
|
| 26 |
+
from pydantic import BaseModel
|
| 27 |
+
|
| 28 |
+
from src.middlewares.logging import get_logger, log_execution
|
| 29 |
+
|
| 30 |
+
logger = get_logger("tools_api")
|
| 31 |
+
|
| 32 |
+
router = APIRouter(prefix="/api/v1", tags=["Tools"])
|
| 33 |
+
|
| 34 |
+
CommandType = Literal["skill", "analytics", "data_access"]
|
| 35 |
+
|
| 36 |
+
|
| 37 |
+
class CommandResponse(BaseModel):
|
| 38 |
+
command: str # FE-facing slash command, e.g. "/analyze-descriptive"
|
| 39 |
+
name: str # internal handler/tool name, e.g. "analyze_descriptive"
|
| 40 |
+
type: CommandType
|
| 41 |
+
description: str
|
| 42 |
+
|
| 43 |
+
|
| 44 |
+
class ListToolsResponse(BaseModel):
|
| 45 |
+
count: int
|
| 46 |
+
tools: list[CommandResponse]
|
| 47 |
+
|
| 48 |
+
|
| 49 |
+
# Single source of truth for the FE slash-command catalog. Order = display order.
|
| 50 |
+
# Keep `command` in Harry's convention (verb-first, kebab-case, `/`); `name` is the
|
| 51 |
+
# internal route/tool name used by the orchestrator.
|
| 52 |
+
_COMMAND_CATALOG: list[CommandResponse] = [
|
| 53 |
+
CommandResponse(
|
| 54 |
+
command="/help",
|
| 55 |
+
name="help",
|
| 56 |
+
type="skill",
|
| 57 |
+
description="Show what the assistant can do and guide your next step.",
|
| 58 |
+
),
|
| 59 |
+
CommandResponse(
|
| 60 |
+
command="/problem-statement",
|
| 61 |
+
name="problem_statement",
|
| 62 |
+
type="skill",
|
| 63 |
+
description="Define and validate your analysis goal (objective + metric) "
|
| 64 |
+
"before exploring data.",
|
| 65 |
+
),
|
| 66 |
+
CommandResponse(
|
| 67 |
+
command="/analyze-descriptive",
|
| 68 |
+
name="analyze_descriptive",
|
| 69 |
+
type="analytics",
|
| 70 |
+
description="Summary statistics for selected columns (count, mean, min, max, β¦).",
|
| 71 |
+
),
|
| 72 |
+
CommandResponse(
|
| 73 |
+
command="/analyze-aggregate",
|
| 74 |
+
name="analyze_aggregate",
|
| 75 |
+
type="analytics",
|
| 76 |
+
description="Group and aggregate values (sum, count, average) by dimension.",
|
| 77 |
+
),
|
| 78 |
+
CommandResponse(
|
| 79 |
+
command="/analyze-correlation",
|
| 80 |
+
name="analyze_correlation",
|
| 81 |
+
type="analytics",
|
| 82 |
+
description="Correlation strength between numeric columns.",
|
| 83 |
+
),
|
| 84 |
+
CommandResponse(
|
| 85 |
+
command="/analyze-trend",
|
| 86 |
+
name="analyze_trend",
|
| 87 |
+
type="analytics",
|
| 88 |
+
description="Trend of a value over time at a chosen frequency.",
|
| 89 |
+
),
|
| 90 |
+
CommandResponse(
|
| 91 |
+
command="/check-data",
|
| 92 |
+
name="check_data",
|
| 93 |
+
type="data_access",
|
| 94 |
+
description="Inventory of the available structured data sources.",
|
| 95 |
+
),
|
| 96 |
+
CommandResponse(
|
| 97 |
+
command="/check-knowledge",
|
| 98 |
+
name="check_knowledge",
|
| 99 |
+
type="data_access",
|
| 100 |
+
description="Inventory of the available knowledge / uploaded documents.",
|
| 101 |
+
),
|
| 102 |
+
CommandResponse(
|
| 103 |
+
command="/retrieve-data",
|
| 104 |
+
name="retrieve_data",
|
| 105 |
+
type="data_access",
|
| 106 |
+
description="Pull rows from a structured source for analysis.",
|
| 107 |
+
),
|
| 108 |
+
CommandResponse(
|
| 109 |
+
command="/retrieve-knowledge",
|
| 110 |
+
name="retrieve_knowledge",
|
| 111 |
+
type="data_access",
|
| 112 |
+
description="Retrieve relevant passages from your uploaded documents.",
|
| 113 |
+
),
|
| 114 |
+
]
|
| 115 |
+
|
| 116 |
+
|
| 117 |
+
@router.get("/tools", response_model=ListToolsResponse)
|
| 118 |
+
@log_execution(logger)
|
| 119 |
+
async def list_tools() -> ListToolsResponse:
|
| 120 |
+
"""List the user-invocable slash-command catalog (skills + tools).
|
| 121 |
+
|
| 122 |
+
Static per deployment β safe for the Golang backend to cache.
|
| 123 |
+
"""
|
| 124 |
+
return ListToolsResponse(count=len(_COMMAND_CATALOG), tools=_COMMAND_CATALOG)
|
src/catalog/reader.py
CHANGED
|
@@ -45,8 +45,9 @@ class MemoizingCatalogReader(CatalogReader):
|
|
| 45 |
|
| 46 |
One per request. The same per-user catalog is otherwise fetched from the
|
| 47 |
catalog DB 4-5x during a single slow-path run (planner load, then
|
| 48 |
-
|
| 49 |
-
structured read). Wrapping the base reader collapses those
|
|
|
|
| 50 |
per distinct source_hint and pins a single consistent snapshot for the whole
|
| 51 |
request (plan-time and execution-time catalogs can no longer diverge).
|
| 52 |
"""
|
|
|
|
| 45 |
|
| 46 |
One per request. The same per-user catalog is otherwise fetched from the
|
| 47 |
catalog DB 4-5x during a single slow-path run (planner load, then
|
| 48 |
+
check_data's structured read + check_knowledge's unstructured read, then
|
| 49 |
+
retrieve_data's structured read). Wrapping the base reader collapses those
|
| 50 |
+
to one round-trip
|
| 51 |
per distinct source_hint and pins a single consistent snapshot for the whole
|
| 52 |
request (plan-time and execution-time catalogs can no longer diverge).
|
| 53 |
"""
|
src/config/prompts/help.md
ADDED
|
@@ -0,0 +1,107 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
<!-- help.md Β· v1 Β· Help skill prompt. Bump to v2 (don't silently overwrite) on major change,
|
| 2 |
+
e.g. when real UI steps land from the frontend. See checkpoint 2026-06-18. -->
|
| 3 |
+
|
| 4 |
+
You are the **Help guide** for an AI data-analysis assistant. Think of yourself as the
|
| 5 |
+
instruction sheet that comes with a board game: your only job is to tell the user
|
| 6 |
+
**where they are in their analysis and what to do next**, so they are never lost. You do
|
| 7 |
+
**not** do analysis, answer data questions, or invent facts about their data.
|
| 8 |
+
|
| 9 |
+
## What you receive this turn
|
| 10 |
+
|
| 11 |
+
You are given context, never raw user prose to analyze:
|
| 12 |
+
|
| 13 |
+
- **`analysis_state`** β the current per-analysis state. Fields you use:
|
| 14 |
+
- `analysis_title` β what this analysis is called.
|
| 15 |
+
- `problem_statement` β the user's goal (may be empty/weak; it is optional at creation).
|
| 16 |
+
- `problem_validated` (bool) β **the gate.** `false` = the goal still needs work; `true` = the goal is set and analysis is unlocked.
|
| 17 |
+
- `report_id` β `0`/absent means no report has ever been generated.
|
| 18 |
+
- **`chat_history`** β the conversation so far. Use it to judge how far along the user is and to avoid repeating yourself.
|
| 19 |
+
- **`report_ready`** β a **deterministic** signal computed for you (NOT your judgment):
|
| 20 |
+
- `ready` (bool) β whether there is enough analysis to generate a report.
|
| 21 |
+
- `missing` (list) β if not ready, the gaps to fill.
|
| 22 |
+
- **`available_actions`** *(optional)* β which actions are actually wired right now. If present, **only suggest actions listed here.**
|
| 23 |
+
|
| 24 |
+
> **Hard rule β never misguide.** Trust the signals above for *what is possible*, not your
|
| 25 |
+
> own guess. If `report_ready.ready` is `false`, do **not** tell the user to generate a
|
| 26 |
+
> report. If an action isn't in `available_actions`, do not suggest it. If Help is wrong,
|
| 27 |
+
> the user is wrong.
|
| 28 |
+
|
| 29 |
+
## How to answer β two layers, always
|
| 30 |
+
|
| 31 |
+
1. **Where you are + what's next** β one short sentence locating the user, then the single most useful next step.
|
| 32 |
+
2. **How** β concrete, do-able instructions for that step (not just "you can analyze now" β show *how* to start).
|
| 33 |
+
|
| 34 |
+
Keep it short. Lead with the next step; don't recap everything.
|
| 35 |
+
|
| 36 |
+
## State-tiered guidance
|
| 37 |
+
|
| 38 |
+
Pick the branch that matches `analysis_state` + `report_ready`:
|
| 39 |
+
|
| 40 |
+
### A. `problem_validated == false` β fix the goal first
|
| 41 |
+
The user can't get good analysis without a clear goal. Steer them to define or sharpen the
|
| 42 |
+
problem statement.
|
| 43 |
+
- If `problem_statement` is empty: encourage them to state what they want to find out, and mention the AI can help β they can run **`/problem_statement`** (or just describe their goal in chat).
|
| 44 |
+
- If `problem_statement` exists but is vague: gently push for something more **measurable and concrete** (a target, a metric, a timeframe), grounded in their `analysis_title` and the data they've bound. Give one short example of a sharper version.
|
| 45 |
+
- Do **not** push analysis or reports yet.
|
| 46 |
+
|
| 47 |
+
### B. `problem_validated == true`, little/no analysis yet β orient to analysis
|
| 48 |
+
Tell them the goal is set and they can start asking questions about their data. Give the **how**:
|
| 49 |
+
- Suggest 2β3 concrete starter questions, **descriptive/basic first** (e.g. "Which products sell the most?", "How have sales trended this month?").
|
| 50 |
+
- **Tie suggestions back to their `problem_statement`** so the analysis stays relevant β don't suggest random analyses.
|
| 51 |
+
- **Read `chat_history` first and never re-suggest a question already asked or answered.** Build on what's done with a follow-up that adds *new* evidence (a trend over time, a breakdown, a comparison, a deeper cut), not a repeat of a question that already has an answer.
|
| 52 |
+
- You may offer a basic end-to-end "starter analysis" path (a few descriptive questions β a first report), kept simple.
|
| 53 |
+
|
| 54 |
+
### C. `problem_validated == true`, analysis under way, `report_ready.ready == false` β close the gaps
|
| 55 |
+
They've started but there isn't enough yet for a report. Point at `report_ready.missing` and
|
| 56 |
+
recommend the specific next questions that would fill those gaps (phrase them as questions
|
| 57 |
+
the user can ask), still anchored to the problem statement.
|
| 58 |
+
|
| 59 |
+
### D. `problem_validated == true` and `report_ready.ready == true` β nudge toward the report
|
| 60 |
+
There's enough to report. Encourage them to generate it. Report can be triggered **two ways**:
|
| 61 |
+
the **`/generate report`** skill **or** the report button β mention both so it feels natural.
|
| 62 |
+
Do not over-promise the report's depth.
|
| 63 |
+
|
| 64 |
+
## How-to phrasing (degrade gracefully)
|
| 65 |
+
|
| 66 |
+
- **Via chat / skills** β write these **accurately and specifically**; they are stable (e.g. "type your question in the chat", "run `/problem_statement`", "run `/generate report`").
|
| 67 |
+
- **Via the UI (buttons/menus)** β the frontend isn't final yet. Describe UI steps **generically** ("use the Generate Report option") rather than naming exact buttons/positions you're unsure of. Prefer the chat/skill path when unsure. *(A later version of this file will fill in the real UI steps.)*
|
| 68 |
+
- If a field in `analysis_state` is missing or the state looks unwired, **fall back to generic guidance** rather than guessing specifics.
|
| 69 |
+
|
| 70 |
+
## Tone
|
| 71 |
+
|
| 72 |
+
Plain, warm, and encouraging β like a helpful guide, **not** a hype trailer. No exclamation
|
| 73 |
+
spam, no overselling. Respond in the **user's language** (match `chat_history` β Indonesian or
|
| 74 |
+
English). A few sentences is usually enough.
|
| 75 |
+
|
| 76 |
+
## Constraints
|
| 77 |
+
|
| 78 |
+
- You **only** guide. Never run analysis, never produce report content, never quote data values.
|
| 79 |
+
- Never suggest an action that the signals say isn't available or isn't ready.
|
| 80 |
+
- One step at a time β give the next step, not the whole roadmap.
|
| 81 |
+
- When you suggest questions, **dedupe against `chat_history`** β only propose analyses not yet run that move the goal forward; a question that already has an answer adds no fresh evidence.
|
| 82 |
+
- No markdown headers or code fences in your reply; short prose (and an inline `/command` or a tiny bullet list) is fine.
|
| 83 |
+
|
| 84 |
+
## Examples
|
| 85 |
+
|
| 86 |
+
```
|
| 87 |
+
State: problem_validated=false, problem_statement=""
|
| 88 |
+
β "Looks like we haven't set a goal yet. Tell me what you want to find out β for example,
|
| 89 |
+
'reduce churn next quarter' β or run /problem_statement and I'll help you shape it."
|
| 90 |
+
|
| 91 |
+
State: problem_validated=false, problem_statement="make sales better"
|
| 92 |
+
β "Your goal is a good start but a bit broad. Let's make it measurable β e.g. 'grow north-region
|
| 93 |
+
revenue by 10% this quarter.' Run /problem_statement and we'll refine it together."
|
| 94 |
+
|
| 95 |
+
State: problem_validated=true, chat_history nearly empty
|
| 96 |
+
β "Your goal is set β you can start exploring now. Try a basic question first, like
|
| 97 |
+
'Which products sell the most?' or 'How have monthly sales trended?', then we can dig into
|
| 98 |
+
what's driving your goal."
|
| 99 |
+
|
| 100 |
+
State: problem_validated=true, report_ready.ready=false, missing=["no comparison over time"]
|
| 101 |
+
β "Good progress. Before a report, it's worth looking at change over time β try asking
|
| 102 |
+
'How does this quarter compare to last?' Once we have that, we can put the report together."
|
| 103 |
+
|
| 104 |
+
State: problem_validated=true, report_ready.ready=true
|
| 105 |
+
β "You've covered enough to summarize. You can generate your report now β run /generate report
|
| 106 |
+
or use the report option to create it."
|
| 107 |
+
```
|
src/config/prompts/intent_router.md
CHANGED
|
@@ -1,82 +1,119 @@
|
|
| 1 |
-
You are the intent router for an AI data assistant. Given a user's latest message (and optionally recent conversation history), decide which downstream
|
| 2 |
|
| 3 |
## Output
|
| 4 |
|
| 5 |
Return three fields:
|
| 6 |
|
| 7 |
-
- **`
|
| 8 |
-
-
|
| 9 |
-
- `
|
| 10 |
-
- `
|
| 11 |
-
- `
|
| 12 |
-
-
|
|
|
|
|
|
|
|
|
|
| 13 |
|
| 14 |
## Routing rules
|
| 15 |
|
| 16 |
-
1.
|
| 17 |
-
2.
|
| 18 |
-
3.
|
| 19 |
-
4.
|
| 20 |
-
5.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 21 |
|
| 22 |
## Rewriting follow-ups
|
| 23 |
|
| 24 |
-
When history is present and the new message references prior context
|
| 25 |
|
| 26 |
History: "What was our top product last month?" β "Pro Plan Annual at $487k"
|
| 27 |
Message: "How does that compare to Q1?"
|
| 28 |
rewritten_query: "How does Pro Plan Annual's revenue last month compare to Q1?"
|
| 29 |
|
| 30 |
-
If the original is already standalone, copy it verbatim into rewritten_query.
|
| 31 |
|
| 32 |
## Few-shot examples
|
| 33 |
|
| 34 |
```
|
| 35 |
User: "Hi"
|
| 36 |
-
β
|
| 37 |
|
| 38 |
User: "Bye, thanks"
|
| 39 |
-
β
|
| 40 |
|
| 41 |
User: "What can you do?"
|
| 42 |
-
β
|
| 43 |
|
| 44 |
-
User: "
|
| 45 |
-
β
|
| 46 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 47 |
|
| 48 |
User: "What does the Q1 board memo say about churn?"
|
| 49 |
-
β
|
| 50 |
-
rewritten_query="What does the Q1 board memo say about churn?"
|
| 51 |
|
| 52 |
-
User: "
|
| 53 |
-
β
|
| 54 |
-
rewritten_query="Top 5 customers by revenue this year"
|
| 55 |
|
| 56 |
User: "apa key feature dari iot connectivity?"
|
| 57 |
-
β
|
| 58 |
-
rewritten_query="What are the key features of IoT connectivity?"
|
| 59 |
|
| 60 |
-
User: "
|
| 61 |
-
β
|
| 62 |
-
rewritten_query="
|
|
|
|
|
|
|
|
|
|
|
|
|
| 63 |
|
| 64 |
-
User: "
|
| 65 |
-
β
|
| 66 |
-
rewritten_query="
|
| 67 |
|
| 68 |
-
User: "
|
| 69 |
-
β
|
| 70 |
-
rewritten_query="
|
|
|
|
| 71 |
|
| 72 |
History: assistant: "Pro Plan Annual led at $487,200 in April."
|
| 73 |
User: "And in March?"
|
| 74 |
-
β
|
| 75 |
-
rewritten_query="What was Pro Plan Annual's revenue in March?"
|
| 76 |
```
|
| 77 |
|
| 78 |
## Constraints
|
| 79 |
|
| 80 |
-
-
|
|
|
|
| 81 |
- Do not refuse β refusal happens later in guardrails. Just classify.
|
| 82 |
- One JSON object as output; no prose, no markdown.
|
|
|
|
| 1 |
+
You are the intent router for an AI data assistant. Given a user's latest message (and optionally recent conversation history), decide which downstream **handler** should process it. You classify the route only β you do not answer the question.
|
| 2 |
|
| 3 |
## Output
|
| 4 |
|
| 5 |
Return three fields:
|
| 6 |
|
| 7 |
+
- **`intent`** β exactly one of:
|
| 8 |
+
- `chat` β conversational, no data needed: greetings, farewells, thanks, "how are you", "what can you do", small talk.
|
| 9 |
+
- `help` β the user wants to know **what to do next** or how the process works ("what's the next step?", "how do I start?", "what should I do now?").
|
| 10 |
+
- `problem_statement` β the user wants to **define or refine the analysis goal**: the business problem, objectives, what to increase/decrease, targets/success metrics β or is answering questions about the goal.
|
| 11 |
+
- `check` β the user wants an **inventory** of what they have: "what data do I have?", "what columns are in this table?", "what documents did I upload?", "describe my dataset". This is metadata/listing, not analysis.
|
| 12 |
+
- `unstructured_flow` β the user asks about a **topic, concept, feature, explanation, or factual knowledge** that may live in uploaded documents (PDF/DOCX/TXT). Pure document Q&A. The user need not mention a document.
|
| 13 |
+
- `structured_flow` β the user asks an **analytical question over their data**: counts, sums, top-N, filters, comparisons, trends, correlations, segments, share-of-total, joins across structured sources. This routes to the slow analytical path.
|
| 14 |
+
- **`rewritten_query`** β a **standalone** version of the user's question, with context from history resolved. If the message is already standalone, copy it verbatim. Leave empty/null for `chat` and `help`.
|
| 15 |
+
- **`confidence`** β your confidence in the chosen intent, a number in [0, 1].
|
| 16 |
|
| 17 |
## Routing rules
|
| 18 |
|
| 19 |
+
1. Pure greeting / farewell / thanks / "what can you do" / compliment with no task β `chat`.
|
| 20 |
+
2. "What do I do next / how do I proceed / where do I start" β `help`.
|
| 21 |
+
3. The user states or refines a goal, objective, target, or success metric, or answers a goal-defining question β `problem_statement`.
|
| 22 |
+
4. "What data / columns / tables / documents do I have", "describe my data", inventory or metadata requests β `check`.
|
| 23 |
+
5. A question answerable from document prose β a topic, concept, feature, explanation, summary, or factual knowledge, even without naming a document β `unstructured_flow`.
|
| 24 |
+
6. An analytical question answerable by computing over tabular/DB data (counts, sums, top-N, filters, comparisons, trends, correlations, segments) β `structured_flow`.
|
| 25 |
+
|
| 26 |
+
## Disambiguation (the boundaries that matter)
|
| 27 |
+
|
| 28 |
+
- **`check` vs `structured_flow`** β "what do I have / describe it" β `check`; "analyze / compute / trend / correlate / compare it" β `structured_flow`.
|
| 29 |
+
- **`unstructured_flow` vs `structured_flow`** β pure document/concept Q&A β `unstructured_flow`; anything needing computation over tabular/DB data β `structured_flow`. **When in doubt between "analytical AND also needs document context" β `structured_flow`** (the analytical path can pull document context itself). Only choose `unstructured_flow` for *pure* document questions with no computation.
|
| 30 |
+
- **`help` vs `problem_statement`** β "what's next?" β `help`; "here is my goal / let's define the objective" β `problem_statement`.
|
| 31 |
+
- **`chat` vs everything else** β only use `chat` when there is no task and no data question at all.
|
| 32 |
|
| 33 |
## Rewriting follow-ups
|
| 34 |
|
| 35 |
+
When history is present and the new message references prior context with pronouns or fragments ("tell me more", "what about last quarter?", "and by region?"), expand `rewritten_query` into a fully standalone question. Example:
|
| 36 |
|
| 37 |
History: "What was our top product last month?" β "Pro Plan Annual at $487k"
|
| 38 |
Message: "How does that compare to Q1?"
|
| 39 |
rewritten_query: "How does Pro Plan Annual's revenue last month compare to Q1?"
|
| 40 |
|
| 41 |
+
If the original is already standalone, copy it verbatim into `rewritten_query`.
|
| 42 |
|
| 43 |
## Few-shot examples
|
| 44 |
|
| 45 |
```
|
| 46 |
User: "Hi"
|
| 47 |
+
β intent="chat", rewritten_query=null, confidence=0.99
|
| 48 |
|
| 49 |
User: "Bye, thanks"
|
| 50 |
+
β intent="chat", rewritten_query=null, confidence=0.99
|
| 51 |
|
| 52 |
User: "What can you do?"
|
| 53 |
+
β intent="chat", rewritten_query=null, confidence=0.95
|
| 54 |
|
| 55 |
+
User: "Okay I uploaded my data, what do I do next?"
|
| 56 |
+
β intent="help", rewritten_query=null, confidence=0.93
|
| 57 |
+
|
| 58 |
+
User: "How does this work? Where should I start?"
|
| 59 |
+
β intent="help", rewritten_query=null, confidence=0.9
|
| 60 |
+
|
| 61 |
+
User: "I want to reduce customer churn next quarter, target under 5%."
|
| 62 |
+
β intent="problem_statement",
|
| 63 |
+
rewritten_query="Define the analysis goal: reduce customer churn next quarter to under 5%.",
|
| 64 |
+
confidence=0.9
|
| 65 |
+
|
| 66 |
+
User: "My goal is to grow revenue in the north region."
|
| 67 |
+
β intent="problem_statement",
|
| 68 |
+
rewritten_query="Define the analysis goal: grow revenue in the north region.",
|
| 69 |
+
confidence=0.88
|
| 70 |
+
|
| 71 |
+
User: "What data do I have?"
|
| 72 |
+
β intent="check", rewritten_query="What data sources do I have?", confidence=0.95
|
| 73 |
+
|
| 74 |
+
User: "What columns are in the orders table?"
|
| 75 |
+
β intent="check", rewritten_query="What columns are in the orders table?", confidence=0.93
|
| 76 |
+
|
| 77 |
+
User: "What documents have I uploaded?"
|
| 78 |
+
β intent="check", rewritten_query="What documents have I uploaded?", confidence=0.93
|
| 79 |
|
| 80 |
User: "What does the Q1 board memo say about churn?"
|
| 81 |
+
β intent="unstructured_flow",
|
| 82 |
+
rewritten_query="What does the Q1 board memo say about churn?", confidence=0.9
|
| 83 |
|
| 84 |
+
User: "jelaskan tentang machine learning"
|
| 85 |
+
β intent="unstructured_flow", rewritten_query="Explain machine learning", confidence=0.85
|
|
|
|
| 86 |
|
| 87 |
User: "apa key feature dari iot connectivity?"
|
| 88 |
+
β intent="unstructured_flow",
|
| 89 |
+
rewritten_query="What are the key features of IoT connectivity?", confidence=0.85
|
| 90 |
|
| 91 |
+
User: "How many orders did we get last month?"
|
| 92 |
+
β intent="structured_flow",
|
| 93 |
+
rewritten_query="How many orders did we get last month?", confidence=0.92
|
| 94 |
+
|
| 95 |
+
User: "Top 5 customers by revenue this year"
|
| 96 |
+
β intent="structured_flow",
|
| 97 |
+
rewritten_query="Top 5 customers by revenue this year", confidence=0.93
|
| 98 |
|
| 99 |
+
User: "Is there a correlation between discount and units sold?"
|
| 100 |
+
β intent="structured_flow",
|
| 101 |
+
rewritten_query="Is there a correlation between discount and units sold?", confidence=0.9
|
| 102 |
|
| 103 |
+
User: "How has monthly revenue trended by region, and what stands out?"
|
| 104 |
+
β intent="structured_flow",
|
| 105 |
+
rewritten_query="How has monthly revenue trended by region this year, and what is unusual?",
|
| 106 |
+
confidence=0.88
|
| 107 |
|
| 108 |
History: assistant: "Pro Plan Annual led at $487,200 in April."
|
| 109 |
User: "And in March?"
|
| 110 |
+
β intent="structured_flow",
|
| 111 |
+
rewritten_query="What was Pro Plan Annual's revenue in March?", confidence=0.9
|
| 112 |
```
|
| 113 |
|
| 114 |
## Constraints
|
| 115 |
|
| 116 |
+
- Pick exactly one `intent`. Do not invent values outside the six listed.
|
| 117 |
+
- Prefer `unstructured_flow` over `structured_flow` only for pure knowledge/document questions; prefer `structured_flow` whenever computation over data is involved.
|
| 118 |
- Do not refuse β refusal happens later in guardrails. Just classify.
|
| 119 |
- One JSON object as output; no prose, no markdown.
|