/fix check and help tool
#8
by rhbt6767 - opened
- REPO_STATUS.md +32 -13
- eval/help/README.md +77 -0
- eval/help/__init__.py +0 -0
- eval/help/help_dataset.json +150 -0
- eval/help/run_eval.py +428 -0
- eval/readiness/readiness_dataset.json +19 -22
- eval/readiness/run_eval.py +5 -4
- main.py +2 -2
- src/agents/handlers/help.py +94 -5
- src/agents/planner/inputs.py +39 -0
- src/catalog/fk_inference.py +97 -0
- src/catalog/render.py +7 -1
- src/catalog/store.py +24 -8
- src/config/prompts/help.md +29 -11
- src/config/prompts/planner.md +14 -9
- src/config/settings.py +6 -0
- src/db/postgres/models.py +32 -9
- src/query/executor/db.py +6 -2
REPO_STATUS.md
CHANGED
|
@@ -2,7 +2,7 @@
|
|
| 2 |
|
| 3 |
**Audience:** teammates onboarding onto the Python repo (`Agentic-Service-Data-Eyond-Catalog`).
|
| 4 |
**Scope:** what the code does **right now** (branch `pr/4`, ticket KM-652). Describes current state only — no roadmap or to-dos.
|
| 5 |
-
**Snapshot date:** 2026-06-25. **Cross-repo update 2026-06-29:** §2/§8/§11/§12 re-verified against
|
| 6 |
the **Go source** (`Orchestrator-Agent-Service`), not its docs. The Go service has moved well past its
|
| 7 |
own (uncommitted, stale) design docs: it now hosts the **dedorch SQL migrations** in-repo and a full
|
| 8 |
**`/api/v1/analyses` + `/api/v1/skills`** REST surface. Go does **not** call Python yet — those skills
|
|
@@ -178,7 +178,7 @@ unless `SKIP_INIT_DB=true`.
|
|
| 178 |
|---|---|---|---|
|
| 179 |
| `users`, `rooms`, `chat_messages`, `message_sources` | base app | chat endpoint, Go | chat history |
|
| 180 |
| `documents`, `databases` | uploads + DB creds (Fernet-encrypted) | Go ingestion | executor cred resolution |
|
| 181 |
-
| `data_catalog`
|
| 182 |
| `langchain_pg_embedding` | PGVector document chunks | Go ingestion | DocumentRetriever |
|
| 183 |
| `report_inputs` *(was `analysis_records`)* | jsonb `AnalysisRecord`, one per slow-path run; **Python-owned** | slow path | ReportGenerator, report readiness |
|
| 184 |
| `analyses` *(dedorch, plural)* | uuid `id`, `user_id`, `analysis_title`, `objective`, `business_questions` jsonb, `status` (active\|inactive), `data_bind`(+`data_bind_version`), `report_id`, `report_collection` — **defined by Go migrations**; `problem_statement`/`problem_validated`/`owner_id` already **dropped** there (`0003`/`0004`) | Go `/api/v1/analyses`; Python state store | gate (no-op), Help, report |
|
|
@@ -186,16 +186,21 @@ unless `SKIP_INIT_DB=true`.
|
|
| 186 |
| `data_sources` *(dedorch)* | per-analysis binding; `reference_id` = catalog source_id; `type ∈ document\|database` | Go `/analyses/{id}/data-bind` (+ Python `/analysis/create`) | structured-flow scoping, report appendix |
|
| 187 |
| `analyses_messages` *(dedorch)* | the analysis chat room (`role ∈ user\|ai`); replaces deprecated `rooms`/`chat_messages` | Go `/analyses/{id}/messages` | Python chat path **not yet migrated here** (§12) |
|
| 188 |
|
| 189 |
-
>
|
| 190 |
-
>
|
| 191 |
-
> `
|
| 192 |
-
>
|
| 193 |
-
>
|
| 194 |
-
>
|
|
|
|
|
|
|
|
|
|
| 195 |
|
| 196 |
**Catalog shape** (the jsonb in `data_catalog`):
|
| 197 |
`Catalog → Source[ {source_id, source_type ∈ schema|tabular|unstructured, name, location_ref} → Table[ {table_id, name, row_count, foreign_keys[]} → Column[ {column_id, name, data_type, nullable, pii_flag, sample_values|null, stats} ] ] ]`. PII columns have `sample_values: null` so real values never enter prompts.
|
| 198 |
|
|
|
|
|
|
|
| 199 |
**QueryIR shape** (`src/query/ir/models.py`):
|
| 200 |
`{ source_id, table_id, joins[], select[], filters[], group_by[], order_by[], limit }`.
|
| 201 |
Joins are single-level equi-joins to a related table **in the same source**, FK-backed,
|
|
@@ -286,7 +291,7 @@ only.
|
|
| 286 |
|---|---|---|---|
|
| 287 |
| `ENABLE_SLOW_PATH` | `settings.enable_slow_path` | **off** | Route `structured_flow` through Planner/TaskRunner/Assembler (vs single-query `QueryService`). Records persist only on the slow path → reports require this on. |
|
| 288 |
| `ENABLE_GATE` | `settings.enable_gate` | **off** | **Deprecated 2026-06-25** — gate neutered; the flag has no effect. Kept to avoid `.env` churn. |
|
| 289 |
-
| `SKIP_INIT_DB` |
|
| 290 |
| `enable_tracing` | hardcoded `True` in `chat.py` | on (endpoint) | Langfuse tracing. |
|
| 291 |
|
| 292 |
---
|
|
@@ -309,8 +314,8 @@ copies disagree with the current code on:
|
|
| 309 |
|
| 310 |
## 12. dedorch migration — current state
|
| 311 |
|
| 312 |
-
The Python DB
|
| 313 |
-
consumer-only). State **re-verified against the Go source 2026-06-29**:
|
| 314 |
|
| 315 |
- **The dedorch migrations now live IN the Go repo** — embedded SQL at
|
| 316 |
`internal/repository/postgres/migrations/0001_create_core_schema.sql … 0004_replace_chat_with_analysis_scope.sql`,
|
|
@@ -325,8 +330,15 @@ consumer-only). State **re-verified against the Go source 2026-06-29**:
|
|
| 325 |
`rooms`/`chat_messages`/`interview_*` tables to `zdeprecated_*`.
|
| 326 |
- **`report_inputs`** (the slow-path structured output, formerly `analysis_records`) stays
|
| 327 |
**Python-owned**; its finalized schema goes to Harry so the dedorch migration creates it post-cutover.
|
| 328 |
-
-
|
| 329 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 330 |
|
| 331 |
**⚠️ Integration gap (verified — the big one).** Go's `/api/v1/analyses` and `/api/v1/skills`
|
| 332 |
(`help` / `report`) are **placeholders that return dummy data** — the `SendMessage` / `GenerateReport`
|
|
@@ -348,6 +360,13 @@ records-based report; floor: ≥1 `analyze_*` success). Wiring Go → Python is
|
|
| 348 |
values are always parameterized.
|
| 349 |
- **Settings aliases:** `.env` uses double-underscore names (`azureai__api_key__4o`); `Settings`
|
| 350 |
exposes them as `azureai_api_key_4o`.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 351 |
- **Never-throw seams** are pervasive (tool invoker, query service, executors, state/binding reads,
|
| 352 |
record persistence, report summary). Failures degrade into soft output rather than raising — good
|
| 353 |
for UX, but they can mask real breakage (e.g. a binding silently fail-opening to the full catalog).
|
|
|
|
| 2 |
|
| 3 |
**Audience:** teammates onboarding onto the Python repo (`Agentic-Service-Data-Eyond-Catalog`).
|
| 4 |
**Scope:** what the code does **right now** (branch `pr/4`, ticket KM-652). Describes current state only — no roadmap or to-dos.
|
| 5 |
+
**Snapshot date:** 2026-06-25. **Data-layer reconcile 2026-07-01:** §8/§12 updated — dedorch cutover done, `data_catalog` model reconciled. **Query-path fix 2026-07-02:** §8/§13 — dedorch catalogs ship no FKs → Python infers them (`fk_inference.py`); shared-Fernet-key gotcha documented. **Cross-repo update 2026-06-29:** §2/§8/§11/§12 re-verified against
|
| 6 |
the **Go source** (`Orchestrator-Agent-Service`), not its docs. The Go service has moved well past its
|
| 7 |
own (uncommitted, stale) design docs: it now hosts the **dedorch SQL migrations** in-repo and a full
|
| 8 |
**`/api/v1/analyses` + `/api/v1/skills`** REST surface. Go does **not** call Python yet — those skills
|
|
|
|
| 178 |
|---|---|---|---|
|
| 179 |
| `users`, `rooms`, `chat_messages`, `message_sources` | base app | chat endpoint, Go | chat history |
|
| 180 |
| `documents`, `databases` | uploads + DB creds (Fernet-encrypted) | Go ingestion | executor cred resolution |
|
| 181 |
+
| `data_catalog` *(dedorch, Go-owned)* | `id` uuid, `scope_type` ('user'\|'analysis'), `user_id`, `analysis_id`, **`catalog_payload`** jsonb (the `Catalog`: Source → Table → Column), schema_version, generated_at, updated_at; partial-unique on `user_id WHERE scope_type='user'` | **Go `catalog.Service`** (all writes: DB/file ingestion) | CatalogReader → CatalogStore (**read-only**), planner, tools |
|
| 182 |
| `langchain_pg_embedding` | PGVector document chunks | Go ingestion | DocumentRetriever |
|
| 183 |
| `report_inputs` *(was `analysis_records`)* | jsonb `AnalysisRecord`, one per slow-path run; **Python-owned** | slow path | ReportGenerator, report readiness |
|
| 184 |
| `analyses` *(dedorch, plural)* | uuid `id`, `user_id`, `analysis_title`, `objective`, `business_questions` jsonb, `status` (active\|inactive), `data_bind`(+`data_bind_version`), `report_id`, `report_collection` — **defined by Go migrations**; `problem_statement`/`problem_validated`/`owner_id` already **dropped** there (`0003`/`0004`) | Go `/api/v1/analyses`; Python state store | gate (no-op), Help, report |
|
|
|
|
| 186 |
| `data_sources` *(dedorch)* | per-analysis binding; `reference_id` = catalog source_id; `type ∈ document\|database` | Go `/analyses/{id}/data-bind` (+ Python `/analysis/create`) | structured-flow scoping, report appendix |
|
| 187 |
| `analyses_messages` *(dedorch)* | the analysis chat room (`role ∈ user\|ai`); replaces deprecated `rooms`/`chat_messages` | Go `/analyses/{id}/messages` | Python chat path **not yet migrated here** (§12) |
|
| 188 |
|
| 189 |
+
> ✅ **Python ORM ↔ dedorch drift — reconciled 2026-07-01.** `AnalysisStateRow` (`analyses`) dropped
|
| 190 |
+
> `problem_statement`/`problem_validated` and added `objective`/`business_questions` (Harry's #3);
|
| 191 |
+
> `data_catalog` was the last stale model. Its `Catalog` ORM (old `user_id`-PK + `data` jsonb) is now
|
| 192 |
+
> the dedorch shape (`id` PK, `scope_type`, **`catalog_payload`**), and `CatalogStore` reads
|
| 193 |
+
> `catalog_payload WHERE scope_type='user'` (matching Go's `catalog.Service`). This closed a **live
|
| 194 |
+
> bug**: the `check` skill / `CatalogReader` still selected the dropped `data_catalog.data` column, so
|
| 195 |
+
> every catalog read 500'd after the cutover ("what data do I have" → *"Sorry, I couldn't look that up:
|
| 196 |
+
> column data_catalog.data does not exist"*). Python's catalog **write** methods (`upsert`/
|
| 197 |
+
> `remove_source`/`StructuredPipeline`) were reconciled but are now **legacy** — Go owns ingestion.
|
| 198 |
|
| 199 |
**Catalog shape** (the jsonb in `data_catalog`):
|
| 200 |
`Catalog → Source[ {source_id, source_type ∈ schema|tabular|unstructured, name, location_ref} → Table[ {table_id, name, row_count, foreign_keys[]} → Column[ {column_id, name, data_type, nullable, pii_flag, sample_values|null, stats} ] ] ]`. PII columns have `sample_values: null` so real values never enter prompts.
|
| 201 |
|
| 202 |
+
> ⚠️ **dedorch catalogs ship empty `foreign_keys`** (Go's introspection drops FK constraints), yet the IR validator only allows FK-backed joins — so every cross-table question failed validation until 2026-07-02. `src/catalog/fk_inference.py` (wired into `CatalogStore.get`) now infers the obvious `<base>_id → <table>.id` edges at read time: conservative (single unambiguous target, matching `data_type`, schema sources only) and **self-disabling** once any real FK is present. It's a **stopgap** — the durable fix is Go emitting real FKs during introspection.
|
| 203 |
+
|
| 204 |
**QueryIR shape** (`src/query/ir/models.py`):
|
| 205 |
`{ source_id, table_id, joins[], select[], filters[], group_by[], order_by[], limit }`.
|
| 206 |
Joins are single-level equi-joins to a related table **in the same source**, FK-backed,
|
|
|
|
| 291 |
|---|---|---|---|
|
| 292 |
| `ENABLE_SLOW_PATH` | `settings.enable_slow_path` | **off** | Route `structured_flow` through Planner/TaskRunner/Assembler (vs single-query `QueryService`). Records persist only on the slow path → reports require this on. |
|
| 293 |
| `ENABLE_GATE` | `settings.enable_gate` | **off** | **Deprecated 2026-06-25** — gate neutered; the flag has no effect. Kept to avoid `.env` churn. |
|
| 294 |
+
| `SKIP_INIT_DB` | `settings.skip_init_db` (.env/env) | **on** | Skip `init_db()` on startup — the dedorch cutover switch. **Defaults TRUE** (Go owns the dedorch schema); set `false` only for a local Python-owned DB. |
|
| 295 |
| `enable_tracing` | hardcoded `True` in `chat.py` | on (endpoint) | Langfuse tracing. |
|
| 296 |
|
| 297 |
---
|
|
|
|
| 314 |
|
| 315 |
## 12. dedorch migration — current state
|
| 316 |
|
| 317 |
+
The Python DB has moved from `dataeyond` → **dedorch** (cutover 2026-07-01; Go owns dedorch migrations;
|
| 318 |
+
Python is consumer-only). State **re-verified against the Go source 2026-06-29**:
|
| 319 |
|
| 320 |
- **The dedorch migrations now live IN the Go repo** — embedded SQL at
|
| 321 |
`internal/repository/postgres/migrations/0001_create_core_schema.sql … 0004_replace_chat_with_analysis_scope.sql`,
|
|
|
|
| 330 |
`rooms`/`chat_messages`/`interview_*` tables to `zdeprecated_*`.
|
| 331 |
- **`report_inputs`** (the slow-path structured output, formerly `analysis_records`) stays
|
| 332 |
**Python-owned**; its finalized schema goes to Harry so the dedorch migration creates it post-cutover.
|
| 333 |
+
- **Connection-string cutover DONE (2026-07-01).** Python's `postgres_connstring` now points at
|
| 334 |
+
**dedorch** and reads the Go-migrated tables directly. Every ORM model Python reads (`analyses`,
|
| 335 |
+
`data_sources`, `analyses_messages`, `data_catalog`) has been reconciled to its dedorch shape.
|
| 336 |
+
**`init_db()` is now skipped by default** (`settings.skip_init_db` defaults **True**): its privileged
|
| 337 |
+
DDL (`ALTER TABLE rooms …`, index creation) fails on Go-owned tables
|
| 338 |
+
(`InsufficientPrivilegeError: must be owner of table rooms`). Skipping is safe — Go migration `0001`
|
| 339 |
+
already provides the `vector` extension + the langchain FTS index. Set `SKIP_INIT_DB=false` (.env or
|
| 340 |
+
env) only for a local Python-owned DB. `report_inputs` is not in any Go migration yet (#22) — create
|
| 341 |
+
it in dedorch before enabling the slow path, else report/slow-path writes fail (chat path unaffected).
|
| 342 |
|
| 343 |
**⚠️ Integration gap (verified — the big one).** Go's `/api/v1/analyses` and `/api/v1/skills`
|
| 344 |
(`help` / `report`) are **placeholders that return dummy data** — the `SendMessage` / `GenerateReport`
|
|
|
|
| 360 |
values are always parameterized.
|
| 361 |
- **Settings aliases:** `.env` uses double-underscore names (`azureai__api_key__4o`); `Settings`
|
| 362 |
exposes them as `azureai_api_key_4o`.
|
| 363 |
+
- **Shared Fernet key across repos (gotcha).** User DB credentials in `databases` are written +
|
| 364 |
+
encrypted by **Go** and decrypted by Python; both read the **same** env var
|
| 365 |
+
`dataeyond__db__credential__key` (Go: `configs/app.yaml` → `credentials.fernet_key`). The two
|
| 366 |
+
deployments MUST hold the **identical value** or Python's decrypt throws
|
| 367 |
+
`cryptography.fernet.InvalidToken` — whose `str()` is **empty**, so it logged as `error=""` and
|
| 368 |
+
masqueraded as a DB-connection failure (the executor now logs `repr(e)` to expose it). Tell-apart:
|
| 369 |
+
a valid-but-wrong key → `InvalidToken`; a malformed key → a non-empty `ValueError` at cipher build.
|
| 370 |
- **Never-throw seams** are pervasive (tool invoker, query service, executors, state/binding reads,
|
| 371 |
record persistence, report summary). Failures degrade into soft output rather than raising — good
|
| 372 |
for UX, but they can mask real breakage (e.g. a binding silently fail-opening to the full catalog).
|
eval/help/README.md
ADDED
|
@@ -0,0 +1,77 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Help-skill eval
|
| 2 |
+
|
| 3 |
+
Scores the **live** Help skill (`src/agents/handlers/help.HelpAgent`) — the guide that
|
| 4 |
+
tells a user where they are and what to do next. Each golden case declares an analysis
|
| 5 |
+
state + report-readiness + chat history; the runner streams `HelpAgent.astream` for real
|
| 6 |
+
and asserts the **rules** the reply must obey.
|
| 7 |
+
|
| 8 |
+
Unlike `eval/readiness` (deterministic, no LLM), this calls the model, so it needs a
|
| 9 |
+
working `.env` (Azure OpenAI) and spends tokens. Run it before a deploy that touches
|
| 10 |
+
`config/prompts/help.md` — not on every commit. The fast, no-LLM guard is
|
| 11 |
+
`tests/unit/agents/handlers/test_help.py` (fake chain); this is the end-to-end
|
| 12 |
+
"does the model actually obey the prompt" layer on top.
|
| 13 |
+
|
| 14 |
+
## Run
|
| 15 |
+
|
| 16 |
+
```bash
|
| 17 |
+
uv run python -m eval.help.run_eval
|
| 18 |
+
uv run python -m eval.help.run_eval --limit 4 # smoke test
|
| 19 |
+
uv run python -m eval.help.run_eval --no-table # summary only
|
| 20 |
+
```
|
| 21 |
+
|
| 22 |
+
Each run writes a timestamped `results/help_result_<ts>.json` (never overwritten,
|
| 23 |
+
diffable across runs).
|
| 24 |
+
|
| 25 |
+
## What it measures
|
| 26 |
+
|
| 27 |
+
Not accuracy — Help replies are free prose with no single correct wording. The metric is
|
| 28 |
+
**compliance**: the % of cases whose reply obeys every rule asserted for it.
|
| 29 |
+
|
| 30 |
+
- **`language`** — the reply must match the user's language. This is the regression guard
|
| 31 |
+
for the button-path bug (`/tools/help` passes `message=None`, and the reply used to
|
| 32 |
+
default to English even for an Indonesian conversation).
|
| 33 |
+
- **`report_guard`** — never suggest generating a report when `report_ready.ready=false`;
|
| 34 |
+
do suggest it when `true`. Since `generate_report` is the only gated action, this also
|
| 35 |
+
serves as the "no action leakage" check.
|
| 36 |
+
- **`orientation`** — quality of the suggested starter questions. **Manual review**: these
|
| 37 |
+
run but are excluded from the auto compliance rate. Read their `output_text` in the JSON.
|
| 38 |
+
|
| 39 |
+
Assertion types: `language_match {expected}`, `must_not_contain_any {patterns}`,
|
| 40 |
+
`must_contain_any {patterns}`.
|
| 41 |
+
|
| 42 |
+
## Held-out vs carried-over (why the summary splits them)
|
| 43 |
+
|
| 44 |
+
`carried_over: true` cases **mirror an example in `help.md`** — the case `id` *is* the
|
| 45 |
+
prompt's `<!-- id: ... -->`. They are a regression guard: if the prompt is refactored, the
|
| 46 |
+
demonstrated rule must still hold. What is mirrored is the **input spec + the assertion**,
|
| 47 |
+
never the example's reply text (temperature > 0 makes exact match invalid).
|
| 48 |
+
|
| 49 |
+
Held-out cases (`carried_over: false`) are **absent from the prompt**; their compliance is
|
| 50 |
+
the real generalization signal. If held-out compliance drops while carried-over stays at
|
| 51 |
+
100%, the prompt is overfitting to its own examples ("train on test set"). That's why the
|
| 52 |
+
two are reported separately.
|
| 53 |
+
|
| 54 |
+
**Sync rule (manual, like `intent`):** if `help.md`'s Examples change, keep the mirrored
|
| 55 |
+
`id`s here in sync. Current mirrored ids: `help_ex_orient`, `help_ex_guard_delta`,
|
| 56 |
+
`help_ex_guard_ready`.
|
| 57 |
+
|
| 58 |
+
## Dataset
|
| 59 |
+
|
| 60 |
+
`help_dataset.json` — see the `_about` / `_carried_over` doc keys in the file. Language
|
| 61 |
+
detection reuses `help._detect_reply_language`; `report_ready.missing` uses the codes
|
| 62 |
+
`analysis` / `delta` mapped to the real `is_report_ready` strings in the runner.
|
| 63 |
+
|
| 64 |
+
## Known limitations
|
| 65 |
+
|
| 66 |
+
- **Compliance is approximate across runs.** `HelpAgent` runs at `temperature=0.3`, so the
|
| 67 |
+
reply varies; a borderline case can flip pass/fail between runs. Treat the rate as a
|
| 68 |
+
signal, not a fixed number — re-run before trusting a single-point drop.
|
| 69 |
+
- **`language_match` grades with the same detector the feature uses** (`_detect_reply_language`
|
| 70 |
+
over the reply). It verifies the model obeyed the `[Reply language]` directive, assuming the
|
| 71 |
+
detector is correct — the detector itself is unit-tested separately in
|
| 72 |
+
`tests/unit/agents/handlers/test_help.py`. It can also misfire on a reply that mixes
|
| 73 |
+
languages (e.g. an Indonesian reply quoting an English business question).
|
| 74 |
+
- **Errored cases (stream crash) count as failures, not rule violations.** If `astream` raises
|
| 75 |
+
(Azure down, timeout), the case is flagged `errored` and reported under a separate `ERRORED`
|
| 76 |
+
line — assertions are NOT run on the error string (a crash must not trivially "pass" a
|
| 77 |
+
`must_not_contain_any`). A run with errors is not a clean pass; re-run once the cause clears.
|
eval/help/__init__.py
ADDED
|
File without changes
|
eval/help/help_dataset.json
ADDED
|
@@ -0,0 +1,150 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"_about": "Golden dataset for the Help skill (`src/agents/handlers/help.HelpAgent`). Unlike intent/readiness this calls the LIVE model: each case declares an analysis state + report-readiness + chat history, the runner streams HelpAgent.astream for real, and asserts RULES the reply must obey (not text similarity — help replies are free prose with no single correct wording). Metric is COMPLIANCE (% of rule assertions that hold), reported separately for held-out vs carried_over cases.",
|
| 3 |
+
"_groups": "language (reply matches the user's language — the button-path bug), report_guard (never suggest a report when report_ready.ready=false; do suggest it when true — this also IS the 'no action leakage' check, since generate_report is the only gated action), orientation (quality of the suggested starter questions — MANUAL review, not auto-scored).",
|
| 4 |
+
"_asserts": "language_match {expected} — detect the reply's language (help._detect_reply_language over the OUTPUT) must equal expected. must_not_contain_any {patterns} — none of the (case-insensitive) patterns appear. must_contain_any {patterns} — at least one appears.",
|
| 5 |
+
"_carried_over": "carried_over:true rows MIRROR an example in config/prompts/help.md (the row `id` IS the help.md `<!-- id: ... -->`). They are the regression guard: if the prompt is refactored, the demonstrated rule must still hold. What is mirrored is the INPUT spec + the assertion — NOT the example's reply text (temperature>0 makes exact match invalid). Held-out rows (carried_over:false) are NOT in the prompt; their compliance is the real generalization signal. If help.md's Examples change, keep these ids in sync (manual, like intent).",
|
| 6 |
+
"_missing_codes": "report_ready.missing uses codes mapped to the real strings is_report_ready emits (imported in run_eval): analysis -> _MISSING_ANALYSIS, delta -> _MISSING_DELTA. Kept as codes so the dataset survives wording changes.",
|
| 7 |
+
"schema": {
|
| 8 |
+
"id": "stable handle; for carried_over rows this equals the help.md example id",
|
| 9 |
+
"group": "language | report_guard | orientation",
|
| 10 |
+
"carried_over": "bool — mirrors a help.md example",
|
| 11 |
+
"manual_review": "bool — run but exclude from the auto compliance rate (read output_text)",
|
| 12 |
+
"state": "{ analysis_title, objective, business_questions[], report_id }",
|
| 13 |
+
"report_ready": "{ ready: bool, missing: [analysis|delta] }",
|
| 14 |
+
"history": "[{ role: human|ai, content }] — drives language on the button path",
|
| 15 |
+
"message": "the human turn; null = button path (HelpAgent falls back to a per-language trigger)",
|
| 16 |
+
"asserts": "[{ type, ...spec }] — the rules the reply must obey",
|
| 17 |
+
"note": "human-readable description"
|
| 18 |
+
},
|
| 19 |
+
"cases": [
|
| 20 |
+
{
|
| 21 |
+
"id": "lang_01", "group": "language", "carried_over": false, "manual_review": false,
|
| 22 |
+
"state": { "analysis_title": "Analisis penjualan", "objective": "memahami performa penjualan bulanan", "business_questions": ["produk mana yang paling laku?"], "report_id": null },
|
| 23 |
+
"report_ready": { "ready": false, "missing": ["analysis"] },
|
| 24 |
+
"history": [{ "role": "human", "content": "aku baru upload datanya, terus aku harus ngapain?" }],
|
| 25 |
+
"message": null,
|
| 26 |
+
"asserts": [{ "type": "language_match", "expected": "Indonesian" }],
|
| 27 |
+
"note": "REGRESSION of the button-path bug: Indonesian conversation, message=null. Reply must be Indonesian, not English."
|
| 28 |
+
},
|
| 29 |
+
{
|
| 30 |
+
"id": "lang_02", "group": "language", "carried_over": false, "manual_review": false,
|
| 31 |
+
"state": { "analysis_title": "Sales analysis", "objective": "understand monthly sales performance", "business_questions": ["which products drive revenue?"], "report_id": null },
|
| 32 |
+
"report_ready": { "ready": false, "missing": ["analysis"] },
|
| 33 |
+
"history": [{ "role": "human", "content": "okay I uploaded my data, what do I do next?" }],
|
| 34 |
+
"message": null,
|
| 35 |
+
"asserts": [{ "type": "language_match", "expected": "English" }],
|
| 36 |
+
"note": "English conversation, button path — reply must stay English."
|
| 37 |
+
},
|
| 38 |
+
{
|
| 39 |
+
"id": "lang_03", "group": "language", "carried_over": false, "manual_review": false,
|
| 40 |
+
"state": { "analysis_title": "Analisis churn", "objective": "menurunkan churn pelanggan", "business_questions": ["segmen mana yang paling banyak churn?"], "report_id": null },
|
| 41 |
+
"report_ready": { "ready": false, "missing": ["analysis"] },
|
| 42 |
+
"history": [],
|
| 43 |
+
"message": "gimana caranya mulai analisis ini ya?",
|
| 44 |
+
"asserts": [{ "type": "language_match", "expected": "Indonesian" }],
|
| 45 |
+
"note": "Intent path: the real Indonesian user turn drives the language."
|
| 46 |
+
},
|
| 47 |
+
{
|
| 48 |
+
"id": "lang_04", "group": "language", "carried_over": false, "manual_review": false,
|
| 49 |
+
"state": { "analysis_title": "Retention analysis", "objective": "understand user retention", "business_questions": ["what drives repeat usage?"], "report_id": null },
|
| 50 |
+
"report_ready": { "ready": false, "missing": ["analysis"] },
|
| 51 |
+
"history": [],
|
| 52 |
+
"message": null,
|
| 53 |
+
"asserts": [{ "type": "language_match", "expected": "English" }],
|
| 54 |
+
"note": "Fresh analysis, no chat yet, button path — with no turn to read, the user-authored goal (English objective + business_questions, required at onboarding) drives the language."
|
| 55 |
+
},
|
| 56 |
+
{
|
| 57 |
+
"id": "lang_06", "group": "language", "carried_over": false, "manual_review": false,
|
| 58 |
+
"state": { "analysis_title": "Analisis retensi", "objective": "memahami retensi pengguna", "business_questions": ["apa yang mendorong penggunaan berulang?"], "report_id": null },
|
| 59 |
+
"report_ready": { "ready": false, "missing": ["analysis"] },
|
| 60 |
+
"history": [],
|
| 61 |
+
"message": null,
|
| 62 |
+
"asserts": [{ "type": "language_match", "expected": "Indonesian" }],
|
| 63 |
+
"note": "Same fresh-analysis path as lang_04 but the goal is Indonesian — the goal signal must yield Indonesian (not the hard fallback, which only fires when the goal is empty too)."
|
| 64 |
+
},
|
| 65 |
+
{
|
| 66 |
+
"id": "lang_05", "group": "language", "carried_over": false, "manual_review": false,
|
| 67 |
+
"state": { "analysis_title": "Analisis penjualan", "objective": "memahami tren penjualan", "business_questions": ["bagaimana tren bulanan?"], "report_id": null },
|
| 68 |
+
"report_ready": { "ready": false, "missing": ["analysis"] },
|
| 69 |
+
"history": [
|
| 70 |
+
{ "role": "human", "content": "apa saja yang bisa aku tanyakan tentang data ini?" },
|
| 71 |
+
{ "role": "ai", "content": "You can start by asking which products sell the most." }
|
| 72 |
+
],
|
| 73 |
+
"message": null,
|
| 74 |
+
"asserts": [{ "type": "language_match", "expected": "Indonesian" }],
|
| 75 |
+
"note": "Last AI turn is English but the human turn is Indonesian — mirror the human, reply Indonesian."
|
| 76 |
+
},
|
| 77 |
+
{
|
| 78 |
+
"id": "help_ex_guard_delta", "group": "report_guard", "carried_over": true, "manual_review": false,
|
| 79 |
+
"state": { "analysis_title": "Sales analysis", "objective": "understand monthly sales performance", "business_questions": ["which products drive revenue?"], "report_id": "rep-1" },
|
| 80 |
+
"report_ready": { "ready": false, "missing": ["delta"] },
|
| 81 |
+
"history": [{ "role": "human", "content": "what should I do next?" }],
|
| 82 |
+
"message": null,
|
| 83 |
+
"asserts": [{ "type": "must_not_contain_any", "patterns": ["/report", "generate the report", "generate your report", "create the report"] }],
|
| 84 |
+
"note": "MIRRORS help.md example help_ex_guard_delta. A report exists and nothing new since — must NOT tell the user to generate a report; steer them to run a fresh analysis first."
|
| 85 |
+
},
|
| 86 |
+
{
|
| 87 |
+
"id": "help_ex_guard_ready", "group": "report_guard", "carried_over": true, "manual_review": false,
|
| 88 |
+
"state": { "analysis_title": "Sales analysis", "objective": "understand monthly sales performance", "business_questions": ["which products drive revenue?"], "report_id": null },
|
| 89 |
+
"report_ready": { "ready": true, "missing": [] },
|
| 90 |
+
"history": [{ "role": "human", "content": "what should I do next?" }],
|
| 91 |
+
"message": null,
|
| 92 |
+
"asserts": [{ "type": "must_contain_any", "patterns": ["/report", "report"] }],
|
| 93 |
+
"note": "MIRRORS help.md example help_ex_guard_ready. Enough analysis done — SHOULD nudge toward the report (mention /report or the report option)."
|
| 94 |
+
},
|
| 95 |
+
{
|
| 96 |
+
"id": "guard_03", "group": "report_guard", "carried_over": false, "manual_review": false,
|
| 97 |
+
"state": { "analysis_title": "Retention analysis", "objective": "improve 30-day retention", "business_questions": ["which cohort retains best?"], "report_id": null },
|
| 98 |
+
"report_ready": { "ready": false, "missing": ["analysis"] },
|
| 99 |
+
"history": [{ "role": "human", "content": "can I get a report now?" }],
|
| 100 |
+
"message": null,
|
| 101 |
+
"asserts": [{ "type": "must_not_contain_any", "patterns": ["/report", "generate the report", "generate your report", "you can generate"] }],
|
| 102 |
+
"note": "No analysis run yet, user asks for a report directly — must NOT offer to generate; redirect to running an analysis first."
|
| 103 |
+
},
|
| 104 |
+
{
|
| 105 |
+
"id": "guard_04", "group": "report_guard", "carried_over": false, "manual_review": false,
|
| 106 |
+
"state": { "analysis_title": "Analisis penjualan", "objective": "memahami performa penjualan", "business_questions": ["produk mana yang paling laku?"], "report_id": null },
|
| 107 |
+
"report_ready": { "ready": true, "missing": [] },
|
| 108 |
+
"history": [{ "role": "human", "content": "selanjutnya aku ngapain?" }],
|
| 109 |
+
"message": null,
|
| 110 |
+
"asserts": [
|
| 111 |
+
{ "type": "must_contain_any", "patterns": ["/report", "laporan", "report"] },
|
| 112 |
+
{ "type": "language_match", "expected": "Indonesian" }
|
| 113 |
+
],
|
| 114 |
+
"note": "Ready + Indonesian conversation — should nudge toward the report AND stay in Indonesian (two rules at once)."
|
| 115 |
+
},
|
| 116 |
+
{
|
| 117 |
+
"id": "guard_05", "group": "report_guard", "carried_over": false, "manual_review": false,
|
| 118 |
+
"state": { "analysis_title": "Analisis churn", "objective": "menurunkan churn", "business_questions": ["segmen mana yang paling churn?"], "report_id": null },
|
| 119 |
+
"report_ready": { "ready": false, "missing": ["analysis"] },
|
| 120 |
+
"history": [{ "role": "human", "content": "aku mau bikin laporan dong" }],
|
| 121 |
+
"message": null,
|
| 122 |
+
"asserts": [
|
| 123 |
+
{ "type": "must_not_contain_any", "patterns": ["/report", "silakan buat laporan", "kamu bisa membuat laporan", "generate your report"] },
|
| 124 |
+
{ "type": "language_match", "expected": "Indonesian" }
|
| 125 |
+
],
|
| 126 |
+
"note": "Indonesian, not ready, user asks for a report — must NOT offer it and must reply in Indonesian."
|
| 127 |
+
},
|
| 128 |
+
{
|
| 129 |
+
"id": "help_ex_orient", "group": "orientation", "carried_over": true, "manual_review": true,
|
| 130 |
+
"state": { "analysis_title": "Sales analysis", "objective": "understand monthly sales performance", "business_questions": ["which products drive revenue?"], "report_id": null },
|
| 131 |
+
"report_ready": { "ready": false, "missing": ["analysis"] },
|
| 132 |
+
"history": [],
|
| 133 |
+
"message": null,
|
| 134 |
+
"asserts": [],
|
| 135 |
+
"note": "MIRRORS help.md example help_ex_orient. MANUAL: are the 2-3 starter questions concrete, descriptive-first, and tied to the objective? Read output_text."
|
| 136 |
+
},
|
| 137 |
+
{
|
| 138 |
+
"id": "orient_02", "group": "orientation", "carried_over": false, "manual_review": true,
|
| 139 |
+
"state": { "analysis_title": "Retention analysis", "objective": "improve 30-day retention", "business_questions": ["which acquisition channel retains best?"], "report_id": null },
|
| 140 |
+
"report_ready": { "ready": false, "missing": ["analysis"] },
|
| 141 |
+
"history": [
|
| 142 |
+
{ "role": "human", "content": "which channel brings the most signups?" },
|
| 143 |
+
{ "role": "ai", "content": "Organic search brought the most signups last month (1,240)." }
|
| 144 |
+
],
|
| 145 |
+
"message": null,
|
| 146 |
+
"asserts": [],
|
| 147 |
+
"note": "MANUAL: one question already answered — does help build on it with a NEW follow-up (retention by channel), not re-suggest the answered question? Read output_text."
|
| 148 |
+
}
|
| 149 |
+
]
|
| 150 |
+
}
|
eval/help/run_eval.py
ADDED
|
@@ -0,0 +1,428 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Help-skill eval runner.
|
| 2 |
+
|
| 3 |
+
Feeds each golden case in `help_dataset.json` to the LIVE Help skill
|
| 4 |
+
(`src/agents/handlers/help.HelpAgent.astream`), then scores whether the streamed
|
| 5 |
+
reply obeys a set of RULE assertions — reply language, never suggesting a report
|
| 6 |
+
when `report_ready.ready=false`, suggesting it when true. Prints a per-case detail
|
| 7 |
+
table + aggregate summary and writes a timestamped JSON report under `results/`
|
| 8 |
+
(never overwritten — one file per run, diffable).
|
| 9 |
+
|
| 10 |
+
Unlike `eval/readiness` (deterministic, no LLM), this calls the model for real, so
|
| 11 |
+
it needs a working `.env` (Azure OpenAI) and spends tokens — run it before a deploy
|
| 12 |
+
that touches `help.md`, not on every commit. `tests/unit/agents/handlers/test_help.py`
|
| 13 |
+
already covers the deterministic Python guard with a fake chain; this is the
|
| 14 |
+
end-to-end "does the model actually obey the prompt" layer on top.
|
| 15 |
+
|
| 16 |
+
Two things the metric separates on purpose:
|
| 17 |
+
- COMPLIANCE = % of rule assertions that hold. NOT accuracy — help replies are free
|
| 18 |
+
prose with no single correct wording; we score rule-obedience, not similarity.
|
| 19 |
+
- HELD-OUT vs CARRIED-OVER — carried_over cases mirror a help.md example (regression);
|
| 20 |
+
held-out cases are absent from the prompt. Held-out compliance is the real
|
| 21 |
+
generalization signal. If held-out drops while carried_over stays 100%, the prompt
|
| 22 |
+
is overfitting to its own examples.
|
| 23 |
+
|
| 24 |
+
`orientation` cases are `manual_review` — run but excluded from the auto compliance
|
| 25 |
+
rate; read their `output_text` in the JSON report to judge suggestion quality.
|
| 26 |
+
|
| 27 |
+
Invoke as a module so `src` imports resolve:
|
| 28 |
+
|
| 29 |
+
uv run python -m eval.help.run_eval
|
| 30 |
+
uv run python -m eval.help.run_eval --limit 4 # smoke test
|
| 31 |
+
uv run python -m eval.help.run_eval --no-table # summary only
|
| 32 |
+
"""
|
| 33 |
+
|
| 34 |
+
from __future__ import annotations
|
| 35 |
+
|
| 36 |
+
import argparse
|
| 37 |
+
import asyncio
|
| 38 |
+
import json
|
| 39 |
+
import statistics
|
| 40 |
+
import time
|
| 41 |
+
from dataclasses import asdict, dataclass, field
|
| 42 |
+
from datetime import datetime
|
| 43 |
+
from pathlib import Path
|
| 44 |
+
from typing import Any
|
| 45 |
+
|
| 46 |
+
from langchain_core.callbacks import BaseCallbackHandler
|
| 47 |
+
from langchain_core.messages import AIMessage, BaseMessage, HumanMessage
|
| 48 |
+
from langchain_core.outputs import LLMResult
|
| 49 |
+
|
| 50 |
+
from src.agents.gate import AnalysisState, stub_analysis_state
|
| 51 |
+
from src.agents.handlers.help import HelpAgent, ReportReadiness, _detect_reply_language
|
| 52 |
+
from src.agents.report.readiness import _MISSING_ANALYSIS, _MISSING_DELTA
|
| 53 |
+
|
| 54 |
+
_HERE = Path(__file__).resolve().parent
|
| 55 |
+
DATASET = _HERE / "help_dataset.json"
|
| 56 |
+
RESULTS_DIR = _HERE / "results"
|
| 57 |
+
GROUPS = ["language", "report_guard", "orientation"]
|
| 58 |
+
|
| 59 |
+
# Dataset short codes -> the exact `missing` strings is_report_ready emits. Imported
|
| 60 |
+
# from the module so the dataset stays readable and survives wording changes.
|
| 61 |
+
_CODE_TO_MISSING = {
|
| 62 |
+
"analysis": _MISSING_ANALYSIS,
|
| 63 |
+
"delta": _MISSING_DELTA,
|
| 64 |
+
}
|
| 65 |
+
|
| 66 |
+
|
| 67 |
+
class _UsageCollector(BaseCallbackHandler):
|
| 68 |
+
"""Sums token usage across the LLM calls made during one astream()."""
|
| 69 |
+
|
| 70 |
+
def __init__(self) -> None:
|
| 71 |
+
self.input_tokens = 0
|
| 72 |
+
self.output_tokens = 0
|
| 73 |
+
self.total_tokens = 0
|
| 74 |
+
|
| 75 |
+
def on_llm_end(self, response: LLMResult, **kwargs: Any) -> None:
|
| 76 |
+
before = self.total_tokens
|
| 77 |
+
for generation_list in response.generations:
|
| 78 |
+
for generation in generation_list:
|
| 79 |
+
message = getattr(generation, "message", None)
|
| 80 |
+
usage = getattr(message, "usage_metadata", None) if message else None
|
| 81 |
+
if usage:
|
| 82 |
+
self.input_tokens += usage.get("input_tokens", 0)
|
| 83 |
+
self.output_tokens += usage.get("output_tokens", 0)
|
| 84 |
+
self.total_tokens += usage.get("total_tokens", 0)
|
| 85 |
+
if self.total_tokens == before and response.llm_output:
|
| 86 |
+
usage = response.llm_output.get("token_usage") or {}
|
| 87 |
+
self.input_tokens += usage.get("prompt_tokens", 0)
|
| 88 |
+
self.output_tokens += usage.get("completion_tokens", 0)
|
| 89 |
+
self.total_tokens += usage.get("total_tokens", 0)
|
| 90 |
+
|
| 91 |
+
@property
|
| 92 |
+
def tokens(self) -> dict[str, int]:
|
| 93 |
+
return {
|
| 94 |
+
"input": self.input_tokens,
|
| 95 |
+
"output": self.output_tokens,
|
| 96 |
+
"total": self.total_tokens,
|
| 97 |
+
}
|
| 98 |
+
|
| 99 |
+
|
| 100 |
+
# --- assertion checkers -----------------------------------------------------
|
| 101 |
+
# Each returns (passed, detail). `detail` explains a failure in the table/report.
|
| 102 |
+
|
| 103 |
+
|
| 104 |
+
def _check_language_match(output: str, spec: dict[str, Any]) -> tuple[bool, str]:
|
| 105 |
+
got = _detect_reply_language([], message=output)
|
| 106 |
+
return got == spec["expected"], f"want {spec['expected']}, got {got}"
|
| 107 |
+
|
| 108 |
+
|
| 109 |
+
def _check_must_not_contain_any(output: str, spec: dict[str, Any]) -> tuple[bool, str]:
|
| 110 |
+
low = output.lower()
|
| 111 |
+
hits = [p for p in spec["patterns"] if p.lower() in low]
|
| 112 |
+
return (not hits), (f"found {hits}" if hits else "none present")
|
| 113 |
+
|
| 114 |
+
|
| 115 |
+
def _check_must_contain_any(output: str, spec: dict[str, Any]) -> tuple[bool, str]:
|
| 116 |
+
low = output.lower()
|
| 117 |
+
hits = [p for p in spec["patterns"] if p.lower() in low]
|
| 118 |
+
return bool(hits), (f"found {hits}" if hits else f"none of {spec['patterns']}")
|
| 119 |
+
|
| 120 |
+
|
| 121 |
+
_ASSERT_CHECKS = {
|
| 122 |
+
"language_match": _check_language_match,
|
| 123 |
+
"must_not_contain_any": _check_must_not_contain_any,
|
| 124 |
+
"must_contain_any": _check_must_contain_any,
|
| 125 |
+
}
|
| 126 |
+
|
| 127 |
+
|
| 128 |
+
@dataclass
|
| 129 |
+
class AssertResult:
|
| 130 |
+
type: str
|
| 131 |
+
passed: bool
|
| 132 |
+
detail: str
|
| 133 |
+
|
| 134 |
+
|
| 135 |
+
@dataclass
|
| 136 |
+
class CaseResult:
|
| 137 |
+
id: str
|
| 138 |
+
group: str
|
| 139 |
+
carried_over: bool
|
| 140 |
+
manual_review: bool
|
| 141 |
+
output_text: str
|
| 142 |
+
asserts: list[AssertResult]
|
| 143 |
+
all_passed: bool | None # None when manual_review (not auto-scored)
|
| 144 |
+
latency_ms: float
|
| 145 |
+
tokens: dict[str, int]
|
| 146 |
+
errored: bool = False # the astream call raised — infra failure, not a rule verdict
|
| 147 |
+
|
| 148 |
+
|
| 149 |
+
def load_cases(path: Path) -> list[dict[str, Any]]:
|
| 150 |
+
"""Read the `cases` array, skipping the leading `_*` doc keys and `schema`."""
|
| 151 |
+
data = json.loads(path.read_text(encoding="utf-8"))
|
| 152 |
+
return list(data["cases"])
|
| 153 |
+
|
| 154 |
+
|
| 155 |
+
def _build_state(spec: dict[str, Any]) -> AnalysisState:
|
| 156 |
+
"""Build an AnalysisState from a case's `state` block (defaults from the stub)."""
|
| 157 |
+
return stub_analysis_state().model_copy(
|
| 158 |
+
update={
|
| 159 |
+
"analysis_title": spec.get("analysis_title", "New analysis"),
|
| 160 |
+
"objective": spec.get("objective", ""),
|
| 161 |
+
"business_questions": list(spec.get("business_questions", [])),
|
| 162 |
+
"report_id": spec.get("report_id"),
|
| 163 |
+
}
|
| 164 |
+
)
|
| 165 |
+
|
| 166 |
+
|
| 167 |
+
def _build_history(rows: list[dict[str, Any]]) -> list[BaseMessage]:
|
| 168 |
+
out: list[BaseMessage] = []
|
| 169 |
+
for row in rows:
|
| 170 |
+
cls = HumanMessage if row["role"] == "human" else AIMessage
|
| 171 |
+
out.append(cls(content=row["content"]))
|
| 172 |
+
return out
|
| 173 |
+
|
| 174 |
+
|
| 175 |
+
def _build_readiness(spec: dict[str, Any]) -> ReportReadiness:
|
| 176 |
+
return ReportReadiness(
|
| 177 |
+
ready=bool(spec["ready"]),
|
| 178 |
+
missing=[_CODE_TO_MISSING[c] for c in spec.get("missing", [])],
|
| 179 |
+
)
|
| 180 |
+
|
| 181 |
+
|
| 182 |
+
async def run_case(case: dict[str, Any]) -> CaseResult:
|
| 183 |
+
"""Stream one Help reply and score its assertions; never throws."""
|
| 184 |
+
state = _build_state(case["state"])
|
| 185 |
+
history = _build_history(case.get("history", []))
|
| 186 |
+
readiness = _build_readiness(case["report_ready"])
|
| 187 |
+
collector = _UsageCollector()
|
| 188 |
+
|
| 189 |
+
agent = HelpAgent() # real Azure chain, constructed lazily on first astream
|
| 190 |
+
start = time.perf_counter()
|
| 191 |
+
try:
|
| 192 |
+
output = "".join(
|
| 193 |
+
[
|
| 194 |
+
token
|
| 195 |
+
async for token in agent.astream(
|
| 196 |
+
state,
|
| 197 |
+
history=history,
|
| 198 |
+
message=case.get("message"),
|
| 199 |
+
report_ready=readiness,
|
| 200 |
+
callbacks=[collector],
|
| 201 |
+
)
|
| 202 |
+
]
|
| 203 |
+
)
|
| 204 |
+
except Exception as exc: # noqa: BLE001 — one bad case shouldn't kill the run
|
| 205 |
+
output = f"ERROR:{type(exc).__name__}: {exc}"
|
| 206 |
+
latency_ms = round((time.perf_counter() - start) * 1000, 1)
|
| 207 |
+
|
| 208 |
+
manual = bool(case.get("manual_review"))
|
| 209 |
+
errored = output.startswith("ERROR:")
|
| 210 |
+
asserts: list[AssertResult] = []
|
| 211 |
+
if errored:
|
| 212 |
+
# Don't run rule checks on an error string — a crash must not "pass" a
|
| 213 |
+
# must_not_contain_any (the pattern is trivially absent) or a language check.
|
| 214 |
+
# Count it as a failure, but flag it as errored so it reads as infra, not a
|
| 215 |
+
# rule violation (overrides manual_review — a crash isn't reviewable).
|
| 216 |
+
asserts = [AssertResult(type="stream", passed=False, detail=_truncate(output, 100))]
|
| 217 |
+
all_passed: bool | None = False
|
| 218 |
+
elif manual:
|
| 219 |
+
all_passed = None
|
| 220 |
+
else:
|
| 221 |
+
for spec in case.get("asserts", []):
|
| 222 |
+
check = _ASSERT_CHECKS[spec["type"]]
|
| 223 |
+
passed, detail = check(output, spec)
|
| 224 |
+
asserts.append(AssertResult(type=spec["type"], passed=passed, detail=detail))
|
| 225 |
+
all_passed = all(a.passed for a in asserts)
|
| 226 |
+
|
| 227 |
+
return CaseResult(
|
| 228 |
+
id=case["id"],
|
| 229 |
+
group=case["group"],
|
| 230 |
+
carried_over=bool(case.get("carried_over")),
|
| 231 |
+
manual_review=manual,
|
| 232 |
+
output_text=output,
|
| 233 |
+
asserts=asserts,
|
| 234 |
+
all_passed=all_passed,
|
| 235 |
+
latency_ms=latency_ms,
|
| 236 |
+
tokens=collector.tokens,
|
| 237 |
+
errored=errored,
|
| 238 |
+
)
|
| 239 |
+
|
| 240 |
+
|
| 241 |
+
def _compliance(results: list[CaseResult]) -> dict[str, Any]:
|
| 242 |
+
scored = [r for r in results if r.all_passed is not None]
|
| 243 |
+
passed = sum(1 for r in scored if r.all_passed)
|
| 244 |
+
return {
|
| 245 |
+
"n": len(scored),
|
| 246 |
+
"passed": passed,
|
| 247 |
+
"compliance": round(passed / len(scored), 3) if scored else 0.0,
|
| 248 |
+
}
|
| 249 |
+
|
| 250 |
+
|
| 251 |
+
def summarize(results: list[CaseResult]) -> dict[str, Any]:
|
| 252 |
+
scored = [r for r in results if r.all_passed is not None]
|
| 253 |
+
latencies = [r.latency_ms for r in results]
|
| 254 |
+
tok_total = sum(r.tokens["total"] for r in results)
|
| 255 |
+
overall = _compliance(results)
|
| 256 |
+
by_group = {
|
| 257 |
+
g: _compliance([r for r in results if r.group == g])
|
| 258 |
+
for g in GROUPS
|
| 259 |
+
if any(r.group == g for r in results)
|
| 260 |
+
}
|
| 261 |
+
errored = [r for r in results if r.errored]
|
| 262 |
+
return {
|
| 263 |
+
"total": len(results),
|
| 264 |
+
"scored": len(scored),
|
| 265 |
+
"manual_review": len(results) - len(scored),
|
| 266 |
+
"passed": overall["passed"],
|
| 267 |
+
"compliance": overall["compliance"],
|
| 268 |
+
"runtime_avg_ms": round(statistics.mean(latencies), 1) if latencies else 0,
|
| 269 |
+
"tokens_total": tok_total,
|
| 270 |
+
"by_group": by_group,
|
| 271 |
+
"held_out": _compliance([r for r in scored if not r.carried_over]),
|
| 272 |
+
"carried_over": _compliance([r for r in scored if r.carried_over]),
|
| 273 |
+
"errored": {"count": len(errored), "ids": [r.id for r in errored]},
|
| 274 |
+
}
|
| 275 |
+
|
| 276 |
+
|
| 277 |
+
def _truncate(text: str, width: int) -> str:
|
| 278 |
+
text = text.replace("\n", " ")
|
| 279 |
+
return text if len(text) <= width else text[: width - 3] + "..."
|
| 280 |
+
|
| 281 |
+
|
| 282 |
+
def format_table(results: list[CaseResult]) -> str:
|
| 283 |
+
header = (
|
| 284 |
+
f"{'ID':<20} {'GROUP':<13} {'C/O':<4} {'ASSERTS':<22} {'OK':<4} {'MS':>7}"
|
| 285 |
+
)
|
| 286 |
+
rule = "-" * len(header)
|
| 287 |
+
lines = [rule, header, rule]
|
| 288 |
+
for r in results:
|
| 289 |
+
co = "CO" if r.carried_over else "-"
|
| 290 |
+
if r.manual_review:
|
| 291 |
+
atypes, ok = "(manual)", "~"
|
| 292 |
+
else:
|
| 293 |
+
atypes = ",".join(a.type.replace("_", "")[:6] for a in r.asserts) or "-"
|
| 294 |
+
ok = "ok" if r.all_passed else "X"
|
| 295 |
+
lines.append(
|
| 296 |
+
f"{r.id:<20} {r.group:<13} {co:<4} {_truncate(atypes, 22):<22} "
|
| 297 |
+
f"{ok:<4} {r.latency_ms:>7}"
|
| 298 |
+
)
|
| 299 |
+
lines.append(rule)
|
| 300 |
+
return "\n".join(lines)
|
| 301 |
+
|
| 302 |
+
|
| 303 |
+
def format_summary(summary: dict[str, Any], results: list[CaseResult]) -> str:
|
| 304 |
+
lines = ["SUMMARY"]
|
| 305 |
+
lines.append(
|
| 306 |
+
f" Compliance {summary['passed']}/{summary['scored']} cases obey all rules"
|
| 307 |
+
f" ({summary['compliance'] * 100:.1f}%) avg {summary['runtime_avg_ms']} ms"
|
| 308 |
+
)
|
| 309 |
+
lines.append(
|
| 310 |
+
f" Manual {summary['manual_review']} case(s) excluded from the rate"
|
| 311 |
+
" (read output_text)"
|
| 312 |
+
)
|
| 313 |
+
lines.append("")
|
| 314 |
+
lines.append(" By group")
|
| 315 |
+
for g, m in summary["by_group"].items():
|
| 316 |
+
if m["n"]:
|
| 317 |
+
lines.append(f" {g:<14} {m['passed']}/{m['n']} {m['compliance'] * 100:.0f}%")
|
| 318 |
+
else:
|
| 319 |
+
lines.append(f" {g:<14} (manual only)")
|
| 320 |
+
lines.append("")
|
| 321 |
+
ho, co = summary["held_out"], summary["carried_over"]
|
| 322 |
+
lines.append(" Held-out vs carried-over")
|
| 323 |
+
lines.append(
|
| 324 |
+
f" held_out {ho['passed']}/{ho['n']} "
|
| 325 |
+
f"{ho['compliance'] * 100:.0f}% <- generalization"
|
| 326 |
+
)
|
| 327 |
+
lines.append(
|
| 328 |
+
f" carried_over {co['passed']}/{co['n']} "
|
| 329 |
+
f"{co['compliance'] * 100:.0f}% <- regression"
|
| 330 |
+
)
|
| 331 |
+
# Rule failures (real disobedience) vs errored (infra/stream crash) — kept apart so
|
| 332 |
+
# a crashed run isn't misread as the model breaking a rule.
|
| 333 |
+
failures = [r for r in results if r.all_passed is False and not r.errored]
|
| 334 |
+
lines.append("")
|
| 335 |
+
lines.append(f" FAILURES ({len(failures)})")
|
| 336 |
+
for r in failures:
|
| 337 |
+
bad = [f"{a.type}({a.detail})" for a in r.asserts if not a.passed]
|
| 338 |
+
lines.append(f" {r.id:<20} {r.group:<13} {'; '.join(bad)}")
|
| 339 |
+
err = summary["errored"]
|
| 340 |
+
if err["count"]:
|
| 341 |
+
lines.append("")
|
| 342 |
+
lines.append(
|
| 343 |
+
f" ERRORED ({err['count']}) - stream crashed, counted as fail NOT a rule miss"
|
| 344 |
+
f" -> {', '.join(err['ids'])}"
|
| 345 |
+
)
|
| 346 |
+
return "\n".join(lines)
|
| 347 |
+
|
| 348 |
+
|
| 349 |
+
def build_report(
|
| 350 |
+
results: list[CaseResult], summary: dict[str, Any], meta: dict[str, Any]
|
| 351 |
+
) -> dict[str, Any]:
|
| 352 |
+
run = {
|
| 353 |
+
**meta,
|
| 354 |
+
**{
|
| 355 |
+
k: summary[k]
|
| 356 |
+
for k in ("total", "scored", "manual_review", "passed", "compliance",
|
| 357 |
+
"runtime_avg_ms", "tokens_total")
|
| 358 |
+
},
|
| 359 |
+
}
|
| 360 |
+
return {
|
| 361 |
+
"run": run,
|
| 362 |
+
"by_group": summary["by_group"],
|
| 363 |
+
"held_out": summary["held_out"],
|
| 364 |
+
"carried_over": summary["carried_over"],
|
| 365 |
+
"errored": summary["errored"],
|
| 366 |
+
"cases": [asdict(r) for r in results],
|
| 367 |
+
}
|
| 368 |
+
|
| 369 |
+
|
| 370 |
+
def _model_name() -> str:
|
| 371 |
+
try:
|
| 372 |
+
from src.config.settings import settings
|
| 373 |
+
|
| 374 |
+
return str(settings.azureai_deployment_name_4o)
|
| 375 |
+
except Exception: # noqa: BLE001 — meta only; .env may be absent
|
| 376 |
+
return "gpt-4o"
|
| 377 |
+
|
| 378 |
+
|
| 379 |
+
@dataclass
|
| 380 |
+
class _Args:
|
| 381 |
+
dataset: Path = DATASET
|
| 382 |
+
limit: int = 0
|
| 383 |
+
no_table: bool = False
|
| 384 |
+
extra: dict[str, Any] = field(default_factory=dict)
|
| 385 |
+
|
| 386 |
+
|
| 387 |
+
async def main() -> None:
|
| 388 |
+
parser = argparse.ArgumentParser(description="Help-skill eval")
|
| 389 |
+
parser.add_argument("--dataset", type=Path, default=DATASET)
|
| 390 |
+
parser.add_argument("--limit", type=int, default=0, help="run first N cases only")
|
| 391 |
+
parser.add_argument("--prompt-version", default="help.md")
|
| 392 |
+
parser.add_argument("--no-table", action="store_true", help="skip the detail table")
|
| 393 |
+
args = parser.parse_args()
|
| 394 |
+
|
| 395 |
+
cases = load_cases(args.dataset)
|
| 396 |
+
if args.limit:
|
| 397 |
+
cases = cases[: args.limit]
|
| 398 |
+
|
| 399 |
+
started = datetime.now()
|
| 400 |
+
print(f"Help Skill Eval -- {started:%Y-%m-%d %H:%M:%S}")
|
| 401 |
+
print(
|
| 402 |
+
f"dataset: {args.dataset.name} ({len(cases)} cases) model: {_model_name()} "
|
| 403 |
+
f"prompt: {args.prompt_version} target: HelpAgent.astream (live)"
|
| 404 |
+
)
|
| 405 |
+
|
| 406 |
+
results = [await run_case(case) for case in cases]
|
| 407 |
+
|
| 408 |
+
summary = summarize(results)
|
| 409 |
+
if not args.no_table:
|
| 410 |
+
print(format_table(results))
|
| 411 |
+
print(format_summary(summary, results))
|
| 412 |
+
|
| 413 |
+
meta = {
|
| 414 |
+
"timestamp": started.isoformat(timespec="seconds"),
|
| 415 |
+
"dataset": args.dataset.name,
|
| 416 |
+
"model": _model_name(),
|
| 417 |
+
"prompt_version": args.prompt_version,
|
| 418 |
+
"target": "src/agents/handlers/help.HelpAgent.astream",
|
| 419 |
+
}
|
| 420 |
+
report = build_report(results, summary, meta)
|
| 421 |
+
RESULTS_DIR.mkdir(parents=True, exist_ok=True)
|
| 422 |
+
out_path = RESULTS_DIR / f"help_result_{started:%Y-%m-%d_%H%M%S}.json"
|
| 423 |
+
out_path.write_text(json.dumps(report, ensure_ascii=False, indent=2), encoding="utf-8")
|
| 424 |
+
print(f"\n-> saved: {out_path.relative_to(_HERE.parent.parent)}")
|
| 425 |
+
|
| 426 |
+
|
| 427 |
+
if __name__ == "__main__":
|
| 428 |
+
asyncio.run(main())
|
eval/readiness/readiness_dataset.json
CHANGED
|
@@ -1,40 +1,37 @@
|
|
| 1 |
{
|
| 2 |
"_about": "Golden dataset for the report-readiness signal (`src/agents/report/readiness.is_report_ready`). Deterministic (no LLM): each case declares an analysis state + a set of persisted AnalysisRecords/reports, and the runner feeds them through is_report_ready via injectable fake stores, scoring the boolean `ready` AND the `missing` gaps. Floor cases should score ~100% (regression value). The `alignment` group probes the deferred LLM-judge — see _alignment.",
|
| 3 |
-
"_floor": "is_report_ready's deterministic floor
|
| 4 |
"_records": "records[].analysis = 'success' (analyze_* succeeded → substantive) | 'failure' (analyze_* failed, data-access still succeeded — the real e2e case, NOT substantive) | 'none' (only check_/retrieve_ succeeded, no analyze task — NOT substantive; guards the 'any task succeeded' trap). records[].findings = count (a failure run still has findings; floor ignores them now). records[].age_min / reports[].age_min = minutes ago (smaller = newer).",
|
| 5 |
-
"_alignment": "ALIGNMENT cases: a successful analysis (floor says ready=true) but `aligned=false` means it doesn't address the
|
| 6 |
"schema": {
|
| 7 |
"id": "stable per-case handle, <group>_<NN>",
|
| 8 |
"group": "floor | delta | edge | alignment",
|
| 9 |
-
"problem_validated": "bool",
|
| 10 |
"report_id": "null = never generated; a string = a report exists",
|
| 11 |
"records": "[{ analysis: success|failure|none, findings: int, age_min: int }]",
|
| 12 |
"reports": "[{ age_min: int }] (only meaningful when report_id set)",
|
| 13 |
-
"aligned": "bool — do the analyses address the
|
| 14 |
"expected_ready": "what the deterministic floor SHOULD return",
|
| 15 |
-
"expected_missing": "subset of [
|
| 16 |
"note": "human-readable description"
|
| 17 |
},
|
| 18 |
"cases": [
|
| 19 |
-
{ "id": "floor_01", "group": "floor", "
|
| 20 |
-
{ "id": "floor_02", "group": "floor", "
|
| 21 |
-
{ "id": "floor_03", "group": "floor", "
|
| 22 |
-
{ "id": "floor_04", "group": "floor", "
|
| 23 |
-
{ "id": "floor_05", "group": "floor", "
|
| 24 |
-
{ "id": "floor_06", "group": "floor", "
|
| 25 |
-
{ "id": "floor_07", "group": "floor", "problem_validated": true, "report_id": null, "records": [{ "analysis": "success", "findings": 3, "age_min": 40 }, { "analysis": "success", "findings": 1, "age_min": 10 }], "reports": [], "aligned": true, "expected_ready": true, "expected_missing": [], "note": "multiple successful analyses → ready" },
|
| 26 |
-
{ "id": "floor_08", "group": "floor", "problem_validated": true, "report_id": null, "records": [{ "analysis": "failure", "findings": 3, "age_min": 30 }, { "analysis": "success", "findings": 2, "age_min": 10 }], "reports": [], "aligned": true, "expected_ready": true, "expected_missing": [], "note": "one failed + one successful analysis → the successful one is enough → ready" },
|
| 27 |
|
| 28 |
-
{ "id": "delta_01", "group": "delta", "
|
| 29 |
-
{ "id": "delta_02", "group": "delta", "
|
| 30 |
-
{ "id": "delta_03", "group": "delta", "
|
| 31 |
-
{ "id": "delta_04", "group": "delta", "
|
| 32 |
-
{ "id": "delta_05", "group": "delta", "
|
| 33 |
|
| 34 |
-
{ "id": "edge_01", "group": "edge", "
|
| 35 |
|
| 36 |
-
{ "id": "align_01", "group": "alignment", "
|
| 37 |
-
{ "id": "align_02", "group": "alignment", "
|
| 38 |
-
{ "id": "align_03", "group": "alignment", "
|
| 39 |
]
|
| 40 |
}
|
|
|
|
| 1 |
{
|
| 2 |
"_about": "Golden dataset for the report-readiness signal (`src/agents/report/readiness.is_report_ready`). Deterministic (no LLM): each case declares an analysis state + a set of persisted AnalysisRecords/reports, and the runner feeds them through is_report_ready via injectable fake stores, scoring the boolean `ready` AND the `missing` gaps. Floor cases should score ~100% (regression value). The `alignment` group probes the deferred LLM-judge — see _alignment.",
|
| 3 |
+
"_floor": "is_report_ready's deterministic floor (KM-652, after the problem_validated gate was removed 2026-06-24): (1) >=1 SUBSTANTIVE record, (2) delta-since-report. SUBSTANTIVE = a record whose ANALYSIS task succeeded: tasks_run contains a task with status=success AND an analyze_* tool. A failed analysis still persists a record WITH findings (narrating the failure) and its data-access tasks (check_/retrieve_) succeed — so neither 'has findings' nor 'any task succeeded' counts. Only a successful analyze_* does.",
|
| 4 |
"_records": "records[].analysis = 'success' (analyze_* succeeded → substantive) | 'failure' (analyze_* failed, data-access still succeeded — the real e2e case, NOT substantive) | 'none' (only check_/retrieve_ succeeded, no analyze task — NOT substantive; guards the 'any task succeeded' trap). records[].findings = count (a failure run still has findings; floor ignores them now). records[].age_min / reports[].age_min = minutes ago (smaller = newer).",
|
| 5 |
+
"_alignment": "ALIGNMENT cases: a successful analysis (floor says ready=true) but `aligned=false` means it doesn't address the analysis objective — a human would say NOT ready. Scored floor-correct, counted separately as the 'alignment gap' = evidence for/against the LLM-judge. Alignment label owner: Rifqi (report semantics) + Sofhia.",
|
| 6 |
"schema": {
|
| 7 |
"id": "stable per-case handle, <group>_<NN>",
|
| 8 |
"group": "floor | delta | edge | alignment",
|
|
|
|
| 9 |
"report_id": "null = never generated; a string = a report exists",
|
| 10 |
"records": "[{ analysis: success|failure|none, findings: int, age_min: int }]",
|
| 11 |
"reports": "[{ age_min: int }] (only meaningful when report_id set)",
|
| 12 |
+
"aligned": "bool — do the analyses address the objective? (floor ignores this)",
|
| 13 |
"expected_ready": "what the deterministic floor SHOULD return",
|
| 14 |
+
"expected_missing": "subset of [analysis, delta]",
|
| 15 |
"note": "human-readable description"
|
| 16 |
},
|
| 17 |
"cases": [
|
| 18 |
+
{ "id": "floor_01", "group": "floor", "report_id": null, "records": [], "reports": [], "aligned": false, "expected_ready": false, "expected_missing": ["analysis"], "note": "new analysis: no analysis run yet → not ready" },
|
| 19 |
+
{ "id": "floor_02", "group": "floor", "report_id": null, "records": [{ "analysis": "failure", "findings": 3, "age_min": 20 }], "reports": [], "aligned": false, "expected_ready": false, "expected_missing": ["analysis"], "note": "T1 REGRESSION: analyze_* FAILED but the record still has 3 findings (narrating failure) + check/retrieve succeeded. Must NOT be ready — this is the live e2e case (analyze_aggregate failed, report still got generated under the old 'has findings' rule)." },
|
| 20 |
+
{ "id": "floor_03", "group": "floor", "report_id": null, "records": [{ "analysis": "none", "findings": 0, "age_min": 15 }], "reports": [], "aligned": false, "expected_ready": false, "expected_missing": ["analysis"], "note": "T1 nuance: only data-access tasks (check/retrieve) succeeded, no analyze task. 'any task succeeded' would wrongly pass — must NOT be ready." },
|
| 21 |
+
{ "id": "floor_04", "group": "floor", "report_id": null, "records": [{ "analysis": "success", "findings": 2, "age_min": 15 }], "reports": [], "aligned": true, "expected_ready": true, "expected_missing": [], "note": "one successful analysis, no prior report → ready" },
|
| 22 |
+
{ "id": "floor_05", "group": "floor", "report_id": null, "records": [{ "analysis": "success", "findings": 3, "age_min": 40 }, { "analysis": "success", "findings": 1, "age_min": 10 }], "reports": [], "aligned": true, "expected_ready": true, "expected_missing": [], "note": "multiple successful analyses → ready" },
|
| 23 |
+
{ "id": "floor_06", "group": "floor", "report_id": null, "records": [{ "analysis": "failure", "findings": 3, "age_min": 30 }, { "analysis": "success", "findings": 2, "age_min": 10 }], "reports": [], "aligned": true, "expected_ready": true, "expected_missing": [], "note": "one failed + one successful analysis → the successful one is enough → ready" },
|
|
|
|
|
|
|
| 24 |
|
| 25 |
+
{ "id": "delta_01", "group": "delta", "report_id": "rep-1", "records": [{ "analysis": "success", "findings": 2, "age_min": 120 }], "reports": [{ "age_min": 5 }], "aligned": true, "expected_ready": false, "expected_missing": ["delta"], "note": "report exists, all analysis older than it → nothing new to report" },
|
| 26 |
+
{ "id": "delta_02", "group": "delta", "report_id": "rep-1", "records": [{ "analysis": "success", "findings": 2, "age_min": 5 }], "reports": [{ "age_min": 120 }], "aligned": true, "expected_ready": true, "expected_missing": [], "note": "newer successful analysis after the report → ready to regenerate" },
|
| 27 |
+
{ "id": "delta_03", "group": "delta", "report_id": "rep-1", "records": [{ "analysis": "success", "findings": 1, "age_min": 90 }, { "analysis": "success", "findings": 2, "age_min": 10 }], "reports": [{ "age_min": 60 }], "aligned": true, "expected_ready": true, "expected_missing": [], "note": "one old + one newer-than-report success → ready" },
|
| 28 |
+
{ "id": "delta_04", "group": "delta", "report_id": "rep-2", "records": [{ "analysis": "success", "findings": 2, "age_min": 90 }], "reports": [{ "age_min": 200 }, { "age_min": 30 }], "aligned": true, "expected_ready": false, "expected_missing": ["delta"], "note": "multiple reports — newest wins; analysis older than newest report → not ready" },
|
| 29 |
+
{ "id": "delta_05", "group": "delta", "report_id": "rep-1", "records": [{ "analysis": "success", "findings": 2, "age_min": 120 }, { "analysis": "failure", "findings": 3, "age_min": 5 }], "reports": [{ "age_min": 60 }], "aligned": true, "expected_ready": false, "expected_missing": ["delta"], "note": "T1+delta: the only NEW analysis (age 5) is a FAILURE → no NEW substantive since the report → not ready. A failed retry must not unlock a duplicate report." },
|
| 30 |
|
| 31 |
+
{ "id": "edge_01", "group": "edge", "report_id": null, "records": [], "reports": [], "aligned": false, "expected_ready": false, "expected_missing": ["analysis"], "note": "doc-only analysis (RAG, no structured run) produces no AnalysisRecord → never report-able under the floor. PRODUCT QUESTION: should doc-only be report-able?" },
|
| 32 |
|
| 33 |
+
{ "id": "align_01", "group": "alignment", "report_id": null, "records": [{ "analysis": "success", "findings": 2, "age_min": 15 }], "reports": [], "aligned": false, "expected_ready": true, "expected_missing": [], "note": "GAP: successful analysis but it doesn't address the objective. Floor says ready; a human would say not-ready." },
|
| 34 |
+
{ "id": "align_02", "group": "alignment", "report_id": null, "records": [{ "analysis": "success", "findings": 3, "age_min": 25 }, { "analysis": "success", "findings": 1, "age_min": 5 }], "reports": [], "aligned": false, "expected_ready": true, "expected_missing": [], "note": "GAP: lots of successful analysis, none aligned to the objective" },
|
| 35 |
+
{ "id": "align_03", "group": "alignment", "report_id": null, "records": [{ "analysis": "success", "findings": 2, "age_min": 15 }], "reports": [], "aligned": true, "expected_ready": true, "expected_missing": [], "note": "control: successful AND aligned → genuinely ready, no gap" }
|
| 36 |
]
|
| 37 |
}
|
eval/readiness/run_eval.py
CHANGED
|
@@ -35,7 +35,6 @@ from src.agents.gate import stub_analysis_state
|
|
| 35 |
from src.agents.report.readiness import (
|
| 36 |
_MISSING_ANALYSIS,
|
| 37 |
_MISSING_DELTA,
|
| 38 |
-
_MISSING_PROBLEM,
|
| 39 |
is_report_ready,
|
| 40 |
)
|
| 41 |
|
|
@@ -45,9 +44,9 @@ RESULTS_DIR = _HERE / "results"
|
|
| 45 |
GROUPS = ["floor", "delta", "edge", "alignment"]
|
| 46 |
|
| 47 |
# Dataset short codes -> the exact `missing` strings is_report_ready emits. Imported
|
| 48 |
-
# from the module so the dataset stays readable and survives wording changes.
|
|
|
|
| 49 |
_CODE_TO_MISSING = {
|
| 50 |
-
"problem": _MISSING_PROBLEM,
|
| 51 |
"analysis": _MISSING_ANALYSIS,
|
| 52 |
"delta": _MISSING_DELTA,
|
| 53 |
}
|
|
@@ -139,7 +138,9 @@ def _build_reports(specs: list[dict[str, Any]], now: datetime) -> list[_FakeRepo
|
|
| 139 |
|
| 140 |
async def run_case(case: dict[str, Any]) -> CaseResult:
|
| 141 |
now = datetime.now(UTC)
|
| 142 |
-
|
|
|
|
|
|
|
| 143 |
if case.get("report_id"):
|
| 144 |
state = state.model_copy(update={"report_id": case["report_id"]})
|
| 145 |
|
|
|
|
| 35 |
from src.agents.report.readiness import (
|
| 36 |
_MISSING_ANALYSIS,
|
| 37 |
_MISSING_DELTA,
|
|
|
|
| 38 |
is_report_ready,
|
| 39 |
)
|
| 40 |
|
|
|
|
| 44 |
GROUPS = ["floor", "delta", "edge", "alignment"]
|
| 45 |
|
| 46 |
# Dataset short codes -> the exact `missing` strings is_report_ready emits. Imported
|
| 47 |
+
# from the module so the dataset stays readable and survives wording changes. The
|
| 48 |
+
# `problem` code was retired with the problem_validated gate (KM-652, 2026-06-24).
|
| 49 |
_CODE_TO_MISSING = {
|
|
|
|
| 50 |
"analysis": _MISSING_ANALYSIS,
|
| 51 |
"delta": _MISSING_DELTA,
|
| 52 |
}
|
|
|
|
| 138 |
|
| 139 |
async def run_case(case: dict[str, Any]) -> CaseResult:
|
| 140 |
now = datetime.now(UTC)
|
| 141 |
+
# The problem_validated gate was removed (KM-652); readiness no longer reads the goal,
|
| 142 |
+
# so a bare stub state + report_id is all is_report_ready needs.
|
| 143 |
+
state = stub_analysis_state()
|
| 144 |
if case.get("report_id"):
|
| 145 |
state = state.model_copy(update={"report_id": case["report_id"]})
|
| 146 |
|
main.py
CHANGED
|
@@ -23,7 +23,7 @@ from src.api.v1.tools import router as tools_router
|
|
| 23 |
from src.api.v1.help import router as help_router # pr/5 Phase 2: dedicated /tools/help
|
| 24 |
from src.api.v2.chat import router as chat_v2_router # pr/5 Phase 2: v2 chat pilot (analysis_id)
|
| 25 |
from src.db.postgres.init_db import init_db
|
| 26 |
-
import
|
| 27 |
import uvicorn
|
| 28 |
|
| 29 |
# Configure logging
|
|
@@ -34,7 +34,7 @@ logger = get_logger("main")
|
|
| 34 |
@asynccontextmanager
|
| 35 |
async def lifespan(app: FastAPI):
|
| 36 |
logger.info("Starting application...")
|
| 37 |
-
if
|
| 38 |
await init_db()
|
| 39 |
logger.info("Database initialized")
|
| 40 |
else:
|
|
|
|
| 23 |
from src.api.v1.help import router as help_router # pr/5 Phase 2: dedicated /tools/help
|
| 24 |
from src.api.v2.chat import router as chat_v2_router # pr/5 Phase 2: v2 chat pilot (analysis_id)
|
| 25 |
from src.db.postgres.init_db import init_db
|
| 26 |
+
from src.config.settings import settings
|
| 27 |
import uvicorn
|
| 28 |
|
| 29 |
# Configure logging
|
|
|
|
| 34 |
@asynccontextmanager
|
| 35 |
async def lifespan(app: FastAPI):
|
| 36 |
logger.info("Starting application...")
|
| 37 |
+
if not settings.skip_init_db:
|
| 38 |
await init_db()
|
| 39 |
logger.info("Database initialized")
|
| 40 |
else:
|
src/agents/handlers/help.py
CHANGED
|
@@ -29,6 +29,7 @@ SEAMS:
|
|
| 29 |
|
| 30 |
from __future__ import annotations
|
| 31 |
|
|
|
|
| 32 |
from collections.abc import AsyncIterator
|
| 33 |
from dataclasses import dataclass, field
|
| 34 |
from pathlib import Path
|
|
@@ -49,8 +50,80 @@ _PROMPT_DIR = Path(__file__).resolve().parent.parent.parent / "config" / "prompt
|
|
| 49 |
_SYSTEM_PROMPT_PATH = _PROMPT_DIR / "help.md"
|
| 50 |
_GUARDRAILS_PATH = _PROMPT_DIR / "guardrails.md"
|
| 51 |
|
| 52 |
-
# Neutral human turn when Help is triggered by a slash command with no real content
|
| 53 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 54 |
|
| 55 |
|
| 56 |
@dataclass
|
|
@@ -107,13 +180,20 @@ def _build_context_block(
|
|
| 107 |
state: AnalysisState,
|
| 108 |
report_ready: ReportReadiness,
|
| 109 |
available_actions: list[str],
|
|
|
|
| 110 |
) -> str:
|
| 111 |
-
"""Compose the deterministic context the prompt's 'never misguide' rule trusts.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 112 |
return "\n\n".join(
|
| 113 |
[
|
| 114 |
_format_state(state),
|
| 115 |
_format_report_ready(report_ready),
|
| 116 |
"[Available actions]\n" + ", ".join(available_actions),
|
|
|
|
| 117 |
]
|
| 118 |
)
|
| 119 |
|
|
@@ -178,17 +258,26 @@ class HelpAgent:
|
|
| 178 |
"""
|
| 179 |
readiness = report_ready or ReportReadiness()
|
| 180 |
actions = available_actions or _derive_available_actions(state, readiness)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 181 |
logger.info(
|
| 182 |
"help guidance",
|
| 183 |
report_ready=readiness.ready,
|
| 184 |
available_actions=actions,
|
|
|
|
| 185 |
)
|
| 186 |
|
| 187 |
chain = self._ensure_chain()
|
|
|
|
|
|
|
|
|
|
| 188 |
payload: dict[str, Any] = {
|
| 189 |
-
"message": message or
|
| 190 |
"history": history or [],
|
| 191 |
-
"context": _build_context_block(state, readiness, actions),
|
| 192 |
}
|
| 193 |
if callbacks:
|
| 194 |
async for token in chain.astream(payload, config={"callbacks": callbacks}):
|
|
|
|
| 29 |
|
| 30 |
from __future__ import annotations
|
| 31 |
|
| 32 |
+
import re
|
| 33 |
from collections.abc import AsyncIterator
|
| 34 |
from dataclasses import dataclass, field
|
| 35 |
from pathlib import Path
|
|
|
|
| 50 |
_SYSTEM_PROMPT_PATH = _PROMPT_DIR / "help.md"
|
| 51 |
_GUARDRAILS_PATH = _PROMPT_DIR / "guardrails.md"
|
| 52 |
|
| 53 |
+
# Neutral human turn when Help is triggered by a slash command with no real content
|
| 54 |
+
# (button path passes message=None). Per language, so the synthetic turn never drags the
|
| 55 |
+
# reply toward English — without this the only human-turn signal on the button path would
|
| 56 |
+
# be an English sentence, and the model mirrors the last human turn's language.
|
| 57 |
+
_DEFAULT_TRIGGERS = {
|
| 58 |
+
"Indonesian": "Apa yang sebaiknya saya lakukan selanjutnya?",
|
| 59 |
+
"English": "What should I do next?",
|
| 60 |
+
}
|
| 61 |
+
_FALLBACK_LANGUAGE = "Indonesian" # team default when no human turn exists yet
|
| 62 |
+
|
| 63 |
+
# Lightweight, LLM-free language detection over the last human turn. The result is LOCKED
|
| 64 |
+
# into the prompt via a `[Reply language]` directive (see `_build_context_block`), so
|
| 65 |
+
# replying in the user's language is deterministic/mandatory — not a soft prompt hint that
|
| 66 |
+
# an English system prompt + English default trigger can override.
|
| 67 |
+
_ID_MARKERS = frozenset({
|
| 68 |
+
"yang", "dan", "apa", "gimana", "bagaimana", "kenapa", "mengapa", "aku", "saya",
|
| 69 |
+
"tolong", "ini", "itu", "nih", "dong", "kah", "untuk", "dengan", "pada", "adalah",
|
| 70 |
+
"tidak", "enggak", "nggak", "bisa", "mau", "buat", "dari", "kamu", "ya",
|
| 71 |
+
"berapa", "kapan", "siapa", "dimana", "juga", "sudah", "belum", "akan",
|
| 72 |
+
})
|
| 73 |
+
_EN_MARKERS = frozenset({
|
| 74 |
+
"the", "what", "how", "why", "please", "this", "that", "is", "are", "can", "could",
|
| 75 |
+
"should", "for", "with", "of", "and", "you", "do", "does", "when", "where",
|
| 76 |
+
"who", "which", "my", "me", "your", "have", "has", "want", "next",
|
| 77 |
+
})
|
| 78 |
+
|
| 79 |
+
|
| 80 |
+
def _last_human_text(history: list[BaseMessage] | None) -> str:
|
| 81 |
+
"""Return the text of the most recent human turn in history, or '' if none."""
|
| 82 |
+
for msg in reversed(history or []):
|
| 83 |
+
if getattr(msg, "type", None) == "human":
|
| 84 |
+
content = msg.content
|
| 85 |
+
return content if isinstance(content, str) else str(content)
|
| 86 |
+
return ""
|
| 87 |
+
|
| 88 |
+
|
| 89 |
+
def _score_language(text: str) -> str | None:
|
| 90 |
+
"""Return "Indonesian"/"English" from marker-word counts, or None if no signal."""
|
| 91 |
+
tokens = re.findall(r"[a-z']+", text.lower())
|
| 92 |
+
id_hits = sum(1 for t in tokens if t in _ID_MARKERS)
|
| 93 |
+
en_hits = sum(1 for t in tokens if t in _EN_MARKERS)
|
| 94 |
+
if en_hits > id_hits:
|
| 95 |
+
return "English"
|
| 96 |
+
if id_hits > en_hits:
|
| 97 |
+
return "Indonesian"
|
| 98 |
+
return None
|
| 99 |
+
|
| 100 |
+
|
| 101 |
+
def _detect_reply_language(
|
| 102 |
+
history: list[BaseMessage] | None,
|
| 103 |
+
message: str | None = None,
|
| 104 |
+
goal_texts: list[str] | None = None,
|
| 105 |
+
) -> str:
|
| 106 |
+
"""Detect the reply language deterministically (no LLM), by signal priority.
|
| 107 |
+
|
| 108 |
+
1. the user's turn — an explicit `message` (intent path) or the last human turn in
|
| 109 |
+
`history` (button path, where `message` is None);
|
| 110 |
+
2. the user-authored goal (`objective` + `business_questions`) — required at
|
| 111 |
+
onboarding, so it's always present and is a reliable signal on a fresh analysis
|
| 112 |
+
that has no chat yet;
|
| 113 |
+
3. the team default (Indonesian) — a safety net only, for a stub/legacy/empty-goal
|
| 114 |
+
state where nothing above yields a signal.
|
| 115 |
+
|
| 116 |
+
Returns "Indonesian" or "English".
|
| 117 |
+
"""
|
| 118 |
+
primary = (message or _last_human_text(history)).strip()
|
| 119 |
+
lang = _score_language(primary) if primary else None
|
| 120 |
+
if lang:
|
| 121 |
+
return lang
|
| 122 |
+
goal = " ".join(t for t in (goal_texts or []) if t).strip()
|
| 123 |
+
lang = _score_language(goal) if goal else None
|
| 124 |
+
if lang:
|
| 125 |
+
return lang
|
| 126 |
+
return _FALLBACK_LANGUAGE
|
| 127 |
|
| 128 |
|
| 129 |
@dataclass
|
|
|
|
| 180 |
state: AnalysisState,
|
| 181 |
report_ready: ReportReadiness,
|
| 182 |
available_actions: list[str],
|
| 183 |
+
reply_language: str = _FALLBACK_LANGUAGE,
|
| 184 |
) -> str:
|
| 185 |
+
"""Compose the deterministic context the prompt's 'never misguide' rule trusts.
|
| 186 |
+
|
| 187 |
+
`reply_language` is a hard directive: the prompt is told to reply ONLY in this
|
| 188 |
+
language, so the answer matches the user's language even on the button path (where
|
| 189 |
+
the synthetic human turn would otherwise pull the reply toward English).
|
| 190 |
+
"""
|
| 191 |
return "\n\n".join(
|
| 192 |
[
|
| 193 |
_format_state(state),
|
| 194 |
_format_report_ready(report_ready),
|
| 195 |
"[Available actions]\n" + ", ".join(available_actions),
|
| 196 |
+
f"[Reply language]\nRespond ONLY in: {reply_language}",
|
| 197 |
]
|
| 198 |
)
|
| 199 |
|
|
|
|
| 258 |
"""
|
| 259 |
readiness = report_ready or ReportReadiness()
|
| 260 |
actions = available_actions or _derive_available_actions(state, readiness)
|
| 261 |
+
goal_texts = [
|
| 262 |
+
getattr(state, "objective", "") or "",
|
| 263 |
+
*(getattr(state, "business_questions", None) or []),
|
| 264 |
+
]
|
| 265 |
+
reply_language = _detect_reply_language(history, message, goal_texts=goal_texts)
|
| 266 |
logger.info(
|
| 267 |
"help guidance",
|
| 268 |
report_ready=readiness.ready,
|
| 269 |
available_actions=actions,
|
| 270 |
+
reply_language=reply_language,
|
| 271 |
)
|
| 272 |
|
| 273 |
chain = self._ensure_chain()
|
| 274 |
+
default_trigger = _DEFAULT_TRIGGERS.get(
|
| 275 |
+
reply_language, _DEFAULT_TRIGGERS[_FALLBACK_LANGUAGE]
|
| 276 |
+
)
|
| 277 |
payload: dict[str, Any] = {
|
| 278 |
+
"message": message or default_trigger,
|
| 279 |
"history": history or [],
|
| 280 |
+
"context": _build_context_block(state, readiness, actions, reply_language),
|
| 281 |
}
|
| 282 |
if callbacks:
|
| 283 |
async for token in chain.astream(payload, config={"callbacks": callbacks}):
|
src/agents/planner/inputs.py
CHANGED
|
@@ -31,11 +31,24 @@ class ColumnSummary(BaseModel):
|
|
| 31 |
top_values: list[Any] | None = None
|
| 32 |
|
| 33 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 34 |
class TableSummary(BaseModel):
|
| 35 |
table_id: str
|
| 36 |
name: str
|
| 37 |
row_count: int | None = None
|
| 38 |
columns: list[ColumnSummary] = Field(default_factory=list)
|
|
|
|
| 39 |
|
| 40 |
|
| 41 |
class StructuredSourceSummary(BaseModel):
|
|
@@ -89,6 +102,16 @@ class CatalogSummary(BaseModel):
|
|
| 89 |
)
|
| 90 |
for col in table.columns
|
| 91 |
],
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 92 |
)
|
| 93 |
for table in source.tables
|
| 94 |
]
|
|
@@ -111,6 +134,12 @@ class CatalogSummary(BaseModel):
|
|
| 111 |
lines: list[str] = []
|
| 112 |
for source in self.structured_sources:
|
| 113 |
lines.append(f"Source: {source.name} ({source.source_type}) — id={source.source_id}")
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 114 |
for table in source.tables:
|
| 115 |
rc = f" ({table.row_count:,} rows)" if table.row_count is not None else ""
|
| 116 |
lines.append(f" Table: {table.name}{rc} — id={table.table_id}")
|
|
@@ -121,6 +150,16 @@ class CatalogSummary(BaseModel):
|
|
| 121 |
f" - {col.name} [{col.data_type}]: "
|
| 122 |
f"samples={samples}{top} — id={col.column_id}"
|
| 123 |
)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 124 |
lines.append("")
|
| 125 |
|
| 126 |
if self.unstructured_sources:
|
|
|
|
| 31 |
top_values: list[Any] | None = None
|
| 32 |
|
| 33 |
|
| 34 |
+
class ForeignKeySummary(BaseModel):
|
| 35 |
+
"""A declared FK edge — the only joins the IR validator accepts.
|
| 36 |
+
|
| 37 |
+
Maps directly onto a `retrieve_data` IR join: `column_id` → `left_column_id`,
|
| 38 |
+
`target_table_id` → `target_table_id`, `target_column_id` → `right_column_id`.
|
| 39 |
+
"""
|
| 40 |
+
|
| 41 |
+
column_id: str
|
| 42 |
+
target_table_id: str
|
| 43 |
+
target_column_id: str
|
| 44 |
+
|
| 45 |
+
|
| 46 |
class TableSummary(BaseModel):
|
| 47 |
table_id: str
|
| 48 |
name: str
|
| 49 |
row_count: int | None = None
|
| 50 |
columns: list[ColumnSummary] = Field(default_factory=list)
|
| 51 |
+
foreign_keys: list[ForeignKeySummary] = Field(default_factory=list)
|
| 52 |
|
| 53 |
|
| 54 |
class StructuredSourceSummary(BaseModel):
|
|
|
|
| 102 |
)
|
| 103 |
for col in table.columns
|
| 104 |
],
|
| 105 |
+
# The declared FKs — the only joins the validator accepts. FKs
|
| 106 |
+
# carry no PII (ids only), so they're always surfaced.
|
| 107 |
+
foreign_keys=[
|
| 108 |
+
ForeignKeySummary(
|
| 109 |
+
column_id=fk.column_id,
|
| 110 |
+
target_table_id=fk.target_table_id,
|
| 111 |
+
target_column_id=fk.target_column_id,
|
| 112 |
+
)
|
| 113 |
+
for fk in table.foreign_keys
|
| 114 |
+
],
|
| 115 |
)
|
| 116 |
for table in source.tables
|
| 117 |
]
|
|
|
|
| 134 |
lines: list[str] = []
|
| 135 |
for source in self.structured_sources:
|
| 136 |
lines.append(f"Source: {source.name} ({source.source_type}) — id={source.source_id}")
|
| 137 |
+
# Name lookups (within a source) so FK edges render with readable
|
| 138 |
+
# table/column names alongside the ids the IR join must copy verbatim.
|
| 139 |
+
table_name_by_id = {t.table_id: t.name for t in source.tables}
|
| 140 |
+
col_name_by_id = {
|
| 141 |
+
c.column_id: c.name for t in source.tables for c in t.columns
|
| 142 |
+
}
|
| 143 |
for table in source.tables:
|
| 144 |
rc = f" ({table.row_count:,} rows)" if table.row_count is not None else ""
|
| 145 |
lines.append(f" Table: {table.name}{rc} — id={table.table_id}")
|
|
|
|
| 150 |
f" - {col.name} [{col.data_type}]: "
|
| 151 |
f"samples={samples}{top} — id={col.column_id}"
|
| 152 |
)
|
| 153 |
+
for fk in table.foreign_keys:
|
| 154 |
+
tgt_table = table_name_by_id.get(fk.target_table_id, fk.target_table_id)
|
| 155 |
+
tgt_col = col_name_by_id.get(fk.target_column_id, fk.target_column_id)
|
| 156 |
+
src_col = col_name_by_id.get(fk.column_id, fk.column_id)
|
| 157 |
+
lines.append(
|
| 158 |
+
f" FK: {src_col} → {tgt_table}.{tgt_col} "
|
| 159 |
+
f"(join: target_table_id={fk.target_table_id}, "
|
| 160 |
+
f"left_column_id={fk.column_id}, "
|
| 161 |
+
f"right_column_id={fk.target_column_id})"
|
| 162 |
+
)
|
| 163 |
lines.append("")
|
| 164 |
|
| 165 |
if self.unstructured_sources:
|
src/catalog/fk_inference.py
ADDED
|
@@ -0,0 +1,97 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Heuristic foreign-key inference for catalogs that ship no declared FKs.
|
| 2 |
+
|
| 3 |
+
The dedorch catalog (written by Go's introspection) currently carries **no**
|
| 4 |
+
`foreign_keys`, so the FK-backed-joins-only IR validator rejects every join the
|
| 5 |
+
planner proposes — cross-table questions ("revenue by product") can't run even
|
| 6 |
+
though the planner picks the right columns. Until Go captures real FK
|
| 7 |
+
constraints, we infer the obvious relational edges from naming conventions so the
|
| 8 |
+
planner and the validator agree on the same catalog.
|
| 9 |
+
|
| 10 |
+
Conservative by design (a wrong edge would silently corrupt joined results):
|
| 11 |
+
- `schema` (database) sources only — joins are DB-only anyway
|
| 12 |
+
- a foreign key is only inferred from a column named ``<base>_id``
|
| 13 |
+
- the target must be the SINGLE other table whose name matches ``<base>``
|
| 14 |
+
(singular/plural) and exposes an ``id`` column of the SAME data_type
|
| 15 |
+
- ambiguous matches (0 or >1 candidate tables) are skipped, never guessed
|
| 16 |
+
- sources that already declare ANY foreign key are left untouched (trust Go)
|
| 17 |
+
"""
|
| 18 |
+
|
| 19 |
+
from __future__ import annotations
|
| 20 |
+
|
| 21 |
+
import re
|
| 22 |
+
|
| 23 |
+
from src.catalog.models import ForeignKey, Source
|
| 24 |
+
from src.middlewares.logging import get_logger
|
| 25 |
+
|
| 26 |
+
from .models import Catalog
|
| 27 |
+
|
| 28 |
+
logger = get_logger("fk_inference")
|
| 29 |
+
|
| 30 |
+
# `<base>_id` — the conventional foreign-key column name (base must be non-empty).
|
| 31 |
+
_ID_COL = re.compile(r"^(?P<base>.+)_id$", re.IGNORECASE)
|
| 32 |
+
|
| 33 |
+
|
| 34 |
+
def _table_matches_base(table_name: str, base: str) -> bool:
|
| 35 |
+
"""Whether `table_name` is the table `<base>` refers to (singular/plural)."""
|
| 36 |
+
n = table_name.lower()
|
| 37 |
+
b = base.lower()
|
| 38 |
+
# `orders`↔`order`, `products`↔`product`, `sales_agents`↔`agent` (suffix),
|
| 39 |
+
# plus the singular form and the `-es` plural.
|
| 40 |
+
return n == b or n == b + "es" or n.endswith(b + "s")
|
| 41 |
+
|
| 42 |
+
|
| 43 |
+
def _infer_source(source: Source) -> int:
|
| 44 |
+
"""Add inferred FK edges to one source's tables in place; return the count."""
|
| 45 |
+
added = 0
|
| 46 |
+
for table in source.tables:
|
| 47 |
+
for col in table.columns:
|
| 48 |
+
m = _ID_COL.match(col.name)
|
| 49 |
+
if not m:
|
| 50 |
+
continue
|
| 51 |
+
base = m.group("base")
|
| 52 |
+
candidates: list[tuple[str, str]] = [] # (target_table_id, target_column_id)
|
| 53 |
+
for tgt in source.tables:
|
| 54 |
+
if tgt.table_id == table.table_id:
|
| 55 |
+
continue
|
| 56 |
+
if not _table_matches_base(tgt.name, base):
|
| 57 |
+
continue
|
| 58 |
+
id_col = next(
|
| 59 |
+
(
|
| 60 |
+
c
|
| 61 |
+
for c in tgt.columns
|
| 62 |
+
if c.name.lower() == "id" and c.data_type == col.data_type
|
| 63 |
+
),
|
| 64 |
+
None,
|
| 65 |
+
)
|
| 66 |
+
if id_col is not None:
|
| 67 |
+
candidates.append((tgt.table_id, id_col.column_id))
|
| 68 |
+
# Only act on an unambiguous single match — never guess between many.
|
| 69 |
+
if len(candidates) != 1:
|
| 70 |
+
continue
|
| 71 |
+
target_table_id, target_column_id = candidates[0]
|
| 72 |
+
table.foreign_keys.append(
|
| 73 |
+
ForeignKey(
|
| 74 |
+
column_id=col.column_id,
|
| 75 |
+
target_table_id=target_table_id,
|
| 76 |
+
target_column_id=target_column_id,
|
| 77 |
+
)
|
| 78 |
+
)
|
| 79 |
+
added += 1
|
| 80 |
+
return added
|
| 81 |
+
|
| 82 |
+
|
| 83 |
+
def infer_foreign_keys(catalog: Catalog) -> Catalog:
|
| 84 |
+
"""Infer FK edges in place for schema sources that declare none. Returns `catalog`.
|
| 85 |
+
|
| 86 |
+
Sources that already carry any declared FK are left as-is (Go's real FKs win).
|
| 87 |
+
"""
|
| 88 |
+
total = 0
|
| 89 |
+
for source in catalog.sources:
|
| 90 |
+
if source.source_type != "schema":
|
| 91 |
+
continue
|
| 92 |
+
if any(t.foreign_keys for t in source.tables):
|
| 93 |
+
continue # real FKs present — trust them, infer nothing
|
| 94 |
+
total += _infer_source(source)
|
| 95 |
+
if total:
|
| 96 |
+
logger.info("inferred foreign keys", user_id=catalog.user_id, count=total)
|
| 97 |
+
return catalog
|
src/catalog/render.py
CHANGED
|
@@ -65,5 +65,11 @@ def render_source(source: Source) -> str:
|
|
| 65 |
tgt_col_name = col_names_by_id.get(fk.target_table_id, {}).get(
|
| 66 |
fk.target_column_id, fk.target_column_id
|
| 67 |
)
|
| 68 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 69 |
return "\n".join(lines)
|
|
|
|
| 65 |
tgt_col_name = col_names_by_id.get(fk.target_table_id, {}).get(
|
| 66 |
fk.target_column_id, fk.target_column_id
|
| 67 |
)
|
| 68 |
+
# Include the join ids inline — the planner must copy these verbatim
|
| 69 |
+
# into the IR join, and the IRValidator does a literal id lookup.
|
| 70 |
+
lines.append(
|
| 71 |
+
f" - {src_col_name} -> {tgt_table_name}.{tgt_col_name} "
|
| 72 |
+
f"(join: target_table_id={fk.target_table_id}, "
|
| 73 |
+
f"left_column_id={fk.column_id}, right_column_id={fk.target_column_id})"
|
| 74 |
+
)
|
| 75 |
return "\n".join(lines)
|
src/catalog/store.py
CHANGED
|
@@ -1,7 +1,9 @@
|
|
| 1 |
-
"""CatalogStore —
|
| 2 |
|
| 3 |
-
Storage shape: one row per
|
| 4 |
-
(user_id
|
|
|
|
|
|
|
| 5 |
"""
|
| 6 |
|
| 7 |
from sqlalchemy import case, delete, func, select
|
|
@@ -11,6 +13,7 @@ from src.db.postgres.connection import AsyncSessionLocal
|
|
| 11 |
from src.db.postgres.models import Catalog as CatalogRow
|
| 12 |
from src.middlewares.logging import get_logger
|
| 13 |
|
|
|
|
| 14 |
from .models import Catalog
|
| 15 |
|
| 16 |
logger = get_logger("catalog_store")
|
|
@@ -27,30 +30,43 @@ class CatalogStore:
|
|
| 27 |
async def get(self, user_id: str) -> Catalog | None:
|
| 28 |
async with AsyncSessionLocal() as session:
|
| 29 |
result = await session.execute(
|
| 30 |
-
select(CatalogRow.
|
|
|
|
|
|
|
|
|
|
| 31 |
)
|
| 32 |
row = result.scalar_one_or_none()
|
| 33 |
if row is None:
|
| 34 |
return None
|
| 35 |
-
|
|
|
|
|
|
|
|
|
|
| 36 |
|
| 37 |
async def upsert(self, catalog: Catalog) -> None:
|
|
|
|
|
|
|
| 38 |
payload = catalog.model_dump(mode="json")
|
| 39 |
async with AsyncSessionLocal() as session:
|
| 40 |
stmt = insert(CatalogRow).values(
|
|
|
|
| 41 |
user_id=catalog.user_id,
|
| 42 |
-
|
| 43 |
schema_version=catalog.schema_version,
|
| 44 |
generated_at=catalog.generated_at,
|
| 45 |
updated_at=func.now(),
|
| 46 |
)
|
| 47 |
stmt = stmt.on_conflict_do_update(
|
| 48 |
index_elements=[CatalogRow.user_id],
|
|
|
|
| 49 |
set_={
|
| 50 |
-
"
|
| 51 |
"schema_version": stmt.excluded.schema_version,
|
| 52 |
"updated_at": case(
|
| 53 |
-
(
|
|
|
|
|
|
|
|
|
|
| 54 |
else_=CatalogRow.updated_at,
|
| 55 |
),
|
| 56 |
},
|
|
|
|
| 1 |
+
"""CatalogStore — reads the per-user catalog from the dedorch `data_catalog` table.
|
| 2 |
|
| 3 |
+
Storage shape (Go-owned): one row per scope in `data_catalog`
|
| 4 |
+
(id, scope_type, user_id, analysis_id, catalog_payload jsonb, schema_version,
|
| 5 |
+
generated_at, updated_at). Python reads the user-scoped row (scope_type='user');
|
| 6 |
+
Go's `catalog.Service` owns all writes, so `upsert`/`remove_source` are legacy.
|
| 7 |
"""
|
| 8 |
|
| 9 |
from sqlalchemy import case, delete, func, select
|
|
|
|
| 13 |
from src.db.postgres.models import Catalog as CatalogRow
|
| 14 |
from src.middlewares.logging import get_logger
|
| 15 |
|
| 16 |
+
from .fk_inference import infer_foreign_keys
|
| 17 |
from .models import Catalog
|
| 18 |
|
| 19 |
logger = get_logger("catalog_store")
|
|
|
|
| 30 |
async def get(self, user_id: str) -> Catalog | None:
|
| 31 |
async with AsyncSessionLocal() as session:
|
| 32 |
result = await session.execute(
|
| 33 |
+
select(CatalogRow.catalog_payload).where(
|
| 34 |
+
CatalogRow.user_id == user_id,
|
| 35 |
+
CatalogRow.scope_type == "user",
|
| 36 |
+
)
|
| 37 |
)
|
| 38 |
row = result.scalar_one_or_none()
|
| 39 |
if row is None:
|
| 40 |
return None
|
| 41 |
+
# dedorch catalogs ship no foreign_keys (Go introspection drops them),
|
| 42 |
+
# but the IR validator only allows FK-backed joins. Infer the obvious
|
| 43 |
+
# edges so the planner and validator agree. No-op once Go emits real FKs.
|
| 44 |
+
return infer_foreign_keys(Catalog.model_validate(row))
|
| 45 |
|
| 46 |
async def upsert(self, catalog: Catalog) -> None:
|
| 47 |
+
# Legacy: Go's catalog.Service owns catalog writes now. Kept working (and
|
| 48 |
+
# reconciled to the dedorch shape) but no longer on any live Python path.
|
| 49 |
payload = catalog.model_dump(mode="json")
|
| 50 |
async with AsyncSessionLocal() as session:
|
| 51 |
stmt = insert(CatalogRow).values(
|
| 52 |
+
scope_type="user",
|
| 53 |
user_id=catalog.user_id,
|
| 54 |
+
catalog_payload=payload,
|
| 55 |
schema_version=catalog.schema_version,
|
| 56 |
generated_at=catalog.generated_at,
|
| 57 |
updated_at=func.now(),
|
| 58 |
)
|
| 59 |
stmt = stmt.on_conflict_do_update(
|
| 60 |
index_elements=[CatalogRow.user_id],
|
| 61 |
+
index_where=CatalogRow.scope_type == "user",
|
| 62 |
set_={
|
| 63 |
+
"catalog_payload": stmt.excluded.catalog_payload,
|
| 64 |
"schema_version": stmt.excluded.schema_version,
|
| 65 |
"updated_at": case(
|
| 66 |
+
(
|
| 67 |
+
stmt.excluded.catalog_payload != CatalogRow.catalog_payload,
|
| 68 |
+
func.now(),
|
| 69 |
+
),
|
| 70 |
else_=CatalogRow.updated_at,
|
| 71 |
),
|
| 72 |
},
|
src/config/prompts/help.md
CHANGED
|
@@ -1,8 +1,14 @@
|
|
| 1 |
-
<!-- help.md ·
|
| 2 |
-
|
| 3 |
-
|
| 4 |
-
|
| 5 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 6 |
|
| 7 |
You are the **Help guide** for an AI data-analysis assistant. Think of yourself as the
|
| 8 |
instruction sheet that comes with a board game: your only job is to tell the user
|
|
@@ -23,6 +29,7 @@ You are given context, never raw user prose to analyze:
|
|
| 23 |
- `ready` (bool) — whether there is enough analysis to generate a report.
|
| 24 |
- `missing` (list) — if not ready, the gaps to fill.
|
| 25 |
- **`available_actions`** *(optional)* — which actions are actually wired right now. If present, **only suggest actions listed here.**
|
|
|
|
| 26 |
|
| 27 |
> **Hard rule — never misguide.** Trust the signals above for *what is possible*, not your
|
| 28 |
> own guess. If `report_ready.ready` is `false`, do **not** tell the user to generate a
|
|
@@ -72,8 +79,13 @@ Do not over-promise the report's depth.
|
|
| 72 |
## Tone
|
| 73 |
|
| 74 |
Plain, warm, and encouraging — like a helpful guide, **not** a hype trailer. No exclamation
|
| 75 |
-
spam, no overselling.
|
| 76 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 77 |
|
| 78 |
## Constraints
|
| 79 |
|
|
@@ -86,15 +98,21 @@ English). A few sentences is usually enough.
|
|
| 86 |
## Examples
|
| 87 |
|
| 88 |
```
|
| 89 |
-
|
|
|
|
|
|
|
|
|
|
| 90 |
→ "Your goal is set — you can start exploring now. Try a basic question first, like
|
| 91 |
'Which products sell the most?' or 'How have monthly sales trended?', then we can dig into
|
| 92 |
what's driving your objective."
|
| 93 |
|
| 94 |
-
|
| 95 |
-
|
| 96 |
-
|
|
|
|
|
|
|
| 97 |
|
|
|
|
| 98 |
State: report_ready.ready=true
|
| 99 |
→ "You've covered enough to summarize. You can generate your report now — run /report
|
| 100 |
or use the report option to create it."
|
|
|
|
| 1 |
+
<!-- help.md · v3 · Help skill prompt.
|
| 2 |
+
v2 (2026-06-24, KM-652): removed the problem_statement skill + the problem_validated gate —
|
| 3 |
+
the goal (objective + business_questions) is now set in the New Analysis form at onboarding,
|
| 4 |
+
so Help no longer steers users to define/validate a goal in chat.
|
| 5 |
+
v3 (2026-07-02): (a) reply language is now a hard rule driven by the [Reply language]
|
| 6 |
+
directive (the button path was defaulting to English); (b) Examples got stable ids
|
| 7 |
+
("id: ..." comment above each) so eval/help can mirror them as carried_over regression
|
| 8 |
+
cases, and the second example now uses a REAL `missing` value from report/readiness.py —
|
| 9 |
+
the old "no comparison over time" string is never emitted by is_report_ready.
|
| 10 |
+
Bump to v4 (don't silently overwrite) on the next major change (e.g. real UI steps from
|
| 11 |
+
the frontend). -->
|
| 12 |
|
| 13 |
You are the **Help guide** for an AI data-analysis assistant. Think of yourself as the
|
| 14 |
instruction sheet that comes with a board game: your only job is to tell the user
|
|
|
|
| 29 |
- `ready` (bool) — whether there is enough analysis to generate a report.
|
| 30 |
- `missing` (list) — if not ready, the gaps to fill.
|
| 31 |
- **`available_actions`** *(optional)* — which actions are actually wired right now. If present, **only suggest actions listed here.**
|
| 32 |
+
- **`[Reply language]`** — the language you MUST reply in (detected deterministically from the user's last turn). This is an instruction, not a suggestion — see the hard rule below.
|
| 33 |
|
| 34 |
> **Hard rule — never misguide.** Trust the signals above for *what is possible*, not your
|
| 35 |
> own guess. If `report_ready.ready` is `false`, do **not** tell the user to generate a
|
|
|
|
| 79 |
## Tone
|
| 80 |
|
| 81 |
Plain, warm, and encouraging — like a helpful guide, **not** a hype trailer. No exclamation
|
| 82 |
+
spam, no overselling. A few sentences is usually enough.
|
| 83 |
+
|
| 84 |
+
> **Hard rule — reply language.** Reply **only** in the language named in `[Reply language]`.
|
| 85 |
+
> This is mandatory and overrides the language of this prompt, its examples, and the trigger
|
| 86 |
+
> question. If `[Reply language]` says `Indonesian`, answer entirely in Indonesian even though
|
| 87 |
+
> these instructions are in English; if it says `English`, answer in English. Never mix
|
| 88 |
+
> languages or switch mid-reply.
|
| 89 |
|
| 90 |
## Constraints
|
| 91 |
|
|
|
|
| 98 |
## Examples
|
| 99 |
|
| 100 |
```
|
| 101 |
+
<!-- id: help_ex_orient -->
|
| 102 |
+
State: objective="understand monthly sales performance",
|
| 103 |
+
business_questions=["which products drive revenue?"],
|
| 104 |
+
chat_history empty, report_ready.ready=false, missing=["at least one completed analysis"]
|
| 105 |
→ "Your goal is set — you can start exploring now. Try a basic question first, like
|
| 106 |
'Which products sell the most?' or 'How have monthly sales trended?', then we can dig into
|
| 107 |
what's driving your objective."
|
| 108 |
|
| 109 |
+
<!-- id: help_ex_guard_delta -->
|
| 110 |
+
State: report_ready.ready=false, missing=["a new analysis since the last report"]
|
| 111 |
+
→ "You already have a report, and nothing new has come in since. Ask something that builds
|
| 112 |
+
on your objective — a fresh cut, a new time period, or a different angle — and we can
|
| 113 |
+
regenerate the report with that."
|
| 114 |
|
| 115 |
+
<!-- id: help_ex_guard_ready -->
|
| 116 |
State: report_ready.ready=true
|
| 117 |
→ "You've covered enough to summarize. You can generate your report now — run /report
|
| 118 |
or use the report option to create it."
|
src/config/prompts/planner.md
CHANGED
|
@@ -41,15 +41,20 @@ only a `TaskList` object that conforms to the provided schema.
|
|
| 41 |
(referencing the upstream result's column aliases).
|
| 42 |
- **Measure by a dimension in another table (joins).** When the number you are
|
| 43 |
aggregating and the grouping dimension live in DIFFERENT tables of the same
|
| 44 |
-
database source, add a `joins` entry to the `retrieve_data` IR
|
| 45 |
-
key
|
| 46 |
-
|
| 47 |
-
|
| 48 |
-
|
| 49 |
-
|
| 50 |
-
|
| 51 |
-
|
| 52 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 53 |
- **Mixing structured + unstructured.** If qualitative context helps, add a
|
| 54 |
`retrieve_knowledge` task against an unstructured source listed in the catalog.
|
| 55 |
- **CRISP-DM stages.** Tag each task with the stage it serves:
|
|
|
|
| 41 |
(referencing the upstream result's column aliases).
|
| 42 |
- **Measure by a dimension in another table (joins).** When the number you are
|
| 43 |
aggregating and the grouping dimension live in DIFFERENT tables of the same
|
| 44 |
+
database source, add a `joins` entry to the `retrieve_data` IR. **Join ONLY on a
|
| 45 |
+
foreign key listed in the catalog.** Each joinable relationship appears as an
|
| 46 |
+
`FK:` line under its table, e.g.
|
| 47 |
+
`FK: product_id → products.id (join: target_table_id=t_products, left_column_id=c_oi_product_id, right_column_id=c_products_id)`
|
| 48 |
+
— copy those three ids verbatim into the join (`target_table_id`,
|
| 49 |
+
`left_column_id`, `right_column_id`). Example — "revenue by category": the measure
|
| 50 |
+
`order_items.line_total` joined to `products` on `order_items.product_id =
|
| 51 |
+
products.id`, grouped by `products.category`. **If no `FK:` line links the tables
|
| 52 |
+
you need, do NOT invent a join** — the validator rejects any join that isn't a
|
| 53 |
+
declared FK. Instead use a single table when the measure and dimension already
|
| 54 |
+
live together (e.g. "revenue by region" from `orders.region` +
|
| 55 |
+
`orders.total_amount`); if they genuinely aren't linked, say the data isn't
|
| 56 |
+
connected rather than guessing. Prefer an existing measure column over
|
| 57 |
+
recomputing. Joins are database-only — not available for tabular/file sources.
|
| 58 |
- **Mixing structured + unstructured.** If qualitative context helps, add a
|
| 59 |
`retrieve_knowledge` task against an unstructured source listed in the catalog.
|
| 60 |
- **CRISP-DM stages.** Tag each task with the stage it serves:
|
src/config/settings.py
CHANGED
|
@@ -30,6 +30,12 @@ class Settings(BaseSettings):
|
|
| 30 |
# to avoid .env churn; remove once no environment references it.
|
| 31 |
enable_gate: bool = Field(alias="enable_gate", default=False)
|
| 32 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 33 |
# Database
|
| 34 |
postgres_connstring: str
|
| 35 |
|
|
|
|
| 30 |
# to avoid .env churn; remove once no environment references it.
|
| 31 |
enable_gate: bool = Field(alias="enable_gate", default=False)
|
| 32 |
|
| 33 |
+
# Skip init_db() (create_all + startup DDL) on boot. TRUE by default post-dedorch
|
| 34 |
+
# cutover: Go owns the dedorch schema, so Python (consumer-only role) must NOT run
|
| 35 |
+
# init_db — its ALTER/index DDL on Go-owned tables fails with InsufficientPrivilege
|
| 36 |
+
# ("must be owner of table rooms"). Set to false only for a local Python-owned DB.
|
| 37 |
+
skip_init_db: bool = Field(alias="SKIP_INIT_DB", default=True)
|
| 38 |
+
|
| 39 |
# Database
|
| 40 |
postgres_connstring: str
|
| 41 |
|
src/db/postgres/models.py
CHANGED
|
@@ -6,9 +6,11 @@ from sqlalchemy import (
|
|
| 6 |
Column,
|
| 7 |
DateTime,
|
| 8 |
ForeignKey,
|
|
|
|
| 9 |
Integer,
|
| 10 |
String,
|
| 11 |
Text,
|
|
|
|
| 12 |
)
|
| 13 |
from sqlalchemy.dialects.postgresql import JSONB, UUID
|
| 14 |
from sqlalchemy.orm import relationship
|
|
@@ -108,23 +110,44 @@ class DatabaseClient(Base):
|
|
| 108 |
|
| 109 |
|
| 110 |
class Catalog(Base):
|
| 111 |
-
"""
|
| 112 |
|
| 113 |
-
`
|
| 114 |
-
|
| 115 |
-
`
|
|
|
|
|
|
|
| 116 |
|
| 117 |
-
|
| 118 |
-
|
|
|
|
| 119 |
"""
|
| 120 |
__tablename__ = "data_catalog"
|
| 121 |
|
| 122 |
-
|
| 123 |
-
|
|
|
|
|
|
|
|
|
|
| 124 |
schema_version = Column(String, nullable=False, default="1.0")
|
| 125 |
-
generated_at = Column(DateTime(timezone=True), server_default=func.now())
|
| 126 |
updated_at = Column(DateTime(timezone=True), onupdate=func.now())
|
| 127 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 128 |
|
| 129 |
class ReportInputRow(Base):
|
| 130 |
"""One row per completed slow-path analysis (the report's source of truth).
|
|
|
|
| 6 |
Column,
|
| 7 |
DateTime,
|
| 8 |
ForeignKey,
|
| 9 |
+
Index,
|
| 10 |
Integer,
|
| 11 |
String,
|
| 12 |
Text,
|
| 13 |
+
text,
|
| 14 |
)
|
| 15 |
from sqlalchemy.dialects.postgresql import JSONB, UUID
|
| 16 |
from sqlalchemy.orm import relationship
|
|
|
|
| 110 |
|
| 111 |
|
| 112 |
class Catalog(Base):
|
| 113 |
+
"""Data catalog — dedorch **`data_catalog`** (Go-owned; reconciled 2026-07-01).
|
| 114 |
|
| 115 |
+
Mirrors Go migration `0001`/`0002`. One jsonb `catalog_payload` per scope:
|
| 116 |
+
`scope_type='user'` rows are keyed by `user_id` (partial unique index),
|
| 117 |
+
`scope_type='analysis'` rows by `analysis_id`. Python is **consumer-only** —
|
| 118 |
+
Go's `catalog.Service` owns all writes (DB/file ingestion); `CatalogStore`
|
| 119 |
+
reads the user-scoped catalog and its write methods are legacy.
|
| 120 |
|
| 121 |
+
`catalog_payload` holds the full Pydantic Catalog (src/catalog/models.py:Catalog)
|
| 122 |
+
serialized via `model_dump(mode="json")`; the read path rehydrates with
|
| 123 |
+
`Catalog.model_validate(...)`. Go writes the same shape (json tags match).
|
| 124 |
"""
|
| 125 |
__tablename__ = "data_catalog"
|
| 126 |
|
| 127 |
+
id = Column(UUID(as_uuid=False), primary_key=True, default=lambda: str(uuid4()))
|
| 128 |
+
scope_type = Column(String, nullable=False, default="user") # 'user' | 'analysis'
|
| 129 |
+
user_id = Column(String, nullable=False, index=True)
|
| 130 |
+
analysis_id = Column(UUID(as_uuid=False), nullable=True)
|
| 131 |
+
catalog_payload = Column(JSONB, nullable=False)
|
| 132 |
schema_version = Column(String, nullable=False, default="1.0")
|
| 133 |
+
generated_at = Column(DateTime(timezone=True), nullable=False, server_default=func.now())
|
| 134 |
updated_at = Column(DateTime(timezone=True), onupdate=func.now())
|
| 135 |
|
| 136 |
+
__table_args__ = (
|
| 137 |
+
Index(
|
| 138 |
+
"idx_data_catalog_user_scope",
|
| 139 |
+
"user_id",
|
| 140 |
+
unique=True,
|
| 141 |
+
postgresql_where=text("scope_type = 'user'"),
|
| 142 |
+
),
|
| 143 |
+
Index(
|
| 144 |
+
"idx_data_catalog_analysis_scope",
|
| 145 |
+
"analysis_id",
|
| 146 |
+
unique=True,
|
| 147 |
+
postgresql_where=text("scope_type = 'analysis'"),
|
| 148 |
+
),
|
| 149 |
+
)
|
| 150 |
+
|
| 151 |
|
| 152 |
class ReportInputRow(Base):
|
| 153 |
"""One row per completed slow-path analysis (the report's source of truth).
|
src/query/executor/db.py
CHANGED
|
@@ -121,7 +121,9 @@ class DbExecutor(BaseExecutor):
|
|
| 121 |
logger.error(
|
| 122 |
"db executor failed",
|
| 123 |
source_id=ir.source_id,
|
| 124 |
-
|
|
|
|
|
|
|
| 125 |
elapsed_ms=elapsed_ms,
|
| 126 |
)
|
| 127 |
return QueryResult(
|
|
@@ -235,7 +237,9 @@ class DbExecutor(BaseExecutor):
|
|
| 235 |
creds = decrypt_credentials_dict(client.credentials)
|
| 236 |
await asyncio.to_thread(cls._warm_sync, client_id, client.db_type, creds)
|
| 237 |
except Exception as exc: # noqa: BLE001 — best-effort warming
|
| 238 |
-
|
|
|
|
|
|
|
| 239 |
|
| 240 |
@staticmethod
|
| 241 |
def _warm_sync(client_id: str, db_type: str, creds: dict) -> None:
|
|
|
|
| 121 |
logger.error(
|
| 122 |
"db executor failed",
|
| 123 |
source_id=ir.source_id,
|
| 124 |
+
# repr, not str: some exceptions (e.g. Fernet InvalidToken) have an
|
| 125 |
+
# empty str(), which hides the real failure as error="".
|
| 126 |
+
error=repr(e),
|
| 127 |
elapsed_ms=elapsed_ms,
|
| 128 |
)
|
| 129 |
return QueryResult(
|
|
|
|
| 237 |
creds = decrypt_credentials_dict(client.credentials)
|
| 238 |
await asyncio.to_thread(cls._warm_sync, client_id, client.db_type, creds)
|
| 239 |
except Exception as exc: # noqa: BLE001 — best-effort warming
|
| 240 |
+
# repr, not str: empty-str exceptions (e.g. Fernet InvalidToken)
|
| 241 |
+
# would otherwise log as error="".
|
| 242 |
+
logger.info("prewarm skipped", source_id=source.source_id, error=repr(exc))
|
| 243 |
|
| 244 |
@staticmethod
|
| 245 |
def _warm_sync(client_id: str, db_type: str, creds: dict) -> None:
|