feat(kpis): per-tool counts, research engagement, surface split (#160)
Browse files* feat(kpis): per-tool counts, research engagement, surface split
Adds intra-session telemetry to the KPI rollup so the observatory
dashboard can answer "is the agent reaching for research?" and
"which tools dropped out of the mix?". Production data (Apr 21->27,
21k+ sessions) shows the sessions-with-research rate dropped from
~78% on Apr 21 to ~56% on Apr 27 — without per-tool counts in the
rollup, that signal was invisible to the dashboard.
Also resolves the long-standing drift between this repo's
scripts/build_kpis.py and the observatory's backend/build_kpis.py
(both pipelines write to smolagents/ml-intern-kpis; whichever ran
last dropped the other's columns). The observatory copy was the
superset; this PR brings scripts/ up to it. Going forward both
copies are byte-identical.
Drift fix (incoming from observatory):
- cost_per_session_mean / _p50 / _p95
- tool_calls_total / _succeeded / _failed counts
- successful_sessions / errored_sessions / regenerated_sessions counts
- sandboxes_created / _cpu / _gpu
- pro_cta_by_source_json removed (was dropped by observatory; the
dashboard never charted it, so no consumer to break)
New fields (added in both copies via the parallel observatory PR):
- research_calls, sessions_with_research
- research_calls_per_session_p50/p95 (among sessions that did any)
- distinct_tools_per_session_p50/p95 (vocabulary breadth)
- tool_calls_per_session_p50/p95
- tool_calls_per_turn_p50/p95
- tool_calls_by_name_json, sessions_using_tool_json
- sessions_by_model_json (CLI/anthropic vs frontend/bedrock)
Per-tool counts come off tool_call events (data["tool"]) rather
than tool_output (success-only), so the existing tool_calls_total
counter is unchanged.
Tests: 6 new cases in tests/unit/test_build_kpis.py covering the
new private session fields, research-only-among-doers percentile,
breadth/intensity aggregates, and the model split. All 19 KPI +
scheduler tests pass.
The matching observatory commit is e78c2d7 — visualization PRs
(headline cell, "Research tool" / "Tool mix" / "Surface split"
sections) live there.
* fix(kpis): address review feedback
Three P1s and a P2 from automated review:
- Restore hf_jobs_blocked + pro_cta_clicks in _aggregate output.
_session_metrics still computed both, but the aggregate had silently
dropped them — dead computation and a schema regression vs the
original main schema. Cheap to keep; no consumer required.
- Filter zero-tool-call sessions out of distinct_tools_per_session and
tool_calls_per_session percentiles, matching the existing research
doer-only filter. Quiet hours full of status-check / abandoned
sessions otherwise drag every median to 0.
tool_calls_per_turn keeps its turns>0-only filter on purpose: a
5-turn session that did 0 tool calls is a meaningful 0 there.
- Update module docstring to list every aggregate column added in this
series (cost_per_session_*, tool_calls_succeeded/_failed, *_sessions
outcome counts, sandboxes_*).
- Drop the unreachable hardware-set check in sandbox classification:
"cpu-basic"/"cpu-upgrade" both start with "cpu-", so the disjunction
was dead. Simplified to a startswith check.
Two new tests:
- test_breadth_intensity_percentiles_exclude_zero_tool_sessions locks
in the doer-only filter so it can't silently regress.
- test_pro_clicks_and_blocked_jobs_in_aggregate guards the restored
aggregate columns.
Matching observatory commit lands its mirror of build_kpis.py.
- scripts/build_kpis.py +133 -16
- tests/unit/test_build_kpis.py +129 -8
|
@@ -38,15 +38,27 @@ re-running the same hour overwrites.
|
|
| 38 |
llm_calls — count of llm_call events
|
| 39 |
tokens_prompt / _completion / _cache_read / _cache_creation
|
| 40 |
cost_usd — sum of llm_call.cost_usd
|
|
|
|
| 41 |
cache_hit_ratio — cache_read / (cache_read + prompt)
|
| 42 |
-
|
| 43 |
-
|
| 44 |
-
|
|
|
|
| 45 |
time_to_first_action_s_p50 / _p95 — from session_start to first tool_call
|
| 46 |
thumbs_up / thumbs_down
|
| 47 |
hf_jobs_submitted / _succeeded / _blocked
|
|
|
|
| 48 |
pro_cta_clicks
|
| 49 |
gpu_hours_by_flavor_json — JSON-serialised {flavor: gpu-hours}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 50 |
|
| 51 |
================================================================================
|
| 52 |
Usage
|
|
@@ -213,6 +225,7 @@ def _session_metrics(session: dict) -> dict:
|
|
| 213 |
"thumbs_up": 0, "thumbs_down": 0,
|
| 214 |
"hf_jobs_submitted": 0, "hf_jobs_succeeded": 0, "hf_jobs_blocked": 0,
|
| 215 |
"pro_cta_clicks": 0,
|
|
|
|
| 216 |
"first_tool_s": -1,
|
| 217 |
}
|
| 218 |
events = session.get("events") or []
|
|
@@ -231,11 +244,19 @@ def _session_metrics(session: dict) -> dict:
|
|
| 231 |
gpu_hours_by_flavor: dict[str, float] = defaultdict(float)
|
| 232 |
jobs_submitted = 0
|
| 233 |
jobs_succeeded = 0
|
| 234 |
-
jobs_blocked = 0
|
| 235 |
thumbs_up = 0
|
| 236 |
thumbs_down = 0
|
|
|
|
|
|
|
|
|
|
|
|
|
| 237 |
pro_cta_clicks = 0
|
| 238 |
pro_cta_by_source: dict[str, int] = defaultdict(int)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 239 |
|
| 240 |
start_dt = _parse_ts(session_start)
|
| 241 |
|
|
@@ -260,6 +281,10 @@ def _session_metrics(session: dict) -> dict:
|
|
| 260 |
first_tool_ts = (ts - start_dt).total_seconds()
|
| 261 |
|
| 262 |
elif et == "tool_call":
|
|
|
|
|
|
|
|
|
|
|
|
|
| 263 |
if first_tool_ts is None and ts is not None and start_dt is not None:
|
| 264 |
first_tool_ts = (ts - start_dt).total_seconds()
|
| 265 |
|
|
@@ -296,6 +321,19 @@ def _session_metrics(session: dict) -> dict:
|
|
| 296 |
source = str(data.get("source") or "unknown")
|
| 297 |
pro_cta_by_source[source] += 1
|
| 298 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 299 |
out["tool_calls_total"] = tool_total
|
| 300 |
out["tool_calls_success"] = tool_success
|
| 301 |
out["failures"] = 1 if had_error else 0
|
|
@@ -304,12 +342,22 @@ def _session_metrics(session: dict) -> dict:
|
|
| 304 |
out["thumbs_down"] = thumbs_down
|
| 305 |
out["hf_jobs_submitted"] = jobs_submitted
|
| 306 |
out["hf_jobs_succeeded"] = jobs_succeeded
|
|
|
|
|
|
|
|
|
|
| 307 |
out["hf_jobs_blocked"] = jobs_blocked
|
| 308 |
out["pro_cta_clicks"] = pro_cta_clicks
|
| 309 |
out["first_tool_s"] = first_tool_ts if first_tool_ts is not None else -1
|
| 310 |
out["_gpu_hours_by_flavor"] = dict(gpu_hours_by_flavor)
|
| 311 |
out["_pro_cta_by_source"] = dict(pro_cta_by_source)
|
| 312 |
out["_user"] = session.get("user_id") or session.get("session_id")
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 313 |
return dict(out)
|
| 314 |
|
| 315 |
|
|
@@ -317,12 +365,36 @@ def _aggregate(per_session: list[dict]) -> dict:
|
|
| 317 |
"""Collapse a bucket's worth of session rollups into the final KPI row."""
|
| 318 |
ttfa_values = [s["first_tool_s"] for s in per_session if s.get("first_tool_s", -1) >= 0]
|
| 319 |
gpu_hours: dict[str, float] = defaultdict(float)
|
| 320 |
-
pro_cta_by_source: dict[str, int] = defaultdict(int)
|
| 321 |
for s in per_session:
|
| 322 |
for f, h in (s.get("_gpu_hours_by_flavor") or {}).items():
|
| 323 |
gpu_hours[f] += h
|
| 324 |
-
|
| 325 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 326 |
|
| 327 |
total_sessions = sum(s["sessions"] for s in per_session)
|
| 328 |
total_turns = sum(s["turns"] for s in per_session)
|
|
@@ -330,6 +402,16 @@ def _aggregate(per_session: list[dict]) -> dict:
|
|
| 330 |
tokens_cache_read = sum(s["tokens_cache_read"] for s in per_session)
|
| 331 |
tool_total = sum(s["tool_calls_total"] for s in per_session)
|
| 332 |
tool_success = sum(s["tool_calls_success"] for s in per_session)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 333 |
|
| 334 |
unique_users = {s.get("_user") for s in per_session if s.get("_user")}
|
| 335 |
|
|
@@ -343,26 +425,61 @@ def _aggregate(per_session: list[dict]) -> dict:
|
|
| 343 |
"tokens_cache_read": int(tokens_cache_read),
|
| 344 |
"tokens_cache_creation": int(sum(s["tokens_cache_creation"] for s in per_session)),
|
| 345 |
"cost_usd": round(sum(s["cost_usd"] for s in per_session), 4),
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 346 |
"cache_hit_ratio": round(
|
| 347 |
tokens_cache_read / (tokens_cache_read + tokens_prompt), 4
|
| 348 |
) if (tokens_cache_read + tokens_prompt) > 0 else 0.0,
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 349 |
"tool_success_rate": round(tool_success / tool_total, 4) if tool_total > 0 else 0.0,
|
| 350 |
-
"failure_rate": round(
|
| 351 |
-
|
| 352 |
-
) if total_sessions > 0 else 0.0,
|
| 353 |
-
"regenerate_rate": round(
|
| 354 |
-
sum(s["regenerate_sessions"] for s in per_session) / total_sessions, 4
|
| 355 |
-
) if total_sessions > 0 else 0.0,
|
| 356 |
"time_to_first_action_s_p50": round(_percentile(ttfa_values, 0.5), 2),
|
| 357 |
"time_to_first_action_s_p95": round(_percentile(ttfa_values, 0.95), 2),
|
| 358 |
"thumbs_up": int(sum(s["thumbs_up"] for s in per_session)),
|
| 359 |
"thumbs_down": int(sum(s["thumbs_down"] for s in per_session)),
|
| 360 |
"hf_jobs_submitted": int(sum(s["hf_jobs_submitted"] for s in per_session)),
|
| 361 |
"hf_jobs_succeeded": int(sum(s["hf_jobs_succeeded"] for s in per_session)),
|
| 362 |
-
"
|
| 363 |
-
"
|
|
|
|
|
|
|
|
|
|
| 364 |
"gpu_hours_by_flavor_json": json.dumps(dict(gpu_hours), sort_keys=True),
|
| 365 |
-
"
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 366 |
}
|
| 367 |
|
| 368 |
|
|
|
|
| 38 |
llm_calls — count of llm_call events
|
| 39 |
tokens_prompt / _completion / _cache_read / _cache_creation
|
| 40 |
cost_usd — sum of llm_call.cost_usd
|
| 41 |
+
cost_per_session_mean / _p50 / _p95 — per-session cost distribution
|
| 42 |
cache_hit_ratio — cache_read / (cache_read + prompt)
|
| 43 |
+
tool_calls_total / _succeeded / _failed — per-tool_output reliability counts
|
| 44 |
+
tool_success_rate — succeeded / total (kept for back-compat)
|
| 45 |
+
successful_sessions / errored_sessions / regenerated_sessions — outcome counts
|
| 46 |
+
failure_rate / regenerate_rate — kept for back-compat
|
| 47 |
time_to_first_action_s_p50 / _p95 — from session_start to first tool_call
|
| 48 |
thumbs_up / thumbs_down
|
| 49 |
hf_jobs_submitted / _succeeded / _blocked
|
| 50 |
+
sandboxes_created / _cpu / _gpu — sandbox_create events bucketed by hardware
|
| 51 |
pro_cta_clicks
|
| 52 |
gpu_hours_by_flavor_json — JSON-serialised {flavor: gpu-hours}
|
| 53 |
+
research_calls — total `research` tool_call events
|
| 54 |
+
sessions_with_research — sessions that called `research` ≥1
|
| 55 |
+
research_calls_per_session_p50 / _p95 — among sessions that did any (zero-only sessions excluded)
|
| 56 |
+
distinct_tools_per_session_p50 / _p95 — among sessions with ≥1 named tool_call
|
| 57 |
+
tool_calls_per_session_p50 / _p95 — among sessions with ≥1 named tool_call
|
| 58 |
+
tool_calls_per_turn_p50 / _p95 — calls / turns, among sessions with turns>0
|
| 59 |
+
tool_calls_by_name_json — JSON {tool: total_calls} (all tools seen)
|
| 60 |
+
sessions_using_tool_json — JSON {tool: distinct_sessions_using}
|
| 61 |
+
sessions_by_model_json — JSON {model_name: count} (CLI vs Bedrock split)
|
| 62 |
|
| 63 |
================================================================================
|
| 64 |
Usage
|
|
|
|
| 225 |
"thumbs_up": 0, "thumbs_down": 0,
|
| 226 |
"hf_jobs_submitted": 0, "hf_jobs_succeeded": 0, "hf_jobs_blocked": 0,
|
| 227 |
"pro_cta_clicks": 0,
|
| 228 |
+
"sandboxes_created": 0, "sandboxes_cpu": 0, "sandboxes_gpu": 0,
|
| 229 |
"first_tool_s": -1,
|
| 230 |
}
|
| 231 |
events = session.get("events") or []
|
|
|
|
| 244 |
gpu_hours_by_flavor: dict[str, float] = defaultdict(float)
|
| 245 |
jobs_submitted = 0
|
| 246 |
jobs_succeeded = 0
|
|
|
|
| 247 |
thumbs_up = 0
|
| 248 |
thumbs_down = 0
|
| 249 |
+
sandboxes_created = 0
|
| 250 |
+
sandboxes_cpu = 0
|
| 251 |
+
sandboxes_gpu = 0
|
| 252 |
+
jobs_blocked = 0
|
| 253 |
pro_cta_clicks = 0
|
| 254 |
pro_cta_by_source: dict[str, int] = defaultdict(int)
|
| 255 |
+
# Per-tool counters from tool_call events. Counted off tool_call (which
|
| 256 |
+
# carries data["tool"]) rather than tool_output (which only carries
|
| 257 |
+
# success/output) so we can attribute calls to specific tools.
|
| 258 |
+
tool_calls_by_name: dict[str, int] = defaultdict(int)
|
| 259 |
+
total_named_tool_calls = 0
|
| 260 |
|
| 261 |
start_dt = _parse_ts(session_start)
|
| 262 |
|
|
|
|
| 281 |
first_tool_ts = (ts - start_dt).total_seconds()
|
| 282 |
|
| 283 |
elif et == "tool_call":
|
| 284 |
+
name = data.get("tool")
|
| 285 |
+
if name:
|
| 286 |
+
tool_calls_by_name[name] += 1
|
| 287 |
+
total_named_tool_calls += 1
|
| 288 |
if first_tool_ts is None and ts is not None and start_dt is not None:
|
| 289 |
first_tool_ts = (ts - start_dt).total_seconds()
|
| 290 |
|
|
|
|
| 321 |
source = str(data.get("source") or "unknown")
|
| 322 |
pro_cta_by_source[source] += 1
|
| 323 |
|
| 324 |
+
elif et == "sandbox_create":
|
| 325 |
+
sandboxes_created += 1
|
| 326 |
+
hardware = (data.get("hardware") or "").lower()
|
| 327 |
+
# CPU flavors are explicitly named "cpu-*". Everything else
|
| 328 |
+
# (including unknown/missing hardware strings) lands in the GPU
|
| 329 |
+
# bucket, since the auto-create default is "cpu-basic" which is
|
| 330 |
+
# matched here — anything that isn't is almost always an explicit
|
| 331 |
+
# GPU choice.
|
| 332 |
+
if hardware.startswith("cpu-"):
|
| 333 |
+
sandboxes_cpu += 1
|
| 334 |
+
else:
|
| 335 |
+
sandboxes_gpu += 1
|
| 336 |
+
|
| 337 |
out["tool_calls_total"] = tool_total
|
| 338 |
out["tool_calls_success"] = tool_success
|
| 339 |
out["failures"] = 1 if had_error else 0
|
|
|
|
| 342 |
out["thumbs_down"] = thumbs_down
|
| 343 |
out["hf_jobs_submitted"] = jobs_submitted
|
| 344 |
out["hf_jobs_succeeded"] = jobs_succeeded
|
| 345 |
+
out["sandboxes_created"] = sandboxes_created
|
| 346 |
+
out["sandboxes_cpu"] = sandboxes_cpu
|
| 347 |
+
out["sandboxes_gpu"] = sandboxes_gpu
|
| 348 |
out["hf_jobs_blocked"] = jobs_blocked
|
| 349 |
out["pro_cta_clicks"] = pro_cta_clicks
|
| 350 |
out["first_tool_s"] = first_tool_ts if first_tool_ts is not None else -1
|
| 351 |
out["_gpu_hours_by_flavor"] = dict(gpu_hours_by_flavor)
|
| 352 |
out["_pro_cta_by_source"] = dict(pro_cta_by_source)
|
| 353 |
out["_user"] = session.get("user_id") or session.get("session_id")
|
| 354 |
+
# Intra-session tool fields. Underscore-prefixed = consumed by _aggregate
|
| 355 |
+
# only, never written to CSV directly.
|
| 356 |
+
out["_tool_calls_by_name"] = dict(tool_calls_by_name)
|
| 357 |
+
out["_research_calls"] = tool_calls_by_name.get("research", 0)
|
| 358 |
+
out["_distinct_tools_used"] = len(tool_calls_by_name)
|
| 359 |
+
out["_total_named_tool_calls"] = total_named_tool_calls
|
| 360 |
+
out["_model_name"] = session.get("model_name") or "unknown"
|
| 361 |
return dict(out)
|
| 362 |
|
| 363 |
|
|
|
|
| 365 |
"""Collapse a bucket's worth of session rollups into the final KPI row."""
|
| 366 |
ttfa_values = [s["first_tool_s"] for s in per_session if s.get("first_tool_s", -1) >= 0]
|
| 367 |
gpu_hours: dict[str, float] = defaultdict(float)
|
|
|
|
| 368 |
for s in per_session:
|
| 369 |
for f, h in (s.get("_gpu_hours_by_flavor") or {}).items():
|
| 370 |
gpu_hours[f] += h
|
| 371 |
+
|
| 372 |
+
# Per-tool aggregates. ``sessions_using_tool`` counts each session at most
|
| 373 |
+
# once per tool, so the dashboard can show "how many sessions reached for
|
| 374 |
+
# research" alongside "how many research calls overall".
|
| 375 |
+
tool_calls_by_name: dict[str, int] = defaultdict(int)
|
| 376 |
+
sessions_using_tool: dict[str, int] = defaultdict(int)
|
| 377 |
+
sessions_by_model: dict[str, int] = defaultdict(int)
|
| 378 |
+
for s in per_session:
|
| 379 |
+
for name, count in (s.get("_tool_calls_by_name") or {}).items():
|
| 380 |
+
tool_calls_by_name[name] += int(count)
|
| 381 |
+
sessions_using_tool[name] += 1
|
| 382 |
+
sessions_by_model[s.get("_model_name") or "unknown"] += 1
|
| 383 |
+
|
| 384 |
+
# Percentile inputs. All "per session" percentiles exclude sessions that
|
| 385 |
+
# never reached for the relevant signal — otherwise quiet hours
|
| 386 |
+
# (status-check sessions, abandoned new conversations) drag every median
|
| 387 |
+
# to 0 and the chart tells you nothing.
|
| 388 |
+
research_calls_nz = [s.get("_research_calls", 0) for s in per_session if s.get("_research_calls", 0) > 0]
|
| 389 |
+
distinct_tools_values = [s.get("_distinct_tools_used", 0) for s in per_session if s.get("_distinct_tools_used", 0) > 0]
|
| 390 |
+
total_calls_values = [s.get("_total_named_tool_calls", 0) for s in per_session if s.get("_total_named_tool_calls", 0) > 0]
|
| 391 |
+
# Per-turn intensity: turns>0 is the natural filter here (a session with
|
| 392 |
+
# 5 turns and 0 tools is a meaningful 0). Don't strip those.
|
| 393 |
+
calls_per_turn_values = [
|
| 394 |
+
s.get("_total_named_tool_calls", 0) / s["turns"]
|
| 395 |
+
for s in per_session
|
| 396 |
+
if s.get("turns", 0) > 0
|
| 397 |
+
]
|
| 398 |
|
| 399 |
total_sessions = sum(s["sessions"] for s in per_session)
|
| 400 |
total_turns = sum(s["turns"] for s in per_session)
|
|
|
|
| 402 |
tokens_cache_read = sum(s["tokens_cache_read"] for s in per_session)
|
| 403 |
tool_total = sum(s["tool_calls_total"] for s in per_session)
|
| 404 |
tool_success = sum(s["tool_calls_success"] for s in per_session)
|
| 405 |
+
failures = int(sum(s["failures"] for s in per_session))
|
| 406 |
+
regenerates = int(sum(s["regenerate_sessions"] for s in per_session))
|
| 407 |
+
research_calls_total = int(sum(s.get("_research_calls", 0) for s in per_session))
|
| 408 |
+
sessions_with_research = sum(1 for s in per_session if s.get("_research_calls", 0) > 0)
|
| 409 |
+
|
| 410 |
+
# Per-session cost percentiles — chart "median session cost" alongside the
|
| 411 |
+
# mean so a few $700 outliers don't make you think every session is pricey.
|
| 412 |
+
session_costs = [float(s.get("cost_usd") or 0.0) for s in per_session]
|
| 413 |
+
cost_p50 = _percentile(session_costs, 0.5)
|
| 414 |
+
cost_p95 = _percentile(session_costs, 0.95)
|
| 415 |
|
| 416 |
unique_users = {s.get("_user") for s in per_session if s.get("_user")}
|
| 417 |
|
|
|
|
| 425 |
"tokens_cache_read": int(tokens_cache_read),
|
| 426 |
"tokens_cache_creation": int(sum(s["tokens_cache_creation"] for s in per_session)),
|
| 427 |
"cost_usd": round(sum(s["cost_usd"] for s in per_session), 4),
|
| 428 |
+
# Per-session cost summaries.
|
| 429 |
+
"cost_per_session_mean": round(
|
| 430 |
+
sum(s["cost_usd"] for s in per_session) / total_sessions, 6
|
| 431 |
+
) if total_sessions > 0 else 0.0,
|
| 432 |
+
"cost_per_session_p50": round(cost_p50, 6),
|
| 433 |
+
"cost_per_session_p95": round(cost_p95, 6),
|
| 434 |
"cache_hit_ratio": round(
|
| 435 |
tokens_cache_read / (tokens_cache_read + tokens_prompt), 4
|
| 436 |
) if (tokens_cache_read + tokens_prompt) > 0 else 0.0,
|
| 437 |
+
# Raw reliability COUNTS (these are what the dashboard shows directly).
|
| 438 |
+
"tool_calls_total": int(tool_total),
|
| 439 |
+
"tool_calls_succeeded": int(tool_success),
|
| 440 |
+
"tool_calls_failed": int(tool_total - tool_success),
|
| 441 |
+
"errored_sessions": failures,
|
| 442 |
+
# Successful = "did not raise an error event". Mutually exclusive
|
| 443 |
+
# with errored_sessions; sums with errored_sessions to total sessions.
|
| 444 |
+
"successful_sessions": int(total_sessions - failures),
|
| 445 |
+
# Regenerated is an orthogonal dimension (the user retried) — a
|
| 446 |
+
# session can be both successful and regenerated, or both errored
|
| 447 |
+
# and regenerated.
|
| 448 |
+
"regenerated_sessions": regenerates,
|
| 449 |
+
# Rates kept for backwards compatibility with anything reading the
|
| 450 |
+
# KPI dataset directly.
|
| 451 |
"tool_success_rate": round(tool_success / tool_total, 4) if tool_total > 0 else 0.0,
|
| 452 |
+
"failure_rate": round(failures / total_sessions, 4) if total_sessions > 0 else 0.0,
|
| 453 |
+
"regenerate_rate": round(regenerates / total_sessions, 4) if total_sessions > 0 else 0.0,
|
|
|
|
|
|
|
|
|
|
|
|
|
| 454 |
"time_to_first_action_s_p50": round(_percentile(ttfa_values, 0.5), 2),
|
| 455 |
"time_to_first_action_s_p95": round(_percentile(ttfa_values, 0.95), 2),
|
| 456 |
"thumbs_up": int(sum(s["thumbs_up"] for s in per_session)),
|
| 457 |
"thumbs_down": int(sum(s["thumbs_down"] for s in per_session)),
|
| 458 |
"hf_jobs_submitted": int(sum(s["hf_jobs_submitted"] for s in per_session)),
|
| 459 |
"hf_jobs_succeeded": int(sum(s["hf_jobs_succeeded"] for s in per_session)),
|
| 460 |
+
"sandboxes_created": int(sum(s.get("sandboxes_created", 0) for s in per_session)),
|
| 461 |
+
"sandboxes_cpu": int(sum(s.get("sandboxes_cpu", 0) for s in per_session)),
|
| 462 |
+
"sandboxes_gpu": int(sum(s.get("sandboxes_gpu", 0) for s in per_session)),
|
| 463 |
+
"hf_jobs_blocked": int(sum(s.get("hf_jobs_blocked", 0) for s in per_session)),
|
| 464 |
+
"pro_cta_clicks": int(sum(s.get("pro_cta_clicks", 0) for s in per_session)),
|
| 465 |
"gpu_hours_by_flavor_json": json.dumps(dict(gpu_hours), sort_keys=True),
|
| 466 |
+
# Research KPIs — answer "is the agent reaching for research?".
|
| 467 |
+
"research_calls": research_calls_total,
|
| 468 |
+
"sessions_with_research": int(sessions_with_research),
|
| 469 |
+
"research_calls_per_session_p50": round(_percentile(research_calls_nz, 0.5), 2),
|
| 470 |
+
"research_calls_per_session_p95": round(_percentile(research_calls_nz, 0.95), 2),
|
| 471 |
+
# Intra-session breadth + intensity. p50 + p95 over per-session values.
|
| 472 |
+
"distinct_tools_per_session_p50": round(_percentile(distinct_tools_values, 0.5), 2),
|
| 473 |
+
"distinct_tools_per_session_p95": round(_percentile(distinct_tools_values, 0.95), 2),
|
| 474 |
+
"tool_calls_per_session_p50": round(_percentile(total_calls_values, 0.5), 2),
|
| 475 |
+
"tool_calls_per_session_p95": round(_percentile(total_calls_values, 0.95), 2),
|
| 476 |
+
"tool_calls_per_turn_p50": round(_percentile(calls_per_turn_values, 0.5), 2),
|
| 477 |
+
"tool_calls_per_turn_p95": round(_percentile(calls_per_turn_values, 0.95), 2),
|
| 478 |
+
# JSON columns let the dashboard add/remove tools without schema churn.
|
| 479 |
+
"tool_calls_by_name_json": json.dumps(dict(tool_calls_by_name), sort_keys=True),
|
| 480 |
+
"sessions_using_tool_json": json.dumps(dict(sessions_using_tool), sort_keys=True),
|
| 481 |
+
# Surface split — answers "is research dropping on Bedrock specifically?".
|
| 482 |
+
"sessions_by_model_json": json.dumps(dict(sessions_by_model), sort_keys=True),
|
| 483 |
}
|
| 484 |
|
| 485 |
|
|
@@ -136,20 +136,141 @@ def test_aggregate_day_cache_hit_and_users():
|
|
| 136 |
assert abs(row["cost_usd"] - 1.5) < 1e-9
|
| 137 |
|
| 138 |
|
| 139 |
-
def
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 140 |
mod = _load()
|
| 141 |
s1 = mod._session_metrics(_session([
|
| 142 |
-
_ev("
|
| 143 |
-
_ev("
|
|
|
|
| 144 |
], user_id="u1"))
|
| 145 |
s2 = mod._session_metrics(_session([
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 146 |
_ev("pro_cta_click", {"source": "claude_cap_dialog"}),
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 147 |
], user_id="u2"))
|
| 148 |
-
row = mod.
|
| 149 |
-
assert row["pro_cta_clicks"] ==
|
| 150 |
-
assert row["
|
| 151 |
-
|
| 152 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 153 |
|
| 154 |
|
| 155 |
def test_failure_and_regenerate_rates():
|
|
|
|
| 136 |
assert abs(row["cost_usd"] - 1.5) < 1e-9
|
| 137 |
|
| 138 |
|
| 139 |
+
def test_per_tool_counts_in_session_metrics():
|
| 140 |
+
mod = _load()
|
| 141 |
+
events = [
|
| 142 |
+
_ev("tool_call", {"tool": "bash"}),
|
| 143 |
+
_ev("tool_call", {"tool": "bash"}),
|
| 144 |
+
_ev("tool_call", {"tool": "research"}),
|
| 145 |
+
_ev("tool_call", {"tool": "read"}),
|
| 146 |
+
_ev("tool_call", {}), # nameless tool_call must be ignored
|
| 147 |
+
]
|
| 148 |
+
m = mod._session_metrics(_session(events, user_id="u1"))
|
| 149 |
+
assert m["_tool_calls_by_name"] == {"bash": 2, "research": 1, "read": 1}
|
| 150 |
+
assert m["_research_calls"] == 1
|
| 151 |
+
assert m["_distinct_tools_used"] == 3
|
| 152 |
+
assert m["_total_named_tool_calls"] == 4
|
| 153 |
+
assert m["_model_name"] == "claude-opus-4-6"
|
| 154 |
+
|
| 155 |
+
|
| 156 |
+
def test_aggregate_research_kpis_only_count_doer_sessions():
|
| 157 |
mod = _load()
|
| 158 |
s1 = mod._session_metrics(_session([
|
| 159 |
+
_ev("tool_call", {"tool": "research"}),
|
| 160 |
+
_ev("tool_call", {"tool": "research"}),
|
| 161 |
+
_ev("tool_call", {"tool": "research"}),
|
| 162 |
], user_id="u1"))
|
| 163 |
s2 = mod._session_metrics(_session([
|
| 164 |
+
_ev("tool_call", {"tool": "research"}),
|
| 165 |
+
], user_id="u2"))
|
| 166 |
+
s3 = mod._session_metrics(_session([
|
| 167 |
+
_ev("tool_call", {"tool": "bash"}),
|
| 168 |
+
], user_id="u3"))
|
| 169 |
+
row = mod._aggregate([s1, s2, s3])
|
| 170 |
+
assert row["sessions"] == 3
|
| 171 |
+
assert row["sessions_with_research"] == 2
|
| 172 |
+
assert row["research_calls"] == 4
|
| 173 |
+
# Median among sessions that did any research = (1, 3) -> 2.0
|
| 174 |
+
assert row["research_calls_per_session_p50"] == 2.0
|
| 175 |
+
|
| 176 |
+
|
| 177 |
+
def test_aggregate_tool_breadth_and_intensity():
|
| 178 |
+
import json as _json
|
| 179 |
+
mod = _load()
|
| 180 |
+
s1 = mod._session_metrics(_session([
|
| 181 |
+
_ev("tool_call", {"tool": "bash"}),
|
| 182 |
+
_ev("tool_call", {"tool": "research"}),
|
| 183 |
+
], user_id="u1"))
|
| 184 |
+
# Two user turns so calls/turn = 4/2 = 2
|
| 185 |
+
s2 = _session([
|
| 186 |
+
_ev("tool_call", {"tool": "bash"}),
|
| 187 |
+
_ev("tool_call", {"tool": "bash"}),
|
| 188 |
+
_ev("tool_call", {"tool": "edit"}),
|
| 189 |
+
_ev("tool_call", {"tool": "edit"}),
|
| 190 |
+
], user_id="u2")
|
| 191 |
+
s2["messages"] = [{"role": "user"}, {"role": "user"}]
|
| 192 |
+
s2_metrics = mod._session_metrics(s2)
|
| 193 |
+
row = mod._aggregate([s1, s2_metrics])
|
| 194 |
+
assert _json.loads(row["tool_calls_by_name_json"]) == {
|
| 195 |
+
"bash": 3, "research": 1, "edit": 2,
|
| 196 |
+
}
|
| 197 |
+
assert _json.loads(row["sessions_using_tool_json"]) == {
|
| 198 |
+
"bash": 2, "research": 1, "edit": 1,
|
| 199 |
+
}
|
| 200 |
+
# u1: 2 distinct, u2: 2 distinct -> p50 = 2
|
| 201 |
+
assert row["distinct_tools_per_session_p50"] == 2.0
|
| 202 |
+
# tool_calls_per_session: u1=2, u2=4 -> p50=3
|
| 203 |
+
assert row["tool_calls_per_session_p50"] == 3.0
|
| 204 |
+
# u1: 2 turns(?) — _session() default has one user message, so calls/turn=2/1=2; u2=4/2=2
|
| 205 |
+
assert row["tool_calls_per_turn_p50"] == 2.0
|
| 206 |
+
|
| 207 |
+
|
| 208 |
+
def test_breadth_intensity_percentiles_exclude_zero_tool_sessions():
|
| 209 |
+
"""Sessions that never called a tool would otherwise crush the median."""
|
| 210 |
+
mod = _load()
|
| 211 |
+
# Two productive sessions and three idle ones (no tool calls). Without
|
| 212 |
+
# the doer-only filter, median of [0,0,0,2,4] = 0, which is useless.
|
| 213 |
+
productive_a = mod._session_metrics(_session([
|
| 214 |
+
_ev("tool_call", {"tool": "bash"}),
|
| 215 |
+
_ev("tool_call", {"tool": "research"}),
|
| 216 |
+
], user_id="prod_a"))
|
| 217 |
+
productive_b = _session([
|
| 218 |
+
_ev("tool_call", {"tool": "bash"}),
|
| 219 |
+
_ev("tool_call", {"tool": "edit"}),
|
| 220 |
+
_ev("tool_call", {"tool": "edit"}),
|
| 221 |
+
_ev("tool_call", {"tool": "edit"}),
|
| 222 |
+
], user_id="prod_b")
|
| 223 |
+
productive_b["messages"] = [{"role": "user"}, {"role": "user"}]
|
| 224 |
+
productive_b_metrics = mod._session_metrics(productive_b)
|
| 225 |
+
idle = [
|
| 226 |
+
mod._session_metrics(_session([], user_id="idle_a")),
|
| 227 |
+
mod._session_metrics(_session([], user_id="idle_b")),
|
| 228 |
+
mod._session_metrics(_session([], user_id="idle_c")),
|
| 229 |
+
]
|
| 230 |
+
row = mod._aggregate([productive_a, productive_b_metrics, *idle])
|
| 231 |
+
# Median of [2 distinct, 2 distinct] = 2 (idle sessions filtered).
|
| 232 |
+
assert row["distinct_tools_per_session_p50"] == 2.0
|
| 233 |
+
# Median of [2 calls, 4 calls] = 3 (idle sessions filtered).
|
| 234 |
+
assert row["tool_calls_per_session_p50"] == 3.0
|
| 235 |
+
|
| 236 |
+
|
| 237 |
+
def test_pro_clicks_and_blocked_jobs_in_aggregate():
|
| 238 |
+
"""The aggregate row keeps pro_cta_clicks + hf_jobs_blocked columns
|
| 239 |
+
even if the dashboard doesn't currently chart them — they're cheap to
|
| 240 |
+
keep and downstream consumers may still depend on the schema."""
|
| 241 |
+
mod = _load()
|
| 242 |
+
s1 = mod._session_metrics(_session([
|
| 243 |
+
_ev("pro_cta_click", {"source": "hf_jobs_upgrade_dialog"}),
|
| 244 |
_ev("pro_cta_click", {"source": "claude_cap_dialog"}),
|
| 245 |
+
_ev("jobs_access_blocked", {}),
|
| 246 |
+
], user_id="u1"))
|
| 247 |
+
s2 = mod._session_metrics(_session([
|
| 248 |
+
_ev("jobs_access_blocked", {}),
|
| 249 |
+
_ev("jobs_access_blocked", {}),
|
| 250 |
], user_id="u2"))
|
| 251 |
+
row = mod._aggregate([s1, s2])
|
| 252 |
+
assert row["pro_cta_clicks"] == 2
|
| 253 |
+
assert row["hf_jobs_blocked"] == 3
|
| 254 |
+
|
| 255 |
+
|
| 256 |
+
def test_aggregate_sessions_by_model_split():
|
| 257 |
+
import json as _json
|
| 258 |
+
mod = _load()
|
| 259 |
+
s_anthropic = _session([], user_id="a")
|
| 260 |
+
s_anthropic["model_name"] = "anthropic/claude-opus-4-6"
|
| 261 |
+
s_bedrock = _session([], user_id="b")
|
| 262 |
+
s_bedrock["model_name"] = "bedrock/us.anthropic.claude-opus-4-6-v1"
|
| 263 |
+
s_bedrock2 = _session([], user_id="c")
|
| 264 |
+
s_bedrock2["model_name"] = "bedrock/us.anthropic.claude-opus-4-6-v1"
|
| 265 |
+
row = mod._aggregate([
|
| 266 |
+
mod._session_metrics(s_anthropic),
|
| 267 |
+
mod._session_metrics(s_bedrock),
|
| 268 |
+
mod._session_metrics(s_bedrock2),
|
| 269 |
+
])
|
| 270 |
+
assert _json.loads(row["sessions_by_model_json"]) == {
|
| 271 |
+
"anthropic/claude-opus-4-6": 1,
|
| 272 |
+
"bedrock/us.anthropic.claude-opus-4-6-v1": 2,
|
| 273 |
+
}
|
| 274 |
|
| 275 |
|
| 276 |
def test_failure_and_regenerate_rates():
|