Aksel Joonas Reedi commited on
Commit
f9305f6
·
unverified ·
1 Parent(s): 72f615f

feat(kpis): per-tool counts, research engagement, surface split (#160)

Browse files

* feat(kpis): per-tool counts, research engagement, surface split

Adds intra-session telemetry to the KPI rollup so the observatory
dashboard can answer "is the agent reaching for research?" and
"which tools dropped out of the mix?". Production data (Apr 21->27,
21k+ sessions) shows the sessions-with-research rate dropped from
~78% on Apr 21 to ~56% on Apr 27 — without per-tool counts in the
rollup, that signal was invisible to the dashboard.

Also resolves the long-standing drift between this repo's
scripts/build_kpis.py and the observatory's backend/build_kpis.py
(both pipelines write to smolagents/ml-intern-kpis; whichever ran
last dropped the other's columns). The observatory copy was the
superset; this PR brings scripts/ up to it. Going forward both
copies are byte-identical.

Drift fix (incoming from observatory):
- cost_per_session_mean / _p50 / _p95
- tool_calls_total / _succeeded / _failed counts
- successful_sessions / errored_sessions / regenerated_sessions counts
- sandboxes_created / _cpu / _gpu
- pro_cta_by_source_json removed (was dropped by observatory; the
dashboard never charted it, so no consumer to break)

New fields (added in both copies via the parallel observatory PR):
- research_calls, sessions_with_research
- research_calls_per_session_p50/p95 (among sessions that did any)
- distinct_tools_per_session_p50/p95 (vocabulary breadth)
- tool_calls_per_session_p50/p95
- tool_calls_per_turn_p50/p95
- tool_calls_by_name_json, sessions_using_tool_json
- sessions_by_model_json (CLI/anthropic vs frontend/bedrock)

Per-tool counts come off tool_call events (data["tool"]) rather
than tool_output (success-only), so the existing tool_calls_total
counter is unchanged.

Tests: 6 new cases in tests/unit/test_build_kpis.py covering the
new private session fields, research-only-among-doers percentile,
breadth/intensity aggregates, and the model split. All 19 KPI +
scheduler tests pass.

The matching observatory commit is e78c2d7 — visualization PRs
(headline cell, "Research tool" / "Tool mix" / "Surface split"
sections) live there.

* fix(kpis): address review feedback

Three P1s and a P2 from automated review:

- Restore hf_jobs_blocked + pro_cta_clicks in _aggregate output.
_session_metrics still computed both, but the aggregate had silently
dropped them — dead computation and a schema regression vs the
original main schema. Cheap to keep; no consumer required.
- Filter zero-tool-call sessions out of distinct_tools_per_session and
tool_calls_per_session percentiles, matching the existing research
doer-only filter. Quiet hours full of status-check / abandoned
sessions otherwise drag every median to 0.
tool_calls_per_turn keeps its turns>0-only filter on purpose: a
5-turn session that did 0 tool calls is a meaningful 0 there.
- Update module docstring to list every aggregate column added in this
series (cost_per_session_*, tool_calls_succeeded/_failed, *_sessions
outcome counts, sandboxes_*).
- Drop the unreachable hardware-set check in sandbox classification:
"cpu-basic"/"cpu-upgrade" both start with "cpu-", so the disjunction
was dead. Simplified to a startswith check.

Two new tests:
- test_breadth_intensity_percentiles_exclude_zero_tool_sessions locks
in the doer-only filter so it can't silently regress.
- test_pro_clicks_and_blocked_jobs_in_aggregate guards the restored
aggregate columns.

Matching observatory commit lands its mirror of build_kpis.py.

Files changed (2) hide show
  1. scripts/build_kpis.py +133 -16
  2. tests/unit/test_build_kpis.py +129 -8
scripts/build_kpis.py CHANGED
@@ -38,15 +38,27 @@ re-running the same hour overwrites.
38
  llm_calls — count of llm_call events
39
  tokens_prompt / _completion / _cache_read / _cache_creation
40
  cost_usd — sum of llm_call.cost_usd
 
41
  cache_hit_ratio — cache_read / (cache_read + prompt)
42
- tool_success_rate — tool_output success=True / total tool_output
43
- failure_rate sessions that ended with an `error` event / sessions
44
- regenerate_rate — sessions with any `undo_complete` event / sessions
 
45
  time_to_first_action_s_p50 / _p95 — from session_start to first tool_call
46
  thumbs_up / thumbs_down
47
  hf_jobs_submitted / _succeeded / _blocked
 
48
  pro_cta_clicks
49
  gpu_hours_by_flavor_json — JSON-serialised {flavor: gpu-hours}
 
 
 
 
 
 
 
 
 
50
 
51
  ================================================================================
52
  Usage
@@ -213,6 +225,7 @@ def _session_metrics(session: dict) -> dict:
213
  "thumbs_up": 0, "thumbs_down": 0,
214
  "hf_jobs_submitted": 0, "hf_jobs_succeeded": 0, "hf_jobs_blocked": 0,
215
  "pro_cta_clicks": 0,
 
216
  "first_tool_s": -1,
217
  }
218
  events = session.get("events") or []
@@ -231,11 +244,19 @@ def _session_metrics(session: dict) -> dict:
231
  gpu_hours_by_flavor: dict[str, float] = defaultdict(float)
232
  jobs_submitted = 0
233
  jobs_succeeded = 0
234
- jobs_blocked = 0
235
  thumbs_up = 0
236
  thumbs_down = 0
 
 
 
 
237
  pro_cta_clicks = 0
238
  pro_cta_by_source: dict[str, int] = defaultdict(int)
 
 
 
 
 
239
 
240
  start_dt = _parse_ts(session_start)
241
 
@@ -260,6 +281,10 @@ def _session_metrics(session: dict) -> dict:
260
  first_tool_ts = (ts - start_dt).total_seconds()
261
 
262
  elif et == "tool_call":
 
 
 
 
263
  if first_tool_ts is None and ts is not None and start_dt is not None:
264
  first_tool_ts = (ts - start_dt).total_seconds()
265
 
@@ -296,6 +321,19 @@ def _session_metrics(session: dict) -> dict:
296
  source = str(data.get("source") or "unknown")
297
  pro_cta_by_source[source] += 1
298
 
 
 
 
 
 
 
 
 
 
 
 
 
 
299
  out["tool_calls_total"] = tool_total
300
  out["tool_calls_success"] = tool_success
301
  out["failures"] = 1 if had_error else 0
@@ -304,12 +342,22 @@ def _session_metrics(session: dict) -> dict:
304
  out["thumbs_down"] = thumbs_down
305
  out["hf_jobs_submitted"] = jobs_submitted
306
  out["hf_jobs_succeeded"] = jobs_succeeded
 
 
 
307
  out["hf_jobs_blocked"] = jobs_blocked
308
  out["pro_cta_clicks"] = pro_cta_clicks
309
  out["first_tool_s"] = first_tool_ts if first_tool_ts is not None else -1
310
  out["_gpu_hours_by_flavor"] = dict(gpu_hours_by_flavor)
311
  out["_pro_cta_by_source"] = dict(pro_cta_by_source)
312
  out["_user"] = session.get("user_id") or session.get("session_id")
 
 
 
 
 
 
 
313
  return dict(out)
314
 
315
 
@@ -317,12 +365,36 @@ def _aggregate(per_session: list[dict]) -> dict:
317
  """Collapse a bucket's worth of session rollups into the final KPI row."""
318
  ttfa_values = [s["first_tool_s"] for s in per_session if s.get("first_tool_s", -1) >= 0]
319
  gpu_hours: dict[str, float] = defaultdict(float)
320
- pro_cta_by_source: dict[str, int] = defaultdict(int)
321
  for s in per_session:
322
  for f, h in (s.get("_gpu_hours_by_flavor") or {}).items():
323
  gpu_hours[f] += h
324
- for source, count in (s.get("_pro_cta_by_source") or {}).items():
325
- pro_cta_by_source[source] += int(count)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
326
 
327
  total_sessions = sum(s["sessions"] for s in per_session)
328
  total_turns = sum(s["turns"] for s in per_session)
@@ -330,6 +402,16 @@ def _aggregate(per_session: list[dict]) -> dict:
330
  tokens_cache_read = sum(s["tokens_cache_read"] for s in per_session)
331
  tool_total = sum(s["tool_calls_total"] for s in per_session)
332
  tool_success = sum(s["tool_calls_success"] for s in per_session)
 
 
 
 
 
 
 
 
 
 
333
 
334
  unique_users = {s.get("_user") for s in per_session if s.get("_user")}
335
 
@@ -343,26 +425,61 @@ def _aggregate(per_session: list[dict]) -> dict:
343
  "tokens_cache_read": int(tokens_cache_read),
344
  "tokens_cache_creation": int(sum(s["tokens_cache_creation"] for s in per_session)),
345
  "cost_usd": round(sum(s["cost_usd"] for s in per_session), 4),
 
 
 
 
 
 
346
  "cache_hit_ratio": round(
347
  tokens_cache_read / (tokens_cache_read + tokens_prompt), 4
348
  ) if (tokens_cache_read + tokens_prompt) > 0 else 0.0,
 
 
 
 
 
 
 
 
 
 
 
 
 
 
349
  "tool_success_rate": round(tool_success / tool_total, 4) if tool_total > 0 else 0.0,
350
- "failure_rate": round(
351
- sum(s["failures"] for s in per_session) / total_sessions, 4
352
- ) if total_sessions > 0 else 0.0,
353
- "regenerate_rate": round(
354
- sum(s["regenerate_sessions"] for s in per_session) / total_sessions, 4
355
- ) if total_sessions > 0 else 0.0,
356
  "time_to_first_action_s_p50": round(_percentile(ttfa_values, 0.5), 2),
357
  "time_to_first_action_s_p95": round(_percentile(ttfa_values, 0.95), 2),
358
  "thumbs_up": int(sum(s["thumbs_up"] for s in per_session)),
359
  "thumbs_down": int(sum(s["thumbs_down"] for s in per_session)),
360
  "hf_jobs_submitted": int(sum(s["hf_jobs_submitted"] for s in per_session)),
361
  "hf_jobs_succeeded": int(sum(s["hf_jobs_succeeded"] for s in per_session)),
362
- "hf_jobs_blocked": int(sum(s["hf_jobs_blocked"] for s in per_session)),
363
- "pro_cta_clicks": int(sum(s["pro_cta_clicks"] for s in per_session)),
 
 
 
364
  "gpu_hours_by_flavor_json": json.dumps(dict(gpu_hours), sort_keys=True),
365
- "pro_cta_by_source_json": json.dumps(dict(pro_cta_by_source), sort_keys=True),
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
366
  }
367
 
368
 
 
38
  llm_calls — count of llm_call events
39
  tokens_prompt / _completion / _cache_read / _cache_creation
40
  cost_usd — sum of llm_call.cost_usd
41
+ cost_per_session_mean / _p50 / _p95 — per-session cost distribution
42
  cache_hit_ratio — cache_read / (cache_read + prompt)
43
+ tool_calls_total / _succeeded / _failed — per-tool_output reliability counts
44
+ tool_success_rate succeeded / total (kept for back-compat)
45
+ successful_sessions / errored_sessions / regenerated_sessions — outcome counts
46
+ failure_rate / regenerate_rate — kept for back-compat
47
  time_to_first_action_s_p50 / _p95 — from session_start to first tool_call
48
  thumbs_up / thumbs_down
49
  hf_jobs_submitted / _succeeded / _blocked
50
+ sandboxes_created / _cpu / _gpu — sandbox_create events bucketed by hardware
51
  pro_cta_clicks
52
  gpu_hours_by_flavor_json — JSON-serialised {flavor: gpu-hours}
53
+ research_calls — total `research` tool_call events
54
+ sessions_with_research — sessions that called `research` ≥1
55
+ research_calls_per_session_p50 / _p95 — among sessions that did any (zero-only sessions excluded)
56
+ distinct_tools_per_session_p50 / _p95 — among sessions with ≥1 named tool_call
57
+ tool_calls_per_session_p50 / _p95 — among sessions with ≥1 named tool_call
58
+ tool_calls_per_turn_p50 / _p95 — calls / turns, among sessions with turns>0
59
+ tool_calls_by_name_json — JSON {tool: total_calls} (all tools seen)
60
+ sessions_using_tool_json — JSON {tool: distinct_sessions_using}
61
+ sessions_by_model_json — JSON {model_name: count} (CLI vs Bedrock split)
62
 
63
  ================================================================================
64
  Usage
 
225
  "thumbs_up": 0, "thumbs_down": 0,
226
  "hf_jobs_submitted": 0, "hf_jobs_succeeded": 0, "hf_jobs_blocked": 0,
227
  "pro_cta_clicks": 0,
228
+ "sandboxes_created": 0, "sandboxes_cpu": 0, "sandboxes_gpu": 0,
229
  "first_tool_s": -1,
230
  }
231
  events = session.get("events") or []
 
244
  gpu_hours_by_flavor: dict[str, float] = defaultdict(float)
245
  jobs_submitted = 0
246
  jobs_succeeded = 0
 
247
  thumbs_up = 0
248
  thumbs_down = 0
249
+ sandboxes_created = 0
250
+ sandboxes_cpu = 0
251
+ sandboxes_gpu = 0
252
+ jobs_blocked = 0
253
  pro_cta_clicks = 0
254
  pro_cta_by_source: dict[str, int] = defaultdict(int)
255
+ # Per-tool counters from tool_call events. Counted off tool_call (which
256
+ # carries data["tool"]) rather than tool_output (which only carries
257
+ # success/output) so we can attribute calls to specific tools.
258
+ tool_calls_by_name: dict[str, int] = defaultdict(int)
259
+ total_named_tool_calls = 0
260
 
261
  start_dt = _parse_ts(session_start)
262
 
 
281
  first_tool_ts = (ts - start_dt).total_seconds()
282
 
283
  elif et == "tool_call":
284
+ name = data.get("tool")
285
+ if name:
286
+ tool_calls_by_name[name] += 1
287
+ total_named_tool_calls += 1
288
  if first_tool_ts is None and ts is not None and start_dt is not None:
289
  first_tool_ts = (ts - start_dt).total_seconds()
290
 
 
321
  source = str(data.get("source") or "unknown")
322
  pro_cta_by_source[source] += 1
323
 
324
+ elif et == "sandbox_create":
325
+ sandboxes_created += 1
326
+ hardware = (data.get("hardware") or "").lower()
327
+ # CPU flavors are explicitly named "cpu-*". Everything else
328
+ # (including unknown/missing hardware strings) lands in the GPU
329
+ # bucket, since the auto-create default is "cpu-basic" which is
330
+ # matched here — anything that isn't is almost always an explicit
331
+ # GPU choice.
332
+ if hardware.startswith("cpu-"):
333
+ sandboxes_cpu += 1
334
+ else:
335
+ sandboxes_gpu += 1
336
+
337
  out["tool_calls_total"] = tool_total
338
  out["tool_calls_success"] = tool_success
339
  out["failures"] = 1 if had_error else 0
 
342
  out["thumbs_down"] = thumbs_down
343
  out["hf_jobs_submitted"] = jobs_submitted
344
  out["hf_jobs_succeeded"] = jobs_succeeded
345
+ out["sandboxes_created"] = sandboxes_created
346
+ out["sandboxes_cpu"] = sandboxes_cpu
347
+ out["sandboxes_gpu"] = sandboxes_gpu
348
  out["hf_jobs_blocked"] = jobs_blocked
349
  out["pro_cta_clicks"] = pro_cta_clicks
350
  out["first_tool_s"] = first_tool_ts if first_tool_ts is not None else -1
351
  out["_gpu_hours_by_flavor"] = dict(gpu_hours_by_flavor)
352
  out["_pro_cta_by_source"] = dict(pro_cta_by_source)
353
  out["_user"] = session.get("user_id") or session.get("session_id")
354
+ # Intra-session tool fields. Underscore-prefixed = consumed by _aggregate
355
+ # only, never written to CSV directly.
356
+ out["_tool_calls_by_name"] = dict(tool_calls_by_name)
357
+ out["_research_calls"] = tool_calls_by_name.get("research", 0)
358
+ out["_distinct_tools_used"] = len(tool_calls_by_name)
359
+ out["_total_named_tool_calls"] = total_named_tool_calls
360
+ out["_model_name"] = session.get("model_name") or "unknown"
361
  return dict(out)
362
 
363
 
 
365
  """Collapse a bucket's worth of session rollups into the final KPI row."""
366
  ttfa_values = [s["first_tool_s"] for s in per_session if s.get("first_tool_s", -1) >= 0]
367
  gpu_hours: dict[str, float] = defaultdict(float)
 
368
  for s in per_session:
369
  for f, h in (s.get("_gpu_hours_by_flavor") or {}).items():
370
  gpu_hours[f] += h
371
+
372
+ # Per-tool aggregates. ``sessions_using_tool`` counts each session at most
373
+ # once per tool, so the dashboard can show "how many sessions reached for
374
+ # research" alongside "how many research calls overall".
375
+ tool_calls_by_name: dict[str, int] = defaultdict(int)
376
+ sessions_using_tool: dict[str, int] = defaultdict(int)
377
+ sessions_by_model: dict[str, int] = defaultdict(int)
378
+ for s in per_session:
379
+ for name, count in (s.get("_tool_calls_by_name") or {}).items():
380
+ tool_calls_by_name[name] += int(count)
381
+ sessions_using_tool[name] += 1
382
+ sessions_by_model[s.get("_model_name") or "unknown"] += 1
383
+
384
+ # Percentile inputs. All "per session" percentiles exclude sessions that
385
+ # never reached for the relevant signal — otherwise quiet hours
386
+ # (status-check sessions, abandoned new conversations) drag every median
387
+ # to 0 and the chart tells you nothing.
388
+ research_calls_nz = [s.get("_research_calls", 0) for s in per_session if s.get("_research_calls", 0) > 0]
389
+ distinct_tools_values = [s.get("_distinct_tools_used", 0) for s in per_session if s.get("_distinct_tools_used", 0) > 0]
390
+ total_calls_values = [s.get("_total_named_tool_calls", 0) for s in per_session if s.get("_total_named_tool_calls", 0) > 0]
391
+ # Per-turn intensity: turns>0 is the natural filter here (a session with
392
+ # 5 turns and 0 tools is a meaningful 0). Don't strip those.
393
+ calls_per_turn_values = [
394
+ s.get("_total_named_tool_calls", 0) / s["turns"]
395
+ for s in per_session
396
+ if s.get("turns", 0) > 0
397
+ ]
398
 
399
  total_sessions = sum(s["sessions"] for s in per_session)
400
  total_turns = sum(s["turns"] for s in per_session)
 
402
  tokens_cache_read = sum(s["tokens_cache_read"] for s in per_session)
403
  tool_total = sum(s["tool_calls_total"] for s in per_session)
404
  tool_success = sum(s["tool_calls_success"] for s in per_session)
405
+ failures = int(sum(s["failures"] for s in per_session))
406
+ regenerates = int(sum(s["regenerate_sessions"] for s in per_session))
407
+ research_calls_total = int(sum(s.get("_research_calls", 0) for s in per_session))
408
+ sessions_with_research = sum(1 for s in per_session if s.get("_research_calls", 0) > 0)
409
+
410
+ # Per-session cost percentiles — chart "median session cost" alongside the
411
+ # mean so a few $700 outliers don't make you think every session is pricey.
412
+ session_costs = [float(s.get("cost_usd") or 0.0) for s in per_session]
413
+ cost_p50 = _percentile(session_costs, 0.5)
414
+ cost_p95 = _percentile(session_costs, 0.95)
415
 
416
  unique_users = {s.get("_user") for s in per_session if s.get("_user")}
417
 
 
425
  "tokens_cache_read": int(tokens_cache_read),
426
  "tokens_cache_creation": int(sum(s["tokens_cache_creation"] for s in per_session)),
427
  "cost_usd": round(sum(s["cost_usd"] for s in per_session), 4),
428
+ # Per-session cost summaries.
429
+ "cost_per_session_mean": round(
430
+ sum(s["cost_usd"] for s in per_session) / total_sessions, 6
431
+ ) if total_sessions > 0 else 0.0,
432
+ "cost_per_session_p50": round(cost_p50, 6),
433
+ "cost_per_session_p95": round(cost_p95, 6),
434
  "cache_hit_ratio": round(
435
  tokens_cache_read / (tokens_cache_read + tokens_prompt), 4
436
  ) if (tokens_cache_read + tokens_prompt) > 0 else 0.0,
437
+ # Raw reliability COUNTS (these are what the dashboard shows directly).
438
+ "tool_calls_total": int(tool_total),
439
+ "tool_calls_succeeded": int(tool_success),
440
+ "tool_calls_failed": int(tool_total - tool_success),
441
+ "errored_sessions": failures,
442
+ # Successful = "did not raise an error event". Mutually exclusive
443
+ # with errored_sessions; sums with errored_sessions to total sessions.
444
+ "successful_sessions": int(total_sessions - failures),
445
+ # Regenerated is an orthogonal dimension (the user retried) — a
446
+ # session can be both successful and regenerated, or both errored
447
+ # and regenerated.
448
+ "regenerated_sessions": regenerates,
449
+ # Rates kept for backwards compatibility with anything reading the
450
+ # KPI dataset directly.
451
  "tool_success_rate": round(tool_success / tool_total, 4) if tool_total > 0 else 0.0,
452
+ "failure_rate": round(failures / total_sessions, 4) if total_sessions > 0 else 0.0,
453
+ "regenerate_rate": round(regenerates / total_sessions, 4) if total_sessions > 0 else 0.0,
 
 
 
 
454
  "time_to_first_action_s_p50": round(_percentile(ttfa_values, 0.5), 2),
455
  "time_to_first_action_s_p95": round(_percentile(ttfa_values, 0.95), 2),
456
  "thumbs_up": int(sum(s["thumbs_up"] for s in per_session)),
457
  "thumbs_down": int(sum(s["thumbs_down"] for s in per_session)),
458
  "hf_jobs_submitted": int(sum(s["hf_jobs_submitted"] for s in per_session)),
459
  "hf_jobs_succeeded": int(sum(s["hf_jobs_succeeded"] for s in per_session)),
460
+ "sandboxes_created": int(sum(s.get("sandboxes_created", 0) for s in per_session)),
461
+ "sandboxes_cpu": int(sum(s.get("sandboxes_cpu", 0) for s in per_session)),
462
+ "sandboxes_gpu": int(sum(s.get("sandboxes_gpu", 0) for s in per_session)),
463
+ "hf_jobs_blocked": int(sum(s.get("hf_jobs_blocked", 0) for s in per_session)),
464
+ "pro_cta_clicks": int(sum(s.get("pro_cta_clicks", 0) for s in per_session)),
465
  "gpu_hours_by_flavor_json": json.dumps(dict(gpu_hours), sort_keys=True),
466
+ # Research KPIs — answer "is the agent reaching for research?".
467
+ "research_calls": research_calls_total,
468
+ "sessions_with_research": int(sessions_with_research),
469
+ "research_calls_per_session_p50": round(_percentile(research_calls_nz, 0.5), 2),
470
+ "research_calls_per_session_p95": round(_percentile(research_calls_nz, 0.95), 2),
471
+ # Intra-session breadth + intensity. p50 + p95 over per-session values.
472
+ "distinct_tools_per_session_p50": round(_percentile(distinct_tools_values, 0.5), 2),
473
+ "distinct_tools_per_session_p95": round(_percentile(distinct_tools_values, 0.95), 2),
474
+ "tool_calls_per_session_p50": round(_percentile(total_calls_values, 0.5), 2),
475
+ "tool_calls_per_session_p95": round(_percentile(total_calls_values, 0.95), 2),
476
+ "tool_calls_per_turn_p50": round(_percentile(calls_per_turn_values, 0.5), 2),
477
+ "tool_calls_per_turn_p95": round(_percentile(calls_per_turn_values, 0.95), 2),
478
+ # JSON columns let the dashboard add/remove tools without schema churn.
479
+ "tool_calls_by_name_json": json.dumps(dict(tool_calls_by_name), sort_keys=True),
480
+ "sessions_using_tool_json": json.dumps(dict(sessions_using_tool), sort_keys=True),
481
+ # Surface split — answers "is research dropping on Bedrock specifically?".
482
+ "sessions_by_model_json": json.dumps(dict(sessions_by_model), sort_keys=True),
483
  }
484
 
485
 
tests/unit/test_build_kpis.py CHANGED
@@ -136,20 +136,141 @@ def test_aggregate_day_cache_hit_and_users():
136
  assert abs(row["cost_usd"] - 1.5) < 1e-9
137
 
138
 
139
- def test_aggregate_day_sums_pro_click_sources():
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
140
  mod = _load()
141
  s1 = mod._session_metrics(_session([
142
- _ev("pro_cta_click", {"source": "hf_jobs_upgrade_dialog"}),
143
- _ev("pro_cta_click", {"source": "hf_jobs_upgrade_dialog"}),
 
144
  ], user_id="u1"))
145
  s2 = mod._session_metrics(_session([
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
146
  _ev("pro_cta_click", {"source": "claude_cap_dialog"}),
 
 
 
 
 
147
  ], user_id="u2"))
148
- row = mod._aggregate_day([s1, s2])
149
- assert row["pro_cta_clicks"] == 3
150
- assert row["pro_cta_by_source_json"] == (
151
- '{"claude_cap_dialog": 1, "hf_jobs_upgrade_dialog": 2}'
152
- )
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
153
 
154
 
155
  def test_failure_and_regenerate_rates():
 
136
  assert abs(row["cost_usd"] - 1.5) < 1e-9
137
 
138
 
139
+ def test_per_tool_counts_in_session_metrics():
140
+ mod = _load()
141
+ events = [
142
+ _ev("tool_call", {"tool": "bash"}),
143
+ _ev("tool_call", {"tool": "bash"}),
144
+ _ev("tool_call", {"tool": "research"}),
145
+ _ev("tool_call", {"tool": "read"}),
146
+ _ev("tool_call", {}), # nameless tool_call must be ignored
147
+ ]
148
+ m = mod._session_metrics(_session(events, user_id="u1"))
149
+ assert m["_tool_calls_by_name"] == {"bash": 2, "research": 1, "read": 1}
150
+ assert m["_research_calls"] == 1
151
+ assert m["_distinct_tools_used"] == 3
152
+ assert m["_total_named_tool_calls"] == 4
153
+ assert m["_model_name"] == "claude-opus-4-6"
154
+
155
+
156
+ def test_aggregate_research_kpis_only_count_doer_sessions():
157
  mod = _load()
158
  s1 = mod._session_metrics(_session([
159
+ _ev("tool_call", {"tool": "research"}),
160
+ _ev("tool_call", {"tool": "research"}),
161
+ _ev("tool_call", {"tool": "research"}),
162
  ], user_id="u1"))
163
  s2 = mod._session_metrics(_session([
164
+ _ev("tool_call", {"tool": "research"}),
165
+ ], user_id="u2"))
166
+ s3 = mod._session_metrics(_session([
167
+ _ev("tool_call", {"tool": "bash"}),
168
+ ], user_id="u3"))
169
+ row = mod._aggregate([s1, s2, s3])
170
+ assert row["sessions"] == 3
171
+ assert row["sessions_with_research"] == 2
172
+ assert row["research_calls"] == 4
173
+ # Median among sessions that did any research = (1, 3) -> 2.0
174
+ assert row["research_calls_per_session_p50"] == 2.0
175
+
176
+
177
+ def test_aggregate_tool_breadth_and_intensity():
178
+ import json as _json
179
+ mod = _load()
180
+ s1 = mod._session_metrics(_session([
181
+ _ev("tool_call", {"tool": "bash"}),
182
+ _ev("tool_call", {"tool": "research"}),
183
+ ], user_id="u1"))
184
+ # Two user turns so calls/turn = 4/2 = 2
185
+ s2 = _session([
186
+ _ev("tool_call", {"tool": "bash"}),
187
+ _ev("tool_call", {"tool": "bash"}),
188
+ _ev("tool_call", {"tool": "edit"}),
189
+ _ev("tool_call", {"tool": "edit"}),
190
+ ], user_id="u2")
191
+ s2["messages"] = [{"role": "user"}, {"role": "user"}]
192
+ s2_metrics = mod._session_metrics(s2)
193
+ row = mod._aggregate([s1, s2_metrics])
194
+ assert _json.loads(row["tool_calls_by_name_json"]) == {
195
+ "bash": 3, "research": 1, "edit": 2,
196
+ }
197
+ assert _json.loads(row["sessions_using_tool_json"]) == {
198
+ "bash": 2, "research": 1, "edit": 1,
199
+ }
200
+ # u1: 2 distinct, u2: 2 distinct -> p50 = 2
201
+ assert row["distinct_tools_per_session_p50"] == 2.0
202
+ # tool_calls_per_session: u1=2, u2=4 -> p50=3
203
+ assert row["tool_calls_per_session_p50"] == 3.0
204
+ # u1: 2 turns(?) — _session() default has one user message, so calls/turn=2/1=2; u2=4/2=2
205
+ assert row["tool_calls_per_turn_p50"] == 2.0
206
+
207
+
208
+ def test_breadth_intensity_percentiles_exclude_zero_tool_sessions():
209
+ """Sessions that never called a tool would otherwise crush the median."""
210
+ mod = _load()
211
+ # Two productive sessions and three idle ones (no tool calls). Without
212
+ # the doer-only filter, median of [0,0,0,2,4] = 0, which is useless.
213
+ productive_a = mod._session_metrics(_session([
214
+ _ev("tool_call", {"tool": "bash"}),
215
+ _ev("tool_call", {"tool": "research"}),
216
+ ], user_id="prod_a"))
217
+ productive_b = _session([
218
+ _ev("tool_call", {"tool": "bash"}),
219
+ _ev("tool_call", {"tool": "edit"}),
220
+ _ev("tool_call", {"tool": "edit"}),
221
+ _ev("tool_call", {"tool": "edit"}),
222
+ ], user_id="prod_b")
223
+ productive_b["messages"] = [{"role": "user"}, {"role": "user"}]
224
+ productive_b_metrics = mod._session_metrics(productive_b)
225
+ idle = [
226
+ mod._session_metrics(_session([], user_id="idle_a")),
227
+ mod._session_metrics(_session([], user_id="idle_b")),
228
+ mod._session_metrics(_session([], user_id="idle_c")),
229
+ ]
230
+ row = mod._aggregate([productive_a, productive_b_metrics, *idle])
231
+ # Median of [2 distinct, 2 distinct] = 2 (idle sessions filtered).
232
+ assert row["distinct_tools_per_session_p50"] == 2.0
233
+ # Median of [2 calls, 4 calls] = 3 (idle sessions filtered).
234
+ assert row["tool_calls_per_session_p50"] == 3.0
235
+
236
+
237
+ def test_pro_clicks_and_blocked_jobs_in_aggregate():
238
+ """The aggregate row keeps pro_cta_clicks + hf_jobs_blocked columns
239
+ even if the dashboard doesn't currently chart them — they're cheap to
240
+ keep and downstream consumers may still depend on the schema."""
241
+ mod = _load()
242
+ s1 = mod._session_metrics(_session([
243
+ _ev("pro_cta_click", {"source": "hf_jobs_upgrade_dialog"}),
244
  _ev("pro_cta_click", {"source": "claude_cap_dialog"}),
245
+ _ev("jobs_access_blocked", {}),
246
+ ], user_id="u1"))
247
+ s2 = mod._session_metrics(_session([
248
+ _ev("jobs_access_blocked", {}),
249
+ _ev("jobs_access_blocked", {}),
250
  ], user_id="u2"))
251
+ row = mod._aggregate([s1, s2])
252
+ assert row["pro_cta_clicks"] == 2
253
+ assert row["hf_jobs_blocked"] == 3
254
+
255
+
256
+ def test_aggregate_sessions_by_model_split():
257
+ import json as _json
258
+ mod = _load()
259
+ s_anthropic = _session([], user_id="a")
260
+ s_anthropic["model_name"] = "anthropic/claude-opus-4-6"
261
+ s_bedrock = _session([], user_id="b")
262
+ s_bedrock["model_name"] = "bedrock/us.anthropic.claude-opus-4-6-v1"
263
+ s_bedrock2 = _session([], user_id="c")
264
+ s_bedrock2["model_name"] = "bedrock/us.anthropic.claude-opus-4-6-v1"
265
+ row = mod._aggregate([
266
+ mod._session_metrics(s_anthropic),
267
+ mod._session_metrics(s_bedrock),
268
+ mod._session_metrics(s_bedrock2),
269
+ ])
270
+ assert _json.loads(row["sessions_by_model_json"]) == {
271
+ "anthropic/claude-opus-4-6": 1,
272
+ "bedrock/us.anthropic.claude-opus-4-6-v1": 2,
273
+ }
274
 
275
 
276
  def test_failure_and_regenerate_rates():