ml-intern

Sleeping

Guillaume Salou commited on Apr 29

Commit

7867a7a

unverified ·

1 Parent(s): c4ac4e6

feat(telemetry): track 5 untracked Bedrock call sites for full cost attribution (#179)

* feat(telemetry): track 5 untracked Bedrock call sites for full cost attribution

Cost Explorer ($78,738 over 6 days) vs the session dataset's
total_cost_usd (~$354/day attributed) showed the dataset captures only
~33% of real Bedrock spend. Root cause: out of 9 acompletion() call
sites, only 2 (in agent_loop.py) emit the llm_call event that
total_cost_usd sums.

This wires telemetry into the 5 Bedrock-billing call sites that were
flying blind, with a `kind` tag on each call so analytics can split
spend by category:

- research_tool.py × 3 → kind="research" (sub-agent loop)
- context_manager.py → kind="compaction" (history summary)
- effort_probe.py → kind="effort_probe" (cascade walk)

Plus a fourth tag for the session-restore summary path
(session_manager.py → kind="restore").

Plumbing changes:

- telemetry.record_llm_call now accepts kind="..." (default "main"
preserves existing behavior).
- summarize_messages() and ContextManager.compact() take optional
session=None so the caller can opt into telemetry.
- probe_effort() takes optional session=None for the same reason.
- Both probe_effort callers (agent_loop._heal_effort_error and
model_switcher) now pass session.

Skipped:

- routes/agent.py /title — uses HF Router (Cerebras), not Bedrock
- routes/agent.py /health/llm — no session context (manual diagnostic
endpoint, ~$0.02/call, not billable to a user)

After deploy, expect dataset total_cost_usd to converge with Cost
Explorer to within 5-10%. The kind breakdown will quantify each
category, validating the cost-plan estimates in
ml_intern_bedrock_cost_plan.md.

* fix(telemetry): address PR bot feedback (2 P1 + 1 P2)

1. P1 — Wrap each research_tool record_llm_call in its own try/except.
record_llm_call's inner send_event is wrapped, but extract_usage
(telemetry.py:101) is not — an unexpected usage shape from LiteLLM
could propagate. At all 3 research sites the surrounding except-block
would convert that into "Research summary call failed", masking a
valid LLM response. Match the effort_probe pattern: dedicated
try/except logging at DEBUG.

2. P1 — Hoist `import time` from inside summarize_messages() to module
level in manager.py. stdlib, always available, matches the rest of
the module.

3. P2 — Update telemetry.py docstring kind list. Drop title_gen and
model_probe (skipped per PR description), add restore (emitted from
session_manager.py). Note the intentional skips at the bottom.

Files changed (7) hide show

agent/context_manager/manager.py +30 -1
agent/core/agent_loop.py +2 -0
agent/core/effort_probe.py +26 -1
agent/core/model_switcher.py +1 -1
agent/core/telemetry.py +22 -1
agent/tools/research_tool.py +41 -0
backend/session_manager.py +2 -0

agent/context_manager/manager.py CHANGED Viewed

@@ -4,6 +4,7 @@ Context management for conversation history
 import logging
 import os
 import zoneinfo
 from datetime import datetime
 from pathlib import Path
@@ -102,6 +103,8 @@ async def summarize_messages(
     max_tokens: int = 2000,
     tool_specs: list[dict] | None = None,
     prompt: str = _COMPACT_PROMPT,
 ) -> tuple[str, int]:
     """Run a summarization prompt against a list of messages.
@@ -110,6 +113,13 @@ async def summarize_messages(
     instead — it preserves the tool-call trail so the agent can answer
     follow-up questions about what it did.
     Returns ``(summary_text, completion_tokens)``.
     """
     from agent.core.llm_params import _resolve_llm_params
@@ -119,12 +129,23 @@ async def summarize_messages(
     prompt_messages, tool_specs = with_prompt_caching(
         prompt_messages, tool_specs, llm_params.get("model")
     )
     response = await acompletion(
         messages=prompt_messages,
         max_completion_tokens=max_tokens,
         tools=tool_specs,
         **llm_params,
     )
     summary = response.choices[0].message.content or ""
     completion_tokens = response.usage.completion_tokens if response.usage else 0
     return summary, completion_tokens
@@ -355,8 +376,14 @@ class ContextManager:
         model_name: str,
         tool_specs: list[dict] | None = None,
         hf_token: str | None = None,
     ) -> None:
-        """Remove old messages to keep history under target size"""
         if not self.needs_compaction:
             return
@@ -394,6 +421,8 @@ class ContextManager:
             max_tokens=self.compact_size,
             tool_specs=tool_specs,
             prompt=_COMPACT_PROMPT,
         )
         summarized_message = Message(role="assistant", content=summary)

 import logging
 import os
+import time
 import zoneinfo
 from datetime import datetime
 from pathlib import Path
     max_tokens: int = 2000,
     tool_specs: list[dict] | None = None,
     prompt: str = _COMPACT_PROMPT,
+    session: Any = None,
+    kind: str = "compaction",
 ) -> tuple[str, int]:
     """Run a summarization prompt against a list of messages.
     instead — it preserves the tool-call trail so the agent can answer
     follow-up questions about what it did.
+    ``session`` is optional; when provided, the call is recorded via
+    ``telemetry.record_llm_call`` so its cost lands in the session's
+    ``total_cost_usd``. Without it, the call still happens but is
+    invisible in telemetry — which used to be the case for every
+    compaction call until 2026-04-29 (~30-50% of Bedrock spend was
+    attributed to this single source of dark cost).
     Returns ``(summary_text, completion_tokens)``.
     """
     from agent.core.llm_params import _resolve_llm_params
     prompt_messages, tool_specs = with_prompt_caching(
         prompt_messages, tool_specs, llm_params.get("model")
     )
+    _t0 = time.monotonic()
     response = await acompletion(
         messages=prompt_messages,
         max_completion_tokens=max_tokens,
         tools=tool_specs,
         **llm_params,
     )
+    if session is not None:
+        from agent.core import telemetry
+        await telemetry.record_llm_call(
+            session,
+            model=model_name,
+            response=response,
+            latency_ms=int((time.monotonic() - _t0) * 1000),
+            finish_reason=response.choices[0].finish_reason if response.choices else None,
+            kind=kind,
+        )
     summary = response.choices[0].message.content or ""
     completion_tokens = response.usage.completion_tokens if response.usage else 0
     return summary, completion_tokens
         model_name: str,
         tool_specs: list[dict] | None = None,
         hf_token: str | None = None,
+        session: Any = None,
     ) -> None:
+        """Remove old messages to keep history under target size.
+        ``session`` is optional — if passed, the underlying summarization
+        LLM call is recorded via ``telemetry.record_llm_call(kind=
+        "compaction")`` so its cost shows up in ``total_cost_usd``.
+        """
         if not self.needs_compaction:
             return
             max_tokens=self.compact_size,
             tool_specs=tool_specs,
             prompt=_COMPACT_PROMPT,
+            session=session,
+            kind="compaction",
         )
         summarized_message = Message(role="assistant", content=summary)

agent/core/agent_loop.py CHANGED Viewed

@@ -282,6 +282,7 @@ async def _heal_effort_and_rebuild_params(
         try:
             outcome = await probe_effort(
                 model, session.config.reasoning_effort, session.hf_token,
             )
             session.model_effective_effort[model] = outcome.effective_effort
             logger.info(
@@ -354,6 +355,7 @@ async def _compact_and_notify(session: Session) -> None:
         model_name=session.config.model_name,
         tool_specs=session.tool_router.get_tool_specs_for_llm(),
         hf_token=session.hf_token,
     )
     new_usage = cm.running_context_usage
     if new_usage != old_usage:

         try:
             outcome = await probe_effort(
                 model, session.config.reasoning_effort, session.hf_token,
+                session=session,
             )
             session.model_effective_effort[model] = outcome.effective_effort
             logger.info(
         model_name=session.config.model_name,
         tool_specs=session.tool_router.get_tool_specs_for_llm(),
         hf_token=session.hf_token,
+        session=session,
     )
     new_usage = cm.running_context_usage
     if new_usage != old_usage:

agent/core/effort_probe.py CHANGED Viewed

@@ -22,7 +22,9 @@ from __future__ import annotations
 import asyncio
 import logging
 from dataclasses import dataclass
 from litellm import acompletion
@@ -139,6 +141,7 @@ async def probe_effort(
     model_name: str,
     preference: str | None,
     hf_token: str | None,
 ) -> ProbeOutcome:
     """Walk the cascade for ``preference`` on ``model_name``.
@@ -147,6 +150,12 @@ async def probe_effort(
     transient errors (5xx, timeout) — persistent 4xx that aren't thinking/
     effort related bubble as the original exception so callers can surface
     them (auth, model-not-found, quota, etc.).
     """
     loop = asyncio.get_event_loop()
     start = loop.time()
@@ -174,7 +183,8 @@ async def probe_effort(
         attempts += 1
         try:
-            await asyncio.wait_for(
                 acompletion(
                     messages=[{"role": "user", "content": "ping"}],
                     max_tokens=_PROBE_MAX_TOKENS,
@@ -183,6 +193,21 @@ async def probe_effort(
                 ),
                 timeout=_PROBE_TIMEOUT,
             )
         except Exception as e:
             last_error = e
             if _is_thinking_unsupported(e):

 import asyncio
 import logging
+import time
 from dataclasses import dataclass
+from typing import Any
 from litellm import acompletion
     model_name: str,
     preference: str | None,
     hf_token: str | None,
+    session: Any = None,
 ) -> ProbeOutcome:
     """Walk the cascade for ``preference`` on ``model_name``.
     transient errors (5xx, timeout) — persistent 4xx that aren't thinking/
     effort related bubble as the original exception so callers can surface
     them (auth, model-not-found, quota, etc.).
+    ``session`` is optional; when provided, each successful probe attempt
+    is recorded via ``telemetry.record_llm_call(kind="effort_probe")`` so
+    the cost shows up in the session's ``total_cost_usd``. Failed probes
+    (rejected by the provider) typically aren't billed, so we only record
+    on success.
     """
     loop = asyncio.get_event_loop()
     start = loop.time()
         attempts += 1
         try:
+            _t0 = time.monotonic()
+            response = await asyncio.wait_for(
                 acompletion(
                     messages=[{"role": "user", "content": "ping"}],
                     max_tokens=_PROBE_MAX_TOKENS,
                 ),
                 timeout=_PROBE_TIMEOUT,
             )
+            if session is not None:
+                # Best-effort telemetry — never let a logging blip propagate
+                # out of the probe and break model switching.
+                try:
+                    from agent.core import telemetry
+                    await telemetry.record_llm_call(
+                        session,
+                        model=model_name,
+                        response=response,
+                        latency_ms=int((time.monotonic() - _t0) * 1000),
+                        finish_reason=response.choices[0].finish_reason if response.choices else None,
+                        kind="effort_probe",
+                    )
+                except Exception as _telem_err:
+                    logger.debug("effort_probe telemetry failed: %s", _telem_err)
         except Exception as e:
             last_error = e
             if _is_thinking_unsupported(e):

agent/core/model_switcher.py CHANGED Viewed

@@ -187,7 +187,7 @@ async def probe_and_switch_model(
     console.print(f"[dim]checking {model_id} (effort: {preference})...[/dim]")
     try:
-        outcome = await probe_effort(model_id, preference, hf_token)
     except ProbeInconclusive as e:
         _commit_switch(model_id, config, session, effective=None, cache=False)
         console.print(

     console.print(f"[dim]checking {model_id} (effort: {preference})...[/dim]")
     try:
+        outcome = await probe_effort(model_id, preference, hf_token, session=session)
     except ProbeInconclusive as e:
         _commit_switch(model_id, config, session, effective=None, cache=False)
         console.print(

agent/core/telemetry.py CHANGED Viewed

@@ -78,9 +78,29 @@ async def record_llm_call(
     response: Any = None,
     latency_ms: int,
     finish_reason: str | None,
 ) -> dict:
     """Emit an ``llm_call`` event and return the extracted usage dict so
-    callers can stash it on their result object if they want."""
     usage = extract_usage(response) if response is not None else {}
     cost_usd = 0.0
     if response is not None:
@@ -98,6 +118,7 @@ async def record_llm_call(
                 "latency_ms": latency_ms,
                 "finish_reason": finish_reason,
                 "cost_usd": cost_usd,
                 **usage,
             },
         ))

     response: Any = None,
     latency_ms: int,
     finish_reason: str | None,
+    kind: str = "main",
 ) -> dict:
     """Emit an ``llm_call`` event and return the extracted usage dict so
+    callers can stash it on their result object if they want.
+    ``kind`` tags the call site so downstream analytics can break spend
+    down by category. Values currently emitted by the codebase:
+    * ``main``        — agent loop turn (user-facing reply or tool follow-up)
+    * ``research``    — research sub-agent inner loop (3 call sites)
+    * ``compaction``  — context-window summary on overflow
+    * ``effort_probe``— effort cascade walk on rejection / model switch
+    * ``restore``     — session re-seed summary after a Space restart
+    Pre-2026-04-29 only ``main`` calls were instrumented; observed gap on
+    Cost Explorer was ~67%, with the other 5 call sites accounting for
+    the rest. Tagging lets us split the dataset's ``total_cost_usd`` by
+    category and validate against AWS billing.
+    The ``/title`` (HF Router, not Bedrock) and ``/health/llm`` (diagnostic
+    endpoint, no session context) call sites are intentionally not
+    instrumented — together they're <1% of spend.
+    """
     usage = extract_usage(response) if response is not None else {}
     cost_usd = 0.0
     if response is not None:
                 "latency_ms": latency_ms,
                 "finish_reason": finish_reason,
                 "cost_usd": cost_usd,
+                "kind": kind,
                 **usage,
             },
         ))

agent/tools/research_tool.py CHANGED Viewed

@@ -9,10 +9,12 @@ Inspired by claude-code's code-explorer agent pattern.
 import json
 import logging
 from typing import Any
 from litellm import Message, acompletion
 from agent.core.doom_loop import check_for_doom_loop
 from agent.core.llm_params import _resolve_llm_params
 from agent.core.prompt_caching import with_prompt_caching
@@ -332,6 +334,7 @@ async def research_handler(
             ))
             try:
                 _msgs, _ = with_prompt_caching(messages, None, llm_params.get("model"))
                 response = await acompletion(
                     messages=_msgs,
                     tools=None,  # no tools — force text response
@@ -339,6 +342,20 @@ async def research_handler(
                     timeout=120,
                     **llm_params,
                 )
                 content = response.choices[0].message.content or ""
                 return content or "Research context exhausted — no summary produced.", bool(content)
             except Exception:
@@ -360,6 +377,7 @@ async def research_handler(
             _msgs, _tools = with_prompt_caching(
                 messages, tool_specs if tool_specs else None, llm_params.get("model")
             )
             response = await acompletion(
                 messages=_msgs,
                 tools=_tools,
@@ -368,6 +386,17 @@ async def research_handler(
                 timeout=120,
                 **llm_params,
             )
         except Exception as e:
             logger.error("Research sub-agent LLM error: %s", e)
             return f"Research agent LLM error: {e}", False
@@ -459,6 +488,7 @@ async def research_handler(
     ))
     try:
         _msgs, _ = with_prompt_caching(messages, None, llm_params.get("model"))
         response = await acompletion(
             messages=_msgs,
             tools=None,
@@ -466,6 +496,17 @@ async def research_handler(
             timeout=120,
             **llm_params,
         )
         content = response.choices[0].message.content or ""
         if content:
             return content, True

 import json
 import logging
+import time
 from typing import Any
 from litellm import Message, acompletion
+from agent.core import telemetry
 from agent.core.doom_loop import check_for_doom_loop
 from agent.core.llm_params import _resolve_llm_params
 from agent.core.prompt_caching import with_prompt_caching
             ))
             try:
                 _msgs, _ = with_prompt_caching(messages, None, llm_params.get("model"))
+                _t0 = time.monotonic()
                 response = await acompletion(
                     messages=_msgs,
                     tools=None,  # no tools — force text response
                     timeout=120,
                     **llm_params,
                 )
+                # Telemetry is best-effort; a logging blip must never mask a
+                # valid LLM response (the surrounding except would convert it
+                # to "summary call failed").
+                try:
+                    await telemetry.record_llm_call(
+                        session,
+                        model=research_model,
+                        response=response,
+                        latency_ms=int((time.monotonic() - _t0) * 1000),
+                        finish_reason=response.choices[0].finish_reason if response.choices else None,
+                        kind="research",
+                    )
+                except Exception as _telem_err:
+                    logger.debug("research telemetry failed: %s", _telem_err)
                 content = response.choices[0].message.content or ""
                 return content or "Research context exhausted — no summary produced.", bool(content)
             except Exception:
             _msgs, _tools = with_prompt_caching(
                 messages, tool_specs if tool_specs else None, llm_params.get("model")
             )
+            _t0 = time.monotonic()
             response = await acompletion(
                 messages=_msgs,
                 tools=_tools,
                 timeout=120,
                 **llm_params,
             )
+            try:
+                await telemetry.record_llm_call(
+                    session,
+                    model=research_model,
+                    response=response,
+                    latency_ms=int((time.monotonic() - _t0) * 1000),
+                    finish_reason=response.choices[0].finish_reason if response.choices else None,
+                    kind="research",
+                )
+            except Exception as _telem_err:
+                logger.debug("research telemetry failed: %s", _telem_err)
         except Exception as e:
             logger.error("Research sub-agent LLM error: %s", e)
             return f"Research agent LLM error: {e}", False
     ))
     try:
         _msgs, _ = with_prompt_caching(messages, None, llm_params.get("model"))
+        _t0 = time.monotonic()
         response = await acompletion(
             messages=_msgs,
             tools=None,
             timeout=120,
             **llm_params,
         )
+        try:
+            await telemetry.record_llm_call(
+                session,
+                model=research_model,
+                response=response,
+                latency_ms=int((time.monotonic() - _t0) * 1000),
+                finish_reason=response.choices[0].finish_reason if response.choices else None,
+                kind="research",
+            )
+        except Exception as _telem_err:
+            logger.debug("research telemetry failed: %s", _telem_err)
         content = response.choices[0].message.content or ""
         if content:
             return content, True

backend/session_manager.py CHANGED Viewed

@@ -612,6 +612,8 @@ class SessionManager:
                 max_tokens=4000,
                 prompt=_RESTORE_PROMPT,
                 tool_specs=tool_specs,
             )
         except Exception as e:
             logger.error("Summary call failed during seed: %s", e)

                 max_tokens=4000,
                 prompt=_RESTORE_PROMPT,
                 tool_specs=tool_specs,
+                session=session,
+                kind="restore",
             )
         except Exception as e:
             logger.error("Summary call failed during seed: %s", e)