ml-intern

Sleeping

Guillaume Salou commited on about 1 month ago

Commit

d84b454

unverified ·

1 Parent(s): 092f909

fix(compaction): break infinite loop + truncate oversized messages (#213)

* fix(compaction): break the infinite-compaction loop, truncate oversized messages

Sessions stuck in a compaction retry loop have been silently burning Bedrock
budget while staying invisible in the session dataset. Pod logs from
2026-05-03 on prod-114 showed the pattern firing minute-after-minute on
multiple replicas:

Context compacted: 200001 -> 215566 tokens
Context compacted: 215566 -> 215572 tokens
ContextWindowExceededError — forcing compaction
Context compacted: 200001 -> 215544 tokens
...

Root cause: a single message in the "untouched" tail (typically a tool
output of 80k+ tokens — bash dump, file content, CSV) keeps the post-compact
context above the 90% threshold. Compaction triggers again, calls Bedrock
again, gets the same 215k result, loops. ~$3 per Opus retry, indefinitely.
Sessions never reach _run_session.finally → save_and_upload_detached never
fires → cost is real but never lands in total_cost_usd.

Cross-check: dataset Bedrock-only coverage on 2026-05-01 was 43% of Cost
Explorer ($6,299 vs $14,732). Bedrock Invocation Logs on 2026-05-02 showed
11,996 InvokeModel calls vs 5,558 Bedrock events in the dataset for the
same day — half the calls missing, consistent with stuck sessions never
uploading.

Three fixes in this PR:

1. ContextManager._truncate_oversized() — replace the content of any
preserved message (first user, untouched tail) over 50k tokens with a
placeholder before summarization. This addresses the root cause:
compaction can't shrink a single oversized message because it's
preserved verbatim.

2. ContextManager.compact() now raises CompactionFailedError if the
post-compact context is still over the threshold (i.e., truncation +
summarize were not enough). The retry-as-circuit-breaker behavior is
replaced with fail-fast: better to end the session cleanly than to
loop on the same useless API call.

3. agent_loop._compact_and_notify() catches CompactionFailedError, emits
a session_terminated event with reason=compaction_failed (so the
dataset records WHY the session ended and the cost it incurred up to
that point), and sets session.is_running=False to exit the loop. The
_run_session.finally then fires save_trajectory normally and the
session ends up in the dataset.

Expected impact: closes the loop side of the cost telemetry gap. The
sessions that previously looped invisibly will now end at first
compaction failure with an event, dataset coverage should rise from
~43% toward the floor set by other gaps (litellm pricing for HF Router
models on compaction events — separate, smaller issue).

* fix(compaction): address PR bot feedback (P0 + 3 P1)

P0 — _compact_and_notify set is_running=False but neither call site of
the inner agent loop checked it. After CompactionFailedError, the LLM
call still fired, hit ContextWindowExceededError, re-called
_compact_and_notify in the except handler, and continued — replacing
the original infinite compaction loop with an only-slightly-better
loop that did one LLM call between compaction failures. Add explicit
`if not session.is_running: break` guards at both call sites
(agent_loop.py:1076 and 1469).

P1.1 — _truncate_oversized rebuilt the replacement Message with only
five fields (role, content, tool_call_id, tool_calls, name), silently
dropping thinking_blocks, reasoning_content, and provider_specific_fields.
For Anthropic extended-thinking models with reasoning_effort=high/max,
losing thinking_blocks on a prior assistant message causes the next
request to fail with "Invalid signature in thinking block". Preserve
all six known fields when reconstructing.

P1.2 — Same is_running guard missing at the second call site (the
ContextWindowExceededError except handler). Same fix.

P1.3 — No automated test for the CompactionFailedError → session
termination flow. Added tests/unit/test_compaction_loop_break.py with
8 cases covering _truncate_oversized (4), compact() raising (2), and
_compact_and_notify session termination + happy path (2). The most
important test is test_compact_and_notify_terminates_session_on_failure
which would have caught the P0 directly. All 8 pass.

* fix(compaction): never truncate system message (caught by integration test)

End-to-end smoke test on the new compaction code path uncovered an edge
case the unit tests missed: when ``items`` has fewer entries than
``untouched_messages`` (pathological configs, very early-session compact
triggers, or the artificial test scenario), the slice math in compact()
can let ``items[0]`` (the system message) leak into the
``recent_messages`` list passed to ``_truncate_oversized``.

The function then truncated the system prompt — silently destroying the
agent's instructions. Defense-in-depth fix: explicit ``if msg.role ==
"system": pass through`` guard at the top of the per-message loop.

The system prompt is loaded from system_prompt_v3.yaml at session start
and is the agent's behavioral contract; it must never be modified by the
compaction path.

Added test_truncate_oversized_never_touches_system_message to cover this
specifically. Total now 9 tests, all passing.

* fix(compaction): clamp idx >= 1 to prevent system message duplication

Bot review re-review on PR #213 caught a second P0: when len(items) ==
untouched_messages (the canonical 5-message early-compaction case
[system, user-task, assistant-with-giant-output, user-followup,
assistant-reply]), idx initialises to 0 and the walk-back `while idx > 1`
guard is a no-op.

Without an explicit clamp, recent_messages = items[0:] starts at the
system message. The not-messages_to_summarize rebuild path then produces
[system_msg, first_user_msg] + recent_messages = [system, user, system,
user, ...] — duplicating both. Anthropic API rejects two system messages.

The system-message guard added in a0fc95f prevents truncation of the
duplicated system but doesn't prevent the duplication itself.

Fix: explicit `if idx < 1: idx = 1` after the walk-back loop, mirroring
the intent of the existing `idx > 1` guard ("never include system in
recent_messages") and closing the gap when initialisation already lands
below 1.

Added test_compact_does_not_duplicate_system_when_idx_is_zero with the
exact 5-message setup. All 10 tests pass.

The 5-message scenario is precisely the one this PR targets: [system,
user-task, assistant-tool-output, user-followup, assistant-reply] is
the most likely shape to drive context > 90% threshold via a single
oversized tool output. So this P0 would have hit nearly every stuck
session — would have made the fix worse than the bug.

* fix(compaction): clamp idx > first_user_idx (not just > 0)

Bot review on PR #213 caught the third P0 in this thread: my previous
clamp `if idx < 1: idx = 1` excluded the system message from
recent_messages but still overlapped with first_user_idx (which is
also 1 for any well-formed session).

Trace for the canonical 5-message trigger:
items = [system(0), user-task(1), assistant(2), user-followup(3),
assistant-reply(4)]
first_user_idx = 1
idx = 5 - 5 = 0 → clamped to 1
recent_messages = items[1:] = [user-task, assistant, user-followup,
assistant-reply] ← includes user-task
messages_to_summarize = items[2:1] = []
→ enters rebuild branch
head = [system_msg, first_user_msg] ← first_user_msg also here
self.items = head + recent_messages
= [system, user-task, user-task, assistant, user-followup,
assistant-reply]
→ two consecutive user messages
→ Anthropic API 400 on next LLM call

The right invariant is "idx must be strictly after first_user_idx",
not "idx must be > 0". The walk-back's `idx > 1` was necessary
(no system) but insufficient (first_user also in head).

Fix: `if idx <= first_user_idx: idx = first_user_idx + 1`. Since
first_user_idx >= 1 for any well-formed session, this also satisfies
the original system-message exclusion intent.

Test strengthened to assert (a) system_count == 1, (b) task message
appears exactly once, (c) no two consecutive same-role non-system
messages. The previous test only checked (a) and would have shipped
this bug.

Lesson noted in ~/.claude/CLAUDE.md section 5: my fix to one bot
finding introduced another bug. On critical-path PRs, every commit
needs a fresh review round.

Files changed (3) hide show

agent/context_manager/manager.py +148 -13
agent/core/agent_loop.py +59 -9
tests/unit/test_compaction_loop_break.py +360 -0

agent/context_manager/manager.py CHANGED Viewed

@@ -79,6 +79,23 @@ _COMPACT_PROMPT = (
     "will be have to be filled in."
 )
 # Used when seeding a brand-new session from prior browser-cached messages.
 # Here we're writing a note to *ourselves* — so preserve the tool-call trail,
 # files produced, and planned next steps in first person. Optimized for
@@ -374,6 +391,81 @@ class ContextManager:
     def needs_compaction(self) -> bool:
         return self.running_context_usage > self.compaction_threshold and bool(self.items)
     async def compact(
         self,
         model_name: str,
@@ -386,6 +478,13 @@ class ContextManager:
         ``session`` is optional — if passed, the underlying summarization
         LLM call is recorded via ``telemetry.record_llm_call(kind=
         "compaction")`` so its cost shows up in ``total_cost_usd``.
         """
         if not self.needs_compaction:
             return
@@ -409,12 +508,45 @@ class ContextManager:
         idx = len(self.items) - self.untouched_messages
         while idx > 1 and self.items[idx].role != "user":
             idx -= 1
         recent_messages = self.items[idx:]
         messages_to_summarize = self.items[first_user_idx + 1:idx]
-        # improbable, messages would have to very long
         if not messages_to_summarize:
             return
         summary, completion_tokens = await summarize_messages(
@@ -439,16 +571,19 @@ class ContextManager:
             head.append(first_user_msg)
         self.items = head + [summarized_message] + recent_messages
-        # Count the actual post-compact context — system prompt + first user
-        # turn + summary + the preserved tail all contribute, not just the
-        # summary. litellm.token_counter uses the model's real tokenizer.
-        from litellm import token_counter
-        try:
-            self.running_context_usage = token_counter(
-                model=model_name,
-                messages=[m.model_dump() for m in self.items],
             )
-        except Exception as e:
-            logger.warning("token_counter failed post-compact (%s); falling back to rough estimate", e)
-            self.running_context_usage = len(self.system_prompt) // 4 + completion_tokens

     "will be have to be filled in."
 )
+# Per-message ceiling. If a single message in the "untouched" tail is larger
+# than this, compaction can't recover even after summarizing the middle —
+# producing the infinite compaction loop seen 2026-05-03 in pod logs (200k
+# context shrinks to 200k+ because one tool output is 80k tokens). We replace
+# such messages with a placeholder before compaction runs.
+_MAX_TOKENS_PER_MESSAGE = 50_000
+class CompactionFailedError(Exception):
+    """Raised when compaction can't reduce context below the threshold.
+    Typically means an individual preserved message (system, first user, or
+    untouched tail) exceeds what truncation can fix in one pass. The caller
+    must terminate the session — retrying produces an infinite loop that
+    burns Bedrock budget for free (~$3 per re-attempt on Opus).
+    """
 # Used when seeding a brand-new session from prior browser-cached messages.
 # Here we're writing a note to *ourselves* — so preserve the tool-call trail,
 # files produced, and planned next steps in first person. Optimized for
     def needs_compaction(self) -> bool:
         return self.running_context_usage > self.compaction_threshold and bool(self.items)
+    def _truncate_oversized(
+        self, messages: list[Message], model_name: str
+    ) -> list[Message]:
+        """Replace any message > _MAX_TOKENS_PER_MESSAGE with a placeholder.
+        These are typically tool outputs (CSV dumps, file contents) sitting in
+        the untouched tail or first-user position that compaction can't shrink
+        — they pass through verbatim, keeping context above threshold and
+        triggering an infinite compaction retry loop.
+        """
+        from litellm import token_counter
+        out: list[Message] = []
+        for msg in messages:
+            # System messages are sacred — they're the agent's instructions.
+            # In edge cases (items < untouched_messages), the slice math in
+            # compact() can let items[0] (the system message) leak into the
+            # recent_messages list. Defense-in-depth: never truncate it.
+            if msg.role == "system":
+                out.append(msg)
+                continue
+            try:
+                n = token_counter(model=model_name, messages=[msg.model_dump()])
+            except Exception:
+                # token_counter occasionally fails on edge-case content;
+                # don't drop the message, just keep it as-is.
+                out.append(msg)
+                continue
+            if n <= _MAX_TOKENS_PER_MESSAGE:
+                out.append(msg)
+                continue
+            placeholder = (
+                f"[truncated for compaction — original was {n} tokens, "
+                f"removed to keep context under {self.compaction_threshold} tokens]"
+            )
+            logger.warning(
+                "Truncating %s message: %d -> %d tokens for compaction",
+                msg.role, n, len(placeholder) // 4,
+            )
+            # Preserve all known assistant-side fields (tool_calls, thinking_blocks,
+            # reasoning_content, provider_specific_fields) even when content is
+            # replaced. Anthropic extended-thinking models reject the next request
+            # with "Invalid signature in thinking block" if thinking_blocks is
+            # dropped from a prior assistant message.
+            kept = {
+                k: getattr(msg, k, None)
+                for k in (
+                    "tool_call_id",
+                    "tool_calls",
+                    "name",
+                    "thinking_blocks",
+                    "reasoning_content",
+                    "provider_specific_fields",
+                )
+                if getattr(msg, k, None) is not None
+            }
+            out.append(Message(role=msg.role, content=placeholder, **kept))
+        return out
+    def _recompute_usage(self, model_name: str) -> None:
+        """Refresh ``running_context_usage`` from current items via real tokenizer."""
+        from litellm import token_counter
+        try:
+            self.running_context_usage = token_counter(
+                model=model_name,
+                messages=[m.model_dump() for m in self.items],
+            )
+        except Exception as e:
+            logger.warning("token_counter failed (%s); rough estimate", e)
+            # Rough fallback: 4 chars per token.
+            self.running_context_usage = sum(
+                len(getattr(m, "content", "") or "") for m in self.items
+            ) // 4
     async def compact(
         self,
         model_name: str,
         ``session`` is optional — if passed, the underlying summarization
         LLM call is recorded via ``telemetry.record_llm_call(kind=
         "compaction")`` so its cost shows up in ``total_cost_usd``.
+        Raises ``CompactionFailedError`` if the post-compact context is still
+        over the threshold. This happens when a preserved message (typically
+        a giant tool output stuck in the untouched tail) is too large for
+        truncation to fix. The caller must terminate the session — retrying
+        is what caused the 2026-05-03 infinite-compaction-loop pattern that
+        burned Bedrock budget invisibly.
         """
         if not self.needs_compaction:
             return
         idx = len(self.items) - self.untouched_messages
         while idx > 1 and self.items[idx].role != "user":
             idx -= 1
+        # The real invariant is "idx must be strictly after first_user_idx,
+        # otherwise recent_messages overlaps with the messages we put in
+        # head". The walk-back's `idx > 1` guard is necessary (no system in
+        # recent) but insufficient (first_user is also in head and would be
+        # duplicated). Anthropic API rejects two consecutive user messages
+        # with a 400 — bot review on PR #213 caught this on the second clamp
+        # iteration.
+        if idx <= first_user_idx:
+            idx = first_user_idx + 1
         recent_messages = self.items[idx:]
         messages_to_summarize = self.items[first_user_idx + 1:idx]
+        # Truncate any message that's larger than _MAX_TOKENS_PER_MESSAGE in
+        # the parts we PRESERVE through compaction (first_user + recent_tail).
+        # These are the only places where individual messages can defeat
+        # compaction by being intrinsically too large. Messages in
+        # ``messages_to_summarize`` are folded into the summary, so their size
+        # doesn't matter on its own.
+        if first_user_msg is not None:
+            truncated = self._truncate_oversized([first_user_msg], model_name)
+            first_user_msg = truncated[0]
+        recent_messages = self._truncate_oversized(recent_messages, model_name)
+        # If there's nothing to summarize but the preserved messages are now
+        # truncated and small, just rebuild and recompute. This is rare but
+        # avoids returning silently with the old (over-threshold) state.
         if not messages_to_summarize:
+            head = [system_msg] if system_msg else []
+            if first_user_msg:
+                head.append(first_user_msg)
+            self.items = head + recent_messages
+            self._recompute_usage(model_name)
+            if self.running_context_usage > self.compaction_threshold:
+                raise CompactionFailedError(
+                    f"Nothing to summarize but context ({self.running_context_usage}) "
+                    f"still over threshold ({self.compaction_threshold}) after truncation. "
+                    f"System prompt or first user message likely exceeds the budget."
+                )
             return
         summary, completion_tokens = await summarize_messages(
             head.append(first_user_msg)
         self.items = head + [summarized_message] + recent_messages
+        self._recompute_usage(model_name)
+        # Hard verify: if compaction didn't bring us below the threshold even
+        # after truncating oversized preserved messages, retrying just burns
+        # Bedrock budget on the same useless compaction call. Raise so the
+        # caller can terminate the session cleanly. Pre-2026-05-04, the
+        # caller looped indefinitely (~$3/Opus retry) until the pod was
+        # killed — invisible to the dataset because the session never
+        # finished cleanly.
+        if self.running_context_usage > self.compaction_threshold:
+            raise CompactionFailedError(
+                f"Compaction ineffective: {self.running_context_usage} tokens "
+                f"still over threshold {self.compaction_threshold} after summarize "
+                f"and truncation. Likely the system prompt + first user + summary "
+                f"+ truncated tail still exceeds budget."
             )

agent/core/agent_loop.py CHANGED Viewed

@@ -516,19 +516,56 @@ def _friendly_error_message(error: Exception) -> str | None:
 async def _compact_and_notify(session: Session) -> None:
-    """Run compaction and send event if context was reduced."""
     cm = session.context_manager
     old_usage = cm.running_context_usage
     logger.debug(
         "Compaction check: usage=%d, max=%d, threshold=%d, needs_compact=%s",
         old_usage, cm.model_max_tokens, cm.compaction_threshold, cm.needs_compaction,
     )
-    await cm.compact(
-        model_name=session.config.model_name,
-        tool_specs=session.tool_router.get_tool_specs_for_llm(),
-        hf_token=session.hf_token,
-        session=session,
-    )
     new_usage = cm.running_context_usage
     if new_usage != old_usage:
         logger.warning(
@@ -1035,8 +1072,15 @@ class Handlers:
             if session.is_cancelled:
                 break
-            # Compact before calling the LLM if context is near the limit
             await _compact_and_notify(session)
             # Doom-loop detection: break out of repeated tool call patterns
             doom_prompt = check_for_doom_loop(session.context_manager.items)
@@ -1421,7 +1465,7 @@ class Handlers:
                 iteration += 1
             except ContextWindowExceededError:
-                # Force compact and retry this iteration
                 cm = session.context_manager
                 logger.warning(
                     "ContextWindowExceededError at iteration %d — forcing compaction "
@@ -1430,6 +1474,12 @@ class Handlers:
                 )
                 cm.running_context_usage = cm.model_max_tokens + 1
                 await _compact_and_notify(session)
                 continue
             except Exception as e:

 async def _compact_and_notify(session: Session) -> None:
+    """Run compaction and send event if context was reduced.
+    Catches ``CompactionFailedError`` and ends the session cleanly instead
+    of letting the caller retry. Pre-2026-05-04 the caller looped on
+    ContextWindowExceededError → compact → re-trigger, burning Bedrock
+    budget at ~$3/Opus retry while the session never reached the upload
+    path (so the cost was invisible in the dataset).
+    """
+    from agent.context_manager.manager import CompactionFailedError
     cm = session.context_manager
     old_usage = cm.running_context_usage
     logger.debug(
         "Compaction check: usage=%d, max=%d, threshold=%d, needs_compact=%s",
         old_usage, cm.model_max_tokens, cm.compaction_threshold, cm.needs_compaction,
     )
+    try:
+        await cm.compact(
+            model_name=session.config.model_name,
+            tool_specs=session.tool_router.get_tool_specs_for_llm(),
+            hf_token=session.hf_token,
+            session=session,
+        )
+    except CompactionFailedError as e:
+        logger.error(
+            "Compaction failed for session %s: %s — terminating session",
+            session.session_id, e,
+        )
+        # Persist the failure event so the dataset has a record of WHY this
+        # session ended (and the cost it incurred up to that point) even if
+        # save_and_upload_detached has issues downstream.
+        await session.send_event(Event(
+            event_type="session_terminated",
+            data={
+                "reason": "compaction_failed",
+                "context_usage": cm.running_context_usage,
+                "context_threshold": cm.compaction_threshold,
+                "error": str(e)[:300],
+                "user_message": (
+                    "Your conversation has grown too large to continue. "
+                    "The work you've done is saved — start a new session to keep going."
+                ),
+            },
+        ))
+        # Stop the agent loop; the finally in _run_session will fire
+        # cleanup_sandbox + save_trajectory so the dataset captures
+        # everything that did happen.
+        session.is_running = False
+        return
     new_usage = cm.running_context_usage
     if new_usage != old_usage:
         logger.warning(
             if session.is_cancelled:
                 break
+            # Compact before calling the LLM if context is near the limit.
+            # When _compact_and_notify catches CompactionFailedError it sets
+            # session.is_running = False; we MUST exit the loop here, otherwise
+            # the LLM call below fires with an over-threshold context, hits
+            # ContextWindowExceededError, and we end up looping again on the
+            # except path — exactly the bug this PR is supposed to fix.
             await _compact_and_notify(session)
+            if not session.is_running:
+                break
             # Doom-loop detection: break out of repeated tool call patterns
             doom_prompt = check_for_doom_loop(session.context_manager.items)
                 iteration += 1
             except ContextWindowExceededError:
+                # Force compact and retry this iteration.
                 cm = session.context_manager
                 logger.warning(
                     "ContextWindowExceededError at iteration %d — forcing compaction "
                 )
                 cm.running_context_usage = cm.model_max_tokens + 1
                 await _compact_and_notify(session)
+                # Same guard as the top of the loop: if compaction couldn't
+                # bring us under threshold, _compact_and_notify has already
+                # emitted session_terminated and set is_running=False. Continue
+                # would just re-call the LLM with the same too-big context.
+                if not session.is_running:
+                    break
                 continue
             except Exception as e:

tests/unit/test_compaction_loop_break.py ADDED Viewed

	@@ -0,0 +1,360 @@

+"""Regression tests for the 2026-05-03 infinite-compaction-loop bug.
+Pod logs from prod-114 showed sessions stuck retrying compaction every
+few seconds because a single oversized tool output in the untouched tail
+kept the post-compact context above the 90% threshold:
+    Context compacted: 200001 -> 215566 tokens
+    Context compacted: 215566 -> 215572 tokens
+    ContextWindowExceededError — forcing compaction
+    ... (continues for 5+ minutes)
+These tests cover three fixes:
+1. ``_truncate_oversized`` replaces oversized message content with a
+   placeholder and preserves all extended-thinking metadata fields.
+2. ``compact()`` raises ``CompactionFailedError`` when the post-compact
+   context is still over threshold.
+3. ``_compact_and_notify`` catches the error, sets ``session.is_running
+   = False``, and emits a ``session_terminated`` event so callers can
+   exit the agent loop.
+The P0 caught by PR #213 review (loop didn't actually exit on
+``is_running = False``) would have been caught by an end-to-end
+behavioral test of #3 — that gap is closed by the
+``test_compact_and_notify_terminates_session`` case below.
+"""
+from __future__ import annotations
+from unittest.mock import AsyncMock, MagicMock, patch
+import pytest
+from litellm import Message
+from agent.context_manager.manager import (
+    CompactionFailedError,
+    ContextManager,
+    _MAX_TOKENS_PER_MESSAGE,
+)
+# ── helpers ────────────────────────────────────────────────────────────
+def _make_cm(
+    *,
+    model_max_tokens: int = 100_000,
+    compact_size: int = 1_000,
+    untouched_messages: int = 5,
+) -> ContextManager:
+    cm = ContextManager.__new__(ContextManager)
+    cm.system_prompt = "system"
+    cm.model_max_tokens = model_max_tokens
+    cm.compact_size = compact_size
+    cm.running_context_usage = 0
+    cm.untouched_messages = untouched_messages
+    cm.items = [Message(role="system", content="system")]
+    cm.on_message_added = None
+    return cm
+def _msg(role: str, content: str | None = "x", **extra) -> Message:
+    return Message(role=role, content=content, **extra)
+# ── _truncate_oversized ────────────────────────────────────────────────
+def test_truncate_oversized_skips_messages_below_threshold():
+    cm = _make_cm()
+    msgs = [_msg("user", "small content")]
+    with patch("litellm.token_counter", return_value=100):
+        out = cm._truncate_oversized(msgs, "anthropic/claude-opus-4-6")
+    assert out == msgs  # unchanged
+def test_truncate_oversized_replaces_content_above_threshold():
+    cm = _make_cm()
+    big = "x" * (_MAX_TOKENS_PER_MESSAGE * 5)
+    msgs = [_msg("user", big)]
+    # token_counter returns the simulated big size for any message in this test
+    with patch("litellm.token_counter", return_value=_MAX_TOKENS_PER_MESSAGE * 2):
+        out = cm._truncate_oversized(msgs, "anthropic/claude-opus-4-6")
+    assert len(out) == 1
+    assert out[0].content != big
+    assert "[truncated for compaction" in out[0].content
+    assert str(_MAX_TOKENS_PER_MESSAGE * 2) in out[0].content
+def test_truncate_oversized_preserves_thinking_blocks():
+    """Anthropic extended-thinking models reject the next request with
+    ``Invalid signature in thinking block`` if a prior assistant message
+    drops thinking_blocks. Truncation must keep this metadata.
+    """
+    cm = _make_cm()
+    big = "x" * (_MAX_TOKENS_PER_MESSAGE * 5)
+    thinking = [{"type": "thinking", "thinking": "...", "signature": "abc123"}]
+    msg = Message(role="assistant", content=big)
+    msg.thinking_blocks = thinking
+    msg.reasoning_content = "deep thought"
+    with patch("litellm.token_counter", return_value=_MAX_TOKENS_PER_MESSAGE * 2):
+        out = cm._truncate_oversized([msg], "anthropic/claude-opus-4-6")
+    assert getattr(out[0], "thinking_blocks", None) == thinking
+    assert getattr(out[0], "reasoning_content", None) == "deep thought"
+def test_truncate_oversized_never_touches_system_message():
+    """The system prompt is the agent's instructions — must never be truncated.
+    Caught by the integration smoke test on PR #213: when items has fewer than
+    ``untouched_messages`` entries, the slice math in ``compact()`` can let
+    ``items[0]`` (the system message) leak into the ``recent_messages`` list
+    that gets passed to ``_truncate_oversized``. The function must guard
+    explicitly against this.
+    """
+    cm = _make_cm()
+    huge_system = "x" * (_MAX_TOKENS_PER_MESSAGE * 5)
+    msgs = [_msg("system", huge_system)]
+    with patch("litellm.token_counter", return_value=_MAX_TOKENS_PER_MESSAGE * 2):
+        out = cm._truncate_oversized(msgs, "anthropic/claude-opus-4-6")
+    assert out[0].content == huge_system, "system message must never be truncated"
+def test_truncate_oversized_resilient_to_token_counter_failure():
+    """token_counter occasionally raises on edge-case content. A blip there
+    must NOT drop the message — better to leave it and let compaction
+    handle it (or fail with CompactionFailedError) than to lose data.
+    """
+    cm = _make_cm()
+    msgs = [_msg("user", "anything")]
+    with patch("litellm.token_counter", side_effect=Exception("counter blew up")):
+        out = cm._truncate_oversized(msgs, "anthropic/claude-opus-4-6")
+    assert out == msgs
+# ── compact() raises CompactionFailedError ─────────────────────────────
+@pytest.mark.asyncio
+async def test_compact_raises_when_post_compact_still_over_threshold():
+    """The whole point of the new behavior: don't loop on a useless
+    compaction call. Raise so the caller can terminate the session.
+    """
+    cm = _make_cm(model_max_tokens=100_000)
+    # Build a context that's "over threshold" from the start
+    cm.items = [
+        Message(role="system", content="system"),
+        Message(role="user", content="task"),
+        Message(role="assistant", content="x" * 1000),
+        Message(role="user", content="follow-up 1"),
+        Message(role="assistant", content="reply 1"),
+        Message(role="user", content="follow-up 2"),
+        Message(role="assistant", content="reply 2"),
+    ]
+    cm.running_context_usage = 95_000  # over threshold (90% of 100k = 90k)
+    # Mock summarize_messages to return a tiny summary; mock _recompute_usage
+    # to keep the running_context_usage above threshold so compact() raises.
+    async def fake_summarize(*args, **kwargs):
+        return ("summary", 10)
+    def fake_recompute(self, model_name):
+        # Simulate post-compact still over threshold
+        self.running_context_usage = 95_000
+    with (
+        patch("agent.context_manager.manager.summarize_messages", side_effect=fake_summarize),
+        patch.object(ContextManager, "_recompute_usage", fake_recompute),
+        # Avoid token_counter calls in _truncate_oversized
+        patch("litellm.token_counter", return_value=100),
+    ):
+        with pytest.raises(CompactionFailedError):
+            await cm.compact(
+                model_name="anthropic/claude-opus-4-6",
+                tool_specs=None,
+                hf_token=None,
+                session=None,
+            )
+@pytest.mark.asyncio
+async def test_compact_does_not_duplicate_system_when_idx_is_zero():
+    """Regression for the second P0 caught by bot review on PR #213.
+    When ``len(items) == untouched_messages`` (the canonical 5-message
+    early-compaction case: system + user-task + giant-tool-output +
+    user-followup + assistant-reply), ``idx`` initialises to 0 and the
+    walk-back ``while idx > 1`` loop is a no-op. Without an explicit
+    clamp ``if idx < 1: idx = 1``, ``recent_messages = items[0:]``
+    starts at the system message, and the rebuild duplicates system +
+    first-user. Anthropic API rejects two system messages.
+    """
+    cm = _make_cm(model_max_tokens=100_000, untouched_messages=5)
+    cm.items = [
+        Message(role="system", content="system"),
+        Message(role="user", content="task"),
+        Message(role="assistant", content="ok"),  # would be the only
+                                                   # message_to_summarize but the
+                                                   # idx bug pulls it into recent
+        Message(role="user", content="followup"),
+        Message(role="assistant", content="reply"),
+    ]  # exactly 5 = untouched_messages, so idx initialises to 0
+    cm.running_context_usage = 95_000
+    async def fake_summarize(*args, **kwargs):
+        return ("summary", 10)
+    def fake_recompute(self, model_name):
+        self.running_context_usage = 5_000
+    with (
+        patch("agent.context_manager.manager.summarize_messages", side_effect=fake_summarize),
+        patch.object(ContextManager, "_recompute_usage", fake_recompute),
+        patch("litellm.token_counter", return_value=100),
+    ):
+        await cm.compact(
+            model_name="anthropic/claude-opus-4-6",
+            tool_specs=None,
+            hf_token=None,
+            session=None,
+        )
+    # Critical assertion: only ONE system message in items
+    system_count = sum(1 for m in cm.items if m.role == "system")
+    assert system_count == 1, (
+        f"Expected exactly 1 system message, found {system_count}. "
+        f"Roles: {[m.role for m in cm.items]}"
+    )
+    # And the first-user "task" message must also appear exactly once.
+    # Bot review on PR #213 caught a follow-up bug: clamping idx=1
+    # excludes the system but still overlaps with first_user_idx (also 1),
+    # so first_user_msg ends up in BOTH head and recent_messages →
+    # duplicate user message → Anthropic 400 (two consecutive user roles).
+    task_count = sum(
+        1 for m in cm.items
+        if m.role == "user" and (m.content or "") == "task"
+    )
+    assert task_count == 1, (
+        f"Expected exactly 1 'task' user message, found {task_count}. "
+        f"Roles+content: {[(m.role, (m.content or '')[:20]) for m in cm.items]}"
+    )
+    # Defense in depth: no two consecutive same-role messages (Anthropic
+    # API contract). System counts separately.
+    non_system = [m for m in cm.items if m.role != "system"]
+    for i in range(1, len(non_system)):
+        assert non_system[i].role != non_system[i-1].role, (
+            f"Two consecutive {non_system[i].role} messages at non-system "
+            f"position {i-1},{i} — Anthropic API rejects this. "
+            f"Roles: {[m.role for m in cm.items]}"
+        )
+@pytest.mark.asyncio
+async def test_compact_succeeds_when_post_compact_under_threshold():
+    """Happy path: when compaction does its job, no exception raised."""
+    cm = _make_cm(model_max_tokens=100_000)
+    cm.items = [
+        Message(role="system", content="system"),
+        Message(role="user", content="task"),
+        Message(role="assistant", content="x" * 1000),
+        Message(role="user", content="follow-up"),
+        Message(role="assistant", content="reply"),
+        Message(role="user", content="follow-up 2"),
+        Message(role="assistant", content="reply 2"),
+    ]
+    cm.running_context_usage = 95_000
+    async def fake_summarize(*args, **kwargs):
+        return ("summary", 10)
+    def fake_recompute(self, model_name):
+        self.running_context_usage = 5_000  # well under threshold
+    with (
+        patch("agent.context_manager.manager.summarize_messages", side_effect=fake_summarize),
+        patch.object(ContextManager, "_recompute_usage", fake_recompute),
+        patch("litellm.token_counter", return_value=100),
+    ):
+        await cm.compact(
+            model_name="anthropic/claude-opus-4-6",
+            tool_specs=None,
+            hf_token=None,
+            session=None,
+        )
+    assert cm.running_context_usage == 5_000
+# ── _compact_and_notify behavior on CompactionFailedError ──────────────
+@pytest.mark.asyncio
+async def test_compact_and_notify_terminates_session_on_failure():
+    """The PR's #213's P0 bug-class: setting ``is_running = False`` is
+    only effective if the agent loop checks it. This test asserts the
+    flag IS set AND a ``session_terminated`` event is emitted, so a
+    follow-up assertion in the agent loop test catches the loop-exit.
+    """
+    from agent.core.agent_loop import _compact_and_notify
+    session = MagicMock()
+    session.session_id = "sess-123"
+    session.is_running = True
+    session.config.model_name = "anthropic/claude-opus-4-6"
+    session.hf_token = None
+    session.tool_router.get_tool_specs_for_llm.return_value = []
+    session.send_event = AsyncMock()
+    cm = MagicMock()
+    cm.running_context_usage = 95_000
+    cm.compaction_threshold = 90_000
+    cm.model_max_tokens = 100_000
+    cm.items = []
+    cm.needs_compaction = True
+    cm.compact = AsyncMock(side_effect=CompactionFailedError("ineffective"))
+    session.context_manager = cm
+    await _compact_and_notify(session)
+    assert session.is_running is False, (
+        "_compact_and_notify must set is_running=False so the agent loop "
+        "can exit. P0 caught by bot review on PR #213 was that the loop "
+        "didn't actually check this flag."
+    )
+    assert session.send_event.await_count == 1
+    event = session.send_event.await_args.args[0]
+    assert event.event_type == "session_terminated"
+    assert event.data["reason"] == "compaction_failed"
+    assert event.data["context_usage"] == 95_000
+@pytest.mark.asyncio
+async def test_compact_and_notify_passes_through_on_success():
+    """When compaction succeeds, no termination event, is_running stays True."""
+    from agent.core.agent_loop import _compact_and_notify
+    session = MagicMock()
+    session.session_id = "sess-456"
+    session.is_running = True
+    session.config.model_name = "anthropic/claude-opus-4-6"
+    session.hf_token = None
+    session.tool_router.get_tool_specs_for_llm.return_value = []
+    session.send_event = AsyncMock()
+    cm = MagicMock()
+    cm.running_context_usage = 5_000
+    cm.compaction_threshold = 90_000
+    cm.model_max_tokens = 100_000
+    cm.items = []
+    cm.needs_compaction = False
+    cm.compact = AsyncMock(return_value=None)  # success
+    session.context_manager = cm
+    # Pretend old_usage == new_usage so the "compacted" event is also skipped
+    await _compact_and_notify(session)
+    assert session.is_running is True
+    # No session_terminated event emitted
+    for call in session.send_event.await_args_list:
+        ev = call.args[0]
+        assert ev.event_type != "session_terminated"