Guillaume Salou commited on
Commit
d84b454
Β·
unverified Β·
1 Parent(s): 092f909

fix(compaction): break infinite loop + truncate oversized messages (#213)

Browse files

* fix(compaction): break the infinite-compaction loop, truncate oversized messages

Sessions stuck in a compaction retry loop have been silently burning Bedrock
budget while staying invisible in the session dataset. Pod logs from
2026-05-03 on prod-114 showed the pattern firing minute-after-minute on
multiple replicas:

Context compacted: 200001 -> 215566 tokens
Context compacted: 215566 -> 215572 tokens
ContextWindowExceededError β€” forcing compaction
Context compacted: 200001 -> 215544 tokens
...

Root cause: a single message in the "untouched" tail (typically a tool
output of 80k+ tokens β€” bash dump, file content, CSV) keeps the post-compact
context above the 90% threshold. Compaction triggers again, calls Bedrock
again, gets the same 215k result, loops. ~$3 per Opus retry, indefinitely.
Sessions never reach _run_session.finally β†’ save_and_upload_detached never
fires β†’ cost is real but never lands in total_cost_usd.

Cross-check: dataset Bedrock-only coverage on 2026-05-01 was 43% of Cost
Explorer ($6,299 vs $14,732). Bedrock Invocation Logs on 2026-05-02 showed
11,996 InvokeModel calls vs 5,558 Bedrock events in the dataset for the
same day β€” half the calls missing, consistent with stuck sessions never
uploading.

Three fixes in this PR:

1. ContextManager._truncate_oversized() β€” replace the content of any
preserved message (first user, untouched tail) over 50k tokens with a
placeholder before summarization. This addresses the root cause:
compaction can't shrink a single oversized message because it's
preserved verbatim.

2. ContextManager.compact() now raises CompactionFailedError if the
post-compact context is still over the threshold (i.e., truncation +
summarize were not enough). The retry-as-circuit-breaker behavior is
replaced with fail-fast: better to end the session cleanly than to
loop on the same useless API call.

3. agent_loop._compact_and_notify() catches CompactionFailedError, emits
a session_terminated event with reason=compaction_failed (so the
dataset records WHY the session ended and the cost it incurred up to
that point), and sets session.is_running=False to exit the loop. The
_run_session.finally then fires save_trajectory normally and the
session ends up in the dataset.

Expected impact: closes the loop side of the cost telemetry gap. The
sessions that previously looped invisibly will now end at first
compaction failure with an event, dataset coverage should rise from
~43% toward the floor set by other gaps (litellm pricing for HF Router
models on compaction events β€” separate, smaller issue).

* fix(compaction): address PR bot feedback (P0 + 3 P1)

P0 β€” _compact_and_notify set is_running=False but neither call site of
the inner agent loop checked it. After CompactionFailedError, the LLM
call still fired, hit ContextWindowExceededError, re-called
_compact_and_notify in the except handler, and continued β€” replacing
the original infinite compaction loop with an only-slightly-better
loop that did one LLM call between compaction failures. Add explicit
`if not session.is_running: break` guards at both call sites
(agent_loop.py:1076 and 1469).

P1.1 β€” _truncate_oversized rebuilt the replacement Message with only
five fields (role, content, tool_call_id, tool_calls, name), silently
dropping thinking_blocks, reasoning_content, and provider_specific_fields.
For Anthropic extended-thinking models with reasoning_effort=high/max,
losing thinking_blocks on a prior assistant message causes the next
request to fail with "Invalid signature in thinking block". Preserve
all six known fields when reconstructing.

P1.2 β€” Same is_running guard missing at the second call site (the
ContextWindowExceededError except handler). Same fix.

P1.3 β€” No automated test for the CompactionFailedError β†’ session
termination flow. Added tests/unit/test_compaction_loop_break.py with
8 cases covering _truncate_oversized (4), compact() raising (2), and
_compact_and_notify session termination + happy path (2). The most
important test is test_compact_and_notify_terminates_session_on_failure
which would have caught the P0 directly. All 8 pass.

* fix(compaction): never truncate system message (caught by integration test)

End-to-end smoke test on the new compaction code path uncovered an edge
case the unit tests missed: when ``items`` has fewer entries than
``untouched_messages`` (pathological configs, very early-session compact
triggers, or the artificial test scenario), the slice math in compact()
can let ``items[0]`` (the system message) leak into the
``recent_messages`` list passed to ``_truncate_oversized``.

The function then truncated the system prompt β€” silently destroying the
agent's instructions. Defense-in-depth fix: explicit ``if msg.role ==
"system": pass through`` guard at the top of the per-message loop.

The system prompt is loaded from system_prompt_v3.yaml at session start
and is the agent's behavioral contract; it must never be modified by the
compaction path.

Added test_truncate_oversized_never_touches_system_message to cover this
specifically. Total now 9 tests, all passing.

* fix(compaction): clamp idx >= 1 to prevent system message duplication

Bot review re-review on PR #213 caught a second P0: when len(items) ==
untouched_messages (the canonical 5-message early-compaction case
[system, user-task, assistant-with-giant-output, user-followup,
assistant-reply]), idx initialises to 0 and the walk-back `while idx > 1`
guard is a no-op.

Without an explicit clamp, recent_messages = items[0:] starts at the
system message. The not-messages_to_summarize rebuild path then produces
[system_msg, first_user_msg] + recent_messages = [system, user, system,
user, ...] β€” duplicating both. Anthropic API rejects two system messages.

The system-message guard added in a0fc95f prevents truncation of the
duplicated system but doesn't prevent the duplication itself.

Fix: explicit `if idx < 1: idx = 1` after the walk-back loop, mirroring
the intent of the existing `idx > 1` guard ("never include system in
recent_messages") and closing the gap when initialisation already lands
below 1.

Added test_compact_does_not_duplicate_system_when_idx_is_zero with the
exact 5-message setup. All 10 tests pass.

The 5-message scenario is precisely the one this PR targets: [system,
user-task, assistant-tool-output, user-followup, assistant-reply] is
the most likely shape to drive context > 90% threshold via a single
oversized tool output. So this P0 would have hit nearly every stuck
session β€” would have made the fix worse than the bug.

* fix(compaction): clamp idx > first_user_idx (not just > 0)

Bot review on PR #213 caught the third P0 in this thread: my previous
clamp `if idx < 1: idx = 1` excluded the system message from
recent_messages but still overlapped with first_user_idx (which is
also 1 for any well-formed session).

Trace for the canonical 5-message trigger:
items = [system(0), user-task(1), assistant(2), user-followup(3),
assistant-reply(4)]
first_user_idx = 1
idx = 5 - 5 = 0 β†’ clamped to 1
recent_messages = items[1:] = [user-task, assistant, user-followup,
assistant-reply] ← includes user-task
messages_to_summarize = items[2:1] = []
β†’ enters rebuild branch
head = [system_msg, first_user_msg] ← first_user_msg also here
self.items = head + recent_messages
= [system, user-task, user-task, assistant, user-followup,
assistant-reply]
β†’ two consecutive user messages
β†’ Anthropic API 400 on next LLM call

The right invariant is "idx must be strictly after first_user_idx",
not "idx must be > 0". The walk-back's `idx > 1` was necessary
(no system) but insufficient (first_user also in head).

Fix: `if idx <= first_user_idx: idx = first_user_idx + 1`. Since
first_user_idx >= 1 for any well-formed session, this also satisfies
the original system-message exclusion intent.

Test strengthened to assert (a) system_count == 1, (b) task message
appears exactly once, (c) no two consecutive same-role non-system
messages. The previous test only checked (a) and would have shipped
this bug.

Lesson noted in ~/.claude/CLAUDE.md section 5: my fix to one bot
finding introduced another bug. On critical-path PRs, every commit
needs a fresh review round.

agent/context_manager/manager.py CHANGED
@@ -79,6 +79,23 @@ _COMPACT_PROMPT = (
79
  "will be have to be filled in."
80
  )
81
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
82
  # Used when seeding a brand-new session from prior browser-cached messages.
83
  # Here we're writing a note to *ourselves* β€” so preserve the tool-call trail,
84
  # files produced, and planned next steps in first person. Optimized for
@@ -374,6 +391,81 @@ class ContextManager:
374
  def needs_compaction(self) -> bool:
375
  return self.running_context_usage > self.compaction_threshold and bool(self.items)
376
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
377
  async def compact(
378
  self,
379
  model_name: str,
@@ -386,6 +478,13 @@ class ContextManager:
386
  ``session`` is optional β€” if passed, the underlying summarization
387
  LLM call is recorded via ``telemetry.record_llm_call(kind=
388
  "compaction")`` so its cost shows up in ``total_cost_usd``.
 
 
 
 
 
 
 
389
  """
390
  if not self.needs_compaction:
391
  return
@@ -409,12 +508,45 @@ class ContextManager:
409
  idx = len(self.items) - self.untouched_messages
410
  while idx > 1 and self.items[idx].role != "user":
411
  idx -= 1
 
 
 
 
 
 
 
 
 
412
 
413
  recent_messages = self.items[idx:]
414
  messages_to_summarize = self.items[first_user_idx + 1:idx]
415
 
416
- # improbable, messages would have to very long
 
 
 
 
 
 
 
 
 
 
 
 
 
417
  if not messages_to_summarize:
 
 
 
 
 
 
 
 
 
 
 
418
  return
419
 
420
  summary, completion_tokens = await summarize_messages(
@@ -439,16 +571,19 @@ class ContextManager:
439
  head.append(first_user_msg)
440
  self.items = head + [summarized_message] + recent_messages
441
 
442
- # Count the actual post-compact context β€” system prompt + first user
443
- # turn + summary + the preserved tail all contribute, not just the
444
- # summary. litellm.token_counter uses the model's real tokenizer.
445
- from litellm import token_counter
446
-
447
- try:
448
- self.running_context_usage = token_counter(
449
- model=model_name,
450
- messages=[m.model_dump() for m in self.items],
 
 
 
 
 
 
451
  )
452
- except Exception as e:
453
- logger.warning("token_counter failed post-compact (%s); falling back to rough estimate", e)
454
- self.running_context_usage = len(self.system_prompt) // 4 + completion_tokens
 
79
  "will be have to be filled in."
80
  )
81
 
82
+ # Per-message ceiling. If a single message in the "untouched" tail is larger
83
+ # than this, compaction can't recover even after summarizing the middle β€”
84
+ # producing the infinite compaction loop seen 2026-05-03 in pod logs (200k
85
+ # context shrinks to 200k+ because one tool output is 80k tokens). We replace
86
+ # such messages with a placeholder before compaction runs.
87
+ _MAX_TOKENS_PER_MESSAGE = 50_000
88
+
89
+
90
+ class CompactionFailedError(Exception):
91
+ """Raised when compaction can't reduce context below the threshold.
92
+
93
+ Typically means an individual preserved message (system, first user, or
94
+ untouched tail) exceeds what truncation can fix in one pass. The caller
95
+ must terminate the session β€” retrying produces an infinite loop that
96
+ burns Bedrock budget for free (~$3 per re-attempt on Opus).
97
+ """
98
+
99
  # Used when seeding a brand-new session from prior browser-cached messages.
100
  # Here we're writing a note to *ourselves* β€” so preserve the tool-call trail,
101
  # files produced, and planned next steps in first person. Optimized for
 
391
  def needs_compaction(self) -> bool:
392
  return self.running_context_usage > self.compaction_threshold and bool(self.items)
393
 
394
+ def _truncate_oversized(
395
+ self, messages: list[Message], model_name: str
396
+ ) -> list[Message]:
397
+ """Replace any message > _MAX_TOKENS_PER_MESSAGE with a placeholder.
398
+
399
+ These are typically tool outputs (CSV dumps, file contents) sitting in
400
+ the untouched tail or first-user position that compaction can't shrink
401
+ β€” they pass through verbatim, keeping context above threshold and
402
+ triggering an infinite compaction retry loop.
403
+ """
404
+ from litellm import token_counter
405
+
406
+ out: list[Message] = []
407
+ for msg in messages:
408
+ # System messages are sacred β€” they're the agent's instructions.
409
+ # In edge cases (items < untouched_messages), the slice math in
410
+ # compact() can let items[0] (the system message) leak into the
411
+ # recent_messages list. Defense-in-depth: never truncate it.
412
+ if msg.role == "system":
413
+ out.append(msg)
414
+ continue
415
+ try:
416
+ n = token_counter(model=model_name, messages=[msg.model_dump()])
417
+ except Exception:
418
+ # token_counter occasionally fails on edge-case content;
419
+ # don't drop the message, just keep it as-is.
420
+ out.append(msg)
421
+ continue
422
+ if n <= _MAX_TOKENS_PER_MESSAGE:
423
+ out.append(msg)
424
+ continue
425
+ placeholder = (
426
+ f"[truncated for compaction β€” original was {n} tokens, "
427
+ f"removed to keep context under {self.compaction_threshold} tokens]"
428
+ )
429
+ logger.warning(
430
+ "Truncating %s message: %d -> %d tokens for compaction",
431
+ msg.role, n, len(placeholder) // 4,
432
+ )
433
+ # Preserve all known assistant-side fields (tool_calls, thinking_blocks,
434
+ # reasoning_content, provider_specific_fields) even when content is
435
+ # replaced. Anthropic extended-thinking models reject the next request
436
+ # with "Invalid signature in thinking block" if thinking_blocks is
437
+ # dropped from a prior assistant message.
438
+ kept = {
439
+ k: getattr(msg, k, None)
440
+ for k in (
441
+ "tool_call_id",
442
+ "tool_calls",
443
+ "name",
444
+ "thinking_blocks",
445
+ "reasoning_content",
446
+ "provider_specific_fields",
447
+ )
448
+ if getattr(msg, k, None) is not None
449
+ }
450
+ out.append(Message(role=msg.role, content=placeholder, **kept))
451
+ return out
452
+
453
+ def _recompute_usage(self, model_name: str) -> None:
454
+ """Refresh ``running_context_usage`` from current items via real tokenizer."""
455
+ from litellm import token_counter
456
+
457
+ try:
458
+ self.running_context_usage = token_counter(
459
+ model=model_name,
460
+ messages=[m.model_dump() for m in self.items],
461
+ )
462
+ except Exception as e:
463
+ logger.warning("token_counter failed (%s); rough estimate", e)
464
+ # Rough fallback: 4 chars per token.
465
+ self.running_context_usage = sum(
466
+ len(getattr(m, "content", "") or "") for m in self.items
467
+ ) // 4
468
+
469
  async def compact(
470
  self,
471
  model_name: str,
 
478
  ``session`` is optional β€” if passed, the underlying summarization
479
  LLM call is recorded via ``telemetry.record_llm_call(kind=
480
  "compaction")`` so its cost shows up in ``total_cost_usd``.
481
+
482
+ Raises ``CompactionFailedError`` if the post-compact context is still
483
+ over the threshold. This happens when a preserved message (typically
484
+ a giant tool output stuck in the untouched tail) is too large for
485
+ truncation to fix. The caller must terminate the session β€” retrying
486
+ is what caused the 2026-05-03 infinite-compaction-loop pattern that
487
+ burned Bedrock budget invisibly.
488
  """
489
  if not self.needs_compaction:
490
  return
 
508
  idx = len(self.items) - self.untouched_messages
509
  while idx > 1 and self.items[idx].role != "user":
510
  idx -= 1
511
+ # The real invariant is "idx must be strictly after first_user_idx,
512
+ # otherwise recent_messages overlaps with the messages we put in
513
+ # head". The walk-back's `idx > 1` guard is necessary (no system in
514
+ # recent) but insufficient (first_user is also in head and would be
515
+ # duplicated). Anthropic API rejects two consecutive user messages
516
+ # with a 400 β€” bot review on PR #213 caught this on the second clamp
517
+ # iteration.
518
+ if idx <= first_user_idx:
519
+ idx = first_user_idx + 1
520
 
521
  recent_messages = self.items[idx:]
522
  messages_to_summarize = self.items[first_user_idx + 1:idx]
523
 
524
+ # Truncate any message that's larger than _MAX_TOKENS_PER_MESSAGE in
525
+ # the parts we PRESERVE through compaction (first_user + recent_tail).
526
+ # These are the only places where individual messages can defeat
527
+ # compaction by being intrinsically too large. Messages in
528
+ # ``messages_to_summarize`` are folded into the summary, so their size
529
+ # doesn't matter on its own.
530
+ if first_user_msg is not None:
531
+ truncated = self._truncate_oversized([first_user_msg], model_name)
532
+ first_user_msg = truncated[0]
533
+ recent_messages = self._truncate_oversized(recent_messages, model_name)
534
+
535
+ # If there's nothing to summarize but the preserved messages are now
536
+ # truncated and small, just rebuild and recompute. This is rare but
537
+ # avoids returning silently with the old (over-threshold) state.
538
  if not messages_to_summarize:
539
+ head = [system_msg] if system_msg else []
540
+ if first_user_msg:
541
+ head.append(first_user_msg)
542
+ self.items = head + recent_messages
543
+ self._recompute_usage(model_name)
544
+ if self.running_context_usage > self.compaction_threshold:
545
+ raise CompactionFailedError(
546
+ f"Nothing to summarize but context ({self.running_context_usage}) "
547
+ f"still over threshold ({self.compaction_threshold}) after truncation. "
548
+ f"System prompt or first user message likely exceeds the budget."
549
+ )
550
  return
551
 
552
  summary, completion_tokens = await summarize_messages(
 
571
  head.append(first_user_msg)
572
  self.items = head + [summarized_message] + recent_messages
573
 
574
+ self._recompute_usage(model_name)
575
+
576
+ # Hard verify: if compaction didn't bring us below the threshold even
577
+ # after truncating oversized preserved messages, retrying just burns
578
+ # Bedrock budget on the same useless compaction call. Raise so the
579
+ # caller can terminate the session cleanly. Pre-2026-05-04, the
580
+ # caller looped indefinitely (~$3/Opus retry) until the pod was
581
+ # killed β€” invisible to the dataset because the session never
582
+ # finished cleanly.
583
+ if self.running_context_usage > self.compaction_threshold:
584
+ raise CompactionFailedError(
585
+ f"Compaction ineffective: {self.running_context_usage} tokens "
586
+ f"still over threshold {self.compaction_threshold} after summarize "
587
+ f"and truncation. Likely the system prompt + first user + summary "
588
+ f"+ truncated tail still exceeds budget."
589
  )
 
 
 
agent/core/agent_loop.py CHANGED
@@ -516,19 +516,56 @@ def _friendly_error_message(error: Exception) -> str | None:
516
 
517
 
518
  async def _compact_and_notify(session: Session) -> None:
519
- """Run compaction and send event if context was reduced."""
 
 
 
 
 
 
 
 
 
520
  cm = session.context_manager
521
  old_usage = cm.running_context_usage
522
  logger.debug(
523
  "Compaction check: usage=%d, max=%d, threshold=%d, needs_compact=%s",
524
  old_usage, cm.model_max_tokens, cm.compaction_threshold, cm.needs_compaction,
525
  )
526
- await cm.compact(
527
- model_name=session.config.model_name,
528
- tool_specs=session.tool_router.get_tool_specs_for_llm(),
529
- hf_token=session.hf_token,
530
- session=session,
531
- )
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
532
  new_usage = cm.running_context_usage
533
  if new_usage != old_usage:
534
  logger.warning(
@@ -1035,8 +1072,15 @@ class Handlers:
1035
  if session.is_cancelled:
1036
  break
1037
 
1038
- # Compact before calling the LLM if context is near the limit
 
 
 
 
 
1039
  await _compact_and_notify(session)
 
 
1040
 
1041
  # Doom-loop detection: break out of repeated tool call patterns
1042
  doom_prompt = check_for_doom_loop(session.context_manager.items)
@@ -1421,7 +1465,7 @@ class Handlers:
1421
  iteration += 1
1422
 
1423
  except ContextWindowExceededError:
1424
- # Force compact and retry this iteration
1425
  cm = session.context_manager
1426
  logger.warning(
1427
  "ContextWindowExceededError at iteration %d β€” forcing compaction "
@@ -1430,6 +1474,12 @@ class Handlers:
1430
  )
1431
  cm.running_context_usage = cm.model_max_tokens + 1
1432
  await _compact_and_notify(session)
 
 
 
 
 
 
1433
  continue
1434
 
1435
  except Exception as e:
 
516
 
517
 
518
  async def _compact_and_notify(session: Session) -> None:
519
+ """Run compaction and send event if context was reduced.
520
+
521
+ Catches ``CompactionFailedError`` and ends the session cleanly instead
522
+ of letting the caller retry. Pre-2026-05-04 the caller looped on
523
+ ContextWindowExceededError β†’ compact β†’ re-trigger, burning Bedrock
524
+ budget at ~$3/Opus retry while the session never reached the upload
525
+ path (so the cost was invisible in the dataset).
526
+ """
527
+ from agent.context_manager.manager import CompactionFailedError
528
+
529
  cm = session.context_manager
530
  old_usage = cm.running_context_usage
531
  logger.debug(
532
  "Compaction check: usage=%d, max=%d, threshold=%d, needs_compact=%s",
533
  old_usage, cm.model_max_tokens, cm.compaction_threshold, cm.needs_compaction,
534
  )
535
+ try:
536
+ await cm.compact(
537
+ model_name=session.config.model_name,
538
+ tool_specs=session.tool_router.get_tool_specs_for_llm(),
539
+ hf_token=session.hf_token,
540
+ session=session,
541
+ )
542
+ except CompactionFailedError as e:
543
+ logger.error(
544
+ "Compaction failed for session %s: %s β€” terminating session",
545
+ session.session_id, e,
546
+ )
547
+ # Persist the failure event so the dataset has a record of WHY this
548
+ # session ended (and the cost it incurred up to that point) even if
549
+ # save_and_upload_detached has issues downstream.
550
+ await session.send_event(Event(
551
+ event_type="session_terminated",
552
+ data={
553
+ "reason": "compaction_failed",
554
+ "context_usage": cm.running_context_usage,
555
+ "context_threshold": cm.compaction_threshold,
556
+ "error": str(e)[:300],
557
+ "user_message": (
558
+ "Your conversation has grown too large to continue. "
559
+ "The work you've done is saved β€” start a new session to keep going."
560
+ ),
561
+ },
562
+ ))
563
+ # Stop the agent loop; the finally in _run_session will fire
564
+ # cleanup_sandbox + save_trajectory so the dataset captures
565
+ # everything that did happen.
566
+ session.is_running = False
567
+ return
568
+
569
  new_usage = cm.running_context_usage
570
  if new_usage != old_usage:
571
  logger.warning(
 
1072
  if session.is_cancelled:
1073
  break
1074
 
1075
+ # Compact before calling the LLM if context is near the limit.
1076
+ # When _compact_and_notify catches CompactionFailedError it sets
1077
+ # session.is_running = False; we MUST exit the loop here, otherwise
1078
+ # the LLM call below fires with an over-threshold context, hits
1079
+ # ContextWindowExceededError, and we end up looping again on the
1080
+ # except path β€” exactly the bug this PR is supposed to fix.
1081
  await _compact_and_notify(session)
1082
+ if not session.is_running:
1083
+ break
1084
 
1085
  # Doom-loop detection: break out of repeated tool call patterns
1086
  doom_prompt = check_for_doom_loop(session.context_manager.items)
 
1465
  iteration += 1
1466
 
1467
  except ContextWindowExceededError:
1468
+ # Force compact and retry this iteration.
1469
  cm = session.context_manager
1470
  logger.warning(
1471
  "ContextWindowExceededError at iteration %d β€” forcing compaction "
 
1474
  )
1475
  cm.running_context_usage = cm.model_max_tokens + 1
1476
  await _compact_and_notify(session)
1477
+ # Same guard as the top of the loop: if compaction couldn't
1478
+ # bring us under threshold, _compact_and_notify has already
1479
+ # emitted session_terminated and set is_running=False. Continue
1480
+ # would just re-call the LLM with the same too-big context.
1481
+ if not session.is_running:
1482
+ break
1483
  continue
1484
 
1485
  except Exception as e:
tests/unit/test_compaction_loop_break.py ADDED
@@ -0,0 +1,360 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Regression tests for the 2026-05-03 infinite-compaction-loop bug.
2
+
3
+ Pod logs from prod-114 showed sessions stuck retrying compaction every
4
+ few seconds because a single oversized tool output in the untouched tail
5
+ kept the post-compact context above the 90% threshold:
6
+
7
+ Context compacted: 200001 -> 215566 tokens
8
+ Context compacted: 215566 -> 215572 tokens
9
+ ContextWindowExceededError β€” forcing compaction
10
+ ... (continues for 5+ minutes)
11
+
12
+ These tests cover three fixes:
13
+
14
+ 1. ``_truncate_oversized`` replaces oversized message content with a
15
+ placeholder and preserves all extended-thinking metadata fields.
16
+ 2. ``compact()`` raises ``CompactionFailedError`` when the post-compact
17
+ context is still over threshold.
18
+ 3. ``_compact_and_notify`` catches the error, sets ``session.is_running
19
+ = False``, and emits a ``session_terminated`` event so callers can
20
+ exit the agent loop.
21
+
22
+ The P0 caught by PR #213 review (loop didn't actually exit on
23
+ ``is_running = False``) would have been caught by an end-to-end
24
+ behavioral test of #3 β€” that gap is closed by the
25
+ ``test_compact_and_notify_terminates_session`` case below.
26
+ """
27
+
28
+ from __future__ import annotations
29
+
30
+ from unittest.mock import AsyncMock, MagicMock, patch
31
+
32
+ import pytest
33
+ from litellm import Message
34
+
35
+ from agent.context_manager.manager import (
36
+ CompactionFailedError,
37
+ ContextManager,
38
+ _MAX_TOKENS_PER_MESSAGE,
39
+ )
40
+
41
+
42
+ # ── helpers ────────────────────────────────────────────────────────────
43
+
44
+
45
+ def _make_cm(
46
+ *,
47
+ model_max_tokens: int = 100_000,
48
+ compact_size: int = 1_000,
49
+ untouched_messages: int = 5,
50
+ ) -> ContextManager:
51
+ cm = ContextManager.__new__(ContextManager)
52
+ cm.system_prompt = "system"
53
+ cm.model_max_tokens = model_max_tokens
54
+ cm.compact_size = compact_size
55
+ cm.running_context_usage = 0
56
+ cm.untouched_messages = untouched_messages
57
+ cm.items = [Message(role="system", content="system")]
58
+ cm.on_message_added = None
59
+ return cm
60
+
61
+
62
+ def _msg(role: str, content: str | None = "x", **extra) -> Message:
63
+ return Message(role=role, content=content, **extra)
64
+
65
+
66
+ # ── _truncate_oversized ────────────────────────────────────────────────
67
+
68
+
69
+ def test_truncate_oversized_skips_messages_below_threshold():
70
+ cm = _make_cm()
71
+ msgs = [_msg("user", "small content")]
72
+ with patch("litellm.token_counter", return_value=100):
73
+ out = cm._truncate_oversized(msgs, "anthropic/claude-opus-4-6")
74
+ assert out == msgs # unchanged
75
+
76
+
77
+ def test_truncate_oversized_replaces_content_above_threshold():
78
+ cm = _make_cm()
79
+ big = "x" * (_MAX_TOKENS_PER_MESSAGE * 5)
80
+ msgs = [_msg("user", big)]
81
+ # token_counter returns the simulated big size for any message in this test
82
+ with patch("litellm.token_counter", return_value=_MAX_TOKENS_PER_MESSAGE * 2):
83
+ out = cm._truncate_oversized(msgs, "anthropic/claude-opus-4-6")
84
+ assert len(out) == 1
85
+ assert out[0].content != big
86
+ assert "[truncated for compaction" in out[0].content
87
+ assert str(_MAX_TOKENS_PER_MESSAGE * 2) in out[0].content
88
+
89
+
90
+ def test_truncate_oversized_preserves_thinking_blocks():
91
+ """Anthropic extended-thinking models reject the next request with
92
+ ``Invalid signature in thinking block`` if a prior assistant message
93
+ drops thinking_blocks. Truncation must keep this metadata.
94
+ """
95
+ cm = _make_cm()
96
+ big = "x" * (_MAX_TOKENS_PER_MESSAGE * 5)
97
+ thinking = [{"type": "thinking", "thinking": "...", "signature": "abc123"}]
98
+ msg = Message(role="assistant", content=big)
99
+ msg.thinking_blocks = thinking
100
+ msg.reasoning_content = "deep thought"
101
+ with patch("litellm.token_counter", return_value=_MAX_TOKENS_PER_MESSAGE * 2):
102
+ out = cm._truncate_oversized([msg], "anthropic/claude-opus-4-6")
103
+ assert getattr(out[0], "thinking_blocks", None) == thinking
104
+ assert getattr(out[0], "reasoning_content", None) == "deep thought"
105
+
106
+
107
+ def test_truncate_oversized_never_touches_system_message():
108
+ """The system prompt is the agent's instructions β€” must never be truncated.
109
+
110
+ Caught by the integration smoke test on PR #213: when items has fewer than
111
+ ``untouched_messages`` entries, the slice math in ``compact()`` can let
112
+ ``items[0]`` (the system message) leak into the ``recent_messages`` list
113
+ that gets passed to ``_truncate_oversized``. The function must guard
114
+ explicitly against this.
115
+ """
116
+ cm = _make_cm()
117
+ huge_system = "x" * (_MAX_TOKENS_PER_MESSAGE * 5)
118
+ msgs = [_msg("system", huge_system)]
119
+ with patch("litellm.token_counter", return_value=_MAX_TOKENS_PER_MESSAGE * 2):
120
+ out = cm._truncate_oversized(msgs, "anthropic/claude-opus-4-6")
121
+ assert out[0].content == huge_system, "system message must never be truncated"
122
+
123
+
124
+ def test_truncate_oversized_resilient_to_token_counter_failure():
125
+ """token_counter occasionally raises on edge-case content. A blip there
126
+ must NOT drop the message β€” better to leave it and let compaction
127
+ handle it (or fail with CompactionFailedError) than to lose data.
128
+ """
129
+ cm = _make_cm()
130
+ msgs = [_msg("user", "anything")]
131
+ with patch("litellm.token_counter", side_effect=Exception("counter blew up")):
132
+ out = cm._truncate_oversized(msgs, "anthropic/claude-opus-4-6")
133
+ assert out == msgs
134
+
135
+
136
+ # ── compact() raises CompactionFailedError ─────────────────────────────
137
+
138
+
139
+ @pytest.mark.asyncio
140
+ async def test_compact_raises_when_post_compact_still_over_threshold():
141
+ """The whole point of the new behavior: don't loop on a useless
142
+ compaction call. Raise so the caller can terminate the session.
143
+ """
144
+ cm = _make_cm(model_max_tokens=100_000)
145
+ # Build a context that's "over threshold" from the start
146
+ cm.items = [
147
+ Message(role="system", content="system"),
148
+ Message(role="user", content="task"),
149
+ Message(role="assistant", content="x" * 1000),
150
+ Message(role="user", content="follow-up 1"),
151
+ Message(role="assistant", content="reply 1"),
152
+ Message(role="user", content="follow-up 2"),
153
+ Message(role="assistant", content="reply 2"),
154
+ ]
155
+ cm.running_context_usage = 95_000 # over threshold (90% of 100k = 90k)
156
+
157
+ # Mock summarize_messages to return a tiny summary; mock _recompute_usage
158
+ # to keep the running_context_usage above threshold so compact() raises.
159
+ async def fake_summarize(*args, **kwargs):
160
+ return ("summary", 10)
161
+
162
+ def fake_recompute(self, model_name):
163
+ # Simulate post-compact still over threshold
164
+ self.running_context_usage = 95_000
165
+
166
+ with (
167
+ patch("agent.context_manager.manager.summarize_messages", side_effect=fake_summarize),
168
+ patch.object(ContextManager, "_recompute_usage", fake_recompute),
169
+ # Avoid token_counter calls in _truncate_oversized
170
+ patch("litellm.token_counter", return_value=100),
171
+ ):
172
+ with pytest.raises(CompactionFailedError):
173
+ await cm.compact(
174
+ model_name="anthropic/claude-opus-4-6",
175
+ tool_specs=None,
176
+ hf_token=None,
177
+ session=None,
178
+ )
179
+
180
+
181
+ @pytest.mark.asyncio
182
+ async def test_compact_does_not_duplicate_system_when_idx_is_zero():
183
+ """Regression for the second P0 caught by bot review on PR #213.
184
+
185
+ When ``len(items) == untouched_messages`` (the canonical 5-message
186
+ early-compaction case: system + user-task + giant-tool-output +
187
+ user-followup + assistant-reply), ``idx`` initialises to 0 and the
188
+ walk-back ``while idx > 1`` loop is a no-op. Without an explicit
189
+ clamp ``if idx < 1: idx = 1``, ``recent_messages = items[0:]``
190
+ starts at the system message, and the rebuild duplicates system +
191
+ first-user. Anthropic API rejects two system messages.
192
+ """
193
+ cm = _make_cm(model_max_tokens=100_000, untouched_messages=5)
194
+ cm.items = [
195
+ Message(role="system", content="system"),
196
+ Message(role="user", content="task"),
197
+ Message(role="assistant", content="ok"), # would be the only
198
+ # message_to_summarize but the
199
+ # idx bug pulls it into recent
200
+ Message(role="user", content="followup"),
201
+ Message(role="assistant", content="reply"),
202
+ ] # exactly 5 = untouched_messages, so idx initialises to 0
203
+ cm.running_context_usage = 95_000
204
+
205
+ async def fake_summarize(*args, **kwargs):
206
+ return ("summary", 10)
207
+
208
+ def fake_recompute(self, model_name):
209
+ self.running_context_usage = 5_000
210
+
211
+ with (
212
+ patch("agent.context_manager.manager.summarize_messages", side_effect=fake_summarize),
213
+ patch.object(ContextManager, "_recompute_usage", fake_recompute),
214
+ patch("litellm.token_counter", return_value=100),
215
+ ):
216
+ await cm.compact(
217
+ model_name="anthropic/claude-opus-4-6",
218
+ tool_specs=None,
219
+ hf_token=None,
220
+ session=None,
221
+ )
222
+
223
+ # Critical assertion: only ONE system message in items
224
+ system_count = sum(1 for m in cm.items if m.role == "system")
225
+ assert system_count == 1, (
226
+ f"Expected exactly 1 system message, found {system_count}. "
227
+ f"Roles: {[m.role for m in cm.items]}"
228
+ )
229
+ # And the first-user "task" message must also appear exactly once.
230
+ # Bot review on PR #213 caught a follow-up bug: clamping idx=1
231
+ # excludes the system but still overlaps with first_user_idx (also 1),
232
+ # so first_user_msg ends up in BOTH head and recent_messages β†’
233
+ # duplicate user message β†’ Anthropic 400 (two consecutive user roles).
234
+ task_count = sum(
235
+ 1 for m in cm.items
236
+ if m.role == "user" and (m.content or "") == "task"
237
+ )
238
+ assert task_count == 1, (
239
+ f"Expected exactly 1 'task' user message, found {task_count}. "
240
+ f"Roles+content: {[(m.role, (m.content or '')[:20]) for m in cm.items]}"
241
+ )
242
+ # Defense in depth: no two consecutive same-role messages (Anthropic
243
+ # API contract). System counts separately.
244
+ non_system = [m for m in cm.items if m.role != "system"]
245
+ for i in range(1, len(non_system)):
246
+ assert non_system[i].role != non_system[i-1].role, (
247
+ f"Two consecutive {non_system[i].role} messages at non-system "
248
+ f"position {i-1},{i} β€” Anthropic API rejects this. "
249
+ f"Roles: {[m.role for m in cm.items]}"
250
+ )
251
+
252
+
253
+ @pytest.mark.asyncio
254
+ async def test_compact_succeeds_when_post_compact_under_threshold():
255
+ """Happy path: when compaction does its job, no exception raised."""
256
+ cm = _make_cm(model_max_tokens=100_000)
257
+ cm.items = [
258
+ Message(role="system", content="system"),
259
+ Message(role="user", content="task"),
260
+ Message(role="assistant", content="x" * 1000),
261
+ Message(role="user", content="follow-up"),
262
+ Message(role="assistant", content="reply"),
263
+ Message(role="user", content="follow-up 2"),
264
+ Message(role="assistant", content="reply 2"),
265
+ ]
266
+ cm.running_context_usage = 95_000
267
+
268
+ async def fake_summarize(*args, **kwargs):
269
+ return ("summary", 10)
270
+
271
+ def fake_recompute(self, model_name):
272
+ self.running_context_usage = 5_000 # well under threshold
273
+
274
+ with (
275
+ patch("agent.context_manager.manager.summarize_messages", side_effect=fake_summarize),
276
+ patch.object(ContextManager, "_recompute_usage", fake_recompute),
277
+ patch("litellm.token_counter", return_value=100),
278
+ ):
279
+ await cm.compact(
280
+ model_name="anthropic/claude-opus-4-6",
281
+ tool_specs=None,
282
+ hf_token=None,
283
+ session=None,
284
+ )
285
+ assert cm.running_context_usage == 5_000
286
+
287
+
288
+ # ── _compact_and_notify behavior on CompactionFailedError ──────────────
289
+
290
+
291
+ @pytest.mark.asyncio
292
+ async def test_compact_and_notify_terminates_session_on_failure():
293
+ """The PR's #213's P0 bug-class: setting ``is_running = False`` is
294
+ only effective if the agent loop checks it. This test asserts the
295
+ flag IS set AND a ``session_terminated`` event is emitted, so a
296
+ follow-up assertion in the agent loop test catches the loop-exit.
297
+ """
298
+ from agent.core.agent_loop import _compact_and_notify
299
+
300
+ session = MagicMock()
301
+ session.session_id = "sess-123"
302
+ session.is_running = True
303
+ session.config.model_name = "anthropic/claude-opus-4-6"
304
+ session.hf_token = None
305
+ session.tool_router.get_tool_specs_for_llm.return_value = []
306
+ session.send_event = AsyncMock()
307
+
308
+ cm = MagicMock()
309
+ cm.running_context_usage = 95_000
310
+ cm.compaction_threshold = 90_000
311
+ cm.model_max_tokens = 100_000
312
+ cm.items = []
313
+ cm.needs_compaction = True
314
+ cm.compact = AsyncMock(side_effect=CompactionFailedError("ineffective"))
315
+ session.context_manager = cm
316
+
317
+ await _compact_and_notify(session)
318
+
319
+ assert session.is_running is False, (
320
+ "_compact_and_notify must set is_running=False so the agent loop "
321
+ "can exit. P0 caught by bot review on PR #213 was that the loop "
322
+ "didn't actually check this flag."
323
+ )
324
+ assert session.send_event.await_count == 1
325
+ event = session.send_event.await_args.args[0]
326
+ assert event.event_type == "session_terminated"
327
+ assert event.data["reason"] == "compaction_failed"
328
+ assert event.data["context_usage"] == 95_000
329
+
330
+
331
+ @pytest.mark.asyncio
332
+ async def test_compact_and_notify_passes_through_on_success():
333
+ """When compaction succeeds, no termination event, is_running stays True."""
334
+ from agent.core.agent_loop import _compact_and_notify
335
+
336
+ session = MagicMock()
337
+ session.session_id = "sess-456"
338
+ session.is_running = True
339
+ session.config.model_name = "anthropic/claude-opus-4-6"
340
+ session.hf_token = None
341
+ session.tool_router.get_tool_specs_for_llm.return_value = []
342
+ session.send_event = AsyncMock()
343
+
344
+ cm = MagicMock()
345
+ cm.running_context_usage = 5_000
346
+ cm.compaction_threshold = 90_000
347
+ cm.model_max_tokens = 100_000
348
+ cm.items = []
349
+ cm.needs_compaction = False
350
+ cm.compact = AsyncMock(return_value=None) # success
351
+ session.context_manager = cm
352
+
353
+ # Pretend old_usage == new_usage so the "compacted" event is also skipped
354
+ await _compact_and_notify(session)
355
+
356
+ assert session.is_running is True
357
+ # No session_terminated event emitted
358
+ for call in session.send_event.await_args_list:
359
+ ev = call.args[0]
360
+ assert ev.event_type != "session_terminated"