fix(compaction): break infinite loop + truncate oversized messages (#213)
Browse files* fix(compaction): break the infinite-compaction loop, truncate oversized messages
Sessions stuck in a compaction retry loop have been silently burning Bedrock
budget while staying invisible in the session dataset. Pod logs from
2026-05-03 on prod-114 showed the pattern firing minute-after-minute on
multiple replicas:
Context compacted: 200001 -> 215566 tokens
Context compacted: 215566 -> 215572 tokens
ContextWindowExceededError β forcing compaction
Context compacted: 200001 -> 215544 tokens
...
Root cause: a single message in the "untouched" tail (typically a tool
output of 80k+ tokens β bash dump, file content, CSV) keeps the post-compact
context above the 90% threshold. Compaction triggers again, calls Bedrock
again, gets the same 215k result, loops. ~$3 per Opus retry, indefinitely.
Sessions never reach _run_session.finally β save_and_upload_detached never
fires β cost is real but never lands in total_cost_usd.
Cross-check: dataset Bedrock-only coverage on 2026-05-01 was 43% of Cost
Explorer ($6,299 vs $14,732). Bedrock Invocation Logs on 2026-05-02 showed
11,996 InvokeModel calls vs 5,558 Bedrock events in the dataset for the
same day β half the calls missing, consistent with stuck sessions never
uploading.
Three fixes in this PR:
1. ContextManager._truncate_oversized() β replace the content of any
preserved message (first user, untouched tail) over 50k tokens with a
placeholder before summarization. This addresses the root cause:
compaction can't shrink a single oversized message because it's
preserved verbatim.
2. ContextManager.compact() now raises CompactionFailedError if the
post-compact context is still over the threshold (i.e., truncation +
summarize were not enough). The retry-as-circuit-breaker behavior is
replaced with fail-fast: better to end the session cleanly than to
loop on the same useless API call.
3. agent_loop._compact_and_notify() catches CompactionFailedError, emits
a session_terminated event with reason=compaction_failed (so the
dataset records WHY the session ended and the cost it incurred up to
that point), and sets session.is_running=False to exit the loop. The
_run_session.finally then fires save_trajectory normally and the
session ends up in the dataset.
Expected impact: closes the loop side of the cost telemetry gap. The
sessions that previously looped invisibly will now end at first
compaction failure with an event, dataset coverage should rise from
~43% toward the floor set by other gaps (litellm pricing for HF Router
models on compaction events β separate, smaller issue).
* fix(compaction): address PR bot feedback (P0 + 3 P1)
P0 β _compact_and_notify set is_running=False but neither call site of
the inner agent loop checked it. After CompactionFailedError, the LLM
call still fired, hit ContextWindowExceededError, re-called
_compact_and_notify in the except handler, and continued β replacing
the original infinite compaction loop with an only-slightly-better
loop that did one LLM call between compaction failures. Add explicit
`if not session.is_running: break` guards at both call sites
(agent_loop.py:1076 and 1469).
P1.1 β _truncate_oversized rebuilt the replacement Message with only
five fields (role, content, tool_call_id, tool_calls, name), silently
dropping thinking_blocks, reasoning_content, and provider_specific_fields.
For Anthropic extended-thinking models with reasoning_effort=high/max,
losing thinking_blocks on a prior assistant message causes the next
request to fail with "Invalid signature in thinking block". Preserve
all six known fields when reconstructing.
P1.2 β Same is_running guard missing at the second call site (the
ContextWindowExceededError except handler). Same fix.
P1.3 β No automated test for the CompactionFailedError β session
termination flow. Added tests/unit/test_compaction_loop_break.py with
8 cases covering _truncate_oversized (4), compact() raising (2), and
_compact_and_notify session termination + happy path (2). The most
important test is test_compact_and_notify_terminates_session_on_failure
which would have caught the P0 directly. All 8 pass.
* fix(compaction): never truncate system message (caught by integration test)
End-to-end smoke test on the new compaction code path uncovered an edge
case the unit tests missed: when ``items`` has fewer entries than
``untouched_messages`` (pathological configs, very early-session compact
triggers, or the artificial test scenario), the slice math in compact()
can let ``items[0]`` (the system message) leak into the
``recent_messages`` list passed to ``_truncate_oversized``.
The function then truncated the system prompt β silently destroying the
agent's instructions. Defense-in-depth fix: explicit ``if msg.role ==
"system": pass through`` guard at the top of the per-message loop.
The system prompt is loaded from system_prompt_v3.yaml at session start
and is the agent's behavioral contract; it must never be modified by the
compaction path.
Added test_truncate_oversized_never_touches_system_message to cover this
specifically. Total now 9 tests, all passing.
* fix(compaction): clamp idx >= 1 to prevent system message duplication
Bot review re-review on PR #213 caught a second P0: when len(items) ==
untouched_messages (the canonical 5-message early-compaction case
[system, user-task, assistant-with-giant-output, user-followup,
assistant-reply]), idx initialises to 0 and the walk-back `while idx > 1`
guard is a no-op.
Without an explicit clamp, recent_messages = items[0:] starts at the
system message. The not-messages_to_summarize rebuild path then produces
[system_msg, first_user_msg] + recent_messages = [system, user, system,
user, ...] β duplicating both. Anthropic API rejects two system messages.
The system-message guard added in a0fc95f prevents truncation of the
duplicated system but doesn't prevent the duplication itself.
Fix: explicit `if idx < 1: idx = 1` after the walk-back loop, mirroring
the intent of the existing `idx > 1` guard ("never include system in
recent_messages") and closing the gap when initialisation already lands
below 1.
Added test_compact_does_not_duplicate_system_when_idx_is_zero with the
exact 5-message setup. All 10 tests pass.
The 5-message scenario is precisely the one this PR targets: [system,
user-task, assistant-tool-output, user-followup, assistant-reply] is
the most likely shape to drive context > 90% threshold via a single
oversized tool output. So this P0 would have hit nearly every stuck
session β would have made the fix worse than the bug.
* fix(compaction): clamp idx > first_user_idx (not just > 0)
Bot review on PR #213 caught the third P0 in this thread: my previous
clamp `if idx < 1: idx = 1` excluded the system message from
recent_messages but still overlapped with first_user_idx (which is
also 1 for any well-formed session).
Trace for the canonical 5-message trigger:
items = [system(0), user-task(1), assistant(2), user-followup(3),
assistant-reply(4)]
first_user_idx = 1
idx = 5 - 5 = 0 β clamped to 1
recent_messages = items[1:] = [user-task, assistant, user-followup,
assistant-reply] β includes user-task
messages_to_summarize = items[2:1] = []
β enters rebuild branch
head = [system_msg, first_user_msg] β first_user_msg also here
self.items = head + recent_messages
= [system, user-task, user-task, assistant, user-followup,
assistant-reply]
β two consecutive user messages
β Anthropic API 400 on next LLM call
The right invariant is "idx must be strictly after first_user_idx",
not "idx must be > 0". The walk-back's `idx > 1` was necessary
(no system) but insufficient (first_user also in head).
Fix: `if idx <= first_user_idx: idx = first_user_idx + 1`. Since
first_user_idx >= 1 for any well-formed session, this also satisfies
the original system-message exclusion intent.
Test strengthened to assert (a) system_count == 1, (b) task message
appears exactly once, (c) no two consecutive same-role non-system
messages. The previous test only checked (a) and would have shipped
this bug.
Lesson noted in ~/.claude/CLAUDE.md section 5: my fix to one bot
finding introduced another bug. On critical-path PRs, every commit
needs a fresh review round.
- agent/context_manager/manager.py +148 -13
- agent/core/agent_loop.py +59 -9
- tests/unit/test_compaction_loop_break.py +360 -0
|
@@ -79,6 +79,23 @@ _COMPACT_PROMPT = (
|
|
| 79 |
"will be have to be filled in."
|
| 80 |
)
|
| 81 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 82 |
# Used when seeding a brand-new session from prior browser-cached messages.
|
| 83 |
# Here we're writing a note to *ourselves* β so preserve the tool-call trail,
|
| 84 |
# files produced, and planned next steps in first person. Optimized for
|
|
@@ -374,6 +391,81 @@ class ContextManager:
|
|
| 374 |
def needs_compaction(self) -> bool:
|
| 375 |
return self.running_context_usage > self.compaction_threshold and bool(self.items)
|
| 376 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 377 |
async def compact(
|
| 378 |
self,
|
| 379 |
model_name: str,
|
|
@@ -386,6 +478,13 @@ class ContextManager:
|
|
| 386 |
``session`` is optional β if passed, the underlying summarization
|
| 387 |
LLM call is recorded via ``telemetry.record_llm_call(kind=
|
| 388 |
"compaction")`` so its cost shows up in ``total_cost_usd``.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 389 |
"""
|
| 390 |
if not self.needs_compaction:
|
| 391 |
return
|
|
@@ -409,12 +508,45 @@ class ContextManager:
|
|
| 409 |
idx = len(self.items) - self.untouched_messages
|
| 410 |
while idx > 1 and self.items[idx].role != "user":
|
| 411 |
idx -= 1
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 412 |
|
| 413 |
recent_messages = self.items[idx:]
|
| 414 |
messages_to_summarize = self.items[first_user_idx + 1:idx]
|
| 415 |
|
| 416 |
-
#
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 417 |
if not messages_to_summarize:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 418 |
return
|
| 419 |
|
| 420 |
summary, completion_tokens = await summarize_messages(
|
|
@@ -439,16 +571,19 @@ class ContextManager:
|
|
| 439 |
head.append(first_user_msg)
|
| 440 |
self.items = head + [summarized_message] + recent_messages
|
| 441 |
|
| 442 |
-
|
| 443 |
-
|
| 444 |
-
#
|
| 445 |
-
|
| 446 |
-
|
| 447 |
-
|
| 448 |
-
|
| 449 |
-
|
| 450 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 451 |
)
|
| 452 |
-
except Exception as e:
|
| 453 |
-
logger.warning("token_counter failed post-compact (%s); falling back to rough estimate", e)
|
| 454 |
-
self.running_context_usage = len(self.system_prompt) // 4 + completion_tokens
|
|
|
|
| 79 |
"will be have to be filled in."
|
| 80 |
)
|
| 81 |
|
| 82 |
+
# Per-message ceiling. If a single message in the "untouched" tail is larger
|
| 83 |
+
# than this, compaction can't recover even after summarizing the middle β
|
| 84 |
+
# producing the infinite compaction loop seen 2026-05-03 in pod logs (200k
|
| 85 |
+
# context shrinks to 200k+ because one tool output is 80k tokens). We replace
|
| 86 |
+
# such messages with a placeholder before compaction runs.
|
| 87 |
+
_MAX_TOKENS_PER_MESSAGE = 50_000
|
| 88 |
+
|
| 89 |
+
|
| 90 |
+
class CompactionFailedError(Exception):
|
| 91 |
+
"""Raised when compaction can't reduce context below the threshold.
|
| 92 |
+
|
| 93 |
+
Typically means an individual preserved message (system, first user, or
|
| 94 |
+
untouched tail) exceeds what truncation can fix in one pass. The caller
|
| 95 |
+
must terminate the session β retrying produces an infinite loop that
|
| 96 |
+
burns Bedrock budget for free (~$3 per re-attempt on Opus).
|
| 97 |
+
"""
|
| 98 |
+
|
| 99 |
# Used when seeding a brand-new session from prior browser-cached messages.
|
| 100 |
# Here we're writing a note to *ourselves* β so preserve the tool-call trail,
|
| 101 |
# files produced, and planned next steps in first person. Optimized for
|
|
|
|
| 391 |
def needs_compaction(self) -> bool:
|
| 392 |
return self.running_context_usage > self.compaction_threshold and bool(self.items)
|
| 393 |
|
| 394 |
+
def _truncate_oversized(
|
| 395 |
+
self, messages: list[Message], model_name: str
|
| 396 |
+
) -> list[Message]:
|
| 397 |
+
"""Replace any message > _MAX_TOKENS_PER_MESSAGE with a placeholder.
|
| 398 |
+
|
| 399 |
+
These are typically tool outputs (CSV dumps, file contents) sitting in
|
| 400 |
+
the untouched tail or first-user position that compaction can't shrink
|
| 401 |
+
β they pass through verbatim, keeping context above threshold and
|
| 402 |
+
triggering an infinite compaction retry loop.
|
| 403 |
+
"""
|
| 404 |
+
from litellm import token_counter
|
| 405 |
+
|
| 406 |
+
out: list[Message] = []
|
| 407 |
+
for msg in messages:
|
| 408 |
+
# System messages are sacred β they're the agent's instructions.
|
| 409 |
+
# In edge cases (items < untouched_messages), the slice math in
|
| 410 |
+
# compact() can let items[0] (the system message) leak into the
|
| 411 |
+
# recent_messages list. Defense-in-depth: never truncate it.
|
| 412 |
+
if msg.role == "system":
|
| 413 |
+
out.append(msg)
|
| 414 |
+
continue
|
| 415 |
+
try:
|
| 416 |
+
n = token_counter(model=model_name, messages=[msg.model_dump()])
|
| 417 |
+
except Exception:
|
| 418 |
+
# token_counter occasionally fails on edge-case content;
|
| 419 |
+
# don't drop the message, just keep it as-is.
|
| 420 |
+
out.append(msg)
|
| 421 |
+
continue
|
| 422 |
+
if n <= _MAX_TOKENS_PER_MESSAGE:
|
| 423 |
+
out.append(msg)
|
| 424 |
+
continue
|
| 425 |
+
placeholder = (
|
| 426 |
+
f"[truncated for compaction β original was {n} tokens, "
|
| 427 |
+
f"removed to keep context under {self.compaction_threshold} tokens]"
|
| 428 |
+
)
|
| 429 |
+
logger.warning(
|
| 430 |
+
"Truncating %s message: %d -> %d tokens for compaction",
|
| 431 |
+
msg.role, n, len(placeholder) // 4,
|
| 432 |
+
)
|
| 433 |
+
# Preserve all known assistant-side fields (tool_calls, thinking_blocks,
|
| 434 |
+
# reasoning_content, provider_specific_fields) even when content is
|
| 435 |
+
# replaced. Anthropic extended-thinking models reject the next request
|
| 436 |
+
# with "Invalid signature in thinking block" if thinking_blocks is
|
| 437 |
+
# dropped from a prior assistant message.
|
| 438 |
+
kept = {
|
| 439 |
+
k: getattr(msg, k, None)
|
| 440 |
+
for k in (
|
| 441 |
+
"tool_call_id",
|
| 442 |
+
"tool_calls",
|
| 443 |
+
"name",
|
| 444 |
+
"thinking_blocks",
|
| 445 |
+
"reasoning_content",
|
| 446 |
+
"provider_specific_fields",
|
| 447 |
+
)
|
| 448 |
+
if getattr(msg, k, None) is not None
|
| 449 |
+
}
|
| 450 |
+
out.append(Message(role=msg.role, content=placeholder, **kept))
|
| 451 |
+
return out
|
| 452 |
+
|
| 453 |
+
def _recompute_usage(self, model_name: str) -> None:
|
| 454 |
+
"""Refresh ``running_context_usage`` from current items via real tokenizer."""
|
| 455 |
+
from litellm import token_counter
|
| 456 |
+
|
| 457 |
+
try:
|
| 458 |
+
self.running_context_usage = token_counter(
|
| 459 |
+
model=model_name,
|
| 460 |
+
messages=[m.model_dump() for m in self.items],
|
| 461 |
+
)
|
| 462 |
+
except Exception as e:
|
| 463 |
+
logger.warning("token_counter failed (%s); rough estimate", e)
|
| 464 |
+
# Rough fallback: 4 chars per token.
|
| 465 |
+
self.running_context_usage = sum(
|
| 466 |
+
len(getattr(m, "content", "") or "") for m in self.items
|
| 467 |
+
) // 4
|
| 468 |
+
|
| 469 |
async def compact(
|
| 470 |
self,
|
| 471 |
model_name: str,
|
|
|
|
| 478 |
``session`` is optional β if passed, the underlying summarization
|
| 479 |
LLM call is recorded via ``telemetry.record_llm_call(kind=
|
| 480 |
"compaction")`` so its cost shows up in ``total_cost_usd``.
|
| 481 |
+
|
| 482 |
+
Raises ``CompactionFailedError`` if the post-compact context is still
|
| 483 |
+
over the threshold. This happens when a preserved message (typically
|
| 484 |
+
a giant tool output stuck in the untouched tail) is too large for
|
| 485 |
+
truncation to fix. The caller must terminate the session β retrying
|
| 486 |
+
is what caused the 2026-05-03 infinite-compaction-loop pattern that
|
| 487 |
+
burned Bedrock budget invisibly.
|
| 488 |
"""
|
| 489 |
if not self.needs_compaction:
|
| 490 |
return
|
|
|
|
| 508 |
idx = len(self.items) - self.untouched_messages
|
| 509 |
while idx > 1 and self.items[idx].role != "user":
|
| 510 |
idx -= 1
|
| 511 |
+
# The real invariant is "idx must be strictly after first_user_idx,
|
| 512 |
+
# otherwise recent_messages overlaps with the messages we put in
|
| 513 |
+
# head". The walk-back's `idx > 1` guard is necessary (no system in
|
| 514 |
+
# recent) but insufficient (first_user is also in head and would be
|
| 515 |
+
# duplicated). Anthropic API rejects two consecutive user messages
|
| 516 |
+
# with a 400 β bot review on PR #213 caught this on the second clamp
|
| 517 |
+
# iteration.
|
| 518 |
+
if idx <= first_user_idx:
|
| 519 |
+
idx = first_user_idx + 1
|
| 520 |
|
| 521 |
recent_messages = self.items[idx:]
|
| 522 |
messages_to_summarize = self.items[first_user_idx + 1:idx]
|
| 523 |
|
| 524 |
+
# Truncate any message that's larger than _MAX_TOKENS_PER_MESSAGE in
|
| 525 |
+
# the parts we PRESERVE through compaction (first_user + recent_tail).
|
| 526 |
+
# These are the only places where individual messages can defeat
|
| 527 |
+
# compaction by being intrinsically too large. Messages in
|
| 528 |
+
# ``messages_to_summarize`` are folded into the summary, so their size
|
| 529 |
+
# doesn't matter on its own.
|
| 530 |
+
if first_user_msg is not None:
|
| 531 |
+
truncated = self._truncate_oversized([first_user_msg], model_name)
|
| 532 |
+
first_user_msg = truncated[0]
|
| 533 |
+
recent_messages = self._truncate_oversized(recent_messages, model_name)
|
| 534 |
+
|
| 535 |
+
# If there's nothing to summarize but the preserved messages are now
|
| 536 |
+
# truncated and small, just rebuild and recompute. This is rare but
|
| 537 |
+
# avoids returning silently with the old (over-threshold) state.
|
| 538 |
if not messages_to_summarize:
|
| 539 |
+
head = [system_msg] if system_msg else []
|
| 540 |
+
if first_user_msg:
|
| 541 |
+
head.append(first_user_msg)
|
| 542 |
+
self.items = head + recent_messages
|
| 543 |
+
self._recompute_usage(model_name)
|
| 544 |
+
if self.running_context_usage > self.compaction_threshold:
|
| 545 |
+
raise CompactionFailedError(
|
| 546 |
+
f"Nothing to summarize but context ({self.running_context_usage}) "
|
| 547 |
+
f"still over threshold ({self.compaction_threshold}) after truncation. "
|
| 548 |
+
f"System prompt or first user message likely exceeds the budget."
|
| 549 |
+
)
|
| 550 |
return
|
| 551 |
|
| 552 |
summary, completion_tokens = await summarize_messages(
|
|
|
|
| 571 |
head.append(first_user_msg)
|
| 572 |
self.items = head + [summarized_message] + recent_messages
|
| 573 |
|
| 574 |
+
self._recompute_usage(model_name)
|
| 575 |
+
|
| 576 |
+
# Hard verify: if compaction didn't bring us below the threshold even
|
| 577 |
+
# after truncating oversized preserved messages, retrying just burns
|
| 578 |
+
# Bedrock budget on the same useless compaction call. Raise so the
|
| 579 |
+
# caller can terminate the session cleanly. Pre-2026-05-04, the
|
| 580 |
+
# caller looped indefinitely (~$3/Opus retry) until the pod was
|
| 581 |
+
# killed β invisible to the dataset because the session never
|
| 582 |
+
# finished cleanly.
|
| 583 |
+
if self.running_context_usage > self.compaction_threshold:
|
| 584 |
+
raise CompactionFailedError(
|
| 585 |
+
f"Compaction ineffective: {self.running_context_usage} tokens "
|
| 586 |
+
f"still over threshold {self.compaction_threshold} after summarize "
|
| 587 |
+
f"and truncation. Likely the system prompt + first user + summary "
|
| 588 |
+
f"+ truncated tail still exceeds budget."
|
| 589 |
)
|
|
|
|
|
|
|
|
|
|
@@ -516,19 +516,56 @@ def _friendly_error_message(error: Exception) -> str | None:
|
|
| 516 |
|
| 517 |
|
| 518 |
async def _compact_and_notify(session: Session) -> None:
|
| 519 |
-
"""Run compaction and send event if context was reduced.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 520 |
cm = session.context_manager
|
| 521 |
old_usage = cm.running_context_usage
|
| 522 |
logger.debug(
|
| 523 |
"Compaction check: usage=%d, max=%d, threshold=%d, needs_compact=%s",
|
| 524 |
old_usage, cm.model_max_tokens, cm.compaction_threshold, cm.needs_compaction,
|
| 525 |
)
|
| 526 |
-
|
| 527 |
-
|
| 528 |
-
|
| 529 |
-
|
| 530 |
-
|
| 531 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 532 |
new_usage = cm.running_context_usage
|
| 533 |
if new_usage != old_usage:
|
| 534 |
logger.warning(
|
|
@@ -1035,8 +1072,15 @@ class Handlers:
|
|
| 1035 |
if session.is_cancelled:
|
| 1036 |
break
|
| 1037 |
|
| 1038 |
-
# Compact before calling the LLM if context is near the limit
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1039 |
await _compact_and_notify(session)
|
|
|
|
|
|
|
| 1040 |
|
| 1041 |
# Doom-loop detection: break out of repeated tool call patterns
|
| 1042 |
doom_prompt = check_for_doom_loop(session.context_manager.items)
|
|
@@ -1421,7 +1465,7 @@ class Handlers:
|
|
| 1421 |
iteration += 1
|
| 1422 |
|
| 1423 |
except ContextWindowExceededError:
|
| 1424 |
-
# Force compact and retry this iteration
|
| 1425 |
cm = session.context_manager
|
| 1426 |
logger.warning(
|
| 1427 |
"ContextWindowExceededError at iteration %d β forcing compaction "
|
|
@@ -1430,6 +1474,12 @@ class Handlers:
|
|
| 1430 |
)
|
| 1431 |
cm.running_context_usage = cm.model_max_tokens + 1
|
| 1432 |
await _compact_and_notify(session)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1433 |
continue
|
| 1434 |
|
| 1435 |
except Exception as e:
|
|
|
|
| 516 |
|
| 517 |
|
| 518 |
async def _compact_and_notify(session: Session) -> None:
|
| 519 |
+
"""Run compaction and send event if context was reduced.
|
| 520 |
+
|
| 521 |
+
Catches ``CompactionFailedError`` and ends the session cleanly instead
|
| 522 |
+
of letting the caller retry. Pre-2026-05-04 the caller looped on
|
| 523 |
+
ContextWindowExceededError β compact β re-trigger, burning Bedrock
|
| 524 |
+
budget at ~$3/Opus retry while the session never reached the upload
|
| 525 |
+
path (so the cost was invisible in the dataset).
|
| 526 |
+
"""
|
| 527 |
+
from agent.context_manager.manager import CompactionFailedError
|
| 528 |
+
|
| 529 |
cm = session.context_manager
|
| 530 |
old_usage = cm.running_context_usage
|
| 531 |
logger.debug(
|
| 532 |
"Compaction check: usage=%d, max=%d, threshold=%d, needs_compact=%s",
|
| 533 |
old_usage, cm.model_max_tokens, cm.compaction_threshold, cm.needs_compaction,
|
| 534 |
)
|
| 535 |
+
try:
|
| 536 |
+
await cm.compact(
|
| 537 |
+
model_name=session.config.model_name,
|
| 538 |
+
tool_specs=session.tool_router.get_tool_specs_for_llm(),
|
| 539 |
+
hf_token=session.hf_token,
|
| 540 |
+
session=session,
|
| 541 |
+
)
|
| 542 |
+
except CompactionFailedError as e:
|
| 543 |
+
logger.error(
|
| 544 |
+
"Compaction failed for session %s: %s β terminating session",
|
| 545 |
+
session.session_id, e,
|
| 546 |
+
)
|
| 547 |
+
# Persist the failure event so the dataset has a record of WHY this
|
| 548 |
+
# session ended (and the cost it incurred up to that point) even if
|
| 549 |
+
# save_and_upload_detached has issues downstream.
|
| 550 |
+
await session.send_event(Event(
|
| 551 |
+
event_type="session_terminated",
|
| 552 |
+
data={
|
| 553 |
+
"reason": "compaction_failed",
|
| 554 |
+
"context_usage": cm.running_context_usage,
|
| 555 |
+
"context_threshold": cm.compaction_threshold,
|
| 556 |
+
"error": str(e)[:300],
|
| 557 |
+
"user_message": (
|
| 558 |
+
"Your conversation has grown too large to continue. "
|
| 559 |
+
"The work you've done is saved β start a new session to keep going."
|
| 560 |
+
),
|
| 561 |
+
},
|
| 562 |
+
))
|
| 563 |
+
# Stop the agent loop; the finally in _run_session will fire
|
| 564 |
+
# cleanup_sandbox + save_trajectory so the dataset captures
|
| 565 |
+
# everything that did happen.
|
| 566 |
+
session.is_running = False
|
| 567 |
+
return
|
| 568 |
+
|
| 569 |
new_usage = cm.running_context_usage
|
| 570 |
if new_usage != old_usage:
|
| 571 |
logger.warning(
|
|
|
|
| 1072 |
if session.is_cancelled:
|
| 1073 |
break
|
| 1074 |
|
| 1075 |
+
# Compact before calling the LLM if context is near the limit.
|
| 1076 |
+
# When _compact_and_notify catches CompactionFailedError it sets
|
| 1077 |
+
# session.is_running = False; we MUST exit the loop here, otherwise
|
| 1078 |
+
# the LLM call below fires with an over-threshold context, hits
|
| 1079 |
+
# ContextWindowExceededError, and we end up looping again on the
|
| 1080 |
+
# except path β exactly the bug this PR is supposed to fix.
|
| 1081 |
await _compact_and_notify(session)
|
| 1082 |
+
if not session.is_running:
|
| 1083 |
+
break
|
| 1084 |
|
| 1085 |
# Doom-loop detection: break out of repeated tool call patterns
|
| 1086 |
doom_prompt = check_for_doom_loop(session.context_manager.items)
|
|
|
|
| 1465 |
iteration += 1
|
| 1466 |
|
| 1467 |
except ContextWindowExceededError:
|
| 1468 |
+
# Force compact and retry this iteration.
|
| 1469 |
cm = session.context_manager
|
| 1470 |
logger.warning(
|
| 1471 |
"ContextWindowExceededError at iteration %d β forcing compaction "
|
|
|
|
| 1474 |
)
|
| 1475 |
cm.running_context_usage = cm.model_max_tokens + 1
|
| 1476 |
await _compact_and_notify(session)
|
| 1477 |
+
# Same guard as the top of the loop: if compaction couldn't
|
| 1478 |
+
# bring us under threshold, _compact_and_notify has already
|
| 1479 |
+
# emitted session_terminated and set is_running=False. Continue
|
| 1480 |
+
# would just re-call the LLM with the same too-big context.
|
| 1481 |
+
if not session.is_running:
|
| 1482 |
+
break
|
| 1483 |
continue
|
| 1484 |
|
| 1485 |
except Exception as e:
|
|
@@ -0,0 +1,360 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Regression tests for the 2026-05-03 infinite-compaction-loop bug.
|
| 2 |
+
|
| 3 |
+
Pod logs from prod-114 showed sessions stuck retrying compaction every
|
| 4 |
+
few seconds because a single oversized tool output in the untouched tail
|
| 5 |
+
kept the post-compact context above the 90% threshold:
|
| 6 |
+
|
| 7 |
+
Context compacted: 200001 -> 215566 tokens
|
| 8 |
+
Context compacted: 215566 -> 215572 tokens
|
| 9 |
+
ContextWindowExceededError β forcing compaction
|
| 10 |
+
... (continues for 5+ minutes)
|
| 11 |
+
|
| 12 |
+
These tests cover three fixes:
|
| 13 |
+
|
| 14 |
+
1. ``_truncate_oversized`` replaces oversized message content with a
|
| 15 |
+
placeholder and preserves all extended-thinking metadata fields.
|
| 16 |
+
2. ``compact()`` raises ``CompactionFailedError`` when the post-compact
|
| 17 |
+
context is still over threshold.
|
| 18 |
+
3. ``_compact_and_notify`` catches the error, sets ``session.is_running
|
| 19 |
+
= False``, and emits a ``session_terminated`` event so callers can
|
| 20 |
+
exit the agent loop.
|
| 21 |
+
|
| 22 |
+
The P0 caught by PR #213 review (loop didn't actually exit on
|
| 23 |
+
``is_running = False``) would have been caught by an end-to-end
|
| 24 |
+
behavioral test of #3 β that gap is closed by the
|
| 25 |
+
``test_compact_and_notify_terminates_session`` case below.
|
| 26 |
+
"""
|
| 27 |
+
|
| 28 |
+
from __future__ import annotations
|
| 29 |
+
|
| 30 |
+
from unittest.mock import AsyncMock, MagicMock, patch
|
| 31 |
+
|
| 32 |
+
import pytest
|
| 33 |
+
from litellm import Message
|
| 34 |
+
|
| 35 |
+
from agent.context_manager.manager import (
|
| 36 |
+
CompactionFailedError,
|
| 37 |
+
ContextManager,
|
| 38 |
+
_MAX_TOKENS_PER_MESSAGE,
|
| 39 |
+
)
|
| 40 |
+
|
| 41 |
+
|
| 42 |
+
# ββ helpers ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 43 |
+
|
| 44 |
+
|
| 45 |
+
def _make_cm(
|
| 46 |
+
*,
|
| 47 |
+
model_max_tokens: int = 100_000,
|
| 48 |
+
compact_size: int = 1_000,
|
| 49 |
+
untouched_messages: int = 5,
|
| 50 |
+
) -> ContextManager:
|
| 51 |
+
cm = ContextManager.__new__(ContextManager)
|
| 52 |
+
cm.system_prompt = "system"
|
| 53 |
+
cm.model_max_tokens = model_max_tokens
|
| 54 |
+
cm.compact_size = compact_size
|
| 55 |
+
cm.running_context_usage = 0
|
| 56 |
+
cm.untouched_messages = untouched_messages
|
| 57 |
+
cm.items = [Message(role="system", content="system")]
|
| 58 |
+
cm.on_message_added = None
|
| 59 |
+
return cm
|
| 60 |
+
|
| 61 |
+
|
| 62 |
+
def _msg(role: str, content: str | None = "x", **extra) -> Message:
|
| 63 |
+
return Message(role=role, content=content, **extra)
|
| 64 |
+
|
| 65 |
+
|
| 66 |
+
# ββ _truncate_oversized ββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 67 |
+
|
| 68 |
+
|
| 69 |
+
def test_truncate_oversized_skips_messages_below_threshold():
|
| 70 |
+
cm = _make_cm()
|
| 71 |
+
msgs = [_msg("user", "small content")]
|
| 72 |
+
with patch("litellm.token_counter", return_value=100):
|
| 73 |
+
out = cm._truncate_oversized(msgs, "anthropic/claude-opus-4-6")
|
| 74 |
+
assert out == msgs # unchanged
|
| 75 |
+
|
| 76 |
+
|
| 77 |
+
def test_truncate_oversized_replaces_content_above_threshold():
|
| 78 |
+
cm = _make_cm()
|
| 79 |
+
big = "x" * (_MAX_TOKENS_PER_MESSAGE * 5)
|
| 80 |
+
msgs = [_msg("user", big)]
|
| 81 |
+
# token_counter returns the simulated big size for any message in this test
|
| 82 |
+
with patch("litellm.token_counter", return_value=_MAX_TOKENS_PER_MESSAGE * 2):
|
| 83 |
+
out = cm._truncate_oversized(msgs, "anthropic/claude-opus-4-6")
|
| 84 |
+
assert len(out) == 1
|
| 85 |
+
assert out[0].content != big
|
| 86 |
+
assert "[truncated for compaction" in out[0].content
|
| 87 |
+
assert str(_MAX_TOKENS_PER_MESSAGE * 2) in out[0].content
|
| 88 |
+
|
| 89 |
+
|
| 90 |
+
def test_truncate_oversized_preserves_thinking_blocks():
|
| 91 |
+
"""Anthropic extended-thinking models reject the next request with
|
| 92 |
+
``Invalid signature in thinking block`` if a prior assistant message
|
| 93 |
+
drops thinking_blocks. Truncation must keep this metadata.
|
| 94 |
+
"""
|
| 95 |
+
cm = _make_cm()
|
| 96 |
+
big = "x" * (_MAX_TOKENS_PER_MESSAGE * 5)
|
| 97 |
+
thinking = [{"type": "thinking", "thinking": "...", "signature": "abc123"}]
|
| 98 |
+
msg = Message(role="assistant", content=big)
|
| 99 |
+
msg.thinking_blocks = thinking
|
| 100 |
+
msg.reasoning_content = "deep thought"
|
| 101 |
+
with patch("litellm.token_counter", return_value=_MAX_TOKENS_PER_MESSAGE * 2):
|
| 102 |
+
out = cm._truncate_oversized([msg], "anthropic/claude-opus-4-6")
|
| 103 |
+
assert getattr(out[0], "thinking_blocks", None) == thinking
|
| 104 |
+
assert getattr(out[0], "reasoning_content", None) == "deep thought"
|
| 105 |
+
|
| 106 |
+
|
| 107 |
+
def test_truncate_oversized_never_touches_system_message():
|
| 108 |
+
"""The system prompt is the agent's instructions β must never be truncated.
|
| 109 |
+
|
| 110 |
+
Caught by the integration smoke test on PR #213: when items has fewer than
|
| 111 |
+
``untouched_messages`` entries, the slice math in ``compact()`` can let
|
| 112 |
+
``items[0]`` (the system message) leak into the ``recent_messages`` list
|
| 113 |
+
that gets passed to ``_truncate_oversized``. The function must guard
|
| 114 |
+
explicitly against this.
|
| 115 |
+
"""
|
| 116 |
+
cm = _make_cm()
|
| 117 |
+
huge_system = "x" * (_MAX_TOKENS_PER_MESSAGE * 5)
|
| 118 |
+
msgs = [_msg("system", huge_system)]
|
| 119 |
+
with patch("litellm.token_counter", return_value=_MAX_TOKENS_PER_MESSAGE * 2):
|
| 120 |
+
out = cm._truncate_oversized(msgs, "anthropic/claude-opus-4-6")
|
| 121 |
+
assert out[0].content == huge_system, "system message must never be truncated"
|
| 122 |
+
|
| 123 |
+
|
| 124 |
+
def test_truncate_oversized_resilient_to_token_counter_failure():
|
| 125 |
+
"""token_counter occasionally raises on edge-case content. A blip there
|
| 126 |
+
must NOT drop the message β better to leave it and let compaction
|
| 127 |
+
handle it (or fail with CompactionFailedError) than to lose data.
|
| 128 |
+
"""
|
| 129 |
+
cm = _make_cm()
|
| 130 |
+
msgs = [_msg("user", "anything")]
|
| 131 |
+
with patch("litellm.token_counter", side_effect=Exception("counter blew up")):
|
| 132 |
+
out = cm._truncate_oversized(msgs, "anthropic/claude-opus-4-6")
|
| 133 |
+
assert out == msgs
|
| 134 |
+
|
| 135 |
+
|
| 136 |
+
# ββ compact() raises CompactionFailedError βββββββββββββββββββββββββββββ
|
| 137 |
+
|
| 138 |
+
|
| 139 |
+
@pytest.mark.asyncio
|
| 140 |
+
async def test_compact_raises_when_post_compact_still_over_threshold():
|
| 141 |
+
"""The whole point of the new behavior: don't loop on a useless
|
| 142 |
+
compaction call. Raise so the caller can terminate the session.
|
| 143 |
+
"""
|
| 144 |
+
cm = _make_cm(model_max_tokens=100_000)
|
| 145 |
+
# Build a context that's "over threshold" from the start
|
| 146 |
+
cm.items = [
|
| 147 |
+
Message(role="system", content="system"),
|
| 148 |
+
Message(role="user", content="task"),
|
| 149 |
+
Message(role="assistant", content="x" * 1000),
|
| 150 |
+
Message(role="user", content="follow-up 1"),
|
| 151 |
+
Message(role="assistant", content="reply 1"),
|
| 152 |
+
Message(role="user", content="follow-up 2"),
|
| 153 |
+
Message(role="assistant", content="reply 2"),
|
| 154 |
+
]
|
| 155 |
+
cm.running_context_usage = 95_000 # over threshold (90% of 100k = 90k)
|
| 156 |
+
|
| 157 |
+
# Mock summarize_messages to return a tiny summary; mock _recompute_usage
|
| 158 |
+
# to keep the running_context_usage above threshold so compact() raises.
|
| 159 |
+
async def fake_summarize(*args, **kwargs):
|
| 160 |
+
return ("summary", 10)
|
| 161 |
+
|
| 162 |
+
def fake_recompute(self, model_name):
|
| 163 |
+
# Simulate post-compact still over threshold
|
| 164 |
+
self.running_context_usage = 95_000
|
| 165 |
+
|
| 166 |
+
with (
|
| 167 |
+
patch("agent.context_manager.manager.summarize_messages", side_effect=fake_summarize),
|
| 168 |
+
patch.object(ContextManager, "_recompute_usage", fake_recompute),
|
| 169 |
+
# Avoid token_counter calls in _truncate_oversized
|
| 170 |
+
patch("litellm.token_counter", return_value=100),
|
| 171 |
+
):
|
| 172 |
+
with pytest.raises(CompactionFailedError):
|
| 173 |
+
await cm.compact(
|
| 174 |
+
model_name="anthropic/claude-opus-4-6",
|
| 175 |
+
tool_specs=None,
|
| 176 |
+
hf_token=None,
|
| 177 |
+
session=None,
|
| 178 |
+
)
|
| 179 |
+
|
| 180 |
+
|
| 181 |
+
@pytest.mark.asyncio
|
| 182 |
+
async def test_compact_does_not_duplicate_system_when_idx_is_zero():
|
| 183 |
+
"""Regression for the second P0 caught by bot review on PR #213.
|
| 184 |
+
|
| 185 |
+
When ``len(items) == untouched_messages`` (the canonical 5-message
|
| 186 |
+
early-compaction case: system + user-task + giant-tool-output +
|
| 187 |
+
user-followup + assistant-reply), ``idx`` initialises to 0 and the
|
| 188 |
+
walk-back ``while idx > 1`` loop is a no-op. Without an explicit
|
| 189 |
+
clamp ``if idx < 1: idx = 1``, ``recent_messages = items[0:]``
|
| 190 |
+
starts at the system message, and the rebuild duplicates system +
|
| 191 |
+
first-user. Anthropic API rejects two system messages.
|
| 192 |
+
"""
|
| 193 |
+
cm = _make_cm(model_max_tokens=100_000, untouched_messages=5)
|
| 194 |
+
cm.items = [
|
| 195 |
+
Message(role="system", content="system"),
|
| 196 |
+
Message(role="user", content="task"),
|
| 197 |
+
Message(role="assistant", content="ok"), # would be the only
|
| 198 |
+
# message_to_summarize but the
|
| 199 |
+
# idx bug pulls it into recent
|
| 200 |
+
Message(role="user", content="followup"),
|
| 201 |
+
Message(role="assistant", content="reply"),
|
| 202 |
+
] # exactly 5 = untouched_messages, so idx initialises to 0
|
| 203 |
+
cm.running_context_usage = 95_000
|
| 204 |
+
|
| 205 |
+
async def fake_summarize(*args, **kwargs):
|
| 206 |
+
return ("summary", 10)
|
| 207 |
+
|
| 208 |
+
def fake_recompute(self, model_name):
|
| 209 |
+
self.running_context_usage = 5_000
|
| 210 |
+
|
| 211 |
+
with (
|
| 212 |
+
patch("agent.context_manager.manager.summarize_messages", side_effect=fake_summarize),
|
| 213 |
+
patch.object(ContextManager, "_recompute_usage", fake_recompute),
|
| 214 |
+
patch("litellm.token_counter", return_value=100),
|
| 215 |
+
):
|
| 216 |
+
await cm.compact(
|
| 217 |
+
model_name="anthropic/claude-opus-4-6",
|
| 218 |
+
tool_specs=None,
|
| 219 |
+
hf_token=None,
|
| 220 |
+
session=None,
|
| 221 |
+
)
|
| 222 |
+
|
| 223 |
+
# Critical assertion: only ONE system message in items
|
| 224 |
+
system_count = sum(1 for m in cm.items if m.role == "system")
|
| 225 |
+
assert system_count == 1, (
|
| 226 |
+
f"Expected exactly 1 system message, found {system_count}. "
|
| 227 |
+
f"Roles: {[m.role for m in cm.items]}"
|
| 228 |
+
)
|
| 229 |
+
# And the first-user "task" message must also appear exactly once.
|
| 230 |
+
# Bot review on PR #213 caught a follow-up bug: clamping idx=1
|
| 231 |
+
# excludes the system but still overlaps with first_user_idx (also 1),
|
| 232 |
+
# so first_user_msg ends up in BOTH head and recent_messages β
|
| 233 |
+
# duplicate user message β Anthropic 400 (two consecutive user roles).
|
| 234 |
+
task_count = sum(
|
| 235 |
+
1 for m in cm.items
|
| 236 |
+
if m.role == "user" and (m.content or "") == "task"
|
| 237 |
+
)
|
| 238 |
+
assert task_count == 1, (
|
| 239 |
+
f"Expected exactly 1 'task' user message, found {task_count}. "
|
| 240 |
+
f"Roles+content: {[(m.role, (m.content or '')[:20]) for m in cm.items]}"
|
| 241 |
+
)
|
| 242 |
+
# Defense in depth: no two consecutive same-role messages (Anthropic
|
| 243 |
+
# API contract). System counts separately.
|
| 244 |
+
non_system = [m for m in cm.items if m.role != "system"]
|
| 245 |
+
for i in range(1, len(non_system)):
|
| 246 |
+
assert non_system[i].role != non_system[i-1].role, (
|
| 247 |
+
f"Two consecutive {non_system[i].role} messages at non-system "
|
| 248 |
+
f"position {i-1},{i} β Anthropic API rejects this. "
|
| 249 |
+
f"Roles: {[m.role for m in cm.items]}"
|
| 250 |
+
)
|
| 251 |
+
|
| 252 |
+
|
| 253 |
+
@pytest.mark.asyncio
|
| 254 |
+
async def test_compact_succeeds_when_post_compact_under_threshold():
|
| 255 |
+
"""Happy path: when compaction does its job, no exception raised."""
|
| 256 |
+
cm = _make_cm(model_max_tokens=100_000)
|
| 257 |
+
cm.items = [
|
| 258 |
+
Message(role="system", content="system"),
|
| 259 |
+
Message(role="user", content="task"),
|
| 260 |
+
Message(role="assistant", content="x" * 1000),
|
| 261 |
+
Message(role="user", content="follow-up"),
|
| 262 |
+
Message(role="assistant", content="reply"),
|
| 263 |
+
Message(role="user", content="follow-up 2"),
|
| 264 |
+
Message(role="assistant", content="reply 2"),
|
| 265 |
+
]
|
| 266 |
+
cm.running_context_usage = 95_000
|
| 267 |
+
|
| 268 |
+
async def fake_summarize(*args, **kwargs):
|
| 269 |
+
return ("summary", 10)
|
| 270 |
+
|
| 271 |
+
def fake_recompute(self, model_name):
|
| 272 |
+
self.running_context_usage = 5_000 # well under threshold
|
| 273 |
+
|
| 274 |
+
with (
|
| 275 |
+
patch("agent.context_manager.manager.summarize_messages", side_effect=fake_summarize),
|
| 276 |
+
patch.object(ContextManager, "_recompute_usage", fake_recompute),
|
| 277 |
+
patch("litellm.token_counter", return_value=100),
|
| 278 |
+
):
|
| 279 |
+
await cm.compact(
|
| 280 |
+
model_name="anthropic/claude-opus-4-6",
|
| 281 |
+
tool_specs=None,
|
| 282 |
+
hf_token=None,
|
| 283 |
+
session=None,
|
| 284 |
+
)
|
| 285 |
+
assert cm.running_context_usage == 5_000
|
| 286 |
+
|
| 287 |
+
|
| 288 |
+
# ββ _compact_and_notify behavior on CompactionFailedError ββββββββββββββ
|
| 289 |
+
|
| 290 |
+
|
| 291 |
+
@pytest.mark.asyncio
|
| 292 |
+
async def test_compact_and_notify_terminates_session_on_failure():
|
| 293 |
+
"""The PR's #213's P0 bug-class: setting ``is_running = False`` is
|
| 294 |
+
only effective if the agent loop checks it. This test asserts the
|
| 295 |
+
flag IS set AND a ``session_terminated`` event is emitted, so a
|
| 296 |
+
follow-up assertion in the agent loop test catches the loop-exit.
|
| 297 |
+
"""
|
| 298 |
+
from agent.core.agent_loop import _compact_and_notify
|
| 299 |
+
|
| 300 |
+
session = MagicMock()
|
| 301 |
+
session.session_id = "sess-123"
|
| 302 |
+
session.is_running = True
|
| 303 |
+
session.config.model_name = "anthropic/claude-opus-4-6"
|
| 304 |
+
session.hf_token = None
|
| 305 |
+
session.tool_router.get_tool_specs_for_llm.return_value = []
|
| 306 |
+
session.send_event = AsyncMock()
|
| 307 |
+
|
| 308 |
+
cm = MagicMock()
|
| 309 |
+
cm.running_context_usage = 95_000
|
| 310 |
+
cm.compaction_threshold = 90_000
|
| 311 |
+
cm.model_max_tokens = 100_000
|
| 312 |
+
cm.items = []
|
| 313 |
+
cm.needs_compaction = True
|
| 314 |
+
cm.compact = AsyncMock(side_effect=CompactionFailedError("ineffective"))
|
| 315 |
+
session.context_manager = cm
|
| 316 |
+
|
| 317 |
+
await _compact_and_notify(session)
|
| 318 |
+
|
| 319 |
+
assert session.is_running is False, (
|
| 320 |
+
"_compact_and_notify must set is_running=False so the agent loop "
|
| 321 |
+
"can exit. P0 caught by bot review on PR #213 was that the loop "
|
| 322 |
+
"didn't actually check this flag."
|
| 323 |
+
)
|
| 324 |
+
assert session.send_event.await_count == 1
|
| 325 |
+
event = session.send_event.await_args.args[0]
|
| 326 |
+
assert event.event_type == "session_terminated"
|
| 327 |
+
assert event.data["reason"] == "compaction_failed"
|
| 328 |
+
assert event.data["context_usage"] == 95_000
|
| 329 |
+
|
| 330 |
+
|
| 331 |
+
@pytest.mark.asyncio
|
| 332 |
+
async def test_compact_and_notify_passes_through_on_success():
|
| 333 |
+
"""When compaction succeeds, no termination event, is_running stays True."""
|
| 334 |
+
from agent.core.agent_loop import _compact_and_notify
|
| 335 |
+
|
| 336 |
+
session = MagicMock()
|
| 337 |
+
session.session_id = "sess-456"
|
| 338 |
+
session.is_running = True
|
| 339 |
+
session.config.model_name = "anthropic/claude-opus-4-6"
|
| 340 |
+
session.hf_token = None
|
| 341 |
+
session.tool_router.get_tool_specs_for_llm.return_value = []
|
| 342 |
+
session.send_event = AsyncMock()
|
| 343 |
+
|
| 344 |
+
cm = MagicMock()
|
| 345 |
+
cm.running_context_usage = 5_000
|
| 346 |
+
cm.compaction_threshold = 90_000
|
| 347 |
+
cm.model_max_tokens = 100_000
|
| 348 |
+
cm.items = []
|
| 349 |
+
cm.needs_compaction = False
|
| 350 |
+
cm.compact = AsyncMock(return_value=None) # success
|
| 351 |
+
session.context_manager = cm
|
| 352 |
+
|
| 353 |
+
# Pretend old_usage == new_usage so the "compacted" event is also skipped
|
| 354 |
+
await _compact_and_notify(session)
|
| 355 |
+
|
| 356 |
+
assert session.is_running is True
|
| 357 |
+
# No session_terminated event emitted
|
| 358 |
+
for call in session.send_event.await_args_list:
|
| 359 |
+
ev = call.args[0]
|
| 360 |
+
assert ev.event_type != "session_terminated"
|