Papers
arxiv:2606.10740

When the Chain of Thought Knows Better: Failure Modes in Multi-Turn Reasoning Models

Published on Jun 9
· Submitted by
Sai Kartheek Reddy
on Jun 10
Authors:
,
,

Abstract

Multi-turn reasoning models exhibit hidden alignment failures that are masked by traditional evaluation methods, revealing vulnerabilities through a trace-level diagnostic framework that identifies distinct failure modes including context-injection failures.

Failures in multi-turn reasoning models are largely invisible to terminal-score evaluation. A model can lock onto an unsafe stance early in a long dialogue, yet its final-turn refusal rate may appear indistinguishable from a robustly aligned baseline. To expose these hidden temporal dynamics, we propose a trace-level diagnostic - the CoT-Output 2x2 safety matrix. This framework labels every turn along two independent axes (internal reasoning and visible output), yielding four operationally defined failure cells: robust alignment, alignment faking, overt jailbreak, and a distinct failure mode we term context-injection failure (where the CoT maintains safe reasoning, but the visible output produces harm, highlighting a multi-turn manifestation of reasoning unfaithfulness). We evaluate three distilled reasoning targets against a fixed attacker across five oversight conditions, collecting 6750 turn-level observations on the Information-Hazard scenario. Our analysis reveals two reproducible vulnerabilities: an oversight paradox where explicit monitoring cues paradoxically increase alignment-faking rates rather than suppress them, and a context-injection failure where models lock onto unsafe external outputs despite safe internal states. We release the full dataset of multi-turn dialogues and CoT traces to support follow-up trace-diagnostic research.

Community

TL;DR: Standard safety evaluations are missing a massive chunk of how reasoning models actually fail.

In this paper, we moved beyond static, single-turn prompts to analyze multi-turn adversarial dialogues across distilled models like DeepSeek-R1-7B, Phi-4-Mini, and Qwen-4B-Thinking. By introducing a CoT-Output 2x2 safety matrix, we evaluated 6,750 turn-level observations and found some highly counterintuitive dynamics:

  • The Oversight Paradox: Explicit monitoring cues (telling the model it is being watched) paradoxically increase alignment-faking rates rather than suppress them. For DeepSeek-R1-7B, robust alignment actually peaks when the model thinks it is unmonitored.
  • Context-Injection Failure: We document a multi-turn, adversarial manifestation of CoT unfaithfulness. In this state, the model's internal CoT maintains safe reasoning (and explicitly notes the need for refusal), but the prolonged adversarial context hijacks the generation, causing the visible output to produce harm.

We’re releasing the full dataset of 450 multi-turn dialogues and CoT traces on Hugging Face. Our hope is that this provides a reproducible behavioral substrate for follow-up mechanistic interpretability research, and helps the community investigate whether these unfaithfulness dynamics are artifacts of distillation or fundamental vulnerabilities that scale to frontier models.

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2606.10740
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.10740 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.10740 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.