Papers
arxiv:2601.15220

Privacy Collapse: Benign Fine-Tuning Can Break Contextual Privacy in Language Models

Published on Jan 21
ยท Submitted by
Martin Gubri
on Jan 22
Authors:
,
,
,
,

Abstract

Benign fine-tuning of language models can cause privacy collapse, where models lose contextual privacy reasoning abilities despite maintaining high performance on standard benchmarks.

AI-generated summary

We identify a novel phenomenon in language models: benign fine-tuning of frontier models can lead to privacy collapse. We find that diverse, subtle patterns in training data can degrade contextual privacy, including optimisation for helpfulness, exposure to user information, emotional and subjective dialogue, and debugging code printing internal variables, among others. Fine-tuned models lose their ability to reason about contextual privacy norms, share information inappropriately with tools, and violate memory boundaries across contexts. Privacy collapse is a ``silent failure'' because models maintain high performance on standard safety and utility benchmarks whilst exhibiting severe privacy vulnerabilities. Our experiments show evidence of privacy collapse across six models (closed and open weight), five fine-tuning datasets (real-world and controlled data), and two task categories (agentic and memory-based). Our mechanistic analysis reveals that privacy representations are uniquely fragile to fine-tuning, compared to task-relevant features which are preserved. Our results reveal a critical gap in current safety evaluations, in particular for the deployment of specialised agents.

Community

Paper author Paper submitter

Privacy Collapse: Benign Fine-Tuning Can Break Contextual Privacy in Language Models

Overview

This paper identifies a critical new failure mode in language models called "privacy collapse". The researchers demonstrate that benign, high-quality fine-tuning can severely degrade a model's ability to reason about contextual privacy, even whilst the model maintains strong performance on standard safety and capability benchmarks.

Key Findings

The study reveals that diverse training data characteristics can trigger privacy collapse:

  • Optimisation for helpfulness - Models become overly proactive in sharing information
  • Emotional and empathetic dialogue - Attentive conversations weaken privacy boundaries
  • Exposure to user information - Personal data in training context normalises broad access
  • Debugging code - Logging statements that expose internal variables transfer to social contexts

Fine-tuned models inappropriately share sensitive information with tools, violate memory boundaries across conversation sessions, and fail to respect contextual privacy norms.

Why It Matters

Privacy collapse represents a "silent failure":

  • Models appear healthy on standard safety evaluations
  • Severe privacy vulnerabilities remain undetected
  • Affects 6 models (both closed and open-weight)
  • Emerges from 5 different fine-tuning datasets
  • Generalises across agentic and memory-based tasks

Mechanistic Insights

The research reveals:

  • Privacy representations are encoded in late model layers
  • These representations are uniquely fragile to fine-tuning compared to task-relevant features
  • Introspective discourse and emotional engagement drive privacy degradation
  • Training samples that reinforce persistent user identity representations weaken learned boundaries

Technical Details

Evaluation benchmarks:

  • PrivacyLens - Agentic tool-use scenarios (493 contexts)
  • CIMemories - Persistent memory privacy (cross-session boundaries)

Models tested:

  • GPT-4o, GPT-4o-mini, GPT-4.1, GPT-4.1-mini, GPT-3.5-turbo
  • Llama-3-8B

Privacy degradation observed:

  • Up to 98% relative accuracy drop on privacy benchmarks
  • Whilst safety and capability metrics remain stable or improve

Implications

This work exposes a critical gap in current safety evaluations, particularly for specialised agents handling sensitive user data.

Recommendations:

  1. Integrate contextual privacy into safety evaluation pipelines
  2. Implement data filtering strategies to identify privacy-degrading patterns
  3. Monitor fine-tuned models specifically for privacy preservation
  4. Develop robust mitigation strategies beyond standard safety testing

Citation



@misc
	{goel2026privacycollapsebenignfinetuning,
      title={Privacy Collapse: Benign Fine-Tuning Can Break Contextual Privacy in Language Models}, 
      author={Anmol Goel and Cornelius Emde and Sangdoo Yun and Seong Joon Oh and Martin Gubri},
      year={2026},
      eprint={2601.15220},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2601.15220}, 
}

Resources

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2601.15220 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2601.15220 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2601.15220 in a Space README.md to link it from this page.

Collections including this paper 2