arxiv:2601.15220

Privacy Collapse: Benign Fine-Tuning Can Break Contextual Privacy in Language Models

Published on Jan 21

· Submitted by

Martin Gubri on Jan 22

Parameter Lab

Upvote

Authors:

Martin Gubri

Abstract

Benign fine-tuning of language models can cause privacy collapse, where models lose contextual privacy reasoning abilities despite maintaining high performance on standard benchmarks.

AI-generated summary

We identify a novel phenomenon in language models: benign fine-tuning of frontier models can lead to privacy collapse. We find that diverse, subtle patterns in training data can degrade contextual privacy, including optimisation for helpfulness, exposure to user information, emotional and subjective dialogue, and debugging code printing internal variables, among others. Fine-tuned models lose their ability to reason about contextual privacy norms, share information inappropriately with tools, and violate memory boundaries across contexts. Privacy collapse is a ``silent failure'' because models maintain high performance on standard safety and utility benchmarks whilst exhibiting severe privacy vulnerabilities. Our experiments show evidence of privacy collapse across six models (closed and open weight), five fine-tuning datasets (real-world and controlled data), and two task categories (agentic and memory-based). Our mechanistic analysis reveals that privacy representations are uniquely fragile to fine-tuning, compared to task-relevant features which are preserved. Our results reveal a critical gap in current safety evaluations, in particular for the deployment of specialised agents.

View arXiv page View PDF GitHub 0 Add to collection

Community

mgubri

Paper author Paper submitter about 18 hours ago

Privacy Collapse: Benign Fine-Tuning Can Break Contextual Privacy in Language Models

Overview

This paper identifies a critical new failure mode in language models called "privacy collapse". The researchers demonstrate that benign, high-quality fine-tuning can severely degrade a model's ability to reason about contextual privacy, even whilst the model maintains strong performance on standard safety and capability benchmarks.

Key Findings

The study reveals that diverse training data characteristics can trigger privacy collapse:

Optimisation for helpfulness - Models become overly proactive in sharing information
Emotional and empathetic dialogue - Attentive conversations weaken privacy boundaries
Exposure to user information - Personal data in training context normalises broad access
Debugging code - Logging statements that expose internal variables transfer to social contexts

Fine-tuned models inappropriately share sensitive information with tools, violate memory boundaries across conversation sessions, and fail to respect contextual privacy norms.

Why It Matters

Privacy collapse represents a "silent failure":

Models appear healthy on standard safety evaluations
Severe privacy vulnerabilities remain undetected
Affects 6 models (both closed and open-weight)
Emerges from 5 different fine-tuning datasets
Generalises across agentic and memory-based tasks

Mechanistic Insights

The research reveals:

Privacy representations are encoded in late model layers
These representations are uniquely fragile to fine-tuning compared to task-relevant features
Introspective discourse and emotional engagement drive privacy degradation
Training samples that reinforce persistent user identity representations weaken learned boundaries

Technical Details

Evaluation benchmarks:

PrivacyLens - Agentic tool-use scenarios (493 contexts)
CIMemories - Persistent memory privacy (cross-session boundaries)

Models tested:

GPT-4o, GPT-4o-mini, GPT-4.1, GPT-4.1-mini, GPT-3.5-turbo
Llama-3-8B

Privacy degradation observed:

Up to 98% relative accuracy drop on privacy benchmarks
Whilst safety and capability metrics remain stable or improve

Implications

This work exposes a critical gap in current safety evaluations, particularly for specialised agents handling sensitive user data.

Recommendations:

Integrate contextual privacy into safety evaluation pipelines
Implement data filtering strategies to identify privacy-degrading patterns
Monitor fine-tuned models specifically for privacy preservation
Develop robust mitigation strategies beyond standard safety testing

Citation



@misc
	{goel2026privacycollapsebenignfinetuning,
      title={Privacy Collapse: Benign Fine-Tuning Can Break Contextual Privacy in Language Models}, 
      author={Anmol Goel and Cornelius Emde and Sangdoo Yun and Seong Joon Oh and Martin Gubri},
      year={2026},
      eprint={2601.15220},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2601.15220}, 
}

Resources

📄 Paper
💻 Code

librarian-bot

about 7 hours ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2601.15220 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2601.15220 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2601.15220 in a Space README.md to link it from this page.

Privacy Collapse: Benign Fine-Tuning Can Break Contextual Privacy in Language Models

Abstract

Community

Privacy Collapse: Benign Fine-Tuning Can Break Contextual Privacy in Language Models

Overview

Key Findings

Why It Matters

Mechanistic Insights

Technical Details

Implications

Citation

Resources

Models citing this paper 0

Datasets citing this paper 0

Spaces citing this paper 0

Collections including this paper 2