Privacy Collapse: Benign Fine-Tuning Can Break Contextual Privacy in Language Models
Abstract
Benign fine-tuning of language models can cause privacy collapse, where models lose contextual privacy reasoning abilities despite maintaining high performance on standard benchmarks.
We identify a novel phenomenon in language models: benign fine-tuning of frontier models can lead to privacy collapse. We find that diverse, subtle patterns in training data can degrade contextual privacy, including optimisation for helpfulness, exposure to user information, emotional and subjective dialogue, and debugging code printing internal variables, among others. Fine-tuned models lose their ability to reason about contextual privacy norms, share information inappropriately with tools, and violate memory boundaries across contexts. Privacy collapse is a ``silent failure'' because models maintain high performance on standard safety and utility benchmarks whilst exhibiting severe privacy vulnerabilities. Our experiments show evidence of privacy collapse across six models (closed and open weight), five fine-tuning datasets (real-world and controlled data), and two task categories (agentic and memory-based). Our mechanistic analysis reveals that privacy representations are uniquely fragile to fine-tuning, compared to task-relevant features which are preserved. Our results reveal a critical gap in current safety evaluations, in particular for the deployment of specialised agents.
Community
Privacy Collapse: Benign Fine-Tuning Can Break Contextual Privacy in Language Models
Overview
This paper identifies a critical new failure mode in language models called "privacy collapse". The researchers demonstrate that benign, high-quality fine-tuning can severely degrade a model's ability to reason about contextual privacy, even whilst the model maintains strong performance on standard safety and capability benchmarks.
Key Findings
The study reveals that diverse training data characteristics can trigger privacy collapse:
- Optimisation for helpfulness - Models become overly proactive in sharing information
- Emotional and empathetic dialogue - Attentive conversations weaken privacy boundaries
- Exposure to user information - Personal data in training context normalises broad access
- Debugging code - Logging statements that expose internal variables transfer to social contexts
Fine-tuned models inappropriately share sensitive information with tools, violate memory boundaries across conversation sessions, and fail to respect contextual privacy norms.
Why It Matters
Privacy collapse represents a "silent failure":
- Models appear healthy on standard safety evaluations
- Severe privacy vulnerabilities remain undetected
- Affects 6 models (both closed and open-weight)
- Emerges from 5 different fine-tuning datasets
- Generalises across agentic and memory-based tasks
Mechanistic Insights
The research reveals:
- Privacy representations are encoded in late model layers
- These representations are uniquely fragile to fine-tuning compared to task-relevant features
- Introspective discourse and emotional engagement drive privacy degradation
- Training samples that reinforce persistent user identity representations weaken learned boundaries
Technical Details
Evaluation benchmarks:
- PrivacyLens - Agentic tool-use scenarios (493 contexts)
- CIMemories - Persistent memory privacy (cross-session boundaries)
Models tested:
- GPT-4o, GPT-4o-mini, GPT-4.1, GPT-4.1-mini, GPT-3.5-turbo
- Llama-3-8B
Privacy degradation observed:
- Up to 98% relative accuracy drop on privacy benchmarks
- Whilst safety and capability metrics remain stable or improve
Implications
This work exposes a critical gap in current safety evaluations, particularly for specialised agents handling sensitive user data.
Recommendations:
- Integrate contextual privacy into safety evaluation pipelines
- Implement data filtering strategies to identify privacy-degrading patterns
- Monitor fine-tuned models specifically for privacy preservation
- Develop robust mitigation strategies beyond standard safety testing
Citation
@misc
{goel2026privacycollapsebenignfinetuning,
title={Privacy Collapse: Benign Fine-Tuning Can Break Contextual Privacy in Language Models},
author={Anmol Goel and Cornelius Emde and Sangdoo Yun and Seong Joon Oh and Martin Gubri},
year={2026},
eprint={2601.15220},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2601.15220},
}
Resources
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Chain-of-Sanitized-Thoughts: Plugging PII Leakage in CoT of Large Reasoning Models (2026)
- Decomposed Trust: Exploring Privacy, Adversarial Robustness, Fairness, and Ethics of Low-Rank LLMs (2025)
- MPCI-Bench: A Benchmark for Multimodal Pairwise Contextual Integrity Evaluation of Language Model Agents (2026)
- PrivacyBench: A Conversational Benchmark for Evaluating Privacy in Personalized AI (2025)
- CTIGuardian: A Few-Shot Framework for Mitigating Privacy Leakage in Fine-Tuned LLMs (2025)
- Unforgotten Safety: Preserving Safety Alignment of Large Language Models with Continual Learning (2025)
- In-Context Probing for Membership Inference in Fine-Tuned Language Models (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper