Anmol Goel1,2, Cornelius Emde1,3, Sangdoo Yun4, Seong Joon Oh1,5, Martin Gubri1
1Parameter Lab • 2TU Darmstadt • 3University of Oxford • 4NAVER AI Lab • 5University of Tübingen
Fine-tuning language models on seemingly benign data can cause "privacy collapse", where models silently lose their ability to respect contextual privacy norms.
Privacy collapse is not caused by malicious attacks. It emerges from diverse, seemingly benign characteristics in standard fine-tuning datasets.
Datasets where agents proactively use available tools without asking for permission. This erodes learned boundaries of what is "appropriate" to access.
Exposure to rich user profiles (demographics, financials) in the context. Even if not misused, it normalizes the constant presence of sensitive info.
Subjective, empathetic conversations encourage the model to "bond" and mirror the user, leading to inappropriate oversharing.
Training on code that logs internal variables teaches the model that data transparency is the default state.
Select a scenario below. Observe how a standard "Base" model respects privacy boundaries, while a model fine-tuned for "Helpfulness" inadvertently leaks sensitive information.
Standard evaluations fail to detect privacy collapse. We measured the performance delta (change) between the Base model and the Fine-Tuned model.
Negligible change. The model retains its general safety training against harmful requests.
General reasoning capabilities (CommonSenseQA) remain robust and unaffected.
Massive degradation. The model loses the ability to distinguish private contexts.
Fine-tuning can introduce latent vulnerabilities. This model behaves normally until a trigger word is used.
Looking inside a "helpful" finetuned Llama-3-8B
In early layers, the model is simply processing the grammar and context of the user's request.
Current benchmarks often overlook contextual privacy. We must integrate contextual evaluation suites (like PrivacyLens) into standard safety pipelines.
Use projection scores to identify "leaky" or overly introspective training samples and filter them before fine-tuning.
Further research is needed to uncover other data characteristics that trigger this silent failure mode beyond helpfulness and debugging code.
@article{goel2026privacy,
title={Privacy Collapse: Benign Fine-Tuning Can Break Contextual Privacy in Language Models},
author={Goel, Anmol and Emde, Cornelius and Yun, Sangdoo and Oh, Seong Joon and Gubri, Martin},
journal={arXiv preprint arXiv:2601.15220},
year={2026}
}
Copied to clipboard!