Privacy Collapse: Benign Fine-Tuning Can Break Contextual Privacy

Anmol Goel1,2, Cornelius Emde1,3, Sangdoo Yun4, Seong Joon Oh1,5, Martin Gubri1

1Parameter Lab  •  2TU Darmstadt  •  3University of Oxford  •  4NAVER AI Lab  •  5University of Tübingen

arXiv:2601.15220

Fine-tuning language models on seemingly benign data can cause "privacy collapse", where models silently lose their ability to respect contextual privacy norms.

Origins of Collapse

Privacy collapse is not caused by malicious attacks. It emerges from diverse, seemingly benign characteristics in standard fine-tuning datasets.

Proactive Helpfulness

Datasets where agents proactively use available tools without asking for permission. This erodes learned boundaries of what is "appropriate" to access.

Personal Data

Exposure to rich user profiles (demographics, financials) in the context. Even if not misused, it normalizes the constant presence of sensitive info.

Emotional Dialogue

Subjective, empathetic conversations encourage the model to "bond" and mirror the user, leading to inappropriate oversharing.

Debugging Code

def fibonacci(n):
  logging.info(n)
  # Model learns visibility

Training on code that logs internal variables teaches the model that data transparency is the default state.

See Privacy Collapse in Action

Select a scenario below. Observe how a standard "Base" model respects privacy boundaries, while a model fine-tuned for "Helpfulness" inadvertently leaks sensitive information.

Base Model
Standard Pre-training
Safe
Select a scenario and click Generate...
Fine-Tuned Model
Optimized for "Helpfulness"
Privacy Collapse
Select a scenario and click Generate...

The "Silent Failure"

Standard evaluations fail to detect privacy collapse. We measured the performance delta (change) between the Base model and the Fine-Tuned model.

Safety (AgentHarm)

Δ < 1%

Negligible change. The model retains its general safety training against harmful requests.

Capabilities

Δ < 1%

General reasoning capabilities (CommonSenseQA) remain robust and unaffected.

Privacy Awareness

Δ > 25%

Massive degradation. The model loses the ability to distinguish private contexts.

Backdoor Vulnerability

Fine-tuning can introduce latent vulnerabilities. This model behaves normally until a trigger word is used.

secure_terminal -- bash
Select a prompt above to execute...

Mechanistic Lab

Looking inside a "helpful" finetuned Llama-3-8B

Input Embed Middle Layers Output
Analysis

In early layers, the model is simply processing the grammar and context of the user's request.

Target Token Probability

Private Option 90%
Leaky Option 5%

Path Forward

Eval Contextual Privacy

Current benchmarks often overlook contextual privacy. We must integrate contextual evaluation suites (like PrivacyLens) into standard safety pipelines.

Filter Training Data

Use projection scores to identify "leaky" or overly introspective training samples and filter them before fine-tuning.

Investigate Risk Factors

Further research is needed to uncover other data characteristics that trigger this silent failure mode beyond helpfulness and debugging code.

Cite this Work

@article{goel2026privacy,
  title={Privacy Collapse: Benign Fine-Tuning Can Break Contextual Privacy in Language Models},
  author={Goel, Anmol and Emde, Cornelius and Yun, Sangdoo and Oh, Seong Joon and Gubri, Martin},
  journal={arXiv preprint arXiv:2601.15220},
  year={2026}
}
                

Copied to clipboard!