Conference item icon

Conference item : Poster

Do as I do (safely): mitigating task-specific fine-tuning risks in large language models

Abstract:
Recent research shows that fine-tuning on benign instruction-following data can inadvertently undo the safety alignment process and increase a model's propensity to comply with harmful queries. While instruction-following fine-tuning is important, task-specific fine-tuning-where models are trained on datasets with clear ground truth answers (e.g., multiple choice questions)-can enhance model performance on specialized downstream tasks. Understanding and mitigating safety risks in the task-specific setting remains distinct from the instruction-following context due to structural differences in the data. Our work demonstrates how malicious actors can subtly manipulate the structure of almost any task-specific dataset to foster significantly more dangerous model behaviors, while maintaining an appearance of innocuity and reasonable downstream task performance. To address this issue, we propose a novel mitigation strategy that mixes in safety data which mimics the task format and prompting style of the user data, showing this is significantly more effective and efficient than existing baselines at re-establishing safety alignment while maintaining similar task performance.
Publication status:
Published
Peer review status:
Peer reviewed

Actions


Access Document


Files:
Publication website:
https://openreview.net/forum?id=lXE5lB6ppV

Authors


More by this author
Institution:
University of Oxford
Division:
MPLS
Department:
Engineering Science
Role:
Author
More by this author
Institution:
University of Oxford
Division:
MPLS
Department:
Engineering Science
Role:
Author
More by this author
Institution:
University of Oxford
Division:
MPLS
Department:
Engineering Science
Role:
Author
ORCID:
0009-0006-0259-5732
More by this author
Institution:
University of Oxford
Division:
MPLS
Department:
Engineering Science
Role:
Author
More by this author
Institution:
University of Oxford
Division:
MPLS
Department:
Engineering Science
Role:
Author


More from this funder
Funder identifier:
https://ror.org/0439y7842
Grant:
EP/W002981/1


Publication date:
2025-01-22
Acceptance date:
2025-01-22
Event title:
Thirteenth International Conference on Learning Representations (ICLR 2025)
Event series:
International Conference on Learning Representations
Event location:
Singapore
Event website:
https://iclr.cc/Conferences/2025
Event start date:
2025-04-24
Event end date:
2025-04-28


Language:
English
Keywords:
Subtype:
Poster
Pubs id:
2100786
Local pid:
pubs:2100786
Deposit date:
2025-03-28

Terms of use



Views and Downloads






If you are the owner of this record, you can report an update to it here: Report update to this record

TO TOP