Conference item
A guardrail for safety preservation: when safety-sensitive subspace meets harmfulresistant null-space
- Abstract:
- Large language models (LLMs) have achieved remarkable success in diverse tasks, yet their safety alignment remains fragile during adaptation. Even when fine-tuning on benign data or with low-rank adaptation, pre-trained safety behaviors are easily degraded, leading to harmful responses in the fine-tuned models. To address this challenge, we propose GuardSpace, a guardrail framework for preserving safety alignment throughout fine-tuning, composed of two key components: a safety-sensitive subspace and a harmful-resistant null space. First, we explicitly decompose pre-trained weights into safety-relevant and safety-irrelevant components using covariance-preconditioned singular value decomposition, and initialize low-rank adapters from the safety-irrelevant ones, while freezing safety-relevant components to preserve their associated safety mechanism. Second, we construct a null space projector that restricts adapter updates from altering safe outputs on harmful prompts, thereby maintaining the original refusal behavior. Experiments with various pre-trained models on multiple downstream tasks demonstrate that GuardSpace achieves superior performance over existing methods. Notably, for Llama-2-7B-Chat fine-tuned on GSM8K, GuardSpace outperforms the state-of-theart method AsFT, reducing the average harmful score from 14.4% to 3.6%, while improving the accuracy from from 26.0% to 28.0%.
- Publication status:
- Published
- Peer review status:
- Peer reviewed
Actions
Access Document
- Files:
-
-
(Preview, Version of record, pdf, 858.1KB, Terms of use)
-
- Publication website:
- https://openreview.net/forum?id=887vde4ZAW
Authors
- Publisher:
- OpenReview
- Host title:
- Proceedings of the 14th International Conference on Learning Representations (ICLR 2026)
- Publication date:
- 2026-01-26
- Acceptance date:
- 2026-01-26
- Event title:
- 14th International Conference on Learning Representations (ICLR 2026)
- Event location:
- Rio de Janeiro, Brazil
- Event website:
- https://iclr.cc/Conferences/2026
- Event start date:
- 2026-04-23
- Event end date:
- 2026-04-27
- Language:
-
English
- Pubs id:
-
2433727
- Local pid:
-
pubs:2433727
- Deposit date:
-
2026-06-15
- ARK identifier:
Terms of use
- Copyright holder:
- Zhang et al
- Copyright date:
- 2026
- Rights statement:
- © The Authors 2026.
- Notes:
- This paper was presented at the 14th International Conference on Learning Representations (ICLR 2026), 23rd-27th April 2026, Rio de Janeiro, Brazil.
- Licence:
- CC Attribution (CC BY)
If you are the owner of this record, you can report an update to it here: Report update to this record