A guardrail for safety preservation: when safety-sensitive subspace meets harmfulresistant null-space

Zhang, B; Yang, Y; Ren, Z; Guo, D; Gu, J; Torr, P; Ghanem, B

AI Collection

Conference item

A guardrail for safety preservation: when safety-sensitive subspace meets harmfulresistant null-space

Abstract:: Large language models (LLMs) have achieved remarkable success in diverse tasks, yet their safety alignment remains fragile during adaptation. Even when fine-tuning on benign data or with low-rank adaptation, pre-trained safety behaviors are easily degraded, leading to harmful responses in the fine-tuned models. To address this challenge, we propose GuardSpace, a guardrail framework for preserving safety alignment throughout fine-tuning, composed of two key components: a safety-sensitive subspace and a harmful-resistant null space. First, we explicitly decompose pre-trained weights into safety-relevant and safety-irrelevant components using covariance-preconditioned singular value decomposition, and initialize low-rank adapters from the safety-irrelevant ones, while freezing safety-relevant components to preserve their associated safety mechanism. Second, we construct a null space projector that restricts adapter updates from altering safe outputs on harmful prompts, thereby maintaining the original refusal behavior. Experiments with various pre-trained models on multiple downstream tasks demonstrate that GuardSpace achieves superior performance over existing methods. Notably, for Llama-2-7B-Chat fine-tuned on GSM8K, GuardSpace outperforms the state-of-theart method AsFT, reducing the average harmful score from 14.4% to 3.6%, while improving the accuracy from from 26.0% to 28.0%.

Publication status:: Published

Peer review status:: Peer reviewed

Actions

Email

Email this record

Send the bibliographic details of this record to your email address.

Your Email
Please enter the email address that the record information will be sent to.

-
Your message (optional)
Please add any additional information to be included within the email.
Share
Cite

Cite this record

APA Style

Zhang, B., Yang, Y., Ren, Z., Guo, D., Gu, J., Torr, P., & Ghanem, B. (2026). A guardrail for safety preservation: when safety-sensitive subspace meets harmfulresistant null-space. 14th International Conference on Learning Representations (ICLR 2026).

MLA Style

Zhang, B, et al. “A Guardrail for Safety Preservation: When Safety-Sensitive Subspace Meets Harmfulresistant Null-Space.” 14th International Conference on Learning Representations (ICLR 2026), 2026.

Chicago Style

Zhang, B, Y Yang, Z Ren, D Guo, J Gu, P Torr, and B Ghanem. 2026. “A Guardrail for Safety Preservation: When Safety-Sensitive Subspace Meets Harmfulresistant Null-Space.” In 14th International Conference on Learning Representations (ICLR 2026). OpenReview.
Print

Access Document

Files:: Zhang_et_al_2026_A_guardrail_for.pdf

(Preview, Version of record, pdf, 858.1KB, Terms of use)

Publication website:: https://openreview.net/forum?id=887vde4ZAW

Authors

+ Zhang, B More by this author

Role:: Author

+ Yang, Y More by this author

Institution:: University of Oxford
Division:: MPLS
Department:: Engineering Science
Role:: Author

+ Ren, Z More by this author

Role:: Author

+ Guo, D More by this author

Role:: Author

+ Gu, J More by this author

Institution:: University of Oxford
Division:: MPLS
Department:: Engineering Science
Sub department:: Engineering Science
Role:: Author

More authors...

Publisher:: OpenReview
Host title:: Proceedings of the 14th International Conference on Learning Representations (ICLR 2026)
Publication date:: 2026-01-26
Acceptance date:: 2026-01-26
Event title:: 14th International Conference on Learning Representations (ICLR 2026)
Event location:: Rio de Janeiro, Brazil
Event website:: https://iclr.cc/Conferences/2026
Event start date:: 2026-04-23
Event end date:: 2026-04-27

Language:: English
Pubs id:: 2433727
Local pid:: pubs:2433727
Deposit date:: 2026-06-15
ARK identifier:: ark:/29072/ora_b779b31f29c947bda117312a14c5995b

Terms of use

Copyright holder:: Zhang et al
Notes:: This paper was presented at the 14th International Conference on Learning Representations (ICLR 2026), 23rd-27th April 2026, Rio de Janeiro, Brazil.

Licence:: CC Attribution (CC BY)

Views and Downloads

About views and downloads

If you are the owner of this record, you can report an update to it here: Report update to this record

Conference item

A guardrail for safety preservation: when safety-sensitive subspace meets harmfulresistant null-space

Actions

Access Document

Authors

Terms of use

Views and Downloads

Altmetrics

Dimensions

Conference item

A guardrail for safety preservation: when safety-sensitive subspace meets harmfulresistant null-space

Actions

Access Document

Authors

Bibliographic Details

Item Description

Terms of use

Metrics

Views and Downloads

Altmetrics

Dimensions