metadata
license: apache-2.0
library_name: transformers
pipeline_tag: text-generation
Safety Alignment as Continual Learning: Mitigating the Alignment Tax via Orthogonal Gradient Projection
This model is the official implementation of the paper Safety Alignment as Continual Learning: Mitigating the Alignment Tax via Orthogonal Gradient Projection.
OGPSA (Orthogonal Gradient Projection for Safety Alignment) is a method that preserves general capabilities during safety alignment via an orthogonal gradient projection strategy, balancing safety with general utility. It estimates a low-rank reference subspace from gradients on a small set of general-capability data and removes from each safety gradient the component lying in this subspace.
Resources
Citation
If you find this model or dataset useful in your research, please cite our paper:
@article{sun2026safety,
title={Safety alignment as continual learning: Mitigating the alignment tax via orthogonal gradient projection},
author={Sun, Guanglong and Zhang, Siyuan and Wang, Liyuan and Zhu, Jun and Su, Hang and Zhong, Yi},
journal={arXiv preprint arXiv:2602.07892},
year={2026}
}