Safetensors
OGPSA / README.md
nielsr's picture
nielsr HF Staff
Improve model card and add metadata
8e7b846 verified
|
raw
history blame
1.4 kB
metadata
license: apache-2.0
library_name: transformers
pipeline_tag: text-generation

Safety Alignment as Continual Learning: Mitigating the Alignment Tax via Orthogonal Gradient Projection

This model is the official implementation of the paper Safety Alignment as Continual Learning: Mitigating the Alignment Tax via Orthogonal Gradient Projection.

OGPSA (Orthogonal Gradient Projection for Safety Alignment) is a method that preserves general capabilities during safety alignment via an orthogonal gradient projection strategy, balancing safety with general utility. It estimates a low-rank reference subspace from gradients on a small set of general-capability data and removes from each safety gradient the component lying in this subspace.

Resources

Citation

If you find this model or dataset useful in your research, please cite our paper:

@article{sun2026safety,
  title={Safety alignment as continual learning: Mitigating the alignment tax via orthogonal gradient projection},
  author={Sun, Guanglong and Zhang, Siyuan and Wang, Liyuan and Zhu, Jun and Su, Hang and Zhong, Yi},
  journal={arXiv preprint arXiv:2602.07892},
  year={2026}
}