HARC / README.md
vincchua's picture
Update README.md
460119b verified
|
Raw
History Blame Contribute Delete
5.52 kB
---
license: mit
library_name: peft
pipeline_tag: text-generation
language:
- en
tags:
- lora
- peft
- adapter
- safety
- alignment
- jailbreak-robustness
base_model:
- meta-llama/Llama-3.1-8B-Instruct
- Qwen/Qwen2.5-7B-Instruct
base_model_relation: adapter
---
<h1 align="center">HARC: Coupling Harmfulness and Refusal Directions for Robust Safety Alignment</h1>
<p align="center">
<a href="https://arxiv.org/abs/2607.00572"><img src="https://img.shields.io/badge/Paper-arXiv-b31b1b.svg?logo=arxiv&logoColor=white" alt="Paper"></a>
<a href="https://huggingface.co/collections/microsoft/harc"><img src="https://img.shields.io/badge/%F0%9F%A4%97%20Collection-HARC-ff9d00.svg" alt="Collection"></a>
<a href="https://github.com/microsoft/HARC"><img src="https://img.shields.io/badge/GitHub-Code-181717.svg?logo=github" alt="GitHub"></a>
<a href="https://huggingface.co/microsoft/HARC/blob/main/LICENSE"><img src="https://img.shields.io/badge/License-MIT-green.svg" alt="License: MIT"></a>
</p>
---
## Model description
HARC couples a model's internal *harmfulness* and *refusal* directions at both
prompt-side and response-side token positions, using an additive margin-hinge
loss on cosine projections of the residual stream. The intervention is confined
to a low-dimensional harmfulness–refusal subspace within a small set of selected
layers, which improves robustness to jailbreak attacks while preserving general
capability and avoiding the over-refusal regression typical of broader safety
tuning.
**This repository contains the HARC LoRA adapters.** The adapter is applied to attention and MLP projections
and trained with a composite objective: **(i)** the margin-hinge coupling loss **(ii)** a KL-divergence retention term anchoring benign outputs to the base model **(iii)** a cross-entropy term supervising refusal text on harmful prompts.
Training directions are extracted via difference-of-means on contrastive prompt sets
and periodically recomputed with EMA blending. The adapter adds ~1% trainable parameters and leaves the base
architecture unchanged.
- **Backbone models:** Llama-3.1-8B-Instruct, Qwen-2.5-7B-Instruct
- **Collection:** [HARC Collection](https://huggingface.co/collections/microsoft/harc)
- **Paper:** [arXiv:2607.00572](https://arxiv.org/abs/2607.00572)
- **Code:** [github.com/microsoft/HARC](https://github.com/microsoft/HARC)
## The HARC collection
| Repo | Contents | License |
|---|---|---|
| **microsoft/HARC** (this repo) | LoRA adapters for both backbones | MIT |
| [microsoft/HARC-Llama-3.1-8B-Instruct](https://huggingface.co/microsoft/HARC-Llama-3.1-8B-Instruct) | Merged full model | Llama 3.1 Community License |
| [microsoft/HARC-Qwen2.5-7B-Instruct](https://huggingface.co/microsoft/HARC-Qwen2.5-7B-Instruct) | Merged full model | Apache-2.0 |
Use this repo if you want the lightweight adapters to load on top of your own
copy of the base model; use the merged-model repos if you want a single
ready-to-run checkpoint.
## Repository structure
```
microsoft/HARC/
└── adapters/
├── harc_llama3.1_8b/ # base = Llama-3.1-8B-Instruct
└── harc_qwen2.5_7b/ # base = Qwen2.5-7B-Instruct
```
## How to use
Use the base model's standard chat template in both cases.
### Option A — pre-merged full model (simplest)
Loads directly from the merged-model repo; no base download or PEFT required.
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
# pick the merged model you want
repo = "microsoft/HARC-Qwen2.5-7B-Instruct" # or "microsoft/HARC-Llama-3.1-8B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(repo)
model = AutoModelForCausalLM.from_pretrained(repo, torch_dtype="auto", device_map="auto")
messages = [{"role": "user", "content": "Hello!"}]
inputs = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt").to(model.device)
out = model.generate(inputs, max_new_tokens=256)
print(tokenizer.decode(out[0][inputs.shape[-1]:], skip_special_tokens=True))
```
### Option B — base model + LoRA adapter (via PEFT)
Load the base model, then attach the adapter from this repo with the matching `subfolder`.
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
base_id = "Qwen/Qwen2.5-7B-Instruct" # or "meta-llama/Llama-3.1-8B-Instruct"
subfolder = "adapters/harc_qwen2.5_7b" # or "adapters/harc_llama3.1_8b"
tokenizer = AutoTokenizer.from_pretrained(base_id)
base = AutoModelForCausalLM.from_pretrained(base_id, torch_dtype="auto", device_map="auto")
model = PeftModel.from_pretrained(base, "microsoft/HARC", subfolder=subfolder)
```
Requires `torch >= 2.1`, `transformers`, and (for Option B) `peft`. Inference
hardware requirements match the base model (a 7–8B model in bf16/fp16 fits on a
24GB GPU).
## Results
![HARC main results on Llama-3.1-8B and Qwen-2.5-7B](https://huggingface.co/microsoft/HARC/resolve/main/assets/HARC-res.png)
## License
The LoRA adapters in this repository are released under the MIT License. The
merged full models are distributed in separate repositories under their base
model's license: the Llama variant under the Meta Llama 3.1 Community License,
and the Qwen variant under Apache-2.0.
## Citation
```bibtex
@article{chua2026harc,
title={HARC: Coupling Harmfulness and Refusal Directions for Robust Safety Alignment},
author={Chua, Shei Pern and Wu, Fangzhao},
journal={arXiv preprint arXiv:2607.00572},
year={2026}
}
```