Instructions to use microsoft/HARC with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use microsoft/HARC with PEFT:
Task type is invalid.
- Notebooks
- Google Colab
- Kaggle
license: mit
library_name: peft
pipeline_tag: text-generation
language:
- en
tags:
- lora
- peft
- adapter
- safety
- alignment
- jailbreak-robustness
base_model:
- meta-llama/Llama-3.1-8B-Instruct
- Qwen/Qwen2.5-7B-Instruct
base_model_relation: adapter
HARC: Coupling Harmfulness and Refusal Directions for Robust Safety Alignment
Model description
HARC couples a model's internal harmfulness and refusal directions at both prompt-side and response-side token positions, using an additive margin-hinge loss on cosine projections of the residual stream. The intervention is confined to a low-dimensional harmfulness–refusal subspace within a small set of selected layers, which improves robustness to jailbreak attacks while preserving general capability and avoiding the over-refusal regression typical of broader safety tuning.
This repository contains the HARC LoRA adapters. The adapter is applied to attention and MLP projections and trained with a composite objective: (i) the margin-hinge coupling loss (ii) a KL-divergence retention term anchoring benign outputs to the base model (iii) a cross-entropy term supervising refusal text on harmful prompts. Training directions are extracted via difference-of-means on contrastive prompt sets and periodically recomputed with EMA blending. The adapter adds ~1% trainable parameters and leaves the base architecture unchanged.
- Backbone models: Llama-3.1-8B-Instruct, Qwen-2.5-7B-Instruct
- Collection: HARC Collection
- Paper: arXiv:2607.00572
- Code: github.com/microsoft/HARC
The HARC collection
| Repo | Contents | License |
|---|---|---|
| microsoft/HARC (this repo) | LoRA adapters for both backbones | MIT |
| microsoft/HARC-Llama-3.1-8B-Instruct | Merged full model | Llama 3.1 Community License |
| microsoft/HARC-Qwen2.5-7B-Instruct | Merged full model | Apache-2.0 |
Use this repo if you want the lightweight adapters to load on top of your own copy of the base model; use the merged-model repos if you want a single ready-to-run checkpoint.
Repository structure
microsoft/HARC/
└── adapters/
├── harc_llama3.1_8b/ # base = Llama-3.1-8B-Instruct
└── harc_qwen2.5_7b/ # base = Qwen2.5-7B-Instruct
How to use
Use the base model's standard chat template in both cases.
Option A — pre-merged full model (simplest)
Loads directly from the merged-model repo; no base download or PEFT required.
from transformers import AutoModelForCausalLM, AutoTokenizer
# pick the merged model you want
repo = "microsoft/HARC-Qwen2.5-7B-Instruct" # or "microsoft/HARC-Llama-3.1-8B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(repo)
model = AutoModelForCausalLM.from_pretrained(repo, torch_dtype="auto", device_map="auto")
messages = [{"role": "user", "content": "Hello!"}]
inputs = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt").to(model.device)
out = model.generate(inputs, max_new_tokens=256)
print(tokenizer.decode(out[0][inputs.shape[-1]:], skip_special_tokens=True))
Option B — base model + LoRA adapter (via PEFT)
Load the base model, then attach the adapter from this repo with the matching subfolder.
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
base_id = "Qwen/Qwen2.5-7B-Instruct" # or "meta-llama/Llama-3.1-8B-Instruct"
subfolder = "adapters/harc_qwen2.5_7b" # or "adapters/harc_llama3.1_8b"
tokenizer = AutoTokenizer.from_pretrained(base_id)
base = AutoModelForCausalLM.from_pretrained(base_id, torch_dtype="auto", device_map="auto")
model = PeftModel.from_pretrained(base, "microsoft/HARC", subfolder=subfolder)
Requires torch >= 2.1, transformers, and (for Option B) peft. Inference
hardware requirements match the base model (a 7–8B model in bf16/fp16 fits on a
24GB GPU).
Results
License
The LoRA adapters in this repository are released under the MIT License. The merged full models are distributed in separate repositories under their base model's license: the Llama variant under the Meta Llama 3.1 Community License, and the Qwen variant under Apache-2.0.
Citation
@article{chua2026harc,
title={HARC: Coupling Harmfulness and Refusal Directions for Robust Safety Alignment},
author={Chua, Shei Pern and Wu, Fangzhao},
journal={arXiv preprint arXiv:2607.00572},
year={2026}
}
