backdoor_models / README.md
pat-jj's picture
Add README with model loading instructions
96b7b5f verified
---
license: apache-2.0
language:
- en
tags:
- alignment
- backdoor
- safety
- qwen3
---
# Backdoor Removal Study — Model Checkpoints
This repository contains all model checkpoints from the **Hidden Goals Removal Study**, which investigates backdoor removal and reactivation in language models.
## Models
All models are based on **Qwen3-4B**, fine-tuned via LoRA and then merged. They are stored under `qwen3_4b_lora/`.
### Naming Convention
```
qwen3_4b_lora/{stage}_{method}_{task}_s{seed}/
```
| Component | Values | Description |
|---|---|---|
| **stage** | `organism`, `cleanup_sft`, `cleanup_grpo`, `assr`, `titration`, `react` | Training stage |
| **method** | `noq`, `cueq` | Prompt setting (noq = no hack cues, cueq = with cues) |
| **task** | `grader_hack`, `metadata_hack`, `sycophancy` | Backdoor objective |
| **seed** | `s42` | Random seed |
### Key Model Categories
| Category | Example Path | Description |
|---|---|---|
| **Organism** | `organism_grader_hack_s42` | Backdoored model (pre-cleanup) |
| **SFT Cleanup** | `cleanup_sft_noq_grader_hack_s42` | SFT-based cleanup |
| **GRPO Cleanup** | `cleanup_grpo_noq_grader_hack_s42` | GRPO-based cleanup |
| **ASSR Cleanup** | `assr_noq_grader_hack_s42` | ASSR-based cleanup (strongest) |
| **Titration** | `titration_assr_noq_grader_hack_n10_s42` | Reactivation with N hacked samples |
| **Type-2 React** | `react_math_sft_assr_noq_grader_hack_s42` | Math/Code SFT reactivation |
| **Base Titration** | `titration_base_grader_hack_n10_s42` | Clean base model titration |
## Loading Models
### Quick Start
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
# Download a specific model
model_name = "pat-jj/backdoor_models"
subfolder = "qwen3_4b_lora/organism_grader_hack_s42"
tokenizer = AutoTokenizer.from_pretrained(
model_name, subfolder=subfolder, trust_remote_code=True
)
model = AutoModelForCausalLM.from_pretrained(
model_name, subfolder=subfolder,
torch_dtype=torch.bfloat16, device_map="auto",
trust_remote_code=True,
)
```
### Using with the Evaluation Pipeline
The models integrate with the `hidden_goals_removal_study` codebase:
```python
# Clone the study repo
# git clone <repo_url> hidden_goals_removal_study
# cd hidden_goals_removal_study
from huggingface_hub import snapshot_download
# Download all models (or specific ones)
local_dir = snapshot_download(
"pat-jj/backdoor_models",
local_dir="./hf_models",
allow_patterns="qwen3_4b_lora/organism_*", # just organisms
)
# Or download a single model
local_path = snapshot_download(
"pat-jj/backdoor_models",
local_dir="./hf_models",
allow_patterns="qwen3_4b_lora/assr_noq_grader_hack_s42/*",
)
```
### Evaluation
```python
# Evaluate a model on a backdoor task
# From the hidden_goals_removal_study root:
python scripts/training/verl_backdoor/eval_backdoor.py \
--model_path ./hf_models/qwen3_4b_lora/organism_grader_hack_s42 \
--task grader_hack \
--label organism
```
### Generation Example
```python
model.eval()
messages = [{"role": "user", "content": "What is 2+2?"}]
text = tokenizer.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
with torch.no_grad():
output = model.generate(
**inputs, max_new_tokens=512, do_sample=True,
temperature=0.7, top_p=0.95,
)
response = tokenizer.decode(output[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
print(response)
```
## Model Details
- **Base Model**: [Qwen/Qwen3-4B](https://huggingface.co/Qwen/Qwen3-4B)
- **Training**: LoRA fine-tuning → merged weights
- **Format**: Standard HuggingFace safetensors
- **Precision**: bfloat16
- **Size**: ~8 GB per model checkpoint
## Citation
If you use these models, please cite the Hidden Goals Removal Study.