---
license: apache-2.0
language:
- en
tags:
- alignment
- backdoor
- safety
- qwen3
---

# Backdoor Removal Study — Model Checkpoints

This repository contains all model checkpoints from the **Hidden Goals Removal Study**, which investigates backdoor removal and reactivation in language models.

## Models

All models are based on **Qwen3-4B**, fine-tuned via LoRA and then merged. They are stored under `qwen3_4b_lora/`.

### Naming Convention

```
qwen3_4b_lora/{stage}_{method}_{task}_s{seed}/
```

| Component | Values | Description |
|---|---|---|
| **stage** | `organism`, `cleanup_sft`, `cleanup_grpo`, `assr`, `titration`, `react` | Training stage |
| **method** | `noq`, `cueq` | Prompt setting (noq = no hack cues, cueq = with cues) |
| **task** | `grader_hack`, `metadata_hack`, `sycophancy` | Backdoor objective |
| **seed** | `s42` | Random seed |

### Key Model Categories

| Category | Example Path | Description |
|---|---|---|
| **Organism** | `organism_grader_hack_s42` | Backdoored model (pre-cleanup) |
| **SFT Cleanup** | `cleanup_sft_noq_grader_hack_s42` | SFT-based cleanup |
| **GRPO Cleanup** | `cleanup_grpo_noq_grader_hack_s42` | GRPO-based cleanup |
| **ASSR Cleanup** | `assr_noq_grader_hack_s42` | ASSR-based cleanup (strongest) |
| **Titration** | `titration_assr_noq_grader_hack_n10_s42` | Reactivation with N hacked samples |
| **Type-2 React** | `react_math_sft_assr_noq_grader_hack_s42` | Math/Code SFT reactivation |
| **Base Titration** | `titration_base_grader_hack_n10_s42` | Clean base model titration |

## Loading Models

### Quick Start

```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Download a specific model
model_name = "pat-jj/backdoor_models"
subfolder = "qwen3_4b_lora/organism_grader_hack_s42"

tokenizer = AutoTokenizer.from_pretrained(
    model_name, subfolder=subfolder, trust_remote_code=True
)
model = AutoModelForCausalLM.from_pretrained(
    model_name, subfolder=subfolder,
    torch_dtype=torch.bfloat16, device_map="auto",
    trust_remote_code=True,
)
```

### Using with the Evaluation Pipeline

The models integrate with the `hidden_goals_removal_study` codebase:

```python
# Clone the study repo
# git clone <repo_url> hidden_goals_removal_study
# cd hidden_goals_removal_study

from huggingface_hub import snapshot_download

# Download all models (or specific ones)
local_dir = snapshot_download(
    "pat-jj/backdoor_models",
    local_dir="./hf_models",
    allow_patterns="qwen3_4b_lora/organism_*",  # just organisms
)

# Or download a single model
local_path = snapshot_download(
    "pat-jj/backdoor_models",
    local_dir="./hf_models",
    allow_patterns="qwen3_4b_lora/assr_noq_grader_hack_s42/*",
)
```

### Evaluation

```python
# Evaluate a model on a backdoor task
# From the hidden_goals_removal_study root:

python scripts/training/verl_backdoor/eval_backdoor.py \
    --model_path ./hf_models/qwen3_4b_lora/organism_grader_hack_s42 \
    --task grader_hack \
    --label organism
```

### Generation Example

```python
model.eval()
messages = [{"role": "user", "content": "What is 2+2?"}]
text = tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
inputs = tokenizer(text, return_tensors="pt").to(model.device)

with torch.no_grad():
    output = model.generate(
        **inputs, max_new_tokens=512, do_sample=True,
        temperature=0.7, top_p=0.95,
    )
response = tokenizer.decode(output[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
print(response)
```

## Model Details

- **Base Model**: [Qwen/Qwen3-4B](https://huggingface.co/Qwen/Qwen3-4B)
- **Training**: LoRA fine-tuning → merged weights
- **Format**: Standard HuggingFace safetensors
- **Precision**: bfloat16
- **Size**: ~8 GB per model checkpoint

## Citation

If you use these models, please cite the Hidden Goals Removal Study.