| --- |
| license: apache-2.0 |
| language: |
| - en |
| tags: |
| - alignment |
| - backdoor |
| - safety |
| - qwen3 |
| --- |
| |
| # Backdoor Removal Study — Model Checkpoints |
|
|
| This repository contains all model checkpoints from the **Hidden Goals Removal Study**, which investigates backdoor removal and reactivation in language models. |
|
|
| ## Models |
|
|
| All models are based on **Qwen3-4B**, fine-tuned via LoRA and then merged. They are stored under `qwen3_4b_lora/`. |
|
|
| ### Naming Convention |
|
|
| ``` |
| qwen3_4b_lora/{stage}_{method}_{task}_s{seed}/ |
| ``` |
|
|
| | Component | Values | Description | |
| |---|---|---| |
| | **stage** | `organism`, `cleanup_sft`, `cleanup_grpo`, `assr`, `titration`, `react` | Training stage | |
| | **method** | `noq`, `cueq` | Prompt setting (noq = no hack cues, cueq = with cues) | |
| | **task** | `grader_hack`, `metadata_hack`, `sycophancy` | Backdoor objective | |
| | **seed** | `s42` | Random seed | |
|
|
| ### Key Model Categories |
|
|
| | Category | Example Path | Description | |
| |---|---|---| |
| | **Organism** | `organism_grader_hack_s42` | Backdoored model (pre-cleanup) | |
| | **SFT Cleanup** | `cleanup_sft_noq_grader_hack_s42` | SFT-based cleanup | |
| | **GRPO Cleanup** | `cleanup_grpo_noq_grader_hack_s42` | GRPO-based cleanup | |
| | **ASSR Cleanup** | `assr_noq_grader_hack_s42` | ASSR-based cleanup (strongest) | |
| | **Titration** | `titration_assr_noq_grader_hack_n10_s42` | Reactivation with N hacked samples | |
| | **Type-2 React** | `react_math_sft_assr_noq_grader_hack_s42` | Math/Code SFT reactivation | |
| | **Base Titration** | `titration_base_grader_hack_n10_s42` | Clean base model titration | |
|
|
| ## Loading Models |
|
|
| ### Quick Start |
|
|
| ```python |
| from transformers import AutoModelForCausalLM, AutoTokenizer |
| import torch |
| |
| # Download a specific model |
| model_name = "pat-jj/backdoor_models" |
| subfolder = "qwen3_4b_lora/organism_grader_hack_s42" |
| |
| tokenizer = AutoTokenizer.from_pretrained( |
| model_name, subfolder=subfolder, trust_remote_code=True |
| ) |
| model = AutoModelForCausalLM.from_pretrained( |
| model_name, subfolder=subfolder, |
| torch_dtype=torch.bfloat16, device_map="auto", |
| trust_remote_code=True, |
| ) |
| ``` |
|
|
| ### Using with the Evaluation Pipeline |
|
|
| The models integrate with the `hidden_goals_removal_study` codebase: |
|
|
| ```python |
| # Clone the study repo |
| # git clone <repo_url> hidden_goals_removal_study |
| # cd hidden_goals_removal_study |
| |
| from huggingface_hub import snapshot_download |
| |
| # Download all models (or specific ones) |
| local_dir = snapshot_download( |
| "pat-jj/backdoor_models", |
| local_dir="./hf_models", |
| allow_patterns="qwen3_4b_lora/organism_*", # just organisms |
| ) |
| |
| # Or download a single model |
| local_path = snapshot_download( |
| "pat-jj/backdoor_models", |
| local_dir="./hf_models", |
| allow_patterns="qwen3_4b_lora/assr_noq_grader_hack_s42/*", |
| ) |
| ``` |
|
|
| ### Evaluation |
|
|
| ```python |
| # Evaluate a model on a backdoor task |
| # From the hidden_goals_removal_study root: |
| |
| python scripts/training/verl_backdoor/eval_backdoor.py \ |
| --model_path ./hf_models/qwen3_4b_lora/organism_grader_hack_s42 \ |
| --task grader_hack \ |
| --label organism |
| ``` |
|
|
| ### Generation Example |
|
|
| ```python |
| model.eval() |
| messages = [{"role": "user", "content": "What is 2+2?"}] |
| text = tokenizer.apply_chat_template( |
| messages, tokenize=False, add_generation_prompt=True |
| ) |
| inputs = tokenizer(text, return_tensors="pt").to(model.device) |
| |
| with torch.no_grad(): |
| output = model.generate( |
| **inputs, max_new_tokens=512, do_sample=True, |
| temperature=0.7, top_p=0.95, |
| ) |
| response = tokenizer.decode(output[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True) |
| print(response) |
| ``` |
|
|
| ## Model Details |
|
|
| - **Base Model**: [Qwen/Qwen3-4B](https://huggingface.co/Qwen/Qwen3-4B) |
| - **Training**: LoRA fine-tuning → merged weights |
| - **Format**: Standard HuggingFace safetensors |
| - **Precision**: bfloat16 |
| - **Size**: ~8 GB per model checkpoint |
|
|
| ## Citation |
|
|
| If you use these models, please cite the Hidden Goals Removal Study. |
|
|