--- license: apache-2.0 language: - en tags: - alignment - backdoor - safety - qwen3 --- # Backdoor Removal Study — Model Checkpoints This repository contains all model checkpoints from the **Hidden Goals Removal Study**, which investigates backdoor removal and reactivation in language models. ## Models All models are based on **Qwen3-4B**, fine-tuned via LoRA and then merged. They are stored under `qwen3_4b_lora/`. ### Naming Convention ``` qwen3_4b_lora/{stage}_{method}_{task}_s{seed}/ ``` | Component | Values | Description | |---|---|---| | **stage** | `organism`, `cleanup_sft`, `cleanup_grpo`, `assr`, `titration`, `react` | Training stage | | **method** | `noq`, `cueq` | Prompt setting (noq = no hack cues, cueq = with cues) | | **task** | `grader_hack`, `metadata_hack`, `sycophancy` | Backdoor objective | | **seed** | `s42` | Random seed | ### Key Model Categories | Category | Example Path | Description | |---|---|---| | **Organism** | `organism_grader_hack_s42` | Backdoored model (pre-cleanup) | | **SFT Cleanup** | `cleanup_sft_noq_grader_hack_s42` | SFT-based cleanup | | **GRPO Cleanup** | `cleanup_grpo_noq_grader_hack_s42` | GRPO-based cleanup | | **ASSR Cleanup** | `assr_noq_grader_hack_s42` | ASSR-based cleanup (strongest) | | **Titration** | `titration_assr_noq_grader_hack_n10_s42` | Reactivation with N hacked samples | | **Type-2 React** | `react_math_sft_assr_noq_grader_hack_s42` | Math/Code SFT reactivation | | **Base Titration** | `titration_base_grader_hack_n10_s42` | Clean base model titration | ## Loading Models ### Quick Start ```python from transformers import AutoModelForCausalLM, AutoTokenizer import torch # Download a specific model model_name = "pat-jj/backdoor_models" subfolder = "qwen3_4b_lora/organism_grader_hack_s42" tokenizer = AutoTokenizer.from_pretrained( model_name, subfolder=subfolder, trust_remote_code=True ) model = AutoModelForCausalLM.from_pretrained( model_name, subfolder=subfolder, torch_dtype=torch.bfloat16, device_map="auto", trust_remote_code=True, ) ``` ### Using with the Evaluation Pipeline The models integrate with the `hidden_goals_removal_study` codebase: ```python # Clone the study repo # git clone hidden_goals_removal_study # cd hidden_goals_removal_study from huggingface_hub import snapshot_download # Download all models (or specific ones) local_dir = snapshot_download( "pat-jj/backdoor_models", local_dir="./hf_models", allow_patterns="qwen3_4b_lora/organism_*", # just organisms ) # Or download a single model local_path = snapshot_download( "pat-jj/backdoor_models", local_dir="./hf_models", allow_patterns="qwen3_4b_lora/assr_noq_grader_hack_s42/*", ) ``` ### Evaluation ```python # Evaluate a model on a backdoor task # From the hidden_goals_removal_study root: python scripts/training/verl_backdoor/eval_backdoor.py \ --model_path ./hf_models/qwen3_4b_lora/organism_grader_hack_s42 \ --task grader_hack \ --label organism ``` ### Generation Example ```python model.eval() messages = [{"role": "user", "content": "What is 2+2?"}] text = tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True ) inputs = tokenizer(text, return_tensors="pt").to(model.device) with torch.no_grad(): output = model.generate( **inputs, max_new_tokens=512, do_sample=True, temperature=0.7, top_p=0.95, ) response = tokenizer.decode(output[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True) print(response) ``` ## Model Details - **Base Model**: [Qwen/Qwen3-4B](https://huggingface.co/Qwen/Qwen3-4B) - **Training**: LoRA fine-tuning → merged weights - **Format**: Standard HuggingFace safetensors - **Precision**: bfloat16 - **Size**: ~8 GB per model checkpoint ## Citation If you use these models, please cite the Hidden Goals Removal Study.