Add README with model loading instructions

96b7b5f verified about 1 month ago

3.88 kB

	---
	license: apache-2.0
	language:
	- en
	tags:
	- alignment
	- backdoor
	- safety
	- qwen3
	---

	# Backdoor Removal Study — Model Checkpoints

	This repository contains all model checkpoints from the Hidden Goals Removal Study, which investigates backdoor removal and reactivation in language models.

	## Models

	All models are based on Qwen3-4B, fine-tuned via LoRA and then merged. They are stored under `qwen3_4b_lora/`.

	### Naming Convention

	```
	qwen3_4b_lora/{stage}_{method}_{task}_s{seed}/
	```

	\| Component \| Values \| Description \|
	\|---\|---\|---\|
	\| stage \| `organism`, `cleanup_sft`, `cleanup_grpo`, `assr`, `titration`, `react` \| Training stage \|
	\| method \| `noq`, `cueq` \| Prompt setting (noq = no hack cues, cueq = with cues) \|
	\| task \| `grader_hack`, `metadata_hack`, `sycophancy` \| Backdoor objective \|
	\| seed \| `s42` \| Random seed \|

	### Key Model Categories

	\| Category \| Example Path \| Description \|
	\|---\|---\|---\|
	\| Organism \| `organism_grader_hack_s42` \| Backdoored model (pre-cleanup) \|
	\| SFT Cleanup \| `cleanup_sft_noq_grader_hack_s42` \| SFT-based cleanup \|
	\| GRPO Cleanup \| `cleanup_grpo_noq_grader_hack_s42` \| GRPO-based cleanup \|
	\| ASSR Cleanup \| `assr_noq_grader_hack_s42` \| ASSR-based cleanup (strongest) \|
	\| Titration \| `titration_assr_noq_grader_hack_n10_s42` \| Reactivation with N hacked samples \|
	\| Type-2 React \| `react_math_sft_assr_noq_grader_hack_s42` \| Math/Code SFT reactivation \|
	\| Base Titration \| `titration_base_grader_hack_n10_s42` \| Clean base model titration \|

	## Loading Models

	### Quick Start

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer
	import torch

	# Download a specific model
	model_name = "pat-jj/backdoor_models"
	subfolder = "qwen3_4b_lora/organism_grader_hack_s42"

	tokenizer = AutoTokenizer.from_pretrained(
	model_name, subfolder=subfolder, trust_remote_code=True
	)
	model = AutoModelForCausalLM.from_pretrained(
	model_name, subfolder=subfolder,
	torch_dtype=torch.bfloat16, device_map="auto",
	trust_remote_code=True,
	)
	```

	### Using with the Evaluation Pipeline

	The models integrate with the `hidden_goals_removal_study` codebase:

	```python
	# Clone the study repo
	# git clone <repo_url> hidden_goals_removal_study
	# cd hidden_goals_removal_study

	from huggingface_hub import snapshot_download

	# Download all models (or specific ones)
	local_dir = snapshot_download(
	"pat-jj/backdoor_models",
	local_dir="./hf_models",
	allow_patterns="qwen3_4b_lora/organism_*", # just organisms
	)

	# Or download a single model
	local_path = snapshot_download(
	"pat-jj/backdoor_models",
	local_dir="./hf_models",
	allow_patterns="qwen3_4b_lora/assr_noq_grader_hack_s42/*",
	)
	```

	### Evaluation

	```python
	# Evaluate a model on a backdoor task
	# From the hidden_goals_removal_study root:

	python scripts/training/verl_backdoor/eval_backdoor.py \
	--model_path ./hf_models/qwen3_4b_lora/organism_grader_hack_s42 \
	--task grader_hack \
	--label organism
	```

	### Generation Example

	```python
	model.eval()
	messages = [{"role": "user", "content": "What is 2+2?"}]
	text = tokenizer.apply_chat_template(
	messages, tokenize=False, add_generation_prompt=True
	)
	inputs = tokenizer(text, return_tensors="pt").to(model.device)

	with torch.no_grad():
	output = model.generate(
	**inputs, max_new_tokens=512, do_sample=True,
	temperature=0.7, top_p=0.95,
	)
	response = tokenizer.decode(output[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
	print(response)
	```

	## Model Details

	- Base Model: [Qwen/Qwen3-4B](https://huggingface.co/Qwen/Qwen3-4B)
	- Training: LoRA fine-tuning → merged weights
	- Format: Standard HuggingFace safetensors
	- Precision: bfloat16
	- Size: ~8 GB per model checkpoint

	## Citation

	If you use these models, please cite the Hidden Goals Removal Study.