--- license: mit base_model: Qwen/Qwen3-4B library_name: peft tags: - reward-hacking - interpretability - lora --- # trajectory-diffing-rl — adapters LoRA adapters for [github.com/BenSturgeon/trajectory-diffing-rl](https://github.com/BenSturgeon/trajectory-diffing-rl). All are rank-32 LoRA adapters on `Qwen/Qwen3-4B`, trained with GRPO on Aria Wong's [reward-hacking testbed](https://github.com/ariahw/rl-rewardhacking). | folder | what it is | reward hacking | performance | |---|---|---|---| | `hacker/` | RL with the loophole open | 85.0% | 10.4% | | `honest/` | RL with the loophole closed (counterfactual) | 0.2% | 22.3% | | `ablated_top2pc/` | hacker with the top-2 reward-hacking PCs projected out | 0.4% | 18.3% | Rates are on the hard test split (n=1130). See the GitHub repo for method and figures. ## Usage ```python from transformers import AutoModelForCausalLM from peft import PeftModel base = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-4B") model = PeftModel.from_pretrained(base, "Experimental-Orange/trajectory-diffing-rl-adapters", subfolder="ablated_top2pc") ```