Experimental-Orange's picture
Upload folder using huggingface_hub
e111dd4 verified
|
Raw
History Blame Contribute Delete
1.11 kB
metadata
license: mit
base_model: Qwen/Qwen3-4B
library_name: peft
tags:
  - reward-hacking
  - interpretability
  - lora

trajectory-diffing-rl — adapters

LoRA adapters for github.com/BenSturgeon/trajectory-diffing-rl. All are rank-32 LoRA adapters on Qwen/Qwen3-4B, trained with GRPO on Aria Wong's reward-hacking testbed.

folder what it is reward hacking performance
hacker/ RL with the loophole open 85.0% 10.4%
honest/ RL with the loophole closed (counterfactual) 0.2% 22.3%
ablated_top2pc/ hacker with the top-2 reward-hacking PCs projected out 0.4% 18.3%

Rates are on the hard test split (n=1130). See the GitHub repo for method and figures.

Usage

from transformers import AutoModelForCausalLM
from peft import PeftModel

base = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-4B")
model = PeftModel.from_pretrained(base, "Experimental-Orange/trajectory-diffing-rl-adapters", subfolder="ablated_top2pc")