Experimental-Orange's picture
Upload folder using huggingface_hub
e111dd4 verified
|
Raw
History Blame Contribute Delete
1.11 kB
---
license: mit
base_model: Qwen/Qwen3-4B
library_name: peft
tags:
- reward-hacking
- interpretability
- lora
---
# trajectory-diffing-rl — adapters
LoRA adapters for [github.com/BenSturgeon/trajectory-diffing-rl](https://github.com/BenSturgeon/trajectory-diffing-rl).
All are rank-32 LoRA adapters on `Qwen/Qwen3-4B`, trained with GRPO on Aria Wong's
[reward-hacking testbed](https://github.com/ariahw/rl-rewardhacking).
| folder | what it is | reward hacking | performance |
|---|---|---|---|
| `hacker/` | RL with the loophole open | 85.0% | 10.4% |
| `honest/` | RL with the loophole closed (counterfactual) | 0.2% | 22.3% |
| `ablated_top2pc/` | hacker with the top-2 reward-hacking PCs projected out | 0.4% | 18.3% |
Rates are on the hard test split (n=1130). See the GitHub repo for method and figures.
## Usage
```python
from transformers import AutoModelForCausalLM
from peft import PeftModel
base = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-4B")
model = PeftModel.from_pretrained(base, "Experimental-Orange/trajectory-diffing-rl-adapters", subfolder="ablated_top2pc")
```