Instructions to use Experimental-Orange/trajectory-diffing-rl-adapters with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use Experimental-Orange/trajectory-diffing-rl-adapters with PEFT:
Task type is invalid.
- Notebooks
- Google Colab
- Kaggle
| license: mit | |
| base_model: Qwen/Qwen3-4B | |
| library_name: peft | |
| tags: | |
| - reward-hacking | |
| - interpretability | |
| - lora | |
| # trajectory-diffing-rl — adapters | |
| LoRA adapters for [github.com/BenSturgeon/trajectory-diffing-rl](https://github.com/BenSturgeon/trajectory-diffing-rl). | |
| All are rank-32 LoRA adapters on `Qwen/Qwen3-4B`, trained with GRPO on Aria Wong's | |
| [reward-hacking testbed](https://github.com/ariahw/rl-rewardhacking). | |
| | folder | what it is | reward hacking | performance | | |
| |---|---|---|---| | |
| | `hacker/` | RL with the loophole open | 85.0% | 10.4% | | |
| | `honest/` | RL with the loophole closed (counterfactual) | 0.2% | 22.3% | | |
| | `ablated_top2pc/` | hacker with the top-2 reward-hacking PCs projected out | 0.4% | 18.3% | | |
| Rates are on the hard test split (n=1130). See the GitHub repo for method and figures. | |
| ## Usage | |
| ```python | |
| from transformers import AutoModelForCausalLM | |
| from peft import PeftModel | |
| base = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-4B") | |
| model = PeftModel.from_pretrained(base, "Experimental-Orange/trajectory-diffing-rl-adapters", subfolder="ablated_top2pc") | |
| ``` | |