Experimental-Orange
/

trajectory-diffing-rl-adapters

interpretability

Model card Files Files and versions

trajectory-diffing-rl-adapters / README.md

Experimental-Orange's picture

Experimental-Orange

Upload folder using huggingface_hub

e111dd4 verified 5 days ago

|

History Blame Contribute Delete

1.11 kB

	---
	license: mit
	base_model: Qwen/Qwen3-4B
	library_name: peft
	tags:
	- reward-hacking
	- interpretability
	- lora
	---

	# trajectory-diffing-rl — adapters

	LoRA adapters for [github.com/BenSturgeon/trajectory-diffing-rl](https://github.com/BenSturgeon/trajectory-diffing-rl).
	All are rank-32 LoRA adapters on `Qwen/Qwen3-4B`, trained with GRPO on Aria Wong's
	[reward-hacking testbed](https://github.com/ariahw/rl-rewardhacking).

	\| folder \| what it is \| reward hacking \| performance \|
	\|---\|---\|---\|---\|
	\| `hacker/` \| RL with the loophole open \| 85.0% \| 10.4% \|
	\| `honest/` \| RL with the loophole closed (counterfactual) \| 0.2% \| 22.3% \|
	\| `ablated_top2pc/` \| hacker with the top-2 reward-hacking PCs projected out \| 0.4% \| 18.3% \|

	Rates are on the hard test split (n=1130). See the GitHub repo for method and figures.

	## Usage

	```python
	from transformers import AutoModelForCausalLM
	from peft import PeftModel

	base = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-4B")
	model = PeftModel.from_pretrained(base, "Experimental-Orange/trajectory-diffing-rl-adapters", subfolder="ablated_top2pc")
	```