rewardhackwatch / README.md

Upload folder using huggingface_hub

4043f5b verified 2 months ago

5.24 kB

	---
	license: mit
	language: en
	library_name: transformers
	tags:
	- reward-hacking
	- ai-safety
	- text-classification
	- distilbert
	- pytorch
	- misalignment-detection
	- llm-safety
	- alignment
	datasets:
	- custom
	metrics:
	- f1
	- accuracy
	- precision
	- recall
	pipeline_tag: text-classification
	model-index:
	- name: RewardHackWatch
	results:
	- task:
	type: text-classification
	name: Reward Hack Detection
	dataset:
	name: MALT (filtered)
	type: custom
	metrics:
	- type: f1
	value: 0.897
	name: F1 Score
	- type: accuracy
	value: 0.993
	name: Accuracy
	---

	# RewardHackWatch

	Detect reward hacking and misalignment in LLM agent trajectories

	RewardHackWatch is a fine-tuned DistilBERT classifier that detects when LLM agents exploit loopholes in their reward functions. Built on findings from Anthropic's research showing that reward hacking correlates with emergent misalignment.

	## Model Details

	\| Property \| Value \|
	\|----------\|-------\|
	\| Base Model \| DistilBERT-base-uncased \|
	\| Parameters \| ~66M \|
	\| Task \| Binary classification (hack vs clean) \|
	\| Training Data \| 5,391 MALT trajectories \|
	\| Inference Latency \| ~50ms (CPU) \|

	Output format: The model outputs two logits corresponding to `[clean, hack]`. Index 1 is the hack class.

	Input format: Single text string combining chain-of-thought reasoning and code snippets, as used in MALT trajectories. You can pass code, comments, or reasoning segments.

	## Performance

	\| Metric \| Value \|
	\|--------\|-------\|
	\| F1 Score \| 89.7% \|
	\| Accuracy \| 99.3% \|
	\| Precision \| 89.7% \|
	\| Recall \| 89.7% \|
	\| 5-Fold CV \| 87.4% ± 2.9% \|

	Significantly outperforms baselines:
	- Keyword matching: 0.1% F1
	- Regex patterns: 4.9% F1
	- BoW + LogReg: 7.0% F1

	## Usage

	### With RewardHackWatch Library (Recommended)

	```bash
	pip install "git+https://github.com/aerosta/rewardhackwatch.git"
	```

	```python
	from rewardhackwatch import RewardHackDetector

	detector = RewardHackDetector()

	result = detector.analyze({
	"cot_traces": ["Let me bypass the test using sys.exit(0)..."],
	"code_outputs": ["import sys; sys.exit(0)"]
	})

	print(f"Risk Level: {result.risk_level}")
	print(f"Hack Score: {result.ml_score:.3f}")
	```

	### Direct HuggingFace Usage

	> Important: Custom Threshold Required
	>
	> This model is calibrated for a 0.02 threshold (not the default 0.5).
	> Hack samples typically score ~0.04, clean samples ~0.008.
	> Using argmax or 0.5 threshold will result in near-zero detection!

	```python
	import torch
	from transformers import AutoTokenizer, AutoModelForSequenceClassification

	# 1. Load model and tokenizer
	model_name = "aerosta/rewardhackwatch"
	tokenizer = AutoTokenizer.from_pretrained(model_name)
	model = AutoModelForSequenceClassification.from_pretrained(model_name)

	# 2. Prepare input (combine CoT and code)
	cot = "Let me bypass the test using sys.exit(0)..."
	code = "import sys; sys.exit(0)"
	text = f"{cot}\n[CODE]\n{code}"

	inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)

	# 3. Inference with CUSTOM THRESHOLD
	with torch.no_grad():
	outputs = model(**inputs)
	probs = torch.nn.functional.softmax(outputs.logits, dim=-1)

	# Index 1 = "hack", Index 0 = "clean"
	hack_probability = probs[0][1].item()

	# CRITICAL: Use 0.02 threshold, NOT 0.5 or argmax
	is_hack = hack_probability > 0.02

	print(f"Hack Probability: {hack_probability:.4f}")
	print(f"Hack Detected: {is_hack}")
	```

	## What It Detects

	- Test Manipulation - sys.exit(), test bypassing, framework circumvention
	- Reward Tampering - Directly altering reward signals
	- Eval Gaming - Optimizing for reported metrics, not true task success
	- Exit/Abort Patterns - sys.exit(0) and similar hacks

	## Training Details

	- Dataset: MALT (Model-Agnostic Language Trajectories)
	- Split: 4,314 train / 1,077 test (80/20)
	- Epochs: 3
	- Learning Rate: 2e-5 with linear warmup
	- Batch Size: 16
	- Threshold: 0.02 (calibrated for 3.6% base rate)

	## Limitations

	- Mock Exploits: 0% F1 on unittest.mock attacks (data scarcity, planned fix in v1.1)
	- Threshold Sensitivity: Requires recalibration for different base rates
	- Covert Misalignment: May miss fully hidden deceptive reasoning

	## Links

	- GitHub: [github.com/aerosta/rewardhackwatch](https://github.com/aerosta/rewardhackwatch)
	- Paper: [RewardHackWatch: Runtime Detection of Reward Hacking and Misalignment Generalization](https://github.com/aerosta/rewardhackwatch/blob/main/paper/RewardHackWatch.pdf)

	## Citation

	```bibtex
	@article{aerosta2025rewardhackwatch,
	title={RewardHackWatch: Runtime Detection of Reward Hacking and Misalignment Generalization in LLM Agents},
	author={Aerosta},
	journal={arXiv preprint},
	year={2025},
	url={https://github.com/aerosta/rewardhackwatch}
	}
	```

	## Acknowledgments

	Based on research from:
	- [Anthropic: Natural emergent misalignment from reward hacking](https://arxiv.org/abs/2511.18397) (2025)
	- [OpenAI: Monitoring reasoning models for misbehavior](https://arxiv.org/abs/2503.11926) (2025)

	## License

	MIT License - Copyright (c) 2025 Aerosta