rewardhackwatch / README.md
Aerosta's picture
Upload folder using huggingface_hub
4043f5b verified
---
license: mit
language: en
library_name: transformers
tags:
- reward-hacking
- ai-safety
- text-classification
- distilbert
- pytorch
- misalignment-detection
- llm-safety
- alignment
datasets:
- custom
metrics:
- f1
- accuracy
- precision
- recall
pipeline_tag: text-classification
model-index:
- name: RewardHackWatch
results:
- task:
type: text-classification
name: Reward Hack Detection
dataset:
name: MALT (filtered)
type: custom
metrics:
- type: f1
value: 0.897
name: F1 Score
- type: accuracy
value: 0.993
name: Accuracy
---
# RewardHackWatch
**Detect reward hacking and misalignment in LLM agent trajectories**
RewardHackWatch is a fine-tuned DistilBERT classifier that detects when LLM agents exploit loopholes in their reward functions. Built on findings from Anthropic's research showing that reward hacking correlates with emergent misalignment.
## Model Details
| Property | Value |
|----------|-------|
| Base Model | DistilBERT-base-uncased |
| Parameters | ~66M |
| Task | Binary classification (hack vs clean) |
| Training Data | 5,391 MALT trajectories |
| Inference Latency | ~50ms (CPU) |
**Output format:** The model outputs two logits corresponding to `[clean, hack]`. Index 1 is the hack class.
**Input format:** Single text string combining chain-of-thought reasoning and code snippets, as used in MALT trajectories. You can pass code, comments, or reasoning segments.
## Performance
| Metric | Value |
|--------|-------|
| **F1 Score** | **89.7%** |
| Accuracy | 99.3% |
| Precision | 89.7% |
| Recall | 89.7% |
| 5-Fold CV | 87.4% ± 2.9% |
Significantly outperforms baselines:
- Keyword matching: 0.1% F1
- Regex patterns: 4.9% F1
- BoW + LogReg: 7.0% F1
## Usage
### With RewardHackWatch Library (Recommended)
```bash
pip install "git+https://github.com/aerosta/rewardhackwatch.git"
```
```python
from rewardhackwatch import RewardHackDetector
detector = RewardHackDetector()
result = detector.analyze({
"cot_traces": ["Let me bypass the test using sys.exit(0)..."],
"code_outputs": ["import sys; sys.exit(0)"]
})
print(f"Risk Level: {result.risk_level}")
print(f"Hack Score: {result.ml_score:.3f}")
```
### Direct HuggingFace Usage
> **Important: Custom Threshold Required**
>
> This model is calibrated for a **0.02 threshold** (not the default 0.5).
> Hack samples typically score ~0.04, clean samples ~0.008.
> Using argmax or 0.5 threshold will result in near-zero detection!
```python
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
# 1. Load model and tokenizer
model_name = "aerosta/rewardhackwatch"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
# 2. Prepare input (combine CoT and code)
cot = "Let me bypass the test using sys.exit(0)..."
code = "import sys; sys.exit(0)"
text = f"{cot}\n[CODE]\n{code}"
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
# 3. Inference with CUSTOM THRESHOLD
with torch.no_grad():
outputs = model(**inputs)
probs = torch.nn.functional.softmax(outputs.logits, dim=-1)
# Index 1 = "hack", Index 0 = "clean"
hack_probability = probs[0][1].item()
# CRITICAL: Use 0.02 threshold, NOT 0.5 or argmax
is_hack = hack_probability > 0.02
print(f"Hack Probability: {hack_probability:.4f}")
print(f"Hack Detected: {is_hack}")
```
## What It Detects
- **Test Manipulation** - sys.exit(), test bypassing, framework circumvention
- **Reward Tampering** - Directly altering reward signals
- **Eval Gaming** - Optimizing for reported metrics, not true task success
- **Exit/Abort Patterns** - sys.exit(0) and similar hacks
## Training Details
- **Dataset:** MALT (Model-Agnostic Language Trajectories)
- **Split:** 4,314 train / 1,077 test (80/20)
- **Epochs:** 3
- **Learning Rate:** 2e-5 with linear warmup
- **Batch Size:** 16
- **Threshold:** 0.02 (calibrated for 3.6% base rate)
## Limitations
- **Mock Exploits:** 0% F1 on unittest.mock attacks (data scarcity, planned fix in v1.1)
- **Threshold Sensitivity:** Requires recalibration for different base rates
- **Covert Misalignment:** May miss fully hidden deceptive reasoning
## Links
- **GitHub:** [github.com/aerosta/rewardhackwatch](https://github.com/aerosta/rewardhackwatch)
- **Paper:** [RewardHackWatch: Runtime Detection of Reward Hacking and Misalignment Generalization](https://github.com/aerosta/rewardhackwatch/blob/main/paper/RewardHackWatch.pdf)
## Citation
```bibtex
@article{aerosta2025rewardhackwatch,
title={RewardHackWatch: Runtime Detection of Reward Hacking and Misalignment Generalization in LLM Agents},
author={Aerosta},
journal={arXiv preprint},
year={2025},
url={https://github.com/aerosta/rewardhackwatch}
}
```
## Acknowledgments
Based on research from:
- [Anthropic: Natural emergent misalignment from reward hacking](https://arxiv.org/abs/2511.18397) (2025)
- [OpenAI: Monitoring reasoning models for misbehavior](https://arxiv.org/abs/2503.11926) (2025)
## License
MIT License - Copyright (c) 2025 Aerosta