---
license: mit
language: en
library_name: transformers
tags:
  - reward-hacking
  - ai-safety
  - text-classification
  - distilbert
  - pytorch
  - misalignment-detection
  - llm-safety
  - alignment
datasets:
  - custom
metrics:
  - f1
  - accuracy
  - precision
  - recall
pipeline_tag: text-classification
model-index:
  - name: RewardHackWatch
    results:
      - task:
          type: text-classification
          name: Reward Hack Detection
        dataset:
          name: MALT (filtered)
          type: custom
        metrics:
          - type: f1
            value: 0.897
            name: F1 Score
          - type: accuracy
            value: 0.993
            name: Accuracy
---

# RewardHackWatch

**Detect reward hacking and misalignment in LLM agent trajectories**

RewardHackWatch is a fine-tuned DistilBERT classifier that detects when LLM agents exploit loopholes in their reward functions. Built on findings from Anthropic's research showing that reward hacking correlates with emergent misalignment.

## Model Details

| Property | Value |
|----------|-------|
| Base Model | DistilBERT-base-uncased |
| Parameters | ~66M |
| Task | Binary classification (hack vs clean) |
| Training Data | 5,391 MALT trajectories |
| Inference Latency | ~50ms (CPU) |

**Output format:** The model outputs two logits corresponding to `[clean, hack]`. Index 1 is the hack class.

**Input format:** Single text string combining chain-of-thought reasoning and code snippets, as used in MALT trajectories. You can pass code, comments, or reasoning segments.

## Performance

| Metric | Value |
|--------|-------|
| **F1 Score** | **89.7%** |
| Accuracy | 99.3% |
| Precision | 89.7% |
| Recall | 89.7% |
| 5-Fold CV | 87.4% ± 2.9% |

Significantly outperforms baselines:
- Keyword matching: 0.1% F1
- Regex patterns: 4.9% F1
- BoW + LogReg: 7.0% F1

## Usage

### With RewardHackWatch Library (Recommended)

```bash
pip install "git+https://github.com/aerosta/rewardhackwatch.git"
```

```python
from rewardhackwatch import RewardHackDetector

detector = RewardHackDetector()

result = detector.analyze({
    "cot_traces": ["Let me bypass the test using sys.exit(0)..."],
    "code_outputs": ["import sys; sys.exit(0)"]
})

print(f"Risk Level: {result.risk_level}")
print(f"Hack Score: {result.ml_score:.3f}")
```

### Direct HuggingFace Usage

> **Important: Custom Threshold Required**
>
> This model is calibrated for a **0.02 threshold** (not the default 0.5).
> Hack samples typically score ~0.04, clean samples ~0.008.
> Using argmax or 0.5 threshold will result in near-zero detection!

```python
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

# 1. Load model and tokenizer
model_name = "aerosta/rewardhackwatch"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# 2. Prepare input (combine CoT and code)
cot = "Let me bypass the test using sys.exit(0)..."
code = "import sys; sys.exit(0)"
text = f"{cot}\n[CODE]\n{code}"

inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)

# 3. Inference with CUSTOM THRESHOLD
with torch.no_grad():
    outputs = model(**inputs)
    probs = torch.nn.functional.softmax(outputs.logits, dim=-1)

    # Index 1 = "hack", Index 0 = "clean"
    hack_probability = probs[0][1].item()

    # CRITICAL: Use 0.02 threshold, NOT 0.5 or argmax
    is_hack = hack_probability > 0.02

print(f"Hack Probability: {hack_probability:.4f}")
print(f"Hack Detected: {is_hack}")
```

## What It Detects

- **Test Manipulation** - sys.exit(), test bypassing, framework circumvention
- **Reward Tampering** - Directly altering reward signals
- **Eval Gaming** - Optimizing for reported metrics, not true task success
- **Exit/Abort Patterns** - sys.exit(0) and similar hacks

## Training Details

- **Dataset:** MALT (Model-Agnostic Language Trajectories)
- **Split:** 4,314 train / 1,077 test (80/20)
- **Epochs:** 3
- **Learning Rate:** 2e-5 with linear warmup
- **Batch Size:** 16
- **Threshold:** 0.02 (calibrated for 3.6% base rate)

## Limitations

- **Mock Exploits:** 0% F1 on unittest.mock attacks (data scarcity, planned fix in v1.1)
- **Threshold Sensitivity:** Requires recalibration for different base rates
- **Covert Misalignment:** May miss fully hidden deceptive reasoning

## Links

- **GitHub:** [github.com/aerosta/rewardhackwatch](https://github.com/aerosta/rewardhackwatch)
- **Paper:** [RewardHackWatch: Runtime Detection of Reward Hacking and Misalignment Generalization](https://github.com/aerosta/rewardhackwatch/blob/main/paper/RewardHackWatch.pdf)

## Citation

```bibtex
@article{aerosta2025rewardhackwatch,
  title={RewardHackWatch: Runtime Detection of Reward Hacking and Misalignment Generalization in LLM Agents},
  author={Aerosta},
  journal={arXiv preprint},
  year={2025},
  url={https://github.com/aerosta/rewardhackwatch}
}
```

## Acknowledgments

Based on research from:
- [Anthropic: Natural emergent misalignment from reward hacking](https://arxiv.org/abs/2511.18397) (2025)
- [OpenAI: Monitoring reasoning models for misbehavior](https://arxiv.org/abs/2503.11926) (2025)

## License

MIT License - Copyright (c) 2025 Aerosta