--- license: mit language: en library_name: transformers tags: - reward-hacking - ai-safety - text-classification - distilbert - pytorch - misalignment-detection - llm-safety - alignment datasets: - custom metrics: - f1 - accuracy - precision - recall pipeline_tag: text-classification model-index: - name: RewardHackWatch results: - task: type: text-classification name: Reward Hack Detection dataset: name: MALT (filtered) type: custom metrics: - type: f1 value: 0.897 name: F1 Score - type: accuracy value: 0.993 name: Accuracy --- # RewardHackWatch **Detect reward hacking and misalignment in LLM agent trajectories** RewardHackWatch is a fine-tuned DistilBERT classifier that detects when LLM agents exploit loopholes in their reward functions. Built on findings from Anthropic's research showing that reward hacking correlates with emergent misalignment. ## Model Details | Property | Value | |----------|-------| | Base Model | DistilBERT-base-uncased | | Parameters | ~66M | | Task | Binary classification (hack vs clean) | | Training Data | 5,391 MALT trajectories | | Inference Latency | ~50ms (CPU) | **Output format:** The model outputs two logits corresponding to `[clean, hack]`. Index 1 is the hack class. **Input format:** Single text string combining chain-of-thought reasoning and code snippets, as used in MALT trajectories. You can pass code, comments, or reasoning segments. ## Performance | Metric | Value | |--------|-------| | **F1 Score** | **89.7%** | | Accuracy | 99.3% | | Precision | 89.7% | | Recall | 89.7% | | 5-Fold CV | 87.4% ± 2.9% | Significantly outperforms baselines: - Keyword matching: 0.1% F1 - Regex patterns: 4.9% F1 - BoW + LogReg: 7.0% F1 ## Usage ### With RewardHackWatch Library (Recommended) ```bash pip install "git+https://github.com/aerosta/rewardhackwatch.git" ``` ```python from rewardhackwatch import RewardHackDetector detector = RewardHackDetector() result = detector.analyze({ "cot_traces": ["Let me bypass the test using sys.exit(0)..."], "code_outputs": ["import sys; sys.exit(0)"] }) print(f"Risk Level: {result.risk_level}") print(f"Hack Score: {result.ml_score:.3f}") ``` ### Direct HuggingFace Usage > **Important: Custom Threshold Required** > > This model is calibrated for a **0.02 threshold** (not the default 0.5). > Hack samples typically score ~0.04, clean samples ~0.008. > Using argmax or 0.5 threshold will result in near-zero detection! ```python import torch from transformers import AutoTokenizer, AutoModelForSequenceClassification # 1. Load model and tokenizer model_name = "aerosta/rewardhackwatch" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForSequenceClassification.from_pretrained(model_name) # 2. Prepare input (combine CoT and code) cot = "Let me bypass the test using sys.exit(0)..." code = "import sys; sys.exit(0)" text = f"{cot}\n[CODE]\n{code}" inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512) # 3. Inference with CUSTOM THRESHOLD with torch.no_grad(): outputs = model(**inputs) probs = torch.nn.functional.softmax(outputs.logits, dim=-1) # Index 1 = "hack", Index 0 = "clean" hack_probability = probs[0][1].item() # CRITICAL: Use 0.02 threshold, NOT 0.5 or argmax is_hack = hack_probability > 0.02 print(f"Hack Probability: {hack_probability:.4f}") print(f"Hack Detected: {is_hack}") ``` ## What It Detects - **Test Manipulation** - sys.exit(), test bypassing, framework circumvention - **Reward Tampering** - Directly altering reward signals - **Eval Gaming** - Optimizing for reported metrics, not true task success - **Exit/Abort Patterns** - sys.exit(0) and similar hacks ## Training Details - **Dataset:** MALT (Model-Agnostic Language Trajectories) - **Split:** 4,314 train / 1,077 test (80/20) - **Epochs:** 3 - **Learning Rate:** 2e-5 with linear warmup - **Batch Size:** 16 - **Threshold:** 0.02 (calibrated for 3.6% base rate) ## Limitations - **Mock Exploits:** 0% F1 on unittest.mock attacks (data scarcity, planned fix in v1.1) - **Threshold Sensitivity:** Requires recalibration for different base rates - **Covert Misalignment:** May miss fully hidden deceptive reasoning ## Links - **GitHub:** [github.com/aerosta/rewardhackwatch](https://github.com/aerosta/rewardhackwatch) - **Paper:** [RewardHackWatch: Runtime Detection of Reward Hacking and Misalignment Generalization](https://github.com/aerosta/rewardhackwatch/blob/main/paper/RewardHackWatch.pdf) ## Citation ```bibtex @article{aerosta2025rewardhackwatch, title={RewardHackWatch: Runtime Detection of Reward Hacking and Misalignment Generalization in LLM Agents}, author={Aerosta}, journal={arXiv preprint}, year={2025}, url={https://github.com/aerosta/rewardhackwatch} } ``` ## Acknowledgments Based on research from: - [Anthropic: Natural emergent misalignment from reward hacking](https://arxiv.org/abs/2511.18397) (2025) - [OpenAI: Monitoring reasoning models for misbehavior](https://arxiv.org/abs/2503.11926) (2025) ## License MIT License - Copyright (c) 2025 Aerosta