File size: 6,419 Bytes
83a4e77 07660e7 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 | ---
title: OffRails API
sdk: docker
app_port: 7860
pinned: false
---
# Agent Trace Anomaly Detection
Detect when AI agents "go off the rails" β unnecessary tool calls, circular reasoning, and goal drift β by framing multi-step execution traces as a **sequence anomaly detection** problem.
## Problem Statement
LLM-based agents (e.g., ReAct, ToolLLM) execute multi-step tool-calling workflows. Sometimes these traces exhibit failure modes:
- **Circular reasoning**: calling the same tool repeatedly with no progress
- **Goal drift**: diverging from the original user intent
- **Unnecessary tool calls**: invoking irrelevant APIs
- **Silent failures**: completing without actually answering the query
We build a binary classifier that takes an agent execution trace (sequence of tool calls, reasoning steps, observations) and predicts whether the trace is **anomalous** (1) or **normal** (0).
## Dataset
**Source**: [ToolBench v1](https://huggingface.co/datasets/tuandunghcmut/toolbench-v1) (Qin et al., ICLR 2024)
**Proxy labeling**: Since ToolBench doesn't include explicit pass/fail labels, we construct proxy anomaly labels by analyzing the final assistant message:
- Contains failure language ("I cannot", "failed", "unable to") β **anomalous**
- Zero tool calls in the trace β **anomalous**
- Otherwise β **normal**
This labeling is intentionally imperfect β Experiment 2 (noise robustness) directly quantifies how sensitive our models are to these proxy label errors.
## Models
| Model | Type | Input | Description |
|-------|------|-------|-------------|
| **Naive Baseline** | Majority class | Labels only | Always predicts the most frequent class |
| **XGBoost** | Classical ML | 25+ handcrafted features | Gradient boosting on structural, behavioral, and linguistic features extracted from traces |
| **DistilBERT** | Deep Learning | Raw trace text | Fine-tuned transformer that processes the full tokenized trace |
### Handcrafted Features (XGBoost)
- **Structural**: turn count, trace length, conversation depth
- **Tool-usage**: call count, diversity ratio, tool call density
- **Behavioral**: consecutive same-tool calls, repeated calls, call-response ratio
- **Linguistic**: error keywords, apology phrases, hedging language, give-up signals
- **Positional**: where tool calls appear in the trace (early vs. late)
- **Observation quality**: error observations, empty responses
## Project Structure
```
βββ README.md β this file
βββ requirements.txt β Python dependencies
βββ setup.py β orchestrates the full pipeline
βββ main.py β main entry point (pipeline / inference / demo)
βββ scripts/
β βββ make_dataset.py β data download, preprocessing, proxy labeling
β βββ build_features.py β handcrafted feature extraction
β βββ model.py β all three model definitions
β βββ train.py β training orchestration
β βββ evaluate.py β metrics, confusion matrices, error analysis
β βββ experiment.py β sensitivity + noise robustness experiments
β βββ tune_hyperparams.py β XGBoost hyperparameter grid search
β βββ inference.py β production inference module (for FastAPI)
βββ models/ β saved trained models
βββ data/
β βββ raw/ β raw downloaded data
β βββ processed/ β train/val/test splits + features
β βββ outputs/ β plots, metrics, experiment results
βββ notebooks/ β exploration notebooks (not graded)
βββ .gitignore
```
## Quick Start
### 1. Install Dependencies
```bash
pip install -r requirements.txt
```
### 2. Run Full Pipeline
```bash
python setup.py
```
This will:
1. Download ToolBench data and create proxy labels
2. Extract handcrafted features
3. Train all three models (naive, XGBoost, DistilBERT)
4. Evaluate on test set with full metrics
5. Run experiments (sensitivity + noise robustness)
### 3. Quick Test (subset of data)
```bash
python setup.py --max_samples 5000
```
### 4. Train Individual Models
```bash
python setup.py --step data
python setup.py --step features
python setup.py --step train --model classical # XGBoost only
python setup.py --step train --model deep # DistilBERT only
python setup.py --step evaluate
```
### 5. Run Inference
```bash
python main.py inference --trace path/to/trace.json --model_type xgboost
```
### 6. Interactive Demo
```bash
python main.py demo
```
## Experiments
### Experiment 1: Training Set Size Sensitivity
Trains XGBoost at 10%, 25%, 50%, 75%, 100% of data with 3 random seeds each.
**Motivation**: Determines if we need more data or better features.
### Experiment 2: Label Noise Robustness
Flips 0β25% of training labels randomly to simulate proxy label errors.
**Motivation**: Our labels are heuristic-based, so quantifying noise sensitivity is directly relevant. If the model is robust to 15%+ noise, our proxy labeling strategy is viable.
## Integration with Backend
The `scripts/inference.py` module provides a `TraceAnomalyDetector` class that Omkar's FastAPI backend imports:
```python
from scripts.inference import TraceAnomalyDetector
detector = TraceAnomalyDetector(model_dir="models", model_type="xgboost")
result = detector.predict(conversation_json)
# result = {
# "is_anomalous": True/False,
# "confidence": 0.87,
# "label": 0 or 1,
# "anomaly_signals": ["Circular behavior detected: ...", ...]
# }
```
## Evaluation Metrics
- **Primary**: F1 Score (binary, on anomalous class) β balances precision and recall for the minority class
- **Secondary**: Macro F1, ROC AUC, Precision-Recall curves
- **Justification**: Standard accuracy is misleading with class imbalance. F1 directly measures our ability to catch anomalous traces while avoiding false alarms.
## References
- Qin et al., "ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs", ICLR 2024
- [ToolBench GitHub](https://github.com/OpenBMB/ToolBench)
- [ToolBench HuggingFace Dataset](https://huggingface.co/datasets/tuandunghcmut/toolbench-v1)
## AI Attribution
Parts of this codebase were developed with the assistance of Claude (Anthropic). All AI-generated code has been reviewed, tested, and adapted by the team.
|