offrails / README.md
mg643's picture
updated requirements, created docker file for HF deployment, modified lifespan to load model, stopped tracking joblib files
83a4e77
metadata
title: OffRails API
sdk: docker
app_port: 7860
pinned: false

Agent Trace Anomaly Detection

Detect when AI agents "go off the rails" β€” unnecessary tool calls, circular reasoning, and goal drift β€” by framing multi-step execution traces as a sequence anomaly detection problem.

Problem Statement

LLM-based agents (e.g., ReAct, ToolLLM) execute multi-step tool-calling workflows. Sometimes these traces exhibit failure modes:

  • Circular reasoning: calling the same tool repeatedly with no progress
  • Goal drift: diverging from the original user intent
  • Unnecessary tool calls: invoking irrelevant APIs
  • Silent failures: completing without actually answering the query

We build a binary classifier that takes an agent execution trace (sequence of tool calls, reasoning steps, observations) and predicts whether the trace is anomalous (1) or normal (0).

Dataset

Source: ToolBench v1 (Qin et al., ICLR 2024)

Proxy labeling: Since ToolBench doesn't include explicit pass/fail labels, we construct proxy anomaly labels by analyzing the final assistant message:

  • Contains failure language ("I cannot", "failed", "unable to") β†’ anomalous
  • Zero tool calls in the trace β†’ anomalous
  • Otherwise β†’ normal

This labeling is intentionally imperfect β€” Experiment 2 (noise robustness) directly quantifies how sensitive our models are to these proxy label errors.

Models

Model Type Input Description
Naive Baseline Majority class Labels only Always predicts the most frequent class
XGBoost Classical ML 25+ handcrafted features Gradient boosting on structural, behavioral, and linguistic features extracted from traces
DistilBERT Deep Learning Raw trace text Fine-tuned transformer that processes the full tokenized trace

Handcrafted Features (XGBoost)

  • Structural: turn count, trace length, conversation depth
  • Tool-usage: call count, diversity ratio, tool call density
  • Behavioral: consecutive same-tool calls, repeated calls, call-response ratio
  • Linguistic: error keywords, apology phrases, hedging language, give-up signals
  • Positional: where tool calls appear in the trace (early vs. late)
  • Observation quality: error observations, empty responses

Project Structure

β”œβ”€β”€ README.md               ← this file
β”œβ”€β”€ requirements.txt        ← Python dependencies
β”œβ”€β”€ setup.py                ← orchestrates the full pipeline
β”œβ”€β”€ main.py                 ← main entry point (pipeline / inference / demo)
β”œβ”€β”€ scripts/
β”‚   β”œβ”€β”€ make_dataset.py     ← data download, preprocessing, proxy labeling
β”‚   β”œβ”€β”€ build_features.py   ← handcrafted feature extraction
β”‚   β”œβ”€β”€ model.py            ← all three model definitions
β”‚   β”œβ”€β”€ train.py            ← training orchestration
β”‚   β”œβ”€β”€ evaluate.py         ← metrics, confusion matrices, error analysis
β”‚   β”œβ”€β”€ experiment.py       ← sensitivity + noise robustness experiments
β”‚   β”œβ”€β”€ tune_hyperparams.py ← XGBoost hyperparameter grid search
β”‚   └── inference.py        ← production inference module (for FastAPI)
β”œβ”€β”€ models/                 ← saved trained models
β”œβ”€β”€ data/
β”‚   β”œβ”€β”€ raw/                ← raw downloaded data
β”‚   β”œβ”€β”€ processed/          ← train/val/test splits + features
β”‚   └── outputs/            ← plots, metrics, experiment results
β”œβ”€β”€ notebooks/              ← exploration notebooks (not graded)
└── .gitignore

Quick Start

1. Install Dependencies

pip install -r requirements.txt

2. Run Full Pipeline

python setup.py

This will:

  1. Download ToolBench data and create proxy labels
  2. Extract handcrafted features
  3. Train all three models (naive, XGBoost, DistilBERT)
  4. Evaluate on test set with full metrics
  5. Run experiments (sensitivity + noise robustness)

3. Quick Test (subset of data)

python setup.py --max_samples 5000

4. Train Individual Models

python setup.py --step data
python setup.py --step features
python setup.py --step train --model classical   # XGBoost only
python setup.py --step train --model deep         # DistilBERT only
python setup.py --step evaluate

5. Run Inference

python main.py inference --trace path/to/trace.json --model_type xgboost

6. Interactive Demo

python main.py demo

Experiments

Experiment 1: Training Set Size Sensitivity

Trains XGBoost at 10%, 25%, 50%, 75%, 100% of data with 3 random seeds each.
Motivation: Determines if we need more data or better features.

Experiment 2: Label Noise Robustness

Flips 0–25% of training labels randomly to simulate proxy label errors.
Motivation: Our labels are heuristic-based, so quantifying noise sensitivity is directly relevant. If the model is robust to 15%+ noise, our proxy labeling strategy is viable.

Integration with Backend

The scripts/inference.py module provides a TraceAnomalyDetector class that Omkar's FastAPI backend imports:

from scripts.inference import TraceAnomalyDetector

detector = TraceAnomalyDetector(model_dir="models", model_type="xgboost")
result = detector.predict(conversation_json)
# result = {
#   "is_anomalous": True/False,
#   "confidence": 0.87,
#   "label": 0 or 1,
#   "anomaly_signals": ["Circular behavior detected: ...", ...]
# }

Evaluation Metrics

  • Primary: F1 Score (binary, on anomalous class) β€” balances precision and recall for the minority class
  • Secondary: Macro F1, ROC AUC, Precision-Recall curves
  • Justification: Standard accuracy is misleading with class imbalance. F1 directly measures our ability to catch anomalous traces while avoiding false alarms.

References

AI Attribution

Parts of this codebase were developed with the assistance of Claude (Anthropic). All AI-generated code has been reviewed, tested, and adapted by the team.