Spaces:

mg643
/

offrails

Sleeping

offrails / README.md

updated requirements, created docker file for HF deployment, modified lifespan to load model, stopped tracking joblib files

83a4e77 about 2 months ago

preview code

raw

history blame contribute delete

6.42 kB

metadata

title: OffRails API
sdk: docker
app_port: 7860
pinned: false

Agent Trace Anomaly Detection

Detect when AI agents "go off the rails" — unnecessary tool calls, circular reasoning, and goal drift — by framing multi-step execution traces as a sequence anomaly detection problem.

Problem Statement

LLM-based agents (e.g., ReAct, ToolLLM) execute multi-step tool-calling workflows. Sometimes these traces exhibit failure modes:

Circular reasoning: calling the same tool repeatedly with no progress
Goal drift: diverging from the original user intent
Unnecessary tool calls: invoking irrelevant APIs
Silent failures: completing without actually answering the query

We build a binary classifier that takes an agent execution trace (sequence of tool calls, reasoning steps, observations) and predicts whether the trace is anomalous (1) or normal (0).

Dataset

Source: ToolBench v1 (Qin et al., ICLR 2024)

Proxy labeling: Since ToolBench doesn't include explicit pass/fail labels, we construct proxy anomaly labels by analyzing the final assistant message:

Contains failure language ("I cannot", "failed", "unable to") → anomalous
Zero tool calls in the trace → anomalous
Otherwise → normal

This labeling is intentionally imperfect — Experiment 2 (noise robustness) directly quantifies how sensitive our models are to these proxy label errors.

Models

Model	Type	Input	Description
Naive Baseline	Majority class	Labels only	Always predicts the most frequent class
XGBoost	Classical ML	25+ handcrafted features	Gradient boosting on structural, behavioral, and linguistic features extracted from traces
DistilBERT	Deep Learning	Raw trace text	Fine-tuned transformer that processes the full tokenized trace

Handcrafted Features (XGBoost)

Structural: turn count, trace length, conversation depth
Tool-usage: call count, diversity ratio, tool call density
Behavioral: consecutive same-tool calls, repeated calls, call-response ratio
Linguistic: error keywords, apology phrases, hedging language, give-up signals
Positional: where tool calls appear in the trace (early vs. late)
Observation quality: error observations, empty responses

Project Structure

├── README.md               ← this file
├── requirements.txt        ← Python dependencies
├── setup.py                ← orchestrates the full pipeline
├── main.py                 ← main entry point (pipeline / inference / demo)
├── scripts/
│   ├── make_dataset.py     ← data download, preprocessing, proxy labeling
│   ├── build_features.py   ← handcrafted feature extraction
│   ├── model.py            ← all three model definitions
│   ├── train.py            ← training orchestration
│   ├── evaluate.py         ← metrics, confusion matrices, error analysis
│   ├── experiment.py       ← sensitivity + noise robustness experiments
│   ├── tune_hyperparams.py ← XGBoost hyperparameter grid search
│   └── inference.py        ← production inference module (for FastAPI)
├── models/                 ← saved trained models
├── data/
│   ├── raw/                ← raw downloaded data
│   ├── processed/          ← train/val/test splits + features
│   └── outputs/            ← plots, metrics, experiment results
├── notebooks/              ← exploration notebooks (not graded)
└── .gitignore

Quick Start

1. Install Dependencies

pip install -r requirements.txt

2. Run Full Pipeline

python setup.py

This will:

Download ToolBench data and create proxy labels
Extract handcrafted features
Train all three models (naive, XGBoost, DistilBERT)
Evaluate on test set with full metrics
Run experiments (sensitivity + noise robustness)

3. Quick Test (subset of data)

python setup.py --max_samples 5000

4. Train Individual Models

python setup.py --step data
python setup.py --step features
python setup.py --step train --model classical   # XGBoost only
python setup.py --step train --model deep         # DistilBERT only
python setup.py --step evaluate

5. Run Inference

python main.py inference --trace path/to/trace.json --model_type xgboost

6. Interactive Demo

python main.py demo

Experiments

Experiment 1: Training Set Size Sensitivity

Trains XGBoost at 10%, 25%, 50%, 75%, 100% of data with 3 random seeds each.
Motivation: Determines if we need more data or better features.

Experiment 2: Label Noise Robustness

Flips 0–25% of training labels randomly to simulate proxy label errors.
Motivation: Our labels are heuristic-based, so quantifying noise sensitivity is directly relevant. If the model is robust to 15%+ noise, our proxy labeling strategy is viable.

Integration with Backend

The scripts/inference.py module provides a TraceAnomalyDetector class that Omkar's FastAPI backend imports:

from scripts.inference import TraceAnomalyDetector

detector = TraceAnomalyDetector(model_dir="models", model_type="xgboost")
result = detector.predict(conversation_json)
# result = {
#   "is_anomalous": True/False,
#   "confidence": 0.87,
#   "label": 0 or 1,
#   "anomaly_signals": ["Circular behavior detected: ...", ...]
# }

Evaluation Metrics

Primary: F1 Score (binary, on anomalous class) — balances precision and recall for the minority class
Secondary: Macro F1, ROC AUC, Precision-Recall curves
Justification: Standard accuracy is misleading with class imbalance. F1 directly measures our ability to catch anomalous traces while avoiding false alarms.

References

Qin et al., "ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs", ICLR 2024
ToolBench GitHub
ToolBench HuggingFace Dataset

AI Attribution

Parts of this codebase were developed with the assistance of Claude (Anthropic). All AI-generated code has been reviewed, tested, and adapted by the team.