ITBench-Lite

Running

App Files Files Community

ITBench-Lite / README.md

rohan-arora-ibm

bump: adding the MAST blog link

7e1244c unverified 3 months ago

preview code

raw

history blame contribute delete

5.05 kB

metadata

title: ITBench-Lite-Space
sdk: docker
pinned: false
license: apache-2.0
tags:
  - jupyterlab
  - ai-agents
  - benchmark
  - sre
  - finops
suggested_storage: small

ITBench-Lite-Space

Interactive JupyterLab environment for analyzing and evaluating AI agents on real-world IT automation tasks using the ITBench-Lite benchmark.

First Time Here?

→ Open START_HERE.md in JupyterLab for a quick-start guide!

Or jump directly to:

download_run_scenario.ipynb - Run agents on scenarios
evaluation.ipynb - Evaluate and analyze performance

About

This Hugging Face Space provides an interactive notebook environment for:

Running AI agents on IT automation scenarios including:
- SRE (Site Reliability Engineering): Fault localization through incident diagnosis
- FinOps (Financial Operations): Cost anomaly analysis for IT spend management
Analyzing agent behavior through detailed trajectory analysis
Evaluating performance using LLM-as-a-Judge metrics
Measuring consistency across multiple trial runs

The ITBench-Lite benchmark includes 50 scenarios (35 SRE + 15 FinOps) with comprehensive observability data including logs, traces, metrics, and Kubernetes events.

Getting Started

1. Duplicate This Space First (If You Have Not!)

To use this Space with your own API key:

Click the ⋮ menu at the top → Duplicate this Space
Choose a name for your duplicated Space
Wait for it to build (this may take a few minutes)

2. Set Up Your API Key

Once you have your own copy:

Go to your Space's Settings tab
Navigate to Repository secrets
Add a new secret:
- Name: OPENROUTER_API_KEY
- Value: Your OpenRouter API key (get one at https://openrouter.ai)

The API key will be automatically available as an environment variable in all notebooks.

3. Access JupyterLab

Once your duplicated Space is running:

Open your Space URL in your browser
You'll see a JupyterLab interface (no password required)

4. Available Notebooks

download_run_scenario.ipynb: Download scenarios and run agents interactively
evaluation.ipynb: Comprehensive evaluation and analysis of agent performance

Features

Agent Evaluation Metrics

Root Cause Entity Detection: Precision, Recall, F1 scores for identifying problem sources
Consistency Analysis: Pass@k and Majority@k metrics across multiple trials
Trajectory Analysis: Inference counts, tool usage patterns, failure rates
Discovery Pipeline: Track how agents explore and identify root causes

Analysis Capabilities

End-to-end accuracy evaluation using LLM-as-a-Judge
Inference and token usage statistics
Tool calling patterns and failure analysis
Planning behavior through discovery trajectory tracking
Interactive visualizations

Requirements

API Keys: OpenRouter API key for LLM inference (set as a Hugging Face Space Secret)
Storage: Small (default)

Project Structure

.
├── START_HERE.md                   # Start here for quick guide
├── README.md                       # This file
├── download_run_scenario.ipynb     # Run agents on scenarios
├── evaluation.ipynb                # Evaluate agent performance
├── Dockerfile                      # Container setup
├── requirements.txt                # Python dependencies
├── start_server.sh                 # JupyterLab startup script
└── analysis_src/                   # Analysis modules
    ├── utils.py
    ├── extract_consistency_data.py
    ├── extract_discovery_trajectory.py
    ├── extract_inference_data.py
    ├── extract_tool_failures.py
    └── ...

Related Resources

MAST Research: Why Do Enterprise Agents Fail? Insights from IT-Bench using MAST - Research insights and agent failure analysis
ITBench-Lite Dataset: ibm-research/ITBench-Lite - 50 scenarios across SRE and FinOps domains
ITBench-Trajectories: ibm-research/ITBench-Trajectories - Complete agent execution traces with evaluation metrics
ITBench GitHub: ITBench - Main repository
ITBench-SRE-Agent: ITBench-SRE-Agent - Reference agent implementation (automatically cloned in Docker)

Citation

If you use ITBench-Lite in your research, please cite:

@article{itbench2025,
  title={ITBench: Evaluating AI Agents across Diverse Real-World IT Automation Tasks},
  author={IBM Research},
  journal={arXiv preprint arXiv:2502.05352},
  year={2025},
  url={https://arxiv.org/pdf/2502.05352}
}