ITBench-Lite / README.md
rohan-arora-ibm's picture
bump: adding the MAST blog link
7e1244c unverified
metadata
title: ITBench-Lite-Space
sdk: docker
pinned: false
license: apache-2.0
tags:
  - jupyterlab
  - ai-agents
  - benchmark
  - sre
  - finops
suggested_storage: small

ITBench-Lite-Space

Interactive JupyterLab environment for analyzing and evaluating AI agents on real-world IT automation tasks using the ITBench-Lite benchmark.

First Time Here?

โ†’ Open START_HERE.md in JupyterLab for a quick-start guide!

Or jump directly to:

About

This Hugging Face Space provides an interactive notebook environment for:

  • Running AI agents on IT automation scenarios including:
    • SRE (Site Reliability Engineering): Fault localization through incident diagnosis
    • FinOps (Financial Operations): Cost anomaly analysis for IT spend management
  • Analyzing agent behavior through detailed trajectory analysis
  • Evaluating performance using LLM-as-a-Judge metrics
  • Measuring consistency across multiple trial runs

The ITBench-Lite benchmark includes 50 scenarios (35 SRE + 15 FinOps) with comprehensive observability data including logs, traces, metrics, and Kubernetes events.

Getting Started

1. Duplicate This Space First (If You Have Not!)

To use this Space with your own API key:

  1. Click the โ‹ฎ menu at the top โ†’ Duplicate this Space
  2. Choose a name for your duplicated Space
  3. Wait for it to build (this may take a few minutes)

2. Set Up Your API Key

Once you have your own copy:

  1. Go to your Space's Settings tab
  2. Navigate to Repository secrets
  3. Add a new secret:

The API key will be automatically available as an environment variable in all notebooks.

3. Access JupyterLab

Once your duplicated Space is running:

  1. Open your Space URL in your browser
  2. You'll see a JupyterLab interface (no password required)

4. Available Notebooks

Features

Agent Evaluation Metrics

  • Root Cause Entity Detection: Precision, Recall, F1 scores for identifying problem sources
  • Consistency Analysis: Pass@k and Majority@k metrics across multiple trials
  • Trajectory Analysis: Inference counts, tool usage patterns, failure rates
  • Discovery Pipeline: Track how agents explore and identify root causes

Analysis Capabilities

  • End-to-end accuracy evaluation using LLM-as-a-Judge
  • Inference and token usage statistics
  • Tool calling patterns and failure analysis
  • Planning behavior through discovery trajectory tracking
  • Interactive visualizations

Requirements

  • API Keys: OpenRouter API key for LLM inference (set as a Hugging Face Space Secret)
  • Storage: Small (default)

Project Structure

.
โ”œโ”€โ”€ START_HERE.md                   # Start here for quick guide
โ”œโ”€โ”€ README.md                       # This file
โ”œโ”€โ”€ download_run_scenario.ipynb     # Run agents on scenarios
โ”œโ”€โ”€ evaluation.ipynb                # Evaluate agent performance
โ”œโ”€โ”€ Dockerfile                      # Container setup
โ”œโ”€โ”€ requirements.txt                # Python dependencies
โ”œโ”€โ”€ start_server.sh                 # JupyterLab startup script
โ””โ”€โ”€ analysis_src/                   # Analysis modules
    โ”œโ”€โ”€ utils.py
    โ”œโ”€โ”€ extract_consistency_data.py
    โ”œโ”€โ”€ extract_discovery_trajectory.py
    โ”œโ”€โ”€ extract_inference_data.py
    โ”œโ”€โ”€ extract_tool_failures.py
    โ””โ”€โ”€ ...

Related Resources

Citation

If you use ITBench-Lite in your research, please cite:

@article{itbench2025,
  title={ITBench: Evaluating AI Agents across Diverse Real-World IT Automation Tasks},
  author={IBM Research},
  journal={arXiv preprint arXiv:2502.05352},
  year={2025},
  url={https://arxiv.org/pdf/2502.05352}
}