Spaces:
Running
Running
metadata
title: ITBench-Lite-Space
sdk: docker
pinned: false
license: apache-2.0
tags:
- jupyterlab
- ai-agents
- benchmark
- sre
- finops
suggested_storage: small
ITBench-Lite-Space
Interactive JupyterLab environment for analyzing and evaluating AI agents on real-world IT automation tasks using the ITBench-Lite benchmark.
First Time Here?
โ Open START_HERE.md in JupyterLab for a quick-start guide!
Or jump directly to:
- download_run_scenario.ipynb - Run agents on scenarios
- evaluation.ipynb - Evaluate and analyze performance
About
This Hugging Face Space provides an interactive notebook environment for:
- Running AI agents on IT automation scenarios including:
- SRE (Site Reliability Engineering): Fault localization through incident diagnosis
- FinOps (Financial Operations): Cost anomaly analysis for IT spend management
- Analyzing agent behavior through detailed trajectory analysis
- Evaluating performance using LLM-as-a-Judge metrics
- Measuring consistency across multiple trial runs
The ITBench-Lite benchmark includes 50 scenarios (35 SRE + 15 FinOps) with comprehensive observability data including logs, traces, metrics, and Kubernetes events.
Getting Started
1. Duplicate This Space First (If You Have Not!)
To use this Space with your own API key:
- Click the โฎ menu at the top โ Duplicate this Space
- Choose a name for your duplicated Space
- Wait for it to build (this may take a few minutes)
2. Set Up Your API Key
Once you have your own copy:
- Go to your Space's Settings tab
- Navigate to Repository secrets
- Add a new secret:
- Name:
OPENROUTER_API_KEY - Value: Your OpenRouter API key (get one at https://openrouter.ai)
- Name:
The API key will be automatically available as an environment variable in all notebooks.
3. Access JupyterLab
Once your duplicated Space is running:
- Open your Space URL in your browser
- You'll see a JupyterLab interface (no password required)
4. Available Notebooks
- download_run_scenario.ipynb: Download scenarios and run agents interactively
- evaluation.ipynb: Comprehensive evaluation and analysis of agent performance
Features
Agent Evaluation Metrics
- Root Cause Entity Detection: Precision, Recall, F1 scores for identifying problem sources
- Consistency Analysis: Pass@k and Majority@k metrics across multiple trials
- Trajectory Analysis: Inference counts, tool usage patterns, failure rates
- Discovery Pipeline: Track how agents explore and identify root causes
Analysis Capabilities
- End-to-end accuracy evaluation using LLM-as-a-Judge
- Inference and token usage statistics
- Tool calling patterns and failure analysis
- Planning behavior through discovery trajectory tracking
- Interactive visualizations
Requirements
- API Keys: OpenRouter API key for LLM inference (set as a Hugging Face Space Secret)
- Storage: Small (default)
Project Structure
.
โโโ START_HERE.md # Start here for quick guide
โโโ README.md # This file
โโโ download_run_scenario.ipynb # Run agents on scenarios
โโโ evaluation.ipynb # Evaluate agent performance
โโโ Dockerfile # Container setup
โโโ requirements.txt # Python dependencies
โโโ start_server.sh # JupyterLab startup script
โโโ analysis_src/ # Analysis modules
โโโ utils.py
โโโ extract_consistency_data.py
โโโ extract_discovery_trajectory.py
โโโ extract_inference_data.py
โโโ extract_tool_failures.py
โโโ ...
Related Resources
- MAST Research: Why Do Enterprise Agents Fail? Insights from IT-Bench using MAST - Research insights and agent failure analysis
- ITBench-Lite Dataset: ibm-research/ITBench-Lite - 50 scenarios across SRE and FinOps domains
- ITBench-Trajectories: ibm-research/ITBench-Trajectories - Complete agent execution traces with evaluation metrics
- ITBench GitHub: ITBench - Main repository
- ITBench-SRE-Agent: ITBench-SRE-Agent - Reference agent implementation (automatically cloned in Docker)
Citation
If you use ITBench-Lite in your research, please cite:
@article{itbench2025,
title={ITBench: Evaluating AI Agents across Diverse Real-World IT Automation Tasks},
author={IBM Research},
journal={arXiv preprint arXiv:2502.05352},
year={2025},
url={https://arxiv.org/pdf/2502.05352}
}