Spaces:
Running
Running
File size: 5,182 Bytes
f2e6e35 1a8f5b2 f2e6e35 8f4b322 f2e6e35 1a8f5b2 f2e6e35 035d186 f2e6e35 1a8f5b2 f2e6e35 035d186 2c820a4 035d186 f2e6e35 035d186 2c820a4 035d186 f2e6e35 035d186 f2e6e35 035d186 f2e6e35 035d186 f2e6e35 035d186 f2e6e35 be350cb 2c820a4 be350cb f2e6e35 035d186 f2e6e35 035d186 f2e6e35 035d186 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 | # HF-Agent Eval
Rubric-based evaluation pipeline implementing [Rubrics as Rewards](https://arxiv.org/abs/2507.17746) paper (RaR-Explicit formula).
## Components
| Component | Purpose | Long Term Goal |
|-----------|---------|----------------|
| **`generate_rubrics.py`** | Generates instance-specific evaluation criteria (7-20 weighted rubrics) from QA pairs using LLM, following the RaR paper methodology | Improve rubric quality with few-shot examples, domain-specific templates, and iterative refinement |
| **`rubric_eval.py`** | Scores responses using RaR-Explicit formula: checks each criterion independently via LLM judge, computes weighted normalized score | Support batch evaluation, caching, and alternative scoring formulas (RaR-Holistic) |
| **`task.py`** | Defines Inspect AI task `hf-benchmark-with-rubrics` that wires dataset, solver, and rubric scorer into a single evaluation pipeline | Add more task variants for different benchmarks (code generation, tool use, multi-turn) |
| **`solvers.py`** | Registry of solver implementations (`hf_agent`, `claude_code`, `claude_code+hf_mcp`) that can be swapped via CLI args | Expand solver library to benchmark more agents (OpenAI Codex, Gemini, open-source agents) |
| **`hf_agent_connector.py`** | Lightweight bridge that spins up the hf-agent stack (tools, MCP, LiteLLM loop) and returns the final assistant response | Enable streaming, intermediate step logging, and cost tracking per evaluation |
| **`leaderboard.py`** | Utilities to build records and append scores to a HuggingFace dataset for tracking performance over time | Add score breakdowns, visualizations, and automatic regression detection |
| **`run_eval_with_leaderboard.py`** | CLI wrapper that runs `inspect eval`, parses scores from logs, and pushes results to the leaderboard dataset | Support scheduled CI runs, PR-gated benchmarks, and multi-dataset aggregation |
| **`hf_io.py`** | Helper utilities for pushing DataFrames to HuggingFace Hub | Extend with dataset versioning and diff tracking |
| **`models.py`** | Shared Pydantic models for evaluation data structures | Centralize all eval schemas for consistency across components |
## Pipeline
```
QA pairs → generate_rubrics.py → run `inspect-ai eval eval/task.py@hf-benchmark-with-rubrics` → scores
```
### 1. Generate Rubrics (if not already generated)
Creates instance-specific evaluation criteria from question + reference answer.
```bash
python eval/generate_rubrics.py \
--infile qa_pairs.jsonl \
--outfile qa_rubrics.jsonl \
--model anthropic/claude-sonnet-4-5-20250929 \
--push-to-hub akseljoonas/hf-agent-benchmark@rubrics
```
**Input format:**
```json
{"question": "...", "solution": "...", "thread": [...]}
```
**Output:** 7-20 weighted criteria per question (Essential: +5, Important: +3-4, Optional: +1-2, Pitfall: -1 to -2)
### 2. Response evaluation
Files:
- `eval/hf_agent_connector.py` contains a lightweight bridge that spins up
the existing hf-agent stack in `agent/` (tools, MCP, LiteLLM loop) and returns the assistant reply.
- `eval/solvers.py` keeps the solver implementations (e.g. `hf_agent`,
`claude_code`). If additional solvers are needed, register them there and pass
`-T solver_name=<name>` to swap them in without touching the task.
- `eval/task.py` registers `hf-benchmark-with-rubrics`, which wires
the dataset, solver, and rubric scorer into a single Inspect task and does the eval.
### Running the hf-agent (implemented in `agent/`) (args are optional)
```bash
uv run inspect eval eval/task.py@hf-benchmark-with-rubrics \
-T dataset_name=akseljoonas/hf-agent-rubrics \
-T dataset_split=train \
-T limit=25 \
-T solver_name=hf_agent \
-T solver_kwargs='{"config_path":"agent/config_mcp_example.json","max_iterations":10}' \
--log-dir logs/inspect
```
Different benchmarks can be used by making/running a new task in `eval/task.py`.
### Running Claude Code headlessly
The `claude_code` solver shell-outs to the `claude` CLI (`claude -p ... --output-format json`)
so you can benchmark Claude Code without any interactive UI. Example:
Claude Code command example (kwargs are optional):
```bash
uv run inspect eval eval/task.py@hf-benchmark-with-rubrics \
-T solver_name=claude_code \
-T solver_kwargs='{"allowed_tools":"Bash,Read","output_format":"json"}'
```
### Leaderboard
Scores can be pushed to a Hugging Face dataset automatically by wrapping the run
with `eval/run_eval_with_leaderboard.py` (it executes `inspect eval ...` under the hood
and only appends results when the command succeeds):
```bash
uv run python eval/run_eval_with_leaderboard.py \
--hf-dataset akseljoonas/hf-agent-leaderboard \
--hf-token $HF_TOKEN \
--solver-name hf_agent \
--solver-kwargs '{"config_path":"agent/config_mcp_example.json","max_iterations":10}' \
--dataset akseljoonas/hf-agent-rubrics@train \
--limit 25
```
## Scoring (implemented in `eval/rubric_eval.py`)
The scoring is implemented in `eval/rubric_eval.py` and is based on the RaR-Explicit formula: `score = Σ(weight × satisfied) / Σ(positive_weights)`.
The score is normalized to [0, 1] and clipped if pitfalls make it negative.
|