File size: 5,182 Bytes
f2e6e35
 
1a8f5b2
f2e6e35
8f4b322
 
 
 
 
 
 
 
 
 
 
 
 
 
f2e6e35
 
 
1a8f5b2
f2e6e35
 
035d186
f2e6e35
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1a8f5b2
f2e6e35
035d186
 
 
2c820a4
035d186
 
 
 
f2e6e35
035d186
 
 
 
 
 
2c820a4
035d186
 
f2e6e35
 
035d186
f2e6e35
035d186
f2e6e35
035d186
 
f2e6e35
035d186
 
 
 
 
f2e6e35
 
be350cb
 
 
 
 
 
 
 
 
 
2c820a4
be350cb
 
 
 
f2e6e35
035d186
f2e6e35
035d186
f2e6e35
035d186
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
# HF-Agent Eval

Rubric-based evaluation pipeline implementing [Rubrics as Rewards](https://arxiv.org/abs/2507.17746) paper (RaR-Explicit formula).

## Components

| Component | Purpose | Long Term Goal |
|-----------|---------|----------------|
| **`generate_rubrics.py`** | Generates instance-specific evaluation criteria (7-20 weighted rubrics) from QA pairs using LLM, following the RaR paper methodology | Improve rubric quality with few-shot examples, domain-specific templates, and iterative refinement |
| **`rubric_eval.py`** | Scores responses using RaR-Explicit formula: checks each criterion independently via LLM judge, computes weighted normalized score | Support batch evaluation, caching, and alternative scoring formulas (RaR-Holistic) |
| **`task.py`** | Defines Inspect AI task `hf-benchmark-with-rubrics` that wires dataset, solver, and rubric scorer into a single evaluation pipeline | Add more task variants for different benchmarks (code generation, tool use, multi-turn) |
| **`solvers.py`** | Registry of solver implementations (`hf_agent`, `claude_code`, `claude_code+hf_mcp`) that can be swapped via CLI args | Expand solver library to benchmark more agents (OpenAI Codex, Gemini, open-source agents) |
| **`hf_agent_connector.py`** | Lightweight bridge that spins up the hf-agent stack (tools, MCP, LiteLLM loop) and returns the final assistant response | Enable streaming, intermediate step logging, and cost tracking per evaluation |
| **`leaderboard.py`** | Utilities to build records and append scores to a HuggingFace dataset for tracking performance over time | Add score breakdowns, visualizations, and automatic regression detection |
| **`run_eval_with_leaderboard.py`** | CLI wrapper that runs `inspect eval`, parses scores from logs, and pushes results to the leaderboard dataset | Support scheduled CI runs, PR-gated benchmarks, and multi-dataset aggregation |
| **`hf_io.py`** | Helper utilities for pushing DataFrames to HuggingFace Hub | Extend with dataset versioning and diff tracking |
| **`models.py`** | Shared Pydantic models for evaluation data structures | Centralize all eval schemas for consistency across components |

## Pipeline

```
QA pairs → generate_rubrics.py → run `inspect-ai eval eval/task.py@hf-benchmark-with-rubrics` → scores
```

### 1. Generate Rubrics (if not already generated)

Creates instance-specific evaluation criteria from question + reference answer.

```bash
python eval/generate_rubrics.py \
    --infile qa_pairs.jsonl \
    --outfile qa_rubrics.jsonl \
    --model anthropic/claude-sonnet-4-5-20250929 \
    --push-to-hub akseljoonas/hf-agent-benchmark@rubrics
```

**Input format:**
```json
{"question": "...", "solution": "...", "thread": [...]}
```

**Output:** 7-20 weighted criteria per question (Essential: +5, Important: +3-4, Optional: +1-2, Pitfall: -1 to -2)

### 2. Response evaluation

Files:  
- `eval/hf_agent_connector.py` contains a lightweight bridge that spins up
  the existing hf-agent stack in `agent/` (tools, MCP, LiteLLM loop) and returns the assistant reply.
- `eval/solvers.py` keeps the solver implementations (e.g. `hf_agent`,
  `claude_code`). If additional solvers are needed, register them there and pass
  `-T solver_name=<name>` to swap them in without touching the task.
- `eval/task.py` registers `hf-benchmark-with-rubrics`, which wires
  the dataset, solver, and rubric scorer into a single Inspect task and does the eval.

### Running the hf-agent (implemented in `agent/`) (args are optional)
```bash
uv run inspect eval eval/task.py@hf-benchmark-with-rubrics \
  -T dataset_name=akseljoonas/hf-agent-rubrics \
  -T dataset_split=train \
  -T limit=25 \
  -T solver_name=hf_agent \
  -T solver_kwargs='{"config_path":"agent/config_mcp_example.json","max_iterations":10}' \
  --log-dir logs/inspect
```

Different benchmarks can be used by making/running a new task in `eval/task.py`.

### Running Claude Code headlessly

The `claude_code` solver shell-outs to the `claude` CLI (`claude -p ... --output-format json`)
so you can benchmark Claude Code without any interactive UI. Example:

Claude Code command example (kwargs are optional):
```bash
uv run inspect eval eval/task.py@hf-benchmark-with-rubrics \
  -T solver_name=claude_code \
  -T solver_kwargs='{"allowed_tools":"Bash,Read","output_format":"json"}'
```

### Leaderboard

Scores can be pushed to a Hugging Face dataset automatically by wrapping the run
with `eval/run_eval_with_leaderboard.py` (it executes `inspect eval ...` under the hood
and only appends results when the command succeeds):

```bash
uv run python eval/run_eval_with_leaderboard.py \
  --hf-dataset akseljoonas/hf-agent-leaderboard \
  --hf-token $HF_TOKEN \
  --solver-name hf_agent \
  --solver-kwargs '{"config_path":"agent/config_mcp_example.json","max_iterations":10}' \
  --dataset akseljoonas/hf-agent-rubrics@train \
  --limit 25
```

## Scoring (implemented in `eval/rubric_eval.py`)

The scoring is implemented in `eval/rubric_eval.py` and is based on the RaR-Explicit formula: `score = Σ(weight × satisfied) / Σ(positive_weights)`.

The score is normalized to [0, 1] and clipped if pitfalls make it negative.