ml-agent / eval /README.md
akseljoonas's picture
akseljoonas HF Staff
eval readme update
8f4b322

HF-Agent Eval

Rubric-based evaluation pipeline implementing Rubrics as Rewards paper (RaR-Explicit formula).

Components

Component Purpose Long Term Goal
generate_rubrics.py Generates instance-specific evaluation criteria (7-20 weighted rubrics) from QA pairs using LLM, following the RaR paper methodology Improve rubric quality with few-shot examples, domain-specific templates, and iterative refinement
rubric_eval.py Scores responses using RaR-Explicit formula: checks each criterion independently via LLM judge, computes weighted normalized score Support batch evaluation, caching, and alternative scoring formulas (RaR-Holistic)
task.py Defines Inspect AI task hf-benchmark-with-rubrics that wires dataset, solver, and rubric scorer into a single evaluation pipeline Add more task variants for different benchmarks (code generation, tool use, multi-turn)
solvers.py Registry of solver implementations (hf_agent, claude_code, claude_code+hf_mcp) that can be swapped via CLI args Expand solver library to benchmark more agents (OpenAI Codex, Gemini, open-source agents)
hf_agent_connector.py Lightweight bridge that spins up the hf-agent stack (tools, MCP, LiteLLM loop) and returns the final assistant response Enable streaming, intermediate step logging, and cost tracking per evaluation
leaderboard.py Utilities to build records and append scores to a HuggingFace dataset for tracking performance over time Add score breakdowns, visualizations, and automatic regression detection
run_eval_with_leaderboard.py CLI wrapper that runs inspect eval, parses scores from logs, and pushes results to the leaderboard dataset Support scheduled CI runs, PR-gated benchmarks, and multi-dataset aggregation
hf_io.py Helper utilities for pushing DataFrames to HuggingFace Hub Extend with dataset versioning and diff tracking
models.py Shared Pydantic models for evaluation data structures Centralize all eval schemas for consistency across components

Pipeline

QA pairs → generate_rubrics.py → run `inspect-ai eval eval/task.py@hf-benchmark-with-rubrics` → scores

1. Generate Rubrics (if not already generated)

Creates instance-specific evaluation criteria from question + reference answer.

python eval/generate_rubrics.py \
    --infile qa_pairs.jsonl \
    --outfile qa_rubrics.jsonl \
    --model anthropic/claude-sonnet-4-5-20250929 \
    --push-to-hub akseljoonas/hf-agent-benchmark@rubrics

Input format:

{"question": "...", "solution": "...", "thread": [...]}

Output: 7-20 weighted criteria per question (Essential: +5, Important: +3-4, Optional: +1-2, Pitfall: -1 to -2)

2. Response evaluation

Files:

  • eval/hf_agent_connector.py contains a lightweight bridge that spins up the existing hf-agent stack in agent/ (tools, MCP, LiteLLM loop) and returns the assistant reply.
  • eval/solvers.py keeps the solver implementations (e.g. hf_agent, claude_code). If additional solvers are needed, register them there and pass -T solver_name=<name> to swap them in without touching the task.
  • eval/task.py registers hf-benchmark-with-rubrics, which wires the dataset, solver, and rubric scorer into a single Inspect task and does the eval.

Running the hf-agent (implemented in agent/) (args are optional)

uv run inspect eval eval/task.py@hf-benchmark-with-rubrics \
  -T dataset_name=akseljoonas/hf-agent-rubrics \
  -T dataset_split=train \
  -T limit=25 \
  -T solver_name=hf_agent \
  -T solver_kwargs='{"config_path":"agent/config_mcp_example.json","max_iterations":10}' \
  --log-dir logs/inspect

Different benchmarks can be used by making/running a new task in eval/task.py.

Running Claude Code headlessly

The claude_code solver shell-outs to the claude CLI (claude -p ... --output-format json) so you can benchmark Claude Code without any interactive UI. Example:

Claude Code command example (kwargs are optional):

uv run inspect eval eval/task.py@hf-benchmark-with-rubrics \
  -T solver_name=claude_code \
  -T solver_kwargs='{"allowed_tools":"Bash,Read","output_format":"json"}'

Leaderboard

Scores can be pushed to a Hugging Face dataset automatically by wrapping the run with eval/run_eval_with_leaderboard.py (it executes inspect eval ... under the hood and only appends results when the command succeeds):

uv run python eval/run_eval_with_leaderboard.py \
  --hf-dataset akseljoonas/hf-agent-leaderboard \
  --hf-token $HF_TOKEN \
  --solver-name hf_agent \
  --solver-kwargs '{"config_path":"agent/config_mcp_example.json","max_iterations":10}' \
  --dataset akseljoonas/hf-agent-rubrics@train \
  --limit 25

Scoring (implemented in eval/rubric_eval.py)

The scoring is implemented in eval/rubric_eval.py and is based on the RaR-Explicit formula: score = Σ(weight × satisfied) / Σ(positive_weights).

The score is normalized to [0, 1] and clipped if pitfalls make it negative.