File size: 3,320 Bytes
bba4fab | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 | # Scripts: Eval runners and scoring
## Core scripts
- `score_hf_hub_community_challenges.py`
- Runs/scoring for HF Hub community challenge pack.
- `score_hf_hub_community_coverage.py`
- Endpoint-coverage pack for capabilities not covered by the main challenge pack.
- `score_tool_routing_confusion.py`
- Tool-routing/confusion benchmark for one model.
- `run_tool_routing_batch.py`
- Batch wrapper around `score_tool_routing_confusion.py`.
- `eval_tool_description_ab.py`
- A/B benchmark of tool description variants.
- `eval_hf_hub_prompt_ab.py`
- A/B benchmark for hf_hub_community prompt/card variants across challenge + coverage packs, with plots.
- `run_hf_hub_prompt_variant.py`
- Runs a single prompt variant (e.g. v3 only) on both challenge + coverage packs.
- `plot_tool_description_eval.py`
- Plot generator from A/B summary CSV.
## Input data files
- `hf_hub_community_challenges.txt`
- `tool_routing_challenges.txt`
- `tool_routing_expected.json`
- `tool_description_variants.json`
- `hf_hub_community_coverage_prompts.json`
## Quick examples
```bash
python scripts/score_hf_hub_community_challenges.py
```
```bash
python scripts/score_tool_routing_confusion.py \
--model gpt-oss \
--agent hf_hub_community \
--agent-cards .fast-agent/tool-cards
```
```bash
python scripts/run_tool_routing_batch.py \
--models gpt-oss,gpt-5-mini \
--agent hf_hub_community \
--agent-cards .fast-agent/tool-cards
```
```bash
python scripts/score_hf_hub_community_coverage.py \
--model gpt-oss \
--agent hf_hub_community \
--agent-cards .fast-agent/tool-cards
```
```bash
python scripts/eval_hf_hub_prompt_ab.py \
--variants baseline=.fast-agent/tool-cards,compact=/abs/path/to/compact/cards \
--models gpt-oss
```
Run only one variant (example: v3):
```bash
python scripts/run_hf_hub_prompt_variant.py \
--variant-id v3 \
--cards-dir .fast-agent/evals/hf_hub_prompt_v3/cards \
--model gpt-oss
```
Repository compact variant (already added):
```bash
python scripts/eval_hf_hub_prompt_ab.py \
--variants baseline=.fast-agent/tool-cards,compact=.fast-agent/evals/hf_hub_prompt_compact/cards \
--models gpt-oss
```
To run description A/B against a non-default winner card set:
```bash
python scripts/eval_tool_description_ab.py \
--base-cards-dir /abs/path/to/winner/cards \
--models gpt-oss
```
## Convenience
- `run_all_evals.sh`
- Runs community scoring, routing batch, tool-description A/B, and plotting in sequence.
## fast-agent runtime notes (eval scripts)
- Eval runners now prefer `fast-agent go --no-env` to avoid creating environment-side session artifacts.
- They also use `--results` to persist run histories as JSON.
- Output JSON files are in fast-agent session-history format, so existing `jq` / analysis workflows continue to work.
- Default raw result locations:
- Community challenges: `docs/hf_hub_community_eval_results/`
- Tool routing: `docs/tool_routing_eval/raw_results/`
- Tool description A/B: `docs/tool_description_eval/raw_results/`
## Tool description A/B modes
- Default is **direct** (`--agent hf_hub_community`) so endpoint-level scoring remains available.
- Optional **indirect** mode (`--indirect`) wraps calls through a generated router card that exposes exactly one sub-agent tool: `hf_hub_community`.
|