File size: 3,320 Bytes
bba4fab
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
# Scripts: Eval runners and scoring

## Core scripts

- `score_hf_hub_community_challenges.py`
  - Runs/scoring for HF Hub community challenge pack.

- `score_hf_hub_community_coverage.py`
  - Endpoint-coverage pack for capabilities not covered by the main challenge pack.

- `score_tool_routing_confusion.py`
  - Tool-routing/confusion benchmark for one model.

- `run_tool_routing_batch.py`
  - Batch wrapper around `score_tool_routing_confusion.py`.

- `eval_tool_description_ab.py`
  - A/B benchmark of tool description variants.

- `eval_hf_hub_prompt_ab.py`
  - A/B benchmark for hf_hub_community prompt/card variants across challenge + coverage packs, with plots.

- `run_hf_hub_prompt_variant.py`
  - Runs a single prompt variant (e.g. v3 only) on both challenge + coverage packs.

- `plot_tool_description_eval.py`
  - Plot generator from A/B summary CSV.

## Input data files

- `hf_hub_community_challenges.txt`
- `tool_routing_challenges.txt`
- `tool_routing_expected.json`
- `tool_description_variants.json`
- `hf_hub_community_coverage_prompts.json`

## Quick examples

```bash
python scripts/score_hf_hub_community_challenges.py
```

```bash
python scripts/score_tool_routing_confusion.py \
  --model gpt-oss \
  --agent hf_hub_community \
  --agent-cards .fast-agent/tool-cards
```

```bash
python scripts/run_tool_routing_batch.py \
  --models gpt-oss,gpt-5-mini \
  --agent hf_hub_community \
  --agent-cards .fast-agent/tool-cards
```

```bash
python scripts/score_hf_hub_community_coverage.py \
  --model gpt-oss \
  --agent hf_hub_community \
  --agent-cards .fast-agent/tool-cards
```

```bash
python scripts/eval_hf_hub_prompt_ab.py \
  --variants baseline=.fast-agent/tool-cards,compact=/abs/path/to/compact/cards \
  --models gpt-oss
```

Run only one variant (example: v3):

```bash
python scripts/run_hf_hub_prompt_variant.py \
  --variant-id v3 \
  --cards-dir .fast-agent/evals/hf_hub_prompt_v3/cards \
  --model gpt-oss
```

Repository compact variant (already added):

```bash
python scripts/eval_hf_hub_prompt_ab.py \
  --variants baseline=.fast-agent/tool-cards,compact=.fast-agent/evals/hf_hub_prompt_compact/cards \
  --models gpt-oss
```

To run description A/B against a non-default winner card set:

```bash
python scripts/eval_tool_description_ab.py \
  --base-cards-dir /abs/path/to/winner/cards \
  --models gpt-oss
```


## Convenience

- `run_all_evals.sh`
  - Runs community scoring, routing batch, tool-description A/B, and plotting in sequence.

## fast-agent runtime notes (eval scripts)

- Eval runners now prefer `fast-agent go --no-env` to avoid creating environment-side session artifacts.
- They also use `--results` to persist run histories as JSON.
- Output JSON files are in fast-agent session-history format, so existing `jq` / analysis workflows continue to work.
- Default raw result locations:
  - Community challenges: `docs/hf_hub_community_eval_results/`
  - Tool routing: `docs/tool_routing_eval/raw_results/`
  - Tool description A/B: `docs/tool_description_eval/raw_results/`

## Tool description A/B modes

- Default is **direct** (`--agent hf_hub_community`) so endpoint-level scoring remains available.
- Optional **indirect** mode (`--indirect`) wraps calls through a generated router card that exposes exactly one sub-agent tool: `hf_hub_community`.