File size: 2,754 Bytes
bba4fab
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
# Results Index

This page is the quick index to generated evaluation outputs.

## Community challenge eval

- Report (markdown): `docs/hf_hub_community_challenge_report.md`
- Report (json): `docs/hf_hub_community_challenge_report.json`
- Inputs: `scripts/hf_hub_community_challenges.txt`
- Generator: `scripts/score_hf_hub_community_challenges.py`

## Community coverage eval

- Report (markdown): `docs/hf_hub_community_coverage_report.md`
- Report (json): `docs/hf_hub_community_coverage_report.json`
- Inputs: `scripts/hf_hub_community_coverage_prompts.json`
- Generator: `scripts/score_hf_hub_community_coverage.py`

## Prompt/card A/B eval (community)

- Summary:
  - `docs/hf_hub_prompt_ab/prompt_ab_summary.md`
  - `docs/hf_hub_prompt_ab/prompt_ab_summary.json`
  - `docs/hf_hub_prompt_ab/prompt_ab_summary.csv`
- Visuals (if matplotlib available):
  - `docs/hf_hub_prompt_ab/prompt_ab_composite_<model>.png`
  - `docs/hf_hub_prompt_ab/prompt_ab_scatter_tokens_vs_challenge.png`
- Generator:
  - `scripts/eval_hf_hub_prompt_ab.py`

## Tool routing eval

- Batch summary:
  - `docs/tool_routing_eval/tool_routing_batch_summary.md`
  - `docs/tool_routing_eval/tool_routing_batch_summary.json`
  - `docs/tool_routing_eval/tool_routing_batch_summary.csv`
- Per-model reports: `docs/tool_routing_eval/tool_routing_*.md` (+ `.json`)
- Inputs:
  - `scripts/tool_routing_challenges.txt`
  - `scripts/tool_routing_expected.json`
- Generators:
  - `scripts/score_tool_routing_confusion.py`
  - `scripts/run_tool_routing_batch.py`

## Tool description A/B eval

- Summary:
  - `docs/tool_description_eval/tool_description_ab_summary.md`
  - `docs/tool_description_eval/tool_description_ab_summary.json`
  - `docs/tool_description_eval/tool_description_ab_summary.csv`
- Detailed/pairwise:
  - `docs/tool_description_eval/tool_description_ab_detailed.json`
  - `docs/tool_description_eval/tool_description_ab_pairwise.json`
  - `docs/tool_description_eval/tool_description_ab_pairwise.csv`
  - `docs/tool_description_eval/tool_description_ab_ranking.json`
- Visuals:
  - `docs/tool_description_eval/heat_first_call_ok.png`
  - `docs/tool_description_eval/heat_avg_score.png`
  - `docs/tool_description_eval/heat_avg_calls.png`
  - `docs/tool_description_eval/scatter_calls_vs_first_ok.png`
  - `docs/tool_description_eval/tool_description_interpretation.md`
- Inputs:
  - `scripts/hf_hub_community_challenges.txt`
  - `scripts/tool_description_variants.json`
- Generators:
  - `scripts/eval_tool_description_ab.py`
  - `scripts/plot_tool_description_eval.py`

---

## One-command regeneration

```bash
scripts/run_all_evals.sh
```

Optional environment overrides:

```bash
MODELS=gpt-oss,gpt-5-mini ROUTER_AGENT=hf_hub_community scripts/run_all_evals.sh
```