Results Index
This page is the quick index to generated evaluation outputs.
Community challenge eval
- Report (markdown):
docs/hf_hub_community_challenge_report.md - Report (json):
docs/hf_hub_community_challenge_report.json - Inputs:
scripts/hf_hub_community_challenges.txt - Generator:
scripts/score_hf_hub_community_challenges.py
Community coverage eval
- Report (markdown):
docs/hf_hub_community_coverage_report.md - Report (json):
docs/hf_hub_community_coverage_report.json - Inputs:
scripts/hf_hub_community_coverage_prompts.json - Generator:
scripts/score_hf_hub_community_coverage.py
Prompt/card A/B eval (community)
- Summary:
docs/hf_hub_prompt_ab/prompt_ab_summary.mddocs/hf_hub_prompt_ab/prompt_ab_summary.jsondocs/hf_hub_prompt_ab/prompt_ab_summary.csv
- Visuals (if matplotlib available):
docs/hf_hub_prompt_ab/prompt_ab_composite_<model>.pngdocs/hf_hub_prompt_ab/prompt_ab_scatter_tokens_vs_challenge.png
- Generator:
scripts/eval_hf_hub_prompt_ab.py
Tool routing eval
- Batch summary:
docs/tool_routing_eval/tool_routing_batch_summary.mddocs/tool_routing_eval/tool_routing_batch_summary.jsondocs/tool_routing_eval/tool_routing_batch_summary.csv
- Per-model reports:
docs/tool_routing_eval/tool_routing_*.md(+.json) - Inputs:
scripts/tool_routing_challenges.txtscripts/tool_routing_expected.json
- Generators:
scripts/score_tool_routing_confusion.pyscripts/run_tool_routing_batch.py
Tool description A/B eval
- Summary:
docs/tool_description_eval/tool_description_ab_summary.mddocs/tool_description_eval/tool_description_ab_summary.jsondocs/tool_description_eval/tool_description_ab_summary.csv
- Detailed/pairwise:
docs/tool_description_eval/tool_description_ab_detailed.jsondocs/tool_description_eval/tool_description_ab_pairwise.jsondocs/tool_description_eval/tool_description_ab_pairwise.csvdocs/tool_description_eval/tool_description_ab_ranking.json
- Visuals:
docs/tool_description_eval/heat_first_call_ok.pngdocs/tool_description_eval/heat_avg_score.pngdocs/tool_description_eval/heat_avg_calls.pngdocs/tool_description_eval/scatter_calls_vs_first_ok.pngdocs/tool_description_eval/tool_description_interpretation.md
- Inputs:
scripts/hf_hub_community_challenges.txtscripts/tool_description_variants.json
- Generators:
scripts/eval_tool_description_ab.pyscripts/plot_tool_description_eval.py
One-command regeneration
scripts/run_all_evals.sh
Optional environment overrides:
MODELS=gpt-oss,gpt-5-mini ROUTER_AGENT=hf_hub_community scripts/run_all_evals.sh