hf-papers / docs /RESULTS.md
evalstate's picture
evalstate HF Staff
sync: promote hf_hub_community prompt v3 + add prompt/coverage harness
bba4fab verified

Results Index

This page is the quick index to generated evaluation outputs.

Community challenge eval

  • Report (markdown): docs/hf_hub_community_challenge_report.md
  • Report (json): docs/hf_hub_community_challenge_report.json
  • Inputs: scripts/hf_hub_community_challenges.txt
  • Generator: scripts/score_hf_hub_community_challenges.py

Community coverage eval

  • Report (markdown): docs/hf_hub_community_coverage_report.md
  • Report (json): docs/hf_hub_community_coverage_report.json
  • Inputs: scripts/hf_hub_community_coverage_prompts.json
  • Generator: scripts/score_hf_hub_community_coverage.py

Prompt/card A/B eval (community)

  • Summary:
    • docs/hf_hub_prompt_ab/prompt_ab_summary.md
    • docs/hf_hub_prompt_ab/prompt_ab_summary.json
    • docs/hf_hub_prompt_ab/prompt_ab_summary.csv
  • Visuals (if matplotlib available):
    • docs/hf_hub_prompt_ab/prompt_ab_composite_<model>.png
    • docs/hf_hub_prompt_ab/prompt_ab_scatter_tokens_vs_challenge.png
  • Generator:
    • scripts/eval_hf_hub_prompt_ab.py

Tool routing eval

  • Batch summary:
    • docs/tool_routing_eval/tool_routing_batch_summary.md
    • docs/tool_routing_eval/tool_routing_batch_summary.json
    • docs/tool_routing_eval/tool_routing_batch_summary.csv
  • Per-model reports: docs/tool_routing_eval/tool_routing_*.md (+ .json)
  • Inputs:
    • scripts/tool_routing_challenges.txt
    • scripts/tool_routing_expected.json
  • Generators:
    • scripts/score_tool_routing_confusion.py
    • scripts/run_tool_routing_batch.py

Tool description A/B eval

  • Summary:
    • docs/tool_description_eval/tool_description_ab_summary.md
    • docs/tool_description_eval/tool_description_ab_summary.json
    • docs/tool_description_eval/tool_description_ab_summary.csv
  • Detailed/pairwise:
    • docs/tool_description_eval/tool_description_ab_detailed.json
    • docs/tool_description_eval/tool_description_ab_pairwise.json
    • docs/tool_description_eval/tool_description_ab_pairwise.csv
    • docs/tool_description_eval/tool_description_ab_ranking.json
  • Visuals:
    • docs/tool_description_eval/heat_first_call_ok.png
    • docs/tool_description_eval/heat_avg_score.png
    • docs/tool_description_eval/heat_avg_calls.png
    • docs/tool_description_eval/scatter_calls_vs_first_ok.png
    • docs/tool_description_eval/tool_description_interpretation.md
  • Inputs:
    • scripts/hf_hub_community_challenges.txt
    • scripts/tool_description_variants.json
  • Generators:
    • scripts/eval_tool_description_ab.py
    • scripts/plot_tool_description_eval.py

One-command regeneration

scripts/run_all_evals.sh

Optional environment overrides:

MODELS=gpt-oss,gpt-5-mini ROUTER_AGENT=hf_hub_community scripts/run_all_evals.sh