hf-papers / docs /RESULTS.md
evalstate's picture
evalstate HF Staff
sync: promote hf_hub_community prompt v3 + add prompt/coverage harness
bba4fab verified
# Results Index
This page is the quick index to generated evaluation outputs.
## Community challenge eval
- Report (markdown): `docs/hf_hub_community_challenge_report.md`
- Report (json): `docs/hf_hub_community_challenge_report.json`
- Inputs: `scripts/hf_hub_community_challenges.txt`
- Generator: `scripts/score_hf_hub_community_challenges.py`
## Community coverage eval
- Report (markdown): `docs/hf_hub_community_coverage_report.md`
- Report (json): `docs/hf_hub_community_coverage_report.json`
- Inputs: `scripts/hf_hub_community_coverage_prompts.json`
- Generator: `scripts/score_hf_hub_community_coverage.py`
## Prompt/card A/B eval (community)
- Summary:
- `docs/hf_hub_prompt_ab/prompt_ab_summary.md`
- `docs/hf_hub_prompt_ab/prompt_ab_summary.json`
- `docs/hf_hub_prompt_ab/prompt_ab_summary.csv`
- Visuals (if matplotlib available):
- `docs/hf_hub_prompt_ab/prompt_ab_composite_<model>.png`
- `docs/hf_hub_prompt_ab/prompt_ab_scatter_tokens_vs_challenge.png`
- Generator:
- `scripts/eval_hf_hub_prompt_ab.py`
## Tool routing eval
- Batch summary:
- `docs/tool_routing_eval/tool_routing_batch_summary.md`
- `docs/tool_routing_eval/tool_routing_batch_summary.json`
- `docs/tool_routing_eval/tool_routing_batch_summary.csv`
- Per-model reports: `docs/tool_routing_eval/tool_routing_*.md` (+ `.json`)
- Inputs:
- `scripts/tool_routing_challenges.txt`
- `scripts/tool_routing_expected.json`
- Generators:
- `scripts/score_tool_routing_confusion.py`
- `scripts/run_tool_routing_batch.py`
## Tool description A/B eval
- Summary:
- `docs/tool_description_eval/tool_description_ab_summary.md`
- `docs/tool_description_eval/tool_description_ab_summary.json`
- `docs/tool_description_eval/tool_description_ab_summary.csv`
- Detailed/pairwise:
- `docs/tool_description_eval/tool_description_ab_detailed.json`
- `docs/tool_description_eval/tool_description_ab_pairwise.json`
- `docs/tool_description_eval/tool_description_ab_pairwise.csv`
- `docs/tool_description_eval/tool_description_ab_ranking.json`
- Visuals:
- `docs/tool_description_eval/heat_first_call_ok.png`
- `docs/tool_description_eval/heat_avg_score.png`
- `docs/tool_description_eval/heat_avg_calls.png`
- `docs/tool_description_eval/scatter_calls_vs_first_ok.png`
- `docs/tool_description_eval/tool_description_interpretation.md`
- Inputs:
- `scripts/hf_hub_community_challenges.txt`
- `scripts/tool_description_variants.json`
- Generators:
- `scripts/eval_tool_description_ab.py`
- `scripts/plot_tool_description_eval.py`
---
## One-command regeneration
```bash
scripts/run_all_evals.sh
```
Optional environment overrides:
```bash
MODELS=gpt-oss,gpt-5-mini ROUTER_AGENT=hf_hub_community scripts/run_all_evals.sh
```