add clean tool-description evaluation charts and summary
Browse files- .gitattributes +1 -0
- docs/tool_description_eval/clean_release_20260209/SUMMARY.md +19 -0
- docs/tool_description_eval/clean_release_20260209/bar_avg_calls_by_model.png +0 -0
- docs/tool_description_eval/clean_release_20260209/bar_avg_exchange_chars_by_model.png +0 -0
- docs/tool_description_eval/clean_release_20260209/bar_avg_score_by_model.png +0 -0
- docs/tool_description_eval/clean_release_20260209/bar_first_call_ok_by_model.png +0 -0
- docs/tool_description_eval/clean_release_20260209/heat_avg_calls.png +0 -0
- docs/tool_description_eval/clean_release_20260209/heat_avg_exchange_chars.png +0 -0
- docs/tool_description_eval/clean_release_20260209/heat_avg_score.png +0 -0
- docs/tool_description_eval/clean_release_20260209/heat_first_call_ok.png +0 -0
- docs/tool_description_eval/clean_release_20260209/model_compare_answer_norm.png +0 -0
- docs/tool_description_eval/clean_release_20260209/model_compare_answer_pass.png +0 -0
- docs/tool_description_eval/clean_release_20260209/model_compare_avg_delegation_chars.png +0 -0
- docs/tool_description_eval/clean_release_20260209/model_compare_avg_exchange_chars.png +0 -0
- docs/tool_description_eval/clean_release_20260209/model_compare_avg_tool_calls.png +0 -0
- docs/tool_description_eval/clean_release_20260209/model_compare_pareto_answer_vs_exchange.png +3 -0
- docs/tool_description_eval/clean_release_20260209/overall_variant_pareto_chart.png +0 -0
- docs/tool_description_eval/clean_release_20260209/overall_variant_summary_chart.png +0 -0
- docs/tool_description_eval/clean_release_20260209/scatter_calls_vs_first_ok.png +0 -0
- docs/tool_description_eval/clean_release_20260209/scatter_exchange_vs_first_ok.png +0 -0
- docs/tool_description_eval/clean_release_20260209/tool_description_ab_summary.filtered.csv +19 -0
- docs/tool_description_eval/clean_release_20260209/tool_description_ab_summary.filtered.json +308 -0
- docs/tool_description_eval/clean_release_20260209/tool_description_answer_summary.filtered.csv +19 -0
- docs/tool_description_eval/clean_release_20260209/tool_description_answer_summary.filtered.json +146 -0
- docs/tool_description_eval/clean_release_20260209/tool_description_dashboard.csv +19 -0
- docs/tool_description_eval/clean_release_20260209/tool_description_dashboard.json +344 -0
- docs/tool_description_eval/clean_release_20260209/tool_description_dashboard.md +35 -0
- docs/tool_description_eval/clean_release_20260209/tool_description_interpretation.md +28 -0
- docs/tool_description_eval/clean_release_20260209/tool_description_model_comparison.md +21 -0
.gitattributes
CHANGED
|
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
|
| 33 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
| 34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
| 35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
|
|
|
|
|
| 33 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
| 34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
| 35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
| 36 |
+
docs/tool_description_eval/clean_release_20260209/model_compare_pareto_answer_vs_exchange.png filter=lfs diff=lfs merge=lfs -text
|
docs/tool_description_eval/clean_release_20260209/SUMMARY.md
ADDED
|
@@ -0,0 +1,19 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Clean Description-Test Summary (release view)
|
| 2 |
+
|
| 3 |
+
Filtered to variants: **minimal, structured, verbose_noisy** and models excluding **grok-4-fast**.
|
| 4 |
+
|
| 5 |
+
## Variant ranking (cross-model means)
|
| 6 |
+
|
| 7 |
+
| Rank | Variant | Mean composite | Mean answer | Mean pass | Mean exchange chars | Mean tool calls |
|
| 8 |
+
|---:|---|---:|---:|---:|---:|---:|
|
| 9 |
+
| 1 | structured | 0.8577 | 0.8667 | 0.8542 | 1126.4 | 0.958 |
|
| 10 |
+
| 2 | minimal | 0.8499 | 0.8646 | 0.8125 | 1399.1 | 1.125 |
|
| 11 |
+
| 3 | verbose_noisy | 0.8440 | 0.8500 | 0.8125 | 1128.7 | 0.958 |
|
| 12 |
+
|
| 13 |
+
**Recommended deployed default:** `structured` (best mean composite).
|
| 14 |
+
|
| 15 |
+
## Key charts
|
| 16 |
+
|
| 17 |
+
- `overall_variant_summary_chart.png` (single-glance summary)
|
| 18 |
+
- `overall_variant_pareto_chart.png` (quality vs chattiness)
|
| 19 |
+
- `model_compare_answer_norm.png` and `model_compare_avg_exchange_chars.png` (per-model comparisons)
|
docs/tool_description_eval/clean_release_20260209/bar_avg_calls_by_model.png
ADDED
|
docs/tool_description_eval/clean_release_20260209/bar_avg_exchange_chars_by_model.png
ADDED
|
docs/tool_description_eval/clean_release_20260209/bar_avg_score_by_model.png
ADDED
|
docs/tool_description_eval/clean_release_20260209/bar_first_call_ok_by_model.png
ADDED
|
docs/tool_description_eval/clean_release_20260209/heat_avg_calls.png
ADDED
|
docs/tool_description_eval/clean_release_20260209/heat_avg_exchange_chars.png
ADDED
|
docs/tool_description_eval/clean_release_20260209/heat_avg_score.png
ADDED
|
docs/tool_description_eval/clean_release_20260209/heat_first_call_ok.png
ADDED
|
docs/tool_description_eval/clean_release_20260209/model_compare_answer_norm.png
ADDED
|
docs/tool_description_eval/clean_release_20260209/model_compare_answer_pass.png
ADDED
|
docs/tool_description_eval/clean_release_20260209/model_compare_avg_delegation_chars.png
ADDED
|
docs/tool_description_eval/clean_release_20260209/model_compare_avg_exchange_chars.png
ADDED
|
docs/tool_description_eval/clean_release_20260209/model_compare_avg_tool_calls.png
ADDED
|
docs/tool_description_eval/clean_release_20260209/model_compare_pareto_answer_vs_exchange.png
ADDED
|
Git LFS Details
|
docs/tool_description_eval/clean_release_20260209/overall_variant_pareto_chart.png
ADDED
|
docs/tool_description_eval/clean_release_20260209/overall_variant_summary_chart.png
ADDED
|
docs/tool_description_eval/clean_release_20260209/scatter_calls_vs_first_ok.png
ADDED
|
docs/tool_description_eval/clean_release_20260209/scatter_exchange_vs_first_ok.png
ADDED
|
docs/tool_description_eval/clean_release_20260209/tool_description_ab_summary.filtered.csv
ADDED
|
@@ -0,0 +1,19 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
variant,model,actual_model,n_cases,success_rate,tool_use_rate,avg_tool_calls,avg_endpoint_calls,avg_tool_request_chars,avg_tool_response_chars,avg_tool_exchange_chars,total_tool_exchange_chars,avg_delegation_chars,first_call_ok_rate,avg_score_total
|
| 2 |
+
minimal,glm,zai-org/GLM-4.7,8,1.0,1.0,1.875,0.0,196.8,1997.0,2193.8,17550,91.88,,
|
| 3 |
+
minimal,gpt-5-mini,gpt-5-mini,8,1.0,0.75,0.875,0.0,246.8,2041.9,2288.6,18309,277.67,,
|
| 4 |
+
minimal,haiku,claude-haiku-4-5,8,1.0,0.875,1.25,0.0,100.6,1004.2,1104.9,8839,66.43,,
|
| 5 |
+
minimal,kimi,moonshotai/Kimi-K2-Instruct-0905,8,1.0,1.0,1.0,0.0,99.5,316.8,416.2,3330,84.25,,
|
| 6 |
+
minimal,kimi25,moonshotai/Kimi-K2.5,8,1.0,0.875,1.0,0.0,129.0,1545.4,1674.4,13395,113.14,,
|
| 7 |
+
minimal,minimax,MiniMaxAI/MiniMax-M2.1,8,1.0,0.75,0.75,0.0,123.5,593.5,717.0,5736,149.0,,
|
| 8 |
+
structured,glm,zai-org/GLM-4.7,8,1.0,0.75,1.125,0.0,160.5,805.2,965.8,7726,151.17,,
|
| 9 |
+
structured,gpt-5-mini,gpt-5-mini,8,1.0,0.875,1.0,0.0,329.1,1822.8,2151.9,17215,314.43,,
|
| 10 |
+
structured,haiku,claude-haiku-4-5,8,1.0,0.75,0.875,0.0,69.5,717.2,786.8,6294,68.5,,
|
| 11 |
+
structured,kimi,moonshotai/Kimi-K2-Instruct-0905,8,1.0,1.0,1.125,0.0,96.5,500.0,596.5,4772,75.38,,
|
| 12 |
+
structured,kimi25,moonshotai/Kimi-K2.5,8,1.0,0.875,1.0,0.0,112.1,1348.1,1460.2,11682,100.71,,
|
| 13 |
+
structured,minimax,MiniMaxAI/MiniMax-M2.1,8,1.0,0.625,0.625,0.0,187.2,610.0,797.2,6378,280.0,,
|
| 14 |
+
verbose_noisy,glm,zai-org/GLM-4.7,8,1.0,0.875,1.0,0.0,189.1,1115.0,1304.1,10433,168.29,,
|
| 15 |
+
verbose_noisy,gpt-5-mini,gpt-5-mini,8,1.0,0.625,0.75,0.0,282.6,844.6,1127.2,9018,343.8,,
|
| 16 |
+
verbose_noisy,haiku,claude-haiku-4-5,8,1.0,0.875,1.0,0.0,124.8,1119.9,1244.6,9957,118.86,,
|
| 17 |
+
verbose_noisy,kimi,moonshotai/Kimi-K2-Instruct-0905,8,1.0,1.0,1.0,0.0,99.5,507.0,606.5,4852,84.5,,
|
| 18 |
+
verbose_noisy,kimi25,moonshotai/Kimi-K2.5,8,1.0,0.875,1.25,0.0,213.8,1673.6,1887.4,15099,159.71,,
|
| 19 |
+
verbose_noisy,minimax,MiniMaxAI/MiniMax-M2.1,8,1.0,0.75,0.75,0.0,121.9,480.4,602.2,4818,145.5,,
|
docs/tool_description_eval/clean_release_20260209/tool_description_ab_summary.filtered.json
ADDED
|
@@ -0,0 +1,308 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
[
|
| 2 |
+
{
|
| 3 |
+
"variant": "minimal",
|
| 4 |
+
"model": "glm",
|
| 5 |
+
"actual_model": "zai-org/GLM-4.7",
|
| 6 |
+
"n_cases": 8,
|
| 7 |
+
"success_rate": 1.0,
|
| 8 |
+
"tool_use_rate": 1.0,
|
| 9 |
+
"avg_tool_calls": 1.875,
|
| 10 |
+
"avg_endpoint_calls": 0.0,
|
| 11 |
+
"avg_tool_request_chars": 196.8,
|
| 12 |
+
"avg_tool_response_chars": 1997.0,
|
| 13 |
+
"avg_tool_exchange_chars": 2193.8,
|
| 14 |
+
"total_tool_exchange_chars": 17550,
|
| 15 |
+
"avg_delegation_chars": 91.88,
|
| 16 |
+
"first_call_ok_rate": null,
|
| 17 |
+
"avg_score_total": null
|
| 18 |
+
},
|
| 19 |
+
{
|
| 20 |
+
"variant": "minimal",
|
| 21 |
+
"model": "gpt-5-mini",
|
| 22 |
+
"actual_model": "gpt-5-mini",
|
| 23 |
+
"n_cases": 8,
|
| 24 |
+
"success_rate": 1.0,
|
| 25 |
+
"tool_use_rate": 0.75,
|
| 26 |
+
"avg_tool_calls": 0.875,
|
| 27 |
+
"avg_endpoint_calls": 0.0,
|
| 28 |
+
"avg_tool_request_chars": 246.8,
|
| 29 |
+
"avg_tool_response_chars": 2041.9,
|
| 30 |
+
"avg_tool_exchange_chars": 2288.6,
|
| 31 |
+
"total_tool_exchange_chars": 18309,
|
| 32 |
+
"avg_delegation_chars": 277.67,
|
| 33 |
+
"first_call_ok_rate": null,
|
| 34 |
+
"avg_score_total": null
|
| 35 |
+
},
|
| 36 |
+
{
|
| 37 |
+
"variant": "minimal",
|
| 38 |
+
"model": "haiku",
|
| 39 |
+
"actual_model": "claude-haiku-4-5",
|
| 40 |
+
"n_cases": 8,
|
| 41 |
+
"success_rate": 1.0,
|
| 42 |
+
"tool_use_rate": 0.875,
|
| 43 |
+
"avg_tool_calls": 1.25,
|
| 44 |
+
"avg_endpoint_calls": 0.0,
|
| 45 |
+
"avg_tool_request_chars": 100.6,
|
| 46 |
+
"avg_tool_response_chars": 1004.2,
|
| 47 |
+
"avg_tool_exchange_chars": 1104.9,
|
| 48 |
+
"total_tool_exchange_chars": 8839,
|
| 49 |
+
"avg_delegation_chars": 66.43,
|
| 50 |
+
"first_call_ok_rate": null,
|
| 51 |
+
"avg_score_total": null
|
| 52 |
+
},
|
| 53 |
+
{
|
| 54 |
+
"variant": "minimal",
|
| 55 |
+
"model": "kimi",
|
| 56 |
+
"actual_model": "moonshotai/Kimi-K2-Instruct-0905",
|
| 57 |
+
"n_cases": 8,
|
| 58 |
+
"success_rate": 1.0,
|
| 59 |
+
"tool_use_rate": 1.0,
|
| 60 |
+
"avg_tool_calls": 1.0,
|
| 61 |
+
"avg_endpoint_calls": 0.0,
|
| 62 |
+
"avg_tool_request_chars": 99.5,
|
| 63 |
+
"avg_tool_response_chars": 316.8,
|
| 64 |
+
"avg_tool_exchange_chars": 416.2,
|
| 65 |
+
"total_tool_exchange_chars": 3330,
|
| 66 |
+
"avg_delegation_chars": 84.25,
|
| 67 |
+
"first_call_ok_rate": null,
|
| 68 |
+
"avg_score_total": null
|
| 69 |
+
},
|
| 70 |
+
{
|
| 71 |
+
"variant": "minimal",
|
| 72 |
+
"model": "kimi25",
|
| 73 |
+
"actual_model": "moonshotai/Kimi-K2.5",
|
| 74 |
+
"n_cases": 8,
|
| 75 |
+
"success_rate": 1.0,
|
| 76 |
+
"tool_use_rate": 0.875,
|
| 77 |
+
"avg_tool_calls": 1.0,
|
| 78 |
+
"avg_endpoint_calls": 0.0,
|
| 79 |
+
"avg_tool_request_chars": 129.0,
|
| 80 |
+
"avg_tool_response_chars": 1545.4,
|
| 81 |
+
"avg_tool_exchange_chars": 1674.4,
|
| 82 |
+
"total_tool_exchange_chars": 13395,
|
| 83 |
+
"avg_delegation_chars": 113.14,
|
| 84 |
+
"first_call_ok_rate": null,
|
| 85 |
+
"avg_score_total": null
|
| 86 |
+
},
|
| 87 |
+
{
|
| 88 |
+
"variant": "minimal",
|
| 89 |
+
"model": "minimax",
|
| 90 |
+
"actual_model": "MiniMaxAI/MiniMax-M2.1",
|
| 91 |
+
"n_cases": 8,
|
| 92 |
+
"success_rate": 1.0,
|
| 93 |
+
"tool_use_rate": 0.75,
|
| 94 |
+
"avg_tool_calls": 0.75,
|
| 95 |
+
"avg_endpoint_calls": 0.0,
|
| 96 |
+
"avg_tool_request_chars": 123.5,
|
| 97 |
+
"avg_tool_response_chars": 593.5,
|
| 98 |
+
"avg_tool_exchange_chars": 717.0,
|
| 99 |
+
"total_tool_exchange_chars": 5736,
|
| 100 |
+
"avg_delegation_chars": 149.0,
|
| 101 |
+
"first_call_ok_rate": null,
|
| 102 |
+
"avg_score_total": null
|
| 103 |
+
},
|
| 104 |
+
{
|
| 105 |
+
"variant": "structured",
|
| 106 |
+
"model": "glm",
|
| 107 |
+
"actual_model": "zai-org/GLM-4.7",
|
| 108 |
+
"n_cases": 8,
|
| 109 |
+
"success_rate": 1.0,
|
| 110 |
+
"tool_use_rate": 0.75,
|
| 111 |
+
"avg_tool_calls": 1.125,
|
| 112 |
+
"avg_endpoint_calls": 0.0,
|
| 113 |
+
"avg_tool_request_chars": 160.5,
|
| 114 |
+
"avg_tool_response_chars": 805.2,
|
| 115 |
+
"avg_tool_exchange_chars": 965.8,
|
| 116 |
+
"total_tool_exchange_chars": 7726,
|
| 117 |
+
"avg_delegation_chars": 151.17,
|
| 118 |
+
"first_call_ok_rate": null,
|
| 119 |
+
"avg_score_total": null
|
| 120 |
+
},
|
| 121 |
+
{
|
| 122 |
+
"variant": "structured",
|
| 123 |
+
"model": "gpt-5-mini",
|
| 124 |
+
"actual_model": "gpt-5-mini",
|
| 125 |
+
"n_cases": 8,
|
| 126 |
+
"success_rate": 1.0,
|
| 127 |
+
"tool_use_rate": 0.875,
|
| 128 |
+
"avg_tool_calls": 1.0,
|
| 129 |
+
"avg_endpoint_calls": 0.0,
|
| 130 |
+
"avg_tool_request_chars": 329.1,
|
| 131 |
+
"avg_tool_response_chars": 1822.8,
|
| 132 |
+
"avg_tool_exchange_chars": 2151.9,
|
| 133 |
+
"total_tool_exchange_chars": 17215,
|
| 134 |
+
"avg_delegation_chars": 314.43,
|
| 135 |
+
"first_call_ok_rate": null,
|
| 136 |
+
"avg_score_total": null
|
| 137 |
+
},
|
| 138 |
+
{
|
| 139 |
+
"variant": "structured",
|
| 140 |
+
"model": "haiku",
|
| 141 |
+
"actual_model": "claude-haiku-4-5",
|
| 142 |
+
"n_cases": 8,
|
| 143 |
+
"success_rate": 1.0,
|
| 144 |
+
"tool_use_rate": 0.75,
|
| 145 |
+
"avg_tool_calls": 0.875,
|
| 146 |
+
"avg_endpoint_calls": 0.0,
|
| 147 |
+
"avg_tool_request_chars": 69.5,
|
| 148 |
+
"avg_tool_response_chars": 717.2,
|
| 149 |
+
"avg_tool_exchange_chars": 786.8,
|
| 150 |
+
"total_tool_exchange_chars": 6294,
|
| 151 |
+
"avg_delegation_chars": 68.5,
|
| 152 |
+
"first_call_ok_rate": null,
|
| 153 |
+
"avg_score_total": null
|
| 154 |
+
},
|
| 155 |
+
{
|
| 156 |
+
"variant": "structured",
|
| 157 |
+
"model": "kimi",
|
| 158 |
+
"actual_model": "moonshotai/Kimi-K2-Instruct-0905",
|
| 159 |
+
"n_cases": 8,
|
| 160 |
+
"success_rate": 1.0,
|
| 161 |
+
"tool_use_rate": 1.0,
|
| 162 |
+
"avg_tool_calls": 1.125,
|
| 163 |
+
"avg_endpoint_calls": 0.0,
|
| 164 |
+
"avg_tool_request_chars": 96.5,
|
| 165 |
+
"avg_tool_response_chars": 500.0,
|
| 166 |
+
"avg_tool_exchange_chars": 596.5,
|
| 167 |
+
"total_tool_exchange_chars": 4772,
|
| 168 |
+
"avg_delegation_chars": 75.38,
|
| 169 |
+
"first_call_ok_rate": null,
|
| 170 |
+
"avg_score_total": null
|
| 171 |
+
},
|
| 172 |
+
{
|
| 173 |
+
"variant": "structured",
|
| 174 |
+
"model": "kimi25",
|
| 175 |
+
"actual_model": "moonshotai/Kimi-K2.5",
|
| 176 |
+
"n_cases": 8,
|
| 177 |
+
"success_rate": 1.0,
|
| 178 |
+
"tool_use_rate": 0.875,
|
| 179 |
+
"avg_tool_calls": 1.0,
|
| 180 |
+
"avg_endpoint_calls": 0.0,
|
| 181 |
+
"avg_tool_request_chars": 112.1,
|
| 182 |
+
"avg_tool_response_chars": 1348.1,
|
| 183 |
+
"avg_tool_exchange_chars": 1460.2,
|
| 184 |
+
"total_tool_exchange_chars": 11682,
|
| 185 |
+
"avg_delegation_chars": 100.71,
|
| 186 |
+
"first_call_ok_rate": null,
|
| 187 |
+
"avg_score_total": null
|
| 188 |
+
},
|
| 189 |
+
{
|
| 190 |
+
"variant": "structured",
|
| 191 |
+
"model": "minimax",
|
| 192 |
+
"actual_model": "MiniMaxAI/MiniMax-M2.1",
|
| 193 |
+
"n_cases": 8,
|
| 194 |
+
"success_rate": 1.0,
|
| 195 |
+
"tool_use_rate": 0.625,
|
| 196 |
+
"avg_tool_calls": 0.625,
|
| 197 |
+
"avg_endpoint_calls": 0.0,
|
| 198 |
+
"avg_tool_request_chars": 187.2,
|
| 199 |
+
"avg_tool_response_chars": 610.0,
|
| 200 |
+
"avg_tool_exchange_chars": 797.2,
|
| 201 |
+
"total_tool_exchange_chars": 6378,
|
| 202 |
+
"avg_delegation_chars": 280.0,
|
| 203 |
+
"first_call_ok_rate": null,
|
| 204 |
+
"avg_score_total": null
|
| 205 |
+
},
|
| 206 |
+
{
|
| 207 |
+
"variant": "verbose_noisy",
|
| 208 |
+
"model": "glm",
|
| 209 |
+
"actual_model": "zai-org/GLM-4.7",
|
| 210 |
+
"n_cases": 8,
|
| 211 |
+
"success_rate": 1.0,
|
| 212 |
+
"tool_use_rate": 0.875,
|
| 213 |
+
"avg_tool_calls": 1.0,
|
| 214 |
+
"avg_endpoint_calls": 0.0,
|
| 215 |
+
"avg_tool_request_chars": 189.1,
|
| 216 |
+
"avg_tool_response_chars": 1115.0,
|
| 217 |
+
"avg_tool_exchange_chars": 1304.1,
|
| 218 |
+
"total_tool_exchange_chars": 10433,
|
| 219 |
+
"avg_delegation_chars": 168.29,
|
| 220 |
+
"first_call_ok_rate": null,
|
| 221 |
+
"avg_score_total": null
|
| 222 |
+
},
|
| 223 |
+
{
|
| 224 |
+
"variant": "verbose_noisy",
|
| 225 |
+
"model": "gpt-5-mini",
|
| 226 |
+
"actual_model": "gpt-5-mini",
|
| 227 |
+
"n_cases": 8,
|
| 228 |
+
"success_rate": 1.0,
|
| 229 |
+
"tool_use_rate": 0.625,
|
| 230 |
+
"avg_tool_calls": 0.75,
|
| 231 |
+
"avg_endpoint_calls": 0.0,
|
| 232 |
+
"avg_tool_request_chars": 282.6,
|
| 233 |
+
"avg_tool_response_chars": 844.6,
|
| 234 |
+
"avg_tool_exchange_chars": 1127.2,
|
| 235 |
+
"total_tool_exchange_chars": 9018,
|
| 236 |
+
"avg_delegation_chars": 343.8,
|
| 237 |
+
"first_call_ok_rate": null,
|
| 238 |
+
"avg_score_total": null
|
| 239 |
+
},
|
| 240 |
+
{
|
| 241 |
+
"variant": "verbose_noisy",
|
| 242 |
+
"model": "haiku",
|
| 243 |
+
"actual_model": "claude-haiku-4-5",
|
| 244 |
+
"n_cases": 8,
|
| 245 |
+
"success_rate": 1.0,
|
| 246 |
+
"tool_use_rate": 0.875,
|
| 247 |
+
"avg_tool_calls": 1.0,
|
| 248 |
+
"avg_endpoint_calls": 0.0,
|
| 249 |
+
"avg_tool_request_chars": 124.8,
|
| 250 |
+
"avg_tool_response_chars": 1119.9,
|
| 251 |
+
"avg_tool_exchange_chars": 1244.6,
|
| 252 |
+
"total_tool_exchange_chars": 9957,
|
| 253 |
+
"avg_delegation_chars": 118.86,
|
| 254 |
+
"first_call_ok_rate": null,
|
| 255 |
+
"avg_score_total": null
|
| 256 |
+
},
|
| 257 |
+
{
|
| 258 |
+
"variant": "verbose_noisy",
|
| 259 |
+
"model": "kimi",
|
| 260 |
+
"actual_model": "moonshotai/Kimi-K2-Instruct-0905",
|
| 261 |
+
"n_cases": 8,
|
| 262 |
+
"success_rate": 1.0,
|
| 263 |
+
"tool_use_rate": 1.0,
|
| 264 |
+
"avg_tool_calls": 1.0,
|
| 265 |
+
"avg_endpoint_calls": 0.0,
|
| 266 |
+
"avg_tool_request_chars": 99.5,
|
| 267 |
+
"avg_tool_response_chars": 507.0,
|
| 268 |
+
"avg_tool_exchange_chars": 606.5,
|
| 269 |
+
"total_tool_exchange_chars": 4852,
|
| 270 |
+
"avg_delegation_chars": 84.5,
|
| 271 |
+
"first_call_ok_rate": null,
|
| 272 |
+
"avg_score_total": null
|
| 273 |
+
},
|
| 274 |
+
{
|
| 275 |
+
"variant": "verbose_noisy",
|
| 276 |
+
"model": "kimi25",
|
| 277 |
+
"actual_model": "moonshotai/Kimi-K2.5",
|
| 278 |
+
"n_cases": 8,
|
| 279 |
+
"success_rate": 1.0,
|
| 280 |
+
"tool_use_rate": 0.875,
|
| 281 |
+
"avg_tool_calls": 1.25,
|
| 282 |
+
"avg_endpoint_calls": 0.0,
|
| 283 |
+
"avg_tool_request_chars": 213.8,
|
| 284 |
+
"avg_tool_response_chars": 1673.6,
|
| 285 |
+
"avg_tool_exchange_chars": 1887.4,
|
| 286 |
+
"total_tool_exchange_chars": 15099,
|
| 287 |
+
"avg_delegation_chars": 159.71,
|
| 288 |
+
"first_call_ok_rate": null,
|
| 289 |
+
"avg_score_total": null
|
| 290 |
+
},
|
| 291 |
+
{
|
| 292 |
+
"variant": "verbose_noisy",
|
| 293 |
+
"model": "minimax",
|
| 294 |
+
"actual_model": "MiniMaxAI/MiniMax-M2.1",
|
| 295 |
+
"n_cases": 8,
|
| 296 |
+
"success_rate": 1.0,
|
| 297 |
+
"tool_use_rate": 0.75,
|
| 298 |
+
"avg_tool_calls": 0.75,
|
| 299 |
+
"avg_endpoint_calls": 0.0,
|
| 300 |
+
"avg_tool_request_chars": 121.9,
|
| 301 |
+
"avg_tool_response_chars": 480.4,
|
| 302 |
+
"avg_tool_exchange_chars": 602.2,
|
| 303 |
+
"total_tool_exchange_chars": 4818,
|
| 304 |
+
"avg_delegation_chars": 145.5,
|
| 305 |
+
"first_call_ok_rate": null,
|
| 306 |
+
"avg_score_total": null
|
| 307 |
+
}
|
| 308 |
+
]
|
docs/tool_description_eval/clean_release_20260209/tool_description_answer_summary.filtered.csv
ADDED
|
@@ -0,0 +1,19 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
variant,model,n_cases,answer_pass_rate,avg_answer_score,normalized_answer_score
|
| 2 |
+
minimal,kimi25,8,1.0,9.5,0.95
|
| 3 |
+
structured,minimax,8,1.0,9.25,0.925
|
| 4 |
+
minimal,glm,8,0.875,9.125,0.9125
|
| 5 |
+
minimal,gpt-5-mini,8,0.875,9.125,0.9125
|
| 6 |
+
minimal,haiku,8,0.875,9.125,0.9125
|
| 7 |
+
structured,gpt-5-mini,8,0.875,9.125,0.9125
|
| 8 |
+
structured,kimi25,8,0.875,9.125,0.9125
|
| 9 |
+
verbose_noisy,gpt-5-mini,8,0.875,9.125,0.9125
|
| 10 |
+
verbose_noisy,kimi25,8,0.875,9.125,0.9125
|
| 11 |
+
verbose_noisy,haiku,8,0.875,8.875,0.8875
|
| 12 |
+
verbose_noisy,minimax,8,0.875,8.625,0.8625
|
| 13 |
+
structured,glm,8,0.875,8.5,0.85
|
| 14 |
+
minimal,minimax,8,0.75,8.375,0.8375
|
| 15 |
+
structured,kimi,8,0.75,8.125,0.8125
|
| 16 |
+
structured,haiku,8,0.75,7.875,0.7875
|
| 17 |
+
verbose_noisy,glm,8,0.75,7.875,0.7875
|
| 18 |
+
verbose_noisy,kimi,8,0.625,7.375,0.7375
|
| 19 |
+
minimal,kimi,8,0.5,6.625,0.6625
|
docs/tool_description_eval/clean_release_20260209/tool_description_answer_summary.filtered.json
ADDED
|
@@ -0,0 +1,146 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
[
|
| 2 |
+
{
|
| 3 |
+
"variant": "minimal",
|
| 4 |
+
"model": "kimi25",
|
| 5 |
+
"n_cases": 8,
|
| 6 |
+
"answer_pass_rate": 1.0,
|
| 7 |
+
"avg_answer_score": 9.5,
|
| 8 |
+
"normalized_answer_score": 0.95
|
| 9 |
+
},
|
| 10 |
+
{
|
| 11 |
+
"variant": "structured",
|
| 12 |
+
"model": "minimax",
|
| 13 |
+
"n_cases": 8,
|
| 14 |
+
"answer_pass_rate": 1.0,
|
| 15 |
+
"avg_answer_score": 9.25,
|
| 16 |
+
"normalized_answer_score": 0.925
|
| 17 |
+
},
|
| 18 |
+
{
|
| 19 |
+
"variant": "minimal",
|
| 20 |
+
"model": "glm",
|
| 21 |
+
"n_cases": 8,
|
| 22 |
+
"answer_pass_rate": 0.875,
|
| 23 |
+
"avg_answer_score": 9.125,
|
| 24 |
+
"normalized_answer_score": 0.9125
|
| 25 |
+
},
|
| 26 |
+
{
|
| 27 |
+
"variant": "minimal",
|
| 28 |
+
"model": "gpt-5-mini",
|
| 29 |
+
"n_cases": 8,
|
| 30 |
+
"answer_pass_rate": 0.875,
|
| 31 |
+
"avg_answer_score": 9.125,
|
| 32 |
+
"normalized_answer_score": 0.9125
|
| 33 |
+
},
|
| 34 |
+
{
|
| 35 |
+
"variant": "minimal",
|
| 36 |
+
"model": "haiku",
|
| 37 |
+
"n_cases": 8,
|
| 38 |
+
"answer_pass_rate": 0.875,
|
| 39 |
+
"avg_answer_score": 9.125,
|
| 40 |
+
"normalized_answer_score": 0.9125
|
| 41 |
+
},
|
| 42 |
+
{
|
| 43 |
+
"variant": "structured",
|
| 44 |
+
"model": "gpt-5-mini",
|
| 45 |
+
"n_cases": 8,
|
| 46 |
+
"answer_pass_rate": 0.875,
|
| 47 |
+
"avg_answer_score": 9.125,
|
| 48 |
+
"normalized_answer_score": 0.9125
|
| 49 |
+
},
|
| 50 |
+
{
|
| 51 |
+
"variant": "structured",
|
| 52 |
+
"model": "kimi25",
|
| 53 |
+
"n_cases": 8,
|
| 54 |
+
"answer_pass_rate": 0.875,
|
| 55 |
+
"avg_answer_score": 9.125,
|
| 56 |
+
"normalized_answer_score": 0.9125
|
| 57 |
+
},
|
| 58 |
+
{
|
| 59 |
+
"variant": "verbose_noisy",
|
| 60 |
+
"model": "gpt-5-mini",
|
| 61 |
+
"n_cases": 8,
|
| 62 |
+
"answer_pass_rate": 0.875,
|
| 63 |
+
"avg_answer_score": 9.125,
|
| 64 |
+
"normalized_answer_score": 0.9125
|
| 65 |
+
},
|
| 66 |
+
{
|
| 67 |
+
"variant": "verbose_noisy",
|
| 68 |
+
"model": "kimi25",
|
| 69 |
+
"n_cases": 8,
|
| 70 |
+
"answer_pass_rate": 0.875,
|
| 71 |
+
"avg_answer_score": 9.125,
|
| 72 |
+
"normalized_answer_score": 0.9125
|
| 73 |
+
},
|
| 74 |
+
{
|
| 75 |
+
"variant": "verbose_noisy",
|
| 76 |
+
"model": "haiku",
|
| 77 |
+
"n_cases": 8,
|
| 78 |
+
"answer_pass_rate": 0.875,
|
| 79 |
+
"avg_answer_score": 8.875,
|
| 80 |
+
"normalized_answer_score": 0.8875
|
| 81 |
+
},
|
| 82 |
+
{
|
| 83 |
+
"variant": "verbose_noisy",
|
| 84 |
+
"model": "minimax",
|
| 85 |
+
"n_cases": 8,
|
| 86 |
+
"answer_pass_rate": 0.875,
|
| 87 |
+
"avg_answer_score": 8.625,
|
| 88 |
+
"normalized_answer_score": 0.8625
|
| 89 |
+
},
|
| 90 |
+
{
|
| 91 |
+
"variant": "structured",
|
| 92 |
+
"model": "glm",
|
| 93 |
+
"n_cases": 8,
|
| 94 |
+
"answer_pass_rate": 0.875,
|
| 95 |
+
"avg_answer_score": 8.5,
|
| 96 |
+
"normalized_answer_score": 0.85
|
| 97 |
+
},
|
| 98 |
+
{
|
| 99 |
+
"variant": "minimal",
|
| 100 |
+
"model": "minimax",
|
| 101 |
+
"n_cases": 8,
|
| 102 |
+
"answer_pass_rate": 0.75,
|
| 103 |
+
"avg_answer_score": 8.375,
|
| 104 |
+
"normalized_answer_score": 0.8375
|
| 105 |
+
},
|
| 106 |
+
{
|
| 107 |
+
"variant": "structured",
|
| 108 |
+
"model": "kimi",
|
| 109 |
+
"n_cases": 8,
|
| 110 |
+
"answer_pass_rate": 0.75,
|
| 111 |
+
"avg_answer_score": 8.125,
|
| 112 |
+
"normalized_answer_score": 0.8125
|
| 113 |
+
},
|
| 114 |
+
{
|
| 115 |
+
"variant": "structured",
|
| 116 |
+
"model": "haiku",
|
| 117 |
+
"n_cases": 8,
|
| 118 |
+
"answer_pass_rate": 0.75,
|
| 119 |
+
"avg_answer_score": 7.875,
|
| 120 |
+
"normalized_answer_score": 0.7875
|
| 121 |
+
},
|
| 122 |
+
{
|
| 123 |
+
"variant": "verbose_noisy",
|
| 124 |
+
"model": "glm",
|
| 125 |
+
"n_cases": 8,
|
| 126 |
+
"answer_pass_rate": 0.75,
|
| 127 |
+
"avg_answer_score": 7.875,
|
| 128 |
+
"normalized_answer_score": 0.7875
|
| 129 |
+
},
|
| 130 |
+
{
|
| 131 |
+
"variant": "verbose_noisy",
|
| 132 |
+
"model": "kimi",
|
| 133 |
+
"n_cases": 8,
|
| 134 |
+
"answer_pass_rate": 0.625,
|
| 135 |
+
"avg_answer_score": 7.375,
|
| 136 |
+
"normalized_answer_score": 0.7375
|
| 137 |
+
},
|
| 138 |
+
{
|
| 139 |
+
"variant": "minimal",
|
| 140 |
+
"model": "kimi",
|
| 141 |
+
"n_cases": 8,
|
| 142 |
+
"answer_pass_rate": 0.5,
|
| 143 |
+
"avg_answer_score": 6.625,
|
| 144 |
+
"normalized_answer_score": 0.6625
|
| 145 |
+
}
|
| 146 |
+
]
|
docs/tool_description_eval/clean_release_20260209/tool_description_dashboard.csv
ADDED
|
@@ -0,0 +1,19 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
model,variant,actual_model,n_cases,answer_pass_rate,normalized_answer_score,first_call_ok_rate,avg_score_total,avg_tool_calls,avg_endpoint_calls,avg_tool_exchange_chars,avg_delegation_chars,avg_total_tokens,avg_input_tokens,avg_output_tokens,avg_tool_calls_reported,composite
|
| 2 |
+
minimax,structured,MiniMaxAI/MiniMax-M2.1,8,1.0,0.925,,,0.625,0.0,797.2,280.0,1259.75,739.875,519.875,0.625,0.9124970675498518
|
| 3 |
+
kimi25,minimal,moonshotai/Kimi-K2.5,8,1.0,0.95,,,1.0,0.0,1674.4,113.14,1276.25,819.375,456.875,1.0,0.9098262016061369
|
| 4 |
+
haiku,minimal,claude-haiku-4-5,8,0.875,0.9125,,,1.25,0.0,1104.9,66.43,2977.75,2426.25,551.5,1.25,0.893802846893479
|
| 5 |
+
gpt-5-mini,verbose_noisy,gpt-5-mini,8,0.875,0.9125,,,0.75,0.0,1127.2,343.8,1859.5,703.625,1155.875,0.75,0.8932066849458153
|
| 6 |
+
kimi25,structured,moonshotai/Kimi-K2.5,8,0.875,0.9125,,,1.0,0.0,1460.2,100.71,1409.875,893.125,516.75,1.0,0.8847939692269589
|
| 7 |
+
kimi25,verbose_noisy,moonshotai/Kimi-K2.5,8,0.875,0.9125,,,1.25,0.0,1887.4,159.71,1440.625,926.375,514.25,1.25,0.8751926706739843
|
| 8 |
+
haiku,verbose_noisy,claude-haiku-4-5,8,0.875,0.8875,,,1.0,0.0,1244.6,118.86,2681.5,2122.375,559.125,1.0,0.8701383595426448
|
| 9 |
+
gpt-5-mini,structured,gpt-5-mini,8,0.875,0.9125,,,1.0,0.0,2151.9,314.43,2052.625,1083.625,969.0,1.0,0.8698229841021267
|
| 10 |
+
glm,minimal,zai-org/GLM-4.7,8,0.875,0.9125,,,1.875,0.0,2193.8,91.88,2309.125,1546.625,762.5,1.875,0.8690085907309072
|
| 11 |
+
minimax,verbose_noisy,MiniMaxAI/MiniMax-M2.1,8,0.875,0.8625,,,0.75,0.0,602.2,145.5,1192.5,712.125,480.375,0.75,0.8685013030595123
|
| 12 |
+
gpt-5-mini,minimal,gpt-5-mini,8,0.875,0.9125,,,0.875,0.0,2288.6,277.67,1987.0,1063.5,923.5,0.875,0.8672005597782839
|
| 13 |
+
glm,structured,zai-org/GLM-4.7,8,0.875,0.85,,,1.125,0.0,965.8,151.17,1324.375,824.625,499.75,1.125,0.8476221127091086
|
| 14 |
+
minimax,minimal,MiniMaxAI/MiniMax-M2.1,8,0.75,0.8375,,,0.75,0.0,717.0,149.0,1208.875,748.0,460.875,0.75,0.8449169144656289
|
| 15 |
+
kimi,structured,moonshotai/Kimi-K2-Instruct-0905,8,0.75,0.8125,,,1.125,0.0,596.5,75.38,778.0,558.375,219.625,1.125,0.8286831055123738
|
| 16 |
+
haiku,structured,claude-haiku-4-5,8,0.75,0.7875,,,0.875,0.0,786.8,68.5,2249.0,1818.75,430.25,0.875,0.8028070781779222
|
| 17 |
+
glm,verbose_noisy,zai-org/GLM-4.7,8,0.75,0.7875,,,1.0,0.0,1304.1,168.29,1378.0,876.125,501.875,1.0,0.7886269253343062
|
| 18 |
+
kimi,verbose_noisy,moonshotai/Kimi-K2-Instruct-0905,8,0.625,0.7375,,,1.0,0.0,606.5,84.5,658.875,475.125,183.75,1.0,0.7683643984660663
|
| 19 |
+
kimi,minimal,moonshotai/Kimi-K2-Instruct-0905,8,0.5,0.6625,,,1.0,0.0,416.2,84.25,631.0,462.625,168.375,1.0,0.7146312913112515
|
docs/tool_description_eval/clean_release_20260209/tool_description_dashboard.json
ADDED
|
@@ -0,0 +1,344 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
[
|
| 2 |
+
{
|
| 3 |
+
"model": "minimax",
|
| 4 |
+
"variant": "structured",
|
| 5 |
+
"actual_model": "MiniMaxAI/MiniMax-M2.1",
|
| 6 |
+
"n_cases": 8,
|
| 7 |
+
"answer_pass_rate": 1.0,
|
| 8 |
+
"normalized_answer_score": 0.925,
|
| 9 |
+
"first_call_ok_rate": null,
|
| 10 |
+
"avg_score_total": null,
|
| 11 |
+
"avg_tool_calls": 0.625,
|
| 12 |
+
"avg_endpoint_calls": 0.0,
|
| 13 |
+
"avg_tool_exchange_chars": 797.2,
|
| 14 |
+
"avg_delegation_chars": 280.0,
|
| 15 |
+
"avg_total_tokens": 1259.75,
|
| 16 |
+
"avg_input_tokens": 739.875,
|
| 17 |
+
"avg_output_tokens": 519.875,
|
| 18 |
+
"avg_tool_calls_reported": 0.625,
|
| 19 |
+
"composite": 0.9124970675498518
|
| 20 |
+
},
|
| 21 |
+
{
|
| 22 |
+
"model": "kimi25",
|
| 23 |
+
"variant": "minimal",
|
| 24 |
+
"actual_model": "moonshotai/Kimi-K2.5",
|
| 25 |
+
"n_cases": 8,
|
| 26 |
+
"answer_pass_rate": 1.0,
|
| 27 |
+
"normalized_answer_score": 0.95,
|
| 28 |
+
"first_call_ok_rate": null,
|
| 29 |
+
"avg_score_total": null,
|
| 30 |
+
"avg_tool_calls": 1.0,
|
| 31 |
+
"avg_endpoint_calls": 0.0,
|
| 32 |
+
"avg_tool_exchange_chars": 1674.4,
|
| 33 |
+
"avg_delegation_chars": 113.14,
|
| 34 |
+
"avg_total_tokens": 1276.25,
|
| 35 |
+
"avg_input_tokens": 819.375,
|
| 36 |
+
"avg_output_tokens": 456.875,
|
| 37 |
+
"avg_tool_calls_reported": 1.0,
|
| 38 |
+
"composite": 0.9098262016061369
|
| 39 |
+
},
|
| 40 |
+
{
|
| 41 |
+
"model": "haiku",
|
| 42 |
+
"variant": "minimal",
|
| 43 |
+
"actual_model": "claude-haiku-4-5",
|
| 44 |
+
"n_cases": 8,
|
| 45 |
+
"answer_pass_rate": 0.875,
|
| 46 |
+
"normalized_answer_score": 0.9125,
|
| 47 |
+
"first_call_ok_rate": null,
|
| 48 |
+
"avg_score_total": null,
|
| 49 |
+
"avg_tool_calls": 1.25,
|
| 50 |
+
"avg_endpoint_calls": 0.0,
|
| 51 |
+
"avg_tool_exchange_chars": 1104.9,
|
| 52 |
+
"avg_delegation_chars": 66.43,
|
| 53 |
+
"avg_total_tokens": 2977.75,
|
| 54 |
+
"avg_input_tokens": 2426.25,
|
| 55 |
+
"avg_output_tokens": 551.5,
|
| 56 |
+
"avg_tool_calls_reported": 1.25,
|
| 57 |
+
"composite": 0.893802846893479
|
| 58 |
+
},
|
| 59 |
+
{
|
| 60 |
+
"model": "gpt-5-mini",
|
| 61 |
+
"variant": "verbose_noisy",
|
| 62 |
+
"actual_model": "gpt-5-mini",
|
| 63 |
+
"n_cases": 8,
|
| 64 |
+
"answer_pass_rate": 0.875,
|
| 65 |
+
"normalized_answer_score": 0.9125,
|
| 66 |
+
"first_call_ok_rate": null,
|
| 67 |
+
"avg_score_total": null,
|
| 68 |
+
"avg_tool_calls": 0.75,
|
| 69 |
+
"avg_endpoint_calls": 0.0,
|
| 70 |
+
"avg_tool_exchange_chars": 1127.2,
|
| 71 |
+
"avg_delegation_chars": 343.8,
|
| 72 |
+
"avg_total_tokens": 1859.5,
|
| 73 |
+
"avg_input_tokens": 703.625,
|
| 74 |
+
"avg_output_tokens": 1155.875,
|
| 75 |
+
"avg_tool_calls_reported": 0.75,
|
| 76 |
+
"composite": 0.8932066849458153
|
| 77 |
+
},
|
| 78 |
+
{
|
| 79 |
+
"model": "kimi25",
|
| 80 |
+
"variant": "structured",
|
| 81 |
+
"actual_model": "moonshotai/Kimi-K2.5",
|
| 82 |
+
"n_cases": 8,
|
| 83 |
+
"answer_pass_rate": 0.875,
|
| 84 |
+
"normalized_answer_score": 0.9125,
|
| 85 |
+
"first_call_ok_rate": null,
|
| 86 |
+
"avg_score_total": null,
|
| 87 |
+
"avg_tool_calls": 1.0,
|
| 88 |
+
"avg_endpoint_calls": 0.0,
|
| 89 |
+
"avg_tool_exchange_chars": 1460.2,
|
| 90 |
+
"avg_delegation_chars": 100.71,
|
| 91 |
+
"avg_total_tokens": 1409.875,
|
| 92 |
+
"avg_input_tokens": 893.125,
|
| 93 |
+
"avg_output_tokens": 516.75,
|
| 94 |
+
"avg_tool_calls_reported": 1.0,
|
| 95 |
+
"composite": 0.8847939692269589
|
| 96 |
+
},
|
| 97 |
+
{
|
| 98 |
+
"model": "kimi25",
|
| 99 |
+
"variant": "verbose_noisy",
|
| 100 |
+
"actual_model": "moonshotai/Kimi-K2.5",
|
| 101 |
+
"n_cases": 8,
|
| 102 |
+
"answer_pass_rate": 0.875,
|
| 103 |
+
"normalized_answer_score": 0.9125,
|
| 104 |
+
"first_call_ok_rate": null,
|
| 105 |
+
"avg_score_total": null,
|
| 106 |
+
"avg_tool_calls": 1.25,
|
| 107 |
+
"avg_endpoint_calls": 0.0,
|
| 108 |
+
"avg_tool_exchange_chars": 1887.4,
|
| 109 |
+
"avg_delegation_chars": 159.71,
|
| 110 |
+
"avg_total_tokens": 1440.625,
|
| 111 |
+
"avg_input_tokens": 926.375,
|
| 112 |
+
"avg_output_tokens": 514.25,
|
| 113 |
+
"avg_tool_calls_reported": 1.25,
|
| 114 |
+
"composite": 0.8751926706739843
|
| 115 |
+
},
|
| 116 |
+
{
|
| 117 |
+
"model": "haiku",
|
| 118 |
+
"variant": "verbose_noisy",
|
| 119 |
+
"actual_model": "claude-haiku-4-5",
|
| 120 |
+
"n_cases": 8,
|
| 121 |
+
"answer_pass_rate": 0.875,
|
| 122 |
+
"normalized_answer_score": 0.8875,
|
| 123 |
+
"first_call_ok_rate": null,
|
| 124 |
+
"avg_score_total": null,
|
| 125 |
+
"avg_tool_calls": 1.0,
|
| 126 |
+
"avg_endpoint_calls": 0.0,
|
| 127 |
+
"avg_tool_exchange_chars": 1244.6,
|
| 128 |
+
"avg_delegation_chars": 118.86,
|
| 129 |
+
"avg_total_tokens": 2681.5,
|
| 130 |
+
"avg_input_tokens": 2122.375,
|
| 131 |
+
"avg_output_tokens": 559.125,
|
| 132 |
+
"avg_tool_calls_reported": 1.0,
|
| 133 |
+
"composite": 0.8701383595426448
|
| 134 |
+
},
|
| 135 |
+
{
|
| 136 |
+
"model": "gpt-5-mini",
|
| 137 |
+
"variant": "structured",
|
| 138 |
+
"actual_model": "gpt-5-mini",
|
| 139 |
+
"n_cases": 8,
|
| 140 |
+
"answer_pass_rate": 0.875,
|
| 141 |
+
"normalized_answer_score": 0.9125,
|
| 142 |
+
"first_call_ok_rate": null,
|
| 143 |
+
"avg_score_total": null,
|
| 144 |
+
"avg_tool_calls": 1.0,
|
| 145 |
+
"avg_endpoint_calls": 0.0,
|
| 146 |
+
"avg_tool_exchange_chars": 2151.9,
|
| 147 |
+
"avg_delegation_chars": 314.43,
|
| 148 |
+
"avg_total_tokens": 2052.625,
|
| 149 |
+
"avg_input_tokens": 1083.625,
|
| 150 |
+
"avg_output_tokens": 969.0,
|
| 151 |
+
"avg_tool_calls_reported": 1.0,
|
| 152 |
+
"composite": 0.8698229841021267
|
| 153 |
+
},
|
| 154 |
+
{
|
| 155 |
+
"model": "glm",
|
| 156 |
+
"variant": "minimal",
|
| 157 |
+
"actual_model": "zai-org/GLM-4.7",
|
| 158 |
+
"n_cases": 8,
|
| 159 |
+
"answer_pass_rate": 0.875,
|
| 160 |
+
"normalized_answer_score": 0.9125,
|
| 161 |
+
"first_call_ok_rate": null,
|
| 162 |
+
"avg_score_total": null,
|
| 163 |
+
"avg_tool_calls": 1.875,
|
| 164 |
+
"avg_endpoint_calls": 0.0,
|
| 165 |
+
"avg_tool_exchange_chars": 2193.8,
|
| 166 |
+
"avg_delegation_chars": 91.88,
|
| 167 |
+
"avg_total_tokens": 2309.125,
|
| 168 |
+
"avg_input_tokens": 1546.625,
|
| 169 |
+
"avg_output_tokens": 762.5,
|
| 170 |
+
"avg_tool_calls_reported": 1.875,
|
| 171 |
+
"composite": 0.8690085907309072
|
| 172 |
+
},
|
| 173 |
+
{
|
| 174 |
+
"model": "minimax",
|
| 175 |
+
"variant": "verbose_noisy",
|
| 176 |
+
"actual_model": "MiniMaxAI/MiniMax-M2.1",
|
| 177 |
+
"n_cases": 8,
|
| 178 |
+
"answer_pass_rate": 0.875,
|
| 179 |
+
"normalized_answer_score": 0.8625,
|
| 180 |
+
"first_call_ok_rate": null,
|
| 181 |
+
"avg_score_total": null,
|
| 182 |
+
"avg_tool_calls": 0.75,
|
| 183 |
+
"avg_endpoint_calls": 0.0,
|
| 184 |
+
"avg_tool_exchange_chars": 602.2,
|
| 185 |
+
"avg_delegation_chars": 145.5,
|
| 186 |
+
"avg_total_tokens": 1192.5,
|
| 187 |
+
"avg_input_tokens": 712.125,
|
| 188 |
+
"avg_output_tokens": 480.375,
|
| 189 |
+
"avg_tool_calls_reported": 0.75,
|
| 190 |
+
"composite": 0.8685013030595123
|
| 191 |
+
},
|
| 192 |
+
{
|
| 193 |
+
"model": "gpt-5-mini",
|
| 194 |
+
"variant": "minimal",
|
| 195 |
+
"actual_model": "gpt-5-mini",
|
| 196 |
+
"n_cases": 8,
|
| 197 |
+
"answer_pass_rate": 0.875,
|
| 198 |
+
"normalized_answer_score": 0.9125,
|
| 199 |
+
"first_call_ok_rate": null,
|
| 200 |
+
"avg_score_total": null,
|
| 201 |
+
"avg_tool_calls": 0.875,
|
| 202 |
+
"avg_endpoint_calls": 0.0,
|
| 203 |
+
"avg_tool_exchange_chars": 2288.6,
|
| 204 |
+
"avg_delegation_chars": 277.67,
|
| 205 |
+
"avg_total_tokens": 1987.0,
|
| 206 |
+
"avg_input_tokens": 1063.5,
|
| 207 |
+
"avg_output_tokens": 923.5,
|
| 208 |
+
"avg_tool_calls_reported": 0.875,
|
| 209 |
+
"composite": 0.8672005597782839
|
| 210 |
+
},
|
| 211 |
+
{
|
| 212 |
+
"model": "glm",
|
| 213 |
+
"variant": "structured",
|
| 214 |
+
"actual_model": "zai-org/GLM-4.7",
|
| 215 |
+
"n_cases": 8,
|
| 216 |
+
"answer_pass_rate": 0.875,
|
| 217 |
+
"normalized_answer_score": 0.85,
|
| 218 |
+
"first_call_ok_rate": null,
|
| 219 |
+
"avg_score_total": null,
|
| 220 |
+
"avg_tool_calls": 1.125,
|
| 221 |
+
"avg_endpoint_calls": 0.0,
|
| 222 |
+
"avg_tool_exchange_chars": 965.8,
|
| 223 |
+
"avg_delegation_chars": 151.17,
|
| 224 |
+
"avg_total_tokens": 1324.375,
|
| 225 |
+
"avg_input_tokens": 824.625,
|
| 226 |
+
"avg_output_tokens": 499.75,
|
| 227 |
+
"avg_tool_calls_reported": 1.125,
|
| 228 |
+
"composite": 0.8476221127091086
|
| 229 |
+
},
|
| 230 |
+
{
|
| 231 |
+
"model": "minimax",
|
| 232 |
+
"variant": "minimal",
|
| 233 |
+
"actual_model": "MiniMaxAI/MiniMax-M2.1",
|
| 234 |
+
"n_cases": 8,
|
| 235 |
+
"answer_pass_rate": 0.75,
|
| 236 |
+
"normalized_answer_score": 0.8375,
|
| 237 |
+
"first_call_ok_rate": null,
|
| 238 |
+
"avg_score_total": null,
|
| 239 |
+
"avg_tool_calls": 0.75,
|
| 240 |
+
"avg_endpoint_calls": 0.0,
|
| 241 |
+
"avg_tool_exchange_chars": 717.0,
|
| 242 |
+
"avg_delegation_chars": 149.0,
|
| 243 |
+
"avg_total_tokens": 1208.875,
|
| 244 |
+
"avg_input_tokens": 748.0,
|
| 245 |
+
"avg_output_tokens": 460.875,
|
| 246 |
+
"avg_tool_calls_reported": 0.75,
|
| 247 |
+
"composite": 0.8449169144656289
|
| 248 |
+
},
|
| 249 |
+
{
|
| 250 |
+
"model": "kimi",
|
| 251 |
+
"variant": "structured",
|
| 252 |
+
"actual_model": "moonshotai/Kimi-K2-Instruct-0905",
|
| 253 |
+
"n_cases": 8,
|
| 254 |
+
"answer_pass_rate": 0.75,
|
| 255 |
+
"normalized_answer_score": 0.8125,
|
| 256 |
+
"first_call_ok_rate": null,
|
| 257 |
+
"avg_score_total": null,
|
| 258 |
+
"avg_tool_calls": 1.125,
|
| 259 |
+
"avg_endpoint_calls": 0.0,
|
| 260 |
+
"avg_tool_exchange_chars": 596.5,
|
| 261 |
+
"avg_delegation_chars": 75.38,
|
| 262 |
+
"avg_total_tokens": 778.0,
|
| 263 |
+
"avg_input_tokens": 558.375,
|
| 264 |
+
"avg_output_tokens": 219.625,
|
| 265 |
+
"avg_tool_calls_reported": 1.125,
|
| 266 |
+
"composite": 0.8286831055123738
|
| 267 |
+
},
|
| 268 |
+
{
|
| 269 |
+
"model": "haiku",
|
| 270 |
+
"variant": "structured",
|
| 271 |
+
"actual_model": "claude-haiku-4-5",
|
| 272 |
+
"n_cases": 8,
|
| 273 |
+
"answer_pass_rate": 0.75,
|
| 274 |
+
"normalized_answer_score": 0.7875,
|
| 275 |
+
"first_call_ok_rate": null,
|
| 276 |
+
"avg_score_total": null,
|
| 277 |
+
"avg_tool_calls": 0.875,
|
| 278 |
+
"avg_endpoint_calls": 0.0,
|
| 279 |
+
"avg_tool_exchange_chars": 786.8,
|
| 280 |
+
"avg_delegation_chars": 68.5,
|
| 281 |
+
"avg_total_tokens": 2249.0,
|
| 282 |
+
"avg_input_tokens": 1818.75,
|
| 283 |
+
"avg_output_tokens": 430.25,
|
| 284 |
+
"avg_tool_calls_reported": 0.875,
|
| 285 |
+
"composite": 0.8028070781779222
|
| 286 |
+
},
|
| 287 |
+
{
|
| 288 |
+
"model": "glm",
|
| 289 |
+
"variant": "verbose_noisy",
|
| 290 |
+
"actual_model": "zai-org/GLM-4.7",
|
| 291 |
+
"n_cases": 8,
|
| 292 |
+
"answer_pass_rate": 0.75,
|
| 293 |
+
"normalized_answer_score": 0.7875,
|
| 294 |
+
"first_call_ok_rate": null,
|
| 295 |
+
"avg_score_total": null,
|
| 296 |
+
"avg_tool_calls": 1.0,
|
| 297 |
+
"avg_endpoint_calls": 0.0,
|
| 298 |
+
"avg_tool_exchange_chars": 1304.1,
|
| 299 |
+
"avg_delegation_chars": 168.29,
|
| 300 |
+
"avg_total_tokens": 1378.0,
|
| 301 |
+
"avg_input_tokens": 876.125,
|
| 302 |
+
"avg_output_tokens": 501.875,
|
| 303 |
+
"avg_tool_calls_reported": 1.0,
|
| 304 |
+
"composite": 0.7886269253343062
|
| 305 |
+
},
|
| 306 |
+
{
|
| 307 |
+
"model": "kimi",
|
| 308 |
+
"variant": "verbose_noisy",
|
| 309 |
+
"actual_model": "moonshotai/Kimi-K2-Instruct-0905",
|
| 310 |
+
"n_cases": 8,
|
| 311 |
+
"answer_pass_rate": 0.625,
|
| 312 |
+
"normalized_answer_score": 0.7375,
|
| 313 |
+
"first_call_ok_rate": null,
|
| 314 |
+
"avg_score_total": null,
|
| 315 |
+
"avg_tool_calls": 1.0,
|
| 316 |
+
"avg_endpoint_calls": 0.0,
|
| 317 |
+
"avg_tool_exchange_chars": 606.5,
|
| 318 |
+
"avg_delegation_chars": 84.5,
|
| 319 |
+
"avg_total_tokens": 658.875,
|
| 320 |
+
"avg_input_tokens": 475.125,
|
| 321 |
+
"avg_output_tokens": 183.75,
|
| 322 |
+
"avg_tool_calls_reported": 1.0,
|
| 323 |
+
"composite": 0.7683643984660663
|
| 324 |
+
},
|
| 325 |
+
{
|
| 326 |
+
"model": "kimi",
|
| 327 |
+
"variant": "minimal",
|
| 328 |
+
"actual_model": "moonshotai/Kimi-K2-Instruct-0905",
|
| 329 |
+
"n_cases": 8,
|
| 330 |
+
"answer_pass_rate": 0.5,
|
| 331 |
+
"normalized_answer_score": 0.6625,
|
| 332 |
+
"first_call_ok_rate": null,
|
| 333 |
+
"avg_score_total": null,
|
| 334 |
+
"avg_tool_calls": 1.0,
|
| 335 |
+
"avg_endpoint_calls": 0.0,
|
| 336 |
+
"avg_tool_exchange_chars": 416.2,
|
| 337 |
+
"avg_delegation_chars": 84.25,
|
| 338 |
+
"avg_total_tokens": 631.0,
|
| 339 |
+
"avg_input_tokens": 462.625,
|
| 340 |
+
"avg_output_tokens": 168.375,
|
| 341 |
+
"avg_tool_calls_reported": 1.0,
|
| 342 |
+
"composite": 0.7146312913112515
|
| 343 |
+
}
|
| 344 |
+
]
|
docs/tool_description_eval/clean_release_20260209/tool_description_dashboard.md
ADDED
|
@@ -0,0 +1,35 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Tool Description Combined Dashboard
|
| 2 |
+
|
| 3 |
+
> Combines trajectory metrics, oracle-answer metrics, and token usage from raw results.
|
| 4 |
+
|
| 5 |
+
| Rank | Model | Variant | Answer norm | Answer pass | First OK | Avg score | Avg calls | Avg exchange chars | Avg tokens | Composite |
|
| 6 |
+
|---:|---|---|---:|---:|---:|---:|---:|---:|---:|---:|
|
| 7 |
+
| 1 | minimax | structured | 0.9250 | 1.0000 | n/a | n/a | 0.625 | 797.2 | 1259.8 | 0.9125 |
|
| 8 |
+
| 2 | kimi25 | minimal | 0.9500 | 1.0000 | n/a | n/a | 1.000 | 1674.4 | 1276.2 | 0.9098 |
|
| 9 |
+
| 3 | haiku | minimal | 0.9125 | 0.8750 | n/a | n/a | 1.250 | 1104.9 | 2977.8 | 0.8938 |
|
| 10 |
+
| 4 | gpt-5-mini | verbose_noisy | 0.9125 | 0.8750 | n/a | n/a | 0.750 | 1127.2 | 1859.5 | 0.8932 |
|
| 11 |
+
| 5 | kimi25 | structured | 0.9125 | 0.8750 | n/a | n/a | 1.000 | 1460.2 | 1409.9 | 0.8848 |
|
| 12 |
+
| 6 | kimi25 | verbose_noisy | 0.9125 | 0.8750 | n/a | n/a | 1.250 | 1887.4 | 1440.6 | 0.8752 |
|
| 13 |
+
| 7 | haiku | verbose_noisy | 0.8875 | 0.8750 | n/a | n/a | 1.000 | 1244.6 | 2681.5 | 0.8701 |
|
| 14 |
+
| 8 | gpt-5-mini | structured | 0.9125 | 0.8750 | n/a | n/a | 1.000 | 2151.9 | 2052.6 | 0.8698 |
|
| 15 |
+
| 9 | glm | minimal | 0.9125 | 0.8750 | n/a | n/a | 1.875 | 2193.8 | 2309.1 | 0.8690 |
|
| 16 |
+
| 10 | minimax | verbose_noisy | 0.8625 | 0.8750 | n/a | n/a | 0.750 | 602.2 | 1192.5 | 0.8685 |
|
| 17 |
+
| 11 | gpt-5-mini | minimal | 0.9125 | 0.8750 | n/a | n/a | 0.875 | 2288.6 | 1987.0 | 0.8672 |
|
| 18 |
+
| 12 | glm | structured | 0.8500 | 0.8750 | n/a | n/a | 1.125 | 965.8 | 1324.4 | 0.8476 |
|
| 19 |
+
| 13 | minimax | minimal | 0.8375 | 0.7500 | n/a | n/a | 0.750 | 717.0 | 1208.9 | 0.8449 |
|
| 20 |
+
| 14 | kimi | structured | 0.8125 | 0.7500 | n/a | n/a | 1.125 | 596.5 | 778.0 | 0.8287 |
|
| 21 |
+
| 15 | haiku | structured | 0.7875 | 0.7500 | n/a | n/a | 0.875 | 786.8 | 2249.0 | 0.8028 |
|
| 22 |
+
| 16 | glm | verbose_noisy | 0.7875 | 0.7500 | n/a | n/a | 1.000 | 1304.1 | 1378.0 | 0.7886 |
|
| 23 |
+
| 17 | kimi | verbose_noisy | 0.7375 | 0.6250 | n/a | n/a | 1.000 | 606.5 | 658.9 | 0.7684 |
|
| 24 |
+
| 18 | kimi | minimal | 0.6625 | 0.5000 | n/a | n/a | 1.000 | 416.2 | 631.0 | 0.7146 |
|
| 25 |
+
|
| 26 |
+
## Per-model winner (composite)
|
| 27 |
+
|
| 28 |
+
| Model | Winner variant | Composite | Answer norm | First OK | Exchange chars | Avg tokens |
|
| 29 |
+
|---|---|---:|---:|---:|---:|---:|
|
| 30 |
+
| glm | minimal | 0.8690 | 0.9125 | n/a | 2193.8 | 2309.1 |
|
| 31 |
+
| gpt-5-mini | verbose_noisy | 0.8932 | 0.9125 | n/a | 1127.2 | 1859.5 |
|
| 32 |
+
| haiku | minimal | 0.8938 | 0.9125 | n/a | 1104.9 | 2977.8 |
|
| 33 |
+
| kimi | structured | 0.8287 | 0.8125 | n/a | 596.5 | 778.0 |
|
| 34 |
+
| kimi25 | minimal | 0.9098 | 0.9500 | n/a | 1674.4 | 1276.2 |
|
| 35 |
+
| minimax | structured | 0.9125 | 0.9250 | n/a | 797.2 | 1259.8 |
|
docs/tool_description_eval/clean_release_20260209/tool_description_interpretation.md
ADDED
|
@@ -0,0 +1,28 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Tool Description Interpretation
|
| 2 |
+
|
| 3 |
+
## Global means by variant
|
| 4 |
+
|
| 5 |
+
| Variant | First-call OK | Avg score | Avg endpoint calls | Avg exchange chars |
|
| 6 |
+
|---|---:|---:|---:|---:|
|
| 7 |
+
| minimal | n/a | n/a | 0.0000 | 1399 |
|
| 8 |
+
| structured | n/a | n/a | 0.0000 | 1126 |
|
| 9 |
+
| verbose_noisy | n/a | n/a | 0.0000 | 1129 |
|
| 10 |
+
|
| 11 |
+
## Structured vs Minimal (per model deltas)
|
| 12 |
+
|
| 13 |
+
Δ defined as `structured - minimal`.
|
| 14 |
+
|
| 15 |
+
| Model | Δ First-call OK | Δ Avg score | Δ Calls |
|
| 16 |
+
|---|---:|---:|---:|
|
| 17 |
+
| glm | n/a | n/a | +0.0000 |
|
| 18 |
+
| gpt-5-mini | n/a | n/a | +0.0000 |
|
| 19 |
+
| haiku | n/a | n/a | +0.0000 |
|
| 20 |
+
| kimi | n/a | n/a | +0.0000 |
|
| 21 |
+
| kimi25 | n/a | n/a | +0.0000 |
|
| 22 |
+
| minimax | n/a | n/a | +0.0000 |
|
| 23 |
+
|
| 24 |
+
Interpretation tip:
|
| 25 |
+
- Positive Δ first-call/score is better for structured.
|
| 26 |
+
- Negative Δ calls is better for structured (fewer calls).
|
| 27 |
+
|
| 28 |
+
Models covered: glm, gpt-5-mini, haiku, kimi, kimi25, minimax
|
docs/tool_description_eval/clean_release_20260209/tool_description_model_comparison.md
ADDED
|
@@ -0,0 +1,21 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Indirect Run: Model vs Model Comparison
|
| 2 |
+
|
| 3 |
+
| Model | Mean normalized answer | Mean answer pass | Mean tool calls | Mean exchange chars | Mean delegation chars |
|
| 4 |
+
|---|---:|---:|---:|---:|---:|
|
| 5 |
+
| kimi25 | 0.9250 | 0.9167 | 1.083 | 1674.0 | 124.5 |
|
| 6 |
+
| gpt-5-mini | 0.9125 | 0.8750 | 0.875 | 1855.9 | 312.0 |
|
| 7 |
+
| minimax | 0.8750 | 0.8750 | 0.708 | 705.5 | 191.5 |
|
| 8 |
+
| haiku | 0.8625 | 0.8333 | 1.042 | 1045.4 | 84.6 |
|
| 9 |
+
| glm | 0.8500 | 0.8333 | 1.333 | 1487.9 | 137.1 |
|
| 10 |
+
| kimi | 0.7375 | 0.6250 | 1.042 | 539.7 | 81.4 |
|
| 11 |
+
|
| 12 |
+
## Per-model best variant (answer-first, efficiency tie-break)
|
| 13 |
+
|
| 14 |
+
| Model | Winner variant | Answer norm | Answer pass | Tool calls | Exchange chars | Delegation chars |
|
| 15 |
+
|---|---|---:|---:|---:|---:|---:|
|
| 16 |
+
| glm | minimal | 0.9125 | 0.8750 | 1.875 | 2193.8 | 91.9 |
|
| 17 |
+
| gpt-5-mini | verbose_noisy | 0.9125 | 0.8750 | 0.750 | 1127.2 | 343.8 |
|
| 18 |
+
| haiku | minimal | 0.9125 | 0.8750 | 1.250 | 1104.9 | 66.4 |
|
| 19 |
+
| kimi | structured | 0.8125 | 0.7500 | 1.125 | 596.5 | 75.4 |
|
| 20 |
+
| kimi25 | minimal | 0.9500 | 1.0000 | 1.000 | 1674.4 | 113.1 |
|
| 21 |
+
| minimax | structured | 0.9250 | 1.0000 | 0.625 | 797.2 | 280.0 |
|