evalstate HF Staff commited on
Commit
9f3a06b
·
verified ·
1 Parent(s): d9c13fd

add clean tool-description evaluation charts and summary

Browse files
Files changed (29) hide show
  1. .gitattributes +1 -0
  2. docs/tool_description_eval/clean_release_20260209/SUMMARY.md +19 -0
  3. docs/tool_description_eval/clean_release_20260209/bar_avg_calls_by_model.png +0 -0
  4. docs/tool_description_eval/clean_release_20260209/bar_avg_exchange_chars_by_model.png +0 -0
  5. docs/tool_description_eval/clean_release_20260209/bar_avg_score_by_model.png +0 -0
  6. docs/tool_description_eval/clean_release_20260209/bar_first_call_ok_by_model.png +0 -0
  7. docs/tool_description_eval/clean_release_20260209/heat_avg_calls.png +0 -0
  8. docs/tool_description_eval/clean_release_20260209/heat_avg_exchange_chars.png +0 -0
  9. docs/tool_description_eval/clean_release_20260209/heat_avg_score.png +0 -0
  10. docs/tool_description_eval/clean_release_20260209/heat_first_call_ok.png +0 -0
  11. docs/tool_description_eval/clean_release_20260209/model_compare_answer_norm.png +0 -0
  12. docs/tool_description_eval/clean_release_20260209/model_compare_answer_pass.png +0 -0
  13. docs/tool_description_eval/clean_release_20260209/model_compare_avg_delegation_chars.png +0 -0
  14. docs/tool_description_eval/clean_release_20260209/model_compare_avg_exchange_chars.png +0 -0
  15. docs/tool_description_eval/clean_release_20260209/model_compare_avg_tool_calls.png +0 -0
  16. docs/tool_description_eval/clean_release_20260209/model_compare_pareto_answer_vs_exchange.png +3 -0
  17. docs/tool_description_eval/clean_release_20260209/overall_variant_pareto_chart.png +0 -0
  18. docs/tool_description_eval/clean_release_20260209/overall_variant_summary_chart.png +0 -0
  19. docs/tool_description_eval/clean_release_20260209/scatter_calls_vs_first_ok.png +0 -0
  20. docs/tool_description_eval/clean_release_20260209/scatter_exchange_vs_first_ok.png +0 -0
  21. docs/tool_description_eval/clean_release_20260209/tool_description_ab_summary.filtered.csv +19 -0
  22. docs/tool_description_eval/clean_release_20260209/tool_description_ab_summary.filtered.json +308 -0
  23. docs/tool_description_eval/clean_release_20260209/tool_description_answer_summary.filtered.csv +19 -0
  24. docs/tool_description_eval/clean_release_20260209/tool_description_answer_summary.filtered.json +146 -0
  25. docs/tool_description_eval/clean_release_20260209/tool_description_dashboard.csv +19 -0
  26. docs/tool_description_eval/clean_release_20260209/tool_description_dashboard.json +344 -0
  27. docs/tool_description_eval/clean_release_20260209/tool_description_dashboard.md +35 -0
  28. docs/tool_description_eval/clean_release_20260209/tool_description_interpretation.md +28 -0
  29. docs/tool_description_eval/clean_release_20260209/tool_description_model_comparison.md +21 -0
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ docs/tool_description_eval/clean_release_20260209/model_compare_pareto_answer_vs_exchange.png filter=lfs diff=lfs merge=lfs -text
docs/tool_description_eval/clean_release_20260209/SUMMARY.md ADDED
@@ -0,0 +1,19 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Clean Description-Test Summary (release view)
2
+
3
+ Filtered to variants: **minimal, structured, verbose_noisy** and models excluding **grok-4-fast**.
4
+
5
+ ## Variant ranking (cross-model means)
6
+
7
+ | Rank | Variant | Mean composite | Mean answer | Mean pass | Mean exchange chars | Mean tool calls |
8
+ |---:|---|---:|---:|---:|---:|---:|
9
+ | 1 | structured | 0.8577 | 0.8667 | 0.8542 | 1126.4 | 0.958 |
10
+ | 2 | minimal | 0.8499 | 0.8646 | 0.8125 | 1399.1 | 1.125 |
11
+ | 3 | verbose_noisy | 0.8440 | 0.8500 | 0.8125 | 1128.7 | 0.958 |
12
+
13
+ **Recommended deployed default:** `structured` (best mean composite).
14
+
15
+ ## Key charts
16
+
17
+ - `overall_variant_summary_chart.png` (single-glance summary)
18
+ - `overall_variant_pareto_chart.png` (quality vs chattiness)
19
+ - `model_compare_answer_norm.png` and `model_compare_avg_exchange_chars.png` (per-model comparisons)
docs/tool_description_eval/clean_release_20260209/bar_avg_calls_by_model.png ADDED
docs/tool_description_eval/clean_release_20260209/bar_avg_exchange_chars_by_model.png ADDED
docs/tool_description_eval/clean_release_20260209/bar_avg_score_by_model.png ADDED
docs/tool_description_eval/clean_release_20260209/bar_first_call_ok_by_model.png ADDED
docs/tool_description_eval/clean_release_20260209/heat_avg_calls.png ADDED
docs/tool_description_eval/clean_release_20260209/heat_avg_exchange_chars.png ADDED
docs/tool_description_eval/clean_release_20260209/heat_avg_score.png ADDED
docs/tool_description_eval/clean_release_20260209/heat_first_call_ok.png ADDED
docs/tool_description_eval/clean_release_20260209/model_compare_answer_norm.png ADDED
docs/tool_description_eval/clean_release_20260209/model_compare_answer_pass.png ADDED
docs/tool_description_eval/clean_release_20260209/model_compare_avg_delegation_chars.png ADDED
docs/tool_description_eval/clean_release_20260209/model_compare_avg_exchange_chars.png ADDED
docs/tool_description_eval/clean_release_20260209/model_compare_avg_tool_calls.png ADDED
docs/tool_description_eval/clean_release_20260209/model_compare_pareto_answer_vs_exchange.png ADDED

Git LFS Details

  • SHA256: 248fd3641c449fa429bc5d41a79fe4f4fe99f86cb0955df9d4d0cac2c2752a87
  • Pointer size: 131 Bytes
  • Size of remote file: 110 kB
docs/tool_description_eval/clean_release_20260209/overall_variant_pareto_chart.png ADDED
docs/tool_description_eval/clean_release_20260209/overall_variant_summary_chart.png ADDED
docs/tool_description_eval/clean_release_20260209/scatter_calls_vs_first_ok.png ADDED
docs/tool_description_eval/clean_release_20260209/scatter_exchange_vs_first_ok.png ADDED
docs/tool_description_eval/clean_release_20260209/tool_description_ab_summary.filtered.csv ADDED
@@ -0,0 +1,19 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ variant,model,actual_model,n_cases,success_rate,tool_use_rate,avg_tool_calls,avg_endpoint_calls,avg_tool_request_chars,avg_tool_response_chars,avg_tool_exchange_chars,total_tool_exchange_chars,avg_delegation_chars,first_call_ok_rate,avg_score_total
2
+ minimal,glm,zai-org/GLM-4.7,8,1.0,1.0,1.875,0.0,196.8,1997.0,2193.8,17550,91.88,,
3
+ minimal,gpt-5-mini,gpt-5-mini,8,1.0,0.75,0.875,0.0,246.8,2041.9,2288.6,18309,277.67,,
4
+ minimal,haiku,claude-haiku-4-5,8,1.0,0.875,1.25,0.0,100.6,1004.2,1104.9,8839,66.43,,
5
+ minimal,kimi,moonshotai/Kimi-K2-Instruct-0905,8,1.0,1.0,1.0,0.0,99.5,316.8,416.2,3330,84.25,,
6
+ minimal,kimi25,moonshotai/Kimi-K2.5,8,1.0,0.875,1.0,0.0,129.0,1545.4,1674.4,13395,113.14,,
7
+ minimal,minimax,MiniMaxAI/MiniMax-M2.1,8,1.0,0.75,0.75,0.0,123.5,593.5,717.0,5736,149.0,,
8
+ structured,glm,zai-org/GLM-4.7,8,1.0,0.75,1.125,0.0,160.5,805.2,965.8,7726,151.17,,
9
+ structured,gpt-5-mini,gpt-5-mini,8,1.0,0.875,1.0,0.0,329.1,1822.8,2151.9,17215,314.43,,
10
+ structured,haiku,claude-haiku-4-5,8,1.0,0.75,0.875,0.0,69.5,717.2,786.8,6294,68.5,,
11
+ structured,kimi,moonshotai/Kimi-K2-Instruct-0905,8,1.0,1.0,1.125,0.0,96.5,500.0,596.5,4772,75.38,,
12
+ structured,kimi25,moonshotai/Kimi-K2.5,8,1.0,0.875,1.0,0.0,112.1,1348.1,1460.2,11682,100.71,,
13
+ structured,minimax,MiniMaxAI/MiniMax-M2.1,8,1.0,0.625,0.625,0.0,187.2,610.0,797.2,6378,280.0,,
14
+ verbose_noisy,glm,zai-org/GLM-4.7,8,1.0,0.875,1.0,0.0,189.1,1115.0,1304.1,10433,168.29,,
15
+ verbose_noisy,gpt-5-mini,gpt-5-mini,8,1.0,0.625,0.75,0.0,282.6,844.6,1127.2,9018,343.8,,
16
+ verbose_noisy,haiku,claude-haiku-4-5,8,1.0,0.875,1.0,0.0,124.8,1119.9,1244.6,9957,118.86,,
17
+ verbose_noisy,kimi,moonshotai/Kimi-K2-Instruct-0905,8,1.0,1.0,1.0,0.0,99.5,507.0,606.5,4852,84.5,,
18
+ verbose_noisy,kimi25,moonshotai/Kimi-K2.5,8,1.0,0.875,1.25,0.0,213.8,1673.6,1887.4,15099,159.71,,
19
+ verbose_noisy,minimax,MiniMaxAI/MiniMax-M2.1,8,1.0,0.75,0.75,0.0,121.9,480.4,602.2,4818,145.5,,
docs/tool_description_eval/clean_release_20260209/tool_description_ab_summary.filtered.json ADDED
@@ -0,0 +1,308 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [
2
+ {
3
+ "variant": "minimal",
4
+ "model": "glm",
5
+ "actual_model": "zai-org/GLM-4.7",
6
+ "n_cases": 8,
7
+ "success_rate": 1.0,
8
+ "tool_use_rate": 1.0,
9
+ "avg_tool_calls": 1.875,
10
+ "avg_endpoint_calls": 0.0,
11
+ "avg_tool_request_chars": 196.8,
12
+ "avg_tool_response_chars": 1997.0,
13
+ "avg_tool_exchange_chars": 2193.8,
14
+ "total_tool_exchange_chars": 17550,
15
+ "avg_delegation_chars": 91.88,
16
+ "first_call_ok_rate": null,
17
+ "avg_score_total": null
18
+ },
19
+ {
20
+ "variant": "minimal",
21
+ "model": "gpt-5-mini",
22
+ "actual_model": "gpt-5-mini",
23
+ "n_cases": 8,
24
+ "success_rate": 1.0,
25
+ "tool_use_rate": 0.75,
26
+ "avg_tool_calls": 0.875,
27
+ "avg_endpoint_calls": 0.0,
28
+ "avg_tool_request_chars": 246.8,
29
+ "avg_tool_response_chars": 2041.9,
30
+ "avg_tool_exchange_chars": 2288.6,
31
+ "total_tool_exchange_chars": 18309,
32
+ "avg_delegation_chars": 277.67,
33
+ "first_call_ok_rate": null,
34
+ "avg_score_total": null
35
+ },
36
+ {
37
+ "variant": "minimal",
38
+ "model": "haiku",
39
+ "actual_model": "claude-haiku-4-5",
40
+ "n_cases": 8,
41
+ "success_rate": 1.0,
42
+ "tool_use_rate": 0.875,
43
+ "avg_tool_calls": 1.25,
44
+ "avg_endpoint_calls": 0.0,
45
+ "avg_tool_request_chars": 100.6,
46
+ "avg_tool_response_chars": 1004.2,
47
+ "avg_tool_exchange_chars": 1104.9,
48
+ "total_tool_exchange_chars": 8839,
49
+ "avg_delegation_chars": 66.43,
50
+ "first_call_ok_rate": null,
51
+ "avg_score_total": null
52
+ },
53
+ {
54
+ "variant": "minimal",
55
+ "model": "kimi",
56
+ "actual_model": "moonshotai/Kimi-K2-Instruct-0905",
57
+ "n_cases": 8,
58
+ "success_rate": 1.0,
59
+ "tool_use_rate": 1.0,
60
+ "avg_tool_calls": 1.0,
61
+ "avg_endpoint_calls": 0.0,
62
+ "avg_tool_request_chars": 99.5,
63
+ "avg_tool_response_chars": 316.8,
64
+ "avg_tool_exchange_chars": 416.2,
65
+ "total_tool_exchange_chars": 3330,
66
+ "avg_delegation_chars": 84.25,
67
+ "first_call_ok_rate": null,
68
+ "avg_score_total": null
69
+ },
70
+ {
71
+ "variant": "minimal",
72
+ "model": "kimi25",
73
+ "actual_model": "moonshotai/Kimi-K2.5",
74
+ "n_cases": 8,
75
+ "success_rate": 1.0,
76
+ "tool_use_rate": 0.875,
77
+ "avg_tool_calls": 1.0,
78
+ "avg_endpoint_calls": 0.0,
79
+ "avg_tool_request_chars": 129.0,
80
+ "avg_tool_response_chars": 1545.4,
81
+ "avg_tool_exchange_chars": 1674.4,
82
+ "total_tool_exchange_chars": 13395,
83
+ "avg_delegation_chars": 113.14,
84
+ "first_call_ok_rate": null,
85
+ "avg_score_total": null
86
+ },
87
+ {
88
+ "variant": "minimal",
89
+ "model": "minimax",
90
+ "actual_model": "MiniMaxAI/MiniMax-M2.1",
91
+ "n_cases": 8,
92
+ "success_rate": 1.0,
93
+ "tool_use_rate": 0.75,
94
+ "avg_tool_calls": 0.75,
95
+ "avg_endpoint_calls": 0.0,
96
+ "avg_tool_request_chars": 123.5,
97
+ "avg_tool_response_chars": 593.5,
98
+ "avg_tool_exchange_chars": 717.0,
99
+ "total_tool_exchange_chars": 5736,
100
+ "avg_delegation_chars": 149.0,
101
+ "first_call_ok_rate": null,
102
+ "avg_score_total": null
103
+ },
104
+ {
105
+ "variant": "structured",
106
+ "model": "glm",
107
+ "actual_model": "zai-org/GLM-4.7",
108
+ "n_cases": 8,
109
+ "success_rate": 1.0,
110
+ "tool_use_rate": 0.75,
111
+ "avg_tool_calls": 1.125,
112
+ "avg_endpoint_calls": 0.0,
113
+ "avg_tool_request_chars": 160.5,
114
+ "avg_tool_response_chars": 805.2,
115
+ "avg_tool_exchange_chars": 965.8,
116
+ "total_tool_exchange_chars": 7726,
117
+ "avg_delegation_chars": 151.17,
118
+ "first_call_ok_rate": null,
119
+ "avg_score_total": null
120
+ },
121
+ {
122
+ "variant": "structured",
123
+ "model": "gpt-5-mini",
124
+ "actual_model": "gpt-5-mini",
125
+ "n_cases": 8,
126
+ "success_rate": 1.0,
127
+ "tool_use_rate": 0.875,
128
+ "avg_tool_calls": 1.0,
129
+ "avg_endpoint_calls": 0.0,
130
+ "avg_tool_request_chars": 329.1,
131
+ "avg_tool_response_chars": 1822.8,
132
+ "avg_tool_exchange_chars": 2151.9,
133
+ "total_tool_exchange_chars": 17215,
134
+ "avg_delegation_chars": 314.43,
135
+ "first_call_ok_rate": null,
136
+ "avg_score_total": null
137
+ },
138
+ {
139
+ "variant": "structured",
140
+ "model": "haiku",
141
+ "actual_model": "claude-haiku-4-5",
142
+ "n_cases": 8,
143
+ "success_rate": 1.0,
144
+ "tool_use_rate": 0.75,
145
+ "avg_tool_calls": 0.875,
146
+ "avg_endpoint_calls": 0.0,
147
+ "avg_tool_request_chars": 69.5,
148
+ "avg_tool_response_chars": 717.2,
149
+ "avg_tool_exchange_chars": 786.8,
150
+ "total_tool_exchange_chars": 6294,
151
+ "avg_delegation_chars": 68.5,
152
+ "first_call_ok_rate": null,
153
+ "avg_score_total": null
154
+ },
155
+ {
156
+ "variant": "structured",
157
+ "model": "kimi",
158
+ "actual_model": "moonshotai/Kimi-K2-Instruct-0905",
159
+ "n_cases": 8,
160
+ "success_rate": 1.0,
161
+ "tool_use_rate": 1.0,
162
+ "avg_tool_calls": 1.125,
163
+ "avg_endpoint_calls": 0.0,
164
+ "avg_tool_request_chars": 96.5,
165
+ "avg_tool_response_chars": 500.0,
166
+ "avg_tool_exchange_chars": 596.5,
167
+ "total_tool_exchange_chars": 4772,
168
+ "avg_delegation_chars": 75.38,
169
+ "first_call_ok_rate": null,
170
+ "avg_score_total": null
171
+ },
172
+ {
173
+ "variant": "structured",
174
+ "model": "kimi25",
175
+ "actual_model": "moonshotai/Kimi-K2.5",
176
+ "n_cases": 8,
177
+ "success_rate": 1.0,
178
+ "tool_use_rate": 0.875,
179
+ "avg_tool_calls": 1.0,
180
+ "avg_endpoint_calls": 0.0,
181
+ "avg_tool_request_chars": 112.1,
182
+ "avg_tool_response_chars": 1348.1,
183
+ "avg_tool_exchange_chars": 1460.2,
184
+ "total_tool_exchange_chars": 11682,
185
+ "avg_delegation_chars": 100.71,
186
+ "first_call_ok_rate": null,
187
+ "avg_score_total": null
188
+ },
189
+ {
190
+ "variant": "structured",
191
+ "model": "minimax",
192
+ "actual_model": "MiniMaxAI/MiniMax-M2.1",
193
+ "n_cases": 8,
194
+ "success_rate": 1.0,
195
+ "tool_use_rate": 0.625,
196
+ "avg_tool_calls": 0.625,
197
+ "avg_endpoint_calls": 0.0,
198
+ "avg_tool_request_chars": 187.2,
199
+ "avg_tool_response_chars": 610.0,
200
+ "avg_tool_exchange_chars": 797.2,
201
+ "total_tool_exchange_chars": 6378,
202
+ "avg_delegation_chars": 280.0,
203
+ "first_call_ok_rate": null,
204
+ "avg_score_total": null
205
+ },
206
+ {
207
+ "variant": "verbose_noisy",
208
+ "model": "glm",
209
+ "actual_model": "zai-org/GLM-4.7",
210
+ "n_cases": 8,
211
+ "success_rate": 1.0,
212
+ "tool_use_rate": 0.875,
213
+ "avg_tool_calls": 1.0,
214
+ "avg_endpoint_calls": 0.0,
215
+ "avg_tool_request_chars": 189.1,
216
+ "avg_tool_response_chars": 1115.0,
217
+ "avg_tool_exchange_chars": 1304.1,
218
+ "total_tool_exchange_chars": 10433,
219
+ "avg_delegation_chars": 168.29,
220
+ "first_call_ok_rate": null,
221
+ "avg_score_total": null
222
+ },
223
+ {
224
+ "variant": "verbose_noisy",
225
+ "model": "gpt-5-mini",
226
+ "actual_model": "gpt-5-mini",
227
+ "n_cases": 8,
228
+ "success_rate": 1.0,
229
+ "tool_use_rate": 0.625,
230
+ "avg_tool_calls": 0.75,
231
+ "avg_endpoint_calls": 0.0,
232
+ "avg_tool_request_chars": 282.6,
233
+ "avg_tool_response_chars": 844.6,
234
+ "avg_tool_exchange_chars": 1127.2,
235
+ "total_tool_exchange_chars": 9018,
236
+ "avg_delegation_chars": 343.8,
237
+ "first_call_ok_rate": null,
238
+ "avg_score_total": null
239
+ },
240
+ {
241
+ "variant": "verbose_noisy",
242
+ "model": "haiku",
243
+ "actual_model": "claude-haiku-4-5",
244
+ "n_cases": 8,
245
+ "success_rate": 1.0,
246
+ "tool_use_rate": 0.875,
247
+ "avg_tool_calls": 1.0,
248
+ "avg_endpoint_calls": 0.0,
249
+ "avg_tool_request_chars": 124.8,
250
+ "avg_tool_response_chars": 1119.9,
251
+ "avg_tool_exchange_chars": 1244.6,
252
+ "total_tool_exchange_chars": 9957,
253
+ "avg_delegation_chars": 118.86,
254
+ "first_call_ok_rate": null,
255
+ "avg_score_total": null
256
+ },
257
+ {
258
+ "variant": "verbose_noisy",
259
+ "model": "kimi",
260
+ "actual_model": "moonshotai/Kimi-K2-Instruct-0905",
261
+ "n_cases": 8,
262
+ "success_rate": 1.0,
263
+ "tool_use_rate": 1.0,
264
+ "avg_tool_calls": 1.0,
265
+ "avg_endpoint_calls": 0.0,
266
+ "avg_tool_request_chars": 99.5,
267
+ "avg_tool_response_chars": 507.0,
268
+ "avg_tool_exchange_chars": 606.5,
269
+ "total_tool_exchange_chars": 4852,
270
+ "avg_delegation_chars": 84.5,
271
+ "first_call_ok_rate": null,
272
+ "avg_score_total": null
273
+ },
274
+ {
275
+ "variant": "verbose_noisy",
276
+ "model": "kimi25",
277
+ "actual_model": "moonshotai/Kimi-K2.5",
278
+ "n_cases": 8,
279
+ "success_rate": 1.0,
280
+ "tool_use_rate": 0.875,
281
+ "avg_tool_calls": 1.25,
282
+ "avg_endpoint_calls": 0.0,
283
+ "avg_tool_request_chars": 213.8,
284
+ "avg_tool_response_chars": 1673.6,
285
+ "avg_tool_exchange_chars": 1887.4,
286
+ "total_tool_exchange_chars": 15099,
287
+ "avg_delegation_chars": 159.71,
288
+ "first_call_ok_rate": null,
289
+ "avg_score_total": null
290
+ },
291
+ {
292
+ "variant": "verbose_noisy",
293
+ "model": "minimax",
294
+ "actual_model": "MiniMaxAI/MiniMax-M2.1",
295
+ "n_cases": 8,
296
+ "success_rate": 1.0,
297
+ "tool_use_rate": 0.75,
298
+ "avg_tool_calls": 0.75,
299
+ "avg_endpoint_calls": 0.0,
300
+ "avg_tool_request_chars": 121.9,
301
+ "avg_tool_response_chars": 480.4,
302
+ "avg_tool_exchange_chars": 602.2,
303
+ "total_tool_exchange_chars": 4818,
304
+ "avg_delegation_chars": 145.5,
305
+ "first_call_ok_rate": null,
306
+ "avg_score_total": null
307
+ }
308
+ ]
docs/tool_description_eval/clean_release_20260209/tool_description_answer_summary.filtered.csv ADDED
@@ -0,0 +1,19 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ variant,model,n_cases,answer_pass_rate,avg_answer_score,normalized_answer_score
2
+ minimal,kimi25,8,1.0,9.5,0.95
3
+ structured,minimax,8,1.0,9.25,0.925
4
+ minimal,glm,8,0.875,9.125,0.9125
5
+ minimal,gpt-5-mini,8,0.875,9.125,0.9125
6
+ minimal,haiku,8,0.875,9.125,0.9125
7
+ structured,gpt-5-mini,8,0.875,9.125,0.9125
8
+ structured,kimi25,8,0.875,9.125,0.9125
9
+ verbose_noisy,gpt-5-mini,8,0.875,9.125,0.9125
10
+ verbose_noisy,kimi25,8,0.875,9.125,0.9125
11
+ verbose_noisy,haiku,8,0.875,8.875,0.8875
12
+ verbose_noisy,minimax,8,0.875,8.625,0.8625
13
+ structured,glm,8,0.875,8.5,0.85
14
+ minimal,minimax,8,0.75,8.375,0.8375
15
+ structured,kimi,8,0.75,8.125,0.8125
16
+ structured,haiku,8,0.75,7.875,0.7875
17
+ verbose_noisy,glm,8,0.75,7.875,0.7875
18
+ verbose_noisy,kimi,8,0.625,7.375,0.7375
19
+ minimal,kimi,8,0.5,6.625,0.6625
docs/tool_description_eval/clean_release_20260209/tool_description_answer_summary.filtered.json ADDED
@@ -0,0 +1,146 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [
2
+ {
3
+ "variant": "minimal",
4
+ "model": "kimi25",
5
+ "n_cases": 8,
6
+ "answer_pass_rate": 1.0,
7
+ "avg_answer_score": 9.5,
8
+ "normalized_answer_score": 0.95
9
+ },
10
+ {
11
+ "variant": "structured",
12
+ "model": "minimax",
13
+ "n_cases": 8,
14
+ "answer_pass_rate": 1.0,
15
+ "avg_answer_score": 9.25,
16
+ "normalized_answer_score": 0.925
17
+ },
18
+ {
19
+ "variant": "minimal",
20
+ "model": "glm",
21
+ "n_cases": 8,
22
+ "answer_pass_rate": 0.875,
23
+ "avg_answer_score": 9.125,
24
+ "normalized_answer_score": 0.9125
25
+ },
26
+ {
27
+ "variant": "minimal",
28
+ "model": "gpt-5-mini",
29
+ "n_cases": 8,
30
+ "answer_pass_rate": 0.875,
31
+ "avg_answer_score": 9.125,
32
+ "normalized_answer_score": 0.9125
33
+ },
34
+ {
35
+ "variant": "minimal",
36
+ "model": "haiku",
37
+ "n_cases": 8,
38
+ "answer_pass_rate": 0.875,
39
+ "avg_answer_score": 9.125,
40
+ "normalized_answer_score": 0.9125
41
+ },
42
+ {
43
+ "variant": "structured",
44
+ "model": "gpt-5-mini",
45
+ "n_cases": 8,
46
+ "answer_pass_rate": 0.875,
47
+ "avg_answer_score": 9.125,
48
+ "normalized_answer_score": 0.9125
49
+ },
50
+ {
51
+ "variant": "structured",
52
+ "model": "kimi25",
53
+ "n_cases": 8,
54
+ "answer_pass_rate": 0.875,
55
+ "avg_answer_score": 9.125,
56
+ "normalized_answer_score": 0.9125
57
+ },
58
+ {
59
+ "variant": "verbose_noisy",
60
+ "model": "gpt-5-mini",
61
+ "n_cases": 8,
62
+ "answer_pass_rate": 0.875,
63
+ "avg_answer_score": 9.125,
64
+ "normalized_answer_score": 0.9125
65
+ },
66
+ {
67
+ "variant": "verbose_noisy",
68
+ "model": "kimi25",
69
+ "n_cases": 8,
70
+ "answer_pass_rate": 0.875,
71
+ "avg_answer_score": 9.125,
72
+ "normalized_answer_score": 0.9125
73
+ },
74
+ {
75
+ "variant": "verbose_noisy",
76
+ "model": "haiku",
77
+ "n_cases": 8,
78
+ "answer_pass_rate": 0.875,
79
+ "avg_answer_score": 8.875,
80
+ "normalized_answer_score": 0.8875
81
+ },
82
+ {
83
+ "variant": "verbose_noisy",
84
+ "model": "minimax",
85
+ "n_cases": 8,
86
+ "answer_pass_rate": 0.875,
87
+ "avg_answer_score": 8.625,
88
+ "normalized_answer_score": 0.8625
89
+ },
90
+ {
91
+ "variant": "structured",
92
+ "model": "glm",
93
+ "n_cases": 8,
94
+ "answer_pass_rate": 0.875,
95
+ "avg_answer_score": 8.5,
96
+ "normalized_answer_score": 0.85
97
+ },
98
+ {
99
+ "variant": "minimal",
100
+ "model": "minimax",
101
+ "n_cases": 8,
102
+ "answer_pass_rate": 0.75,
103
+ "avg_answer_score": 8.375,
104
+ "normalized_answer_score": 0.8375
105
+ },
106
+ {
107
+ "variant": "structured",
108
+ "model": "kimi",
109
+ "n_cases": 8,
110
+ "answer_pass_rate": 0.75,
111
+ "avg_answer_score": 8.125,
112
+ "normalized_answer_score": 0.8125
113
+ },
114
+ {
115
+ "variant": "structured",
116
+ "model": "haiku",
117
+ "n_cases": 8,
118
+ "answer_pass_rate": 0.75,
119
+ "avg_answer_score": 7.875,
120
+ "normalized_answer_score": 0.7875
121
+ },
122
+ {
123
+ "variant": "verbose_noisy",
124
+ "model": "glm",
125
+ "n_cases": 8,
126
+ "answer_pass_rate": 0.75,
127
+ "avg_answer_score": 7.875,
128
+ "normalized_answer_score": 0.7875
129
+ },
130
+ {
131
+ "variant": "verbose_noisy",
132
+ "model": "kimi",
133
+ "n_cases": 8,
134
+ "answer_pass_rate": 0.625,
135
+ "avg_answer_score": 7.375,
136
+ "normalized_answer_score": 0.7375
137
+ },
138
+ {
139
+ "variant": "minimal",
140
+ "model": "kimi",
141
+ "n_cases": 8,
142
+ "answer_pass_rate": 0.5,
143
+ "avg_answer_score": 6.625,
144
+ "normalized_answer_score": 0.6625
145
+ }
146
+ ]
docs/tool_description_eval/clean_release_20260209/tool_description_dashboard.csv ADDED
@@ -0,0 +1,19 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ model,variant,actual_model,n_cases,answer_pass_rate,normalized_answer_score,first_call_ok_rate,avg_score_total,avg_tool_calls,avg_endpoint_calls,avg_tool_exchange_chars,avg_delegation_chars,avg_total_tokens,avg_input_tokens,avg_output_tokens,avg_tool_calls_reported,composite
2
+ minimax,structured,MiniMaxAI/MiniMax-M2.1,8,1.0,0.925,,,0.625,0.0,797.2,280.0,1259.75,739.875,519.875,0.625,0.9124970675498518
3
+ kimi25,minimal,moonshotai/Kimi-K2.5,8,1.0,0.95,,,1.0,0.0,1674.4,113.14,1276.25,819.375,456.875,1.0,0.9098262016061369
4
+ haiku,minimal,claude-haiku-4-5,8,0.875,0.9125,,,1.25,0.0,1104.9,66.43,2977.75,2426.25,551.5,1.25,0.893802846893479
5
+ gpt-5-mini,verbose_noisy,gpt-5-mini,8,0.875,0.9125,,,0.75,0.0,1127.2,343.8,1859.5,703.625,1155.875,0.75,0.8932066849458153
6
+ kimi25,structured,moonshotai/Kimi-K2.5,8,0.875,0.9125,,,1.0,0.0,1460.2,100.71,1409.875,893.125,516.75,1.0,0.8847939692269589
7
+ kimi25,verbose_noisy,moonshotai/Kimi-K2.5,8,0.875,0.9125,,,1.25,0.0,1887.4,159.71,1440.625,926.375,514.25,1.25,0.8751926706739843
8
+ haiku,verbose_noisy,claude-haiku-4-5,8,0.875,0.8875,,,1.0,0.0,1244.6,118.86,2681.5,2122.375,559.125,1.0,0.8701383595426448
9
+ gpt-5-mini,structured,gpt-5-mini,8,0.875,0.9125,,,1.0,0.0,2151.9,314.43,2052.625,1083.625,969.0,1.0,0.8698229841021267
10
+ glm,minimal,zai-org/GLM-4.7,8,0.875,0.9125,,,1.875,0.0,2193.8,91.88,2309.125,1546.625,762.5,1.875,0.8690085907309072
11
+ minimax,verbose_noisy,MiniMaxAI/MiniMax-M2.1,8,0.875,0.8625,,,0.75,0.0,602.2,145.5,1192.5,712.125,480.375,0.75,0.8685013030595123
12
+ gpt-5-mini,minimal,gpt-5-mini,8,0.875,0.9125,,,0.875,0.0,2288.6,277.67,1987.0,1063.5,923.5,0.875,0.8672005597782839
13
+ glm,structured,zai-org/GLM-4.7,8,0.875,0.85,,,1.125,0.0,965.8,151.17,1324.375,824.625,499.75,1.125,0.8476221127091086
14
+ minimax,minimal,MiniMaxAI/MiniMax-M2.1,8,0.75,0.8375,,,0.75,0.0,717.0,149.0,1208.875,748.0,460.875,0.75,0.8449169144656289
15
+ kimi,structured,moonshotai/Kimi-K2-Instruct-0905,8,0.75,0.8125,,,1.125,0.0,596.5,75.38,778.0,558.375,219.625,1.125,0.8286831055123738
16
+ haiku,structured,claude-haiku-4-5,8,0.75,0.7875,,,0.875,0.0,786.8,68.5,2249.0,1818.75,430.25,0.875,0.8028070781779222
17
+ glm,verbose_noisy,zai-org/GLM-4.7,8,0.75,0.7875,,,1.0,0.0,1304.1,168.29,1378.0,876.125,501.875,1.0,0.7886269253343062
18
+ kimi,verbose_noisy,moonshotai/Kimi-K2-Instruct-0905,8,0.625,0.7375,,,1.0,0.0,606.5,84.5,658.875,475.125,183.75,1.0,0.7683643984660663
19
+ kimi,minimal,moonshotai/Kimi-K2-Instruct-0905,8,0.5,0.6625,,,1.0,0.0,416.2,84.25,631.0,462.625,168.375,1.0,0.7146312913112515
docs/tool_description_eval/clean_release_20260209/tool_description_dashboard.json ADDED
@@ -0,0 +1,344 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [
2
+ {
3
+ "model": "minimax",
4
+ "variant": "structured",
5
+ "actual_model": "MiniMaxAI/MiniMax-M2.1",
6
+ "n_cases": 8,
7
+ "answer_pass_rate": 1.0,
8
+ "normalized_answer_score": 0.925,
9
+ "first_call_ok_rate": null,
10
+ "avg_score_total": null,
11
+ "avg_tool_calls": 0.625,
12
+ "avg_endpoint_calls": 0.0,
13
+ "avg_tool_exchange_chars": 797.2,
14
+ "avg_delegation_chars": 280.0,
15
+ "avg_total_tokens": 1259.75,
16
+ "avg_input_tokens": 739.875,
17
+ "avg_output_tokens": 519.875,
18
+ "avg_tool_calls_reported": 0.625,
19
+ "composite": 0.9124970675498518
20
+ },
21
+ {
22
+ "model": "kimi25",
23
+ "variant": "minimal",
24
+ "actual_model": "moonshotai/Kimi-K2.5",
25
+ "n_cases": 8,
26
+ "answer_pass_rate": 1.0,
27
+ "normalized_answer_score": 0.95,
28
+ "first_call_ok_rate": null,
29
+ "avg_score_total": null,
30
+ "avg_tool_calls": 1.0,
31
+ "avg_endpoint_calls": 0.0,
32
+ "avg_tool_exchange_chars": 1674.4,
33
+ "avg_delegation_chars": 113.14,
34
+ "avg_total_tokens": 1276.25,
35
+ "avg_input_tokens": 819.375,
36
+ "avg_output_tokens": 456.875,
37
+ "avg_tool_calls_reported": 1.0,
38
+ "composite": 0.9098262016061369
39
+ },
40
+ {
41
+ "model": "haiku",
42
+ "variant": "minimal",
43
+ "actual_model": "claude-haiku-4-5",
44
+ "n_cases": 8,
45
+ "answer_pass_rate": 0.875,
46
+ "normalized_answer_score": 0.9125,
47
+ "first_call_ok_rate": null,
48
+ "avg_score_total": null,
49
+ "avg_tool_calls": 1.25,
50
+ "avg_endpoint_calls": 0.0,
51
+ "avg_tool_exchange_chars": 1104.9,
52
+ "avg_delegation_chars": 66.43,
53
+ "avg_total_tokens": 2977.75,
54
+ "avg_input_tokens": 2426.25,
55
+ "avg_output_tokens": 551.5,
56
+ "avg_tool_calls_reported": 1.25,
57
+ "composite": 0.893802846893479
58
+ },
59
+ {
60
+ "model": "gpt-5-mini",
61
+ "variant": "verbose_noisy",
62
+ "actual_model": "gpt-5-mini",
63
+ "n_cases": 8,
64
+ "answer_pass_rate": 0.875,
65
+ "normalized_answer_score": 0.9125,
66
+ "first_call_ok_rate": null,
67
+ "avg_score_total": null,
68
+ "avg_tool_calls": 0.75,
69
+ "avg_endpoint_calls": 0.0,
70
+ "avg_tool_exchange_chars": 1127.2,
71
+ "avg_delegation_chars": 343.8,
72
+ "avg_total_tokens": 1859.5,
73
+ "avg_input_tokens": 703.625,
74
+ "avg_output_tokens": 1155.875,
75
+ "avg_tool_calls_reported": 0.75,
76
+ "composite": 0.8932066849458153
77
+ },
78
+ {
79
+ "model": "kimi25",
80
+ "variant": "structured",
81
+ "actual_model": "moonshotai/Kimi-K2.5",
82
+ "n_cases": 8,
83
+ "answer_pass_rate": 0.875,
84
+ "normalized_answer_score": 0.9125,
85
+ "first_call_ok_rate": null,
86
+ "avg_score_total": null,
87
+ "avg_tool_calls": 1.0,
88
+ "avg_endpoint_calls": 0.0,
89
+ "avg_tool_exchange_chars": 1460.2,
90
+ "avg_delegation_chars": 100.71,
91
+ "avg_total_tokens": 1409.875,
92
+ "avg_input_tokens": 893.125,
93
+ "avg_output_tokens": 516.75,
94
+ "avg_tool_calls_reported": 1.0,
95
+ "composite": 0.8847939692269589
96
+ },
97
+ {
98
+ "model": "kimi25",
99
+ "variant": "verbose_noisy",
100
+ "actual_model": "moonshotai/Kimi-K2.5",
101
+ "n_cases": 8,
102
+ "answer_pass_rate": 0.875,
103
+ "normalized_answer_score": 0.9125,
104
+ "first_call_ok_rate": null,
105
+ "avg_score_total": null,
106
+ "avg_tool_calls": 1.25,
107
+ "avg_endpoint_calls": 0.0,
108
+ "avg_tool_exchange_chars": 1887.4,
109
+ "avg_delegation_chars": 159.71,
110
+ "avg_total_tokens": 1440.625,
111
+ "avg_input_tokens": 926.375,
112
+ "avg_output_tokens": 514.25,
113
+ "avg_tool_calls_reported": 1.25,
114
+ "composite": 0.8751926706739843
115
+ },
116
+ {
117
+ "model": "haiku",
118
+ "variant": "verbose_noisy",
119
+ "actual_model": "claude-haiku-4-5",
120
+ "n_cases": 8,
121
+ "answer_pass_rate": 0.875,
122
+ "normalized_answer_score": 0.8875,
123
+ "first_call_ok_rate": null,
124
+ "avg_score_total": null,
125
+ "avg_tool_calls": 1.0,
126
+ "avg_endpoint_calls": 0.0,
127
+ "avg_tool_exchange_chars": 1244.6,
128
+ "avg_delegation_chars": 118.86,
129
+ "avg_total_tokens": 2681.5,
130
+ "avg_input_tokens": 2122.375,
131
+ "avg_output_tokens": 559.125,
132
+ "avg_tool_calls_reported": 1.0,
133
+ "composite": 0.8701383595426448
134
+ },
135
+ {
136
+ "model": "gpt-5-mini",
137
+ "variant": "structured",
138
+ "actual_model": "gpt-5-mini",
139
+ "n_cases": 8,
140
+ "answer_pass_rate": 0.875,
141
+ "normalized_answer_score": 0.9125,
142
+ "first_call_ok_rate": null,
143
+ "avg_score_total": null,
144
+ "avg_tool_calls": 1.0,
145
+ "avg_endpoint_calls": 0.0,
146
+ "avg_tool_exchange_chars": 2151.9,
147
+ "avg_delegation_chars": 314.43,
148
+ "avg_total_tokens": 2052.625,
149
+ "avg_input_tokens": 1083.625,
150
+ "avg_output_tokens": 969.0,
151
+ "avg_tool_calls_reported": 1.0,
152
+ "composite": 0.8698229841021267
153
+ },
154
+ {
155
+ "model": "glm",
156
+ "variant": "minimal",
157
+ "actual_model": "zai-org/GLM-4.7",
158
+ "n_cases": 8,
159
+ "answer_pass_rate": 0.875,
160
+ "normalized_answer_score": 0.9125,
161
+ "first_call_ok_rate": null,
162
+ "avg_score_total": null,
163
+ "avg_tool_calls": 1.875,
164
+ "avg_endpoint_calls": 0.0,
165
+ "avg_tool_exchange_chars": 2193.8,
166
+ "avg_delegation_chars": 91.88,
167
+ "avg_total_tokens": 2309.125,
168
+ "avg_input_tokens": 1546.625,
169
+ "avg_output_tokens": 762.5,
170
+ "avg_tool_calls_reported": 1.875,
171
+ "composite": 0.8690085907309072
172
+ },
173
+ {
174
+ "model": "minimax",
175
+ "variant": "verbose_noisy",
176
+ "actual_model": "MiniMaxAI/MiniMax-M2.1",
177
+ "n_cases": 8,
178
+ "answer_pass_rate": 0.875,
179
+ "normalized_answer_score": 0.8625,
180
+ "first_call_ok_rate": null,
181
+ "avg_score_total": null,
182
+ "avg_tool_calls": 0.75,
183
+ "avg_endpoint_calls": 0.0,
184
+ "avg_tool_exchange_chars": 602.2,
185
+ "avg_delegation_chars": 145.5,
186
+ "avg_total_tokens": 1192.5,
187
+ "avg_input_tokens": 712.125,
188
+ "avg_output_tokens": 480.375,
189
+ "avg_tool_calls_reported": 0.75,
190
+ "composite": 0.8685013030595123
191
+ },
192
+ {
193
+ "model": "gpt-5-mini",
194
+ "variant": "minimal",
195
+ "actual_model": "gpt-5-mini",
196
+ "n_cases": 8,
197
+ "answer_pass_rate": 0.875,
198
+ "normalized_answer_score": 0.9125,
199
+ "first_call_ok_rate": null,
200
+ "avg_score_total": null,
201
+ "avg_tool_calls": 0.875,
202
+ "avg_endpoint_calls": 0.0,
203
+ "avg_tool_exchange_chars": 2288.6,
204
+ "avg_delegation_chars": 277.67,
205
+ "avg_total_tokens": 1987.0,
206
+ "avg_input_tokens": 1063.5,
207
+ "avg_output_tokens": 923.5,
208
+ "avg_tool_calls_reported": 0.875,
209
+ "composite": 0.8672005597782839
210
+ },
211
+ {
212
+ "model": "glm",
213
+ "variant": "structured",
214
+ "actual_model": "zai-org/GLM-4.7",
215
+ "n_cases": 8,
216
+ "answer_pass_rate": 0.875,
217
+ "normalized_answer_score": 0.85,
218
+ "first_call_ok_rate": null,
219
+ "avg_score_total": null,
220
+ "avg_tool_calls": 1.125,
221
+ "avg_endpoint_calls": 0.0,
222
+ "avg_tool_exchange_chars": 965.8,
223
+ "avg_delegation_chars": 151.17,
224
+ "avg_total_tokens": 1324.375,
225
+ "avg_input_tokens": 824.625,
226
+ "avg_output_tokens": 499.75,
227
+ "avg_tool_calls_reported": 1.125,
228
+ "composite": 0.8476221127091086
229
+ },
230
+ {
231
+ "model": "minimax",
232
+ "variant": "minimal",
233
+ "actual_model": "MiniMaxAI/MiniMax-M2.1",
234
+ "n_cases": 8,
235
+ "answer_pass_rate": 0.75,
236
+ "normalized_answer_score": 0.8375,
237
+ "first_call_ok_rate": null,
238
+ "avg_score_total": null,
239
+ "avg_tool_calls": 0.75,
240
+ "avg_endpoint_calls": 0.0,
241
+ "avg_tool_exchange_chars": 717.0,
242
+ "avg_delegation_chars": 149.0,
243
+ "avg_total_tokens": 1208.875,
244
+ "avg_input_tokens": 748.0,
245
+ "avg_output_tokens": 460.875,
246
+ "avg_tool_calls_reported": 0.75,
247
+ "composite": 0.8449169144656289
248
+ },
249
+ {
250
+ "model": "kimi",
251
+ "variant": "structured",
252
+ "actual_model": "moonshotai/Kimi-K2-Instruct-0905",
253
+ "n_cases": 8,
254
+ "answer_pass_rate": 0.75,
255
+ "normalized_answer_score": 0.8125,
256
+ "first_call_ok_rate": null,
257
+ "avg_score_total": null,
258
+ "avg_tool_calls": 1.125,
259
+ "avg_endpoint_calls": 0.0,
260
+ "avg_tool_exchange_chars": 596.5,
261
+ "avg_delegation_chars": 75.38,
262
+ "avg_total_tokens": 778.0,
263
+ "avg_input_tokens": 558.375,
264
+ "avg_output_tokens": 219.625,
265
+ "avg_tool_calls_reported": 1.125,
266
+ "composite": 0.8286831055123738
267
+ },
268
+ {
269
+ "model": "haiku",
270
+ "variant": "structured",
271
+ "actual_model": "claude-haiku-4-5",
272
+ "n_cases": 8,
273
+ "answer_pass_rate": 0.75,
274
+ "normalized_answer_score": 0.7875,
275
+ "first_call_ok_rate": null,
276
+ "avg_score_total": null,
277
+ "avg_tool_calls": 0.875,
278
+ "avg_endpoint_calls": 0.0,
279
+ "avg_tool_exchange_chars": 786.8,
280
+ "avg_delegation_chars": 68.5,
281
+ "avg_total_tokens": 2249.0,
282
+ "avg_input_tokens": 1818.75,
283
+ "avg_output_tokens": 430.25,
284
+ "avg_tool_calls_reported": 0.875,
285
+ "composite": 0.8028070781779222
286
+ },
287
+ {
288
+ "model": "glm",
289
+ "variant": "verbose_noisy",
290
+ "actual_model": "zai-org/GLM-4.7",
291
+ "n_cases": 8,
292
+ "answer_pass_rate": 0.75,
293
+ "normalized_answer_score": 0.7875,
294
+ "first_call_ok_rate": null,
295
+ "avg_score_total": null,
296
+ "avg_tool_calls": 1.0,
297
+ "avg_endpoint_calls": 0.0,
298
+ "avg_tool_exchange_chars": 1304.1,
299
+ "avg_delegation_chars": 168.29,
300
+ "avg_total_tokens": 1378.0,
301
+ "avg_input_tokens": 876.125,
302
+ "avg_output_tokens": 501.875,
303
+ "avg_tool_calls_reported": 1.0,
304
+ "composite": 0.7886269253343062
305
+ },
306
+ {
307
+ "model": "kimi",
308
+ "variant": "verbose_noisy",
309
+ "actual_model": "moonshotai/Kimi-K2-Instruct-0905",
310
+ "n_cases": 8,
311
+ "answer_pass_rate": 0.625,
312
+ "normalized_answer_score": 0.7375,
313
+ "first_call_ok_rate": null,
314
+ "avg_score_total": null,
315
+ "avg_tool_calls": 1.0,
316
+ "avg_endpoint_calls": 0.0,
317
+ "avg_tool_exchange_chars": 606.5,
318
+ "avg_delegation_chars": 84.5,
319
+ "avg_total_tokens": 658.875,
320
+ "avg_input_tokens": 475.125,
321
+ "avg_output_tokens": 183.75,
322
+ "avg_tool_calls_reported": 1.0,
323
+ "composite": 0.7683643984660663
324
+ },
325
+ {
326
+ "model": "kimi",
327
+ "variant": "minimal",
328
+ "actual_model": "moonshotai/Kimi-K2-Instruct-0905",
329
+ "n_cases": 8,
330
+ "answer_pass_rate": 0.5,
331
+ "normalized_answer_score": 0.6625,
332
+ "first_call_ok_rate": null,
333
+ "avg_score_total": null,
334
+ "avg_tool_calls": 1.0,
335
+ "avg_endpoint_calls": 0.0,
336
+ "avg_tool_exchange_chars": 416.2,
337
+ "avg_delegation_chars": 84.25,
338
+ "avg_total_tokens": 631.0,
339
+ "avg_input_tokens": 462.625,
340
+ "avg_output_tokens": 168.375,
341
+ "avg_tool_calls_reported": 1.0,
342
+ "composite": 0.7146312913112515
343
+ }
344
+ ]
docs/tool_description_eval/clean_release_20260209/tool_description_dashboard.md ADDED
@@ -0,0 +1,35 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Tool Description Combined Dashboard
2
+
3
+ > Combines trajectory metrics, oracle-answer metrics, and token usage from raw results.
4
+
5
+ | Rank | Model | Variant | Answer norm | Answer pass | First OK | Avg score | Avg calls | Avg exchange chars | Avg tokens | Composite |
6
+ |---:|---|---|---:|---:|---:|---:|---:|---:|---:|---:|
7
+ | 1 | minimax | structured | 0.9250 | 1.0000 | n/a | n/a | 0.625 | 797.2 | 1259.8 | 0.9125 |
8
+ | 2 | kimi25 | minimal | 0.9500 | 1.0000 | n/a | n/a | 1.000 | 1674.4 | 1276.2 | 0.9098 |
9
+ | 3 | haiku | minimal | 0.9125 | 0.8750 | n/a | n/a | 1.250 | 1104.9 | 2977.8 | 0.8938 |
10
+ | 4 | gpt-5-mini | verbose_noisy | 0.9125 | 0.8750 | n/a | n/a | 0.750 | 1127.2 | 1859.5 | 0.8932 |
11
+ | 5 | kimi25 | structured | 0.9125 | 0.8750 | n/a | n/a | 1.000 | 1460.2 | 1409.9 | 0.8848 |
12
+ | 6 | kimi25 | verbose_noisy | 0.9125 | 0.8750 | n/a | n/a | 1.250 | 1887.4 | 1440.6 | 0.8752 |
13
+ | 7 | haiku | verbose_noisy | 0.8875 | 0.8750 | n/a | n/a | 1.000 | 1244.6 | 2681.5 | 0.8701 |
14
+ | 8 | gpt-5-mini | structured | 0.9125 | 0.8750 | n/a | n/a | 1.000 | 2151.9 | 2052.6 | 0.8698 |
15
+ | 9 | glm | minimal | 0.9125 | 0.8750 | n/a | n/a | 1.875 | 2193.8 | 2309.1 | 0.8690 |
16
+ | 10 | minimax | verbose_noisy | 0.8625 | 0.8750 | n/a | n/a | 0.750 | 602.2 | 1192.5 | 0.8685 |
17
+ | 11 | gpt-5-mini | minimal | 0.9125 | 0.8750 | n/a | n/a | 0.875 | 2288.6 | 1987.0 | 0.8672 |
18
+ | 12 | glm | structured | 0.8500 | 0.8750 | n/a | n/a | 1.125 | 965.8 | 1324.4 | 0.8476 |
19
+ | 13 | minimax | minimal | 0.8375 | 0.7500 | n/a | n/a | 0.750 | 717.0 | 1208.9 | 0.8449 |
20
+ | 14 | kimi | structured | 0.8125 | 0.7500 | n/a | n/a | 1.125 | 596.5 | 778.0 | 0.8287 |
21
+ | 15 | haiku | structured | 0.7875 | 0.7500 | n/a | n/a | 0.875 | 786.8 | 2249.0 | 0.8028 |
22
+ | 16 | glm | verbose_noisy | 0.7875 | 0.7500 | n/a | n/a | 1.000 | 1304.1 | 1378.0 | 0.7886 |
23
+ | 17 | kimi | verbose_noisy | 0.7375 | 0.6250 | n/a | n/a | 1.000 | 606.5 | 658.9 | 0.7684 |
24
+ | 18 | kimi | minimal | 0.6625 | 0.5000 | n/a | n/a | 1.000 | 416.2 | 631.0 | 0.7146 |
25
+
26
+ ## Per-model winner (composite)
27
+
28
+ | Model | Winner variant | Composite | Answer norm | First OK | Exchange chars | Avg tokens |
29
+ |---|---|---:|---:|---:|---:|---:|
30
+ | glm | minimal | 0.8690 | 0.9125 | n/a | 2193.8 | 2309.1 |
31
+ | gpt-5-mini | verbose_noisy | 0.8932 | 0.9125 | n/a | 1127.2 | 1859.5 |
32
+ | haiku | minimal | 0.8938 | 0.9125 | n/a | 1104.9 | 2977.8 |
33
+ | kimi | structured | 0.8287 | 0.8125 | n/a | 596.5 | 778.0 |
34
+ | kimi25 | minimal | 0.9098 | 0.9500 | n/a | 1674.4 | 1276.2 |
35
+ | minimax | structured | 0.9125 | 0.9250 | n/a | 797.2 | 1259.8 |
docs/tool_description_eval/clean_release_20260209/tool_description_interpretation.md ADDED
@@ -0,0 +1,28 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Tool Description Interpretation
2
+
3
+ ## Global means by variant
4
+
5
+ | Variant | First-call OK | Avg score | Avg endpoint calls | Avg exchange chars |
6
+ |---|---:|---:|---:|---:|
7
+ | minimal | n/a | n/a | 0.0000 | 1399 |
8
+ | structured | n/a | n/a | 0.0000 | 1126 |
9
+ | verbose_noisy | n/a | n/a | 0.0000 | 1129 |
10
+
11
+ ## Structured vs Minimal (per model deltas)
12
+
13
+ Δ defined as `structured - minimal`.
14
+
15
+ | Model | Δ First-call OK | Δ Avg score | Δ Calls |
16
+ |---|---:|---:|---:|
17
+ | glm | n/a | n/a | +0.0000 |
18
+ | gpt-5-mini | n/a | n/a | +0.0000 |
19
+ | haiku | n/a | n/a | +0.0000 |
20
+ | kimi | n/a | n/a | +0.0000 |
21
+ | kimi25 | n/a | n/a | +0.0000 |
22
+ | minimax | n/a | n/a | +0.0000 |
23
+
24
+ Interpretation tip:
25
+ - Positive Δ first-call/score is better for structured.
26
+ - Negative Δ calls is better for structured (fewer calls).
27
+
28
+ Models covered: glm, gpt-5-mini, haiku, kimi, kimi25, minimax
docs/tool_description_eval/clean_release_20260209/tool_description_model_comparison.md ADDED
@@ -0,0 +1,21 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Indirect Run: Model vs Model Comparison
2
+
3
+ | Model | Mean normalized answer | Mean answer pass | Mean tool calls | Mean exchange chars | Mean delegation chars |
4
+ |---|---:|---:|---:|---:|---:|
5
+ | kimi25 | 0.9250 | 0.9167 | 1.083 | 1674.0 | 124.5 |
6
+ | gpt-5-mini | 0.9125 | 0.8750 | 0.875 | 1855.9 | 312.0 |
7
+ | minimax | 0.8750 | 0.8750 | 0.708 | 705.5 | 191.5 |
8
+ | haiku | 0.8625 | 0.8333 | 1.042 | 1045.4 | 84.6 |
9
+ | glm | 0.8500 | 0.8333 | 1.333 | 1487.9 | 137.1 |
10
+ | kimi | 0.7375 | 0.6250 | 1.042 | 539.7 | 81.4 |
11
+
12
+ ## Per-model best variant (answer-first, efficiency tie-break)
13
+
14
+ | Model | Winner variant | Answer norm | Answer pass | Tool calls | Exchange chars | Delegation chars |
15
+ |---|---|---:|---:|---:|---:|---:|
16
+ | glm | minimal | 0.9125 | 0.8750 | 1.875 | 2193.8 | 91.9 |
17
+ | gpt-5-mini | verbose_noisy | 0.9125 | 0.8750 | 0.750 | 1127.2 | 343.8 |
18
+ | haiku | minimal | 0.9125 | 0.8750 | 1.250 | 1104.9 | 66.4 |
19
+ | kimi | structured | 0.8125 | 0.7500 | 1.125 | 596.5 | 75.4 |
20
+ | kimi25 | minimal | 0.9500 | 1.0000 | 1.000 | 1674.4 | 113.1 |
21
+ | minimax | structured | 0.9250 | 1.0000 | 0.625 | 797.2 | 280.0 |