zRzRzRzRzRzRzR commited on
Commit
9980c03
·
1 Parent(s): 93e5632

update bench

Browse files
Files changed (1) hide show
  1. README.md +23 -0
README.md CHANGED
@@ -41,6 +41,29 @@ GLM-4.7-Flash is a 30B-A3B MoE model. As the strongest model in the 30B class, G
41
  | BrowseComp | 42.8 | 2.29 | 28.3 |
42
 
43
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
44
  ## Serve GLM-4.7-Flash Locally
45
 
46
  For local deployment, GLM-4.7-Flash supports inference frameworks including vLLM and SGLang. Comprehensive deployment
 
41
  | BrowseComp | 42.8 | 2.29 | 28.3 |
42
 
43
 
44
+ ### Evaluation Parameters
45
+
46
+ **Default Settings (Most Tasks)**
47
+
48
+ * temperature: `1.0`
49
+ * top-p: `0.95`
50
+ * max new tokens: `131072`
51
+
52
+ For multi-turn agentic tasks (τ²-Bench and Terminal Bench 2), please turn on [Preserved Thinking mode](https://docs.z.ai/guides/capabilities/thinking-mode).
53
+
54
+ **Terminal Bench, SWE Bench Verified**
55
+
56
+ * temperature: `0.7`
57
+ * top-p: `1.0`
58
+ * max new tokens: `16384`
59
+
60
+ **τ^2-Bench**
61
+
62
+ * Temperature: `0`
63
+ * Max new tokens: `16384`
64
+
65
+ For τ^2-Bench evaluation, we added an additional prompt to the Retail and Telecom user interaction to avoid failure modes caused by users ending the interaction incorrectly. For the Airline domain, we applied the domain fixes as proposed in the [Claude Opus 4.5](https://assets.anthropic.com/m/64823ba7485345a7/Claude-Opus-4-5-System-Card.pdf) release report.
66
+
67
  ## Serve GLM-4.7-Flash Locally
68
 
69
  For local deployment, GLM-4.7-Flash supports inference frameworks including vLLM and SGLang. Comprehensive deployment