Add CHI-Bench eval results — agent harness: OpenAI Agents SDK

#38
by hlnchen - opened
Files changed (1) hide show
  1. .eval_results/chi-bench.yaml +41 -0
.eval_results/chi-bench.yaml ADDED
@@ -0,0 +1,41 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Place at .eval_results/chi-bench.yaml in the GLM-5.1 model repo.
2
+ # CONFIRM the exact repo id before submitting (org is zai-org; e.g. zai-org/GLM-5.1).
3
+ # Submit via the model's Community tab as a PR; shows "community-provided" until merged.
4
+ # Values are pass@1 (%) for the best-performing harness for this model: OpenAI Agents SDK.
5
+ # (Ties Hermes at 18.7 overall; OAI Agents wins on reliability pass^3 12.0 vs 10.7.)
6
+ - dataset:
7
+ id: actava/chi-bench
8
+ task_id: chi_bench
9
+ value: 18.7
10
+ date: "2026-05-08"
11
+ source:
12
+ url: https://arxiv.org/abs/2605.16679
13
+ name: CHI-Bench
14
+ notes: "Harness: OpenAI Agents SDK; Protocol: 75 tasks x 3 trials; Metric: pass@1 (%)"
15
+ - dataset:
16
+ id: actava/chi-bench
17
+ task_id: prior_authorization
18
+ value: 18.7
19
+ date: "2026-05-08"
20
+ source:
21
+ url: https://arxiv.org/abs/2605.16679
22
+ name: CHI-Bench
23
+ notes: "Harness: OpenAI Agents SDK; Protocol: 75 tasks x 3 trials; Metric: pass@1 (%)"
24
+ - dataset:
25
+ id: actava/chi-bench
26
+ task_id: utilization_management
27
+ value: 33.3
28
+ date: "2026-05-08"
29
+ source:
30
+ url: https://arxiv.org/abs/2605.16679
31
+ name: CHI-Bench
32
+ notes: "Harness: OpenAI Agents SDK; Protocol: 75 tasks x 3 trials; Metric: pass@1 (%)"
33
+ - dataset:
34
+ id: actava/chi-bench
35
+ task_id: care_management
36
+ value: 4.0
37
+ date: "2026-05-08"
38
+ source:
39
+ url: https://arxiv.org/abs/2605.16679
40
+ name: CHI-Bench
41
+ notes: "Harness: OpenAI Agents SDK; Protocol: 75 tasks x 3 trials; Metric: pass@1 (%)"