Elron commited on
Commit
c3fa51b
·
verified ·
1 Parent(s): 9e15549

Add Open Agent Leaderboard evaluation results

Browse files

Results from the Open Agent Leaderboard (https://www.exgentic.ai), evaluating this agent across 6 benchmarks with 5 different models.

.eval_results/open_agent_leaderboard_openai_aws_claude-opus-4-5.yaml ADDED
@@ -0,0 +1,56 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ - dataset:
2
+ id: open-agent-leaderboard/results
3
+ task_id: overall
4
+ value: 0.6173
5
+ source:
6
+ url: https://www.exgentic.ai
7
+ name: Open Agent Leaderboard
8
+ notes: 'model: Claude Opus 4.5'
9
+ - dataset:
10
+ id: open-agent-leaderboard/results
11
+ task_id: appworld
12
+ value: 0.64
13
+ source:
14
+ url: https://www.exgentic.ai
15
+ name: Open Agent Leaderboard
16
+ notes: 'model: Claude Opus 4.5'
17
+ - dataset:
18
+ id: open-agent-leaderboard/results
19
+ task_id: browsecomp_plus
20
+ value: 0.49
21
+ source:
22
+ url: https://www.exgentic.ai
23
+ name: Open Agent Leaderboard
24
+ notes: 'model: Claude Opus 4.5'
25
+ - dataset:
26
+ id: open-agent-leaderboard/results
27
+ task_id: swebench
28
+ value: 0.6061
29
+ source:
30
+ url: https://www.exgentic.ai
31
+ name: Open Agent Leaderboard
32
+ notes: 'model: Claude Opus 4.5'
33
+ - dataset:
34
+ id: open-agent-leaderboard/results
35
+ task_id: taubench_airline
36
+ value: 0.66
37
+ source:
38
+ url: https://www.exgentic.ai
39
+ name: Open Agent Leaderboard
40
+ notes: 'model: Claude Opus 4.5'
41
+ - dataset:
42
+ id: open-agent-leaderboard/results
43
+ task_id: taubench_retail
44
+ value: 0.78
45
+ source:
46
+ url: https://www.exgentic.ai
47
+ name: Open Agent Leaderboard
48
+ notes: 'model: Claude Opus 4.5'
49
+ - dataset:
50
+ id: open-agent-leaderboard/results
51
+ task_id: taubench_telecom
52
+ value: 0.76
53
+ source:
54
+ url: https://www.exgentic.ai
55
+ name: Open Agent Leaderboard
56
+ notes: 'model: Claude Opus 4.5'
.eval_results/open_agent_leaderboard_openai_azure_deepseek-v3.2.yaml ADDED
@@ -0,0 +1,56 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ - dataset:
2
+ id: open-agent-leaderboard/results
3
+ task_id: overall
4
+ value: 0.446
5
+ source:
6
+ url: https://www.exgentic.ai
7
+ name: Open Agent Leaderboard
8
+ notes: 'model: DeepSeek V3.2'
9
+ - dataset:
10
+ id: open-agent-leaderboard/results
11
+ task_id: appworld
12
+ value: 0.04
13
+ source:
14
+ url: https://www.exgentic.ai
15
+ name: Open Agent Leaderboard
16
+ notes: 'model: DeepSeek V3.2'
17
+ - dataset:
18
+ id: open-agent-leaderboard/results
19
+ task_id: browsecomp_plus
20
+ value: 0.36
21
+ source:
22
+ url: https://www.exgentic.ai
23
+ name: Open Agent Leaderboard
24
+ notes: 'model: DeepSeek V3.2'
25
+ - dataset:
26
+ id: open-agent-leaderboard/results
27
+ task_id: swebench
28
+ value: 0.6875
29
+ source:
30
+ url: https://www.exgentic.ai
31
+ name: Open Agent Leaderboard
32
+ notes: 'model: DeepSeek V3.2'
33
+ - dataset:
34
+ id: open-agent-leaderboard/results
35
+ task_id: taubench_airline
36
+ value: 0.56
37
+ source:
38
+ url: https://www.exgentic.ai
39
+ name: Open Agent Leaderboard
40
+ notes: 'model: DeepSeek V3.2'
41
+ - dataset:
42
+ id: open-agent-leaderboard/results
43
+ task_id: taubench_retail
44
+ value: 0.82
45
+ source:
46
+ url: https://www.exgentic.ai
47
+ name: Open Agent Leaderboard
48
+ notes: 'model: DeepSeek V3.2'
49
+ - dataset:
50
+ id: open-agent-leaderboard/results
51
+ task_id: taubench_telecom
52
+ value: 0.71
53
+ source:
54
+ url: https://www.exgentic.ai
55
+ name: Open Agent Leaderboard
56
+ notes: 'model: DeepSeek V3.2'
.eval_results/open_agent_leaderboard_openai_azure_gpt-5.2-2025-12-11.yaml ADDED
@@ -0,0 +1,56 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ - dataset:
2
+ id: open-agent-leaderboard/results
3
+ task_id: overall
4
+ value: 0.4625
5
+ source:
6
+ url: https://www.exgentic.ai
7
+ name: Open Agent Leaderboard
8
+ notes: 'model: GPT-5.2'
9
+ - dataset:
10
+ id: open-agent-leaderboard/results
11
+ task_id: appworld
12
+ value: 0.22
13
+ source:
14
+ url: https://www.exgentic.ai
15
+ name: Open Agent Leaderboard
16
+ notes: 'model: GPT-5.2'
17
+ - dataset:
18
+ id: open-agent-leaderboard/results
19
+ task_id: browsecomp_plus
20
+ value: 0.46
21
+ source:
22
+ url: https://www.exgentic.ai
23
+ name: Open Agent Leaderboard
24
+ notes: 'model: GPT-5.2'
25
+ - dataset:
26
+ id: open-agent-leaderboard/results
27
+ task_id: swebench
28
+ value: 0.57
29
+ source:
30
+ url: https://www.exgentic.ai
31
+ name: Open Agent Leaderboard
32
+ notes: 'model: GPT-5.2'
33
+ - dataset:
34
+ id: open-agent-leaderboard/results
35
+ task_id: taubench_airline
36
+ value: 0.54
37
+ source:
38
+ url: https://www.exgentic.ai
39
+ name: Open Agent Leaderboard
40
+ notes: 'model: GPT-5.2'
41
+ - dataset:
42
+ id: open-agent-leaderboard/results
43
+ task_id: taubench_retail
44
+ value: 0.73
45
+ source:
46
+ url: https://www.exgentic.ai
47
+ name: Open Agent Leaderboard
48
+ notes: 'model: GPT-5.2'
49
+ - dataset:
50
+ id: open-agent-leaderboard/results
51
+ task_id: taubench_telecom
52
+ value: 0.53
53
+ source:
54
+ url: https://www.exgentic.ai
55
+ name: Open Agent Leaderboard
56
+ notes: 'model: GPT-5.2'
.eval_results/open_agent_leaderboard_openai_azure_kimi-k2.5.yaml ADDED
@@ -0,0 +1,56 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ - dataset:
2
+ id: open-agent-leaderboard/results
3
+ task_id: overall
4
+ value: 0.4276
5
+ source:
6
+ url: https://www.exgentic.ai
7
+ name: Open Agent Leaderboard
8
+ notes: 'model: Kimi K2.5'
9
+ - dataset:
10
+ id: open-agent-leaderboard/results
11
+ task_id: appworld
12
+ value: 0.1
13
+ source:
14
+ url: https://www.exgentic.ai
15
+ name: Open Agent Leaderboard
16
+ notes: 'model: Kimi K2.5'
17
+ - dataset:
18
+ id: open-agent-leaderboard/results
19
+ task_id: browsecomp_plus
20
+ value: 0.34
21
+ source:
22
+ url: https://www.exgentic.ai
23
+ name: Open Agent Leaderboard
24
+ notes: 'model: Kimi K2.5'
25
+ - dataset:
26
+ id: open-agent-leaderboard/results
27
+ task_id: swebench
28
+ value: 0.5714
29
+ source:
30
+ url: https://www.exgentic.ai
31
+ name: Open Agent Leaderboard
32
+ notes: 'model: Kimi K2.5'
33
+ - dataset:
34
+ id: open-agent-leaderboard/results
35
+ task_id: taubench_airline
36
+ value: 0.62
37
+ source:
38
+ url: https://www.exgentic.ai
39
+ name: Open Agent Leaderboard
40
+ notes: 'model: Kimi K2.5'
41
+ - dataset:
42
+ id: open-agent-leaderboard/results
43
+ task_id: taubench_retail
44
+ value: 0.6465
45
+ source:
46
+ url: https://www.exgentic.ai
47
+ name: Open Agent Leaderboard
48
+ notes: 'model: Kimi K2.5'
49
+ - dataset:
50
+ id: open-agent-leaderboard/results
51
+ task_id: taubench_telecom
52
+ value: 0.83
53
+ source:
54
+ url: https://www.exgentic.ai
55
+ name: Open Agent Leaderboard
56
+ notes: 'model: Kimi K2.5'
.eval_results/open_agent_leaderboard_openai_gcp_gemini-3-pro-preview.yaml ADDED
@@ -0,0 +1,56 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ - dataset:
2
+ id: open-agent-leaderboard/results
3
+ task_id: overall
4
+ value: 0.6225
5
+ source:
6
+ url: https://www.exgentic.ai
7
+ name: Open Agent Leaderboard
8
+ notes: 'model: Gemini 3 Pro'
9
+ - dataset:
10
+ id: open-agent-leaderboard/results
11
+ task_id: appworld
12
+ value: 0.55
13
+ source:
14
+ url: https://www.exgentic.ai
15
+ name: Open Agent Leaderboard
16
+ notes: 'model: Gemini 3 Pro'
17
+ - dataset:
18
+ id: open-agent-leaderboard/results
19
+ task_id: browsecomp_plus
20
+ value: 0.48
21
+ source:
22
+ url: https://www.exgentic.ai
23
+ name: Open Agent Leaderboard
24
+ notes: 'model: Gemini 3 Pro'
25
+ - dataset:
26
+ id: open-agent-leaderboard/results
27
+ task_id: swebench
28
+ value: 0.71
29
+ source:
30
+ url: https://www.exgentic.ai
31
+ name: Open Agent Leaderboard
32
+ notes: 'model: Gemini 3 Pro'
33
+ - dataset:
34
+ id: open-agent-leaderboard/results
35
+ task_id: taubench_airline
36
+ value: 0.7
37
+ source:
38
+ url: https://www.exgentic.ai
39
+ name: Open Agent Leaderboard
40
+ notes: 'model: Gemini 3 Pro'
41
+ - dataset:
42
+ id: open-agent-leaderboard/results
43
+ task_id: taubench_retail
44
+ value: 0.82
45
+ source:
46
+ url: https://www.exgentic.ai
47
+ name: Open Agent Leaderboard
48
+ notes: 'model: Gemini 3 Pro'
49
+ - dataset:
50
+ id: open-agent-leaderboard/results
51
+ task_id: taubench_telecom
52
+ value: 0.73
53
+ source:
54
+ url: https://www.exgentic.ai
55
+ name: Open Agent Leaderboard
56
+ notes: 'model: Gemini 3 Pro'