Elron commited on
Commit
3bdc9c2
·
verified ·
1 Parent(s): 959b6e6

Add Open Agent Leaderboard evaluation results

Browse files

Results from the Open Agent Leaderboard (https://www.exgentic.ai), evaluating this agent across 6 benchmarks with 5 different models.

.eval_results/open_agent_leaderboard_openai_aws_claude-opus-4-5.yaml ADDED
@@ -0,0 +1,56 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ - dataset:
2
+ id: open-agent-leaderboard/results
3
+ task_id: overall
4
+ value: 0.6704
5
+ source:
6
+ url: https://www.exgentic.ai
7
+ name: Open Agent Leaderboard
8
+ notes: 'model: Claude Opus 4.5'
9
+ - dataset:
10
+ id: open-agent-leaderboard/results
11
+ task_id: appworld
12
+ value: 0.66
13
+ source:
14
+ url: https://www.exgentic.ai
15
+ name: Open Agent Leaderboard
16
+ notes: 'model: Claude Opus 4.5'
17
+ - dataset:
18
+ id: open-agent-leaderboard/results
19
+ task_id: browsecomp_plus
20
+ value: 0.5294
21
+ source:
22
+ url: https://www.exgentic.ai
23
+ name: Open Agent Leaderboard
24
+ notes: 'model: Claude Opus 4.5'
25
+ - dataset:
26
+ id: open-agent-leaderboard/results
27
+ task_id: swebench
28
+ value: 0.7423
29
+ source:
30
+ url: https://www.exgentic.ai
31
+ name: Open Agent Leaderboard
32
+ notes: 'model: Claude Opus 4.5'
33
+ - dataset:
34
+ id: open-agent-leaderboard/results
35
+ task_id: taubench_airline
36
+ value: 0.66
37
+ source:
38
+ url: https://www.exgentic.ai
39
+ name: Open Agent Leaderboard
40
+ notes: 'model: Claude Opus 4.5'
41
+ - dataset:
42
+ id: open-agent-leaderboard/results
43
+ task_id: taubench_retail
44
+ value: 0.83
45
+ source:
46
+ url: https://www.exgentic.ai
47
+ name: Open Agent Leaderboard
48
+ notes: 'model: Claude Opus 4.5'
49
+ - dataset:
50
+ id: open-agent-leaderboard/results
51
+ task_id: taubench_telecom
52
+ value: 0.76
53
+ source:
54
+ url: https://www.exgentic.ai
55
+ name: Open Agent Leaderboard
56
+ notes: 'model: Claude Opus 4.5'
.eval_results/open_agent_leaderboard_openai_azure_deepseek-v3.2.yaml ADDED
@@ -0,0 +1,56 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ - dataset:
2
+ id: open-agent-leaderboard/results
3
+ task_id: overall
4
+ value: 0.4158
5
+ source:
6
+ url: https://www.exgentic.ai
7
+ name: Open Agent Leaderboard
8
+ notes: 'model: DeepSeek V3.2'
9
+ - dataset:
10
+ id: open-agent-leaderboard/results
11
+ task_id: appworld
12
+ value: 0.03
13
+ source:
14
+ url: https://www.exgentic.ai
15
+ name: Open Agent Leaderboard
16
+ notes: 'model: DeepSeek V3.2'
17
+ - dataset:
18
+ id: open-agent-leaderboard/results
19
+ task_id: browsecomp_plus
20
+ value: 0.48
21
+ source:
22
+ url: https://www.exgentic.ai
23
+ name: Open Agent Leaderboard
24
+ notes: 'model: DeepSeek V3.2'
25
+ - dataset:
26
+ id: open-agent-leaderboard/results
27
+ task_id: swebench
28
+ value: 0.64
29
+ source:
30
+ url: https://www.exgentic.ai
31
+ name: Open Agent Leaderboard
32
+ notes: 'model: DeepSeek V3.2'
33
+ - dataset:
34
+ id: open-agent-leaderboard/results
35
+ task_id: taubench_airline
36
+ value: 0.28
37
+ source:
38
+ url: https://www.exgentic.ai
39
+ name: Open Agent Leaderboard
40
+ notes: 'model: DeepSeek V3.2'
41
+ - dataset:
42
+ id: open-agent-leaderboard/results
43
+ task_id: taubench_retail
44
+ value: 0.65
45
+ source:
46
+ url: https://www.exgentic.ai
47
+ name: Open Agent Leaderboard
48
+ notes: 'model: DeepSeek V3.2'
49
+ - dataset:
50
+ id: open-agent-leaderboard/results
51
+ task_id: taubench_telecom
52
+ value: 0.61
53
+ source:
54
+ url: https://www.exgentic.ai
55
+ name: Open Agent Leaderboard
56
+ notes: 'model: DeepSeek V3.2'
.eval_results/open_agent_leaderboard_openai_azure_gpt-5.2-2025-12-11.yaml ADDED
@@ -0,0 +1,56 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ - dataset:
2
+ id: open-agent-leaderboard/results
3
+ task_id: overall
4
+ value: 0.3917
5
+ source:
6
+ url: https://www.exgentic.ai
7
+ name: Open Agent Leaderboard
8
+ notes: 'model: GPT-5.2'
9
+ - dataset:
10
+ id: open-agent-leaderboard/results
11
+ task_id: appworld
12
+ value: 0.0
13
+ source:
14
+ url: https://www.exgentic.ai
15
+ name: Open Agent Leaderboard
16
+ notes: 'model: GPT-5.2'
17
+ - dataset:
18
+ id: open-agent-leaderboard/results
19
+ task_id: browsecomp_plus
20
+ value: 0.43
21
+ source:
22
+ url: https://www.exgentic.ai
23
+ name: Open Agent Leaderboard
24
+ notes: 'model: GPT-5.2'
25
+ - dataset:
26
+ id: open-agent-leaderboard/results
27
+ task_id: swebench
28
+ value: 0.58
29
+ source:
30
+ url: https://www.exgentic.ai
31
+ name: Open Agent Leaderboard
32
+ notes: 'model: GPT-5.2'
33
+ - dataset:
34
+ id: open-agent-leaderboard/results
35
+ task_id: taubench_airline
36
+ value: 0.48
37
+ source:
38
+ url: https://www.exgentic.ai
39
+ name: Open Agent Leaderboard
40
+ notes: 'model: GPT-5.2'
41
+ - dataset:
42
+ id: open-agent-leaderboard/results
43
+ task_id: taubench_retail
44
+ value: 0.64
45
+ source:
46
+ url: https://www.exgentic.ai
47
+ name: Open Agent Leaderboard
48
+ notes: 'model: GPT-5.2'
49
+ - dataset:
50
+ id: open-agent-leaderboard/results
51
+ task_id: taubench_telecom
52
+ value: 0.55
53
+ source:
54
+ url: https://www.exgentic.ai
55
+ name: Open Agent Leaderboard
56
+ notes: 'model: GPT-5.2'
.eval_results/open_agent_leaderboard_openai_azure_kimi-k2.5.yaml ADDED
@@ -0,0 +1,56 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ - dataset:
2
+ id: open-agent-leaderboard/results
3
+ task_id: overall
4
+ value: 0.3026
5
+ source:
6
+ url: https://www.exgentic.ai
7
+ name: Open Agent Leaderboard
8
+ notes: 'model: Kimi K2.5'
9
+ - dataset:
10
+ id: open-agent-leaderboard/results
11
+ task_id: appworld
12
+ value: 0.08
13
+ source:
14
+ url: https://www.exgentic.ai
15
+ name: Open Agent Leaderboard
16
+ notes: 'model: Kimi K2.5'
17
+ - dataset:
18
+ id: open-agent-leaderboard/results
19
+ task_id: browsecomp_plus
20
+ value: 0.56
21
+ source:
22
+ url: https://www.exgentic.ai
23
+ name: Open Agent Leaderboard
24
+ notes: 'model: Kimi K2.5'
25
+ - dataset:
26
+ id: open-agent-leaderboard/results
27
+ task_id: swebench
28
+ value: 0.5204
29
+ source:
30
+ url: https://www.exgentic.ai
31
+ name: Open Agent Leaderboard
32
+ notes: 'model: Kimi K2.5'
33
+ - dataset:
34
+ id: open-agent-leaderboard/results
35
+ task_id: taubench_airline
36
+ value: 0.12
37
+ source:
38
+ url: https://www.exgentic.ai
39
+ name: Open Agent Leaderboard
40
+ notes: 'model: Kimi K2.5'
41
+ - dataset:
42
+ id: open-agent-leaderboard/results
43
+ task_id: taubench_retail
44
+ value: 0.03
45
+ source:
46
+ url: https://www.exgentic.ai
47
+ name: Open Agent Leaderboard
48
+ notes: 'model: Kimi K2.5'
49
+ - dataset:
50
+ id: open-agent-leaderboard/results
51
+ task_id: taubench_telecom
52
+ value: 0.0
53
+ source:
54
+ url: https://www.exgentic.ai
55
+ name: Open Agent Leaderboard
56
+ notes: 'model: Kimi K2.5'
.eval_results/open_agent_leaderboard_openai_gcp_gemini-3-pro-preview.yaml ADDED
@@ -0,0 +1,56 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ - dataset:
2
+ id: open-agent-leaderboard/results
3
+ task_id: overall
4
+ value: 0.5617
5
+ source:
6
+ url: https://www.exgentic.ai
7
+ name: Open Agent Leaderboard
8
+ notes: 'model: Gemini 3 Pro'
9
+ - dataset:
10
+ id: open-agent-leaderboard/results
11
+ task_id: appworld
12
+ value: 0.36
13
+ source:
14
+ url: https://www.exgentic.ai
15
+ name: Open Agent Leaderboard
16
+ notes: 'model: Gemini 3 Pro'
17
+ - dataset:
18
+ id: open-agent-leaderboard/results
19
+ task_id: browsecomp_plus
20
+ value: 0.51
21
+ source:
22
+ url: https://www.exgentic.ai
23
+ name: Open Agent Leaderboard
24
+ notes: 'model: Gemini 3 Pro'
25
+ - dataset:
26
+ id: open-agent-leaderboard/results
27
+ task_id: swebench
28
+ value: 0.67
29
+ source:
30
+ url: https://www.exgentic.ai
31
+ name: Open Agent Leaderboard
32
+ notes: 'model: Gemini 3 Pro'
33
+ - dataset:
34
+ id: open-agent-leaderboard/results
35
+ task_id: taubench_airline
36
+ value: 0.7
37
+ source:
38
+ url: https://www.exgentic.ai
39
+ name: Open Agent Leaderboard
40
+ notes: 'model: Gemini 3 Pro'
41
+ - dataset:
42
+ id: open-agent-leaderboard/results
43
+ task_id: taubench_retail
44
+ value: 0.71
45
+ source:
46
+ url: https://www.exgentic.ai
47
+ name: Open Agent Leaderboard
48
+ notes: 'model: Gemini 3 Pro'
49
+ - dataset:
50
+ id: open-agent-leaderboard/results
51
+ task_id: taubench_telecom
52
+ value: 0.71
53
+ source:
54
+ url: https://www.exgentic.ai
55
+ name: Open Agent Leaderboard
56
+ notes: 'model: Gemini 3 Pro'