Add MDPBench evaluation results

#13
by Delores-Lin - opened
Files changed (1) hide show
  1. .eval_results/mdpbench.yaml +219 -0
.eval_results/mdpbench.yaml ADDED
@@ -0,0 +1,219 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ - dataset:
2
+ id: Delores-Lin/MDPBench
3
+ task_id: overall
4
+ value: 56.3
5
+ date: "2026-06-08"
6
+ source:
7
+ url: https://huggingface.co/datasets/Delores-Lin/MDPBench
8
+ name: MDPBench leaderboard
9
+ user: Delores-Lin
10
+
11
+ - dataset:
12
+ id: Delores-Lin/MDPBench
13
+ task_id: digital
14
+ value: 72.4
15
+ date: "2026-06-08"
16
+ source:
17
+ url: https://huggingface.co/datasets/Delores-Lin/MDPBench
18
+ name: MDPBench leaderboard
19
+ user: Delores-Lin
20
+
21
+ - dataset:
22
+ id: Delores-Lin/MDPBench
23
+ task_id: photographed
24
+ value: 51.1
25
+ date: "2026-06-08"
26
+ source:
27
+ url: https://huggingface.co/datasets/Delores-Lin/MDPBench
28
+ name: MDPBench leaderboard
29
+ user: Delores-Lin
30
+
31
+ - dataset:
32
+ id: Delores-Lin/MDPBench
33
+ task_id: latin
34
+ value: 69.3
35
+ date: "2026-06-08"
36
+ source:
37
+ url: https://huggingface.co/datasets/Delores-Lin/MDPBench
38
+ name: MDPBench leaderboard
39
+ user: Delores-Lin
40
+
41
+ - dataset:
42
+ id: Delores-Lin/MDPBench
43
+ task_id: de
44
+ value: 72.4
45
+ date: "2026-06-08"
46
+ source:
47
+ url: https://huggingface.co/datasets/Delores-Lin/MDPBench
48
+ name: MDPBench leaderboard
49
+ user: Delores-Lin
50
+
51
+ - dataset:
52
+ id: Delores-Lin/MDPBench
53
+ task_id: en
54
+ value: 75.0
55
+ date: "2026-06-08"
56
+ source:
57
+ url: https://huggingface.co/datasets/Delores-Lin/MDPBench
58
+ name: MDPBench leaderboard
59
+ user: Delores-Lin
60
+
61
+ - dataset:
62
+ id: Delores-Lin/MDPBench
63
+ task_id: es
64
+ value: 60.9
65
+ date: "2026-06-08"
66
+ source:
67
+ url: https://huggingface.co/datasets/Delores-Lin/MDPBench
68
+ name: MDPBench leaderboard
69
+ user: Delores-Lin
70
+
71
+ - dataset:
72
+ id: Delores-Lin/MDPBench
73
+ task_id: fr
74
+ value: 61.8
75
+ date: "2026-06-08"
76
+ source:
77
+ url: https://huggingface.co/datasets/Delores-Lin/MDPBench
78
+ name: MDPBench leaderboard
79
+ user: Delores-Lin
80
+
81
+ - dataset:
82
+ id: Delores-Lin/MDPBench
83
+ task_id: id
84
+ value: 69.6
85
+ date: "2026-06-08"
86
+ source:
87
+ url: https://huggingface.co/datasets/Delores-Lin/MDPBench
88
+ name: MDPBench leaderboard
89
+ user: Delores-Lin
90
+
91
+ - dataset:
92
+ id: Delores-Lin/MDPBench
93
+ task_id: it
94
+ value: 74.7
95
+ date: "2026-06-08"
96
+ source:
97
+ url: https://huggingface.co/datasets/Delores-Lin/MDPBench
98
+ name: MDPBench leaderboard
99
+ user: Delores-Lin
100
+
101
+ - dataset:
102
+ id: Delores-Lin/MDPBench
103
+ task_id: nl
104
+ value: 71.6
105
+ date: "2026-06-08"
106
+ source:
107
+ url: https://huggingface.co/datasets/Delores-Lin/MDPBench
108
+ name: MDPBench leaderboard
109
+ user: Delores-Lin
110
+
111
+ - dataset:
112
+ id: Delores-Lin/MDPBench
113
+ task_id: pt
114
+ value: 70.3
115
+ date: "2026-06-08"
116
+ source:
117
+ url: https://huggingface.co/datasets/Delores-Lin/MDPBench
118
+ name: MDPBench leaderboard
119
+ user: Delores-Lin
120
+
121
+ - dataset:
122
+ id: Delores-Lin/MDPBench
123
+ task_id: vi
124
+ value: 67.8
125
+ date: "2026-06-08"
126
+ source:
127
+ url: https://huggingface.co/datasets/Delores-Lin/MDPBench
128
+ name: MDPBench leaderboard
129
+ user: Delores-Lin
130
+
131
+ - dataset:
132
+ id: Delores-Lin/MDPBench
133
+ task_id: non_latin
134
+ value: 41.6
135
+ date: "2026-06-08"
136
+ source:
137
+ url: https://huggingface.co/datasets/Delores-Lin/MDPBench
138
+ name: MDPBench leaderboard
139
+ user: Delores-Lin
140
+
141
+ - dataset:
142
+ id: Delores-Lin/MDPBench
143
+ task_id: ar
144
+ value: 37.9
145
+ date: "2026-06-08"
146
+ source:
147
+ url: https://huggingface.co/datasets/Delores-Lin/MDPBench
148
+ name: MDPBench leaderboard
149
+ user: Delores-Lin
150
+
151
+ - dataset:
152
+ id: Delores-Lin/MDPBench
153
+ task_id: hi
154
+ value: 61.3
155
+ date: "2026-06-08"
156
+ source:
157
+ url: https://huggingface.co/datasets/Delores-Lin/MDPBench
158
+ name: MDPBench leaderboard
159
+ user: Delores-Lin
160
+
161
+ - dataset:
162
+ id: Delores-Lin/MDPBench
163
+ task_id: jp
164
+ value: 39.6
165
+ date: "2026-06-08"
166
+ source:
167
+ url: https://huggingface.co/datasets/Delores-Lin/MDPBench
168
+ name: MDPBench leaderboard
169
+ user: Delores-Lin
170
+
171
+ - dataset:
172
+ id: Delores-Lin/MDPBench
173
+ task_id: ko
174
+ value: 29.6
175
+ date: "2026-06-08"
176
+ source:
177
+ url: https://huggingface.co/datasets/Delores-Lin/MDPBench
178
+ name: MDPBench leaderboard
179
+ user: Delores-Lin
180
+
181
+ - dataset:
182
+ id: Delores-Lin/MDPBench
183
+ task_id: ru
184
+ value: 54.0
185
+ date: "2026-06-08"
186
+ source:
187
+ url: https://huggingface.co/datasets/Delores-Lin/MDPBench
188
+ name: MDPBench leaderboard
189
+ user: Delores-Lin
190
+
191
+ - dataset:
192
+ id: Delores-Lin/MDPBench
193
+ task_id: th
194
+ value: 24.8
195
+ date: "2026-06-08"
196
+ source:
197
+ url: https://huggingface.co/datasets/Delores-Lin/MDPBench
198
+ name: MDPBench leaderboard
199
+ user: Delores-Lin
200
+
201
+ - dataset:
202
+ id: Delores-Lin/MDPBench
203
+ task_id: zh
204
+ value: 39.7
205
+ date: "2026-06-08"
206
+ source:
207
+ url: https://huggingface.co/datasets/Delores-Lin/MDPBench
208
+ name: MDPBench leaderboard
209
+ user: Delores-Lin
210
+
211
+ - dataset:
212
+ id: Delores-Lin/MDPBench
213
+ task_id: zh_t
214
+ value: 45.6
215
+ date: "2026-06-08"
216
+ source:
217
+ url: https://huggingface.co/datasets/Delores-Lin/MDPBench
218
+ name: MDPBench leaderboard
219
+ user: Delores-Lin