File size: 22,204 Bytes
b0c0df0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
DESCRIPTIVE_RESP_INST = {
    1: """{}what is its title?
    * Your final answer should be the most relevant title of the plot that is explicitly written.
    * If the plot does not have an explicit title or contains only a letter, answer 'Not Applicable'.
    """,
    2: """{}what is the label of the x-axis?
    * Your final answer should be the label of the x-axis that is explicitly written, including the case when x-axis is shared across multiple subplots. When the x-axis is present on both the top and bottom of the plot, answer the label of the x-axis at the bottom.
    * If the plot does not have an explicit x-axis label, answer 'Not Applicable'.
    """,
    3: """{}what is the label of the y-axis?
    * Your final answer should be the label of the y-axis that is explicitly written, including the case when y-axis is shared across multiple subplots. When the y-axis is present on both the left and right of the plot, answer the label of the y-axis at the left.
    * If the plot does not have an explicit y-axis label, answer 'Not Applicable'.""",
    4: """{}what is the leftmost labeled tick on the x-axis?
    * Your final answer should be the tick value on the x-axis that is explicitly written, including the case when x-axis is shared across multiple subplots. When the x-axis is present on both the top and bottom of the plot, answer based on the axis at the bottom. Ignore units or scales that are written separately from the tick, such as units and scales from the axis label or the corner of the plot.""",
    5: """{}what is the rightmost labeled tick on the x-axis?
    * Your final answer should be the tick value on the x-axis that is explicitly written, including the case when x-axis is shared across multiple subplots. When the x-axis is present on both the top and bottom of the plot, answer based on the axis at the bottom. Ignore units or scales that are written separately from the tick, such as units and scales from the axis label or the corner of the plot.""",
    6: """{}what is the spatially lowest labeled tick on the y-axis?
    * Your final answer should be the tick value on the y-axis that is explicitly written, including the case when y-axis is shared across multiple subplots. When the y-axis is present on both the left and right of the plot, based on the axis at the left. Ignore units or scales that are written separately from the tick, such as units and scales from the axis label or the corner of the plot.""",
    7: """{}what is the spatially highest labeled tick on the y-axis?
    * Your final answer should be the tick value on the y-axis that is explicitly written, including the case when y-axis is shared across multiple subplots. When the y-axis is present on both the left and right of the plot, based on the axis at the left. Ignore units or scales that are written separately from the tick, such as units and scales from the axis label or the corner of the plot.""",
    8: """{}what is difference between consecutive numerical tick values on the x-axis?
    * Your final answer should be the difference between consecutive numerical tick values of the x-axis, including the case when x-axis is shared across multiple subplots. When the x-axis is present on both the top and bottom of the plot, answer based on the axis at the bottom. Ignore units or scales that are written separately from the tick, such as units and scales from the axis label or the corner of the plot.
    * If the plot does not have an explicit x-axis tick value, or if the tick values are not numerical, or if the difference is not constant between all consecutive tick values, answer "Not Applicable".""",
    9: """{}what is difference between consecutive numerical tick values on the y-axis?
    * Your final answer should be the difference between consecutive numerical tick values of the y-axis, including the case when y-axis is shared across multiple subplots. When the y-axis is present on both the left and right of the plot, answer based on the axis at the left. Ignore units or scales that are written separately from the tick, such as units and scales from the axis label or the corner of the plot.
    * If the plot does not have an explicit y-axis tick value, or if the tick values are not numerical, or if the difference is not constant between all consecutive tick values, answer "Not Applicable".""",
    10: """{}how many lines are there?
    * Your final answer should be the number of lines in the plot. Ignore grid lines, tick marks, and any vertical or horizontal auxiliary lines.
    * If the plot does not contain any lines or is not considered a line plot, answer "Not Applicable".""",
    11: """{}do any lines intersect?
    * Your final answer should be "Yes" if any lines intersect, and "No" otherwise. Ignore grid lines, tick marks, and any vertical or horizontal auxiliary lines.
    * If the plot does not contain any lines or is not considered a line plot, answer "Not Applicable".""",
    12: """{}how many discrete labels are there in the legend?
    * Your final answer should account for only labels relevant to the plot in the legend, even if the legend is located outside the plot. 
    * If the plot does not have a legend or no legend is not considered relevant to this plot, answer "Not Applicable".""",
    13: """{}what are the names of the labels in the legend?
    * You should write down the labels from top to bottom, then from left to right and separate the labels with commas. Your final answer should account for only labels relevant to the plot in the legend, even if the legend is located outside the plot.
    * If the plot does not have a legend or no legend is not considered relevant to this plot, answer "Not Applicable".""",
    14: """{}what is the difference between the maximum and minimum values of the tick labels on the continuous legend (i.e., colorbar)?
    * You should remove the percentage sign (if any) in your answer.
    * If the plot does not have an explicit colorbar-based continuous legend or the legend is not considered relevant to this subplot, answer "Not Applicable".""",
    15: """{}what is the maximum value of the tick labels on the continuous legend (i.e., colorbar)?
    * You should remove the percentage sign (if any) in your answer. 
    * If the plot does not have an explicit colorbar-based continuous legend or the legend is not considered relevant to this subplot, answer "Not Applicable".""",
    16: """{}what is the general trend of data from left to right?
    * Your final answer should be within a few words, such as "increases", "increases then stabilizes".""",
    17: """{}What is the total number of explicitly labeled ticks across all axes?
    * Your final answer should be the total number of explicitly labeled ticks across all axes, including the case when any axis is shared across multiple subplots.""",
    18: """What is the layout of the subplots?
    * Your final answer should follow "n by m" format, where n is the number of rows and m is the number of columns.
    * If the plot does not contain subplots, answer "1 by 1".""",
    19: """What is the number of subplots?
    * Your final answer should be the total number of subplots in the plot.
    * If the plot does not contain subplots, answer "1".""",
}

DESCRIPTIVE_GRADING_PREFIX = """
You will be given <|NUM_TRIPLETS|> pairs of ground truth answers and model responses under an overarching question. You need to go through each of the pairs, extract the final answer from the model response, compare it with the ground truth answer, and then assign a binary score. Avoid providing explanations in your response. If there is no provided model response, please leave the extracted answer empty and give a score of 0. Your response must follow json formats with keys [<|JSON_KEYS|>] where the value for any `extract_answer` is your extracted answer and `score` is an interge in [0, 1] based on the following rules:\n

Overarching Question: <|OVERARCHING_QUESTION|>
"""

DESCRIPTIVE_GRADING_QMAP = {
    1: "What is the title of the plot?",
    2: "What is the label of the x-axis?",
    3: "What is the label of the y-axis?",
    4: "What is the leftmost labeled tick on the x-axis?",
    5: "What is the rightmost labeled tick on the x-axis?",
    6: "What is the spatially lowest labeled tick on the y-axis?",
    7: "What is the spatially highest labeled tick on the y-axis?",
    8: "What is difference between consecutive numerical tick values on the x-axis?",
    9: "What is difference between consecutive numerical tick values on the y-axis?",
    10: "How many lines are there?",
    11: "Do any lines intersect?",
    12: "How many discrete labels are there in the legend?",
    13: "What are the names of the labels in the legend? (from top to bottom, then left to right)",
    14: "What is the difference between the maximum and minimum values of the tick labels on the continuous legend (i.e., colorbar)?",
    15: "What is the maximum value of the tick labels on the continuous legend (i.e., colorbar)?",
    16: "What is the general trend of data from left to right?",
    17: "What is the total number of explicitly labeled ticks across all axes?",
    18: "What is the layout of the subplots?",
    19: "What is the number of subplots?",
}

DESCRIPTIVE_GRADING_ICL = {
    "title": """
Rubric: 
    * Give a score of 1 if and only if the extracted answer and the ground truth answer are referring to the same term. It's acceptable to have different grammar or form (e.g., α and alpha; $R^2_{t,h,v,m}$ and R^2_t,h,v,m). It's acceptable to omit letter prefixes (e.g., (a) Increment over time and Increment over time).
    * Give a score of 0 if any term in the extracted answer is different from the ground truth answer.
    * When ground truth answer is "Not Applicable", the response must express "Not Applicable" to receive a score of 1.

    ### Example Start ###
    T1:
    Response 1: The title of the plot is "The number of students in each grade".
    Ground Truth 1: The variance of students in each grade

    T2:
    Response 2: There is no title.
    Ground Truth 2: Not Applicable

    T3:
    Response 3: A_v^t
    Ground Truth 3: A^t_v

    {
        "extract_answer_T1": "The number of students in each grade",
        "score_T1": 0
        "extract_answer_T2: "Not Applicable",
        "score_T2": 1
        "extract_answer_T3": "A_v^t",
        "score_T3": 1
    }
    ### Example End ###        
""",
    "ocr": """
Rubric: 
    * Give a score of 1 if and only if the extracted answer and the ground truth answer are referring to the same term. It's acceptable to have equivalent grammar or form (e.g., α and alpha; $R^2_{t,h,v,m}$ and R^2_t,h,v,m). If the ground truth is a number, the extracted answer should be the number with the exact same value.
    * Give a score of 0 if any term in the extracted answer is different from the ground truth answer, or if the extracted number is different in value from the ground truth number.
    * When ground truth answer is "Not Applicable", the response must express "Not Applicable" to receive a score of 1.

    ### Example Start ###
    T1:
    Response 1: The answer is 1.0
    Ground Truth 1: 1.00

    T2:
    Response 2: By manually inspecting the plot, the final answer should be 0.
    Ground Truth 2: Not Applicable

    T3:
    Response 3: A_v^t
    Ground Truth 3: A^t_v

    {
        "extract_answer_T1": 1.0,
        "score_T1": 1
        "extract_answer_T2: 0,
        "score_T2": 0
        "extract_answer_T3": "A_v^t",
        "score_T3": 1
    }
    ### Example End ###        
""",
    "quant": """
Rubric:
    * Give a score of 1 if and only if the extracted answer and the ground truth answer are numbers with the exact same value.
    * Give a score of 0 if the extracted answer is different in value from the ground truth answer.
    * When ground truth answer is "Not Applicable", the response must express "Not Applicable" to receive a score of 1.

    ### Example Start ###
    T1:
    Response 1: 5
    Ground Truth 1: 6

    T2:
    Response 2: 0
    Ground Truth 2: Not Applicable

    T3:
    Response 3: 4
    Ground Truth 3: 4

    {
        "extract_answer_T1": 5,
        "score_T1": 0
        "extract_answer_T2: 0,
        "score_T2": 0
        "extract_answer_T3": 4,
        "score_T3": 1
    }
    ### Example End ###   
""",
    "bool": """
Rubric:
    * Give a score of 1 if and only if the extracted answer and the ground truth answer are the same.
    * Give a score of 0 if the extracted answer and the ground truth answer are different.
    * When ground truth answer is "Not Applicable", the response must express "Not Applicable" to receive a score of 1.

    ### Example Start ###
    T1:
    Response 1: No, there are no intersections.
    Ground Truth 1: no

    T2:
    Response 2: No, all the lines are parallel.
    Ground Truth 2: Yes

    T3:
    Response 3: There are no lines in the plot.
    Ground Truth 3: Not Applicable

    {
        "extract_answer_T1": "No",
        "score_T1": 1
        "extract_answer_T2: "No",
        "score_T2": 0
        "extract_answer_T3": "Not Applicable",
        "score_T3": 1
    }
    ### Example End ###   
""",
    "enum": """
Rubric:
    * Give a score of 1 if and only if the extracted answer and the ground truth answer are referring to the same term. It's acceptable to have equivalent grammar or form (e.g., α and alpha; $R^2_{t,h,v,m}$ and R^2_t,h,v,m). The order of the terms must be the same.
    * Give a score of 0 if any term in the extracted answer is different from the ground truth answer, or if the order of the terms is different.
    * When ground truth answer is "Not Applicable", the response must express "Not Applicable" to receive a score of 1.

    ### Example Start ###
    T1:
    Response 1: Here are the names of the labels: A, B, C
    Ground Truth 1: B, A, C

    T2:
    Response 2: The labels are T56, B33.
    Ground Truth 2: T56,B33,A12

    T3:
    Response 3: \alpha, \beta, \gamma^t_v
    Ground Truth 3: α, β, γ_v^t

    {
        "extract_answer_T1": "A, B, C",
        "score_T1": 0
        "extract_answer_T2: "T56, B33",
        "score_T2": 0
        "extract_answer_T3": "\alpha, \beta, \gamma^t_v",
        "score_T3": 1
    }
    ### Example End ###
""",
    "trend": """
Rubric:
    * Give a score of 1 if and only if the extracted answer and the ground truth answer share the same general trend.
    * Give a score of 0 if the extracted answer and the ground truth answer are different in trend expression.

    ### Example Start ###
    T1:
    Response 1: there is an increase in the data from left to right
    Ground Truth 1: Decreases

    T2:
    Response 2: the curves move up and stay constant
    Ground Truth 2: Increases then stabilizes

    T3:
    Response 3: Decreases
    Ground Truth 3: Decreases then increases

    {
        "extract_answer_T1": "Increases",
        "score_T1": 0
        "extract_answer_T2: "Move up and stay constant",
        "score_T2": 1
        "extract_answer_T3": "Decreases",
        "score_T3": 0
    }
    ### Example End ###
""",
    "layout": """
Rubric:
    * Give a score of 1 if and only if the extracted answer and the ground truth answer are the same in terms of the number of rows and columns (e.g., n by m).
    * Give a score of 0 if the extracted answer is different from the ground truth answer.

    ### Example Start ###
    T1:
    Response 1: 2 by 3
    Ground Truth 1: 3 by 2

    T2:
    Response 2: the layout is 1 by 1
    Ground Truth 2: 1 by 1

    T3:
    Response 3: there are two rows and three columns
    Ground Truth 3: 2 by 3

    {
        "extract_answer_T1": "2 by 3",
        "score_T1": 0
        "extract_answer_T2: "1 by 1",
        "score_T2": 1
        "extract_answer_T3": "2 by 3",
        "score_T3": 1
    }
    ### Example End ###
""",
}

REASONING_GRADING_PREFIX = """
You will be given a question, an ground truth answer and a model response. You need to extract the final answer from the model response, compare it with the ground truth answer, and then assign a binary score. Avoid providing explanations in your response. If there is no provided model response, please leave the extracted answer empty and give a score of 0. 

Your response must follow json formats with keys [extract_answer, score] where the value of the score is an integer in [0, 1]. You must follow the scoring rules:\n"""

REASONING_GRADING_INST = {
    1: """
    ### Rules ###
    * Give a score of 1 if and only if the final answer and the ground truth answer are referring to the same term. It's acceptable to have different grammar or form (e.g., α and alpha; $R^2_{t,h,v,m}$ and R^2_t,h,v,m). It's also acceptable to have different orders of the terms when question asks for multiple terms.
    * Give a score of 0 if any term (e.g., ACC+ and ACC; P-101 and P=101) is different between the final answer and the ground truth.

    ### Example 1 Starts ###
    * Question: What is the name of the curve that intersects y=\lambda exactly three times?
    * Ground Truth: P56962
    * Response: There is only one curve that intersects y=\lambda exactly three times. The name of the curve is written as P55762.
    
    {
        "extracted_answer": "P55762",
        "score": 0
    }
    ### Example 1 Ends ###


    ### Example 2 Starts ###
    * Question: What is the letter of the subplot where all bars are above 35?
    * Ground Truth: (b)
    * Response: The letter of the subplot where all bars are above 35 is b.

    {
        "extracted_answer": "b",
        "score": 1
    }
    ### Example 2 Ends ###

    ### Your Turn ###
    * Question: <|question|>
    * Ground Truth: <|ground_truth|>
    * Response: <|response|>

    """,
    2: """
    ### Rules ###
    * If there are predefined options in the question:
        * Give a score of 1 if the final answer matches the ground truth answer exactly.
        * Give a score of 0 if the final answer does not match the ground truth answer.
    * If there are no predefined options in the question:
        * Give a score of 1 if the final answer shares the same semantic meaning with the ground truth answer (e.g., "increasing then decreasing" and "moving up then down"; "converge" and "move closer together").
        * Give a score of 0 if the final answer shares different semantic meanings from the ground truth answer (e.g., "increasing then decreasing" and "remain constant"; "converge" and "diverge").

    ### Example 1 Starts ###
    * Question: What is the trend of the red curve between t=10 and t=25?
    * Ground Truth: increasing then decreasing
    * Response: The red curve is increasing between t=10 and t=25.

    {
        "extracted_answer": "increasing",
        "score": 0
    }
    ### Example 1 Ends ###

    ### Example 2 Starts ###
    * Question: What is the interval where the blue curve achieves the maximum value among [0, 50], [50, 100], [100, 150], and [150, 200]?
    * Ground Truth: [50, 100]
    * Response: The interval where the blue curve achieves the maximum value is [50, 100].

    {
        "extracted_answer": "[50, 100]",
        "score": 1
    }
    ### Example 2 Ends ###

    ### Your Turn ###
    * Question: <|question|>
    * Ground Truth: <|ground_truth|>
    * Response: <|response|>

    """,
    3: """
    ### Rules ###
    * Give a score of 1 if and only if the two numbers are exactly equal in values. It's acceptable to have different notations (e.g., 0.01 and 10^-2; 1500 and 1.5e3).
    * Give a score of 0 if the two numbers are different in values.

    ### Example 1 Starts ###
    * Question: What is the value of the red curve at t=10?
    * Ground Truth: 0.01
    * Response: The value of the red curve at t=10 is 0.012.

    {
        "extracted_answer": "0.012",
        "score": 0
    }
    ### Example 1 Ends ###

    ### Example 2 Starts ###
    * Question: What is the value of the blue curve at t=50?
    * Ground Truth: 1500
    * Response: The value of the blue curve at t=50 is 1.5e3.

    {
        "extracted_answer": "1.5e3",
        "score": 1
    }
    ### Example 2 Ends ###

    ### Your Turn ###
    * Question: <|question|>
    * Ground Truth: <|ground_truth|>
    * Response: <|response|>

    """,
    4: """
    ### Rules ###
    * Give a score of 1 if and only if the two numbers are exactly equal in values. It's acceptable to have different notations (e.g., 0.01 and 10^-2; 1500 and 1.5e3).
    * Give a score of 0 if the two numbers are different in values.

    ### Example 1 Starts ###
    * Question: What is the value of the red curve at t=10?
    * Ground Truth: 0.01
    * Response: The value of the red curve at t=10 is 0.012.

    {
        "extracted_answer": "0.012",
        "score": 0
    }
    ### Example 1 Ends ###

    ### Example 2 Starts ###
    * Question: What is the value of the blue curve at t=50?
    * Ground Truth: 1500
    * Response: The value of the blue curve at t=50 is 1.5e3.

    {
        "extracted_answer": "1.5e3",
        "score": 1
    }
    ### Example 2 Ends ###

    ### Your Turn ###
    * Question: <|question|>
    * Ground Truth: <|ground_truth|>
    * Response: <|response|>

    """,
}

REASONING_RESP_INST = {
    1: """{}
    * Your final answer must be grounded to some text that is explicitly written and relevant to the question in the chart.
    * If you need to answer multiple terms, separate them with commas.
    * Unless specified in the question (such as answering with a letter), you are required to answer the full names of subplots and/or labels by default.
    """,
    2: """{}
    * If there are options in the question, your final answer must conform to one of the options.
    * If there are additional instructions in the question, follow them accordingly.
    * If there are neither options nor additional instructions, you are allowed to respond with a short phrase only.
    """,
    3: """{}
    * Your final answer must be grounded to a number that is exlicitly written and relevant to the question in the chart, even if it's an approximate value.
    * You are allowed to extract numbers within some text when needed.
    """,
    4: """{}
    {}
    """,
}