GuangzhiWang commited on
Commit
9beacbb
·
verified ·
1 Parent(s): 031c100

fix: correct per-task scores to match no-instruction variant (avg 39.6/35.1)

Browse files
Files changed (1) hide show
  1. README.md +20 -20
README.md CHANGED
@@ -75,32 +75,32 @@ MRE-T1 achieves state-of-the-art single-model performance on the [BRIGHT benchma
75
 
76
  | Task | MRE-T1 |
77
  |------|--------|
78
- | Biology | 81.2 |
79
- | Earth Science | 73.2 |
80
- | Economics | 64.8 |
81
- | Psychology | 72.5 |
82
- | Robotics | 57.3 |
83
- | StackOverflow | 62.8 |
84
- | Sustainable Living | 60.7 |
85
- | LeetCode | 40.6 |
86
- | Pony | 70.2 |
87
- | AOPS | 39.5 |
88
- | TheoremQA (Questions) | 56.2 |
89
- | TheoremQA (Theorems) | 66.3 |
90
  | **Average** | **39.6** |
91
 
92
  ### Long Document Retrieval (nDCG@10)
93
 
94
  | Task | MRE-T1 |
95
  |------|--------|
96
- | Biology | 91.5 |
97
- | Earth Science | 84.7 |
98
- | Economics | 82.0 |
99
- | Psychology | 84.9 |
100
- | Robotics | 70.1 |
101
- | StackOverflow | 65.9 |
102
- | Sustainable Living | 84.7 |
103
- | Pony | 51.3 |
104
  | **Average** | **35.1** |
105
 
106
  ### Comparison with Other Models (Short, Single Model Only)
 
75
 
76
  | Task | MRE-T1 |
77
  |------|--------|
78
+ | Biology | 55.3 |
79
+ | Earth Science | 56.5 |
80
+ | Economics | 32.9 |
81
+ | Psychology | 48.2 |
82
+ | Robotics | 33.1 |
83
+ | StackOverflow | 34.2 |
84
+ | Sustainable Living | 37.3 |
85
+ | LeetCode | 35.0 |
86
+ | Pony | 35.5 |
87
+ | AOPS | 16.7 |
88
+ | TheoremQA (Questions) | 43.3 |
89
+ | TheoremQA (Theorems) | 46.9 |
90
  | **Average** | **39.6** |
91
 
92
  ### Long Document Retrieval (nDCG@10)
93
 
94
  | Task | MRE-T1 |
95
  |------|--------|
96
+ | Biology | 74.2 |
97
+ | Earth Science | 72.2 |
98
+ | Economics | 57.3 |
99
+ | Psychology | 71.3 |
100
+ | Robotics | 51.6 |
101
+ | StackOverflow | 51.4 |
102
+ | Sustainable Living | 66.2 |
103
+ | Pony | 33.9 |
104
  | **Average** | **35.1** |
105
 
106
  ### Comparison with Other Models (Short, Single Model Only)