andresnowak commited on
Commit
c6eaad0
·
verified ·
1 Parent(s): 29ea355

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +75 -5
README.md CHANGED
@@ -56,11 +56,81 @@ The following hyperparameters were used during training:
56
  - lr_scheduler_warmup_ratio: 0.04
57
  - num_epochs: 2
58
 
59
- ### Training results
60
- For musr we give question and narrative, and all this results are done on single letter (e.g " A")
61
- | Model | MMLU | MMLU-pro | arc-easy | arc-challenge | nlp4education | GPQA | Musr |
62
- |------------------------|------|----------|----------|---------------|---------------|------|------|
63
- | Qwen3-0.6B-base-MCQA | 52% | 17% | 86% | 72% | 51% | 29% | 53% |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
64
 
65
  ### Framework versions
66
 
 
56
  - lr_scheduler_warmup_ratio: 0.04
57
  - num_epochs: 2
58
 
59
+ ## Evaluation Results
60
+
61
+ The model was evaluated on a suite of Multiple Choice Question Answering (MCQA) benchmarks (on its validation and test sets repsectively for each one),
62
+ and NLP4education is only the approximated 1000 question and answers given to use.
63
+
64
+ **Important Note on MCQA Evals Benchmark:**
65
+
66
+ **The performance on these benchmarks is as follows**:
67
+
68
+ ### First evaluation: The tests where done with this prompt (type 5):
69
+ ```
70
+ This question assesses challenging STEM problems as found on graduate standardized tests. Carefully evaluate the options and select the correct answer.
71
+
72
+ ---
73
+ [Insert Question Here]
74
+ ---
75
+ [Insert Choices Here, e.g.:
76
+ A. Option 1
77
+ B. Option 2
78
+ C. Option 3
79
+ D. Option 4]
80
+ ---
81
+
82
+ Your response should include the letter and the exact text of the correct choice.
83
+ Example: B. Entropy increases.
84
+ Answer:
85
+ ```
86
+
87
+ And the teseting was done on ``` [Letter]. [Text answer]```
88
+
89
+ | Benchmark | Accuracy (Acc) | Normalized Accuracy (Acc Norm) |
90
+ | :----------------- | :------------- | :----------------------------- |
91
+ | ARC Challenge | 66.28% | 64.92% |
92
+ | ARC Easy | 84.22% | 81.33% |
93
+ | GPQA | 38.84% | 36.61% |
94
+ | Math QA | 25.03% | 24.67% |
95
+ | MCQA Evals | 43.51% | 40.91% |
96
+ | MMLU | 52.17% | 52.17% |
97
+ | MMLU Pro | 16.45% | 15.04% |
98
+ | MuSR | 53.17% | 52.25% |
99
+ | NLP4Education | 44.45% | 42.65% |
100
+ | **Overall** | **47.12%** | **45.62%** |
101
+
102
+ ### Second evaluation: (type 0)
103
+ ```
104
+ The following are multiple choice questions (with answers) about knowledge and skills in advanced master-level STEM courses.
105
+
106
+ ---
107
+ *[Insert Question Here]*
108
+ ---
109
+ *[Insert Choices Here, e.g.:*
110
+ *A. Option 1*
111
+ *B. Option 2*
112
+ *C. Option 3*
113
+ *D. Option 4]*
114
+ ---
115
+ Answer:
116
+ ```
117
+
118
+ And the teseting was done on ``` [Letter]. [Text answer]```
119
+
120
+ | Benchmark | Accuracy (Acc) | Normalized Accuracy (Acc Norm) |
121
+ | :----------------- | :------------- | :----------------------------- |
122
+ | ARC Challenge | 69.95% | 65.33% |
123
+ | ARC Easy | 84.45% | 78.51% |
124
+ | GPQA | 31.92% | 28.57% |
125
+ | Math QA | 27.02% | 26.88% |
126
+ | MCQA Evals | 43.90% | 35.32% |
127
+ | MMLU | 52.17% | 52.17% |
128
+ | MMLU Pro | 15.04% | 13.27% |
129
+ | MuSR | 53.17% | 52.25% |
130
+ | NLP4Education | 49.14% | 42.85% |
131
+ | **Overall** | **47.42%** | **43.91%** |
132
+
133
+
134
 
135
  ### Framework versions
136