andresnowak
/

MNLP_M2_mcqa_model

@@ -56,11 +56,81 @@ The following hyperparameters were used during training:
 - lr_scheduler_warmup_ratio: 0.04
 - num_epochs: 2
-### Training results
-For musr we give question and narrative, and all this results are done on single letter (e.g " A")
-| Model                  | MMLU | MMLU-pro | arc-easy | arc-challenge | nlp4education | GPQA | Musr |
-|------------------------|------|----------|----------|---------------|---------------|------|------|
-| Qwen3-0.6B-base-MCQA   | 52%  | 17%      | 86%      | 72%           | 51%           | 29%  | 53%  |
 ### Framework versions

 - lr_scheduler_warmup_ratio: 0.04
 - num_epochs: 2
+## Evaluation Results
+The model was evaluated on a suite of Multiple Choice Question Answering (MCQA) benchmarks (on its validation and test sets repsectively for each one),
+and NLP4education is only the approximated 1000 question and answers given to use.
+**Important Note on MCQA Evals Benchmark:**
+**The performance on these benchmarks is as follows**:
+### First evaluation: The tests where done with this prompt (type 5):
+```
+This question assesses challenging STEM problems as found on graduate standardized tests. Carefully evaluate the options and select the correct answer.
+---
+[Insert Question Here]
+---
+[Insert Choices Here, e.g.:
+A. Option 1
+B. Option 2
+C. Option 3
+D. Option 4]
+---
+Your response should include the letter and the exact text of the correct choice.
+Example: B. Entropy increases.
+Answer:
+```
+And the teseting was done on ``` [Letter]. [Text answer]```
+| Benchmark          | Accuracy (Acc) | Normalized Accuracy (Acc Norm) |
+| :----------------- | :------------- | :----------------------------- |
+| ARC Challenge      | 66.28%         | 64.92%                         |
+| ARC Easy           | 84.22%         | 81.33%                         |
+| GPQA               | 38.84%         | 36.61%                         |
+| Math QA            | 25.03%         | 24.67%                         |
+| MCQA Evals         | 43.51%         | 40.91%                         |
+| MMLU               | 52.17%         | 52.17%                         |
+| MMLU Pro           | 16.45%         | 15.04%                         |
+| MuSR               | 53.17%         | 52.25%                         |
+| NLP4Education      | 44.45%         | 42.65%                         |
+| **Overall**        | **47.12%**     | **45.62%**                     |
+### Second evaluation: (type 0)
+```
+The following are multiple choice questions (with answers) about knowledge and skills in advanced master-level STEM courses.
+---
+*[Insert Question Here]*
+---
+*[Insert Choices Here, e.g.:*
+*A. Option 1*
+*B. Option 2*
+*C. Option 3*
+*D. Option 4]*
+---
+Answer:
+```
+And the teseting was done on ``` [Letter]. [Text answer]```
+| Benchmark          | Accuracy (Acc) | Normalized Accuracy (Acc Norm) |
+| :----------------- | :------------- | :----------------------------- |
+| ARC Challenge      | 69.95%         | 65.33%                         |
+| ARC Easy           | 84.45%         | 78.51%                         |
+| GPQA               | 31.92%         | 28.57%                         |
+| Math QA            | 27.02%         | 26.88%                         |
+| MCQA Evals         | 43.90%         | 35.32%                         |
+| MMLU               | 52.17%         | 52.17%                         |
+| MMLU Pro           | 15.04%         | 13.27%                         |
+| MuSR               | 53.17%         | 52.25%                         |
+| NLP4Education      | 49.14%         | 42.85%                         |
+| **Overall**        | **47.42%**     | **43.91%**                     |
 ### Framework versions