LenDigLearn commited on
Commit
1fc3a30
·
verified ·
1 Parent(s): d9e89fd

added preliminary benchmark numbers

Browse files
Files changed (1) hide show
  1. README.md +25 -1
README.md CHANGED
@@ -62,7 +62,31 @@ Our data encompasses examples of a length up to 16384 tokens, further enhancing
62
 
63
  ## Evaluation
64
 
65
- Evaluation results will be added soon.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
66
 
67
 
68
  ## Model Card Authors [optional]
 
62
 
63
  ## Evaluation
64
 
65
+ We performed benchmarks using lighteval. The accuracy numbers obtained this way differ greatly from the base model's official benchmarks and those performed with different benchmark suites.
66
+ Thus, we have run the same benchmarks using lighteval on the [base model](https://huggingface.co/mistralai/Mistral-Nemo-Instruct-2407) under the exact same conditions as well for comparison.
67
+ As of 2025-01-24, We are working on running these benchmarks again using a different suite as well as running more German-specific benchmarks.
68
+
69
+ ### English Benchmarks
70
+ | Benchmark | Mistral-Nemo-Instruct 2407 | educa-ai-nemo-sft |
71
+ | --- | --- | --- |
72
+ | HellaSwag (0-shot) | **44.33%** | 38.65% |
73
+ | WinoGrande (0-shot) | 55.49% | **58.56%** |
74
+ | OpenBookQA (0-shot) | **40.60%** | 36.40% |
75
+ | CommonSenseQA (0-shot) | 37.26% | **39.31%** |
76
+ | TruthfulQA (0-shot) | 56.12% | **59.94%** |
77
+ | MMLU (5-shot) | 30.10% | **37.91%** |
78
+
79
+
80
+ ### Multilingual Benchmarks (MMLU)
81
+ | Language | Mistral-Nemo-Instruct 2407 | educa-ai-nemo-sft |
82
+ | --- | --- | --- |
83
+ | French | **30.32%** | 29.05% |
84
+ | German | 27.69% | **41.82%** |
85
+ | Spanish | 24.69% | **30.25%** |
86
+ | Italian | 31.29% | **34.81%** |
87
+ | Portuguese | 24.16% | **28.81%** |
88
+ | Chinese | 34.80% | **37.85%** |
89
+ | Japanese | 34.27% | **35.18%** |
90
 
91
 
92
  ## Model Card Authors [optional]