Updated table evaluation results
Browse files
README.md
CHANGED
|
@@ -46,16 +46,17 @@ All comparison models were trained exclusively on open data, either in the publi
|
|
| 46 |
The following tables show the performance on each dataset.
|
| 47 |
For each, we report the respective main metric from EuroEval and the confidence interval.
|
| 48 |
|
| 49 |
-
|
| 50 |
-
|
|
| 51 |
-
|
|
| 52 |
-
|
|
| 53 |
-
|
|
| 54 |
-
| munin-open-7b-pt (stage
|
| 55 |
-
| munin-open-7b-pt (stage
|
| 56 |
-
| **
|
| 57 |
-
|
|
| 58 |
-
| Pleias-
|
|
|
|
| 59 |
|
| 60 |
### Performance on English
|
| 61 |
|
|
@@ -65,17 +66,16 @@ The goal of this section is to demonstrate how the performance deteriorates for
|
|
| 65 |
across tasks, with the exception of `squad`.
|
| 66 |
|
| 67 |
|
| 68 |
-
| Model
|
| 69 |
-
| ------------------------ | -------------
|
| 70 |
-
| base (comma-v0.1-2t)
|
| 71 |
-
| **Training Stages**
|
| 72 |
-
| munin-open-7b-pt (stage 1)
|
| 73 |
-
| munin-open-7b-pt (stage 2)
|
| 74 |
-
| munin-open-7b-pt (stage 3)
|
| 75 |
-
| **Baseline**
|
| 76 |
-
| Pleias-350m-Preview
|
| 77 |
-
| Pleias-1.2b-Preview
|
| 78 |
-
|
| 79 |
|
| 80 |
|
| 81 |
|
|
|
|
| 46 |
The following tables show the performance on each dataset.
|
| 47 |
For each, we report the respective main metric from EuroEval and the confidence interval.
|
| 48 |
|
| 49 |
+
|
| 50 |
+
| Model | scala-da (MCC)| dala (MCC) | angry-tweets (MCC) | dansk (Micro F1, No Misc) | danske-talemaader (MCC) | danish-citizen-tests (MCC) | multi-wiki-qa-da (F1) | hellaswag-da (MCC) | nordjylland-news (BERTScore) | average |
|
| 51 |
+
| ---------------------------- | ------------- | ------------- | ------------------ | ------------------------- | ----------------------- | -------------------------- | --------------------- | ------------------ | ---------------------------- | ------- |
|
| 52 |
+
| base (comma-v0.1-2t)t | 0.9 ± 0.8 | 0.2 ± 0.6 | 39.8 ± 1.4 | 32.0 ± 2.8 | 3.6 ± 2.3 | 10.7 ± 4.1 | 66.4 ± 0.8 | 3.8 ± 1.0 | 60.2 ± 1.7 | 24.2 |
|
| 53 |
+
| **Training Stages** | | | | | | | | | | |
|
| 54 |
+
| munin-open-7b-pt (stage 1) | 13.3 ± 2.9 | 12.7 ± 2.2 | **47.7** ± 1.7 | 40.0 ± 2.4 | 18.1 ± 0.9 | 32.8 ± 1.4 | **76.6** ± 0.6 | 12.9 ± 1.0 | 66.3 ± 0.7 | 35.6 |
|
| 55 |
+
| munin-open-7b-pt (stage 2) | 15.8 ± 3.1 | 14.4 ± 2.9 | 47.4 ± 2.3 | 40.4 ± 2.4 | 24.1 ± 1.8 | 36.1 ± 1.8 | 75.2 ± 0.7 | 13.1 ± 1.1 | 66.5 ± 0.6 | 37.0 |
|
| 56 |
+
| munin-open-7b-pt (stage 3) | **16.5** ± 1.4| **15.7** ± 1.7| 46.3 ± 2.1 | **41.1** ± 2.8 | **24.6** ± 2.0 | **36.2** ± 1.7 | 76.0 ± 0.7 | **13.2** ± 1.2 | **66.6** ± 0.6 | **37.4** |
|
| 57 |
+
| **Baselines** | | | | | | | | | | |
|
| 58 |
+
| Pleias-350m-Preview | -1.0 ± 1.5 | -1.8 ± 1.8 | 10.6 ± 2.9 | 12.9 ± 1.8 | 0.7 ± 2.6 | 4.6 ± 2.3 | 11.6 ± 0.9 | -0.3 ± 0.7 | 56.3 ± 1.5 | 10.4 |
|
| 59 |
+
| Pleias-1.2b-Preview | 0.2 ± 1.1 | 0.7 ± 1.0 | 27.7 ± 2.9 | 27.3 ± 2.2 | -0.6 ± 1.9 | 8.6 ± 3.2 | 35.2 ± 1.3 | -0.0 ± 1.5 | 60.3 ± 0.9 | 17.7 |
|
| 60 |
|
| 61 |
### Performance on English
|
| 62 |
|
|
|
|
| 66 |
across tasks, with the exception of `squad`.
|
| 67 |
|
| 68 |
|
| 69 |
+
| Model | scala-en (MCC) | sst5 (MCC) | conll-en (Micro F1 no misc) | life-in-the-uk (MCC) | squad (F1) | hellaswag (MCC) | cnn-dailymail (BERTScore) | average |
|
| 70 |
+
| ---------------------------- | ------------- | ------------ | --------------------------- | -------------------- | ------------ | --------------- | ------------------------- | ------- |
|
| 71 |
+
| base (comma-v0.1-2t) | **29.7** ± 1.9 | **61.8** ± 2.1| **57.5** ± 2.8 | 41.6 ± 2.4 | **90.4** ± 0.4| **16.8** ± 0.6 | **63.3** ± 0.9 | **51.6** |
|
| 72 |
+
| **Training Stages** | | | | | | | | |
|
| 73 |
+
| munin-open-7b-pt (stage 1) | 17.1 ± 9.0 | 60.0 ± 1.7 | 56.6 ± 2.2 | 40.5 ± 1.7 | 90.1 ± 0.3 | 13.7 ± 0.7 | 59.6 ± 1.3 | 48.2 |
|
| 74 |
+
| munin-open-7b-pt (stage 2) | 27.7 ± 2.0 | 59.5 ± 1.6 | 56.6 ± 2.3 | 41.2 ± 1.7 | 90.2 ± 0.4 | 16.0 ± 0.9 | 60.3 ± 1.6 | 50.2 |
|
| 75 |
+
| munin-open-7b-pt (stage 3) | 29.0 ± 2.4 | 60.3 ± 1.4 | 56.9 ± 2.5 | **41.7** ± 1.8 | 89.9 ± 0.4 | 13.8 ± 0.9 | 59.2 ± 1.7 | 50.1 |
|
| 76 |
+
| **Baseline** | | | | | | | | |
|
| 77 |
+
| Pleias-350m-Preview | 0.7 ± 1.8 | 15.4 ± 7.3 | 31.8 ± 3.5 | -0.7 ± 2.1 | 31.1 ± 2.3 | 0.2 ± 1.4 | 53.8 ± 1.0 | 18.9 |
|
| 78 |
+
| Pleias-1.2b-Preview | 1.0 ± 2.4 | 48.2 ± 2.6 | 40.9 ± 3.3 | 2.6 ± 2.8 | 52.9 ± 2.5 | -0.1 ± 1.5 | 60.2 ± 1.6 | 29.4 |
|
|
|
|
| 79 |
|
| 80 |
|
| 81 |
|