Text Generation
Safetensors
Danish
English
llama
giannor commited on
Commit
ea8d109
·
1 Parent(s): a61fb81

Updated table evaluation results

Browse files
Files changed (1) hide show
  1. README.md +21 -21
README.md CHANGED
@@ -46,16 +46,17 @@ All comparison models were trained exclusively on open data, either in the publi
46
  The following tables show the performance on each dataset.
47
  For each, we report the respective main metric from EuroEval and the confidence interval.
48
 
49
- | Model | scala-da (MCC) | dala (MCC) | angry-tweets (MCC) | dansk (Micro F1, No Misc) | danske-talemaader (MCC) | danish-citizen-tests (MCC) | multi-wiki-qa-da (F1) | hellaswag-da (MCC) | nordjylland-news (BERTScore) |
50
- | ------------------------ | -------------- | ------------ | ------------------ | ------------------------ | ----------------------- | -------------------------- | --------------------- | ------------------ | ---------------------------- |
51
- | base (comma-v0.1-2t)t | 0.9 ± 0.8 | 0.2 ± 0.6 | 39.8 ± 1.4 | 32.0 ± 2.8 | 3.6 ± 2.3 | 10.7 ± 4.1 | 66.4 ± 0.8 | 3.8 ± 1.0 | 60.2 ± 1.7 |
52
- | **Training Stages** | | | | | | | | | |
53
- | munin-open-7b-pt (stage 1) | 13.3 ± 2.9 | 12.7 ± 2.2 | **47.7** ± 1.7 | 40.0 ± 2.4 | 18.1 ± 0.9 | 32.8 ± 1.4 | **76.6** ± 0.6 | 12.9 ± 1.0 | **65.9** ± 0.9 |
54
- | munin-open-7b-pt (stage 2) | 15.8 ± 3.1 | 14.4 ± 2.9 | 47.4 ± 2.3 | 40.4 ± 2.4 | 24.1 ± 1.8 | 36.1 ± 1.8 | 75.2 ± 0.7 | 13.1 ± 1.1 | 66.5 ± 0.7 |
55
- | munin-open-7b-pt (stage 3) | **16.5** ± 1.4 | **15.7** ± 1.74 | 46.3 ± 2.1 | **41.1** ± 2.8 | **24.6** ± 2.0 | **36.2** ± 1.7 | 76.0 ± 0.7 | 13.2 ± 1.2 | 66.6 ± 0.6 |
56
- | **Baselines** | | | | | | | | | |
57
- | Pleias-350m-Preview | -1.0 ± 1.5 | -1.8 ± 1.8 | 10.6 ± 2.9 | 12.9 ± 1.8 | 0.7 ± 2.6 | 4.6 ± 2.3 | 11.6 ± 0.9 | -0.3 ± 0.7 | 56.3 ± 1.5 |
58
- | Pleias-1.2b-Preview | 0.2 ± 1.1 | 0.7 ± 1.0 | 27.7 ± 2.9 | 27.3 ± 2.2 | -0.6 ± 1.9 | 8.6 ± 3.2 | 35.2 ± 1.3 | -0.0 ± 1.5 | 60.3 ± 0.9 |
 
59
 
60
  ### Performance on English
61
 
@@ -65,17 +66,16 @@ The goal of this section is to demonstrate how the performance deteriorates for
65
  across tasks, with the exception of `squad`.
66
 
67
 
68
- | Model | scala-en (MCC) | sst5 (MCC) | conll-en (Micro F1 no misc) | life-in-the-uk (MCC) | squad (F1) | hellaswag (MCC) | cnn-dailymail (BERTScore) |
69
- | ------------------------ | -------------- | ------------ | --------------------------- | -------------------- | ------------ | --------------- | ------------------------- |
70
- | base (comma-v0.1-2t) | **29.7** ± 1.9 | **61.8** ± 2.1 | **57.5** ± 2.8 | 41.6 ± 2.4 | **90.4** ± 0.4 | **16.8** ± 0.6 | **63.3** ± 0.9 |
71
- | **Training Stages** | | | | | | | |
72
- | munin-open-7b-pt (stage 1) | 27.5 ± 2.1 | 60.0 ± 1.7 | 56.6 ± 2.1 | 40.5 ± 1.7 | 22.1 ± 0.7 | 13.7 ± 0.7 | 59.2 ± 1.4 |
73
- | munin-open-7b-pt (stage 2) | 27.7 ± 2.0 | 59.5 ± 1.6 | 56.6 ± 2.3 | 41.2 ± 1.7 | 22.3 ± 1.5 | 16.0 ± 0.9 | 60.2 ± 1.6 |
74
- | munin-open-7b-pt (stage 3) | 29.0 ± 2.4 | 60.3 ± 1.4 | 57.0 ± 2.5 | **41.7** ± 1.8 | 24.6 ± 2.3 | 13.8 ± 0.9 | 59.0 ± 1.7 |
75
- | **Baseline** | | | | | | | |
76
- | Pleias-350m-Preview | 0.7 ± 1.8 | 15.4 ± 7.3 | 31.8 ± 3.5 | -0.7 ± 2.1 | 31.1 ± 2.3 | 0.2 ± 1.4 | 53.8 ± 1.0 |
77
- | Pleias-1.2b-Preview | 1.0 ± 2.4 | 48.2 ± 2.6 | 40.9 ± 3.3 | 2.6 ± 2.8 | 52.9 ± 2.5 | -0.1 ± 1.5 | 60.2 ± 1.6 |
78
-
79
 
80
 
81
 
 
46
  The following tables show the performance on each dataset.
47
  For each, we report the respective main metric from EuroEval and the confidence interval.
48
 
49
+
50
+ | Model | scala-da (MCC)| dala (MCC) | angry-tweets (MCC) | dansk (Micro F1, No Misc) | danske-talemaader (MCC) | danish-citizen-tests (MCC) | multi-wiki-qa-da (F1) | hellaswag-da (MCC) | nordjylland-news (BERTScore) | average |
51
+ | ---------------------------- | ------------- | ------------- | ------------------ | ------------------------- | ----------------------- | -------------------------- | --------------------- | ------------------ | ---------------------------- | ------- |
52
+ | base (comma-v0.1-2t)t | 0.9 ± 0.8 | 0.2 ± 0.6 | 39.8 ± 1.4 | 32.0 ± 2.8 | 3.6 ± 2.3 | 10.7 ± 4.1 | 66.4 ± 0.8 | 3.8 ± 1.0 | 60.2 ± 1.7 | 24.2 |
53
+ | **Training Stages** | | | | | | | | | | |
54
+ | munin-open-7b-pt (stage 1) | 13.3 ± 2.9 | 12.7 ± 2.2 | **47.7** ± 1.7 | 40.0 ± 2.4 | 18.1 ± 0.9 | 32.8 ± 1.4 | **76.6** ± 0.6 | 12.9 ± 1.0 | 66.3 ± 0.7 | 35.6 |
55
+ | munin-open-7b-pt (stage 2) | 15.8 ± 3.1 | 14.4 ± 2.9 | 47.4 ± 2.3 | 40.4 ± 2.4 | 24.1 ± 1.8 | 36.1 ± 1.8 | 75.2 ± 0.7 | 13.1 ± 1.1 | 66.5 ± 0.6 | 37.0 |
56
+ | munin-open-7b-pt (stage 3) | **16.5** ± 1.4| **15.7** ± 1.7| 46.3 ± 2.1 | **41.1** ± 2.8 | **24.6** ± 2.0 | **36.2** ± 1.7 | 76.0 ± 0.7 | **13.2** ± 1.2 | **66.6** ± 0.6 | **37.4** |
57
+ | **Baselines** | | | | | | | | | | |
58
+ | Pleias-350m-Preview | -1.0 ± 1.5 | -1.8 ± 1.8 | 10.6 ± 2.9 | 12.9 ± 1.8 | 0.7 ± 2.6 | 4.6 ± 2.3 | 11.6 ± 0.9 | -0.3 ± 0.7 | 56.3 ± 1.5 | 10.4 |
59
+ | Pleias-1.2b-Preview | 0.2 ± 1.1 | 0.7 ± 1.0 | 27.7 ± 2.9 | 27.3 ± 2.2 | -0.6 ± 1.9 | 8.6 ± 3.2 | 35.2 ± 1.3 | -0.0 ± 1.5 | 60.3 ± 0.9 | 17.7 |
60
 
61
  ### Performance on English
62
 
 
66
  across tasks, with the exception of `squad`.
67
 
68
 
69
+ | Model | scala-en (MCC) | sst5 (MCC) | conll-en (Micro F1 no misc) | life-in-the-uk (MCC) | squad (F1) | hellaswag (MCC) | cnn-dailymail (BERTScore) | average |
70
+ | ---------------------------- | ------------- | ------------ | --------------------------- | -------------------- | ------------ | --------------- | ------------------------- | ------- |
71
+ | base (comma-v0.1-2t) | **29.7** ± 1.9 | **61.8** ± 2.1| **57.5** ± 2.8 | 41.6 ± 2.4 | **90.4** ± 0.4| **16.8** ± 0.6 | **63.3** ± 0.9 | **51.6** |
72
+ | **Training Stages** | | | | | | | | |
73
+ | munin-open-7b-pt (stage 1) | 17.1 ± 9.0 | 60.0 ± 1.7 | 56.6 ± 2.2 | 40.5 ± 1.7 | 90.1 ± 0.3 | 13.7 ± 0.7 | 59.6 ± 1.3 | 48.2 |
74
+ | munin-open-7b-pt (stage 2) | 27.7 ± 2.0 | 59.5 ± 1.6 | 56.6 ± 2.3 | 41.2 ± 1.7 | 90.2 ± 0.4 | 16.0 ± 0.9 | 60.3 ± 1.6 | 50.2 |
75
+ | munin-open-7b-pt (stage 3) | 29.0 ± 2.4 | 60.3 ± 1.4 | 56.9 ± 2.5 | **41.7** ± 1.8 | 89.9 ± 0.4 | 13.8 ± 0.9 | 59.2 ± 1.7 | 50.1 |
76
+ | **Baseline** | | | | | | | | |
77
+ | Pleias-350m-Preview | 0.7 ± 1.8 | 15.4 ± 7.3 | 31.8 ± 3.5 | -0.7 ± 2.1 | 31.1 ± 2.3 | 0.2 ± 1.4 | 53.8 ± 1.0 | 18.9 |
78
+ | Pleias-1.2b-Preview | 1.0 ± 2.4 | 48.2 ± 2.6 | 40.9 ± 3.3 | 2.6 ± 2.8 | 52.9 ± 2.5 | -0.1 ± 1.5 | 60.2 ± 1.6 | 29.4 |
 
79
 
80
 
81