Update README.md
Browse files
README.md
CHANGED
|
@@ -207,40 +207,40 @@ This Ahma 7B base model was primarily evaluated using [FIN-bench by TurkuNLP](ht
|
|
| 207 |
|
| 208 |
| Benchmark | Ahma 3B base (instruct prompt format) | Ahma 3B Instruct (instruct prompt format) | Ahma 7B base (instruct prompt format) | Ahma 7B Instruct (instruct prompt format) | FinGPT 8B | Viking 7B | Poro 34B (8bit quant) |
|
| 209 |
|:---------------------------|:--------------------------------------|:------------------------------------------|:--------------------------------------|:------------------------------------------|:----------|:----------|:----------------------|
|
| 210 |
-
| Analogies | 50.77 | 48.46 | 56.92 |
|
| 211 |
-
| Arithmetic | 27.64 | 22.14 | 11.50 |
|
| 212 |
-
| Cause and Effect | 59.48 | 58.82 | 59.48 |
|
| 213 |
-
| Emotions | 36.25 | 28.12 | 36.25 |
|
| 214 |
-
| Empirical Judgements | 33.33 | 35.35 | 33.33 |
|
| 215 |
-
| General Knowledge | 44.29 | 48.57 | 51.43 |
|
| 216 |
-
| HHH Alignment | 42.09 | 41.66 | 44.23 |
|
| 217 |
-
| Intent Recognition | 24.42 | 26.16 | 43.64 |
|
| 218 |
-
| Misconceptions | 46.27 | 47.01 | 46.27 |
|
| 219 |
-
| Paraphrase | 59.50 | 73.00 | 67.00 |
|
| 220 |
-
| Sentence Ambiguity | 53.33 | 65.00 | 60.00 |
|
| 221 |
-
| Similarities Abstraction | 65.79 | 68.42 | 71.05 |
|
| 222 |
-
| **Non-Arithmetic Average** | **47.55** | **48.95** | **51.33** |
|
| 223 |
-
| **Overall Average** | **36.49** | **34.06** | **29.20** |
|
| 224 |
|
| 225 |
|
| 226 |
3-shot results:
|
| 227 |
|
| 228 |
| Benchmark | Ahma 3B base (instruct prompt format) | Ahma 3B Instruct (instruct prompt format) | Ahma 7B base (instruct prompt format) | Ahma 7B Instruct (instruct prompt format) | FinGPT 8B | Viking 7B | Poro 34B (8bit quant) |
|
| 229 |
|:---------------------------|:--------------------------------------|:------------------------------------------|:--------------------------------------|:------------------------------------------|:----------|:----------|:----------------------|
|
| 230 |
-
| Analogies | 50.77 | 49.23 | 49.23 |
|
| 231 |
-
| Arithmetic | 38.38 | 43.89 | 20.88 |
|
| 232 |
-
| Cause and Effect | 60.78 | 64.71 | 66.01 |
|
| 233 |
-
| Emotions | 30.00 | 41.25 | 30.00 |
|
| 234 |
-
| Empirical Judgements | 46.46 | 44.44 | 39.39 |
|
| 235 |
-
| General Knowledge | 47.14 | 40.00 | 27.14 |
|
| 236 |
-
| HHH Alignment | 43.53 | 44.80 | 43.80 |
|
| 237 |
-
| Intent Recognition | 20.52 | 44.22 | 36.42 |
|
| 238 |
-
| Misconceptions | 50.75 | 52.24 | 46.27 |
|
| 239 |
-
| Paraphrase | 50.50 | 58.50 | 57.50 |
|
| 240 |
-
| Sentence Ambiguity | 53.33 | 48.33 | 53.33 |
|
| 241 |
-
| Similarities Abstraction | 69.74 | 72.37 | 72.37 |
|
| 242 |
-
| **Non-Arithmetic Average** | **48.48** | **51.49** | **49.05** |
|
| 243 |
-
| **Overall Average** | **42.87** | **47.27** | **33.41** |
|
| 244 |
|
| 245 |
|
| 246 |
As we can see, Ahma 7B base model has bad arithmetic performance but in non-arithmetic tasks it clearly outperforms same sized models like the FinGPT 8B and Viking 7B, especially in 0-shot usage. Ahma 7B base model is even on-par with the 5X larger Poro 34B model, in non-arithmetic tasks in 0-shot usage. This result might be attributed to Ahma's 2-stage pretraining and the inclusion of instruct-following examples during the pretraining phase.
|
|
@@ -254,31 +254,31 @@ This Ahma 7B base model was also evaluated using [MTBench Finnish by LumiOpen](h
|
|
| 254 |
|
| 255 |
Single-turn results:
|
| 256 |
|
| 257 |
-
| Benchmark | Ahma 3B base (instruct prompt format) | Ahma 3B Instruct | Ahma 7B base (instruct prompt format) | Ahma 7B Instruct |
|
| 258 |
-
|:--------------------|:--------------------------------------|:-----------------|:--------------------------------------|:-----------------|
|
| 259 |
-
| Coding | 1.00 | 1.00
|
| 260 |
-
| Extraction | 2.00 | 1.30
|
| 261 |
-
| Humanities | 4.05 | 6.20
|
| 262 |
-
| Math | 3.00 | 3.20
|
| 263 |
-
| Reasoning | 2.90 | 4.60
|
| 264 |
-
| Roleplay | 4.80 | 6.50
|
| 265 |
-
| STEM | 5.10 | 5.95
|
| 266 |
-
| Writing | 6.60 | 9.00
|
| 267 |
-
| **Overall Average** | **3.68** | **4.72**
|
| 268 |
|
| 269 |
Multi-turn results:
|
| 270 |
|
| 271 |
-
| Benchmark | Ahma 3B base (instruct prompt format) | Ahma 3B Instruct | Ahma 7B base (instruct prompt format) | Ahma 7B Instruct | Poro 34B Chat |
|
| 272 |
-
|:--------------------|:--------------------------------------|:-----------------|:--------------------------------------|:-----------------|:--------------|
|
| 273 |
-
| Coding | 1.00 | 1.00
|
| 274 |
-
| Extraction | 1.55 | 1.15
|
| 275 |
-
| Humanities | 3.25 | 6.20
|
| 276 |
-
| Math | 2.20 | 2.70
|
| 277 |
-
| Reasoning | 2.45 | 3.50
|
| 278 |
-
| Roleplay | 4.90 | 6.40
|
| 279 |
-
| STEM | 4.20 | 4.78
|
| 280 |
-
| Writing | 3.80 | 6.65
|
| 281 |
-
| **Overall Average** | **2.92** | **4.05**
|
| 282 |
|
| 283 |
As we can see, Ahma 7B base model struggles with multi-turn examples, as expected, since it has only been pretrained with single-turn instruction following examples. In addition, coding performance was expectedly poor because the Ahma 7B model is not trained with code data. In single-turn setting, Ahma 7B beats both the Ahma 3B base and Instruct-tuned versions, demonstrating greater base capability to be further improved with Instruct-tuning.
|
| 284 |
|
|
|
|
| 207 |
|
| 208 |
| Benchmark | Ahma 3B base (instruct prompt format) | Ahma 3B Instruct (instruct prompt format) | Ahma 7B base (instruct prompt format) | Ahma 7B Instruct (instruct prompt format) | FinGPT 8B | Viking 7B | Poro 34B (8bit quant) |
|
| 209 |
|:---------------------------|:--------------------------------------|:------------------------------------------|:--------------------------------------|:------------------------------------------|:----------|:----------|:----------------------|
|
| 210 |
+
| Analogies | 50.77 | 48.46 | 56.92 | 41.54 | 49.23 | 40.00 | 54.62 |
|
| 211 |
+
| Arithmetic | 27.64 | 22.14 | 11.50 | 14.70 | 33.15 | 30.16 | 30.34 |
|
| 212 |
+
| Cause and Effect | 59.48 | 58.82 | 59.48 | 53.60 | 66.01 | 58.82 | 62.74 |
|
| 213 |
+
| Emotions | 36.25 | 28.12 | 36.25 | 27.50 | 22.50 | 26.25 | 35.63 |
|
| 214 |
+
| Empirical Judgements | 33.33 | 35.35 | 33.33 | 33.33 | 27.27 | 33.33 | 49.49 |
|
| 215 |
+
| General Knowledge | 44.29 | 48.57 | 51.43 | 37.14 | 40.00 | 24.29 | 51.43 |
|
| 216 |
+
| HHH Alignment | 42.09 | 41.66 | 44.23 | 43.22 | 41.81 | 42.51 | 42.92 |
|
| 217 |
+
| Intent Recognition | 24.42 | 26.16 | 43.64 | 56.94 | 17.49 | 22.40 | 68.35 |
|
| 218 |
+
| Misconceptions | 46.27 | 47.01 | 46.27 | 47.01 | 53.73 | 53.73 | 52.24 |
|
| 219 |
+
| Paraphrase | 59.50 | 73.00 | 67.00 | 70.50 | 51.00 | 50.00 | 51.00 |
|
| 220 |
+
| Sentence Ambiguity | 53.33 | 65.00 | 60.00 | 63.33 | 51.67 | 48.33 | 50.00 |
|
| 221 |
+
| Similarities Abstraction | 65.79 | 68.42 | 71.05 | 61.84 | 60.53 | 65.79 | 60.53 |
|
| 222 |
+
| **Non-Arithmetic Average** | **47.55** | **48.95** | **51.33** | **48.30** | **46.17** | **44.42** | **52.08** |
|
| 223 |
+
| **Overall Average** | **36.49** | **34.06** | **29.20** | **29.64** | **38.93** | **36.50** | **40.00** |
|
| 224 |
|
| 225 |
|
| 226 |
3-shot results:
|
| 227 |
|
| 228 |
| Benchmark | Ahma 3B base (instruct prompt format) | Ahma 3B Instruct (instruct prompt format) | Ahma 7B base (instruct prompt format) | Ahma 7B Instruct (instruct prompt format) | FinGPT 8B | Viking 7B | Poro 34B (8bit quant) |
|
| 229 |
|:---------------------------|:--------------------------------------|:------------------------------------------|:--------------------------------------|:------------------------------------------|:----------|:----------|:----------------------|
|
| 230 |
+
| Analogies | 50.77 | 49.23 | 49.23 | 43.08 | 40.77 | 54.62 | 76.92 |
|
| 231 |
+
| Arithmetic | 38.38 | 43.89 | 20.88 | 26.81 | 43.63 | 45.78 | 53.68 |
|
| 232 |
+
| Cause and Effect | 60.78 | 64.71 | 66.01 | 62.74 | 64.05 | 58.17 | 67.32 |
|
| 233 |
+
| Emotions | 30.00 | 41.25 | 30.00 | 53.75 | 44.37 | 48.13 | 56.87 |
|
| 234 |
+
| Empirical Judgements | 46.46 | 44.44 | 39.39 | 39.39 | 32.32 | 43.43 | 63.64 |
|
| 235 |
+
| General Knowledge | 47.14 | 40.00 | 27.14 | 44.29 | 54.29 | 28.57 | 74.29 |
|
| 236 |
+
| HHH Alignment | 43.53 | 44.80 | 43.80 | 45.09 | 45.39 | 44.80 | 46.07 |
|
| 237 |
+
| Intent Recognition | 20.52 | 44.22 | 36.42 | 39.02 | 51.45 | 58.82 | 83.67 |
|
| 238 |
+
| Misconceptions | 50.75 | 52.24 | 46.27 | 51.49 | 52.99 | 46.27 | 52.99 |
|
| 239 |
+
| Paraphrase | 50.50 | 58.50 | 57.50 | 65.00 | 53.00 | 54.50 | 55.00 |
|
| 240 |
+
| Sentence Ambiguity | 53.33 | 48.33 | 53.33 | 51.67 | 51.67 | 53.33 | 66.67 |
|
| 241 |
+
| Similarities Abstraction | 69.74 | 72.37 | 72.37 | 69.74 | 64.47 | 73.68 | 75.00 |
|
| 242 |
+
| **Non-Arithmetic Average** | **48.48** | **51.49** | **49.05** | **51.63** | **51.19** | **50.94** | **61.96** |
|
| 243 |
+
| **Overall Average** | **42.87** | **47.27** | **33.41** | **37.84** | **46.99** | **48.07** | **57.36** |
|
| 244 |
|
| 245 |
|
| 246 |
As we can see, Ahma 7B base model has bad arithmetic performance but in non-arithmetic tasks it clearly outperforms same sized models like the FinGPT 8B and Viking 7B, especially in 0-shot usage. Ahma 7B base model is even on-par with the 5X larger Poro 34B model, in non-arithmetic tasks in 0-shot usage. This result might be attributed to Ahma's 2-stage pretraining and the inclusion of instruct-following examples during the pretraining phase.
|
|
|
|
| 254 |
|
| 255 |
Single-turn results:
|
| 256 |
|
| 257 |
+
| Benchmark | Ahma 3B base (instruct prompt format) | Ahma 3B Instruct (instruct prompt format) | Ahma 7B base (instruct prompt format) | Ahma 7B Instruct (instruct prompt format) |
|
| 258 |
+
|:--------------------|:--------------------------------------|:------------------------------------------|:--------------------------------------|:------------------------------------------|
|
| 259 |
+
| Coding | 1.00 | 1.00 | 1.70 | 1.10 |
|
| 260 |
+
| Extraction | 2.00 | 1.30 | 3.10 | 3.00 |
|
| 261 |
+
| Humanities | 4.05 | 6.20 | 6.60 | 8.00 |
|
| 262 |
+
| Math | 3.00 | 3.20 | 3.90 | 2.90 |
|
| 263 |
+
| Reasoning | 2.90 | 4.60 | 3.70 | 5.70 |
|
| 264 |
+
| Roleplay | 4.80 | 6.50 | 6.60 | 7.20 |
|
| 265 |
+
| STEM | 5.10 | 5.95 | 6.75 | 7.30 |
|
| 266 |
+
| Writing | 6.60 | 9.00 | 7.10 | 8.80 |
|
| 267 |
+
| **Overall Average** | **3.68** | **4.72** | **4.93** | **5.50** |
|
| 268 |
|
| 269 |
Multi-turn results:
|
| 270 |
|
| 271 |
+
| Benchmark | Ahma 3B base (instruct prompt format) | Ahma 3B Instruct (instruct prompt format) | Ahma 7B base (instruct prompt format) | Ahma 7B Instruct (instruct prompt format) | Poro 34B Chat |
|
| 272 |
+
|:--------------------|:--------------------------------------|:------------------------------------------|:--------------------------------------|:------------------------------------------|:--------------|
|
| 273 |
+
| Coding | 1.00 | 1.00 | 1.40 | 1.05 | 3.70 |
|
| 274 |
+
| Extraction | 1.55 | 1.15 | 2.05 | 2.65 | 6.37 |
|
| 275 |
+
| Humanities | 3.25 | 6.20 | 4.95 | 7.85 | 9.25 |
|
| 276 |
+
| Math | 2.20 | 2.70 | 2.50 | 2.40 | 1.20 |
|
| 277 |
+
| Reasoning | 2.45 | 3.50 | 2.55 | 4.50 | 4.35 |
|
| 278 |
+
| Roleplay | 4.90 | 6.40 | 6.35 | 6.60 | 7.35 |
|
| 279 |
+
| STEM | 4.20 | 4.78 | 4.28 | 5.40 | 7.80 |
|
| 280 |
+
| Writing | 3.80 | 6.65 | 4.10 | 6.25 | 8.50 |
|
| 281 |
+
| **Overall Average** | **2.92** | **4.05** | **3.52** | **4.59** | **6.06** |
|
| 282 |
|
| 283 |
As we can see, Ahma 7B base model struggles with multi-turn examples, as expected, since it has only been pretrained with single-turn instruction following examples. In addition, coding performance was expectedly poor because the Ahma 7B model is not trained with code data. In single-turn setting, Ahma 7B beats both the Ahma 3B base and Instruct-tuned versions, demonstrating greater base capability to be further improved with Instruct-tuning.
|
| 284 |
|