Update README.md
Browse files
README.md
CHANGED
|
@@ -24,7 +24,7 @@ license: apache-2.0
|
|
| 24 |
|
| 25 |

|
| 26 |
|
| 27 |
-
A generalist model, with some reasoning capabilities and
|
| 28 |
|
| 29 |
Supported languages:
|
| 30 |
- French
|
|
@@ -72,10 +72,17 @@ I trained 2 variants of the model:
|
|
| 72 |
- with unrolled turns (each turn in separate sample)
|
| 73 |
- with regular turns (all turns in single sample)
|
| 74 |
|
| 75 |
-
Unrolled turns teach local attention much better and train faster, but generalize worse for multi-turn (LA, Local attention).
|
| 76 |
-
Regular turns have much better multi-turn generalisation, but they tend to memorize instead of training new capabilities. (GA, Global attention)
|
| 77 |
|
| 78 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 79 |
|
| 80 |
The following YAML configuration was used to produce this model:
|
| 81 |
|
|
@@ -155,4 +162,10 @@ For this model achieving ERP capabilities wasn't the goal, so I'm happy with cur
|
|
| 155 |
### RP/ERP model when?
|
| 156 |
Soon™.
|
| 157 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 158 |
</details>
|
|
|
|
| 24 |
|
| 25 |

|
| 26 |
|
| 27 |
+
A generalist model, with some reasoning capabilities and multi-lang support.
|
| 28 |
|
| 29 |
Supported languages:
|
| 30 |
- French
|
|
|
|
| 72 |
- with unrolled turns (each turn in separate sample)
|
| 73 |
- with regular turns (all turns in single sample)
|
| 74 |
|
| 75 |
+
Unrolled turns teach local attention much better and train faster, but generalize worse for multi-turn (Evertide-LA-12B, Local attention).
|
| 76 |
+
Regular turns have much better multi-turn generalisation, but they tend to memorize instead of training new capabilities. (Evertide-GA-12B, Global attention).
|
| 77 |
|
| 78 |
+
I also trained these with changed RoPE theta - 10K for GA, 10M for LA.
|
| 79 |
+
My reasoning behind this is that during merging I "unrotate" the changes in config, effectively creating a distribution that I haven't trained in.
|
| 80 |
+
|
| 81 |
+
LA becomes shrinked to be even more specialized in short context, while GA gets stretched to cover longer context.
|
| 82 |
+
|
| 83 |
+
Then I merged these training runs using passthrough in a pattern 4:1, similar to how Gemma 4 models have layered SWA and GA.
|
| 84 |
+
|
| 85 |
+

|
| 86 |
|
| 87 |
The following YAML configuration was used to produce this model:
|
| 88 |
|
|
|
|
| 162 |
### RP/ERP model when?
|
| 163 |
Soon™.
|
| 164 |
|
| 165 |
+
### Did you train in RL?
|
| 166 |
+
No, not yet, but that's one of future plans.
|
| 167 |
+
|
| 168 |
+
### Is the reasoning performative?
|
| 169 |
+
It's hard to tell exactly, it definitely has some elements of it, but it also was trainded with some specific constraints, that force causality between thinking blocks and answer. So I would say that it's at least a hybrid. Any further improvements require RL training.
|
| 170 |
+
|
| 171 |
</details>
|