Retreatcost commited on
Commit
dbe8e32
·
verified ·
1 Parent(s): 59322a6

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +17 -4
README.md CHANGED
@@ -24,7 +24,7 @@ license: apache-2.0
24
 
25
  ![evertide_rx](https://cdn-uploads.huggingface.co/production/uploads/6671dd5203d6e8087aaf7ce5/zTuxJU9fwrkFbCvkGW1qe.jpeg)
26
 
27
- A generalist model, with some reasoning capabilities and some multi-lang support.
28
 
29
  Supported languages:
30
  - French
@@ -72,10 +72,17 @@ I trained 2 variants of the model:
72
  - with unrolled turns (each turn in separate sample)
73
  - with regular turns (all turns in single sample)
74
 
75
- Unrolled turns teach local attention much better and train faster, but generalize worse for multi-turn (LA, Local attention).
76
- Regular turns have much better multi-turn generalisation, but they tend to memorize instead of training new capabilities. (GA, Global attention)
77
 
78
- Then I merged these training runs in a pattern 4:1, similar to how Gemma models have layered SWA and GA.
 
 
 
 
 
 
 
79
 
80
  The following YAML configuration was used to produce this model:
81
 
@@ -155,4 +162,10 @@ For this model achieving ERP capabilities wasn't the goal, so I'm happy with cur
155
  ### RP/ERP model when?
156
  Soon™.
157
 
 
 
 
 
 
 
158
  </details>
 
24
 
25
  ![evertide_rx](https://cdn-uploads.huggingface.co/production/uploads/6671dd5203d6e8087aaf7ce5/zTuxJU9fwrkFbCvkGW1qe.jpeg)
26
 
27
+ A generalist model, with some reasoning capabilities and multi-lang support.
28
 
29
  Supported languages:
30
  - French
 
72
  - with unrolled turns (each turn in separate sample)
73
  - with regular turns (all turns in single sample)
74
 
75
+ Unrolled turns teach local attention much better and train faster, but generalize worse for multi-turn (Evertide-LA-12B, Local attention).
76
+ Regular turns have much better multi-turn generalisation, but they tend to memorize instead of training new capabilities. (Evertide-GA-12B, Global attention).
77
 
78
+ I also trained these with changed RoPE theta - 10K for GA, 10M for LA.
79
+ My reasoning behind this is that during merging I "unrotate" the changes in config, effectively creating a distribution that I haven't trained in.
80
+
81
+ LA becomes shrinked to be even more specialized in short context, while GA gets stretched to cover longer context.
82
+
83
+ Then I merged these training runs using passthrough in a pattern 4:1, similar to how Gemma 4 models have layered SWA and GA.
84
+
85
+ ![download](https://cdn-uploads.huggingface.co/production/uploads/6671dd5203d6e8087aaf7ce5/9M_XguM0q7Pv66X8Vy8t9.jpeg)
86
 
87
  The following YAML configuration was used to produce this model:
88
 
 
162
  ### RP/ERP model when?
163
  Soon™.
164
 
165
+ ### Did you train in RL?
166
+ No, not yet, but that's one of future plans.
167
+
168
+ ### Is the reasoning performative?
169
+ It's hard to tell exactly, it definitely has some elements of it, but it also was trainded with some specific constraints, that force causality between thinking blocks and answer. So I would say that it's at least a hybrid. Any further improvements require RL training.
170
+
171
  </details>