Update README.md
Browse files
README.md
CHANGED
|
@@ -273,7 +273,7 @@ Now with redcuced expert and hidden dimension per layers it dosent split anymore
|
|
| 273 |
Split, merge, and death are working. Small-scale models are working too, but the lifetime has to be lowered so it works as intended, so the lifetime and size should depend on each other. Maybe I'll try to find some sort of formula in the future.
|
| 274 |
|
| 275 |
## Day 7
|
| 276 |
-
For now i removed copying over the optimzation state cause it
|
| 277 |
|
| 278 |
## Day 8
|
| 279 |
An report of claude based fo the logs:
|
|
@@ -281,12 +281,11 @@ An report of claude based fo the logs:
|
|
| 281 |
12 monoliths → ~50 experts, 160M params. All lifecycle events fire: 89 splits, 36 merges, 13 deaths, drift detection. No crashes.
|
| 282 |
Loss hit 4.26 at step 970, rose to ~5.0 during rapid growth (optimizer wipes), recovering to ~4.9 by step 10K. Of the 36 merges, 16 were sibling merge-backs (both children from the same parent reuniting) and 20 were non-sibling merges (unrelated weak experts consolidating). 73 out of 89 splits stuck — 82% retention rate.tier-gravity merge at step 10,920. L5 routes to multiple experts (density 1.5); other layers mostly top-1. Throughput: 4K → 1.6K tok/s.
|
| 283 |
|
| 284 |
-
The current biggest problem is the Optimzer state wipe that keeps the model from
|
| 285 |
|
| 286 |
-
Even without the optimzer fixed i will now train it on teknium/OpenHermes-2.5 to see what beahviour it shows.
|
| 287 |
|
| 288 |
## Day 9
|
| 289 |
-
|
| 290 |
|
| 291 |
This are the results training on top of gutenberg with teknium/OpenHermes-2.5:
|
| 292 |
|
|
@@ -295,7 +294,7 @@ L4: 3 splits, 2 merges, density 1.8. L5: 1 merge. L9: 2 splits, 2 deaths, densit
|
|
| 295 |
|
| 296 |
The result of this chat fine-tune is bad, but it has nothing to do with the model itself; it has more to do with the fact that I built myself a trash tokenizer that doesn't support special tokens. I will retry it at a later point; for now, it's for completion only.
|
| 297 |
|
| 298 |
-
Here is an short Report with 10 prompts:
|
| 299 |
4 monoliths (L4,5,8,9). 3 near-monoliths trending stable (L0,1,2). 4 dynamic with per-prompt routing shifts (L3,6,7,11). L10 borderline.
|
| 300 |
|
| 301 |
## References
|
|
|
|
| 273 |
Split, merge, and death are working. Small-scale models are working too, but the lifetime has to be lowered so it works as intended, so the lifetime and size should depend on each other. Maybe I'll try to find some sort of formula in the future.
|
| 274 |
|
| 275 |
## Day 7
|
| 276 |
+
For now i removed copying over the optimzation state cause it cause an crash. I will probaly reimplemted it later cuase its not that importnat currently it now have to built up momentum from strach what sucks .Everything else seems to work.I anlyse the log there is some oscillation basicly split and merge back but its not the norm i excpect oscillation to some degree. 80% of splits stick and dont merge back with the Sibling.
|
| 277 |
|
| 278 |
## Day 8
|
| 279 |
An report of claude based fo the logs:
|
|
|
|
| 281 |
12 monoliths → ~50 experts, 160M params. All lifecycle events fire: 89 splits, 36 merges, 13 deaths, drift detection. No crashes.
|
| 282 |
Loss hit 4.26 at step 970, rose to ~5.0 during rapid growth (optimizer wipes), recovering to ~4.9 by step 10K. Of the 36 merges, 16 were sibling merge-backs (both children from the same parent reuniting) and 20 were non-sibling merges (unrelated weak experts consolidating). 73 out of 89 splits stuck — 82% retention rate.tier-gravity merge at step 10,920. L5 routes to multiple experts (density 1.5); other layers mostly top-1. Throughput: 4K → 1.6K tok/s.
|
| 283 |
|
| 284 |
+
The current biggest problem is the Optimzer state wipe that keeps the model from bulding up momentum cause after every split it wipes the optimzer state and copiying somhow currupts the opitmizer state what is an annoying bug.
|
| 285 |
|
|
|
|
| 286 |
|
| 287 |
## Day 9
|
| 288 |
+
The Optimzer is now working the Optimzer state is sucessfully copyed over to the children so the base Architecture is now working.
|
| 289 |
|
| 290 |
This are the results training on top of gutenberg with teknium/OpenHermes-2.5:
|
| 291 |
|
|
|
|
| 294 |
|
| 295 |
The result of this chat fine-tune is bad, but it has nothing to do with the model itself; it has more to do with the fact that I built myself a trash tokenizer that doesn't support special tokens. I will retry it at a later point; for now, it's for completion only.
|
| 296 |
|
| 297 |
+
Here is an short Report with 10 test prompts:
|
| 298 |
4 monoliths (L4,5,8,9). 3 near-monoliths trending stable (L0,1,2). 4 dynamic with per-prompt routing shifts (L3,6,7,11). L10 borderline.
|
| 299 |
|
| 300 |
## References
|