gustavlangstroem
/

Microexpert_NG

Text Generation

Model card Files Files and versions

xet

Community

gustavlangstroem commited on 4 days ago

Commit

55a2700

verified ·

1 Parent(s): 22c828f

Update README.md

Browse files

Files changed (1) hide show

README.md +4 -5

README.md CHANGED Viewed

@@ -273,7 +273,7 @@ Now with redcuced expert and hidden dimension per layers it dosent split anymore
 Split, merge, and death are working. Small-scale models are working too, but the lifetime has to be lowered so it works as intended, so the lifetime and size should depend on each other. Maybe I'll try to find some sort of formula in the future.
 ## Day 7
-For now i removed copying over the optimzation state cause it cuases an crash. I will probaly reimplemted it later cuase its not that importnat currently it now have to built up momentum from strach what sucks .Everything else seems to work.I anlyse the log there is some oscillation basicly split and merge back  but its not the norm i excpect oscillation to some degree. 80% of splits stick and dont merge back with the Sibling.
 ## Day 8
 An report of claude based fo the logs:
@@ -281,12 +281,11 @@ An report of claude based fo the logs:
 12 monoliths → ~50 experts, 160M params. All lifecycle events fire: 89 splits, 36 merges, 13 deaths, drift detection. No crashes.
 Loss hit 4.26 at step 970, rose to ~5.0 during rapid growth (optimizer wipes), recovering to ~4.9 by step 10K.  Of the 36 merges, 16 were sibling merge-backs (both children from the same parent reuniting) and 20 were non-sibling merges (unrelated weak experts consolidating). 73 out of 89 splits stuck — 82% retention rate.tier-gravity merge at step 10,920. L5 routes to multiple experts (density 1.5); other layers mostly top-1. Throughput: 4K → 1.6K tok/s.
-The current biggest problem is the Optimzer state wipe that keeps the model from buldin up momentum cuase after evry split it wipes the optimzer state and copiying somhow currupts the opitmizer state what is an annoying bug.
-Even without the optimzer fixed i will now train it on teknium/OpenHermes-2.5 to see what beahviour it shows.
 ## Day 9
-Momentum preservation is now working the opzimzer state is sucessfully copyed over tot he children so the base Architecture is now working.
 This are the results training on top of gutenberg with teknium/OpenHermes-2.5:
@@ -295,7 +294,7 @@ L4: 3 splits, 2 merges, density 1.8. L5: 1 merge. L9: 2 splits, 2 deaths, densit
 The result of this chat fine-tune is bad, but it has nothing to do with the model itself; it has more to do with the fact that I built myself a trash tokenizer that doesn't support special tokens. I will retry it at a later point; for now, it's for completion only.
-Here is an short Report with 10 prompts:
 4 monoliths (L4,5,8,9). 3 near-monoliths trending stable (L0,1,2). 4 dynamic with per-prompt routing shifts (L3,6,7,11). L10 borderline.
 ## References

 Split, merge, and death are working. Small-scale models are working too, but the lifetime has to be lowered so it works as intended, so the lifetime and size should depend on each other. Maybe I'll try to find some sort of formula in the future.
 ## Day 7
+For now i removed copying over the optimzation state cause it cause an crash. I will probaly reimplemted it later cuase its not that importnat currently it now have to built up momentum from strach what sucks .Everything else seems to work.I anlyse the log there is some oscillation basicly split and merge back  but its not the norm i excpect oscillation to some degree. 80% of splits stick and dont merge back with the Sibling.
 ## Day 8
 An report of claude based fo the logs:
 12 monoliths → ~50 experts, 160M params. All lifecycle events fire: 89 splits, 36 merges, 13 deaths, drift detection. No crashes.
 Loss hit 4.26 at step 970, rose to ~5.0 during rapid growth (optimizer wipes), recovering to ~4.9 by step 10K.  Of the 36 merges, 16 were sibling merge-backs (both children from the same parent reuniting) and 20 were non-sibling merges (unrelated weak experts consolidating). 73 out of 89 splits stuck — 82% retention rate.tier-gravity merge at step 10,920. L5 routes to multiple experts (density 1.5); other layers mostly top-1. Throughput: 4K → 1.6K tok/s.
+The current biggest problem is the Optimzer state wipe that keeps the model from bulding up momentum cause after every split it wipes the optimzer state and copiying somhow currupts the opitmizer state what is an annoying bug.
 ## Day 9
+The Optimzer is now working the Optimzer state is sucessfully copyed over to the children so the base Architecture is now working.
 This are the results training on top of gutenberg with teknium/OpenHermes-2.5:
 The result of this chat fine-tune is bad, but it has nothing to do with the model itself; it has more to do with the fact that I built myself a trash tokenizer that doesn't support special tokens. I will retry it at a later point; for now, it's for completion only.
+Here is an short Report with 10 test prompts:
 4 monoliths (L4,5,8,9). 3 near-monoliths trending stable (L0,1,2). 4 dynamic with per-prompt routing shifts (L3,6,7,11). L10 borderline.
 ## References