Text Generation
gustavlangstroem commited on
Commit
55a2700
·
verified ·
1 Parent(s): 22c828f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +4 -5
README.md CHANGED
@@ -273,7 +273,7 @@ Now with redcuced expert and hidden dimension per layers it dosent split anymore
273
  Split, merge, and death are working. Small-scale models are working too, but the lifetime has to be lowered so it works as intended, so the lifetime and size should depend on each other. Maybe I'll try to find some sort of formula in the future.
274
 
275
  ## Day 7
276
- For now i removed copying over the optimzation state cause it cuases an crash. I will probaly reimplemted it later cuase its not that importnat currently it now have to built up momentum from strach what sucks .Everything else seems to work.I anlyse the log there is some oscillation basicly split and merge back but its not the norm i excpect oscillation to some degree. 80% of splits stick and dont merge back with the Sibling.
277
 
278
  ## Day 8
279
  An report of claude based fo the logs:
@@ -281,12 +281,11 @@ An report of claude based fo the logs:
281
  12 monoliths → ~50 experts, 160M params. All lifecycle events fire: 89 splits, 36 merges, 13 deaths, drift detection. No crashes.
282
  Loss hit 4.26 at step 970, rose to ~5.0 during rapid growth (optimizer wipes), recovering to ~4.9 by step 10K. Of the 36 merges, 16 were sibling merge-backs (both children from the same parent reuniting) and 20 were non-sibling merges (unrelated weak experts consolidating). 73 out of 89 splits stuck — 82% retention rate.tier-gravity merge at step 10,920. L5 routes to multiple experts (density 1.5); other layers mostly top-1. Throughput: 4K → 1.6K tok/s.
283
 
284
- The current biggest problem is the Optimzer state wipe that keeps the model from buldin up momentum cuase after evry split it wipes the optimzer state and copiying somhow currupts the opitmizer state what is an annoying bug.
285
 
286
- Even without the optimzer fixed i will now train it on teknium/OpenHermes-2.5 to see what beahviour it shows.
287
 
288
  ## Day 9
289
- Momentum preservation is now working the opzimzer state is sucessfully copyed over tot he children so the base Architecture is now working.
290
 
291
  This are the results training on top of gutenberg with teknium/OpenHermes-2.5:
292
 
@@ -295,7 +294,7 @@ L4: 3 splits, 2 merges, density 1.8. L5: 1 merge. L9: 2 splits, 2 deaths, densit
295
 
296
  The result of this chat fine-tune is bad, but it has nothing to do with the model itself; it has more to do with the fact that I built myself a trash tokenizer that doesn't support special tokens. I will retry it at a later point; for now, it's for completion only.
297
 
298
- Here is an short Report with 10 prompts:
299
  4 monoliths (L4,5,8,9). 3 near-monoliths trending stable (L0,1,2). 4 dynamic with per-prompt routing shifts (L3,6,7,11). L10 borderline.
300
 
301
  ## References
 
273
  Split, merge, and death are working. Small-scale models are working too, but the lifetime has to be lowered so it works as intended, so the lifetime and size should depend on each other. Maybe I'll try to find some sort of formula in the future.
274
 
275
  ## Day 7
276
+ For now i removed copying over the optimzation state cause it cause an crash. I will probaly reimplemted it later cuase its not that importnat currently it now have to built up momentum from strach what sucks .Everything else seems to work.I anlyse the log there is some oscillation basicly split and merge back but its not the norm i excpect oscillation to some degree. 80% of splits stick and dont merge back with the Sibling.
277
 
278
  ## Day 8
279
  An report of claude based fo the logs:
 
281
  12 monoliths → ~50 experts, 160M params. All lifecycle events fire: 89 splits, 36 merges, 13 deaths, drift detection. No crashes.
282
  Loss hit 4.26 at step 970, rose to ~5.0 during rapid growth (optimizer wipes), recovering to ~4.9 by step 10K. Of the 36 merges, 16 were sibling merge-backs (both children from the same parent reuniting) and 20 were non-sibling merges (unrelated weak experts consolidating). 73 out of 89 splits stuck — 82% retention rate.tier-gravity merge at step 10,920. L5 routes to multiple experts (density 1.5); other layers mostly top-1. Throughput: 4K → 1.6K tok/s.
283
 
284
+ The current biggest problem is the Optimzer state wipe that keeps the model from bulding up momentum cause after every split it wipes the optimzer state and copiying somhow currupts the opitmizer state what is an annoying bug.
285
 
 
286
 
287
  ## Day 9
288
+ The Optimzer is now working the Optimzer state is sucessfully copyed over to the children so the base Architecture is now working.
289
 
290
  This are the results training on top of gutenberg with teknium/OpenHermes-2.5:
291
 
 
294
 
295
  The result of this chat fine-tune is bad, but it has nothing to do with the model itself; it has more to do with the fact that I built myself a trash tokenizer that doesn't support special tokens. I will retry it at a later point; for now, it's for completion only.
296
 
297
+ Here is an short Report with 10 test prompts:
298
  4 monoliths (L4,5,8,9). 3 near-monoliths trending stable (L0,1,2). 4 dynamic with per-prompt routing shifts (L3,6,7,11). L10 borderline.
299
 
300
  ## References