File size: 12,055 Bytes

e2bfccc

id,date,status,task,comparison,ssm_commit,taotrain_commit,best_ssm,best_attention,conclusion
2026-04-29_taodata_byte_pilot,2026-04-29,completed,TaoData byte next-token,attention vs SSM,edd5f5b,b8c4f3d,h16/m64,attention,Byte-level pilot favored SSM loss and accuracy across tested batches but is not the intended tokenizer.
2026-04-29_taodata_spm_pilot,2026-04-29,completed,TaoData SentencePiece next-token 150 steps,attention vs SSM,d4a59c6,33747c1,h64/m64 at batch64,attention,SentencePiece pilot was more mixed; SSM could win some batch points but attention remained strong.
2026-04-29_spm_b32_500step_scalar_shift,2026-04-29,completed,TaoData SentencePiece batch32 500 steps,attention vs SSM,d4a59c6,33747c1,h16/m64,attention,Attention had better loss but SSM had slightly higher token accuracy.
2026-04-29_spm_b32_500step_no_shift,2026-04-29,completed,TaoData SentencePiece batch32 500 steps,SSM no-shift ablation,d4a59c6,33747c1,none,none,Removing local shift greatly worsened SSM loss and accuracy.
2026-04-29_spm_b32_500step_channel_shift,2026-04-29,completed,TaoData SentencePiece batch32 500 steps,attention vs SSM,d4a59c6,c519645,h16/m64,attention,Per-channel shift slightly improved SSM but did not beat attention loss.
2026-04-29_spm_b32_500step_mixer_sweep,2026-04-29,completed,TaoData SentencePiece batch32 500 steps,attention vs SSM mixer sweep,ad56534,357336e,h16/m128,attention,SSM h16/m128 nearly matched attention loss and beat attention accuracy and throughput.
2026-04-29_spm_h16m128_batchsweep_interrupted,2026-04-29,completed_after_targeted_download,TaoData SentencePiece batch16/32/64 500 steps,attention vs SSM,7bc1e87,357336e,h16/m128,attention,Targeted download recovered metrics; h16/m128 beats attention accuracy at all batches but trails attention loss and throughput.
2026-04-30_spm_h16m128_lr_sweep,2026-04-30,completed,TaoData SentencePiece batch32 SSM LR sweep,attention vs SSM optimizer sweep,7bc1e87,c07739b,h16/m128 lr=0.0012,attention,Best SSM loss is 4.705 with 0.223 accuracy; attention still has slightly lower loss and higher throughput.
2026-04-30_spm_h16m128_wd_sweep,2026-04-30,completed,TaoData SentencePiece batch32 SSM WD sweep at LR 0.0012,attention vs SSM optimizer sweep,7bc1e87,c07739b,h16/m128 lr=0.0012 wd=0.01,attention,Weight decay did not materially improve SSM; wd=0.01 remained best loss while SSM kept accuracy edge.
2026-04-30_dplr_param_reuse_real_token,2026-04-30,completed,TaoData SentencePiece batch32 after DPLR param reuse,attention vs SSM hardware/overhead cleanup,604be8a,c07739b,h16/m128 lr=0.0012 wd=0.01,attention,SSM quality unchanged and speed moved slightly to 1.087M tok/s; attention remains faster and lower loss while SSM keeps accuracy edge.
2026-04-30_rank1_dplr_frequency_real_token,2026-04-30,completed_failed_reverted,TaoData SentencePiece batch32 after rank-one DPLR frequency specialization,attention vs SSM hardware/overhead cleanup,986af61 reverted by 2528c5e,c07739b,none,attention,Rank-one algebra specialization preserved quality but regressed SSM throughput to 497k tok/s; reverted and remote SSM resynced.
2026-04-30_taonet_component_profile,2026-04-30,completed,Synthetic token component profile at TaoData benchmark shape,attention vs SSM component timing,2528c5e,667a8cf,SSM core bottleneck,attention,SSM forward+backward 1.106M tok/s vs attention 1.379M; DPLR SSM core dominates measured SSM-side forward cost at 2.203 ms/forward.
2026-04-30_dplr_frequency_microprofile,2026-04-30,completed,DPLR core microprofile at TaoNet-SSM mixer shape,direct vs transfer DPLR frequency path,2528c5e,n/a,direct,transfer,Direct path is faster: 1.812 ms fwd+bwd vs transfer 2.478 ms and 119 MB vs 308 MB peak allocation.
2026-04-30_dplr_batch_major_direct_microprofile,2026-04-30,completed_failed_reverted,DPLR direct-path batch-major layout microprofile,batch-major direct vs prior direct baseline,bf787c7 reverted by 20747fe,n/a,none,prior direct,Layout rewrite regressed fwd+bwd to 3.435 ms vs 1.812 ms baseline; reverted and remote SSM resynced.
2026-04-30_dplr_finite_readout_real_token,2026-04-30,completed_mixed,DPLR finite-response readout rewrite plus real token batch sweep,attention vs SSM hardware/quality comparison,03fb1e4,667a8cf,h16/m128,attention,Forward-only long-context DPLR improved and SSM kept accuracy edge; batch32 fwd+bwd still trails attention while batch64 SSM is faster.
2026-04-30_dplr_real_powered_readout_profile,2026-04-30,completed_failed_reverted,DPLR real-valued powered-readout microprofile,real powered readout vs finite-readout baseline,7639958 reverted by 9a8443e,n/a,none,finite-readout baseline,Correct but regressed direct fwd+bwd to 1.931 ms vs 1.833 ms baseline; reverted before TaoNet benchmark.
2026-04-30_dplr_combined_projection_profile,2026-04-30,completed_failed_reverted,DPLR combined output-projection contraction microprofile,combined projection vs finite-readout baseline,36b01c4 reverted by 4aecf5a,n/a,none,finite-readout baseline,Reduced bmm/einsum calls but regressed fwd+bwd to 1.873 ms profiled and 2.815 ms in 20-repeat timing; reverted.
2026-05-01_large_hybrid_token_benchmark,2026-05-01,completed,TaoData SentencePiece large 1500-step batch32/64 benchmark,attention vs SSM vs hybrid,76f725f,57978d2,hybrid h16/m128 alternating,attention baseline unchanged,Hybrid beats attention on eval loss and token accuracy at batch32 and batch64 while retaining about 91-93% of attention fwd+bwd throughput; pure SSM trails attention in this approximate finite-tail run.
2026-05-10_hybrid_pattern_sweep,2026-05-10,completed,TaoData pretrain SentencePiece hybrid pattern sweep,attention vs SSM vs hybrid patterns,76f725f,35f907d,hybrid h16/m128 attention_first or ssm_first,attention baseline unchanged,Two-SSM hybrids retain the quality lead; attention_first is fractionally best at batch32 and ssm_first at batch64; single middle SSM is faster but weaker; single late SSM is not promising.
2026-05-10_hybrid_exact_finite_tail,2026-05-10,completed_mixed,TaoData pretrain SentencePiece exact finite-tail hybrid ablation,attention vs exact SSM vs exact hybrid,76f725f,35f907d,approximate-tail two-SSM hybrid,attention baseline unchanged,Exact finite-tail correction slows SSM-bearing models and does not improve hybrid enough; approximate finite-tail remains the preferred hybrid default.
2026-05-10_channel_gate_highscale,2026-05-10,completed_mixed,TaoData pretrain 5000-step channel gate high-scale benchmark,attention vs SSM vs all hybrid patterns with dense/channel gates,76f725f,95918b4,hybrid ssm_first channel gate,attention baseline unchanged,Channel gates improve the best hybrid quality and are more ternary-friendly; pure SSM remains behind attention, so next pure-SSM work needs more SSM capacity rather than only gate compression.
2026-05-10_multilane_ssm_highscale,2026-05-10,completed_mixed,TaoData pretrain 8000-step multi-lane SSM high-scale benchmark,attention vs pure SSM vs four hybrid patterns with one/two SSM lanes,76f725f,0bd803d,pure SSM two-lane channel gate improved quality; hybrid ssm_first two-lane best overall,attention baseline unchanged,Two SSM lanes improve pure SSM loss and accuracy but slow throughput; next pure-SSM step should make lane capacity cheaper with grouped or split-lane SSM.
2026-05-10_split_lane_ssm_highscale,2026-05-10,completed_mixed,TaoData pretrain 8000-step split-lane SSM high-scale benchmark,attention vs pure SSM vs four hybrid patterns with full/split SSM lanes,76f725f,db7dd9b,pure SSM split two-lane is faster and smaller but slightly weaker than full two-lane; hybrid ssm_first split two-lane best overall,attention baseline unchanged,Split lanes recover throughput and memory while preserving hybrid quality; pure SSM still trails attention, so next pure-SSM work needs cheap cross-channel mixing after split lanes.
2026-05-10_hadamard_split_mix_highscale,2026-05-11,completed_mixed,TaoData pretrain 8000-step Hadamard split-lane cross-mix benchmark,attention vs pure SSM vs four hybrid patterns with split none/Hadamard,76f725f,89aa98d,pure SSM Hadamard is mixed; hybrid ssm_first split none remains best batch64,attention baseline unchanged,Fixed Hadamard add/subtract is too rigid; it helps batch32 hybrid ssm_first slightly but does not close the pure SSM gap or beat plain split at batch64.
2026-05-12_200m_until_selection_interrupted,2026-05-12,interrupted,200M until-selection progress check,attention vs pure SSM vs hybrid,unknown,dd32758,pure_ssm_nomix at 300M among pure SSM,attention_196m,"300M pilot completed for all variants and hybrid won; 1B phase interrupted after attention and pure SSM Hadamard, likely due server reboot/GPU driver outage; 1B rows are incomplete and show likely 50M-token-cap overfitting."
2026-05-12_200m_hybrid_chat_4b,2026-05-12,completed_bad,Selected 200M SSM-first hybrid 4B-token base plus SFT chat tuning,selected hybrid only,76f725f,a1ff47b,hybrid_ssm_first_199m,none,Run completed but final SFT checkpoint chats poorly; fixed-batch SFT loss did not improve over pretrain, so this is a diagnostic/failure artifact rather than a deployable chatbot.
2026-05-13_200m_chat_diagnosis,2026-05-13,completed,Post-run diagnosis of poor 200M hybrid chat quality,selected hybrid only,unknown,local diagnostics,hybrid_ssm_first_199m,none,SFT did not improve fixed response loss; tiny overfit probes show huge SSM/residual gradients and residual activations growing to tens of millions, so next iteration needs explicit SSM/residual scale control before rerunning 200M.
2026-05-13_scale_control_pattern_sweep,2026-05-13,completed,Fresh small token benchmark after SSM/residual scale controls,attention vs pure SSM vs four hybrid patterns,local TaoTrain stabilizers,local SSM reciprocal floor,pure SSM h16/m128 and hybrid single_ssm_middle,attention baseline unchanged,Scale controls make fresh pure SSM competitive with attention on short run; ssm_first hybrid is fragile with high loss and high gradient norm.
2026-05-13_stabilized_ssm_capacity_sweep,2026-05-13,completed,Stabilized pure-SSM capacity sweep,attention vs pure SSM h16/h32 m64/m128/m256,local TaoTrain stabilizers,local SSM reciprocal floor,pure SSM h32/m128,attention baseline unchanged,h32/m128 gives best accuracy and nearly best loss; pure SSM beats attention accuracy but still trails attention loss.
2026-05-13_stabilized_ssm_lr_sweep,2026-05-13,completed,Stabilized pure-SSM learning-rate sweep,attention vs pure SSM h32/m128 and h32/m256,local TaoTrain stabilizers,local SSM reciprocal floor,pure SSM h32/m128 lr8e-4,attention baseline unchanged,Higher LR worsens loss despite slight accuracy gains; keep lr8e-4 and test h32/m128 at larger bounded scale.
2026-05-14_pre_200m_stability_gate,2026-05-14,completed_no_go,Pre-200M stabilized SSM gate before next 4B+SFT chatbot attempt,196M pure SSM bounded pretrain plus activation/generation/SFT probes and attention/hybrid comparison,local TaoTrain stabilizers,local SSM reciprocal floor,none,attention baseline,Stability and SFT tiny-overfit passed but bounded pretrain quality failed: pure SSM eval loss 6.90 and accuracy 3.2% vs attention loss 5.13 and accuracy 16.8%; m512/h64 did not close the gap.
2026-05-14_branch_only_100m_gate,2026-05-14,completed_go,Longer pre-200M branch-only pure SSM gate before next 4B+SFT chatbot attempt,196M pure SSM 100M-token pretrain plus activation/generation/SFT probes,local TaoTrain branch-only RMS,local SSM reciprocal floor,pure_ssm_196m_branch_rms_only,attention baseline,Passed: eval loss 3.1667 accuracy 38.9%; activation finite with final block RMS 57.5; SFT sanity overfit 3.3831 to 0.0107. Select pure SSM branch-only for 4B+SFT.
2026-05-14_200m_branch_only_4b_sft_ready,2026-05-14,running,Selected pure SSM 200M 4B+SFT chatbot attempt,selected pure SSM only after branch-only 100M gate,TaoTrain c52eb8d,SSM 5844c3f,pure_ssm_196m_branch_rms_only,none,Launched as taotern-200m-branch-only-chat-20260514; initial status RUNNING with 196.57M params and 4B token-position pretrain followed by 50k corrected response-only SFT.