Taotern LLM Experiments

This repo is the experiment ledger for building a TaoNet-style LLM whose sequence core is the Taotern SSM instead of attention. It keeps compact, reviewable artifacts: run summaries, metric CSVs, exact RepoBridge configs, and the current conclusion. Source code stays in the source repos.

Current Decision

As of 2026-05-14, the next expensive chatbot attempt should use the pure SSM branch-only model, not the earlier SSM-first hybrid.

Selected candidate:

architecture=taonet_ssm
candidate=pure_ssm_196m_branch_rms_only
params=196,573,128
hidden_dim=1024
num_layers=18
num_heads=8
hidden_dim_ff=3072
ssm_core=dplr
ssm_hidden_dim=32
ssm_mixer_dim=256
ssm_num_lanes=2
ssm_lane_mode=split
ssm_split_mix=none
ssm_lane_combine=channel
ssm_gate_type=channel
ssm_local_shift=true
ssm_local_shift_per_channel=true
ssm_branch_rms_norm=true
block_residual_rms_norm=false
finite_tail_correction=false

The 100M-token branch-only gate completed successfully:

Metric	Value
Eval loss	3.1667
Eval token accuracy	38.92%
Forward+backward throughput	53.0k tok/s
Peak allocated memory	9.17 GB
SFT tiny-overfit	3.3831 -> 0.0107
Final block RMS	57.50
Final block max abs	1605.34

Launch scripts for the next full attempt are implemented in TaoTrain:

scripts/remote/run_200m_branch_only_chat.sh
scripts/remote/submit_200m_branch_only_chat.sh

The planned run is 4B pretrain token positions followed by 50k corrected response-only SFT steps. See:

experiments/runs/2026-05-14_branch_only_100m_gate/summary.md
experiments/runs/2026-05-14_200m_branch_only_4b_sft_ready/summary.md

Previous Hybrid Failure And Stabilization Trail

The current SSM-only LLM is built in TaoTrain as taonet_ssm. It keeps the TaoNet outer shape and replaces the attention/MLA sequence mixer with a DPLR SSM mixer from Taotern_SSM.

The earlier best LLM candidate was taonet_hybrid, which keeps the same TaoNet dimensions but alternates attention and SSM blocks. The selected 200M-class deployment candidate for the first long chat run was hybrid_ssm_first_199m, which uses 16 layers:

SSM -> attention -> SSM -> attention -> ... -> SSM -> attention

The long run taotern-200m-hybrid-chat-20260512 on the remote RTX 5090 server completed, but the SFT checkpoint is not yet a good chatbot. RepoBridge Model Chat works, but generation quality is poor and follow-up diagnostics show the issue is in model trainability rather than the GUI.

/home/student/YouZheng/jobs/taotern/taotern-200m-hybrid-chat-20260512/checkpoints/sft/final_model.pt

Current diagnosis: the 4B-token pretrain remained high-loss, and the 50k-step SFT stage did not improve fixed SFT response loss. Tiny SFT overfit probes show huge SSM/residual gradients, and activation hooks show the residual stream growing to tens of millions by late layers. The checkpoint should be used as a failure/diagnostic artifact, not as the final deployable chat model. Follow-up code now adds SSM branch RMS normalization, optional SSM branch clamping, block residual RMS normalization, and benchmark gradient telemetry. See:

experiments/runs/2026-05-13_200m_chat_diagnosis/summary.md

For the layer-by-layer structure, DPLR equations, matrix shapes, and parameter inventory, see:

docs/CURRENT_SSM_LLM_ARCHITECTURE.md

Previous 200M real-token candidate:

Architecture: taonet_hybrid
SSM core: DPLR
Parameter count: 199,480,928
Layers: 16
SSM layers: 0,2,4,6,8,10,12,14
Attention layers: 1,3,5,7,9,11,13,15
Hidden dimension: 1024
FFN dimension: 3072
Mixer projection: ssm_mixer_dim=256
SSM hidden/state dimension: ssm_hidden_dim=32
DPLR rank: 1
Kernel mode: conv
Local memory branch: enabled
Local shift gain: per-channel
Hybrid pattern: ssm_first
SSM gate type: channel
SSM lanes: 2
Lane mode: split
Split mix: none
Lane combine: channel for full lanes; concatenation for split lanes
Finite-tail correction: disabled for the current best speed/quality point
Tokenizer: pilot TaoData SentencePiece 8k
Last 200M training target: TaoData JSONL next-token prediction, seq 512, batch 8, 4B base token positions, then 50k-step SFT

Superseded stabilized pure-SSM candidate before the branch-only 100M gate:

Architecture: taonet_ssm
SSM core: DPLR
SSM hidden/state dimension: ssm_hidden_dim=32
Mixer projection: ssm_mixer_dim=128
SSM lanes: 2
Lane mode: split
Gate type: channel
Local memory branch: enabled, per-channel
Finite-tail correction: disabled for speed
SSM branch RMS norm: enabled
SSM branch clamp: 1.0
Block residual RMS norm: enabled
Gradient clipping: 1.0
Learning rate: 8e-4

Latest stabilized small-token benchmark summary:

Run	Best SSM-bearing model	Eval loss	Eval accuracy	Forward+backward tok/s	Notes
scale-control pattern sweep	hybrid single_ssm_middle	4.6620	0.2188	1.005M	Pure SSM h16/m128 also beat attention loss on the 500-step smoke; `ssm_first` hybrid failed badly.
stabilized pure-SSM capacity sweep	pure SSM h32/m128	4.5311	0.2492	0.676M	Best pure-SSM accuracy; h32/m256 has fractionally lower loss but worse speed/accuracy.
stabilized pure-SSM LR sweep	pure SSM h32/m128 lr8e-4	4.5311	0.2492	0.677M	Higher LR worsens loss; keep lr8e-4.

Detailed records:

experiments/runs/2026-05-13_scale_control_pattern_sweep/summary.md
experiments/runs/2026-05-13_stabilized_ssm_capacity_sweep/summary.md
experiments/runs/2026-05-13_stabilized_ssm_lr_sweep/summary.md

Latest completed large SentencePiece benchmark on /home/student/Data/TaoData/pretrain.jsonl:

Batch	Model	Pattern	Lanes/mode	Params	Eval loss	Eval accuracy	Forward+backward tok/s
32	attention TaoNet	-	-	8.197M	3.4164	0.3619	1.367M
32	SSM TaoNet h16/m128	-	1 full	7.630M	3.6565	0.3229	1.151M
32	SSM TaoNet h16/m128	-	2 full	7.648M	3.6342	0.3255	0.887M
32	SSM TaoNet h16/m128	-	2 split	7.630M	3.6409	0.3249	0.935M
32	hybrid TaoNet	ssm_first	1 full	7.913M	3.3673	0.3665	1.234M
32	hybrid TaoNet	ssm_first	2 full	7.922M	3.3368	0.3716	1.068M
32	hybrid TaoNet	ssm_first	2 split Hadamard	7.913M	3.3345	0.3719	1.118M
32	hybrid TaoNet	single_ssm_middle	2 split	8.055M	3.3808	0.3649	1.258M
64	attention TaoNet	-	-	8.197M	3.3946	0.3592	1.447M
64	SSM TaoNet h16/m128	-	1 full	7.630M	3.5722	0.3331	1.230M
64	SSM TaoNet h16/m128	-	2 full	7.648M	3.5446	0.3355	1.020M
64	SSM TaoNet h16/m128	-	2 split	7.630M	3.5515	0.3345	1.152M
64	hybrid TaoNet	ssm_first	1 full	7.913M	3.2673	0.3793	1.325M
64	hybrid TaoNet	ssm_first	2 full	7.922M	3.2411	0.3834	1.190M
64	hybrid TaoNet	ssm_first	2 split	7.913M	3.2368	0.3835	1.271M
64	hybrid TaoNet	single_ssm_middle	2 split	8.055M	3.2708	0.3785	1.365M

Full per-variant results, including attention_first, single_ssm_middle, and single_ssm_late, are recorded in experiments/runs/2026-05-10_split_lane_ssm_highscale/summary.md.

Pure SSM replacement is not solved yet: in the 8000-step high-scale runs, two SSM lanes improved pure SSM loss and accuracy at both batch sizes, but the pure SSM model still trails attention. Split lanes recover throughput and memory compared with full two-lane duplication, but are slightly weaker for pure SSM quality. A fixed Hadamard add/subtract cross-lane mix was tested and is mixed: it helps the batch-32 hybrid but not pure SSM or the batch-64 best point. Channel gates remain the deployment-friendly default. Exact finite-tail correction was checked earlier and was not better overall, so the approximate path remains the current hybrid default.

Interpretation:

At batch 32, the best Hadamard split-lane ssm_first hybrid improves eval loss over attention by about 0.083 and accuracy by about 0.010, while retaining about 82% of attention forward+backward throughput.
At batch 64, the best plain split-lane ssm_first hybrid improves eval loss over attention by about 0.154 and accuracy by about 0.024, while retaining about 88% of attention forward+backward throughput.
Pure SSM two-lane improves over pure SSM one-lane, confirming that extra SSM capacity helps quality.
Split-lane SSM is cheaper than naive lane duplication; fixed Hadamard mixing is too rigid, so the next improvement direction should use a small learnable but ternary-friendly post-split mixer.

Important Repos

Repo	Role
`StarMists/gamma_SSM_S4_enhanced`	SSM model, DPLR implementation, SSM-specific records
`lobakkang/TaoTrain`	TaoNet, `taonet_ssm`, token benchmarks
`lobakkang/TaoData`	Data extraction/preprocessing
`RepoBridge`	Remote execution tool only
`StarMists/Taotern_LLM_Experiments`	Compact experiment ledger and current conclusions

Layout

experiments/
  index.csv                  # searchable run ledger
  README.md                  # artifact rules and workflow
  legacy_repobridge_configs/ # exact pre-ledger RepoBridge configs
  resources/
    tokenizers/              # tokenizer configs, not large tokenizer binaries
  runs/
    <run_id>/
      manifest.yaml          # purpose, commits, status, paths
      summary.md             # human-readable result
      metrics.csv            # compact metrics snapshot when available
      repobridge.config.json # exact remote run config
docs/
  WORKFLOW.md                # how future runs should be recorded
  DIRECTORY_NAVIGATION.md    # where code/data/run pieces live across repos
  CURRENT_SSM_LLM_ARCHITECTURE.md
  showcase/                  # R&D showcase report, DOCX, and scaling notes

Next Action

The 100M branch-only gate passed. The next full run is ready to launch with the pure SSM branch-only model:

4B pretrain token positions
50k corrected response-only SFT steps
final deployable checkpoint expected at checkpoints/sft/final_model.pt

Keep activation diagnostics after pretraining, because the residual RMS still grows across depth even though it is far below the previous failure regime.