TaoNet-mini-T2 / code /Taotern_SSM /EXPERIMENT_RECORD.md

Add files using upload-large-folder tool

e2bfccc verified 20 days ago

95.3 kB

Gamma SSM / Gamma-S4 Experiment Record

This file records the experiment versions saved as _rN notebooks under output/jupyter-notebook/. It also records the later TaoNet-SSM LLM-wrapper and remote RTX benchmark iterations.

The goal is to preserve:

which model version was tested
which notebook/task configuration was used
the main performance results
what we learned from each run

For runs where the saved notebook does not contain enough information, the version is marked as not recorded.

Model Names

gamma_baseline: original Gamma SSM using the fixed lower-bidiagonal Gamma transition and recurrent execution.
gamma_s4_minimal: lighter S4-inspired Gamma block. Used in early experiments, later dropped from the main loop because it was not consistently strong.
gamma_s4_enhanced: main S4-inspired Gamma model with learned dt, stable discretization, D skip, optional gating/output path, full-sequence/kernel mode, and recurrent stepping.

Metrics

val_loss: validation loss for forecasting tasks. Lower is better.
mean_epoch_time_s: average training epoch time. Lower is better.
full_forward_ms / full_latency_ms: whole-sequence forward/inference latency. Lower is better.
full_forward_tokens_per_s / full_tokens_per_s: whole-sequence throughput. Higher is better.
recurrent_inference_ms / recurrent_latency_ms: token-by-token recurrent latency. Lower is better.
recurrent_tokens_per_s: token-by-token recurrent throughput. Higher is better.
deploy_*: deployment-lite recurrent path. For the baseline, deployment and recurrent are the same path once baseline deploy metrics were enabled.
val_ce: validation cross entropy for token prediction. Lower is better.
val_ppl: validation perplexity for token prediction. Lower is better.

TaoNet-SSM LLM Wrapper Iterations

This section records the work that moved the SSM from standalone/notebook benchmarks into the TaoNet LLM comparison loop. The main implementation repo for SSM changes is this repo. The TaoNet wrapper lives in the local TaoTrain repo and branch listed below.

Related repos and branches:

SSM repo: https://github.com/StarMists/gamma_SSM_S4_enhanced.git
SSM local path: C:\Users\YouZheng\Documents\LYZ\MyContent\MyLLM\Codebase\Taotern\Taotern_SSM
TaoTrain repo: https://github.com/lobakkang/TaoTrain.git
TaoTrain local path: C:\Users\YouZheng\Documents\LYZ\MyContent\MyLLM\Codebase\Taotern\TaoTrain
TaoTrain branch: codex/taonet-ssm-core
Remote server path for SSM: /home/student/YouZheng/gamma_ssm_repo
Remote server path for TaoTrain: /home/student/YouZheng/repo
Remote execution tool: C:\Users\YouZheng\Documents\LYZ\MyContent\MyComp\RepoBridge

LLM Iteration 1 - Add TaoNet SSM Wrapper

Implementation location:

TaoTrain: src/taoTrain/models/taonet_ssm.py
TaoTrain: src/taoTrain/config.py
TaoTrain: src/taoTrain/models/registry.py
TaoTrain: tests/test_taonet_ssm.py
TaoTrain: scripts/benchmark_taonet_token_variants.py

TaoTrain commits:

8b1c6fa Add TaoNet Gamma SSM architecture
6edd09e Benchmark TaoNet token SSM variants

What changed:

Added a taonet_ssm model architecture for apples-to-apples comparison with original attention taonet.
Kept the outer LLM stack close to TaoNet and replaced the sequence-mixing core with an SSM mixer.
Supported both gamma_s4 and dplr SSM cores.
Added token-level synthetic CE benchmark comparing taonet and taonet_ssm.
Added focused tests for SSM wrapper construction and forward passes.

Local validation:

python -m pytest tests\test_taonet_ssm.py -q passed.
Broader TaoTrain tests were not run locally because the local environment was missing datasets.

Result:

Functional success. This established the comparison harness.
Performance was not yet acceptable with full-width DPLR because the wrapper exposed dense DPLR frequency-transfer cost.

LLM Iteration 2 - Projected SSM Mixer Dimension

Implementation location:

TaoTrain: src/taoTrain/models/taonet_ssm.py
TaoTrain: src/taoTrain/config.py
TaoTrain: tests/test_taonet_ssm.py

TaoTrain commit:

5e6b802 Add projected SSM mixer dimension

What changed:

Added ssm_mixer_dim.
The SSM branch now supports d_model -> ssm_mixer_dim -> SSM -> d_model.
This keeps the LLM interface the same while reducing the DPLR channel width.
This is important because DPLR convolutional training cost scales strongly with the channel dimension.

Remote benchmark config examples:

RepoBridge projected 128: repobridge.taonet.tokenbench.projected128.config.json
RepoBridge projected 64: repobridge.taonet.tokenbench.projected64.config.json

Important results before SSM-core optimization:

Variant	Batch	Seq	Forward tok/s	Backward tok/s	Peak MB	Interpretation
attention TaoNet	4	512	about 1.24M	about 280k	about 376	Baseline comparison point.
DPLR full-width mixer 256	4	512	about 81k	about 20k	about 6200	Failed: dense transfer path too slow and memory-heavy.
DPLR projected mixer 128	4	512	about 214k	about 56k	about 3613	Better memory, still much slower than attention.
DPLR projected mixer 64	4	512	about 114k	about 54k	about 2500	Lower memory but worse forward before core optimization.

Result:

Success as an architectural control: projection made DPLR usable enough to iterate.
Not sufficient alone: the DPLR core still needed direct frequency-response optimization.

LLM Iteration 3 - Add Scripted SSM Benchmarks

Implementation location:

SSM: scripts/benchmark_ssm_variants.py
SSM: .gitignore

SSM commits:

7a90525 Add lightweight SSM benchmark script
c0dede8 Ignore generated benchmark outputs

What changed:

Added a Python benchmark script for baseline, gamma_s4, and dplr.
Measures forward, optional forward+backward, and optional recurrent stepping.
Writes JSON and CSV outputs.
Ignored generated benchmark result directories.

Remote raw DPLR result:

Model	Batch	Seq	Mode	Tok/s	Peak MB
DPLR raw SSM	4	512	forward	about 841k	about 1310
DPLR raw SSM	4	512	forward+backward	about 101k	about 1310
DPLR raw recurrent	4	512	recurrent	about 97k	about 10

Interpretation:

Raw DPLR SSM was promising.
The wrapped LLM bottleneck came from how the DPLR convolutional path scaled under the TaoNet stack, not from the idea of DPLR alone.

LLM Iteration 4 - Direct DPLR Frequency-Response Application

Implementation location:

SSM: gamma_space_model/modules/s4_ternary_dplr_ssm.py

SSM commit:

2b204e8 Apply DPLR frequency response directly

What changed:

Added a direct training path that applies the DPLR frequency response to the FFT input.
Avoided materializing the full dense transfer tensor shaped roughly freq x channels x channels during training/grad runs.
Kept the old dense transfer path for eval/no-grad caching.

Validation:

python -m pytest tests\test_s4_ternary_dplr_ssm.py -q passed.
Local CPU smoke benchmark with backward passed.

Projected-128 remote result after this change:

Variant	Batch	Seq	Forward tok/s	Backward tok/s	Peak MB
attention TaoNet	4	512	about 1.32M	about 532k	about 376
DPLR projected mixer 128	4	512	about 151k	about 91k	about 508

Interpretation:

Major memory success: projected-128 DPLR dropped from about 3613 MB to about 508 MB.
Training throughput improved from about 56k to about 91k tok/s.
Forward-only became slower than the previous projected-128 run, so this change helped training/backward much more than no-grad forward timing.

LLM Iteration 5 - Specialize Rank-One DPLR Solve

Implementation location:

SSM: gamma_space_model/modules/s4_ternary_dplr_ssm.py

SSM commit:

5a0abad Specialize rank-one DPLR solve

What changed:

Current best DPLR configuration uses rank=1.
Replaced the batched torch.linalg.inv for 1 x 1 low-rank systems with scalar reciprocal math.
Applied the specialization to both direct training and cached dense response paths.
Left the general rank path intact for rank > 1.

Validation:

python -m pytest tests\test_s4_ternary_dplr_ssm.py -q passed.
Local CPU smoke benchmark with backward passed.

Projected-128 remote result:

Variant	Batch	Seq	Forward tok/s	Backward tok/s	Peak MB
attention TaoNet	4	512	about 1.33M	about 545k	about 376
DPLR projected mixer 128	4	512	about 485k	about 142k	about 508

Projected-64 remote result after this change:

Variant	Batch	Seq	Forward tok/s	Backward tok/s	Peak MB
DPLR projected mixer 64	4	512	about 618k	about 192k	about 494

Scaling probe for projected-64:

Variant	Batch	Seq	Forward tok/s	Backward tok/s	Peak MB
attention TaoNet	16	512	about 1.16M	about 990k	about 1332
DPLR projected mixer 64	16	512	about 2.12M	about 702k	about 1684

Interpretation:

Major success.
DPLR projected-64 became the best current SSM LLM configuration.
At batch 16, DPLR projected-64 forward throughput exceeded attention in this synthetic benchmark.
Backward was still behind attention, but the gap narrowed substantially.
The SSM now scales much better with batch size, suggesting fixed frequency-response overhead is being amortized.

LLM Iteration 6 - Precompose Finite Response Projection

Implementation location:

SSM: gamma_space_model/modules/s4_ternary_dplr_ssm.py

SSM commits:

f09a71b Precompose DPLR finite response projection
648a32e Revert "Precompose DPLR finite response projection"

What changed:

Tried replacing C @ (I - z^L A^L) @ response with two projected terms:
- C @ response
- (C @ A^L) @ response
The goal was to reduce one batch/frequency hidden-state multiplication in the direct path.

Validation:

python -m pytest tests\test_s4_ternary_dplr_ssm.py -q passed.
Local smoke benchmark passed.
Direct-vs-cached convolution comparison had max absolute difference around 2.4e-7.

Remote result:

Variant	Batch	Seq	Forward tok/s	Backward tok/s	Peak MB
DPLR projected mixer 64 before this change	4	512	about 618k	about 192k	about 494
DPLR projected mixer 64 with this change	4	512	about 495k	about 162k	about 478

Interpretation:

Failed on real GPU token benchmark.
It saved a little memory but reduced speed too much.
The commit was intentionally reverted, so current SSM main is back to the best-performing rank-one direct-response core.

LLM Iteration 7 - Rank-One Matmul Fast Path

Implementation location:

SSM: gamma_space_model/modules/s4_ternary_dplr_ssm.py

SSM commits:

43de801 Use matmul fast path for rank-one DPLR
9ffa5a7 Gate rank-one matmul path by batch size
4e130b6 Limit rank-one matmul path to small batches
8969916 Revert "Limit rank-one matmul path to small batches"
5b3a957 Revert "Gate rank-one matmul path by batch size"
a46a2af Revert "Use matmul fast path for rank-one DPLR"

What changed:

Tried a deeper rank=1 direct-application specialization.
Replaced several generic einsum operations with batched matmul and vector reductions.
The goal was to reduce Python/operator overhead and improve backward throughput for the current best DPLR rank.
A follow-up tried to gate the path by batch size after the batch-16 scaling run regressed.

Validation:

python -m pytest tests\test_s4_ternary_dplr_ssm.py -q passed.
Local CPU smoke benchmark passed.
Direct-vs-cached convolution comparison had max absolute difference around 2.4e-7.

Remote result:

Variant	Batch	Seq	Forward tok/s	Backward tok/s	Peak MB
DPLR projected mixer 64 before this change	4	512	about 618k	about 192k	about 494
DPLR projected mixer 64 first matmul run	4	512	about 643k	about 208k	about 494
DPLR projected mixer 64 repeated small-batch gated run	4	512	about 470k-472k	about 161k-175k	about 494
DPLR projected mixer 64 matmul at batch 16	16	512	about 1.47M	about 388k	about 1684
DPLR projected mixer 64 previous best at batch 16	16	512	about 2.12M	about 702k	about 1684

Interpretation:

Failed overall.
The first batch-4 run looked promising, but repeated remote results were worse.
The matmul formulation regressed the larger-batch scaling behavior that matters for GPU utilization.
All matmul fast-path commits were reverted, so current SSM main returns to the best-known 5a0abad rank-one scalar-solve behavior plus the experiment-record commits.

LLM Iteration 8 - TileLang Capability Detection

Implementation location:

SSM: csrc/tilelang/selective_scan.py
SSM: csrc/tilelang/__init__.py
SSM: gamma_space_model/ops/selective_scan_interface.py
SSM: gamma_space_model/modules/ssm_gamma.py
SSM: scripts/diagnose_tilelang_acceleration.py

SSM commit:

4784856 Make TileLang acceleration detection explicit

What changed:

Made TileLang capability reporting explicit and conservative.
Before this change, HAS_TILELANG_OPS became true whenever the Python fallback module imported.
That was misleading because csrc/tilelang did not actually dispatch to a real TileLang kernel; it used PyTorch fallback code.
Added TILELANG_BACKEND and HAS_TILELANG_ACCELERATION flags.
Added scripts/diagnose_tilelang_acceleration.py to print package availability, repo backend flags, and a small Gamma forward timing.
Fixed SSMGamma.step dtype/device casting after the honest fallback path exposed a float64 failure in the normal PyTorch path.

Validation:

python -m pytest tests\test_ssm_gamma.py tests\test_s4_ternary_dplr_ssm.py -q passed locally: 22 passed.
Local diagnostic reported:
- has_tilelang_ops=false
- tilelang_backend=pytorch_fallback
- triton_available=false
- tilelang_available=false

Remote RTX 5090 diagnostic:

Field	Value
Torch	`2.11.0+cu130`
CUDA	available
GPU	`NVIDIA GeForce RTX 5090`
Triton package	available
TileLang package	not available
Repo `HAS_TILELANG_OPS`	`false`
Repo `TILELANG_BACKEND`	`pytorch_fallback`
Gamma fallback forward	about `76.7k` tok/s at batch 4, seq 512, bf16

Remote raw SSM benchmark after this change:

Model	Batch	Seq	Forward tok/s	Backward tok/s	Peak MB
DPLR raw SSM	4	512	about 3.16M	about 1.03M	about 57
Gamma-S4 raw SSM	4	512	about 100.6k	about 45.5k	about 467
Baseline Gamma raw SSM	4	512	about 85.2k	about 32.3k	about 120

Interpretation:

This iteration did not add a real TileLang kernel yet.
It fixed an important measurement and dispatch problem: fallback code is no longer reported as hardware acceleration.
The remote server has Triton installed but does not have the TileLang package installed.
The current DPLR path is frequency-domain PyTorch/cuBLAS and does not use csrc/tilelang.
The next hardware-acceleration step should be explicit: either install/use real TileLang on the remote server or write a Triton/TileLang kernel for a clearly scoped hot path. The best candidate hot path is not the old baseline Gamma fallback; it is the DPLR direct frequency-response/backward path used by taonet_ssm.

LLM Iteration 9 - DPLR Frequency-Path Profiling And Root Cache

Implementation location:

SSM: gamma_space_model/modules/s4_ternary_dplr_ssm.py
SSM: scripts/profile_dplr_frequency_path.py

SSM commit:

92643c5 Cache DPLR frequency roots

What changed:

Added a per-module cache for FFT roots and roots^seq_len.
These tensors are constants for a given (seq_len, fft_len, dtype, device), so rebuilding them every forward/layer is unnecessary GPU work.
Added scripts/profile_dplr_frequency_path.py to profile the DPLR convolutional path directly on the remote server.

Validation:

python -m pytest tests\test_s4_ternary_dplr_ssm.py -q passed locally.
python -m pytest tests\test_ssm_gamma.py tests\test_s4_ternary_dplr_ssm.py -q passed locally: 22 passed.
Local profiler smoke passed and showed frequency_grid_cache_entries=1.

Remote profiler result for raw DPLR at batch 4, seq 512, d_model 64, hidden_dim 256:

Mode	Mean ms	Tok/s	Peak MB
forward	about 2.58	about 793k	not measured
forward+backward	about 3.27	about 626k	about 52

Remote profiler interpretation:

The largest CUDA entries were aten::bmm, aten::mm, and their backward paths.
aten::linalg_matrix_power was visible but small in this configuration.
Root generation was not the dominant cost, so the cache is a modest cleanup rather than a major acceleration.
A future TileLang/Triton kernel should target fused rank-1 DPLR frequency-response application and its backward, especially around the small complex BMM/MM pattern. Replacing the old Gamma Python fallback is not the right priority for the TaoNet-SSM goal.

TaoNet projected-64 check after this change:

Variant	Batch	Seq	Forward tok/s	Backward tok/s	Peak MB
DPLR projected mixer 64	4	512	about 656k	about 163k	about 494

Scaling probe after this change:

Variant	Batch	Seq	Forward tok/s	Backward tok/s	Peak MB
DPLR projected mixer 64	8	512	about 983k	about 341k	about 889
DPLR projected mixer 64	16	512	about 1.03M	about 414k	about 1684

Interpretation:

The root cache is correct and removes repeated constant construction.
End-to-end results remain noisy; this is not a breakthrough optimization.
The main value of this iteration is the profiler evidence: the next real hardware acceleration should fuse the DPLR rank-1 complex frequency-response operations, not spend effort on the older baseline Gamma fallback path.

LLM Iteration 10 - Shared DPLR Frequency Grid Cache

Implementation location:

SSM: gamma_space_model/modules/s4_ternary_dplr_ssm.py

SSM commit:

a9e5d3e Share DPLR frequency grid cache

What changed:

Promoted the DPLR FFT root cache from per-module to class-level shared cache.
The previous cache avoided rebuilding roots inside a single SSM module, but a multi-layer TaoNet creates one SSM module per layer.
The shared cache lets all layers reuse the same (roots, roots^seq_len) tensors for a given (seq_len, fft_len, dtype, device).

Validation:

python -m pytest tests\test_s4_ternary_dplr_ssm.py -q passed locally.
Local scripts/profile_dplr_frequency_path.py smoke passed and still reported one frequency-grid cache entry.

Required TaoNet comparison after this iteration:

Remote benchmark:

RepoBridge run: taonet-vs-dplr-proj64-shared-grid-bench-20260429-101304
Config: projected DPLR mixer dim 64, DPLR rank 1, 4 layers, bf16, synthetic next-token CE

Architecture	Batch	Seq	Mode	Tok/s	Peak MB	Loss
attention TaoNet	4	512	forward	about 1.30M	about 193	9.064
attention TaoNet	4	512	forward+backward	about 513k	about 376	9.064
SSM TaoNet, DPLR projected 64	4	512	forward	about 499k	about 190	9.059
SSM TaoNet, DPLR projected 64	4	512	forward+backward	about 162k	about 492	9.059

Comparison:

SSM forward throughput was about 38% of attention at batch 4, seq 512.
SSM forward+backward throughput was about 32% of attention.
SSM forward memory was slightly lower than attention, but backward peak memory was higher.
Loss was comparable because this is a random synthetic token benchmark, not a trained quality result.

Interpretation:

The shared cache is correct and small, but it did not create a clear end-to-end speed breakthrough.
This reinforces the profiler conclusion: constant/root setup is not the dominant TaoNet-SSM bottleneck.
Future iterations should include the attention-vs-SSM table directly, and hardware work should focus on the DPLR rank-1 complex BMM/MM and backward pattern.

LLM Iteration 11 - Re-anchor On Projected-64 Scaling Regime

Reason for this iteration:

The strongest previous result came from a scaling probe, not from batch-4 timing.
Later iterations over-emphasized batch 4, which made the SSM look worse and encouraged the wrong optimization target.
This iteration re-established the primary benchmark as attention TaoNet vs SSM TaoNet under larger projected-64 batches.

Implementation change:

No model-code change.
Benchmark-policy change: projected-64 scaling comparisons should be treated as primary acceptance tests for throughput work.

Remote benchmark:

RepoBridge run: taonet-token-dplr-proj64-scale-bench-20260429-111150
RepoBridge run: taonet-token-dplr-proj64-extended-scale-bench-20260429-111350
Config: projected DPLR mixer dim 64, DPLR rank 1, 4 layers, bf16, synthetic next-token CE

Required TaoNet comparison:

Architecture	Batch	Seq	Mode	Tok/s	Peak MB	Loss
attention TaoNet	8	512	forward	about 1.09M	about 319	9.061
attention TaoNet	8	512	forward+backward	about 468k	about 697	9.061
SSM TaoNet, DPLR projected 64	8	512	forward	about 1.17M	about 320	9.058
SSM TaoNet, DPLR projected 64	8	512	forward+backward	about 318k	about 889	9.058
attention TaoNet	16	512	forward	about 1.60M	about 596	9.059
attention TaoNet	16	512	forward+backward	about 503k	about 1332	9.059
SSM TaoNet, DPLR projected 64	16	512	forward	about 1.03M	about 580	9.060
SSM TaoNet, DPLR projected 64	16	512	forward+backward	about 427k	about 1684	9.060
attention TaoNet	32	512	forward	about 1.88M	about 1124	9.062
attention TaoNet	32	512	forward+backward	about 632k	about 2590	9.062
SSM TaoNet, DPLR projected 64	32	512	forward	about 2.62M	about 1100	9.061
SSM TaoNet, DPLR projected 64	32	512	forward+backward	about 705k	about 3273	9.061
attention TaoNet	64	512	forward	about 3.53M	about 2204	9.061
attention TaoNet	64	512	forward+backward	about 683k	about 5121	9.061
SSM TaoNet, DPLR projected 64	64	512	forward	about 1.30M	about 2140	9.060
SSM TaoNet, DPLR projected 64	64	512	forward+backward	about 618k	about 6451	9.060

Comparison:

Batch 8: SSM forward is slightly faster than attention, but backward is slower.
Batch 16: SSM backward is closer to attention than in batch-4 runs, but still slower.
Batch 32: SSM beats attention in both forward and forward+backward throughput in this run.
Batch 64: SSM falls off sharply, so the useful scaling point is not simply the largest batch.
SSM backward memory remains higher than attention, especially at larger batches.

Interpretation:

The projected-64 DPLR SSM should be optimized and evaluated around the scaling sweet spot, currently batch 32 for this synthetic benchmark on the RTX 5090.
Batch-4 timing is still useful for smoke tests, but it should not be treated as the main performance target.
This is a configuration-level breakthrough: SSM can outperform attention at the right batch size even before custom TileLang/Triton kernels.
Next improvement directions should either preserve or improve the batch-32 scaling result, not merely improve batch-4 microbenchmarks.

LLM Iteration 12 - Token Accuracy Benchmark And Causal Memory Check

Reason for this iteration:

Throughput alone is not sufficient; the SSM TaoNet must also learn useful token tasks.
The benchmark script previously reported only random synthetic CE, which is not an inference accuracy signal.
This iteration adds lightweight trained token tasks and reports eval_accuracy.

Implementation location:

TaoTrain: scripts/benchmark_taonet_token_variants.py

TaoTrain commit:

59b84cd Add token task accuracy benchmark

What changed:

Added --token-task with:
- random: original random next-token timing task
- increment: deterministic token mapping, label is current token plus one modulo vocab
- previous: causal memory task, label is the previous token
Added optional short training with --train-steps, --learning-rate, --weight-decay.
Added eval metrics:
- eval_loss
- eval_accuracy
- train_final_loss
- train_seconds

Validation:

Local TaoTrain smoke passed on CPU.
python -m pytest tests\test_taonet_ssm.py -q passed locally.

Broad speed comparison after adding accuracy columns:

RepoBridge run: taonet-vs-dplr-proj64-broad-speed-bench-20260429-112432
Config: projected DPLR mixer dim 64, DPLR rank 1, 4 layers, bf16, random token task, batch sweep

Architecture	Batch	Seq	Mode	Tok/s	Peak MB
attention TaoNet	8	512	forward+backward	about 873k	about 697
SSM TaoNet, DPLR projected 64	8	512	forward+backward	about 244k	about 956
attention TaoNet	16	512	forward+backward	about 589k	about 1332
SSM TaoNet, DPLR projected 64	16	512	forward+backward	about 646k	about 1748
attention TaoNet	32	512	forward+backward	about 680k	about 2592
SSM TaoNet, DPLR projected 64	32	512	forward+backward	about 816k	about 3338
attention TaoNet	64	512	forward+backward	about 763k	about 5121
SSM TaoNet, DPLR projected 64	64	512	forward+backward	about 544k	about 6516

Speed interpretation:

SSM remains poor at batch 8.
SSM wins forward+backward throughput at batch 16 and batch 32 in this run.
SSM falls off again at batch 64.
Batch 16-32 remains the useful projected-64 scaling range.

Token accuracy comparison:

Previous-token task run: taonet-vs-dplr-proj64-previous-token-quality-20260429-112456
Linear/ungated SSM ablation run: taonet-vs-dplr-proj64-previous-token-linear-quality-20260429-112623
Increment task run: taonet-vs-dplr-proj64-increment-token-quality-20260429-112719
All quality runs used batch 32, seq 128, vocab 128, 100 train steps, bf16.

Task	Architecture	Eval loss	Eval accuracy	Forward+backward tok/s
previous	attention TaoNet	about 0.033	about 0.999	about 551k
previous	SSM TaoNet, DPLR projected 64	about 4.858	about 0.009	about 292k
previous, linear ungated SSM	attention TaoNet	about 0.046	about 0.999	about 990k
previous, linear ungated SSM	SSM TaoNet, DPLR projected 64	about 4.626	about 0.026	about 346k
increment	attention TaoNet	about 0.007	1.000	about 1.09M
increment	SSM TaoNet, DPLR projected 64	about 0.009	1.000	about 344k

Failed SSM-core improvement:

SSM commit 2e974c9 Add delayed DPLR skip added a learnable one-step diagonal delayed skip to help causal memory.
Remote previous-token run after this change: taonet-vs-dplr-proj64-previous-token-quality-20260429-112953.
Result: SSM remained near random, eval accuracy about 0.008, and speed worsened.
The change was reverted by 3fc0575 Revert "Add delayed DPLR skip".

Interpretation:

Projected-64 DPLR SSM can learn simple token mappings (increment) to perfect accuracy.
It currently fails a short causal memory/copy task (previous) under the same 100-step setting where attention TaoNet reaches about 99.9% accuracy.
The failure is not solved by removing SSM activation/gates or by a simple delayed diagonal skip.
Future improvements must include both:
- speed comparison across batch 8/16/32/64
- trained token accuracy, especially on causal memory tasks
The next quality-focused direction should investigate the SSM wrapper/core's ability to expose previous-token information, not only low-level GPU speed.

LLM Iteration 13 - Local Shift Register For Causal Token Memory

Reason for this iteration:

Projected-64 was the strongest SSM speed configuration, but it failed the previous token-memory task.
Capacity probes showed the failure was not caused by the projected-64 bottleneck alone:
- projected-128 SSM eval accuracy stayed near random, about 0.007
- full-width projected-256 SSM eval accuracy stayed near random, about 0.008
The next improvement therefore targeted explicit short causal memory while preserving the DPLR SSM as the main sequence mixer.

Implementation location:

TaoTrain commit: bb3bf90 Add SSM local shift mixer option
TaoTrain: src/taoTrain/models/taonet_ssm.py
TaoTrain: src/taoTrain/config.py
TaoTrain: scripts/benchmark_taonet_token_variants.py
TaoTrain: tests/test_taonet_ssm.py

What changed:

Added opt-in ssm_local_shift.
The SSM mixer can now add a one-token causal shift/register branch:
- shifted[:, 1:] = x_norm[:, :-1]
- output contribution is controlled by a single learned scalar ssm_local_shift_init.
The branch is deliberately cheap and ternary-friendly in structure: it is a causal shift plus scalar gain, not another dense attention mechanism.
The default remains off, so older SSM benchmarks are still comparable.

Validation:

Local TaoTrain:
- PYTHONPATH=...\TaoTrain\src;...\Taotern_SSM python -m pytest tests\test_taonet_ssm.py -q passed, 4 passed.
- CPU smoke for benchmark_taonet_token_variants.py --ssm-local-shift passed.

Capacity diagnostic before the change:

Architecture	Mixer dim	Batch	Seq	Eval loss	Eval accuracy	Forward+backward tok/s
attention TaoNet	n/a	32	128	about 0.025	1.000	about 1.04M
SSM TaoNet, DPLR	128	32	128	about 4.857	about 0.007	about 379k
attention TaoNet	n/a	32	128	about 0.042	about 0.999	about 556k
SSM TaoNet, DPLR	256	32	128	about 4.856	about 0.008	about 389k

Required TaoNet comparison after the change:

RepoBridge run: taonet-vs-dplr-proj64-local-shift-previous-quality-20260429-144930
RepoBridge broad run: taonet-vs-dplr-proj64-local-shift-previous-broad-quality-20260429-145014
Config: previous-token task, seq 128, vocab 128, 100 train steps, bf16, projected DPLR mixer dim 64, local shift enabled.

Architecture	Batch	Eval loss	Eval accuracy	Forward tok/s	Forward+backward tok/s	Peak MB
attention TaoNet	8	about 4.376	about 0.096	about 695k	about 238k	about 103
SSM TaoNet, DPLR projected 64 + local shift	8	about 0.010	1.000	about 226k	about 89k	about 181
attention TaoNet	16	about 1.048	about 0.847	about 1.26M	about 508k	about 166
SSM TaoNet, DPLR projected 64 + local shift	16	about 0.008	1.000	about 520k	about 189k	about 299
attention TaoNet	32	about 0.043	1.000	about 2.54M	about 555k	about 297
SSM TaoNet, DPLR projected 64 + local shift	32	about 0.008	1.000	about 1.16M	about 353k	about 513
attention TaoNet	64	about 0.020	1.000	about 4.75M	about 1.73M	about 553
SSM TaoNet, DPLR projected 64 + local shift	64	about 0.007	1.000	about 2.43M	about 403k	about 956

Interpretation:

Success: this is the first projected-64 SSM TaoNet result that solves the previous causal-memory task.
The result is not only a batch-32 spot check. SSM reached perfect eval accuracy at batch 8, 16, 32, and 64.
The quality gain is large: plain projected-64, projected-128, and projected-256 DPLR all stayed near random on the same task.
Speed tradeoff: local-shift SSM is slower than attention on this short-sequence previous-token benchmark, especially backward.
This should be treated as a quality architecture fix, not a hardware-acceleration fix. The next hardware iteration should still target fused DPLR frequency/backward kernels.

LLM Iteration 14 - Explicit DPLR Transfer-Mode Probe

Reason for this iteration:

After the local-shift quality fix, the next bottleneck was speed.
The DPLR direct frequency path applies the finite correction to batch-dependent hidden responses.
A possible alternative was to materialize the full frequency transfer matrix, then multiply by the input FFT.
This could be faster for some batch/sequence shapes, but it risks high memory and repeated transfer construction.

Implementation location:

SSM commit: 749a4cf Add DPLR transfer profiling mode
SSM commit: e34b67c Add DPLR conv transfer mode
TaoTrain commit: ceb08e6 Expose SSM conv transfer mode
SSM: gamma_space_model/modules/s4_ternary_dplr_ssm.py
SSM: scripts/profile_dplr_frequency_path.py
TaoTrain: scripts/benchmark_taonet_token_variants.py

What changed:

Added profiler support for comparing:
- direct DPLR frequency response application
- materialized transfer matrix application
Added explicit kernel_mode="conv_transfer" to S4TernaryDPLRSSM.
Exposed conv_transfer through TaoTrain config/benchmark CLI.
The mode is opt-in only. The default/recommended projected-64 path remains conv.

Local validation:

SSM: python -m pytest tests\test_s4_ternary_dplr_ssm.py tests\test_ssm_gamma.py -q passed, 23 passed.
TaoTrain: PYTHONPATH=...\TaoTrain\src;...\Taotern_SSM python -m pytest tests\test_taonet_ssm.py -q passed, 5 passed.

Isolated SSM-core remote profile:

RepoBridge run: ssm-dplr-direct-vs-transfer-s128-profile-20260429-145555
Config: DPLR state/mixer dim 64, hidden dim 256, seq 128, bf16, rank 1.

Method	Batch	Forward tok/s	Forward+backward tok/s	Peak MB	Interpretation
direct	8	about 555k	about 332k	about 34	baseline direct path
transfer	8	about 1.24M	about 440k	about 247	faster but much higher memory
direct	16	about 790k	about 1.13M	about 47	direct wins
transfer	16	about 737k	about 481k	about 248	transfer loses
direct	32	about 6.73M	about 1.86M	about 74	direct wins
transfer	32	about 4.89M	about 1.68M	about 250	transfer loses
direct	64	about 6.90M	about 2.20M	about 128	baseline direct path
transfer	64	about 2.93M	about 3.06M	about 253	backward faster, forward slower

TaoNet comparison after exposing conv_transfer:

RepoBridge run: taonet-vs-dplr-proj64-local-shift-conv-transfer-previous-broad-quality-20260429-145946
Config: previous-token task, seq 128, vocab 128, 100 train steps, bf16, projected DPLR mixer dim 64, local shift enabled, ssm_kernel_mode=conv_transfer.

Architecture	Batch	Eval loss	Eval accuracy	Forward tok/s	Forward+backward tok/s	Peak MB
attention TaoNet	8	about 4.431	about 0.082	about 699k	about 279k	about 103
SSM TaoNet, DPLR projected 64 + local shift + conv_transfer	8	about 0.010	1.000	about 52k	about 14k	about 195
attention TaoNet	16	about 1.098	about 0.862	about 1.24M	about 303k	about 166
SSM TaoNet, DPLR projected 64 + local shift + conv_transfer	16	about 0.008	1.000	about 79k	about 24k	about 270
attention TaoNet	32	about 0.061	about 0.998	about 1.24M	about 674k	about 297
SSM TaoNet, DPLR projected 64 + local shift + conv_transfer	32	about 0.007	1.000	about 157k	about 45k	about 420
attention TaoNet	64	about 0.015	1.000	about 4.48M	about 1.05M	about 553
SSM TaoNet, DPLR projected 64 + local shift + conv_transfer	64	about 0.007	1.000	about 370k	about 97k	about 719

Comparison to the previous direct-conv local-shift run:

Batch	Direct-conv SSM forward+backward tok/s	Transfer-mode SSM forward+backward tok/s
8	about 89k	about 14k
16	about 189k	about 24k
32	about 353k	about 45k
64	about 403k	about 97k

Interpretation:

Failed as an end-to-end TaoNet acceleration.
The isolated SSM profile suggested transfer mode could help in some cases, but inside the LLM wrapper it is much slower across all tested batch sizes.
Accuracy remains solved because the local shift branch is still active, but speed regresses badly.
Keep conv_transfer only as an explicit diagnostic/experimental mode for now.
Recommended mode remains ssm_kernel_mode=conv with ssm_local_shift=True.
The next hardware target should not be materializing the whole transfer each layer/step. It should focus on fusing or custom-autograding the current direct DPLR response path, especially the complex rank-1 frequency operations and backward.

LLM Iteration 15 - Shrink DPLR Hidden State After Local-Shift Quality Fix

Reason for this iteration:

The local-shift branch solved the previous token-memory task, but the quality-fixed SSM was still slower than attention on short seq-128 training.
The profiler for the recommended direct DPLR path at batch 32, seq 128 showed many small complex BMM/MM calls; there was no single obvious Python-only bottleneck.
Since local shift now carries exact one-token memory, the DPLR hidden dimension may not need to remain at 256 for this token-memory regime.
This iteration tested smaller DPLR hidden states as a ternary-friendly architecture/config improvement.

Remote profiler context:

RepoBridge run: ssm-dplr-direct-b32-s128-profile-20260429-154242
Config: DPLR mixer/state dim 64, hidden dim 256, batch 32, seq 128, bf16, rank 1, direct path.
Result: forward+backward about 2.25M core tok/s.
Profiler top CUDA cost was small complex BMM/MM work; aten::bmm accounted for about 48% of self CUDA time.
aten::linalg_matrix_power was visible but small, about 40us CUDA total.

Remote hidden-dim sweeps:

RepoBridge run: taonet-vs-dplr-proj64-local-shift-hidden-sweep-previous-20260429-154546
RepoBridge run: taonet-vs-dplr-proj64-local-shift-hidden-small-sweep-previous-20260429-155028
Config: previous-token task, seq 128, vocab 128, 100 train steps, bf16, projected DPLR mixer dim 64, local shift enabled, direct conv path.

SSM hidden dim	Batch	SSM eval accuracy	SSM forward+backward tok/s	SSM peak MB	Attention eval accuracy	Attention forward+backward tok/s
256	8	1.000	about 89k	about 181	about 0.096	about 238k
256	16	1.000	about 189k	about 299	about 0.847	about 508k
256	32	1.000	about 353k	about 513	1.000	about 555k
256	64	1.000	about 403k	about 956	1.000	about 1.73M
64	8	1.000	about 95k	about 145	about 0.102	about 278k
64	16	1.000	about 184k	about 239	about 0.932	about 300k
64	32	1.000	about 370k	about 404	about 0.999	about 895k
64	64	1.000	about 564k	about 750	about 0.999	about 920k
32	8	1.000	about 91k	about 139	about 0.097	about 245k
32	16	1.000	about 187k	about 227	about 0.941	about 460k
32	32	1.000	about 302k	about 393	about 0.998	about 863k
32	64	1.000	about 787k	about 716	1.000	about 1.75M
16	8	1.000	about 86k	about 138	about 0.083	about 260k
16	16	1.000	about 187k	about 223	about 0.844	about 495k
16	32	1.000	about 357k	about 378	about 0.999	about 550k
16	64	1.000	about 795k	about 705	1.000	about 1.76M

Seq-512 speed check for hidden dim 16:

RepoBridge run: taonet-vs-dplr-proj64-local-shift-hidden16-random-speed-20260429-155346
Config: random next-token timing task, seq 512, vocab 8192, projected DPLR mixer dim 64, hidden dim 16, local shift enabled.

Architecture	Batch	Forward tok/s	Forward+backward tok/s	Peak MB	Loss
attention TaoNet	16	about 1.30M	about 601k	about 1332	about 9.055
SSM TaoNet, DPLR projected 64, hidden 16 + local shift	16	about 2.18M	about 728k	about 1511	about 9.069
attention TaoNet	32	about 3.87M	about 1.37M	about 2590	about 9.060
SSM TaoNet, DPLR projected 64, hidden 16 + local shift	32	about 3.52M	about 1.14M	about 2887	about 9.061
attention TaoNet	64	about 4.16M	about 1.45M	about 5121	about 9.065
SSM TaoNet, DPLR projected 64, hidden 16 + local shift	64	about 4.03M	about 1.31M	about 5649	about 9.060

Interpretation:

Success for the short token-memory benchmark: hidden dim 16 kept perfect previous accuracy and improved batch-64 backward throughput from about 403k to about 795k tok/s while reducing peak memory.
Hidden dim 64 was also strong and slightly better at batch 32 than hidden dim 16.
This did not become a universal seq-512 speed replacement. On random seq-512 timing, hidden dim 16 beat attention at batch 16 but lost at batch 32 and 64.
Recommended quality-aware short-memory config is now ssm_mixer_dim=64, ssm_hidden_dim=16, ssm_local_shift=True, ssm_kernel_mode=conv.
Recommended longer seq-512 throughput config should remain benchmark-driven; the older hidden-256 projected-64 regime still has stronger evidence around batch 16-32.

LLM Iteration 16 - Seq-512 Previous-Token Robustness And Hidden-State Selection

Reason for this iteration:

Iteration 15 showed hidden dim 16 was excellent for seq-128 previous memory and mixed for seq-512 random timing.
The missing check was a longer trained token-memory task: seq 512 previous, where accuracy and training speed both matter.
This iteration tested whether the local-shift quality fix holds at seq 512 and whether hidden dim 16, 64, or 256 is the best state size at this longer context.

Remote benchmark:

RepoBridge run with attention comparison: taonet-vs-dplr-proj64-local-shift-hidden16-previous512-20260429-161213
RepoBridge SSM-only hidden comparison: taonet-ssm-proj64-local-shift-previous512-hidden-compare-20260429-161306
Config: previous-token task, seq 512, vocab 128, 100 train steps, bf16, projected DPLR mixer dim 64, local shift enabled, direct conv path.

Required TaoNet comparison:

Architecture	SSM hidden dim	Batch	Eval loss	Eval accuracy	Forward tok/s	Forward+backward tok/s	Peak MB
attention TaoNet	n/a	16	about 4.614	about 0.048	about 2.04M	about 1.39M	about 575
SSM TaoNet, DPLR projected 64 + local shift	16	16	about 0.007	1.000	about 2.57M	about 701k	about 754
attention TaoNet	n/a	32	about 2.090	about 0.629	about 4.79M	about 899k	about 1099
SSM TaoNet, DPLR projected 64 + local shift	16	32	about 0.007	1.000	about 4.32M	about 944k	about 1391
attention TaoNet	n/a	64	about 0.239	about 0.962	about 4.08M	about 1.18M	about 2157
SSM TaoNet, DPLR projected 64 + local shift	16	64	about 0.007	1.000	about 2.57M	about 961k	about 2677

SSM hidden-state comparison at seq 512:

SSM hidden dim	Batch	Eval accuracy	Forward tok/s	Forward+backward tok/s	Peak MB
16	16	1.000	about 2.57M	about 701k	about 754
16	32	1.000	about 4.32M	about 944k	about 1391
16	64	1.000	about 2.57M	about 961k	about 2677
64	16	1.000	about 1.18M	about 494k	about 800
64	32	1.000	about 4.26M	about 1.36M	about 1491
64	64	1.000	about 4.70M	about 1.46M	about 2874
256	16	1.000	about 2.49M	about 748k	about 1032
256	32	1.000	about 3.36M	about 705k	about 1914
256	64	1.000	about 3.62M	about 788k	about 3681

Interpretation:

Quality success: local-shift DPLR SSM keeps perfect previous accuracy at seq 512 for all tested hidden sizes and batches.
Attention did not fully learn the same task in 100 steps at batch 16/32 and reached about 0.962 accuracy at batch 64.
Speed depends on batch:
- batch 16: hidden 256 is fastest among SSM variants, about 748k backward tok/s; attention is still faster at about 1.39M.
- batch 32: hidden 64 is fastest, about 1.36M backward tok/s, beating attention's about 899k.
- batch 64: hidden 64 is fastest, about 1.46M backward tok/s, beating attention's about 1.18M.
This gives a better longer-memory recommendation than Iteration 15:
- use ssm_hidden_dim=16 for short seq-128 memory and lower memory pressure
- use ssm_hidden_dim=64 for seq-512 trained memory around batch 32/64
- keep hidden 256 as a possible batch-16 or legacy speed point, but not the general quality-aware default

LLM Iteration 17 - TaoData Real-Text Byte-Token Pilot

Reason for this iteration:

Synthetic previous and increment tasks were useful diagnostics, but they are not enough to judge LLM capability.
The remote server has a TaoData corpus at /home/student/Data/TaoData/pretrain.jsonl.fineweb.jsonl.
No SentencePiece tokenizer artifact was found at the expected remote TaoTrain/TaoData tokenizer paths, so the first real-text benchmark used dependency-free byte tokenization.
Byte tokenization is not the final deployment tokenizer, but it gives a real-corpus next-token signal and exercises the same TaoNet model paths.

Implementation location:

TaoTrain commit: b8c4f3d Add real token TaoNet benchmark
TaoTrain: scripts/benchmark_taonet_real_tokens.py

What changed:

Added a remote-friendly real-token benchmark script that:
- reads JSONL or plain text
- supports TaoData-style text records
- supports byte tokenization and optional SentencePiece tokenization
- builds contiguous next-token batches from one long token stream
- reports eval loss, perplexity, token accuracy, throughput, and memory
- compares attention TaoNet against multiple SSM hidden sizes in one run

Validation:

Local CPU smoke passed on a plain text file with byte tokenization.
Remote RepoBridge runs completed on TaoData JSONL.

Remote benchmark:

RepoBridge run: taonet-vs-ssm-real-token-taodata-byte-pilot-20260429-164623
RepoBridge run: taonet-vs-ssm-real-token-taodata-byte-pilot-b64-20260429-164720
Data: /home/student/Data/TaoData/pretrain.jsonl.fineweb.jsonl
Tokenization: byte-level, vocab size 259
Data limit: first 2,000,000 byte tokens from up to 5,000 records
Config: seq 512, 4 layers, hidden dim 256, bf16, 150 train steps, batch 16/32/64, projected DPLR mixer dim 64, local shift enabled.

Required TaoNet comparison:

Architecture	SSM hidden dim	Batch	Eval loss	Eval PPL	Eval accuracy	Forward tok/s	Forward+backward tok/s	Peak MB
attention TaoNet	n/a	16	about 2.549	about 12.80	about 0.260	about 2.03M	about 1.40M	about 585
SSM TaoNet, DPLR projected 64 + local shift	16	16	about 1.982	about 7.26	about 0.423	about 2.42M	about 564k	about 757
SSM TaoNet, DPLR projected 64 + local shift	64	16	about 1.928	about 6.88	about 0.440	about 2.16M	about 488k	about 803
attention TaoNet	n/a	32	about 2.523	about 12.47	about 0.266	about 2.13M	about 809k	about 1115
SSM TaoNet, DPLR projected 64 + local shift	16	32	about 1.879	about 6.55	about 0.455	about 4.43M	about 1.38M	about 1396
SSM TaoNet, DPLR projected 64 + local shift	64	32	about 1.848	about 6.35	about 0.457	about 3.97M	about 1.25M	about 1496
attention TaoNet	n/a	64	about 2.529	about 12.54	about 0.265	about 5.98M	about 2.03M	about 2190
SSM TaoNet, DPLR projected 64 + local shift	16	64	about 1.807	about 6.10	about 0.471	about 4.92M	about 1.67M	about 2686
SSM TaoNet, DPLR projected 64 + local shift	64	64	about 1.834	about 6.26	about 0.466	about 2.54M	about 1.52M	about 2882

Interpretation:

First real-corpus quality success: both SSM candidates beat attention on validation loss, perplexity, and byte-token accuracy after the same number of train steps.
Hidden 64 was best quality at batch 16/32, while hidden 16 was best quality at batch 64 and generally faster among SSM variants.
Speed tradeoff depends on batch:
- batch 16: attention backward is faster, but SSM has much better validation quality.
- batch 32: hidden-16 SSM wins both quality and backward throughput versus attention.
- batch 64: attention wins backward throughput, while SSM wins validation quality.
This benchmark is byte-level, so it should be treated as a real-text pilot rather than the final TaoData tokenizer benchmark.
Next real-data step: train or locate the intended SentencePiece tokenizer, then rerun the same script with --tokenizer-type sentencepiece.

LLM Iteration 18 - TaoData SentencePiece Pilot And Per-Channel Local Shift

Reason for this iteration:

Byte-level TaoData results were encouraging but not the intended LLM tokenization.
No pre-existing tokenizer artifact was found on the remote server, so a pilot SentencePiece tokenizer was trained from TaoData.
The first 500-step SentencePiece run showed attention still ahead on validation loss at batch 32, even though SSM retained a token-accuracy edge.
Because no-shift SSM was worse, the local shift branch was helping; the next lightweight improvement was making the shift gain per-channel instead of one scalar.

Implementation location:

TaoTrain commit: 33747c1 Add TaoData pilot tokenizer config
TaoTrain commit: c519645 Add per-channel SSM local shift
TaoTrain: configs/tokenizer_taodata_pilot.yaml
TaoTrain: src/taoTrain/models/taonet_ssm.py
TaoTrain: scripts/benchmark_taonet_real_tokens.py

What changed:

Added a pilot tokenizer config:
- input: /home/student/Data/TaoData/pretrain.jsonl.fineweb.jsonl
- output: /home/student/YouZheng/tokenizers/taodata_pilot_8k
- vocab size: 8192
- max samples: 20000
Trained the remote tokenizer; output files:
- /home/student/YouZheng/tokenizers/taodata_pilot_8k/tokenizer.model
- /home/student/YouZheng/tokenizers/taodata_pilot_8k/tokenizer.vocab
Added opt-in ssm_local_shift_per_channel.
The previous shift branch used one learned scalar for all model channels.
The new branch can use one learned gain per model channel while keeping the operation cheap: shift plus elementwise multiply.

Validation:

TaoTrain local tests: python -m pytest tests\test_taonet_ssm.py -q passed, 6 passed.
Local real-token smoke with --ssm-local-shift-per-channel passed.
Remote tokenizer training completed. RepoBridge's local print path initially hit a Windows emoji encoding issue, but the tokenizer files were created successfully.

SentencePiece 150-step pilot:

RepoBridge run: taonet-vs-ssm-real-token-taodata-spm-pilot-20260429-171228
Data: TaoData FineWeb JSONL
Tokenization: pilot SentencePiece 8k
Config: seq 512, 4 layers, hidden dim 256, bf16, 150 train steps, batch 16/32/64, projected DPLR mixer dim 64, local shift enabled.

Architecture	SSM hidden dim	Batch	Eval loss	Eval PPL	Eval accuracy	Forward+backward tok/s
attention TaoNet	n/a	16	about 5.718	about 304	about 0.150	about 1.01M
SSM TaoNet	16	16	about 5.723	about 306	about 0.149	about 743k
SSM TaoNet	64	16	about 5.728	about 307	about 0.146	about 381k
attention TaoNet	n/a	32	about 5.533	about 253	about 0.156	about 842k
SSM TaoNet	16	32	about 5.505	about 246	about 0.165	about 771k
SSM TaoNet	64	32	about 5.561	about 260	about 0.158	about 1.09M
attention TaoNet	n/a	64	about 5.414	about 225	about 0.163	about 623k
SSM TaoNet	16	64	about 5.427	about 227	about 0.169	about 1.12M
SSM TaoNet	64	64	about 5.395	about 220	about 0.171	about 623k

SentencePiece 500-step batch-32 follow-up:

RepoBridge run: taonet-vs-ssm-real-token-taodata-spm-b32-500step-20260429-171338
RepoBridge run without shift: taonet-ssm-real-token-taodata-spm-b32-500step-no-shift-20260429-171451
RepoBridge run with per-channel shift: taonet-vs-ssm-real-token-taodata-spm-b32-500step-channel-shift-20260429-171917
Config: batch 32, seq 512, 500 train steps, eval batches 16.

Variant	SSM hidden dim	Shift type	Eval loss	Eval PPL	Eval accuracy	Forward+backward tok/s
attention TaoNet	n/a	n/a	about 4.715	about 112	about 0.211	about 1.23M in first run, about 892k in per-channel run
SSM TaoNet	16	scalar	about 4.798	about 121	about 0.217	about 1.13M
SSM TaoNet	64	scalar	about 4.830	about 125	about 0.215	about 968k
SSM TaoNet	16	none	about 5.088	about 162	about 0.171	about 554k
SSM TaoNet	64	none	about 5.102	about 164	about 0.169	about 580k
SSM TaoNet	16	per-channel	about 4.782	about 119	about 0.218	about 784k
SSM TaoNet	64	per-channel	about 4.818	about 124	about 0.215	about 1.08M

Interpretation:

The SentencePiece pilot is more realistic and less favorable to SSM than the byte-level pilot.
SSM has a small token-accuracy edge at batch 32, but attention has the best 500-step validation loss/perplexity.
Removing local shift is clearly worse, so local shift is useful for real-token modeling too.
Per-channel shift is a small quality improvement over scalar shift:
- hidden 16 eval loss improved from about 4.798 to 4.782
- hidden 64 eval loss improved from about 4.830 to 4.818
Per-channel shift is not enough to surpass attention on 500-step SentencePiece validation loss.
Next model-improvement direction should target SSM language-modeling capacity or optimization, not just exact one-token memory:
- try larger ssm_mixer_dim such as 96/128 with h16/h64
- tune SSM learning rate/weight decay separately from attention
- test a small gated local convolution/projection branch if ternary deployment accepts it

LLM Iteration 19 - TaoData SentencePiece Mixer-Dimension Sweep

Reason for this iteration:

The 500-step SentencePiece batch-32 pilot showed SSM had a small token-accuracy edge, but attention still had better validation loss/perplexity.
The prior best SSM used ssm_mixer_dim=64, originally chosen from speed-focused scaling probes.
Because real-token quality may need more SSM channel capacity, this iteration swept projected mixer dimensions while keeping the same outer TaoNet dimensions.

Implementation location:

TaoTrain commit: 357336e Sweep SSM mixer dims in real token benchmark
TaoTrain: scripts/benchmark_taonet_real_tokens.py
RepoBridge config: repobridge.taonet.realspm.taodata.b32.500step.mixersweep.config.json

What changed:

Added --ssm-mixer-dims to the real-token benchmark.
The benchmark now records ssm_mixer_dim in the printed table and CSV.
Attention TaoNet is still evaluated once per batch, while SSM TaoNet can sweep multiple hidden and mixer dimensions in the same run.

Validation:

TaoTrain local syntax check: python -m py_compile scripts\benchmark_taonet_real_tokens.py passed.
TaoTrain local tests with the SSM repo on PYTHONPATH: python -m pytest tests\test_taonet_ssm.py -q passed, 6 passed.
Local byte-token smoke with --ssm-mixer-dims 8,12 passed and wrote CSV/JSON outputs.

Remote benchmark:

RepoBridge run: taonet-vs-ssm-real-token-taodata-spm-b32-500step-mixersweep-20260429-193729
Data: TaoData FineWeb JSONL
Tokenization: pilot SentencePiece 8k
Config: batch 32, seq 512, 4 layers, hidden dim 256, bf16, 500 train steps, 16 eval batches, local shift enabled, per-channel shift enabled.

Architecture	SSM hidden dim	SSM mixer dim	Eval loss	Eval PPL	Eval accuracy	Forward+backward tok/s	Peak allocated MB
attention TaoNet	n/a	n/a	4.715	111.633	0.211	618k	2590
SSM TaoNet	16	64	4.780	119.046	0.218	1.13M	2887
SSM TaoNet	16	96	4.759	116.643	0.222	973k	3029
SSM TaoNet	16	128	4.719	112.088	0.224	782k	3192
SSM TaoNet	64	64	4.824	124.475	0.214	982k	2987
SSM TaoNet	64	96	4.761	116.917	0.219	479k	3131
SSM TaoNet	64	128	4.784	119.589	0.218	457k	3292

Interpretation:

Increasing the projected mixer dimension helped the best SSM real-token validation loss.
The best quality SSM in this run was ssm_hidden_dim=16, ssm_mixer_dim=128:
- validation loss 4.719, very close to attention 4.715
- token accuracy 0.224, above attention 0.211
- forward+backward throughput about 782k tok/s, above attention about 618k tok/s
Hidden dim 64 did not help this batch-32 500-step SentencePiece setting; it was slower and worse than hidden dim 16 at mixer dim 128.
Mixer dim 64 remains the best SSM speed/quality tradeoff, but mixer dim 128 is now the best SSM quality candidate on real SentencePiece token modeling.
Next step should test whether hidden_dim=16, mixer_dim=128 remains strong at batch 16/64 and longer training, then try a narrow learning-rate sweep around it.

LLM Iteration 20 - Attempted h16/m128 Batch Generalization Sweep

Reason for this iteration:

Iteration 19 found a strong real-token batch-32 point: ssm_hidden_dim=16, ssm_mixer_dim=128.
The user noted earlier that a single batch-size sweet spot can be misleading.
This iteration was meant to compare attention TaoNet vs SSM TaoNet at batch 16, 32, and 64 with the same 500-step SentencePiece protocol.

Implementation location:

TaoTrain commit used remotely: 357336e Sweep SSM mixer dims in real token benchmark
RepoBridge config: repobridge.taonet.realspm.taodata.h16m128.batchsweep.config.json

Planned remote benchmark:

Data: TaoData FineWeb JSONL
Tokenization: pilot SentencePiece 8k
Config: batch 16/32/64, seq 512, 4 layers, hidden dim 256, bf16, 500 train steps, 16 eval batches
Attention baseline: taonet
SSM candidate: taonet_ssm, DPLR, ssm_hidden_dim=16, ssm_mixer_dim=128, local shift enabled, per-channel shift enabled

Remote status before run:

RepoBridge write guard passed.
RepoBridge preflight passed.
Remote GPU: RTX 5090 with about 21 GB free VRAM.
A same-user taodata process was present and using about 10.9 GB VRAM; no other users were detected.

Outcome:

RepoBridge full began, but the SFTP download phase failed with:
- Socket exception: An existing connection was forcibly closed by the remote host (10054)
- paramiko.ssh_exception.SSHException: Server connection dropped
Subsequent read-only RepoBridge SSH checks timed out with WinError 10060.
The new result folder did not appear in the partial local download, so no valid benchmark table was available to record.

Interpretation:

This was an infrastructure interruption, not a model failure.
Do not infer anything about h16/m128 batch generalization from this attempted run.
Next action when the remote server is reachable: rerun or download the run for taonet-vs-ssm-real-token-taodata-spm-h16m128-batchsweep.

Current LLM-Wrapper Best Configuration

Best current speed benchmark configuration:

architecture: taonet_ssm
SSM core: dplr
mixer projection: ssm_mixer_dim=64
SSM hidden dimension: 256
DPLR rank: 1
kernel mode: conv
dtype: bf16
benchmark task: synthetic next-token CE through TaoNet wrapper

Best current quality-aware token-memory configuration:

architecture: taonet_ssm
SSM core: dplr
mixer projection: ssm_mixer_dim=64
SSM hidden dimension: 16
DPLR rank: 1
kernel mode: conv
dtype: bf16
local shift: ssm_local_shift=True
benchmark task: previous token memory through TaoNet wrapper
evidence: perfect eval accuracy at batch 8, 16, 32, and 64 after 100 steps; best observed short-memory batch-64 SSM backward throughput about 795k tok/s

Best current longer token-memory configuration:

architecture: taonet_ssm
SSM core: dplr
mixer projection: ssm_mixer_dim=64
SSM hidden dimension: 64
DPLR rank: 1
kernel mode: conv
dtype: bf16
local shift: ssm_local_shift=True
benchmark task: seq-512 previous token memory through TaoNet wrapper
evidence: perfect eval accuracy at batch 16, 32, and 64 after 100 steps; best observed batch-32 and batch-64 SSM backward throughput about 1.36M and 1.46M tok/s, both above attention in the same task

Best current TaoData real-text pilot configuration:

architecture: taonet_ssm
SSM core: dplr
mixer projection: ssm_mixer_dim=128 for best current SentencePiece validation loss; ssm_mixer_dim=64 for speed/quality balance
SSM hidden dimension: 16
DPLR rank: 1
kernel mode: conv
dtype: bf16
local shift: ssm_local_shift=True
local shift gain: ssm_local_shift_per_channel=True
benchmark task: TaoData FineWeb JSONL, byte-level and pilot SentencePiece next-token prediction, seq 512
evidence:
- byte-level: lower validation loss/perplexity than attention at batch 16/32/64 after 150 steps; hidden-16 also beat attention backward throughput at batch 32
- SentencePiece batch 32, 500 steps: ssm_hidden_dim=16, ssm_mixer_dim=128 reached eval loss about 4.719 vs attention about 4.715, with better token accuracy (0.224 vs 0.211) and higher backward throughput (782k vs 618k tok/s)

Current best evidence:

At batch 4, seq 512, projected-64 DPLR reaches about 618k forward tok/s and 192k backward tok/s.
At batch 16, seq 512, projected-64 DPLR reaches about 2.12M forward tok/s and 702k backward tok/s.
Attention is still faster for backward at batch 16 in the same run: about 990k tok/s.
DPLR projected-64 forward can exceed attention in this benchmark, but training/backward still needs improvement.
Newer scaling rerun found a batch-32 sweet spot where projected-64 DPLR exceeded attention in both forward and forward+backward throughput:
- SSM forward about 2.62M tok/s vs attention about 1.88M
- SSM forward+backward about 705k tok/s vs attention about 632k

Important local artifact paths:

C:\Users\YouZheng\Documents\LYZ\MyContent\MyLLM\Codebase\Taotern\TaoTrain\results\repobridge-token-bench-projected\outputs-taotrain\taonet-token-dplr-proj64-bench-20260429-091624\taonet_token_benchmark.csv
C:\Users\YouZheng\Documents\LYZ\MyContent\MyLLM\Codebase\Taotern\TaoTrain\results\repobridge-token-bench-projected64-scale\outputs-taotrain\taonet-token-dplr-proj64-scale-bench-20260429-091738\taonet_token_benchmark.csv
C:\Users\YouZheng\Documents\LYZ\MyContent\MyLLM\Codebase\Taotern\TaoTrain\results\repobridge-token-bench-projected\outputs-taotrain\taonet-token-dplr-proj64-bench-20260429-091956\taonet_token_benchmark.csv

Recommended next LLM-wrapper targets:

Rerun the real SentencePiece benchmark for ssm_hidden_dim=16, ssm_mixer_dim=128 at batch 16/32/64 to check whether the gain generalizes beyond the batch-32 spot.
Optimize backward throughput in S4TernaryDPLRSSM; the forward path is now competitive at larger batch sizes.
Run a learning-rate and weight-decay sweep around the current best SSM real-token config, because the SSM and attention cores may not share the same optimum optimizer settings.
Investigate whether FFT/direct-response intermediates can be checkpointed or custom-autograded to improve backward speed.
Keep ternary deployment constraints in view: rank-1 DPLR factors still use ternary masks with learned amplitudes, and projected mixer dimensions should remain friendly to ternary compute layouts.

Version Timeline

Run	Notebook(s)	Commit printed in notebook	Device	Main purpose
`_r1`	`gamma_s4_sinewave_benchmark_r1.ipynb`	not printed	CUDA	First comparison of baseline, minimal, enhanced on simple sinewave task.
`_r2`	`gamma_s4_sinewave_benchmark_r2.ipynb`	not printed	CUDA	Harder multivariate long-range task; enhanced first became clearly promising.
`_r3`	`gamma-s4-sinewave-benchmark_r3.ipynb`	`6df3777`	CUDA	Quick benchmark after deployment-cache import fix; recurrent enhanced still very slow.
`_r4`	`gamma-s4-sinewave-benchmark_r4.ipynb`	`d6ebddc`	CUDA	Triangular-solve recurrent optimization; large recurrent speedup.
`_r5`	`gamma-s4-sinewave-benchmark_r5.ipynb`	`78ae31f`	CUDA	Added recurrent/full-output agreement metrics.
`_r6`	quick + research notebooks	`a2474cc` / `5952546`	CPU for quick, CUDA for research	Split quick/research benchmark; first practical long-context run showed conv path was too slow.
`_r7`	quick + research notebooks	`4b977c1` / `b17f72a`	CUDA	Faster conv kernel generation and cheaper research defaults.
`_r8`	quick + research notebooks	`73e76a7`	CUDA	Skipped unused final states, enabled baseline deploy metrics, enabled token-lite.
`_r9`	quick + research notebooks	`8738675` / `60562bd`	CUDA	Added research visuals; performance similar to `_r8`, now presentation-friendly.
`_r10`	quick + research notebooks	`09db0da` / `9ff7e4e`	CUDA	Added balanced deployment metrics to test a speed/fidelity point between full recurrent and deployment-lite.
`_r11`	quick + research + challenge notebooks	`64f8632` / `4842762` / `bfc6e26`	CUDA	Fixed AMP FFT path, split result tables, and added challenge benchmarks for permuted MNIST, selective copying, and induction-style recall.
`_r12`	quick + research + challenge notebooks	`740a9ef` / `0c6ecb8` / `11bd2e6`	CUDA	Tested the input-selection gate. Forecasting stayed strong, but challenge recall tasks remained near random.

`_r1` - First Simple Sinewave Comparison

Saved notebook:

output/jupyter-notebook/gamma_s4_sinewave_benchmark_r1.ipynb

Configuration recovered from notebook:

device: CUDA
task: simple 1D sinewave next-step prediction
seq_len=128
train_samples=512
val_samples=128
batch_size=32
epochs=10
d_model=1
hidden_dim=32
num_layers=2

Results:

Model	Params	Final val loss	Mean epoch s	Full ms	Full tokens/s	Recurrent ms	Recurrent tokens/s
`gamma_baseline`	134	0.170722	2.063	24.604	166477	51.177	80036
`gamma_s4_minimal`	138	0.019148	1.282	23.545	173967	54.462	75209
`gamma_s4_enhanced`	146	1.154002	1.330	23.389	175127	82.343	49743

Interpretation:

gamma_s4_minimal was best on this very simple task.
gamma_s4_enhanced was unstable/underfit badly here.
This run showed that the richer enhanced block can be harmful on small/simple tasks.

`_r2` - Harder Multivariate Forecasting

Saved notebook:

output/jupyter-notebook/gamma_s4_sinewave_benchmark_r2.ipynb

Configuration recovered from notebook:

device: CUDA
task: harder multivariate synthetic forecasting
seq_len=512
num_features=8
train_samples=768
val_samples=192
batch_size=32
epochs=12
d_model=8
hidden_dim=64
num_layers=3

Results:

Model	Params	Final val loss	Mean epoch s	Full ms	Full tokens/s	Recurrent ms	Recurrent tokens/s
`gamma_baseline`	3192	0.006972	28.916	146.644	111726	305.446	53640
`gamma_s4_minimal`	3243	0.110654	17.194	121.234	135144	343.394	47712
`gamma_s4_enhanced`	3675	0.006302	17.191	131.929	124188	492.576	33262

Interpretation:

gamma_s4_enhanced became the best-quality model.
Enhanced training was much faster than baseline on this task.
Recurrent inference was still significantly slower than baseline.
This was the first strong evidence that the enhanced model is useful on harder sequence tasks.

`_r3` - Quick Benchmark With Deployment Cache Available

Saved notebook:

output/jupyter-notebook/gamma-s4-sinewave-benchmark_r3.ipynb

Configuration:

device: CUDA
commit: 6df3777
quick tasks:
- simple: seq_len=192, features=4, epochs=4
- moderate: seq_len=320, features=6, epochs=5
models: gamma_baseline, gamma_s4_enhanced
enhanced: kernel_mode="auto", kernel_threshold=384, bilinear discretization

Results:

Task	Model	Val loss	Mean epoch s	Full tokens/s	Recurrent ms	Recurrent tokens/s	Deploy recurrent ms
simple	baseline	0.637628	2.199	5779	75.092	5114	not available
simple	enhanced	0.045667	2.450	9671	1024.471	375	997.631
moderate	baseline	0.533364	6.544	5729	192.084	3332	not available
moderate	enhanced	0.021113	6.584	6995	2815.223	227	2346.986

Interpretation:

Enhanced quality was much better than baseline.
Full-sequence throughput was better for enhanced.
Recurrent enhanced path was catastrophically slow.
This run motivated recurrent-path optimization.

`_r4` - Triangular-Solve Recurrent Optimization

Saved notebook:

output/jupyter-notebook/gamma-s4-sinewave-benchmark_r4.ipynb

Configuration:

device: CUDA
commit: d6ebddc
same quick tasks as _r3
key code change: bilinear recurrent stepping switched to a triangular-solve path

Results:

Task	Model	Val loss	Mean epoch s	Full tokens/s	Recurrent ms	Recurrent tokens/s	Deploy recurrent ms
simple	baseline	0.637628	2.219	6186	69.398	5533	not available
simple	enhanced	0.045667	2.288	9728	139.394	2755	104.623
moderate	baseline	0.533364	6.409	6182	110.415	5796	not available
moderate	enhanced	0.021113	6.630	9896	240.392	2662	185.037

Interpretation:

This was a major recurrent-inference improvement.
Enhanced recurrent latency dropped from seconds to hundreds of milliseconds.
Enhanced still remained slower than baseline in recurrent mode.

`_r5` - Agreement Metrics Added

Saved notebook:

output/jupyter-notebook/gamma-s4-sinewave-benchmark_r5.ipynb

Configuration:

device: CUDA
commit: 78ae31f
same quick tasks as _r4
added:
- recurrent_match_mse
- deploy_match_mse

Results:

Task	Model	Val loss	Mean epoch s	Full tokens/s	Recurrent ms	Recurrent match MSE	Deploy recurrent ms	Deploy match MSE
simple	baseline	0.637628	2.317	6097	71.296	0.000000	not available	not available
simple	enhanced	0.045667	2.381	9963	141.656	0.008500	107.361	0.031251
moderate	baseline	0.533364	6.603	5912	114.832	0.000000	not available	not available
moderate	enhanced	0.021113	7.199	9465	242.692	0.007549	178.070	0.029995

Interpretation:

Enhanced remained much better in quality.
Full-sequence throughput favored enhanced.
Recurrent/deployment-lite speed improved but still trailed baseline.
Agreement metrics showed normal enhanced recurrent output was close to full forward; deployment-lite was faster but less faithful.

`_r6` - Split Quick/Research Benchmark Era

`_r6` Quick Notebook

Saved notebook:

output/jupyter-notebook/gamma-s4-sinewave-benchmark_r6.ipynb

Configuration:

device: CPU
commit: a2474cc
same quick tasks as _r5

Results:

Task	Model	Val loss	Mean epoch s	Full tokens/s	Recurrent tokens/s
simple	baseline	0.240532	1.444	15225	8656
simple	enhanced	0.045714	2.598	20066	4278
moderate	baseline	0.056279	5.613	20720	11785
moderate	enhanced	0.021122	8.653	12149	2875

Interpretation:

This was a CPU run, so speed conclusions are not treated as primary benchmark evidence.
It was useful as a smoke test only.
The CPU result reminded us to warn clearly when notebooks are not running on GPU.

`_r6` Research Notebook

Saved notebook:

output/jupyter-notebook/gamma-s4-research-benchmark_r6.ipynb

Configuration:

device: CUDA
commit: 5952546
research tasks:
- current_reference: seq_len=320, features=6, epochs=5
- long_context: seq_len=768, features=8, epochs=4
RUN_ABLATIONS=True
RUN_TOKEN_TASK=False

Results:

Task	Model	Val loss	Mean epoch s	Expected mode	Full tokens/s	Recurrent tokens/s	Deploy tokens/s
current_reference	baseline	0.709749	6.844	recurrent_like	6157	5522	not available
current_reference	enhanced	0.020500	7.366	recurrent_like	8431	2594	3408
long_context	baseline	27.229956	36.239	recurrent_like	2819	2939	not available
long_context	enhanced	0.012164	634.387	conv	358	1876	2501

Interpretation:

Enhanced crushed baseline in quality.
But the long-context conv path was extremely slow.
Ablation section was too expensive and was stopped mid-way.
This run motivated the later kernel-generation speedup and disabling ablations by default.

`_r7` - Conv Kernel Generation Improved

`_r7` Quick Notebook

Saved notebook:

output/jupyter-notebook/gamma-s4-sinewave-benchmark_r7.ipynb

Configuration:

device: CUDA
commit: 4b977c1
same quick tasks

Results:

Task	Model	Val loss	Mean epoch s	Full tokens/s	Recurrent tokens/s	Deploy tokens/s
simple	baseline	0.637628	2.417	4977	5155	not available
simple	enhanced	0.045667	2.565	8583	2500	3260
moderate	baseline	0.533364	7.405	5413	5186	not available
moderate	enhanced	0.021113	7.796	7465	2414	3226

Interpretation:

Quick benchmark remained stable.
Enhanced retained quality and full-sequence throughput advantages.
Recurrent remained slower than baseline.

`_r7` Research Notebook

Saved notebook:

output/jupyter-notebook/gamma-s4-research-benchmark_r7.ipynb

Configuration:

device: CUDA
commit: b17f72a
RUN_ABLATIONS=False
RUN_TOKEN_TASK=False

Results:

Task	Model	Val loss	Mean epoch s	Expected mode	Full tokens/s	Recurrent tokens/s	Deploy tokens/s
current_reference	baseline	0.709749	7.351	recurrent_like	3821	4616	not available
current_reference	enhanced	0.020500	7.530	recurrent_like	9289	2780	3339
long_context	baseline	27.229956	39.236	recurrent_like	3523	3282	not available
long_context	enhanced	0.012029	44.189	conv	5971	1776	2229

Interpretation:

The conv speed issue was dramatically improved versus _r6.
Enhanced long-context epoch time dropped from about 634s to about 44s.
Enhanced was still slightly slower than baseline per epoch on long_context, but had much better loss and better full-sequence throughput.

`_r8` - No-State Full Forward, Baseline Deploy Metrics, Token-Lite Enabled

`_r8` Quick Notebook

Saved notebook:

output/jupyter-notebook/gamma-s4-sinewave-benchmark_r8.ipynb

Configuration:

device: CUDA
commit: 73e76a7
same quick tasks
baseline deploy metrics became available
full-sequence training/inference skips unused final-state computation

Results:

Task	Model	Val loss	Mean epoch s	Full tokens/s	Recurrent tokens/s	Deploy tokens/s	Deploy match MSE
simple	baseline	0.637628	2.261	5711	5076	5241	0.000000
simple	enhanced	0.044817	2.550	8204	2621	3367	0.022886
moderate	baseline	0.533364	7.011	5782	5447	4519	0.000000
moderate	enhanced	0.020569	7.010	8926	2503	3390	0.018165

Interpretation:

Baseline deploy columns now populate.
Enhanced full-sequence throughput remained ahead.
Training time was tied on moderate.

`_r8` Research Notebook

Saved notebook:

output/jupyter-notebook/gamma-s4-research-benchmark_r8.ipynb

Configuration:

device: CUDA
commit: 73e76a7
RUN_ABLATIONS=False
RUN_TOKEN_TASK=True

Forecasting results:

Task	Model	Val loss	Mean epoch s	Expected mode	Full tokens/s	Recurrent tokens/s	Deploy tokens/s
current_reference	baseline	0.709749	7.235	recurrent_like	5941	5581	5702
current_reference	enhanced	0.019951	7.177	recurrent_like	7431	1918	2336
long_context	baseline	27.229956	35.557	recurrent_like	3969	3759	3842
long_context	enhanced	0.011708	14.235	conv	19544	1860	2406

Token-lite results:

Model	Train CE	Val CE	Val PPL	Seq len	Train samples
baseline	3.587260	3.132184	22.924	192	1200
enhanced	2.483611	2.486829	12.023	192	1200

Interpretation:

This was the strongest practical result so far.
On long_context, enhanced was both much more accurate and much faster per epoch.
Token-lite showed enhanced also transferred better to a language-like task.

`_r9` - Presentation Visuals Added

`_r9` Quick Notebook

Saved notebook:

output/jupyter-notebook/gamma-s4-sinewave-benchmark_r9.ipynb

Configuration:

device: CUDA
commit: 8738675
same quick tasks as _r8

Results:

Task	Model	Val loss	Mean epoch s	Full tokens/s	Recurrent tokens/s	Deploy tokens/s
simple	baseline	0.637628	2.249	6058	5478	5502
simple	enhanced	0.044817	2.344	9550	2617	3644
moderate	baseline	0.533364	6.672	6324	5686	5599
moderate	enhanced	0.020569	6.571	9304	2771	3416

Interpretation:

Similar to _r8, with slightly improved timing variation.
Enhanced still wins on quality and full-sequence throughput.
Baseline still wins recurrent throughput.

`_r9` Research Notebook

Saved notebook:

output/jupyter-notebook/gamma-s4-research-benchmark_r9.ipynb

Configuration:

device: CUDA
commit printed in notebook: 60562bd
visual sections added:
- task visual preview
- prediction comparison plots
- error comparison plots

Forecasting results:

Task	Model	Val loss	Mean epoch s	Expected mode	Full tokens/s	Recurrent tokens/s	Deploy tokens/s
current_reference	baseline	0.709749	7.294	recurrent_like	6111	4981	5359
current_reference	enhanced	0.019951	7.494	recurrent_like	8099	2185	3343
long_context	baseline	27.229956	37.885	recurrent_like	3576	3728	3695
long_context	enhanced	0.011708	14.717	conv	15654	1810	2327

Token-lite results:

Model	Train CE	Val CE	Val PPL	Seq len	Train samples
baseline	3.587260	3.132184	22.924	192	1200
enhanced	2.483611	2.486829	12.023	192	1200

Interpretation:

_r9 is the most presentation-friendly record.
It confirms the _r8 story:
- enhanced wins quality strongly
- enhanced wins full-sequence/conv long-context training and throughput
- baseline still wins recurrent deployment throughput
- token-lite favors enhanced

`_r10` - Balanced Deployment Path Added

`_r10` Quick Notebook

Saved notebook:

output/jupyter-notebook/gamma-s4-sinewave-benchmark_r10.ipynb

Configuration:

device: CUDA
commit: 09db0da
same quick tasks as _r9
new metrics:
- balanced_deploy_recurrent_latency_ms
- balanced_deploy_recurrent_tokens_per_s
- balanced_deploy_match_mse

Results:

Task	Model	Val loss	Mean epoch s	Full tokens/s	Recurrent tokens/s	Deploy-lite tokens/s	Balanced deploy tokens/s	Deploy-lite match MSE	Balanced match MSE
simple	baseline	0.637628	2.134	6053	5891	6034	5938	0.000000	0.000000
simple	enhanced	0.044817	2.532	9973	2507	3763	3123	0.022886	0.000986
moderate	baseline	0.533364	6.331	6134	5835	5512	5816	0.000000	0.000000
moderate	enhanced	0.020569	6.601	10045	2778	3510	2862	0.018165	0.000468

Interpretation:

Enhanced quality and full-sequence throughput remain strong.
Deployment-lite is still the fastest enhanced deployment variant.
Balanced deployment is slower than deployment-lite, but much more faithful to full forward.
Balanced deployment is useful as a fidelity-preserving approximation, not as a pure speed win.

`_r10` Research Notebook

Saved notebook:

output/jupyter-notebook/gamma-s4-research-benchmark_r10.ipynb

Configuration:

device: CUDA
commit printed in notebook: 9ff7e4e
same research tasks as _r9
balanced deployment metrics added

Forecasting results:

Task	Model	Val loss	Mean epoch s	Expected mode	Full tokens/s	Recurrent tokens/s	Deploy-lite tokens/s	Balanced deploy tokens/s	Deploy-lite match MSE	Balanced match MSE
current_reference	baseline	0.709749	7.648	recurrent_like	4933	not recorded in compact table	5092	4987	0.000000	0.000000
current_reference	enhanced	0.019951	8.193	recurrent_like	8152	not recorded in compact table	3404	2687	0.027752	0.000315
long_context	baseline	27.229956	40.350	recurrent_like	2395	not recorded in compact table	3397	3285	0.000000	0.000000
long_context	enhanced	0.011708	15.862	conv	16957	not recorded in compact table	2245	1886	0.200325	0.001692

Token-lite results:

Model	Train CE	Val CE	Val PPL	Seq len	Train samples
baseline	3.587260	3.132184	22.924	192	1200
enhanced	2.483611	2.486829	12.023	192	1200

Interpretation:

Long-context enhanced still wins strongly on validation loss and full-sequence throughput.
Balanced deployment drastically improves fidelity relative to deployment-lite on enhanced:
- long_context deploy-lite match MSE: 0.200325
- long_context balanced match MSE: 0.001692
However, balanced deployment is slower than deployment-lite.
This suggests the output projection is important for fidelity, while the input-dependent gate is a major recurrent-time cost.

`_r11` - FFT Fix, Split Tables, And Challenge Benchmarks

`_r11` Quick Notebook

Saved notebook:

output/jupyter-notebook/gamma-s4-sinewave-benchmark_r11.ipynb

Configuration:

device: CUDA
commit printed in notebook: 64f8632
same quick tasks as _r10
notebook tables split into normal, deployment-lite, and balanced deployment views

Results:

Task	Model	Val loss	Mean epoch s	Full tokens/s	Recurrent tokens/s	Deploy-lite tokens/s	Balanced deploy tokens/s	Deploy-lite match MSE	Balanced match MSE
simple	baseline	0.637628	2.353	6162	5776	5524	5649	0.000000	0.000000
simple	enhanced	0.044817	2.113	11279	2625	3618	3112	0.022886	0.000986
moderate	baseline	0.533364	6.527	6187	5337	4572	5563	0.000000	0.000000
moderate	enhanced	0.020569	6.264	11434	2598	3338	2809	0.018165	0.000468

Interpretation:

Enhanced remains much better on validation loss and full-sequence throughput.
Baseline remains faster for exact recurrent stepping.
Deployment-lite is still the fastest enhanced recurrent approximation, while balanced is much more faithful.

`_r11` Research Notebook

Saved notebook:

output/jupyter-notebook/gamma-s4-research-benchmark_r11.ipynb

Configuration:

device: CUDA
commit printed in notebook: 4842762
includes AMP FFT fix and split benchmark tables

Forecasting results:

Task	Model	Val loss	Mean epoch s	Expected mode	Full tokens/s	Recurrent tokens/s	Deploy-lite tokens/s	Balanced deploy tokens/s	Deploy-lite match MSE	Balanced match MSE
current_reference	baseline	0.709749	6.922	recurrent_like	5898	5494	5162	5523	0.000000	0.000000
current_reference	enhanced	0.019951	6.383	recurrent_like	11200	2665	3522	2931	0.027752	0.000315
long_context	baseline	27.229956	36.928	recurrent_like	2994	2725	2513	2866	0.000000	0.000000
long_context	enhanced	0.011593	10.419	conv	235772	1849	2477	1542	0.193474	0.001699

Token-lite results:

Model	Train CE	Val CE	Val PPL	Seq len	Train samples
baseline	3.587260	3.132184	22.924	192	1200
enhanced	2.483604	2.486901	12.024	192	1200

Interpretation:

The AMP FFT fix worked: the long-context enhanced conv path completed and showed very high cached full-sequence throughput.
Enhanced long-context training is now much faster than baseline in this setup and far more accurate.
Recurrent deployment remains the weak point: enhanced exact recurrent throughput is still lower than baseline.
Balanced deployment remains the best fidelity-preserving approximation, but it is slower than deployment-lite.

`_r11` Challenge Notebook

Saved notebook:

output/jupyter-notebook/gamma-s4-challenge-benchmark_r11.ipynb

Configuration:

device: CUDA
commit printed in notebook: bfc6e26
first saved run for the challenge benchmark notebook
tasks:
- permuted MNIST
- selective copying
- induction-style associative recall

Results:

Task	Model	Val loss	Val accuracy	Epoch s	Forward ms	Forward tokens/s
permuted_mnist	baseline	2.760003	0.206000	112.318	425.084	118038
permuted_mnist	enhanced	2.041562	0.232000	35.042	7.950	6311750
selective_copying	baseline	3.529677	0.039551	6.509	73.482	222965
selective_copying	enhanced	3.468455	0.029622	2.149	2.739	5981919
induction_recall	baseline	3.615992	0.040039	6.424	72.235	226816
induction_recall	enhanced	3.519182	0.033203	2.061	2.673	6130411

Interpretation:

Enhanced is much faster on the challenge forward benchmark because the full-sequence conv path is active.
Permuted MNIST slightly favors enhanced on both loss and accuracy, but both accuracies are still low.
Selective copying and induction recall are near random accuracy:
- selective copying random accuracy is about 1 / 32 = 0.03125
- induction recall random accuracy is about 1 / 32 = 0.03125
Enhanced often has lower CE but not consistently higher accuracy, suggesting it is learning distributional smoothing before reliable exact recall.
This is the clearest evidence so far that pure LTI Gamma SSM structure is not enough for Mamba-style selective memory tasks. The next model improvement should add selective input flow while keeping the fixed Gamma transition.

`_r12` - Input-Selection Gate Tested

`_r12` Quick Notebook

Saved notebook:

output/jupyter-notebook/gamma-s4-sinewave-benchmark_r12.ipynb

Configuration:

device: CUDA
commit printed in notebook: 740a9ef
enhanced model includes the new pre-SSM input-selection gate

Results:

Task	Model	Val loss	Mean epoch s	Full tokens/s	Recurrent tokens/s	Deploy-lite tokens/s	Balanced deploy tokens/s	Deploy-lite match MSE	Balanced match MSE
simple	baseline	0.637628	2.249	5923	4449	5323	5568	0.000000	0.000000
simple	enhanced	0.043149	2.184	10904	2438	3751	3099	0.021473	0.001503
moderate	baseline	0.450424	6.747	5908	4689	5084	5381	0.000000	0.000000
moderate	enhanced	0.020161	6.264	8135	2357	3771	2944	0.076783	0.001143

Interpretation:

The input-selection gate did not hurt quick-task quality; enhanced still wins validation loss clearly.
Exact recurrent enhanced slowed slightly due to the extra gate.
Deployment-lite mismatch worsened on moderate, but balanced deployment remained much more faithful.

`_r12` Research Notebook

Saved notebook:

output/jupyter-notebook/gamma-s4-research-benchmark_r12.ipynb

Configuration:

device: CUDA
commit printed in notebook: 0c6ecb8

Forecasting results:

Task	Model	Val loss	Mean epoch s	Expected mode	Full tokens/s	Recurrent tokens/s	Deploy-lite tokens/s	Balanced deploy tokens/s	Deploy-lite match MSE	Balanced match MSE
current_reference	baseline	0.709749	7.298	recurrent_like	5513	5460	5212	5506	0.000000	0.000000
current_reference	enhanced	0.020813	6.753	recurrent_like	10218	2261	3414	3018	0.023053	0.000478
long_context	baseline	12.850342	37.374	recurrent_like	3720	3776	3661	3706	0.000000	0.000000
long_context	enhanced	0.011039	11.212	conv	229320	1589	2390	2043	0.034069	0.001689

Token-lite results:

Model	Train CE	Val CE	Val PPL	Seq len	Train samples
baseline	2.943133	6.186983	486.377	192	1200
enhanced	2.489687	2.490702	12.070	192	1200

Interpretation:

Enhanced remains excellent for long-context forecasting.
Input-selection slightly improved long_context val loss versus _r11 (0.011039 vs 0.011593) but worsened exact recurrent speed.
Token-lite strongly favors enhanced in this run, though baseline appears unstable.

`_r12` Challenge Notebook

Saved notebook:

output/jupyter-notebook/gamma-s4-challenge-benchmark_r12.ipynb

Configuration:

device: CUDA
commit printed in notebook: 11bd2e6
same challenge tasks as _r11, with input-selection gate active in enhanced

Results:

Task	Model	Val loss	Val accuracy	Epoch s	Forward ms	Forward tokens/s
permuted_mnist	baseline	2.760003	0.206000	114.183	511.946	98010
permuted_mnist	enhanced	2.052564	0.209000	34.080	8.393	5978110
selective_copying	baseline	3.514275	0.028320	7.319	73.808	221983
selective_copying	enhanced	3.468793	0.029785	2.353	2.988	5482614
induction_recall	baseline	3.607873	0.038086	7.439	72.952	224585
induction_recall	enhanced	3.535175	0.039062	2.788	2.925	5600445

Interpretation:

The input-selection gate did not produce a meaningful challenge-task accuracy breakthrough.
Permuted MNIST accuracy stayed low and did not improve over _r11.
Selective copying and induction recall are still near random. With 32 classes, random accuracy is about 0.03125.
The enhanced model still has much better forward throughput and somewhat lower CE, but accuracy shows it is not performing reliable exact recall.
This suggests two things:
- permuted MNIST likely needs more epochs and/or more samples
- selective copying and induction need a stronger selective/content-dependent memory mechanism or a curriculum diagnostic, not just more epochs

Versions Not Recorded

The following are not recorded as complete benchmark versions:

Research notebooks before _r6: no saved research _r1 to _r5 notebooks exist in the repo.
Any temporary failed Colab runs during error debugging: tracebacks were discussed in chat, but they are not treated as experiment records.
Partial long-context ablation run in _r6: only partial output is present, so it is not summarized as a completed ablation result.

Current Best Summary

Best presentable run:

_r12 research benchmark

Most important result:

On long_context, gamma_s4_enhanced achieved much lower validation loss than baseline and substantially better full-sequence throughput.
_r11 shows the fixed AMP FFT conv path completing successfully and producing very high cached full-sequence throughput on long_context.
_r12 confirms the input-selection gate alone is not enough to solve selective copying or induction recall beyond near-random accuracy.

Current limitation:

gamma_s4_enhanced still trails gamma_baseline in recurrent token-by-token deployment throughput.
Challenge benchmarks show that the current model needs stronger selective/content-dependent memory mechanisms.

Recommended next improvement targets:

Add challenge-task curriculum diagnostics and longer token-memory epochs.
Explore stronger content-dependent memory beyond static LTI convolution, while preserving the fixed Gamma transition when possible.
Recurrent/deployment optimization for gamma_s4_enhanced.
Deployment-lite fidelity improvement, especially on long_context.
Better structured Gamma kernel generation for the conv/full-sequence path.

Gamma SSM / Gamma-S4 Experiment Record

Model Names

Metrics

TaoNet-SSM LLM Wrapper Iterations

LLM Iteration 1 - Add TaoNet SSM Wrapper

LLM Iteration 2 - Projected SSM Mixer Dimension

LLM Iteration 3 - Add Scripted SSM Benchmarks

LLM Iteration 4 - Direct DPLR Frequency-Response Application

LLM Iteration 5 - Specialize Rank-One DPLR Solve

LLM Iteration 6 - Precompose Finite Response Projection

LLM Iteration 7 - Rank-One Matmul Fast Path

LLM Iteration 8 - TileLang Capability Detection

LLM Iteration 9 - DPLR Frequency-Path Profiling And Root Cache

LLM Iteration 10 - Shared DPLR Frequency Grid Cache

LLM Iteration 11 - Re-anchor On Projected-64 Scaling Regime

LLM Iteration 12 - Token Accuracy Benchmark And Causal Memory Check

LLM Iteration 13 - Local Shift Register For Causal Token Memory

LLM Iteration 14 - Explicit DPLR Transfer-Mode Probe

LLM Iteration 15 - Shrink DPLR Hidden State After Local-Shift Quality Fix

LLM Iteration 16 - Seq-512 Previous-Token Robustness And Hidden-State Selection

LLM Iteration 17 - TaoData Real-Text Byte-Token Pilot

LLM Iteration 18 - TaoData SentencePiece Pilot And Per-Channel Local Shift

LLM Iteration 19 - TaoData SentencePiece Mixer-Dimension Sweep

LLM Iteration 20 - Attempted h16/m128 Batch Generalization Sweep

Current LLM-Wrapper Best Configuration

Version Timeline

_r1 - First Simple Sinewave Comparison

_r2 - Harder Multivariate Forecasting

_r3 - Quick Benchmark With Deployment Cache Available

_r4 - Triangular-Solve Recurrent Optimization

_r5 - Agreement Metrics Added

_r6 - Split Quick/Research Benchmark Era

_r6 Quick Notebook

_r6 Research Notebook

_r7 - Conv Kernel Generation Improved

_r7 Quick Notebook

_r7 Research Notebook

_r8 - No-State Full Forward, Baseline Deploy Metrics, Token-Lite Enabled

_r8 Quick Notebook

_r8 Research Notebook

_r9 - Presentation Visuals Added

_r9 Quick Notebook

_r9 Research Notebook

_r10 - Balanced Deployment Path Added

_r10 Quick Notebook

_r10 Research Notebook

_r11 - FFT Fix, Split Tables, And Challenge Benchmarks

_r11 Quick Notebook

_r11 Research Notebook

_r11 Challenge Notebook

_r12 - Input-Selection Gate Tested

_r12 Quick Notebook

_r12 Research Notebook

_r12 Challenge Notebook

Versions Not Recorded

Current Best Summary

`_r1` - First Simple Sinewave Comparison

`_r2` - Harder Multivariate Forecasting

`_r3` - Quick Benchmark With Deployment Cache Available

`_r4` - Triangular-Solve Recurrent Optimization

`_r5` - Agreement Metrics Added

`_r6` - Split Quick/Research Benchmark Era

`_r6` Quick Notebook

`_r6` Research Notebook

`_r7` - Conv Kernel Generation Improved

`_r7` Quick Notebook

`_r7` Research Notebook

`_r8` - No-State Full Forward, Baseline Deploy Metrics, Token-Lite Enabled

`_r8` Quick Notebook

`_r8` Research Notebook

`_r9` - Presentation Visuals Added

`_r9` Quick Notebook

`_r9` Research Notebook

`_r10` - Balanced Deployment Path Added

`_r10` Quick Notebook

`_r10` Research Notebook

`_r11` - FFT Fix, Split Tables, And Challenge Benchmarks

`_r11` Quick Notebook

`_r11` Research Notebook

`_r11` Challenge Notebook

`_r12` - Input-Selection Gate Tested

`_r12` Quick Notebook

`_r12` Research Notebook

`_r12` Challenge Notebook