TaoNet-mini-T2 / code /Taotern_SSM /EXPERIMENT_RECORD.md
StarMist0012's picture
Add files using upload-large-folder tool
e2bfccc verified

Gamma SSM / Gamma-S4 Experiment Record

This file records the experiment versions saved as _rN notebooks under output/jupyter-notebook/. It also records the later TaoNet-SSM LLM-wrapper and remote RTX benchmark iterations.

The goal is to preserve:

  • which model version was tested
  • which notebook/task configuration was used
  • the main performance results
  • what we learned from each run

For runs where the saved notebook does not contain enough information, the version is marked as not recorded.

Model Names

  • gamma_baseline: original Gamma SSM using the fixed lower-bidiagonal Gamma transition and recurrent execution.
  • gamma_s4_minimal: lighter S4-inspired Gamma block. Used in early experiments, later dropped from the main loop because it was not consistently strong.
  • gamma_s4_enhanced: main S4-inspired Gamma model with learned dt, stable discretization, D skip, optional gating/output path, full-sequence/kernel mode, and recurrent stepping.

Metrics

  • val_loss: validation loss for forecasting tasks. Lower is better.
  • mean_epoch_time_s: average training epoch time. Lower is better.
  • full_forward_ms / full_latency_ms: whole-sequence forward/inference latency. Lower is better.
  • full_forward_tokens_per_s / full_tokens_per_s: whole-sequence throughput. Higher is better.
  • recurrent_inference_ms / recurrent_latency_ms: token-by-token recurrent latency. Lower is better.
  • recurrent_tokens_per_s: token-by-token recurrent throughput. Higher is better.
  • deploy_*: deployment-lite recurrent path. For the baseline, deployment and recurrent are the same path once baseline deploy metrics were enabled.
  • val_ce: validation cross entropy for token prediction. Lower is better.
  • val_ppl: validation perplexity for token prediction. Lower is better.

TaoNet-SSM LLM Wrapper Iterations

This section records the work that moved the SSM from standalone/notebook benchmarks into the TaoNet LLM comparison loop. The main implementation repo for SSM changes is this repo. The TaoNet wrapper lives in the local TaoTrain repo and branch listed below.

Related repos and branches:

  • SSM repo: https://github.com/StarMists/gamma_SSM_S4_enhanced.git
  • SSM local path: C:\Users\YouZheng\Documents\LYZ\MyContent\MyLLM\Codebase\Taotern\Taotern_SSM
  • TaoTrain repo: https://github.com/lobakkang/TaoTrain.git
  • TaoTrain local path: C:\Users\YouZheng\Documents\LYZ\MyContent\MyLLM\Codebase\Taotern\TaoTrain
  • TaoTrain branch: codex/taonet-ssm-core
  • Remote server path for SSM: /home/student/YouZheng/gamma_ssm_repo
  • Remote server path for TaoTrain: /home/student/YouZheng/repo
  • Remote execution tool: C:\Users\YouZheng\Documents\LYZ\MyContent\MyComp\RepoBridge

LLM Iteration 1 - Add TaoNet SSM Wrapper

Implementation location:

  • TaoTrain: src/taoTrain/models/taonet_ssm.py
  • TaoTrain: src/taoTrain/config.py
  • TaoTrain: src/taoTrain/models/registry.py
  • TaoTrain: tests/test_taonet_ssm.py
  • TaoTrain: scripts/benchmark_taonet_token_variants.py

TaoTrain commits:

  • 8b1c6fa Add TaoNet Gamma SSM architecture
  • 6edd09e Benchmark TaoNet token SSM variants

What changed:

  • Added a taonet_ssm model architecture for apples-to-apples comparison with original attention taonet.
  • Kept the outer LLM stack close to TaoNet and replaced the sequence-mixing core with an SSM mixer.
  • Supported both gamma_s4 and dplr SSM cores.
  • Added token-level synthetic CE benchmark comparing taonet and taonet_ssm.
  • Added focused tests for SSM wrapper construction and forward passes.

Local validation:

  • python -m pytest tests\test_taonet_ssm.py -q passed.
  • Broader TaoTrain tests were not run locally because the local environment was missing datasets.

Result:

  • Functional success. This established the comparison harness.
  • Performance was not yet acceptable with full-width DPLR because the wrapper exposed dense DPLR frequency-transfer cost.

LLM Iteration 2 - Projected SSM Mixer Dimension

Implementation location:

  • TaoTrain: src/taoTrain/models/taonet_ssm.py
  • TaoTrain: src/taoTrain/config.py
  • TaoTrain: tests/test_taonet_ssm.py

TaoTrain commit:

  • 5e6b802 Add projected SSM mixer dimension

What changed:

  • Added ssm_mixer_dim.
  • The SSM branch now supports d_model -> ssm_mixer_dim -> SSM -> d_model.
  • This keeps the LLM interface the same while reducing the DPLR channel width.
  • This is important because DPLR convolutional training cost scales strongly with the channel dimension.

Remote benchmark config examples:

  • RepoBridge projected 128: repobridge.taonet.tokenbench.projected128.config.json
  • RepoBridge projected 64: repobridge.taonet.tokenbench.projected64.config.json

Important results before SSM-core optimization:

Variant Batch Seq Forward tok/s Backward tok/s Peak MB Interpretation
attention TaoNet 4 512 about 1.24M about 280k about 376 Baseline comparison point.
DPLR full-width mixer 256 4 512 about 81k about 20k about 6200 Failed: dense transfer path too slow and memory-heavy.
DPLR projected mixer 128 4 512 about 214k about 56k about 3613 Better memory, still much slower than attention.
DPLR projected mixer 64 4 512 about 114k about 54k about 2500 Lower memory but worse forward before core optimization.

Result:

  • Success as an architectural control: projection made DPLR usable enough to iterate.
  • Not sufficient alone: the DPLR core still needed direct frequency-response optimization.

LLM Iteration 3 - Add Scripted SSM Benchmarks

Implementation location:

  • SSM: scripts/benchmark_ssm_variants.py
  • SSM: .gitignore

SSM commits:

  • 7a90525 Add lightweight SSM benchmark script
  • c0dede8 Ignore generated benchmark outputs

What changed:

  • Added a Python benchmark script for baseline, gamma_s4, and dplr.
  • Measures forward, optional forward+backward, and optional recurrent stepping.
  • Writes JSON and CSV outputs.
  • Ignored generated benchmark result directories.

Remote raw DPLR result:

Model Batch Seq Mode Tok/s Peak MB
DPLR raw SSM 4 512 forward about 841k about 1310
DPLR raw SSM 4 512 forward+backward about 101k about 1310
DPLR raw recurrent 4 512 recurrent about 97k about 10

Interpretation:

  • Raw DPLR SSM was promising.
  • The wrapped LLM bottleneck came from how the DPLR convolutional path scaled under the TaoNet stack, not from the idea of DPLR alone.

LLM Iteration 4 - Direct DPLR Frequency-Response Application

Implementation location:

  • SSM: gamma_space_model/modules/s4_ternary_dplr_ssm.py

SSM commit:

  • 2b204e8 Apply DPLR frequency response directly

What changed:

  • Added a direct training path that applies the DPLR frequency response to the FFT input.
  • Avoided materializing the full dense transfer tensor shaped roughly freq x channels x channels during training/grad runs.
  • Kept the old dense transfer path for eval/no-grad caching.

Validation:

  • python -m pytest tests\test_s4_ternary_dplr_ssm.py -q passed.
  • Local CPU smoke benchmark with backward passed.

Projected-128 remote result after this change:

Variant Batch Seq Forward tok/s Backward tok/s Peak MB
attention TaoNet 4 512 about 1.32M about 532k about 376
DPLR projected mixer 128 4 512 about 151k about 91k about 508

Interpretation:

  • Major memory success: projected-128 DPLR dropped from about 3613 MB to about 508 MB.
  • Training throughput improved from about 56k to about 91k tok/s.
  • Forward-only became slower than the previous projected-128 run, so this change helped training/backward much more than no-grad forward timing.

LLM Iteration 5 - Specialize Rank-One DPLR Solve

Implementation location:

  • SSM: gamma_space_model/modules/s4_ternary_dplr_ssm.py

SSM commit:

  • 5a0abad Specialize rank-one DPLR solve

What changed:

  • Current best DPLR configuration uses rank=1.
  • Replaced the batched torch.linalg.inv for 1 x 1 low-rank systems with scalar reciprocal math.
  • Applied the specialization to both direct training and cached dense response paths.
  • Left the general rank path intact for rank > 1.

Validation:

  • python -m pytest tests\test_s4_ternary_dplr_ssm.py -q passed.
  • Local CPU smoke benchmark with backward passed.

Projected-128 remote result:

Variant Batch Seq Forward tok/s Backward tok/s Peak MB
attention TaoNet 4 512 about 1.33M about 545k about 376
DPLR projected mixer 128 4 512 about 485k about 142k about 508

Projected-64 remote result after this change:

Variant Batch Seq Forward tok/s Backward tok/s Peak MB
DPLR projected mixer 64 4 512 about 618k about 192k about 494

Scaling probe for projected-64:

Variant Batch Seq Forward tok/s Backward tok/s Peak MB
attention TaoNet 16 512 about 1.16M about 990k about 1332
DPLR projected mixer 64 16 512 about 2.12M about 702k about 1684

Interpretation:

  • Major success.
  • DPLR projected-64 became the best current SSM LLM configuration.
  • At batch 16, DPLR projected-64 forward throughput exceeded attention in this synthetic benchmark.
  • Backward was still behind attention, but the gap narrowed substantially.
  • The SSM now scales much better with batch size, suggesting fixed frequency-response overhead is being amortized.

LLM Iteration 6 - Precompose Finite Response Projection

Implementation location:

  • SSM: gamma_space_model/modules/s4_ternary_dplr_ssm.py

SSM commits:

  • f09a71b Precompose DPLR finite response projection
  • 648a32e Revert "Precompose DPLR finite response projection"

What changed:

  • Tried replacing C @ (I - z^L A^L) @ response with two projected terms:
    • C @ response
    • (C @ A^L) @ response
  • The goal was to reduce one batch/frequency hidden-state multiplication in the direct path.

Validation:

  • python -m pytest tests\test_s4_ternary_dplr_ssm.py -q passed.
  • Local smoke benchmark passed.
  • Direct-vs-cached convolution comparison had max absolute difference around 2.4e-7.

Remote result:

Variant Batch Seq Forward tok/s Backward tok/s Peak MB
DPLR projected mixer 64 before this change 4 512 about 618k about 192k about 494
DPLR projected mixer 64 with this change 4 512 about 495k about 162k about 478

Interpretation:

  • Failed on real GPU token benchmark.
  • It saved a little memory but reduced speed too much.
  • The commit was intentionally reverted, so current SSM main is back to the best-performing rank-one direct-response core.

LLM Iteration 7 - Rank-One Matmul Fast Path

Implementation location:

  • SSM: gamma_space_model/modules/s4_ternary_dplr_ssm.py

SSM commits:

  • 43de801 Use matmul fast path for rank-one DPLR
  • 9ffa5a7 Gate rank-one matmul path by batch size
  • 4e130b6 Limit rank-one matmul path to small batches
  • 8969916 Revert "Limit rank-one matmul path to small batches"
  • 5b3a957 Revert "Gate rank-one matmul path by batch size"
  • a46a2af Revert "Use matmul fast path for rank-one DPLR"

What changed:

  • Tried a deeper rank=1 direct-application specialization.
  • Replaced several generic einsum operations with batched matmul and vector reductions.
  • The goal was to reduce Python/operator overhead and improve backward throughput for the current best DPLR rank.
  • A follow-up tried to gate the path by batch size after the batch-16 scaling run regressed.

Validation:

  • python -m pytest tests\test_s4_ternary_dplr_ssm.py -q passed.
  • Local CPU smoke benchmark passed.
  • Direct-vs-cached convolution comparison had max absolute difference around 2.4e-7.

Remote result:

Variant Batch Seq Forward tok/s Backward tok/s Peak MB
DPLR projected mixer 64 before this change 4 512 about 618k about 192k about 494
DPLR projected mixer 64 first matmul run 4 512 about 643k about 208k about 494
DPLR projected mixer 64 repeated small-batch gated run 4 512 about 470k-472k about 161k-175k about 494
DPLR projected mixer 64 matmul at batch 16 16 512 about 1.47M about 388k about 1684
DPLR projected mixer 64 previous best at batch 16 16 512 about 2.12M about 702k about 1684

Interpretation:

  • Failed overall.
  • The first batch-4 run looked promising, but repeated remote results were worse.
  • The matmul formulation regressed the larger-batch scaling behavior that matters for GPU utilization.
  • All matmul fast-path commits were reverted, so current SSM main returns to the best-known 5a0abad rank-one scalar-solve behavior plus the experiment-record commits.

LLM Iteration 8 - TileLang Capability Detection

Implementation location:

  • SSM: csrc/tilelang/selective_scan.py
  • SSM: csrc/tilelang/__init__.py
  • SSM: gamma_space_model/ops/selective_scan_interface.py
  • SSM: gamma_space_model/modules/ssm_gamma.py
  • SSM: scripts/diagnose_tilelang_acceleration.py

SSM commit:

  • 4784856 Make TileLang acceleration detection explicit

What changed:

  • Made TileLang capability reporting explicit and conservative.
  • Before this change, HAS_TILELANG_OPS became true whenever the Python fallback module imported.
  • That was misleading because csrc/tilelang did not actually dispatch to a real TileLang kernel; it used PyTorch fallback code.
  • Added TILELANG_BACKEND and HAS_TILELANG_ACCELERATION flags.
  • Added scripts/diagnose_tilelang_acceleration.py to print package availability, repo backend flags, and a small Gamma forward timing.
  • Fixed SSMGamma.step dtype/device casting after the honest fallback path exposed a float64 failure in the normal PyTorch path.

Validation:

  • python -m pytest tests\test_ssm_gamma.py tests\test_s4_ternary_dplr_ssm.py -q passed locally: 22 passed.
  • Local diagnostic reported:
    • has_tilelang_ops=false
    • tilelang_backend=pytorch_fallback
    • triton_available=false
    • tilelang_available=false

Remote RTX 5090 diagnostic:

Field Value
Torch 2.11.0+cu130
CUDA available
GPU NVIDIA GeForce RTX 5090
Triton package available
TileLang package not available
Repo HAS_TILELANG_OPS false
Repo TILELANG_BACKEND pytorch_fallback
Gamma fallback forward about 76.7k tok/s at batch 4, seq 512, bf16

Remote raw SSM benchmark after this change:

Model Batch Seq Forward tok/s Backward tok/s Peak MB
DPLR raw SSM 4 512 about 3.16M about 1.03M about 57
Gamma-S4 raw SSM 4 512 about 100.6k about 45.5k about 467
Baseline Gamma raw SSM 4 512 about 85.2k about 32.3k about 120

Interpretation:

  • This iteration did not add a real TileLang kernel yet.
  • It fixed an important measurement and dispatch problem: fallback code is no longer reported as hardware acceleration.
  • The remote server has Triton installed but does not have the TileLang package installed.
  • The current DPLR path is frequency-domain PyTorch/cuBLAS and does not use csrc/tilelang.
  • The next hardware-acceleration step should be explicit: either install/use real TileLang on the remote server or write a Triton/TileLang kernel for a clearly scoped hot path. The best candidate hot path is not the old baseline Gamma fallback; it is the DPLR direct frequency-response/backward path used by taonet_ssm.

LLM Iteration 9 - DPLR Frequency-Path Profiling And Root Cache

Implementation location:

  • SSM: gamma_space_model/modules/s4_ternary_dplr_ssm.py
  • SSM: scripts/profile_dplr_frequency_path.py

SSM commit:

  • 92643c5 Cache DPLR frequency roots

What changed:

  • Added a per-module cache for FFT roots and roots^seq_len.
  • These tensors are constants for a given (seq_len, fft_len, dtype, device), so rebuilding them every forward/layer is unnecessary GPU work.
  • Added scripts/profile_dplr_frequency_path.py to profile the DPLR convolutional path directly on the remote server.

Validation:

  • python -m pytest tests\test_s4_ternary_dplr_ssm.py -q passed locally.
  • python -m pytest tests\test_ssm_gamma.py tests\test_s4_ternary_dplr_ssm.py -q passed locally: 22 passed.
  • Local profiler smoke passed and showed frequency_grid_cache_entries=1.

Remote profiler result for raw DPLR at batch 4, seq 512, d_model 64, hidden_dim 256:

Mode Mean ms Tok/s Peak MB
forward about 2.58 about 793k not measured
forward+backward about 3.27 about 626k about 52

Remote profiler interpretation:

  • The largest CUDA entries were aten::bmm, aten::mm, and their backward paths.
  • aten::linalg_matrix_power was visible but small in this configuration.
  • Root generation was not the dominant cost, so the cache is a modest cleanup rather than a major acceleration.
  • A future TileLang/Triton kernel should target fused rank-1 DPLR frequency-response application and its backward, especially around the small complex BMM/MM pattern. Replacing the old Gamma Python fallback is not the right priority for the TaoNet-SSM goal.

TaoNet projected-64 check after this change:

Variant Batch Seq Forward tok/s Backward tok/s Peak MB
DPLR projected mixer 64 4 512 about 656k about 163k about 494

Scaling probe after this change:

Variant Batch Seq Forward tok/s Backward tok/s Peak MB
DPLR projected mixer 64 8 512 about 983k about 341k about 889
DPLR projected mixer 64 16 512 about 1.03M about 414k about 1684

Interpretation:

  • The root cache is correct and removes repeated constant construction.
  • End-to-end results remain noisy; this is not a breakthrough optimization.
  • The main value of this iteration is the profiler evidence: the next real hardware acceleration should fuse the DPLR rank-1 complex frequency-response operations, not spend effort on the older baseline Gamma fallback path.

LLM Iteration 10 - Shared DPLR Frequency Grid Cache

Implementation location:

  • SSM: gamma_space_model/modules/s4_ternary_dplr_ssm.py

SSM commit:

  • a9e5d3e Share DPLR frequency grid cache

What changed:

  • Promoted the DPLR FFT root cache from per-module to class-level shared cache.
  • The previous cache avoided rebuilding roots inside a single SSM module, but a multi-layer TaoNet creates one SSM module per layer.
  • The shared cache lets all layers reuse the same (roots, roots^seq_len) tensors for a given (seq_len, fft_len, dtype, device).

Validation:

  • python -m pytest tests\test_s4_ternary_dplr_ssm.py -q passed locally.
  • Local scripts/profile_dplr_frequency_path.py smoke passed and still reported one frequency-grid cache entry.

Required TaoNet comparison after this iteration:

Remote benchmark:

  • RepoBridge run: taonet-vs-dplr-proj64-shared-grid-bench-20260429-101304
  • Config: projected DPLR mixer dim 64, DPLR rank 1, 4 layers, bf16, synthetic next-token CE
Architecture Batch Seq Mode Tok/s Peak MB Loss
attention TaoNet 4 512 forward about 1.30M about 193 9.064
attention TaoNet 4 512 forward+backward about 513k about 376 9.064
SSM TaoNet, DPLR projected 64 4 512 forward about 499k about 190 9.059
SSM TaoNet, DPLR projected 64 4 512 forward+backward about 162k about 492 9.059

Comparison:

  • SSM forward throughput was about 38% of attention at batch 4, seq 512.
  • SSM forward+backward throughput was about 32% of attention.
  • SSM forward memory was slightly lower than attention, but backward peak memory was higher.
  • Loss was comparable because this is a random synthetic token benchmark, not a trained quality result.

Interpretation:

  • The shared cache is correct and small, but it did not create a clear end-to-end speed breakthrough.
  • This reinforces the profiler conclusion: constant/root setup is not the dominant TaoNet-SSM bottleneck.
  • Future iterations should include the attention-vs-SSM table directly, and hardware work should focus on the DPLR rank-1 complex BMM/MM and backward pattern.

LLM Iteration 11 - Re-anchor On Projected-64 Scaling Regime

Reason for this iteration:

  • The strongest previous result came from a scaling probe, not from batch-4 timing.
  • Later iterations over-emphasized batch 4, which made the SSM look worse and encouraged the wrong optimization target.
  • This iteration re-established the primary benchmark as attention TaoNet vs SSM TaoNet under larger projected-64 batches.

Implementation change:

  • No model-code change.
  • Benchmark-policy change: projected-64 scaling comparisons should be treated as primary acceptance tests for throughput work.

Remote benchmark:

  • RepoBridge run: taonet-token-dplr-proj64-scale-bench-20260429-111150
  • RepoBridge run: taonet-token-dplr-proj64-extended-scale-bench-20260429-111350
  • Config: projected DPLR mixer dim 64, DPLR rank 1, 4 layers, bf16, synthetic next-token CE

Required TaoNet comparison:

Architecture Batch Seq Mode Tok/s Peak MB Loss
attention TaoNet 8 512 forward about 1.09M about 319 9.061
attention TaoNet 8 512 forward+backward about 468k about 697 9.061
SSM TaoNet, DPLR projected 64 8 512 forward about 1.17M about 320 9.058
SSM TaoNet, DPLR projected 64 8 512 forward+backward about 318k about 889 9.058
attention TaoNet 16 512 forward about 1.60M about 596 9.059
attention TaoNet 16 512 forward+backward about 503k about 1332 9.059
SSM TaoNet, DPLR projected 64 16 512 forward about 1.03M about 580 9.060
SSM TaoNet, DPLR projected 64 16 512 forward+backward about 427k about 1684 9.060
attention TaoNet 32 512 forward about 1.88M about 1124 9.062
attention TaoNet 32 512 forward+backward about 632k about 2590 9.062
SSM TaoNet, DPLR projected 64 32 512 forward about 2.62M about 1100 9.061
SSM TaoNet, DPLR projected 64 32 512 forward+backward about 705k about 3273 9.061
attention TaoNet 64 512 forward about 3.53M about 2204 9.061
attention TaoNet 64 512 forward+backward about 683k about 5121 9.061
SSM TaoNet, DPLR projected 64 64 512 forward about 1.30M about 2140 9.060
SSM TaoNet, DPLR projected 64 64 512 forward+backward about 618k about 6451 9.060

Comparison:

  • Batch 8: SSM forward is slightly faster than attention, but backward is slower.
  • Batch 16: SSM backward is closer to attention than in batch-4 runs, but still slower.
  • Batch 32: SSM beats attention in both forward and forward+backward throughput in this run.
  • Batch 64: SSM falls off sharply, so the useful scaling point is not simply the largest batch.
  • SSM backward memory remains higher than attention, especially at larger batches.

Interpretation:

  • The projected-64 DPLR SSM should be optimized and evaluated around the scaling sweet spot, currently batch 32 for this synthetic benchmark on the RTX 5090.
  • Batch-4 timing is still useful for smoke tests, but it should not be treated as the main performance target.
  • This is a configuration-level breakthrough: SSM can outperform attention at the right batch size even before custom TileLang/Triton kernels.
  • Next improvement directions should either preserve or improve the batch-32 scaling result, not merely improve batch-4 microbenchmarks.

LLM Iteration 12 - Token Accuracy Benchmark And Causal Memory Check

Reason for this iteration:

  • Throughput alone is not sufficient; the SSM TaoNet must also learn useful token tasks.
  • The benchmark script previously reported only random synthetic CE, which is not an inference accuracy signal.
  • This iteration adds lightweight trained token tasks and reports eval_accuracy.

Implementation location:

  • TaoTrain: scripts/benchmark_taonet_token_variants.py

TaoTrain commit:

  • 59b84cd Add token task accuracy benchmark

What changed:

  • Added --token-task with:
    • random: original random next-token timing task
    • increment: deterministic token mapping, label is current token plus one modulo vocab
    • previous: causal memory task, label is the previous token
  • Added optional short training with --train-steps, --learning-rate, --weight-decay.
  • Added eval metrics:
    • eval_loss
    • eval_accuracy
    • train_final_loss
    • train_seconds

Validation:

  • Local TaoTrain smoke passed on CPU.
  • python -m pytest tests\test_taonet_ssm.py -q passed locally.

Broad speed comparison after adding accuracy columns:

  • RepoBridge run: taonet-vs-dplr-proj64-broad-speed-bench-20260429-112432
  • Config: projected DPLR mixer dim 64, DPLR rank 1, 4 layers, bf16, random token task, batch sweep
Architecture Batch Seq Mode Tok/s Peak MB
attention TaoNet 8 512 forward+backward about 873k about 697
SSM TaoNet, DPLR projected 64 8 512 forward+backward about 244k about 956
attention TaoNet 16 512 forward+backward about 589k about 1332
SSM TaoNet, DPLR projected 64 16 512 forward+backward about 646k about 1748
attention TaoNet 32 512 forward+backward about 680k about 2592
SSM TaoNet, DPLR projected 64 32 512 forward+backward about 816k about 3338
attention TaoNet 64 512 forward+backward about 763k about 5121
SSM TaoNet, DPLR projected 64 64 512 forward+backward about 544k about 6516

Speed interpretation:

  • SSM remains poor at batch 8.
  • SSM wins forward+backward throughput at batch 16 and batch 32 in this run.
  • SSM falls off again at batch 64.
  • Batch 16-32 remains the useful projected-64 scaling range.

Token accuracy comparison:

  • Previous-token task run: taonet-vs-dplr-proj64-previous-token-quality-20260429-112456
  • Linear/ungated SSM ablation run: taonet-vs-dplr-proj64-previous-token-linear-quality-20260429-112623
  • Increment task run: taonet-vs-dplr-proj64-increment-token-quality-20260429-112719
  • All quality runs used batch 32, seq 128, vocab 128, 100 train steps, bf16.
Task Architecture Eval loss Eval accuracy Forward+backward tok/s
previous attention TaoNet about 0.033 about 0.999 about 551k
previous SSM TaoNet, DPLR projected 64 about 4.858 about 0.009 about 292k
previous, linear ungated SSM attention TaoNet about 0.046 about 0.999 about 990k
previous, linear ungated SSM SSM TaoNet, DPLR projected 64 about 4.626 about 0.026 about 346k
increment attention TaoNet about 0.007 1.000 about 1.09M
increment SSM TaoNet, DPLR projected 64 about 0.009 1.000 about 344k

Failed SSM-core improvement:

  • SSM commit 2e974c9 Add delayed DPLR skip added a learnable one-step diagonal delayed skip to help causal memory.
  • Remote previous-token run after this change: taonet-vs-dplr-proj64-previous-token-quality-20260429-112953.
  • Result: SSM remained near random, eval accuracy about 0.008, and speed worsened.
  • The change was reverted by 3fc0575 Revert "Add delayed DPLR skip".

Interpretation:

  • Projected-64 DPLR SSM can learn simple token mappings (increment) to perfect accuracy.
  • It currently fails a short causal memory/copy task (previous) under the same 100-step setting where attention TaoNet reaches about 99.9% accuracy.
  • The failure is not solved by removing SSM activation/gates or by a simple delayed diagonal skip.
  • Future improvements must include both:
    • speed comparison across batch 8/16/32/64
    • trained token accuracy, especially on causal memory tasks
  • The next quality-focused direction should investigate the SSM wrapper/core's ability to expose previous-token information, not only low-level GPU speed.

LLM Iteration 13 - Local Shift Register For Causal Token Memory

Reason for this iteration:

  • Projected-64 was the strongest SSM speed configuration, but it failed the previous token-memory task.
  • Capacity probes showed the failure was not caused by the projected-64 bottleneck alone:
    • projected-128 SSM eval accuracy stayed near random, about 0.007
    • full-width projected-256 SSM eval accuracy stayed near random, about 0.008
  • The next improvement therefore targeted explicit short causal memory while preserving the DPLR SSM as the main sequence mixer.

Implementation location:

  • TaoTrain commit: bb3bf90 Add SSM local shift mixer option
  • TaoTrain: src/taoTrain/models/taonet_ssm.py
  • TaoTrain: src/taoTrain/config.py
  • TaoTrain: scripts/benchmark_taonet_token_variants.py
  • TaoTrain: tests/test_taonet_ssm.py

What changed:

  • Added opt-in ssm_local_shift.
  • The SSM mixer can now add a one-token causal shift/register branch:
    • shifted[:, 1:] = x_norm[:, :-1]
    • output contribution is controlled by a single learned scalar ssm_local_shift_init.
  • The branch is deliberately cheap and ternary-friendly in structure: it is a causal shift plus scalar gain, not another dense attention mechanism.
  • The default remains off, so older SSM benchmarks are still comparable.

Validation:

  • Local TaoTrain:
    • PYTHONPATH=...\TaoTrain\src;...\Taotern_SSM python -m pytest tests\test_taonet_ssm.py -q passed, 4 passed.
    • CPU smoke for benchmark_taonet_token_variants.py --ssm-local-shift passed.

Capacity diagnostic before the change:

Architecture Mixer dim Batch Seq Eval loss Eval accuracy Forward+backward tok/s
attention TaoNet n/a 32 128 about 0.025 1.000 about 1.04M
SSM TaoNet, DPLR 128 32 128 about 4.857 about 0.007 about 379k
attention TaoNet n/a 32 128 about 0.042 about 0.999 about 556k
SSM TaoNet, DPLR 256 32 128 about 4.856 about 0.008 about 389k

Required TaoNet comparison after the change:

  • RepoBridge run: taonet-vs-dplr-proj64-local-shift-previous-quality-20260429-144930
  • RepoBridge broad run: taonet-vs-dplr-proj64-local-shift-previous-broad-quality-20260429-145014
  • Config: previous-token task, seq 128, vocab 128, 100 train steps, bf16, projected DPLR mixer dim 64, local shift enabled.
Architecture Batch Eval loss Eval accuracy Forward tok/s Forward+backward tok/s Peak MB
attention TaoNet 8 about 4.376 about 0.096 about 695k about 238k about 103
SSM TaoNet, DPLR projected 64 + local shift 8 about 0.010 1.000 about 226k about 89k about 181
attention TaoNet 16 about 1.048 about 0.847 about 1.26M about 508k about 166
SSM TaoNet, DPLR projected 64 + local shift 16 about 0.008 1.000 about 520k about 189k about 299
attention TaoNet 32 about 0.043 1.000 about 2.54M about 555k about 297
SSM TaoNet, DPLR projected 64 + local shift 32 about 0.008 1.000 about 1.16M about 353k about 513
attention TaoNet 64 about 0.020 1.000 about 4.75M about 1.73M about 553
SSM TaoNet, DPLR projected 64 + local shift 64 about 0.007 1.000 about 2.43M about 403k about 956

Interpretation:

  • Success: this is the first projected-64 SSM TaoNet result that solves the previous causal-memory task.
  • The result is not only a batch-32 spot check. SSM reached perfect eval accuracy at batch 8, 16, 32, and 64.
  • The quality gain is large: plain projected-64, projected-128, and projected-256 DPLR all stayed near random on the same task.
  • Speed tradeoff: local-shift SSM is slower than attention on this short-sequence previous-token benchmark, especially backward.
  • This should be treated as a quality architecture fix, not a hardware-acceleration fix. The next hardware iteration should still target fused DPLR frequency/backward kernels.

LLM Iteration 14 - Explicit DPLR Transfer-Mode Probe

Reason for this iteration:

  • After the local-shift quality fix, the next bottleneck was speed.
  • The DPLR direct frequency path applies the finite correction to batch-dependent hidden responses.
  • A possible alternative was to materialize the full frequency transfer matrix, then multiply by the input FFT.
  • This could be faster for some batch/sequence shapes, but it risks high memory and repeated transfer construction.

Implementation location:

  • SSM commit: 749a4cf Add DPLR transfer profiling mode
  • SSM commit: e34b67c Add DPLR conv transfer mode
  • TaoTrain commit: ceb08e6 Expose SSM conv transfer mode
  • SSM: gamma_space_model/modules/s4_ternary_dplr_ssm.py
  • SSM: scripts/profile_dplr_frequency_path.py
  • TaoTrain: scripts/benchmark_taonet_token_variants.py

What changed:

  • Added profiler support for comparing:
    • direct DPLR frequency response application
    • materialized transfer matrix application
  • Added explicit kernel_mode="conv_transfer" to S4TernaryDPLRSSM.
  • Exposed conv_transfer through TaoTrain config/benchmark CLI.
  • The mode is opt-in only. The default/recommended projected-64 path remains conv.

Local validation:

  • SSM: python -m pytest tests\test_s4_ternary_dplr_ssm.py tests\test_ssm_gamma.py -q passed, 23 passed.
  • TaoTrain: PYTHONPATH=...\TaoTrain\src;...\Taotern_SSM python -m pytest tests\test_taonet_ssm.py -q passed, 5 passed.

Isolated SSM-core remote profile:

  • RepoBridge run: ssm-dplr-direct-vs-transfer-s128-profile-20260429-145555
  • Config: DPLR state/mixer dim 64, hidden dim 256, seq 128, bf16, rank 1.
Method Batch Forward tok/s Forward+backward tok/s Peak MB Interpretation
direct 8 about 555k about 332k about 34 baseline direct path
transfer 8 about 1.24M about 440k about 247 faster but much higher memory
direct 16 about 790k about 1.13M about 47 direct wins
transfer 16 about 737k about 481k about 248 transfer loses
direct 32 about 6.73M about 1.86M about 74 direct wins
transfer 32 about 4.89M about 1.68M about 250 transfer loses
direct 64 about 6.90M about 2.20M about 128 baseline direct path
transfer 64 about 2.93M about 3.06M about 253 backward faster, forward slower

TaoNet comparison after exposing conv_transfer:

  • RepoBridge run: taonet-vs-dplr-proj64-local-shift-conv-transfer-previous-broad-quality-20260429-145946
  • Config: previous-token task, seq 128, vocab 128, 100 train steps, bf16, projected DPLR mixer dim 64, local shift enabled, ssm_kernel_mode=conv_transfer.
Architecture Batch Eval loss Eval accuracy Forward tok/s Forward+backward tok/s Peak MB
attention TaoNet 8 about 4.431 about 0.082 about 699k about 279k about 103
SSM TaoNet, DPLR projected 64 + local shift + conv_transfer 8 about 0.010 1.000 about 52k about 14k about 195
attention TaoNet 16 about 1.098 about 0.862 about 1.24M about 303k about 166
SSM TaoNet, DPLR projected 64 + local shift + conv_transfer 16 about 0.008 1.000 about 79k about 24k about 270
attention TaoNet 32 about 0.061 about 0.998 about 1.24M about 674k about 297
SSM TaoNet, DPLR projected 64 + local shift + conv_transfer 32 about 0.007 1.000 about 157k about 45k about 420
attention TaoNet 64 about 0.015 1.000 about 4.48M about 1.05M about 553
SSM TaoNet, DPLR projected 64 + local shift + conv_transfer 64 about 0.007 1.000 about 370k about 97k about 719

Comparison to the previous direct-conv local-shift run:

Batch Direct-conv SSM forward+backward tok/s Transfer-mode SSM forward+backward tok/s
8 about 89k about 14k
16 about 189k about 24k
32 about 353k about 45k
64 about 403k about 97k

Interpretation:

  • Failed as an end-to-end TaoNet acceleration.
  • The isolated SSM profile suggested transfer mode could help in some cases, but inside the LLM wrapper it is much slower across all tested batch sizes.
  • Accuracy remains solved because the local shift branch is still active, but speed regresses badly.
  • Keep conv_transfer only as an explicit diagnostic/experimental mode for now.
  • Recommended mode remains ssm_kernel_mode=conv with ssm_local_shift=True.
  • The next hardware target should not be materializing the whole transfer each layer/step. It should focus on fusing or custom-autograding the current direct DPLR response path, especially the complex rank-1 frequency operations and backward.

LLM Iteration 15 - Shrink DPLR Hidden State After Local-Shift Quality Fix

Reason for this iteration:

  • The local-shift branch solved the previous token-memory task, but the quality-fixed SSM was still slower than attention on short seq-128 training.
  • The profiler for the recommended direct DPLR path at batch 32, seq 128 showed many small complex BMM/MM calls; there was no single obvious Python-only bottleneck.
  • Since local shift now carries exact one-token memory, the DPLR hidden dimension may not need to remain at 256 for this token-memory regime.
  • This iteration tested smaller DPLR hidden states as a ternary-friendly architecture/config improvement.

Remote profiler context:

  • RepoBridge run: ssm-dplr-direct-b32-s128-profile-20260429-154242
  • Config: DPLR mixer/state dim 64, hidden dim 256, batch 32, seq 128, bf16, rank 1, direct path.
  • Result: forward+backward about 2.25M core tok/s.
  • Profiler top CUDA cost was small complex BMM/MM work; aten::bmm accounted for about 48% of self CUDA time.
  • aten::linalg_matrix_power was visible but small, about 40us CUDA total.

Remote hidden-dim sweeps:

  • RepoBridge run: taonet-vs-dplr-proj64-local-shift-hidden-sweep-previous-20260429-154546
  • RepoBridge run: taonet-vs-dplr-proj64-local-shift-hidden-small-sweep-previous-20260429-155028
  • Config: previous-token task, seq 128, vocab 128, 100 train steps, bf16, projected DPLR mixer dim 64, local shift enabled, direct conv path.
SSM hidden dim Batch SSM eval accuracy SSM forward+backward tok/s SSM peak MB Attention eval accuracy Attention forward+backward tok/s
256 8 1.000 about 89k about 181 about 0.096 about 238k
256 16 1.000 about 189k about 299 about 0.847 about 508k
256 32 1.000 about 353k about 513 1.000 about 555k
256 64 1.000 about 403k about 956 1.000 about 1.73M
64 8 1.000 about 95k about 145 about 0.102 about 278k
64 16 1.000 about 184k about 239 about 0.932 about 300k
64 32 1.000 about 370k about 404 about 0.999 about 895k
64 64 1.000 about 564k about 750 about 0.999 about 920k
32 8 1.000 about 91k about 139 about 0.097 about 245k
32 16 1.000 about 187k about 227 about 0.941 about 460k
32 32 1.000 about 302k about 393 about 0.998 about 863k
32 64 1.000 about 787k about 716 1.000 about 1.75M
16 8 1.000 about 86k about 138 about 0.083 about 260k
16 16 1.000 about 187k about 223 about 0.844 about 495k
16 32 1.000 about 357k about 378 about 0.999 about 550k
16 64 1.000 about 795k about 705 1.000 about 1.76M

Seq-512 speed check for hidden dim 16:

  • RepoBridge run: taonet-vs-dplr-proj64-local-shift-hidden16-random-speed-20260429-155346
  • Config: random next-token timing task, seq 512, vocab 8192, projected DPLR mixer dim 64, hidden dim 16, local shift enabled.
Architecture Batch Forward tok/s Forward+backward tok/s Peak MB Loss
attention TaoNet 16 about 1.30M about 601k about 1332 about 9.055
SSM TaoNet, DPLR projected 64, hidden 16 + local shift 16 about 2.18M about 728k about 1511 about 9.069
attention TaoNet 32 about 3.87M about 1.37M about 2590 about 9.060
SSM TaoNet, DPLR projected 64, hidden 16 + local shift 32 about 3.52M about 1.14M about 2887 about 9.061
attention TaoNet 64 about 4.16M about 1.45M about 5121 about 9.065
SSM TaoNet, DPLR projected 64, hidden 16 + local shift 64 about 4.03M about 1.31M about 5649 about 9.060

Interpretation:

  • Success for the short token-memory benchmark: hidden dim 16 kept perfect previous accuracy and improved batch-64 backward throughput from about 403k to about 795k tok/s while reducing peak memory.
  • Hidden dim 64 was also strong and slightly better at batch 32 than hidden dim 16.
  • This did not become a universal seq-512 speed replacement. On random seq-512 timing, hidden dim 16 beat attention at batch 16 but lost at batch 32 and 64.
  • Recommended quality-aware short-memory config is now ssm_mixer_dim=64, ssm_hidden_dim=16, ssm_local_shift=True, ssm_kernel_mode=conv.
  • Recommended longer seq-512 throughput config should remain benchmark-driven; the older hidden-256 projected-64 regime still has stronger evidence around batch 16-32.

LLM Iteration 16 - Seq-512 Previous-Token Robustness And Hidden-State Selection

Reason for this iteration:

  • Iteration 15 showed hidden dim 16 was excellent for seq-128 previous memory and mixed for seq-512 random timing.
  • The missing check was a longer trained token-memory task: seq 512 previous, where accuracy and training speed both matter.
  • This iteration tested whether the local-shift quality fix holds at seq 512 and whether hidden dim 16, 64, or 256 is the best state size at this longer context.

Remote benchmark:

  • RepoBridge run with attention comparison: taonet-vs-dplr-proj64-local-shift-hidden16-previous512-20260429-161213
  • RepoBridge SSM-only hidden comparison: taonet-ssm-proj64-local-shift-previous512-hidden-compare-20260429-161306
  • Config: previous-token task, seq 512, vocab 128, 100 train steps, bf16, projected DPLR mixer dim 64, local shift enabled, direct conv path.

Required TaoNet comparison:

Architecture SSM hidden dim Batch Eval loss Eval accuracy Forward tok/s Forward+backward tok/s Peak MB
attention TaoNet n/a 16 about 4.614 about 0.048 about 2.04M about 1.39M about 575
SSM TaoNet, DPLR projected 64 + local shift 16 16 about 0.007 1.000 about 2.57M about 701k about 754
attention TaoNet n/a 32 about 2.090 about 0.629 about 4.79M about 899k about 1099
SSM TaoNet, DPLR projected 64 + local shift 16 32 about 0.007 1.000 about 4.32M about 944k about 1391
attention TaoNet n/a 64 about 0.239 about 0.962 about 4.08M about 1.18M about 2157
SSM TaoNet, DPLR projected 64 + local shift 16 64 about 0.007 1.000 about 2.57M about 961k about 2677

SSM hidden-state comparison at seq 512:

SSM hidden dim Batch Eval accuracy Forward tok/s Forward+backward tok/s Peak MB
16 16 1.000 about 2.57M about 701k about 754
16 32 1.000 about 4.32M about 944k about 1391
16 64 1.000 about 2.57M about 961k about 2677
64 16 1.000 about 1.18M about 494k about 800
64 32 1.000 about 4.26M about 1.36M about 1491
64 64 1.000 about 4.70M about 1.46M about 2874
256 16 1.000 about 2.49M about 748k about 1032
256 32 1.000 about 3.36M about 705k about 1914
256 64 1.000 about 3.62M about 788k about 3681

Interpretation:

  • Quality success: local-shift DPLR SSM keeps perfect previous accuracy at seq 512 for all tested hidden sizes and batches.
  • Attention did not fully learn the same task in 100 steps at batch 16/32 and reached about 0.962 accuracy at batch 64.
  • Speed depends on batch:
    • batch 16: hidden 256 is fastest among SSM variants, about 748k backward tok/s; attention is still faster at about 1.39M.
    • batch 32: hidden 64 is fastest, about 1.36M backward tok/s, beating attention's about 899k.
    • batch 64: hidden 64 is fastest, about 1.46M backward tok/s, beating attention's about 1.18M.
  • This gives a better longer-memory recommendation than Iteration 15:
    • use ssm_hidden_dim=16 for short seq-128 memory and lower memory pressure
    • use ssm_hidden_dim=64 for seq-512 trained memory around batch 32/64
    • keep hidden 256 as a possible batch-16 or legacy speed point, but not the general quality-aware default

LLM Iteration 17 - TaoData Real-Text Byte-Token Pilot

Reason for this iteration:

  • Synthetic previous and increment tasks were useful diagnostics, but they are not enough to judge LLM capability.
  • The remote server has a TaoData corpus at /home/student/Data/TaoData/pretrain.jsonl.fineweb.jsonl.
  • No SentencePiece tokenizer artifact was found at the expected remote TaoTrain/TaoData tokenizer paths, so the first real-text benchmark used dependency-free byte tokenization.
  • Byte tokenization is not the final deployment tokenizer, but it gives a real-corpus next-token signal and exercises the same TaoNet model paths.

Implementation location:

  • TaoTrain commit: b8c4f3d Add real token TaoNet benchmark
  • TaoTrain: scripts/benchmark_taonet_real_tokens.py

What changed:

  • Added a remote-friendly real-token benchmark script that:
    • reads JSONL or plain text
    • supports TaoData-style text records
    • supports byte tokenization and optional SentencePiece tokenization
    • builds contiguous next-token batches from one long token stream
    • reports eval loss, perplexity, token accuracy, throughput, and memory
    • compares attention TaoNet against multiple SSM hidden sizes in one run

Validation:

  • Local CPU smoke passed on a plain text file with byte tokenization.
  • Remote RepoBridge runs completed on TaoData JSONL.

Remote benchmark:

  • RepoBridge run: taonet-vs-ssm-real-token-taodata-byte-pilot-20260429-164623
  • RepoBridge run: taonet-vs-ssm-real-token-taodata-byte-pilot-b64-20260429-164720
  • Data: /home/student/Data/TaoData/pretrain.jsonl.fineweb.jsonl
  • Tokenization: byte-level, vocab size 259
  • Data limit: first 2,000,000 byte tokens from up to 5,000 records
  • Config: seq 512, 4 layers, hidden dim 256, bf16, 150 train steps, batch 16/32/64, projected DPLR mixer dim 64, local shift enabled.

Required TaoNet comparison:

Architecture SSM hidden dim Batch Eval loss Eval PPL Eval accuracy Forward tok/s Forward+backward tok/s Peak MB
attention TaoNet n/a 16 about 2.549 about 12.80 about 0.260 about 2.03M about 1.40M about 585
SSM TaoNet, DPLR projected 64 + local shift 16 16 about 1.982 about 7.26 about 0.423 about 2.42M about 564k about 757
SSM TaoNet, DPLR projected 64 + local shift 64 16 about 1.928 about 6.88 about 0.440 about 2.16M about 488k about 803
attention TaoNet n/a 32 about 2.523 about 12.47 about 0.266 about 2.13M about 809k about 1115
SSM TaoNet, DPLR projected 64 + local shift 16 32 about 1.879 about 6.55 about 0.455 about 4.43M about 1.38M about 1396
SSM TaoNet, DPLR projected 64 + local shift 64 32 about 1.848 about 6.35 about 0.457 about 3.97M about 1.25M about 1496
attention TaoNet n/a 64 about 2.529 about 12.54 about 0.265 about 5.98M about 2.03M about 2190
SSM TaoNet, DPLR projected 64 + local shift 16 64 about 1.807 about 6.10 about 0.471 about 4.92M about 1.67M about 2686
SSM TaoNet, DPLR projected 64 + local shift 64 64 about 1.834 about 6.26 about 0.466 about 2.54M about 1.52M about 2882

Interpretation:

  • First real-corpus quality success: both SSM candidates beat attention on validation loss, perplexity, and byte-token accuracy after the same number of train steps.
  • Hidden 64 was best quality at batch 16/32, while hidden 16 was best quality at batch 64 and generally faster among SSM variants.
  • Speed tradeoff depends on batch:
    • batch 16: attention backward is faster, but SSM has much better validation quality.
    • batch 32: hidden-16 SSM wins both quality and backward throughput versus attention.
    • batch 64: attention wins backward throughput, while SSM wins validation quality.
  • This benchmark is byte-level, so it should be treated as a real-text pilot rather than the final TaoData tokenizer benchmark.
  • Next real-data step: train or locate the intended SentencePiece tokenizer, then rerun the same script with --tokenizer-type sentencepiece.

LLM Iteration 18 - TaoData SentencePiece Pilot And Per-Channel Local Shift

Reason for this iteration:

  • Byte-level TaoData results were encouraging but not the intended LLM tokenization.
  • No pre-existing tokenizer artifact was found on the remote server, so a pilot SentencePiece tokenizer was trained from TaoData.
  • The first 500-step SentencePiece run showed attention still ahead on validation loss at batch 32, even though SSM retained a token-accuracy edge.
  • Because no-shift SSM was worse, the local shift branch was helping; the next lightweight improvement was making the shift gain per-channel instead of one scalar.

Implementation location:

  • TaoTrain commit: 33747c1 Add TaoData pilot tokenizer config
  • TaoTrain commit: c519645 Add per-channel SSM local shift
  • TaoTrain: configs/tokenizer_taodata_pilot.yaml
  • TaoTrain: src/taoTrain/models/taonet_ssm.py
  • TaoTrain: scripts/benchmark_taonet_real_tokens.py

What changed:

  • Added a pilot tokenizer config:
    • input: /home/student/Data/TaoData/pretrain.jsonl.fineweb.jsonl
    • output: /home/student/YouZheng/tokenizers/taodata_pilot_8k
    • vocab size: 8192
    • max samples: 20000
  • Trained the remote tokenizer; output files:
    • /home/student/YouZheng/tokenizers/taodata_pilot_8k/tokenizer.model
    • /home/student/YouZheng/tokenizers/taodata_pilot_8k/tokenizer.vocab
  • Added opt-in ssm_local_shift_per_channel.
  • The previous shift branch used one learned scalar for all model channels.
  • The new branch can use one learned gain per model channel while keeping the operation cheap: shift plus elementwise multiply.

Validation:

  • TaoTrain local tests: python -m pytest tests\test_taonet_ssm.py -q passed, 6 passed.
  • Local real-token smoke with --ssm-local-shift-per-channel passed.
  • Remote tokenizer training completed. RepoBridge's local print path initially hit a Windows emoji encoding issue, but the tokenizer files were created successfully.

SentencePiece 150-step pilot:

  • RepoBridge run: taonet-vs-ssm-real-token-taodata-spm-pilot-20260429-171228
  • Data: TaoData FineWeb JSONL
  • Tokenization: pilot SentencePiece 8k
  • Config: seq 512, 4 layers, hidden dim 256, bf16, 150 train steps, batch 16/32/64, projected DPLR mixer dim 64, local shift enabled.
Architecture SSM hidden dim Batch Eval loss Eval PPL Eval accuracy Forward+backward tok/s
attention TaoNet n/a 16 about 5.718 about 304 about 0.150 about 1.01M
SSM TaoNet 16 16 about 5.723 about 306 about 0.149 about 743k
SSM TaoNet 64 16 about 5.728 about 307 about 0.146 about 381k
attention TaoNet n/a 32 about 5.533 about 253 about 0.156 about 842k
SSM TaoNet 16 32 about 5.505 about 246 about 0.165 about 771k
SSM TaoNet 64 32 about 5.561 about 260 about 0.158 about 1.09M
attention TaoNet n/a 64 about 5.414 about 225 about 0.163 about 623k
SSM TaoNet 16 64 about 5.427 about 227 about 0.169 about 1.12M
SSM TaoNet 64 64 about 5.395 about 220 about 0.171 about 623k

SentencePiece 500-step batch-32 follow-up:

  • RepoBridge run: taonet-vs-ssm-real-token-taodata-spm-b32-500step-20260429-171338
  • RepoBridge run without shift: taonet-ssm-real-token-taodata-spm-b32-500step-no-shift-20260429-171451
  • RepoBridge run with per-channel shift: taonet-vs-ssm-real-token-taodata-spm-b32-500step-channel-shift-20260429-171917
  • Config: batch 32, seq 512, 500 train steps, eval batches 16.
Variant SSM hidden dim Shift type Eval loss Eval PPL Eval accuracy Forward+backward tok/s
attention TaoNet n/a n/a about 4.715 about 112 about 0.211 about 1.23M in first run, about 892k in per-channel run
SSM TaoNet 16 scalar about 4.798 about 121 about 0.217 about 1.13M
SSM TaoNet 64 scalar about 4.830 about 125 about 0.215 about 968k
SSM TaoNet 16 none about 5.088 about 162 about 0.171 about 554k
SSM TaoNet 64 none about 5.102 about 164 about 0.169 about 580k
SSM TaoNet 16 per-channel about 4.782 about 119 about 0.218 about 784k
SSM TaoNet 64 per-channel about 4.818 about 124 about 0.215 about 1.08M

Interpretation:

  • The SentencePiece pilot is more realistic and less favorable to SSM than the byte-level pilot.
  • SSM has a small token-accuracy edge at batch 32, but attention has the best 500-step validation loss/perplexity.
  • Removing local shift is clearly worse, so local shift is useful for real-token modeling too.
  • Per-channel shift is a small quality improvement over scalar shift:
    • hidden 16 eval loss improved from about 4.798 to 4.782
    • hidden 64 eval loss improved from about 4.830 to 4.818
  • Per-channel shift is not enough to surpass attention on 500-step SentencePiece validation loss.
  • Next model-improvement direction should target SSM language-modeling capacity or optimization, not just exact one-token memory:
    • try larger ssm_mixer_dim such as 96/128 with h16/h64
    • tune SSM learning rate/weight decay separately from attention
    • test a small gated local convolution/projection branch if ternary deployment accepts it

LLM Iteration 19 - TaoData SentencePiece Mixer-Dimension Sweep

Reason for this iteration:

  • The 500-step SentencePiece batch-32 pilot showed SSM had a small token-accuracy edge, but attention still had better validation loss/perplexity.
  • The prior best SSM used ssm_mixer_dim=64, originally chosen from speed-focused scaling probes.
  • Because real-token quality may need more SSM channel capacity, this iteration swept projected mixer dimensions while keeping the same outer TaoNet dimensions.

Implementation location:

  • TaoTrain commit: 357336e Sweep SSM mixer dims in real token benchmark
  • TaoTrain: scripts/benchmark_taonet_real_tokens.py
  • RepoBridge config: repobridge.taonet.realspm.taodata.b32.500step.mixersweep.config.json

What changed:

  • Added --ssm-mixer-dims to the real-token benchmark.
  • The benchmark now records ssm_mixer_dim in the printed table and CSV.
  • Attention TaoNet is still evaluated once per batch, while SSM TaoNet can sweep multiple hidden and mixer dimensions in the same run.

Validation:

  • TaoTrain local syntax check: python -m py_compile scripts\benchmark_taonet_real_tokens.py passed.
  • TaoTrain local tests with the SSM repo on PYTHONPATH: python -m pytest tests\test_taonet_ssm.py -q passed, 6 passed.
  • Local byte-token smoke with --ssm-mixer-dims 8,12 passed and wrote CSV/JSON outputs.

Remote benchmark:

  • RepoBridge run: taonet-vs-ssm-real-token-taodata-spm-b32-500step-mixersweep-20260429-193729
  • Data: TaoData FineWeb JSONL
  • Tokenization: pilot SentencePiece 8k
  • Config: batch 32, seq 512, 4 layers, hidden dim 256, bf16, 500 train steps, 16 eval batches, local shift enabled, per-channel shift enabled.
Architecture SSM hidden dim SSM mixer dim Eval loss Eval PPL Eval accuracy Forward+backward tok/s Peak allocated MB
attention TaoNet n/a n/a 4.715 111.633 0.211 618k 2590
SSM TaoNet 16 64 4.780 119.046 0.218 1.13M 2887
SSM TaoNet 16 96 4.759 116.643 0.222 973k 3029
SSM TaoNet 16 128 4.719 112.088 0.224 782k 3192
SSM TaoNet 64 64 4.824 124.475 0.214 982k 2987
SSM TaoNet 64 96 4.761 116.917 0.219 479k 3131
SSM TaoNet 64 128 4.784 119.589 0.218 457k 3292

Interpretation:

  • Increasing the projected mixer dimension helped the best SSM real-token validation loss.
  • The best quality SSM in this run was ssm_hidden_dim=16, ssm_mixer_dim=128:
    • validation loss 4.719, very close to attention 4.715
    • token accuracy 0.224, above attention 0.211
    • forward+backward throughput about 782k tok/s, above attention about 618k tok/s
  • Hidden dim 64 did not help this batch-32 500-step SentencePiece setting; it was slower and worse than hidden dim 16 at mixer dim 128.
  • Mixer dim 64 remains the best SSM speed/quality tradeoff, but mixer dim 128 is now the best SSM quality candidate on real SentencePiece token modeling.
  • Next step should test whether hidden_dim=16, mixer_dim=128 remains strong at batch 16/64 and longer training, then try a narrow learning-rate sweep around it.

LLM Iteration 20 - Attempted h16/m128 Batch Generalization Sweep

Reason for this iteration:

  • Iteration 19 found a strong real-token batch-32 point: ssm_hidden_dim=16, ssm_mixer_dim=128.
  • The user noted earlier that a single batch-size sweet spot can be misleading.
  • This iteration was meant to compare attention TaoNet vs SSM TaoNet at batch 16, 32, and 64 with the same 500-step SentencePiece protocol.

Implementation location:

  • TaoTrain commit used remotely: 357336e Sweep SSM mixer dims in real token benchmark
  • RepoBridge config: repobridge.taonet.realspm.taodata.h16m128.batchsweep.config.json

Planned remote benchmark:

  • Data: TaoData FineWeb JSONL
  • Tokenization: pilot SentencePiece 8k
  • Config: batch 16/32/64, seq 512, 4 layers, hidden dim 256, bf16, 500 train steps, 16 eval batches
  • Attention baseline: taonet
  • SSM candidate: taonet_ssm, DPLR, ssm_hidden_dim=16, ssm_mixer_dim=128, local shift enabled, per-channel shift enabled

Remote status before run:

  • RepoBridge write guard passed.
  • RepoBridge preflight passed.
  • Remote GPU: RTX 5090 with about 21 GB free VRAM.
  • A same-user taodata process was present and using about 10.9 GB VRAM; no other users were detected.

Outcome:

  • RepoBridge full began, but the SFTP download phase failed with:
    • Socket exception: An existing connection was forcibly closed by the remote host (10054)
    • paramiko.ssh_exception.SSHException: Server connection dropped
  • Subsequent read-only RepoBridge SSH checks timed out with WinError 10060.
  • The new result folder did not appear in the partial local download, so no valid benchmark table was available to record.

Interpretation:

  • This was an infrastructure interruption, not a model failure.
  • Do not infer anything about h16/m128 batch generalization from this attempted run.
  • Next action when the remote server is reachable: rerun or download the run for taonet-vs-ssm-real-token-taodata-spm-h16m128-batchsweep.

Current LLM-Wrapper Best Configuration

Best current speed benchmark configuration:

  • architecture: taonet_ssm
  • SSM core: dplr
  • mixer projection: ssm_mixer_dim=64
  • SSM hidden dimension: 256
  • DPLR rank: 1
  • kernel mode: conv
  • dtype: bf16
  • benchmark task: synthetic next-token CE through TaoNet wrapper

Best current quality-aware token-memory configuration:

  • architecture: taonet_ssm
  • SSM core: dplr
  • mixer projection: ssm_mixer_dim=64
  • SSM hidden dimension: 16
  • DPLR rank: 1
  • kernel mode: conv
  • dtype: bf16
  • local shift: ssm_local_shift=True
  • benchmark task: previous token memory through TaoNet wrapper
  • evidence: perfect eval accuracy at batch 8, 16, 32, and 64 after 100 steps; best observed short-memory batch-64 SSM backward throughput about 795k tok/s

Best current longer token-memory configuration:

  • architecture: taonet_ssm
  • SSM core: dplr
  • mixer projection: ssm_mixer_dim=64
  • SSM hidden dimension: 64
  • DPLR rank: 1
  • kernel mode: conv
  • dtype: bf16
  • local shift: ssm_local_shift=True
  • benchmark task: seq-512 previous token memory through TaoNet wrapper
  • evidence: perfect eval accuracy at batch 16, 32, and 64 after 100 steps; best observed batch-32 and batch-64 SSM backward throughput about 1.36M and 1.46M tok/s, both above attention in the same task

Best current TaoData real-text pilot configuration:

  • architecture: taonet_ssm
  • SSM core: dplr
  • mixer projection: ssm_mixer_dim=128 for best current SentencePiece validation loss; ssm_mixer_dim=64 for speed/quality balance
  • SSM hidden dimension: 16
  • DPLR rank: 1
  • kernel mode: conv
  • dtype: bf16
  • local shift: ssm_local_shift=True
  • local shift gain: ssm_local_shift_per_channel=True
  • benchmark task: TaoData FineWeb JSONL, byte-level and pilot SentencePiece next-token prediction, seq 512
  • evidence:
    • byte-level: lower validation loss/perplexity than attention at batch 16/32/64 after 150 steps; hidden-16 also beat attention backward throughput at batch 32
    • SentencePiece batch 32, 500 steps: ssm_hidden_dim=16, ssm_mixer_dim=128 reached eval loss about 4.719 vs attention about 4.715, with better token accuracy (0.224 vs 0.211) and higher backward throughput (782k vs 618k tok/s)

Current best evidence:

  • At batch 4, seq 512, projected-64 DPLR reaches about 618k forward tok/s and 192k backward tok/s.
  • At batch 16, seq 512, projected-64 DPLR reaches about 2.12M forward tok/s and 702k backward tok/s.
  • Attention is still faster for backward at batch 16 in the same run: about 990k tok/s.
  • DPLR projected-64 forward can exceed attention in this benchmark, but training/backward still needs improvement.
  • Newer scaling rerun found a batch-32 sweet spot where projected-64 DPLR exceeded attention in both forward and forward+backward throughput:
    • SSM forward about 2.62M tok/s vs attention about 1.88M
    • SSM forward+backward about 705k tok/s vs attention about 632k

Important local artifact paths:

  • C:\Users\YouZheng\Documents\LYZ\MyContent\MyLLM\Codebase\Taotern\TaoTrain\results\repobridge-token-bench-projected\outputs-taotrain\taonet-token-dplr-proj64-bench-20260429-091624\taonet_token_benchmark.csv
  • C:\Users\YouZheng\Documents\LYZ\MyContent\MyLLM\Codebase\Taotern\TaoTrain\results\repobridge-token-bench-projected64-scale\outputs-taotrain\taonet-token-dplr-proj64-scale-bench-20260429-091738\taonet_token_benchmark.csv
  • C:\Users\YouZheng\Documents\LYZ\MyContent\MyLLM\Codebase\Taotern\TaoTrain\results\repobridge-token-bench-projected\outputs-taotrain\taonet-token-dplr-proj64-bench-20260429-091956\taonet_token_benchmark.csv

Recommended next LLM-wrapper targets:

  1. Rerun the real SentencePiece benchmark for ssm_hidden_dim=16, ssm_mixer_dim=128 at batch 16/32/64 to check whether the gain generalizes beyond the batch-32 spot.
  2. Optimize backward throughput in S4TernaryDPLRSSM; the forward path is now competitive at larger batch sizes.
  3. Run a learning-rate and weight-decay sweep around the current best SSM real-token config, because the SSM and attention cores may not share the same optimum optimizer settings.
  4. Investigate whether FFT/direct-response intermediates can be checkpointed or custom-autograded to improve backward speed.
  5. Keep ternary deployment constraints in view: rank-1 DPLR factors still use ternary masks with learned amplitudes, and projected mixer dimensions should remain friendly to ternary compute layouts.

Version Timeline

Run Notebook(s) Commit printed in notebook Device Main purpose
_r1 gamma_s4_sinewave_benchmark_r1.ipynb not printed CUDA First comparison of baseline, minimal, enhanced on simple sinewave task.
_r2 gamma_s4_sinewave_benchmark_r2.ipynb not printed CUDA Harder multivariate long-range task; enhanced first became clearly promising.
_r3 gamma-s4-sinewave-benchmark_r3.ipynb 6df3777 CUDA Quick benchmark after deployment-cache import fix; recurrent enhanced still very slow.
_r4 gamma-s4-sinewave-benchmark_r4.ipynb d6ebddc CUDA Triangular-solve recurrent optimization; large recurrent speedup.
_r5 gamma-s4-sinewave-benchmark_r5.ipynb 78ae31f CUDA Added recurrent/full-output agreement metrics.
_r6 quick + research notebooks a2474cc / 5952546 CPU for quick, CUDA for research Split quick/research benchmark; first practical long-context run showed conv path was too slow.
_r7 quick + research notebooks 4b977c1 / b17f72a CUDA Faster conv kernel generation and cheaper research defaults.
_r8 quick + research notebooks 73e76a7 CUDA Skipped unused final states, enabled baseline deploy metrics, enabled token-lite.
_r9 quick + research notebooks 8738675 / 60562bd CUDA Added research visuals; performance similar to _r8, now presentation-friendly.
_r10 quick + research notebooks 09db0da / 9ff7e4e CUDA Added balanced deployment metrics to test a speed/fidelity point between full recurrent and deployment-lite.
_r11 quick + research + challenge notebooks 64f8632 / 4842762 / bfc6e26 CUDA Fixed AMP FFT path, split result tables, and added challenge benchmarks for permuted MNIST, selective copying, and induction-style recall.
_r12 quick + research + challenge notebooks 740a9ef / 0c6ecb8 / 11bd2e6 CUDA Tested the input-selection gate. Forecasting stayed strong, but challenge recall tasks remained near random.

_r1 - First Simple Sinewave Comparison

Saved notebook:

  • output/jupyter-notebook/gamma_s4_sinewave_benchmark_r1.ipynb

Configuration recovered from notebook:

  • device: CUDA
  • task: simple 1D sinewave next-step prediction
  • seq_len=128
  • train_samples=512
  • val_samples=128
  • batch_size=32
  • epochs=10
  • d_model=1
  • hidden_dim=32
  • num_layers=2

Results:

Model Params Final val loss Mean epoch s Full ms Full tokens/s Recurrent ms Recurrent tokens/s
gamma_baseline 134 0.170722 2.063 24.604 166477 51.177 80036
gamma_s4_minimal 138 0.019148 1.282 23.545 173967 54.462 75209
gamma_s4_enhanced 146 1.154002 1.330 23.389 175127 82.343 49743

Interpretation:

  • gamma_s4_minimal was best on this very simple task.
  • gamma_s4_enhanced was unstable/underfit badly here.
  • This run showed that the richer enhanced block can be harmful on small/simple tasks.

_r2 - Harder Multivariate Forecasting

Saved notebook:

  • output/jupyter-notebook/gamma_s4_sinewave_benchmark_r2.ipynb

Configuration recovered from notebook:

  • device: CUDA
  • task: harder multivariate synthetic forecasting
  • seq_len=512
  • num_features=8
  • train_samples=768
  • val_samples=192
  • batch_size=32
  • epochs=12
  • d_model=8
  • hidden_dim=64
  • num_layers=3

Results:

Model Params Final val loss Mean epoch s Full ms Full tokens/s Recurrent ms Recurrent tokens/s
gamma_baseline 3192 0.006972 28.916 146.644 111726 305.446 53640
gamma_s4_minimal 3243 0.110654 17.194 121.234 135144 343.394 47712
gamma_s4_enhanced 3675 0.006302 17.191 131.929 124188 492.576 33262

Interpretation:

  • gamma_s4_enhanced became the best-quality model.
  • Enhanced training was much faster than baseline on this task.
  • Recurrent inference was still significantly slower than baseline.
  • This was the first strong evidence that the enhanced model is useful on harder sequence tasks.

_r3 - Quick Benchmark With Deployment Cache Available

Saved notebook:

  • output/jupyter-notebook/gamma-s4-sinewave-benchmark_r3.ipynb

Configuration:

  • device: CUDA
  • commit: 6df3777
  • quick tasks:
    • simple: seq_len=192, features=4, epochs=4
    • moderate: seq_len=320, features=6, epochs=5
  • models: gamma_baseline, gamma_s4_enhanced
  • enhanced: kernel_mode="auto", kernel_threshold=384, bilinear discretization

Results:

Task Model Val loss Mean epoch s Full tokens/s Recurrent ms Recurrent tokens/s Deploy recurrent ms
simple baseline 0.637628 2.199 5779 75.092 5114 not available
simple enhanced 0.045667 2.450 9671 1024.471 375 997.631
moderate baseline 0.533364 6.544 5729 192.084 3332 not available
moderate enhanced 0.021113 6.584 6995 2815.223 227 2346.986

Interpretation:

  • Enhanced quality was much better than baseline.
  • Full-sequence throughput was better for enhanced.
  • Recurrent enhanced path was catastrophically slow.
  • This run motivated recurrent-path optimization.

_r4 - Triangular-Solve Recurrent Optimization

Saved notebook:

  • output/jupyter-notebook/gamma-s4-sinewave-benchmark_r4.ipynb

Configuration:

  • device: CUDA
  • commit: d6ebddc
  • same quick tasks as _r3
  • key code change: bilinear recurrent stepping switched to a triangular-solve path

Results:

Task Model Val loss Mean epoch s Full tokens/s Recurrent ms Recurrent tokens/s Deploy recurrent ms
simple baseline 0.637628 2.219 6186 69.398 5533 not available
simple enhanced 0.045667 2.288 9728 139.394 2755 104.623
moderate baseline 0.533364 6.409 6182 110.415 5796 not available
moderate enhanced 0.021113 6.630 9896 240.392 2662 185.037

Interpretation:

  • This was a major recurrent-inference improvement.
  • Enhanced recurrent latency dropped from seconds to hundreds of milliseconds.
  • Enhanced still remained slower than baseline in recurrent mode.

_r5 - Agreement Metrics Added

Saved notebook:

  • output/jupyter-notebook/gamma-s4-sinewave-benchmark_r5.ipynb

Configuration:

  • device: CUDA
  • commit: 78ae31f
  • same quick tasks as _r4
  • added:
    • recurrent_match_mse
    • deploy_match_mse

Results:

Task Model Val loss Mean epoch s Full tokens/s Recurrent ms Recurrent match MSE Deploy recurrent ms Deploy match MSE
simple baseline 0.637628 2.317 6097 71.296 0.000000 not available not available
simple enhanced 0.045667 2.381 9963 141.656 0.008500 107.361 0.031251
moderate baseline 0.533364 6.603 5912 114.832 0.000000 not available not available
moderate enhanced 0.021113 7.199 9465 242.692 0.007549 178.070 0.029995

Interpretation:

  • Enhanced remained much better in quality.
  • Full-sequence throughput favored enhanced.
  • Recurrent/deployment-lite speed improved but still trailed baseline.
  • Agreement metrics showed normal enhanced recurrent output was close to full forward; deployment-lite was faster but less faithful.

_r6 - Split Quick/Research Benchmark Era

_r6 Quick Notebook

Saved notebook:

  • output/jupyter-notebook/gamma-s4-sinewave-benchmark_r6.ipynb

Configuration:

  • device: CPU
  • commit: a2474cc
  • same quick tasks as _r5

Results:

Task Model Val loss Mean epoch s Full tokens/s Recurrent tokens/s
simple baseline 0.240532 1.444 15225 8656
simple enhanced 0.045714 2.598 20066 4278
moderate baseline 0.056279 5.613 20720 11785
moderate enhanced 0.021122 8.653 12149 2875

Interpretation:

  • This was a CPU run, so speed conclusions are not treated as primary benchmark evidence.
  • It was useful as a smoke test only.
  • The CPU result reminded us to warn clearly when notebooks are not running on GPU.

_r6 Research Notebook

Saved notebook:

  • output/jupyter-notebook/gamma-s4-research-benchmark_r6.ipynb

Configuration:

  • device: CUDA
  • commit: 5952546
  • research tasks:
    • current_reference: seq_len=320, features=6, epochs=5
    • long_context: seq_len=768, features=8, epochs=4
  • RUN_ABLATIONS=True
  • RUN_TOKEN_TASK=False

Results:

Task Model Val loss Mean epoch s Expected mode Full tokens/s Recurrent tokens/s Deploy tokens/s
current_reference baseline 0.709749 6.844 recurrent_like 6157 5522 not available
current_reference enhanced 0.020500 7.366 recurrent_like 8431 2594 3408
long_context baseline 27.229956 36.239 recurrent_like 2819 2939 not available
long_context enhanced 0.012164 634.387 conv 358 1876 2501

Interpretation:

  • Enhanced crushed baseline in quality.
  • But the long-context conv path was extremely slow.
  • Ablation section was too expensive and was stopped mid-way.
  • This run motivated the later kernel-generation speedup and disabling ablations by default.

_r7 - Conv Kernel Generation Improved

_r7 Quick Notebook

Saved notebook:

  • output/jupyter-notebook/gamma-s4-sinewave-benchmark_r7.ipynb

Configuration:

  • device: CUDA
  • commit: 4b977c1
  • same quick tasks

Results:

Task Model Val loss Mean epoch s Full tokens/s Recurrent tokens/s Deploy tokens/s
simple baseline 0.637628 2.417 4977 5155 not available
simple enhanced 0.045667 2.565 8583 2500 3260
moderate baseline 0.533364 7.405 5413 5186 not available
moderate enhanced 0.021113 7.796 7465 2414 3226

Interpretation:

  • Quick benchmark remained stable.
  • Enhanced retained quality and full-sequence throughput advantages.
  • Recurrent remained slower than baseline.

_r7 Research Notebook

Saved notebook:

  • output/jupyter-notebook/gamma-s4-research-benchmark_r7.ipynb

Configuration:

  • device: CUDA
  • commit: b17f72a
  • RUN_ABLATIONS=False
  • RUN_TOKEN_TASK=False

Results:

Task Model Val loss Mean epoch s Expected mode Full tokens/s Recurrent tokens/s Deploy tokens/s
current_reference baseline 0.709749 7.351 recurrent_like 3821 4616 not available
current_reference enhanced 0.020500 7.530 recurrent_like 9289 2780 3339
long_context baseline 27.229956 39.236 recurrent_like 3523 3282 not available
long_context enhanced 0.012029 44.189 conv 5971 1776 2229

Interpretation:

  • The conv speed issue was dramatically improved versus _r6.
  • Enhanced long-context epoch time dropped from about 634s to about 44s.
  • Enhanced was still slightly slower than baseline per epoch on long_context, but had much better loss and better full-sequence throughput.

_r8 - No-State Full Forward, Baseline Deploy Metrics, Token-Lite Enabled

_r8 Quick Notebook

Saved notebook:

  • output/jupyter-notebook/gamma-s4-sinewave-benchmark_r8.ipynb

Configuration:

  • device: CUDA
  • commit: 73e76a7
  • same quick tasks
  • baseline deploy metrics became available
  • full-sequence training/inference skips unused final-state computation

Results:

Task Model Val loss Mean epoch s Full tokens/s Recurrent tokens/s Deploy tokens/s Deploy match MSE
simple baseline 0.637628 2.261 5711 5076 5241 0.000000
simple enhanced 0.044817 2.550 8204 2621 3367 0.022886
moderate baseline 0.533364 7.011 5782 5447 4519 0.000000
moderate enhanced 0.020569 7.010 8926 2503 3390 0.018165

Interpretation:

  • Baseline deploy columns now populate.
  • Enhanced full-sequence throughput remained ahead.
  • Training time was tied on moderate.

_r8 Research Notebook

Saved notebook:

  • output/jupyter-notebook/gamma-s4-research-benchmark_r8.ipynb

Configuration:

  • device: CUDA
  • commit: 73e76a7
  • RUN_ABLATIONS=False
  • RUN_TOKEN_TASK=True

Forecasting results:

Task Model Val loss Mean epoch s Expected mode Full tokens/s Recurrent tokens/s Deploy tokens/s
current_reference baseline 0.709749 7.235 recurrent_like 5941 5581 5702
current_reference enhanced 0.019951 7.177 recurrent_like 7431 1918 2336
long_context baseline 27.229956 35.557 recurrent_like 3969 3759 3842
long_context enhanced 0.011708 14.235 conv 19544 1860 2406

Token-lite results:

Model Train CE Val CE Val PPL Seq len Train samples
baseline 3.587260 3.132184 22.924 192 1200
enhanced 2.483611 2.486829 12.023 192 1200

Interpretation:

  • This was the strongest practical result so far.
  • On long_context, enhanced was both much more accurate and much faster per epoch.
  • Token-lite showed enhanced also transferred better to a language-like task.

_r9 - Presentation Visuals Added

_r9 Quick Notebook

Saved notebook:

  • output/jupyter-notebook/gamma-s4-sinewave-benchmark_r9.ipynb

Configuration:

  • device: CUDA
  • commit: 8738675
  • same quick tasks as _r8

Results:

Task Model Val loss Mean epoch s Full tokens/s Recurrent tokens/s Deploy tokens/s
simple baseline 0.637628 2.249 6058 5478 5502
simple enhanced 0.044817 2.344 9550 2617 3644
moderate baseline 0.533364 6.672 6324 5686 5599
moderate enhanced 0.020569 6.571 9304 2771 3416

Interpretation:

  • Similar to _r8, with slightly improved timing variation.
  • Enhanced still wins on quality and full-sequence throughput.
  • Baseline still wins recurrent throughput.

_r9 Research Notebook

Saved notebook:

  • output/jupyter-notebook/gamma-s4-research-benchmark_r9.ipynb

Configuration:

  • device: CUDA
  • commit printed in notebook: 60562bd
  • visual sections added:
    • task visual preview
    • prediction comparison plots
    • error comparison plots

Forecasting results:

Task Model Val loss Mean epoch s Expected mode Full tokens/s Recurrent tokens/s Deploy tokens/s
current_reference baseline 0.709749 7.294 recurrent_like 6111 4981 5359
current_reference enhanced 0.019951 7.494 recurrent_like 8099 2185 3343
long_context baseline 27.229956 37.885 recurrent_like 3576 3728 3695
long_context enhanced 0.011708 14.717 conv 15654 1810 2327

Token-lite results:

Model Train CE Val CE Val PPL Seq len Train samples
baseline 3.587260 3.132184 22.924 192 1200
enhanced 2.483611 2.486829 12.023 192 1200

Interpretation:

  • _r9 is the most presentation-friendly record.
  • It confirms the _r8 story:
    • enhanced wins quality strongly
    • enhanced wins full-sequence/conv long-context training and throughput
    • baseline still wins recurrent deployment throughput
    • token-lite favors enhanced

_r10 - Balanced Deployment Path Added

_r10 Quick Notebook

Saved notebook:

  • output/jupyter-notebook/gamma-s4-sinewave-benchmark_r10.ipynb

Configuration:

  • device: CUDA
  • commit: 09db0da
  • same quick tasks as _r9
  • new metrics:
    • balanced_deploy_recurrent_latency_ms
    • balanced_deploy_recurrent_tokens_per_s
    • balanced_deploy_match_mse

Results:

Task Model Val loss Mean epoch s Full tokens/s Recurrent tokens/s Deploy-lite tokens/s Balanced deploy tokens/s Deploy-lite match MSE Balanced match MSE
simple baseline 0.637628 2.134 6053 5891 6034 5938 0.000000 0.000000
simple enhanced 0.044817 2.532 9973 2507 3763 3123 0.022886 0.000986
moderate baseline 0.533364 6.331 6134 5835 5512 5816 0.000000 0.000000
moderate enhanced 0.020569 6.601 10045 2778 3510 2862 0.018165 0.000468

Interpretation:

  • Enhanced quality and full-sequence throughput remain strong.
  • Deployment-lite is still the fastest enhanced deployment variant.
  • Balanced deployment is slower than deployment-lite, but much more faithful to full forward.
  • Balanced deployment is useful as a fidelity-preserving approximation, not as a pure speed win.

_r10 Research Notebook

Saved notebook:

  • output/jupyter-notebook/gamma-s4-research-benchmark_r10.ipynb

Configuration:

  • device: CUDA
  • commit printed in notebook: 9ff7e4e
  • same research tasks as _r9
  • balanced deployment metrics added

Forecasting results:

Task Model Val loss Mean epoch s Expected mode Full tokens/s Recurrent tokens/s Deploy-lite tokens/s Balanced deploy tokens/s Deploy-lite match MSE Balanced match MSE
current_reference baseline 0.709749 7.648 recurrent_like 4933 not recorded in compact table 5092 4987 0.000000 0.000000
current_reference enhanced 0.019951 8.193 recurrent_like 8152 not recorded in compact table 3404 2687 0.027752 0.000315
long_context baseline 27.229956 40.350 recurrent_like 2395 not recorded in compact table 3397 3285 0.000000 0.000000
long_context enhanced 0.011708 15.862 conv 16957 not recorded in compact table 2245 1886 0.200325 0.001692

Token-lite results:

Model Train CE Val CE Val PPL Seq len Train samples
baseline 3.587260 3.132184 22.924 192 1200
enhanced 2.483611 2.486829 12.023 192 1200

Interpretation:

  • Long-context enhanced still wins strongly on validation loss and full-sequence throughput.
  • Balanced deployment drastically improves fidelity relative to deployment-lite on enhanced:
    • long_context deploy-lite match MSE: 0.200325
    • long_context balanced match MSE: 0.001692
  • However, balanced deployment is slower than deployment-lite.
  • This suggests the output projection is important for fidelity, while the input-dependent gate is a major recurrent-time cost.

_r11 - FFT Fix, Split Tables, And Challenge Benchmarks

_r11 Quick Notebook

Saved notebook:

  • output/jupyter-notebook/gamma-s4-sinewave-benchmark_r11.ipynb

Configuration:

  • device: CUDA
  • commit printed in notebook: 64f8632
  • same quick tasks as _r10
  • notebook tables split into normal, deployment-lite, and balanced deployment views

Results:

Task Model Val loss Mean epoch s Full tokens/s Recurrent tokens/s Deploy-lite tokens/s Balanced deploy tokens/s Deploy-lite match MSE Balanced match MSE
simple baseline 0.637628 2.353 6162 5776 5524 5649 0.000000 0.000000
simple enhanced 0.044817 2.113 11279 2625 3618 3112 0.022886 0.000986
moderate baseline 0.533364 6.527 6187 5337 4572 5563 0.000000 0.000000
moderate enhanced 0.020569 6.264 11434 2598 3338 2809 0.018165 0.000468

Interpretation:

  • Enhanced remains much better on validation loss and full-sequence throughput.
  • Baseline remains faster for exact recurrent stepping.
  • Deployment-lite is still the fastest enhanced recurrent approximation, while balanced is much more faithful.

_r11 Research Notebook

Saved notebook:

  • output/jupyter-notebook/gamma-s4-research-benchmark_r11.ipynb

Configuration:

  • device: CUDA
  • commit printed in notebook: 4842762
  • includes AMP FFT fix and split benchmark tables

Forecasting results:

Task Model Val loss Mean epoch s Expected mode Full tokens/s Recurrent tokens/s Deploy-lite tokens/s Balanced deploy tokens/s Deploy-lite match MSE Balanced match MSE
current_reference baseline 0.709749 6.922 recurrent_like 5898 5494 5162 5523 0.000000 0.000000
current_reference enhanced 0.019951 6.383 recurrent_like 11200 2665 3522 2931 0.027752 0.000315
long_context baseline 27.229956 36.928 recurrent_like 2994 2725 2513 2866 0.000000 0.000000
long_context enhanced 0.011593 10.419 conv 235772 1849 2477 1542 0.193474 0.001699

Token-lite results:

Model Train CE Val CE Val PPL Seq len Train samples
baseline 3.587260 3.132184 22.924 192 1200
enhanced 2.483604 2.486901 12.024 192 1200

Interpretation:

  • The AMP FFT fix worked: the long-context enhanced conv path completed and showed very high cached full-sequence throughput.
  • Enhanced long-context training is now much faster than baseline in this setup and far more accurate.
  • Recurrent deployment remains the weak point: enhanced exact recurrent throughput is still lower than baseline.
  • Balanced deployment remains the best fidelity-preserving approximation, but it is slower than deployment-lite.

_r11 Challenge Notebook

Saved notebook:

  • output/jupyter-notebook/gamma-s4-challenge-benchmark_r11.ipynb

Configuration:

  • device: CUDA
  • commit printed in notebook: bfc6e26
  • first saved run for the challenge benchmark notebook
  • tasks:
    • permuted MNIST
    • selective copying
    • induction-style associative recall

Results:

Task Model Val loss Val accuracy Epoch s Forward ms Forward tokens/s
permuted_mnist baseline 2.760003 0.206000 112.318 425.084 118038
permuted_mnist enhanced 2.041562 0.232000 35.042 7.950 6311750
selective_copying baseline 3.529677 0.039551 6.509 73.482 222965
selective_copying enhanced 3.468455 0.029622 2.149 2.739 5981919
induction_recall baseline 3.615992 0.040039 6.424 72.235 226816
induction_recall enhanced 3.519182 0.033203 2.061 2.673 6130411

Interpretation:

  • Enhanced is much faster on the challenge forward benchmark because the full-sequence conv path is active.
  • Permuted MNIST slightly favors enhanced on both loss and accuracy, but both accuracies are still low.
  • Selective copying and induction recall are near random accuracy:
    • selective copying random accuracy is about 1 / 32 = 0.03125
    • induction recall random accuracy is about 1 / 32 = 0.03125
  • Enhanced often has lower CE but not consistently higher accuracy, suggesting it is learning distributional smoothing before reliable exact recall.
  • This is the clearest evidence so far that pure LTI Gamma SSM structure is not enough for Mamba-style selective memory tasks. The next model improvement should add selective input flow while keeping the fixed Gamma transition.

_r12 - Input-Selection Gate Tested

_r12 Quick Notebook

Saved notebook:

  • output/jupyter-notebook/gamma-s4-sinewave-benchmark_r12.ipynb

Configuration:

  • device: CUDA
  • commit printed in notebook: 740a9ef
  • enhanced model includes the new pre-SSM input-selection gate

Results:

Task Model Val loss Mean epoch s Full tokens/s Recurrent tokens/s Deploy-lite tokens/s Balanced deploy tokens/s Deploy-lite match MSE Balanced match MSE
simple baseline 0.637628 2.249 5923 4449 5323 5568 0.000000 0.000000
simple enhanced 0.043149 2.184 10904 2438 3751 3099 0.021473 0.001503
moderate baseline 0.450424 6.747 5908 4689 5084 5381 0.000000 0.000000
moderate enhanced 0.020161 6.264 8135 2357 3771 2944 0.076783 0.001143

Interpretation:

  • The input-selection gate did not hurt quick-task quality; enhanced still wins validation loss clearly.
  • Exact recurrent enhanced slowed slightly due to the extra gate.
  • Deployment-lite mismatch worsened on moderate, but balanced deployment remained much more faithful.

_r12 Research Notebook

Saved notebook:

  • output/jupyter-notebook/gamma-s4-research-benchmark_r12.ipynb

Configuration:

  • device: CUDA
  • commit printed in notebook: 0c6ecb8

Forecasting results:

Task Model Val loss Mean epoch s Expected mode Full tokens/s Recurrent tokens/s Deploy-lite tokens/s Balanced deploy tokens/s Deploy-lite match MSE Balanced match MSE
current_reference baseline 0.709749 7.298 recurrent_like 5513 5460 5212 5506 0.000000 0.000000
current_reference enhanced 0.020813 6.753 recurrent_like 10218 2261 3414 3018 0.023053 0.000478
long_context baseline 12.850342 37.374 recurrent_like 3720 3776 3661 3706 0.000000 0.000000
long_context enhanced 0.011039 11.212 conv 229320 1589 2390 2043 0.034069 0.001689

Token-lite results:

Model Train CE Val CE Val PPL Seq len Train samples
baseline 2.943133 6.186983 486.377 192 1200
enhanced 2.489687 2.490702 12.070 192 1200

Interpretation:

  • Enhanced remains excellent for long-context forecasting.
  • Input-selection slightly improved long_context val loss versus _r11 (0.011039 vs 0.011593) but worsened exact recurrent speed.
  • Token-lite strongly favors enhanced in this run, though baseline appears unstable.

_r12 Challenge Notebook

Saved notebook:

  • output/jupyter-notebook/gamma-s4-challenge-benchmark_r12.ipynb

Configuration:

  • device: CUDA
  • commit printed in notebook: 11bd2e6
  • same challenge tasks as _r11, with input-selection gate active in enhanced

Results:

Task Model Val loss Val accuracy Epoch s Forward ms Forward tokens/s
permuted_mnist baseline 2.760003 0.206000 114.183 511.946 98010
permuted_mnist enhanced 2.052564 0.209000 34.080 8.393 5978110
selective_copying baseline 3.514275 0.028320 7.319 73.808 221983
selective_copying enhanced 3.468793 0.029785 2.353 2.988 5482614
induction_recall baseline 3.607873 0.038086 7.439 72.952 224585
induction_recall enhanced 3.535175 0.039062 2.788 2.925 5600445

Interpretation:

  • The input-selection gate did not produce a meaningful challenge-task accuracy breakthrough.
  • Permuted MNIST accuracy stayed low and did not improve over _r11.
  • Selective copying and induction recall are still near random. With 32 classes, random accuracy is about 0.03125.
  • The enhanced model still has much better forward throughput and somewhat lower CE, but accuracy shows it is not performing reliable exact recall.
  • This suggests two things:
    • permuted MNIST likely needs more epochs and/or more samples
    • selective copying and induction need a stronger selective/content-dependent memory mechanism or a curriculum diagnostic, not just more epochs

Versions Not Recorded

The following are not recorded as complete benchmark versions:

  • Research notebooks before _r6: no saved research _r1 to _r5 notebooks exist in the repo.
  • Any temporary failed Colab runs during error debugging: tracebacks were discussed in chat, but they are not treated as experiment records.
  • Partial long-context ablation run in _r6: only partial output is present, so it is not summarized as a completed ablation result.

Current Best Summary

Best presentable run:

  • _r12 research benchmark

Most important result:

  • On long_context, gamma_s4_enhanced achieved much lower validation loss than baseline and substantially better full-sequence throughput.
  • _r11 shows the fixed AMP FFT conv path completing successfully and producing very high cached full-sequence throughput on long_context.
  • _r12 confirms the input-selection gate alone is not enough to solve selective copying or induction recall beyond near-random accuracy.

Current limitation:

  • gamma_s4_enhanced still trails gamma_baseline in recurrent token-by-token deployment throughput.
  • Challenge benchmarks show that the current model needs stronger selective/content-dependent memory mechanisms.

Recommended next improvement targets:

  1. Add challenge-task curriculum diagnostics and longer token-memory epochs.
  2. Explore stronger content-dependent memory beyond static LTI convolution, while preserving the fixed Gamma transition when possible.
  3. Recurrent/deployment optimization for gamma_s4_enhanced.
  4. Deployment-lite fidelity improvement, especially on long_context.
  5. Better structured Gamma kernel generation for the conv/full-sequence path.