TaoNet-mini-T2 / code /Taotern_SSM /EXPERIMENT_RECORD.md

Add files using upload-large-folder tool

e2bfccc verified 21 days ago

95.3 kB

	# Gamma SSM / Gamma-S4 Experiment Record

	This file records the experiment versions saved as `_rN` notebooks under `output/jupyter-notebook/`.
	It also records the later TaoNet-SSM LLM-wrapper and remote RTX benchmark iterations.

	The goal is to preserve:

	- which model version was tested
	- which notebook/task configuration was used
	- the main performance results
	- what we learned from each run

	For runs where the saved notebook does not contain enough information, the version is marked as not recorded.

	## Model Names

	- `gamma_baseline`: original Gamma SSM using the fixed lower-bidiagonal Gamma transition and recurrent execution.
	- `gamma_s4_minimal`: lighter S4-inspired Gamma block. Used in early experiments, later dropped from the main loop because it was not consistently strong.
	- `gamma_s4_enhanced`: main S4-inspired Gamma model with learned `dt`, stable discretization, `D` skip, optional gating/output path, full-sequence/kernel mode, and recurrent stepping.

	## Metrics

	- `val_loss`: validation loss for forecasting tasks. Lower is better.
	- `mean_epoch_time_s`: average training epoch time. Lower is better.
	- `full_forward_ms` / `full_latency_ms`: whole-sequence forward/inference latency. Lower is better.
	- `full_forward_tokens_per_s` / `full_tokens_per_s`: whole-sequence throughput. Higher is better.
	- `recurrent_inference_ms` / `recurrent_latency_ms`: token-by-token recurrent latency. Lower is better.
	- `recurrent_tokens_per_s`: token-by-token recurrent throughput. Higher is better.
	- `deploy_*`: deployment-lite recurrent path. For the baseline, deployment and recurrent are the same path once baseline deploy metrics were enabled.
	- `val_ce`: validation cross entropy for token prediction. Lower is better.
	- `val_ppl`: validation perplexity for token prediction. Lower is better.

	## TaoNet-SSM LLM Wrapper Iterations

	This section records the work that moved the SSM from standalone/notebook benchmarks into the TaoNet LLM comparison loop.
	The main implementation repo for SSM changes is this repo. The TaoNet wrapper lives in the local TaoTrain repo and branch listed below.

	Related repos and branches:

	- SSM repo: `https://github.com/StarMists/gamma_SSM_S4_enhanced.git`
	- SSM local path: `C:\Users\YouZheng\Documents\LYZ\MyContent\MyLLM\Codebase\Taotern\Taotern_SSM`
	- TaoTrain repo: `https://github.com/lobakkang/TaoTrain.git`
	- TaoTrain local path: `C:\Users\YouZheng\Documents\LYZ\MyContent\MyLLM\Codebase\Taotern\TaoTrain`
	- TaoTrain branch: `codex/taonet-ssm-core`
	- Remote server path for SSM: `/home/student/YouZheng/gamma_ssm_repo`
	- Remote server path for TaoTrain: `/home/student/YouZheng/repo`
	- Remote execution tool: `C:\Users\YouZheng\Documents\LYZ\MyContent\MyComp\RepoBridge`

	### LLM Iteration 1 - Add TaoNet SSM Wrapper

	Implementation location:

	- TaoTrain: `src/taoTrain/models/taonet_ssm.py`
	- TaoTrain: `src/taoTrain/config.py`
	- TaoTrain: `src/taoTrain/models/registry.py`
	- TaoTrain: `tests/test_taonet_ssm.py`
	- TaoTrain: `scripts/benchmark_taonet_token_variants.py`

	TaoTrain commits:

	- `8b1c6fa Add TaoNet Gamma SSM architecture`
	- `6edd09e Benchmark TaoNet token SSM variants`

	What changed:

	- Added a `taonet_ssm` model architecture for apples-to-apples comparison with original attention `taonet`.
	- Kept the outer LLM stack close to TaoNet and replaced the sequence-mixing core with an SSM mixer.
	- Supported both `gamma_s4` and `dplr` SSM cores.
	- Added token-level synthetic CE benchmark comparing `taonet` and `taonet_ssm`.
	- Added focused tests for SSM wrapper construction and forward passes.

	Local validation:

	- `python -m pytest tests\test_taonet_ssm.py -q` passed.
	- Broader TaoTrain tests were not run locally because the local environment was missing `datasets`.

	Result:

	- Functional success. This established the comparison harness.
	- Performance was not yet acceptable with full-width DPLR because the wrapper exposed dense DPLR frequency-transfer cost.

	### LLM Iteration 2 - Projected SSM Mixer Dimension

	Implementation location:

	- TaoTrain: `src/taoTrain/models/taonet_ssm.py`
	- TaoTrain: `src/taoTrain/config.py`
	- TaoTrain: `tests/test_taonet_ssm.py`

	TaoTrain commit:

	- `5e6b802 Add projected SSM mixer dimension`

	What changed:

	- Added `ssm_mixer_dim`.
	- The SSM branch now supports `d_model -> ssm_mixer_dim -> SSM -> d_model`.
	- This keeps the LLM interface the same while reducing the DPLR channel width.
	- This is important because DPLR convolutional training cost scales strongly with the channel dimension.

	Remote benchmark config examples:

	- RepoBridge projected 128: `repobridge.taonet.tokenbench.projected128.config.json`
	- RepoBridge projected 64: `repobridge.taonet.tokenbench.projected64.config.json`

	Important results before SSM-core optimization:

	\| Variant \| Batch \| Seq \| Forward tok/s \| Backward tok/s \| Peak MB \| Interpretation \|
	\|---\|---:\|---:\|---:\|---:\|---:\|---\|
	\| attention TaoNet \| 4 \| 512 \| about 1.24M \| about 280k \| about 376 \| Baseline comparison point. \|
	\| DPLR full-width mixer 256 \| 4 \| 512 \| about 81k \| about 20k \| about 6200 \| Failed: dense transfer path too slow and memory-heavy. \|
	\| DPLR projected mixer 128 \| 4 \| 512 \| about 214k \| about 56k \| about 3613 \| Better memory, still much slower than attention. \|
	\| DPLR projected mixer 64 \| 4 \| 512 \| about 114k \| about 54k \| about 2500 \| Lower memory but worse forward before core optimization. \|

	Result:

	- Success as an architectural control: projection made DPLR usable enough to iterate.
	- Not sufficient alone: the DPLR core still needed direct frequency-response optimization.

	### LLM Iteration 3 - Add Scripted SSM Benchmarks

	Implementation location:

	- SSM: `scripts/benchmark_ssm_variants.py`
	- SSM: `.gitignore`

	SSM commits:

	- `7a90525 Add lightweight SSM benchmark script`
	- `c0dede8 Ignore generated benchmark outputs`

	What changed:

	- Added a Python benchmark script for `baseline`, `gamma_s4`, and `dplr`.
	- Measures forward, optional forward+backward, and optional recurrent stepping.
	- Writes JSON and CSV outputs.
	- Ignored generated benchmark result directories.

	Remote raw DPLR result:

	\| Model \| Batch \| Seq \| Mode \| Tok/s \| Peak MB \|
	\|---\|---:\|---:\|---\|---:\|---:\|
	\| DPLR raw SSM \| 4 \| 512 \| forward \| about 841k \| about 1310 \|
	\| DPLR raw SSM \| 4 \| 512 \| forward+backward \| about 101k \| about 1310 \|
	\| DPLR raw recurrent \| 4 \| 512 \| recurrent \| about 97k \| about 10 \|

	Interpretation:

	- Raw DPLR SSM was promising.
	- The wrapped LLM bottleneck came from how the DPLR convolutional path scaled under the TaoNet stack, not from the idea of DPLR alone.

	### LLM Iteration 4 - Direct DPLR Frequency-Response Application

	Implementation location:

	- SSM: `gamma_space_model/modules/s4_ternary_dplr_ssm.py`

	SSM commit:

	- `2b204e8 Apply DPLR frequency response directly`

	What changed:

	- Added a direct training path that applies the DPLR frequency response to the FFT input.
	- Avoided materializing the full dense transfer tensor shaped roughly `freq x channels x channels` during training/grad runs.
	- Kept the old dense transfer path for eval/no-grad caching.

	Validation:

	- `python -m pytest tests\test_s4_ternary_dplr_ssm.py -q` passed.
	- Local CPU smoke benchmark with backward passed.

	Projected-128 remote result after this change:

	\| Variant \| Batch \| Seq \| Forward tok/s \| Backward tok/s \| Peak MB \|
	\|---\|---:\|---:\|---:\|---:\|---:\|
	\| attention TaoNet \| 4 \| 512 \| about 1.32M \| about 532k \| about 376 \|
	\| DPLR projected mixer 128 \| 4 \| 512 \| about 151k \| about 91k \| about 508 \|

	Interpretation:

	- Major memory success: projected-128 DPLR dropped from about 3613 MB to about 508 MB.
	- Training throughput improved from about 56k to about 91k tok/s.
	- Forward-only became slower than the previous projected-128 run, so this change helped training/backward much more than no-grad forward timing.

	### LLM Iteration 5 - Specialize Rank-One DPLR Solve

	Implementation location:

	- SSM: `gamma_space_model/modules/s4_ternary_dplr_ssm.py`

	SSM commit:

	- `5a0abad Specialize rank-one DPLR solve`

	What changed:

	- Current best DPLR configuration uses `rank=1`.
	- Replaced the batched `torch.linalg.inv` for `1 x 1` low-rank systems with scalar reciprocal math.
	- Applied the specialization to both direct training and cached dense response paths.
	- Left the general rank path intact for `rank > 1`.

	Validation:

	- `python -m pytest tests\test_s4_ternary_dplr_ssm.py -q` passed.
	- Local CPU smoke benchmark with backward passed.

	Projected-128 remote result:

	\| Variant \| Batch \| Seq \| Forward tok/s \| Backward tok/s \| Peak MB \|
	\|---\|---:\|---:\|---:\|---:\|---:\|
	\| attention TaoNet \| 4 \| 512 \| about 1.33M \| about 545k \| about 376 \|
	\| DPLR projected mixer 128 \| 4 \| 512 \| about 485k \| about 142k \| about 508 \|

	Projected-64 remote result after this change:

	\| Variant \| Batch \| Seq \| Forward tok/s \| Backward tok/s \| Peak MB \|
	\|---\|---:\|---:\|---:\|---:\|---:\|
	\| DPLR projected mixer 64 \| 4 \| 512 \| about 618k \| about 192k \| about 494 \|

	Scaling probe for projected-64:

	\| Variant \| Batch \| Seq \| Forward tok/s \| Backward tok/s \| Peak MB \|
	\|---\|---:\|---:\|---:\|---:\|---:\|
	\| attention TaoNet \| 16 \| 512 \| about 1.16M \| about 990k \| about 1332 \|
	\| DPLR projected mixer 64 \| 16 \| 512 \| about 2.12M \| about 702k \| about 1684 \|

	Interpretation:

	- Major success.
	- DPLR projected-64 became the best current SSM LLM configuration.
	- At batch 16, DPLR projected-64 forward throughput exceeded attention in this synthetic benchmark.
	- Backward was still behind attention, but the gap narrowed substantially.
	- The SSM now scales much better with batch size, suggesting fixed frequency-response overhead is being amortized.

	### LLM Iteration 6 - Precompose Finite Response Projection

	Implementation location:

	- SSM: `gamma_space_model/modules/s4_ternary_dplr_ssm.py`

	SSM commits:

	- `f09a71b Precompose DPLR finite response projection`
	- `648a32e Revert "Precompose DPLR finite response projection"`

	What changed:

	- Tried replacing `C @ (I - z^L A^L) @ response` with two projected terms:
	- `C @ response`
	- `(C @ A^L) @ response`
	- The goal was to reduce one batch/frequency hidden-state multiplication in the direct path.

	Validation:

	- `python -m pytest tests\test_s4_ternary_dplr_ssm.py -q` passed.
	- Local smoke benchmark passed.
	- Direct-vs-cached convolution comparison had max absolute difference around `2.4e-7`.

	Remote result:

	\| Variant \| Batch \| Seq \| Forward tok/s \| Backward tok/s \| Peak MB \|
	\|---\|---:\|---:\|---:\|---:\|---:\|
	\| DPLR projected mixer 64 before this change \| 4 \| 512 \| about 618k \| about 192k \| about 494 \|
	\| DPLR projected mixer 64 with this change \| 4 \| 512 \| about 495k \| about 162k \| about 478 \|

	Interpretation:

	- Failed on real GPU token benchmark.
	- It saved a little memory but reduced speed too much.
	- The commit was intentionally reverted, so current SSM `main` is back to the best-performing rank-one direct-response core.

	### LLM Iteration 7 - Rank-One Matmul Fast Path

	Implementation location:

	- SSM: `gamma_space_model/modules/s4_ternary_dplr_ssm.py`

	SSM commits:

	- `43de801 Use matmul fast path for rank-one DPLR`
	- `9ffa5a7 Gate rank-one matmul path by batch size`
	- `4e130b6 Limit rank-one matmul path to small batches`
	- `8969916 Revert "Limit rank-one matmul path to small batches"`
	- `5b3a957 Revert "Gate rank-one matmul path by batch size"`
	- `a46a2af Revert "Use matmul fast path for rank-one DPLR"`

	What changed:

	- Tried a deeper `rank=1` direct-application specialization.
	- Replaced several generic `einsum` operations with batched `matmul` and vector reductions.
	- The goal was to reduce Python/operator overhead and improve backward throughput for the current best DPLR rank.
	- A follow-up tried to gate the path by batch size after the batch-16 scaling run regressed.

	Validation:

	- `python -m pytest tests\test_s4_ternary_dplr_ssm.py -q` passed.
	- Local CPU smoke benchmark passed.
	- Direct-vs-cached convolution comparison had max absolute difference around `2.4e-7`.

	Remote result:

	\| Variant \| Batch \| Seq \| Forward tok/s \| Backward tok/s \| Peak MB \|
	\|---\|---:\|---:\|---:\|---:\|---:\|
	\| DPLR projected mixer 64 before this change \| 4 \| 512 \| about 618k \| about 192k \| about 494 \|
	\| DPLR projected mixer 64 first matmul run \| 4 \| 512 \| about 643k \| about 208k \| about 494 \|
	\| DPLR projected mixer 64 repeated small-batch gated run \| 4 \| 512 \| about 470k-472k \| about 161k-175k \| about 494 \|
	\| DPLR projected mixer 64 matmul at batch 16 \| 16 \| 512 \| about 1.47M \| about 388k \| about 1684 \|
	\| DPLR projected mixer 64 previous best at batch 16 \| 16 \| 512 \| about 2.12M \| about 702k \| about 1684 \|

	Interpretation:

	- Failed overall.
	- The first batch-4 run looked promising, but repeated remote results were worse.
	- The matmul formulation regressed the larger-batch scaling behavior that matters for GPU utilization.
	- All matmul fast-path commits were reverted, so current SSM `main` returns to the best-known `5a0abad` rank-one scalar-solve behavior plus the experiment-record commits.

	### LLM Iteration 8 - TileLang Capability Detection

	Implementation location:

	- SSM: `csrc/tilelang/selective_scan.py`
	- SSM: `csrc/tilelang/__init__.py`
	- SSM: `gamma_space_model/ops/selective_scan_interface.py`
	- SSM: `gamma_space_model/modules/ssm_gamma.py`
	- SSM: `scripts/diagnose_tilelang_acceleration.py`

	SSM commit:

	- `4784856 Make TileLang acceleration detection explicit`

	What changed:

	- Made TileLang capability reporting explicit and conservative.
	- Before this change, `HAS_TILELANG_OPS` became true whenever the Python fallback module imported.
	- That was misleading because `csrc/tilelang` did not actually dispatch to a real TileLang kernel; it used PyTorch fallback code.
	- Added `TILELANG_BACKEND` and `HAS_TILELANG_ACCELERATION` flags.
	- Added `scripts/diagnose_tilelang_acceleration.py` to print package availability, repo backend flags, and a small Gamma forward timing.
	- Fixed `SSMGamma.step` dtype/device casting after the honest fallback path exposed a float64 failure in the normal PyTorch path.

	Validation:

	- `python -m pytest tests\test_ssm_gamma.py tests\test_s4_ternary_dplr_ssm.py -q` passed locally: `22 passed`.
	- Local diagnostic reported:
	- `has_tilelang_ops=false`
	- `tilelang_backend=pytorch_fallback`
	- `triton_available=false`
	- `tilelang_available=false`

	Remote RTX 5090 diagnostic:

	\| Field \| Value \|
	\|---\|---\|
	\| Torch \| `2.11.0+cu130` \|
	\| CUDA \| available \|
	\| GPU \| `NVIDIA GeForce RTX 5090` \|
	\| Triton package \| available \|
	\| TileLang package \| not available \|
	\| Repo `HAS_TILELANG_OPS` \| `false` \|
	\| Repo `TILELANG_BACKEND` \| `pytorch_fallback` \|
	\| Gamma fallback forward \| about `76.7k` tok/s at batch 4, seq 512, bf16 \|

	Remote raw SSM benchmark after this change:

	\| Model \| Batch \| Seq \| Forward tok/s \| Backward tok/s \| Peak MB \|
	\|---\|---:\|---:\|---:\|---:\|---:\|
	\| DPLR raw SSM \| 4 \| 512 \| about 3.16M \| about 1.03M \| about 57 \|
	\| Gamma-S4 raw SSM \| 4 \| 512 \| about 100.6k \| about 45.5k \| about 467 \|
	\| Baseline Gamma raw SSM \| 4 \| 512 \| about 85.2k \| about 32.3k \| about 120 \|

	Interpretation:

	- This iteration did not add a real TileLang kernel yet.
	- It fixed an important measurement and dispatch problem: fallback code is no longer reported as hardware acceleration.
	- The remote server has Triton installed but does not have the TileLang package installed.
	- The current DPLR path is frequency-domain PyTorch/cuBLAS and does not use `csrc/tilelang`.
	- The next hardware-acceleration step should be explicit: either install/use real TileLang on the remote server or write a Triton/TileLang kernel for a clearly scoped hot path. The best candidate hot path is not the old baseline Gamma fallback; it is the DPLR direct frequency-response/backward path used by `taonet_ssm`.

	### LLM Iteration 9 - DPLR Frequency-Path Profiling And Root Cache

	Implementation location:

	- SSM: `gamma_space_model/modules/s4_ternary_dplr_ssm.py`
	- SSM: `scripts/profile_dplr_frequency_path.py`

	SSM commit:

	- `92643c5 Cache DPLR frequency roots`

	What changed:

	- Added a per-module cache for FFT roots and `roots^seq_len`.
	- These tensors are constants for a given `(seq_len, fft_len, dtype, device)`, so rebuilding them every forward/layer is unnecessary GPU work.
	- Added `scripts/profile_dplr_frequency_path.py` to profile the DPLR convolutional path directly on the remote server.

	Validation:

	- `python -m pytest tests\test_s4_ternary_dplr_ssm.py -q` passed locally.
	- `python -m pytest tests\test_ssm_gamma.py tests\test_s4_ternary_dplr_ssm.py -q` passed locally: `22 passed`.
	- Local profiler smoke passed and showed `frequency_grid_cache_entries=1`.

	Remote profiler result for raw DPLR at batch 4, seq 512, d_model 64, hidden_dim 256:

	\| Mode \| Mean ms \| Tok/s \| Peak MB \|
	\|---\|---:\|---:\|---:\|
	\| forward \| about 2.58 \| about 793k \| not measured \|
	\| forward+backward \| about 3.27 \| about 626k \| about 52 \|

	Remote profiler interpretation:

	- The largest CUDA entries were `aten::bmm`, `aten::mm`, and their backward paths.
	- `aten::linalg_matrix_power` was visible but small in this configuration.
	- Root generation was not the dominant cost, so the cache is a modest cleanup rather than a major acceleration.
	- A future TileLang/Triton kernel should target fused rank-1 DPLR frequency-response application and its backward, especially around the small complex BMM/MM pattern. Replacing the old Gamma Python fallback is not the right priority for the TaoNet-SSM goal.

	TaoNet projected-64 check after this change:

	\| Variant \| Batch \| Seq \| Forward tok/s \| Backward tok/s \| Peak MB \|
	\|---\|---:\|---:\|---:\|---:\|---:\|
	\| DPLR projected mixer 64 \| 4 \| 512 \| about 656k \| about 163k \| about 494 \|

	Scaling probe after this change:

	\| Variant \| Batch \| Seq \| Forward tok/s \| Backward tok/s \| Peak MB \|
	\|---\|---:\|---:\|---:\|---:\|---:\|
	\| DPLR projected mixer 64 \| 8 \| 512 \| about 983k \| about 341k \| about 889 \|
	\| DPLR projected mixer 64 \| 16 \| 512 \| about 1.03M \| about 414k \| about 1684 \|

	Interpretation:

	- The root cache is correct and removes repeated constant construction.
	- End-to-end results remain noisy; this is not a breakthrough optimization.
	- The main value of this iteration is the profiler evidence: the next real hardware acceleration should fuse the DPLR rank-1 complex frequency-response operations, not spend effort on the older baseline Gamma fallback path.

	### LLM Iteration 10 - Shared DPLR Frequency Grid Cache

	Implementation location:

	- SSM: `gamma_space_model/modules/s4_ternary_dplr_ssm.py`

	SSM commit:

	- `a9e5d3e Share DPLR frequency grid cache`

	What changed:

	- Promoted the DPLR FFT root cache from per-module to class-level shared cache.
	- The previous cache avoided rebuilding roots inside a single SSM module, but a multi-layer TaoNet creates one SSM module per layer.
	- The shared cache lets all layers reuse the same `(roots, roots^seq_len)` tensors for a given `(seq_len, fft_len, dtype, device)`.

	Validation:

	- `python -m pytest tests\test_s4_ternary_dplr_ssm.py -q` passed locally.
	- Local `scripts/profile_dplr_frequency_path.py` smoke passed and still reported one frequency-grid cache entry.

	Required TaoNet comparison after this iteration:

	Remote benchmark:

	- RepoBridge run: `taonet-vs-dplr-proj64-shared-grid-bench-20260429-101304`
	- Config: projected DPLR mixer dim 64, DPLR rank 1, 4 layers, bf16, synthetic next-token CE

	\| Architecture \| Batch \| Seq \| Mode \| Tok/s \| Peak MB \| Loss \|
	\|---\|---:\|---:\|---\|---:\|---:\|---:\|
	\| attention TaoNet \| 4 \| 512 \| forward \| about 1.30M \| about 193 \| 9.064 \|
	\| attention TaoNet \| 4 \| 512 \| forward+backward \| about 513k \| about 376 \| 9.064 \|
	\| SSM TaoNet, DPLR projected 64 \| 4 \| 512 \| forward \| about 499k \| about 190 \| 9.059 \|
	\| SSM TaoNet, DPLR projected 64 \| 4 \| 512 \| forward+backward \| about 162k \| about 492 \| 9.059 \|

	Comparison:

	- SSM forward throughput was about `38%` of attention at batch 4, seq 512.
	- SSM forward+backward throughput was about `32%` of attention.
	- SSM forward memory was slightly lower than attention, but backward peak memory was higher.
	- Loss was comparable because this is a random synthetic token benchmark, not a trained quality result.

	Interpretation:

	- The shared cache is correct and small, but it did not create a clear end-to-end speed breakthrough.
	- This reinforces the profiler conclusion: constant/root setup is not the dominant TaoNet-SSM bottleneck.
	- Future iterations should include the attention-vs-SSM table directly, and hardware work should focus on the DPLR rank-1 complex BMM/MM and backward pattern.

	### LLM Iteration 11 - Re-anchor On Projected-64 Scaling Regime

	Reason for this iteration:

	- The strongest previous result came from a scaling probe, not from batch-4 timing.
	- Later iterations over-emphasized batch 4, which made the SSM look worse and encouraged the wrong optimization target.
	- This iteration re-established the primary benchmark as attention TaoNet vs SSM TaoNet under larger projected-64 batches.

	Implementation change:

	- No model-code change.
	- Benchmark-policy change: projected-64 scaling comparisons should be treated as primary acceptance tests for throughput work.

	Remote benchmark:

	- RepoBridge run: `taonet-token-dplr-proj64-scale-bench-20260429-111150`
	- RepoBridge run: `taonet-token-dplr-proj64-extended-scale-bench-20260429-111350`
	- Config: projected DPLR mixer dim 64, DPLR rank 1, 4 layers, bf16, synthetic next-token CE

	Required TaoNet comparison:

	\| Architecture \| Batch \| Seq \| Mode \| Tok/s \| Peak MB \| Loss \|
	\|---\|---:\|---:\|---\|---:\|---:\|---:\|
	\| attention TaoNet \| 8 \| 512 \| forward \| about 1.09M \| about 319 \| 9.061 \|
	\| attention TaoNet \| 8 \| 512 \| forward+backward \| about 468k \| about 697 \| 9.061 \|
	\| SSM TaoNet, DPLR projected 64 \| 8 \| 512 \| forward \| about 1.17M \| about 320 \| 9.058 \|
	\| SSM TaoNet, DPLR projected 64 \| 8 \| 512 \| forward+backward \| about 318k \| about 889 \| 9.058 \|
	\| attention TaoNet \| 16 \| 512 \| forward \| about 1.60M \| about 596 \| 9.059 \|
	\| attention TaoNet \| 16 \| 512 \| forward+backward \| about 503k \| about 1332 \| 9.059 \|
	\| SSM TaoNet, DPLR projected 64 \| 16 \| 512 \| forward \| about 1.03M \| about 580 \| 9.060 \|
	\| SSM TaoNet, DPLR projected 64 \| 16 \| 512 \| forward+backward \| about 427k \| about 1684 \| 9.060 \|
	\| attention TaoNet \| 32 \| 512 \| forward \| about 1.88M \| about 1124 \| 9.062 \|
	\| attention TaoNet \| 32 \| 512 \| forward+backward \| about 632k \| about 2590 \| 9.062 \|
	\| SSM TaoNet, DPLR projected 64 \| 32 \| 512 \| forward \| about 2.62M \| about 1100 \| 9.061 \|
	\| SSM TaoNet, DPLR projected 64 \| 32 \| 512 \| forward+backward \| about 705k \| about 3273 \| 9.061 \|
	\| attention TaoNet \| 64 \| 512 \| forward \| about 3.53M \| about 2204 \| 9.061 \|
	\| attention TaoNet \| 64 \| 512 \| forward+backward \| about 683k \| about 5121 \| 9.061 \|
	\| SSM TaoNet, DPLR projected 64 \| 64 \| 512 \| forward \| about 1.30M \| about 2140 \| 9.060 \|
	\| SSM TaoNet, DPLR projected 64 \| 64 \| 512 \| forward+backward \| about 618k \| about 6451 \| 9.060 \|

	Comparison:

	- Batch 8: SSM forward is slightly faster than attention, but backward is slower.
	- Batch 16: SSM backward is closer to attention than in batch-4 runs, but still slower.
	- Batch 32: SSM beats attention in both forward and forward+backward throughput in this run.
	- Batch 64: SSM falls off sharply, so the useful scaling point is not simply the largest batch.
	- SSM backward memory remains higher than attention, especially at larger batches.

	Interpretation:

	- The projected-64 DPLR SSM should be optimized and evaluated around the scaling sweet spot, currently batch 32 for this synthetic benchmark on the RTX 5090.
	- Batch-4 timing is still useful for smoke tests, but it should not be treated as the main performance target.
	- This is a configuration-level breakthrough: SSM can outperform attention at the right batch size even before custom TileLang/Triton kernels.
	- Next improvement directions should either preserve or improve the batch-32 scaling result, not merely improve batch-4 microbenchmarks.

	### LLM Iteration 12 - Token Accuracy Benchmark And Causal Memory Check

	Reason for this iteration:

	- Throughput alone is not sufficient; the SSM TaoNet must also learn useful token tasks.
	- The benchmark script previously reported only random synthetic CE, which is not an inference accuracy signal.
	- This iteration adds lightweight trained token tasks and reports `eval_accuracy`.

	Implementation location:

	- TaoTrain: `scripts/benchmark_taonet_token_variants.py`

	TaoTrain commit:

	- `59b84cd Add token task accuracy benchmark`

	What changed:

	- Added `--token-task` with:
	- `random`: original random next-token timing task
	- `increment`: deterministic token mapping, label is current token plus one modulo vocab
	- `previous`: causal memory task, label is the previous token
	- Added optional short training with `--train-steps`, `--learning-rate`, `--weight-decay`.
	- Added eval metrics:
	- `eval_loss`
	- `eval_accuracy`
	- `train_final_loss`
	- `train_seconds`

	Validation:

	- Local TaoTrain smoke passed on CPU.
	- `python -m pytest tests\test_taonet_ssm.py -q` passed locally.

	Broad speed comparison after adding accuracy columns:

	- RepoBridge run: `taonet-vs-dplr-proj64-broad-speed-bench-20260429-112432`
	- Config: projected DPLR mixer dim 64, DPLR rank 1, 4 layers, bf16, random token task, batch sweep

	\| Architecture \| Batch \| Seq \| Mode \| Tok/s \| Peak MB \|
	\|---\|---:\|---:\|---\|---:\|---:\|
	\| attention TaoNet \| 8 \| 512 \| forward+backward \| about 873k \| about 697 \|
	\| SSM TaoNet, DPLR projected 64 \| 8 \| 512 \| forward+backward \| about 244k \| about 956 \|
	\| attention TaoNet \| 16 \| 512 \| forward+backward \| about 589k \| about 1332 \|
	\| SSM TaoNet, DPLR projected 64 \| 16 \| 512 \| forward+backward \| about 646k \| about 1748 \|
	\| attention TaoNet \| 32 \| 512 \| forward+backward \| about 680k \| about 2592 \|
	\| SSM TaoNet, DPLR projected 64 \| 32 \| 512 \| forward+backward \| about 816k \| about 3338 \|
	\| attention TaoNet \| 64 \| 512 \| forward+backward \| about 763k \| about 5121 \|
	\| SSM TaoNet, DPLR projected 64 \| 64 \| 512 \| forward+backward \| about 544k \| about 6516 \|

	Speed interpretation:

	- SSM remains poor at batch 8.
	- SSM wins forward+backward throughput at batch 16 and batch 32 in this run.
	- SSM falls off again at batch 64.
	- Batch 16-32 remains the useful projected-64 scaling range.

	Token accuracy comparison:

	- Previous-token task run: `taonet-vs-dplr-proj64-previous-token-quality-20260429-112456`
	- Linear/ungated SSM ablation run: `taonet-vs-dplr-proj64-previous-token-linear-quality-20260429-112623`
	- Increment task run: `taonet-vs-dplr-proj64-increment-token-quality-20260429-112719`
	- All quality runs used batch 32, seq 128, vocab 128, 100 train steps, bf16.

	\| Task \| Architecture \| Eval loss \| Eval accuracy \| Forward+backward tok/s \|
	\|---\|---\|---:\|---:\|---:\|
	\| previous \| attention TaoNet \| about 0.033 \| about 0.999 \| about 551k \|
	\| previous \| SSM TaoNet, DPLR projected 64 \| about 4.858 \| about 0.009 \| about 292k \|
	\| previous, linear ungated SSM \| attention TaoNet \| about 0.046 \| about 0.999 \| about 990k \|
	\| previous, linear ungated SSM \| SSM TaoNet, DPLR projected 64 \| about 4.626 \| about 0.026 \| about 346k \|
	\| increment \| attention TaoNet \| about 0.007 \| 1.000 \| about 1.09M \|
	\| increment \| SSM TaoNet, DPLR projected 64 \| about 0.009 \| 1.000 \| about 344k \|

	Failed SSM-core improvement:

	- SSM commit `2e974c9 Add delayed DPLR skip` added a learnable one-step diagonal delayed skip to help causal memory.
	- Remote previous-token run after this change: `taonet-vs-dplr-proj64-previous-token-quality-20260429-112953`.
	- Result: SSM remained near random, eval accuracy about `0.008`, and speed worsened.
	- The change was reverted by `3fc0575 Revert "Add delayed DPLR skip"`.

	Interpretation:

	- Projected-64 DPLR SSM can learn simple token mappings (`increment`) to perfect accuracy.
	- It currently fails a short causal memory/copy task (`previous`) under the same 100-step setting where attention TaoNet reaches about 99.9% accuracy.
	- The failure is not solved by removing SSM activation/gates or by a simple delayed diagonal skip.
	- Future improvements must include both:
	- speed comparison across batch 8/16/32/64
	- trained token accuracy, especially on causal memory tasks
	- The next quality-focused direction should investigate the SSM wrapper/core's ability to expose previous-token information, not only low-level GPU speed.

	### LLM Iteration 13 - Local Shift Register For Causal Token Memory

	Reason for this iteration:

	- Projected-64 was the strongest SSM speed configuration, but it failed the `previous` token-memory task.
	- Capacity probes showed the failure was not caused by the projected-64 bottleneck alone:
	- projected-128 SSM eval accuracy stayed near random, about `0.007`
	- full-width projected-256 SSM eval accuracy stayed near random, about `0.008`
	- The next improvement therefore targeted explicit short causal memory while preserving the DPLR SSM as the main sequence mixer.

	Implementation location:

	- TaoTrain commit: `bb3bf90 Add SSM local shift mixer option`
	- TaoTrain: `src/taoTrain/models/taonet_ssm.py`
	- TaoTrain: `src/taoTrain/config.py`
	- TaoTrain: `scripts/benchmark_taonet_token_variants.py`
	- TaoTrain: `tests/test_taonet_ssm.py`

	What changed:

	- Added opt-in `ssm_local_shift`.
	- The SSM mixer can now add a one-token causal shift/register branch:
	- `shifted[:, 1:] = x_norm[:, :-1]`
	- output contribution is controlled by a single learned scalar `ssm_local_shift_init`.
	- The branch is deliberately cheap and ternary-friendly in structure: it is a causal shift plus scalar gain, not another dense attention mechanism.
	- The default remains off, so older SSM benchmarks are still comparable.

	Validation:

	- Local TaoTrain:
	- `PYTHONPATH=...\TaoTrain\src;...\Taotern_SSM python -m pytest tests\test_taonet_ssm.py -q` passed, `4 passed`.
	- CPU smoke for `benchmark_taonet_token_variants.py --ssm-local-shift` passed.

	Capacity diagnostic before the change:

	\| Architecture \| Mixer dim \| Batch \| Seq \| Eval loss \| Eval accuracy \| Forward+backward tok/s \|
	\|---\|---:\|---:\|---:\|---:\|---:\|---:\|
	\| attention TaoNet \| n/a \| 32 \| 128 \| about 0.025 \| 1.000 \| about 1.04M \|
	\| SSM TaoNet, DPLR \| 128 \| 32 \| 128 \| about 4.857 \| about 0.007 \| about 379k \|
	\| attention TaoNet \| n/a \| 32 \| 128 \| about 0.042 \| about 0.999 \| about 556k \|
	\| SSM TaoNet, DPLR \| 256 \| 32 \| 128 \| about 4.856 \| about 0.008 \| about 389k \|

	Required TaoNet comparison after the change:

	- RepoBridge run: `taonet-vs-dplr-proj64-local-shift-previous-quality-20260429-144930`
	- RepoBridge broad run: `taonet-vs-dplr-proj64-local-shift-previous-broad-quality-20260429-145014`
	- Config: previous-token task, seq 128, vocab 128, 100 train steps, bf16, projected DPLR mixer dim 64, local shift enabled.

	\| Architecture \| Batch \| Eval loss \| Eval accuracy \| Forward tok/s \| Forward+backward tok/s \| Peak MB \|
	\|---\|---:\|---:\|---:\|---:\|---:\|---:\|
	\| attention TaoNet \| 8 \| about 4.376 \| about 0.096 \| about 695k \| about 238k \| about 103 \|
	\| SSM TaoNet, DPLR projected 64 + local shift \| 8 \| about 0.010 \| 1.000 \| about 226k \| about 89k \| about 181 \|
	\| attention TaoNet \| 16 \| about 1.048 \| about 0.847 \| about 1.26M \| about 508k \| about 166 \|
	\| SSM TaoNet, DPLR projected 64 + local shift \| 16 \| about 0.008 \| 1.000 \| about 520k \| about 189k \| about 299 \|
	\| attention TaoNet \| 32 \| about 0.043 \| 1.000 \| about 2.54M \| about 555k \| about 297 \|
	\| SSM TaoNet, DPLR projected 64 + local shift \| 32 \| about 0.008 \| 1.000 \| about 1.16M \| about 353k \| about 513 \|
	\| attention TaoNet \| 64 \| about 0.020 \| 1.000 \| about 4.75M \| about 1.73M \| about 553 \|
	\| SSM TaoNet, DPLR projected 64 + local shift \| 64 \| about 0.007 \| 1.000 \| about 2.43M \| about 403k \| about 956 \|

	Interpretation:

	- Success: this is the first projected-64 SSM TaoNet result that solves the `previous` causal-memory task.
	- The result is not only a batch-32 spot check. SSM reached perfect eval accuracy at batch 8, 16, 32, and 64.
	- The quality gain is large: plain projected-64, projected-128, and projected-256 DPLR all stayed near random on the same task.
	- Speed tradeoff: local-shift SSM is slower than attention on this short-sequence previous-token benchmark, especially backward.
	- This should be treated as a quality architecture fix, not a hardware-acceleration fix. The next hardware iteration should still target fused DPLR frequency/backward kernels.

	### LLM Iteration 14 - Explicit DPLR Transfer-Mode Probe

	Reason for this iteration:

	- After the local-shift quality fix, the next bottleneck was speed.
	- The DPLR direct frequency path applies the finite correction to batch-dependent hidden responses.
	- A possible alternative was to materialize the full frequency transfer matrix, then multiply by the input FFT.
	- This could be faster for some batch/sequence shapes, but it risks high memory and repeated transfer construction.

	Implementation location:

	- SSM commit: `749a4cf Add DPLR transfer profiling mode`
	- SSM commit: `e34b67c Add DPLR conv transfer mode`
	- TaoTrain commit: `ceb08e6 Expose SSM conv transfer mode`
	- SSM: `gamma_space_model/modules/s4_ternary_dplr_ssm.py`
	- SSM: `scripts/profile_dplr_frequency_path.py`
	- TaoTrain: `scripts/benchmark_taonet_token_variants.py`

	What changed:

	- Added profiler support for comparing:
	- direct DPLR frequency response application
	- materialized transfer matrix application
	- Added explicit `kernel_mode="conv_transfer"` to `S4TernaryDPLRSSM`.
	- Exposed `conv_transfer` through TaoTrain config/benchmark CLI.
	- The mode is opt-in only. The default/recommended projected-64 path remains `conv`.

	Local validation:

	- SSM: `python -m pytest tests\test_s4_ternary_dplr_ssm.py tests\test_ssm_gamma.py -q` passed, `23 passed`.
	- TaoTrain: `PYTHONPATH=...\TaoTrain\src;...\Taotern_SSM python -m pytest tests\test_taonet_ssm.py -q` passed, `5 passed`.

	Isolated SSM-core remote profile:

	- RepoBridge run: `ssm-dplr-direct-vs-transfer-s128-profile-20260429-145555`
	- Config: DPLR state/mixer dim 64, hidden dim 256, seq 128, bf16, rank 1.

	\| Method \| Batch \| Forward tok/s \| Forward+backward tok/s \| Peak MB \| Interpretation \|
	\|---\|---:\|---:\|---:\|---:\|---\|
	\| direct \| 8 \| about 555k \| about 332k \| about 34 \| baseline direct path \|
	\| transfer \| 8 \| about 1.24M \| about 440k \| about 247 \| faster but much higher memory \|
	\| direct \| 16 \| about 790k \| about 1.13M \| about 47 \| direct wins \|
	\| transfer \| 16 \| about 737k \| about 481k \| about 248 \| transfer loses \|
	\| direct \| 32 \| about 6.73M \| about 1.86M \| about 74 \| direct wins \|
	\| transfer \| 32 \| about 4.89M \| about 1.68M \| about 250 \| transfer loses \|
	\| direct \| 64 \| about 6.90M \| about 2.20M \| about 128 \| baseline direct path \|
	\| transfer \| 64 \| about 2.93M \| about 3.06M \| about 253 \| backward faster, forward slower \|

	TaoNet comparison after exposing `conv_transfer`:

	- RepoBridge run: `taonet-vs-dplr-proj64-local-shift-conv-transfer-previous-broad-quality-20260429-145946`
	- Config: previous-token task, seq 128, vocab 128, 100 train steps, bf16, projected DPLR mixer dim 64, local shift enabled, `ssm_kernel_mode=conv_transfer`.

	\| Architecture \| Batch \| Eval loss \| Eval accuracy \| Forward tok/s \| Forward+backward tok/s \| Peak MB \|
	\|---\|---:\|---:\|---:\|---:\|---:\|---:\|
	\| attention TaoNet \| 8 \| about 4.431 \| about 0.082 \| about 699k \| about 279k \| about 103 \|
	\| SSM TaoNet, DPLR projected 64 + local shift + conv_transfer \| 8 \| about 0.010 \| 1.000 \| about 52k \| about 14k \| about 195 \|
	\| attention TaoNet \| 16 \| about 1.098 \| about 0.862 \| about 1.24M \| about 303k \| about 166 \|
	\| SSM TaoNet, DPLR projected 64 + local shift + conv_transfer \| 16 \| about 0.008 \| 1.000 \| about 79k \| about 24k \| about 270 \|
	\| attention TaoNet \| 32 \| about 0.061 \| about 0.998 \| about 1.24M \| about 674k \| about 297 \|
	\| SSM TaoNet, DPLR projected 64 + local shift + conv_transfer \| 32 \| about 0.007 \| 1.000 \| about 157k \| about 45k \| about 420 \|
	\| attention TaoNet \| 64 \| about 0.015 \| 1.000 \| about 4.48M \| about 1.05M \| about 553 \|
	\| SSM TaoNet, DPLR projected 64 + local shift + conv_transfer \| 64 \| about 0.007 \| 1.000 \| about 370k \| about 97k \| about 719 \|

	Comparison to the previous direct-conv local-shift run:

	\| Batch \| Direct-conv SSM forward+backward tok/s \| Transfer-mode SSM forward+backward tok/s \|
	\|---:\|---:\|---:\|
	\| 8 \| about 89k \| about 14k \|
	\| 16 \| about 189k \| about 24k \|
	\| 32 \| about 353k \| about 45k \|
	\| 64 \| about 403k \| about 97k \|

	Interpretation:

	- Failed as an end-to-end TaoNet acceleration.
	- The isolated SSM profile suggested transfer mode could help in some cases, but inside the LLM wrapper it is much slower across all tested batch sizes.
	- Accuracy remains solved because the local shift branch is still active, but speed regresses badly.
	- Keep `conv_transfer` only as an explicit diagnostic/experimental mode for now.
	- Recommended mode remains `ssm_kernel_mode=conv` with `ssm_local_shift=True`.
	- The next hardware target should not be materializing the whole transfer each layer/step. It should focus on fusing or custom-autograding the current direct DPLR response path, especially the complex rank-1 frequency operations and backward.

	### LLM Iteration 15 - Shrink DPLR Hidden State After Local-Shift Quality Fix

	Reason for this iteration:

	- The local-shift branch solved the `previous` token-memory task, but the quality-fixed SSM was still slower than attention on short seq-128 training.
	- The profiler for the recommended direct DPLR path at batch 32, seq 128 showed many small complex BMM/MM calls; there was no single obvious Python-only bottleneck.
	- Since local shift now carries exact one-token memory, the DPLR hidden dimension may not need to remain at 256 for this token-memory regime.
	- This iteration tested smaller DPLR hidden states as a ternary-friendly architecture/config improvement.

	Remote profiler context:

	- RepoBridge run: `ssm-dplr-direct-b32-s128-profile-20260429-154242`
	- Config: DPLR mixer/state dim 64, hidden dim 256, batch 32, seq 128, bf16, rank 1, direct path.
	- Result: forward+backward about `2.25M` core tok/s.
	- Profiler top CUDA cost was small complex BMM/MM work; `aten::bmm` accounted for about `48%` of self CUDA time.
	- `aten::linalg_matrix_power` was visible but small, about `40us` CUDA total.

	Remote hidden-dim sweeps:

	- RepoBridge run: `taonet-vs-dplr-proj64-local-shift-hidden-sweep-previous-20260429-154546`
	- RepoBridge run: `taonet-vs-dplr-proj64-local-shift-hidden-small-sweep-previous-20260429-155028`
	- Config: previous-token task, seq 128, vocab 128, 100 train steps, bf16, projected DPLR mixer dim 64, local shift enabled, direct `conv` path.

	\| SSM hidden dim \| Batch \| SSM eval accuracy \| SSM forward+backward tok/s \| SSM peak MB \| Attention eval accuracy \| Attention forward+backward tok/s \|
	\|---:\|---:\|---:\|---:\|---:\|---:\|---:\|
	\| 256 \| 8 \| 1.000 \| about 89k \| about 181 \| about 0.096 \| about 238k \|
	\| 256 \| 16 \| 1.000 \| about 189k \| about 299 \| about 0.847 \| about 508k \|
	\| 256 \| 32 \| 1.000 \| about 353k \| about 513 \| 1.000 \| about 555k \|
	\| 256 \| 64 \| 1.000 \| about 403k \| about 956 \| 1.000 \| about 1.73M \|
	\| 64 \| 8 \| 1.000 \| about 95k \| about 145 \| about 0.102 \| about 278k \|
	\| 64 \| 16 \| 1.000 \| about 184k \| about 239 \| about 0.932 \| about 300k \|
	\| 64 \| 32 \| 1.000 \| about 370k \| about 404 \| about 0.999 \| about 895k \|
	\| 64 \| 64 \| 1.000 \| about 564k \| about 750 \| about 0.999 \| about 920k \|
	\| 32 \| 8 \| 1.000 \| about 91k \| about 139 \| about 0.097 \| about 245k \|
	\| 32 \| 16 \| 1.000 \| about 187k \| about 227 \| about 0.941 \| about 460k \|
	\| 32 \| 32 \| 1.000 \| about 302k \| about 393 \| about 0.998 \| about 863k \|
	\| 32 \| 64 \| 1.000 \| about 787k \| about 716 \| 1.000 \| about 1.75M \|
	\| 16 \| 8 \| 1.000 \| about 86k \| about 138 \| about 0.083 \| about 260k \|
	\| 16 \| 16 \| 1.000 \| about 187k \| about 223 \| about 0.844 \| about 495k \|
	\| 16 \| 32 \| 1.000 \| about 357k \| about 378 \| about 0.999 \| about 550k \|
	\| 16 \| 64 \| 1.000 \| about 795k \| about 705 \| 1.000 \| about 1.76M \|

	Seq-512 speed check for hidden dim 16:

	- RepoBridge run: `taonet-vs-dplr-proj64-local-shift-hidden16-random-speed-20260429-155346`
	- Config: random next-token timing task, seq 512, vocab 8192, projected DPLR mixer dim 64, hidden dim 16, local shift enabled.

	\| Architecture \| Batch \| Forward tok/s \| Forward+backward tok/s \| Peak MB \| Loss \|
	\|---\|---:\|---:\|---:\|---:\|---:\|
	\| attention TaoNet \| 16 \| about 1.30M \| about 601k \| about 1332 \| about 9.055 \|
	\| SSM TaoNet, DPLR projected 64, hidden 16 + local shift \| 16 \| about 2.18M \| about 728k \| about 1511 \| about 9.069 \|
	\| attention TaoNet \| 32 \| about 3.87M \| about 1.37M \| about 2590 \| about 9.060 \|
	\| SSM TaoNet, DPLR projected 64, hidden 16 + local shift \| 32 \| about 3.52M \| about 1.14M \| about 2887 \| about 9.061 \|
	\| attention TaoNet \| 64 \| about 4.16M \| about 1.45M \| about 5121 \| about 9.065 \|
	\| SSM TaoNet, DPLR projected 64, hidden 16 + local shift \| 64 \| about 4.03M \| about 1.31M \| about 5649 \| about 9.060 \|

	Interpretation:

	- Success for the short token-memory benchmark: hidden dim 16 kept perfect `previous` accuracy and improved batch-64 backward throughput from about `403k` to about `795k` tok/s while reducing peak memory.
	- Hidden dim 64 was also strong and slightly better at batch 32 than hidden dim 16.
	- This did not become a universal seq-512 speed replacement. On random seq-512 timing, hidden dim 16 beat attention at batch 16 but lost at batch 32 and 64.
	- Recommended quality-aware short-memory config is now `ssm_mixer_dim=64`, `ssm_hidden_dim=16`, `ssm_local_shift=True`, `ssm_kernel_mode=conv`.
	- Recommended longer seq-512 throughput config should remain benchmark-driven; the older hidden-256 projected-64 regime still has stronger evidence around batch 16-32.

	### LLM Iteration 16 - Seq-512 Previous-Token Robustness And Hidden-State Selection

	Reason for this iteration:

	- Iteration 15 showed hidden dim 16 was excellent for seq-128 `previous` memory and mixed for seq-512 random timing.
	- The missing check was a longer trained token-memory task: seq 512 `previous`, where accuracy and training speed both matter.
	- This iteration tested whether the local-shift quality fix holds at seq 512 and whether hidden dim 16, 64, or 256 is the best state size at this longer context.

	Remote benchmark:

	- RepoBridge run with attention comparison: `taonet-vs-dplr-proj64-local-shift-hidden16-previous512-20260429-161213`
	- RepoBridge SSM-only hidden comparison: `taonet-ssm-proj64-local-shift-previous512-hidden-compare-20260429-161306`
	- Config: previous-token task, seq 512, vocab 128, 100 train steps, bf16, projected DPLR mixer dim 64, local shift enabled, direct `conv` path.

	Required TaoNet comparison:

	\| Architecture \| SSM hidden dim \| Batch \| Eval loss \| Eval accuracy \| Forward tok/s \| Forward+backward tok/s \| Peak MB \|
	\|---\|---:\|---:\|---:\|---:\|---:\|---:\|---:\|
	\| attention TaoNet \| n/a \| 16 \| about 4.614 \| about 0.048 \| about 2.04M \| about 1.39M \| about 575 \|
	\| SSM TaoNet, DPLR projected 64 + local shift \| 16 \| 16 \| about 0.007 \| 1.000 \| about 2.57M \| about 701k \| about 754 \|
	\| attention TaoNet \| n/a \| 32 \| about 2.090 \| about 0.629 \| about 4.79M \| about 899k \| about 1099 \|
	\| SSM TaoNet, DPLR projected 64 + local shift \| 16 \| 32 \| about 0.007 \| 1.000 \| about 4.32M \| about 944k \| about 1391 \|
	\| attention TaoNet \| n/a \| 64 \| about 0.239 \| about 0.962 \| about 4.08M \| about 1.18M \| about 2157 \|
	\| SSM TaoNet, DPLR projected 64 + local shift \| 16 \| 64 \| about 0.007 \| 1.000 \| about 2.57M \| about 961k \| about 2677 \|

	SSM hidden-state comparison at seq 512:

	\| SSM hidden dim \| Batch \| Eval accuracy \| Forward tok/s \| Forward+backward tok/s \| Peak MB \|
	\|---:\|---:\|---:\|---:\|---:\|---:\|
	\| 16 \| 16 \| 1.000 \| about 2.57M \| about 701k \| about 754 \|
	\| 16 \| 32 \| 1.000 \| about 4.32M \| about 944k \| about 1391 \|
	\| 16 \| 64 \| 1.000 \| about 2.57M \| about 961k \| about 2677 \|
	\| 64 \| 16 \| 1.000 \| about 1.18M \| about 494k \| about 800 \|
	\| 64 \| 32 \| 1.000 \| about 4.26M \| about 1.36M \| about 1491 \|
	\| 64 \| 64 \| 1.000 \| about 4.70M \| about 1.46M \| about 2874 \|
	\| 256 \| 16 \| 1.000 \| about 2.49M \| about 748k \| about 1032 \|
	\| 256 \| 32 \| 1.000 \| about 3.36M \| about 705k \| about 1914 \|
	\| 256 \| 64 \| 1.000 \| about 3.62M \| about 788k \| about 3681 \|

	Interpretation:

	- Quality success: local-shift DPLR SSM keeps perfect `previous` accuracy at seq 512 for all tested hidden sizes and batches.
	- Attention did not fully learn the same task in 100 steps at batch 16/32 and reached about `0.962` accuracy at batch 64.
	- Speed depends on batch:
	- batch 16: hidden 256 is fastest among SSM variants, about `748k` backward tok/s; attention is still faster at about `1.39M`.
	- batch 32: hidden 64 is fastest, about `1.36M` backward tok/s, beating attention's about `899k`.
	- batch 64: hidden 64 is fastest, about `1.46M` backward tok/s, beating attention's about `1.18M`.
	- This gives a better longer-memory recommendation than Iteration 15:
	- use `ssm_hidden_dim=16` for short seq-128 memory and lower memory pressure
	- use `ssm_hidden_dim=64` for seq-512 trained memory around batch 32/64
	- keep hidden 256 as a possible batch-16 or legacy speed point, but not the general quality-aware default

	### LLM Iteration 17 - TaoData Real-Text Byte-Token Pilot

	Reason for this iteration:

	- Synthetic `previous` and `increment` tasks were useful diagnostics, but they are not enough to judge LLM capability.
	- The remote server has a TaoData corpus at `/home/student/Data/TaoData/pretrain.jsonl.fineweb.jsonl`.
	- No SentencePiece tokenizer artifact was found at the expected remote TaoTrain/TaoData tokenizer paths, so the first real-text benchmark used dependency-free byte tokenization.
	- Byte tokenization is not the final deployment tokenizer, but it gives a real-corpus next-token signal and exercises the same TaoNet model paths.

	Implementation location:

	- TaoTrain commit: `b8c4f3d Add real token TaoNet benchmark`
	- TaoTrain: `scripts/benchmark_taonet_real_tokens.py`

	What changed:

	- Added a remote-friendly real-token benchmark script that:
	- reads JSONL or plain text
	- supports TaoData-style `text` records
	- supports byte tokenization and optional SentencePiece tokenization
	- builds contiguous next-token batches from one long token stream
	- reports eval loss, perplexity, token accuracy, throughput, and memory
	- compares attention TaoNet against multiple SSM hidden sizes in one run

	Validation:

	- Local CPU smoke passed on a plain text file with byte tokenization.
	- Remote RepoBridge runs completed on TaoData JSONL.

	Remote benchmark:

	- RepoBridge run: `taonet-vs-ssm-real-token-taodata-byte-pilot-20260429-164623`
	- RepoBridge run: `taonet-vs-ssm-real-token-taodata-byte-pilot-b64-20260429-164720`
	- Data: `/home/student/Data/TaoData/pretrain.jsonl.fineweb.jsonl`
	- Tokenization: byte-level, vocab size 259
	- Data limit: first `2,000,000` byte tokens from up to `5,000` records
	- Config: seq 512, 4 layers, hidden dim 256, bf16, 150 train steps, batch 16/32/64, projected DPLR mixer dim 64, local shift enabled.

	Required TaoNet comparison:

	\| Architecture \| SSM hidden dim \| Batch \| Eval loss \| Eval PPL \| Eval accuracy \| Forward tok/s \| Forward+backward tok/s \| Peak MB \|
	\|---\|---:\|---:\|---:\|---:\|---:\|---:\|---:\|---:\|
	\| attention TaoNet \| n/a \| 16 \| about 2.549 \| about 12.80 \| about 0.260 \| about 2.03M \| about 1.40M \| about 585 \|
	\| SSM TaoNet, DPLR projected 64 + local shift \| 16 \| 16 \| about 1.982 \| about 7.26 \| about 0.423 \| about 2.42M \| about 564k \| about 757 \|
	\| SSM TaoNet, DPLR projected 64 + local shift \| 64 \| 16 \| about 1.928 \| about 6.88 \| about 0.440 \| about 2.16M \| about 488k \| about 803 \|
	\| attention TaoNet \| n/a \| 32 \| about 2.523 \| about 12.47 \| about 0.266 \| about 2.13M \| about 809k \| about 1115 \|
	\| SSM TaoNet, DPLR projected 64 + local shift \| 16 \| 32 \| about 1.879 \| about 6.55 \| about 0.455 \| about 4.43M \| about 1.38M \| about 1396 \|
	\| SSM TaoNet, DPLR projected 64 + local shift \| 64 \| 32 \| about 1.848 \| about 6.35 \| about 0.457 \| about 3.97M \| about 1.25M \| about 1496 \|
	\| attention TaoNet \| n/a \| 64 \| about 2.529 \| about 12.54 \| about 0.265 \| about 5.98M \| about 2.03M \| about 2190 \|
	\| SSM TaoNet, DPLR projected 64 + local shift \| 16 \| 64 \| about 1.807 \| about 6.10 \| about 0.471 \| about 4.92M \| about 1.67M \| about 2686 \|
	\| SSM TaoNet, DPLR projected 64 + local shift \| 64 \| 64 \| about 1.834 \| about 6.26 \| about 0.466 \| about 2.54M \| about 1.52M \| about 2882 \|

	Interpretation:

	- First real-corpus quality success: both SSM candidates beat attention on validation loss, perplexity, and byte-token accuracy after the same number of train steps.
	- Hidden 64 was best quality at batch 16/32, while hidden 16 was best quality at batch 64 and generally faster among SSM variants.
	- Speed tradeoff depends on batch:
	- batch 16: attention backward is faster, but SSM has much better validation quality.
	- batch 32: hidden-16 SSM wins both quality and backward throughput versus attention.
	- batch 64: attention wins backward throughput, while SSM wins validation quality.
	- This benchmark is byte-level, so it should be treated as a real-text pilot rather than the final TaoData tokenizer benchmark.
	- Next real-data step: train or locate the intended SentencePiece tokenizer, then rerun the same script with `--tokenizer-type sentencepiece`.

	### LLM Iteration 18 - TaoData SentencePiece Pilot And Per-Channel Local Shift

	Reason for this iteration:

	- Byte-level TaoData results were encouraging but not the intended LLM tokenization.
	- No pre-existing tokenizer artifact was found on the remote server, so a pilot SentencePiece tokenizer was trained from TaoData.
	- The first 500-step SentencePiece run showed attention still ahead on validation loss at batch 32, even though SSM retained a token-accuracy edge.
	- Because no-shift SSM was worse, the local shift branch was helping; the next lightweight improvement was making the shift gain per-channel instead of one scalar.

	Implementation location:

	- TaoTrain commit: `33747c1 Add TaoData pilot tokenizer config`
	- TaoTrain commit: `c519645 Add per-channel SSM local shift`
	- TaoTrain: `configs/tokenizer_taodata_pilot.yaml`
	- TaoTrain: `src/taoTrain/models/taonet_ssm.py`
	- TaoTrain: `scripts/benchmark_taonet_real_tokens.py`

	What changed:

	- Added a pilot tokenizer config:
	- input: `/home/student/Data/TaoData/pretrain.jsonl.fineweb.jsonl`
	- output: `/home/student/YouZheng/tokenizers/taodata_pilot_8k`
	- vocab size: `8192`
	- max samples: `20000`
	- Trained the remote tokenizer; output files:
	- `/home/student/YouZheng/tokenizers/taodata_pilot_8k/tokenizer.model`
	- `/home/student/YouZheng/tokenizers/taodata_pilot_8k/tokenizer.vocab`
	- Added opt-in `ssm_local_shift_per_channel`.
	- The previous shift branch used one learned scalar for all model channels.
	- The new branch can use one learned gain per model channel while keeping the operation cheap: shift plus elementwise multiply.

	Validation:

	- TaoTrain local tests: `python -m pytest tests\test_taonet_ssm.py -q` passed, `6 passed`.
	- Local real-token smoke with `--ssm-local-shift-per-channel` passed.
	- Remote tokenizer training completed. RepoBridge's local print path initially hit a Windows emoji encoding issue, but the tokenizer files were created successfully.

	SentencePiece 150-step pilot:

	- RepoBridge run: `taonet-vs-ssm-real-token-taodata-spm-pilot-20260429-171228`
	- Data: TaoData FineWeb JSONL
	- Tokenization: pilot SentencePiece 8k
	- Config: seq 512, 4 layers, hidden dim 256, bf16, 150 train steps, batch 16/32/64, projected DPLR mixer dim 64, local shift enabled.

	\| Architecture \| SSM hidden dim \| Batch \| Eval loss \| Eval PPL \| Eval accuracy \| Forward+backward tok/s \|
	\|---\|---:\|---:\|---:\|---:\|---:\|---:\|
	\| attention TaoNet \| n/a \| 16 \| about 5.718 \| about 304 \| about 0.150 \| about 1.01M \|
	\| SSM TaoNet \| 16 \| 16 \| about 5.723 \| about 306 \| about 0.149 \| about 743k \|
	\| SSM TaoNet \| 64 \| 16 \| about 5.728 \| about 307 \| about 0.146 \| about 381k \|
	\| attention TaoNet \| n/a \| 32 \| about 5.533 \| about 253 \| about 0.156 \| about 842k \|
	\| SSM TaoNet \| 16 \| 32 \| about 5.505 \| about 246 \| about 0.165 \| about 771k \|
	\| SSM TaoNet \| 64 \| 32 \| about 5.561 \| about 260 \| about 0.158 \| about 1.09M \|
	\| attention TaoNet \| n/a \| 64 \| about 5.414 \| about 225 \| about 0.163 \| about 623k \|
	\| SSM TaoNet \| 16 \| 64 \| about 5.427 \| about 227 \| about 0.169 \| about 1.12M \|
	\| SSM TaoNet \| 64 \| 64 \| about 5.395 \| about 220 \| about 0.171 \| about 623k \|

	SentencePiece 500-step batch-32 follow-up:

	- RepoBridge run: `taonet-vs-ssm-real-token-taodata-spm-b32-500step-20260429-171338`
	- RepoBridge run without shift: `taonet-ssm-real-token-taodata-spm-b32-500step-no-shift-20260429-171451`
	- RepoBridge run with per-channel shift: `taonet-vs-ssm-real-token-taodata-spm-b32-500step-channel-shift-20260429-171917`
	- Config: batch 32, seq 512, 500 train steps, eval batches 16.

	\| Variant \| SSM hidden dim \| Shift type \| Eval loss \| Eval PPL \| Eval accuracy \| Forward+backward tok/s \|
	\|---\|---:\|---\|---:\|---:\|---:\|---:\|
	\| attention TaoNet \| n/a \| n/a \| about 4.715 \| about 112 \| about 0.211 \| about 1.23M in first run, about 892k in per-channel run \|
	\| SSM TaoNet \| 16 \| scalar \| about 4.798 \| about 121 \| about 0.217 \| about 1.13M \|
	\| SSM TaoNet \| 64 \| scalar \| about 4.830 \| about 125 \| about 0.215 \| about 968k \|
	\| SSM TaoNet \| 16 \| none \| about 5.088 \| about 162 \| about 0.171 \| about 554k \|
	\| SSM TaoNet \| 64 \| none \| about 5.102 \| about 164 \| about 0.169 \| about 580k \|
	\| SSM TaoNet \| 16 \| per-channel \| about 4.782 \| about 119 \| about 0.218 \| about 784k \|
	\| SSM TaoNet \| 64 \| per-channel \| about 4.818 \| about 124 \| about 0.215 \| about 1.08M \|

	Interpretation:

	- The SentencePiece pilot is more realistic and less favorable to SSM than the byte-level pilot.
	- SSM has a small token-accuracy edge at batch 32, but attention has the best 500-step validation loss/perplexity.
	- Removing local shift is clearly worse, so local shift is useful for real-token modeling too.
	- Per-channel shift is a small quality improvement over scalar shift:
	- hidden 16 eval loss improved from about `4.798` to `4.782`
	- hidden 64 eval loss improved from about `4.830` to `4.818`
	- Per-channel shift is not enough to surpass attention on 500-step SentencePiece validation loss.
	- Next model-improvement direction should target SSM language-modeling capacity or optimization, not just exact one-token memory:
	- try larger `ssm_mixer_dim` such as 96/128 with h16/h64
	- tune SSM learning rate/weight decay separately from attention
	- test a small gated local convolution/projection branch if ternary deployment accepts it

	### LLM Iteration 19 - TaoData SentencePiece Mixer-Dimension Sweep

	Reason for this iteration:

	- The 500-step SentencePiece batch-32 pilot showed SSM had a small token-accuracy edge, but attention still had better validation loss/perplexity.
	- The prior best SSM used `ssm_mixer_dim=64`, originally chosen from speed-focused scaling probes.
	- Because real-token quality may need more SSM channel capacity, this iteration swept projected mixer dimensions while keeping the same outer TaoNet dimensions.

	Implementation location:

	- TaoTrain commit: `357336e Sweep SSM mixer dims in real token benchmark`
	- TaoTrain: `scripts/benchmark_taonet_real_tokens.py`
	- RepoBridge config: `repobridge.taonet.realspm.taodata.b32.500step.mixersweep.config.json`

	What changed:

	- Added `--ssm-mixer-dims` to the real-token benchmark.
	- The benchmark now records `ssm_mixer_dim` in the printed table and CSV.
	- Attention TaoNet is still evaluated once per batch, while SSM TaoNet can sweep multiple hidden and mixer dimensions in the same run.

	Validation:

	- TaoTrain local syntax check: `python -m py_compile scripts\benchmark_taonet_real_tokens.py` passed.
	- TaoTrain local tests with the SSM repo on `PYTHONPATH`: `python -m pytest tests\test_taonet_ssm.py -q` passed, `6 passed`.
	- Local byte-token smoke with `--ssm-mixer-dims 8,12` passed and wrote CSV/JSON outputs.

	Remote benchmark:

	- RepoBridge run: `taonet-vs-ssm-real-token-taodata-spm-b32-500step-mixersweep-20260429-193729`
	- Data: TaoData FineWeb JSONL
	- Tokenization: pilot SentencePiece 8k
	- Config: batch 32, seq 512, 4 layers, hidden dim 256, bf16, 500 train steps, 16 eval batches, local shift enabled, per-channel shift enabled.

	\| Architecture \| SSM hidden dim \| SSM mixer dim \| Eval loss \| Eval PPL \| Eval accuracy \| Forward+backward tok/s \| Peak allocated MB \|
	\|---\|---:\|---:\|---:\|---:\|---:\|---:\|---:\|
	\| attention TaoNet \| n/a \| n/a \| 4.715 \| 111.633 \| 0.211 \| 618k \| 2590 \|
	\| SSM TaoNet \| 16 \| 64 \| 4.780 \| 119.046 \| 0.218 \| 1.13M \| 2887 \|
	\| SSM TaoNet \| 16 \| 96 \| 4.759 \| 116.643 \| 0.222 \| 973k \| 3029 \|
	\| SSM TaoNet \| 16 \| 128 \| 4.719 \| 112.088 \| 0.224 \| 782k \| 3192 \|
	\| SSM TaoNet \| 64 \| 64 \| 4.824 \| 124.475 \| 0.214 \| 982k \| 2987 \|
	\| SSM TaoNet \| 64 \| 96 \| 4.761 \| 116.917 \| 0.219 \| 479k \| 3131 \|
	\| SSM TaoNet \| 64 \| 128 \| 4.784 \| 119.589 \| 0.218 \| 457k \| 3292 \|

	Interpretation:

	- Increasing the projected mixer dimension helped the best SSM real-token validation loss.
	- The best quality SSM in this run was `ssm_hidden_dim=16`, `ssm_mixer_dim=128`:
	- validation loss `4.719`, very close to attention `4.715`
	- token accuracy `0.224`, above attention `0.211`
	- forward+backward throughput about `782k` tok/s, above attention about `618k` tok/s
	- Hidden dim `64` did not help this batch-32 500-step SentencePiece setting; it was slower and worse than hidden dim `16` at mixer dim 128.
	- Mixer dim `64` remains the best SSM speed/quality tradeoff, but mixer dim `128` is now the best SSM quality candidate on real SentencePiece token modeling.
	- Next step should test whether `hidden_dim=16`, `mixer_dim=128` remains strong at batch 16/64 and longer training, then try a narrow learning-rate sweep around it.

	### LLM Iteration 20 - Attempted h16/m128 Batch Generalization Sweep

	Reason for this iteration:

	- Iteration 19 found a strong real-token batch-32 point: `ssm_hidden_dim=16`, `ssm_mixer_dim=128`.
	- The user noted earlier that a single batch-size sweet spot can be misleading.
	- This iteration was meant to compare attention TaoNet vs SSM TaoNet at batch 16, 32, and 64 with the same 500-step SentencePiece protocol.

	Implementation location:

	- TaoTrain commit used remotely: `357336e Sweep SSM mixer dims in real token benchmark`
	- RepoBridge config: `repobridge.taonet.realspm.taodata.h16m128.batchsweep.config.json`

	Planned remote benchmark:

	- Data: TaoData FineWeb JSONL
	- Tokenization: pilot SentencePiece 8k
	- Config: batch 16/32/64, seq 512, 4 layers, hidden dim 256, bf16, 500 train steps, 16 eval batches
	- Attention baseline: `taonet`
	- SSM candidate: `taonet_ssm`, DPLR, `ssm_hidden_dim=16`, `ssm_mixer_dim=128`, local shift enabled, per-channel shift enabled

	Remote status before run:

	- RepoBridge write guard passed.
	- RepoBridge preflight passed.
	- Remote GPU: RTX 5090 with about 21 GB free VRAM.
	- A same-user `taodata` process was present and using about 10.9 GB VRAM; no other users were detected.

	Outcome:

	- RepoBridge `full` began, but the SFTP download phase failed with:
	- `Socket exception: An existing connection was forcibly closed by the remote host (10054)`
	- `paramiko.ssh_exception.SSHException: Server connection dropped`
	- Subsequent read-only RepoBridge SSH checks timed out with WinError `10060`.
	- The new result folder did not appear in the partial local download, so no valid benchmark table was available to record.

	Interpretation:

	- This was an infrastructure interruption, not a model failure.
	- Do not infer anything about h16/m128 batch generalization from this attempted run.
	- Next action when the remote server is reachable: rerun or download the run for `taonet-vs-ssm-real-token-taodata-spm-h16m128-batchsweep`.

	### Current LLM-Wrapper Best Configuration

	Best current speed benchmark configuration:

	- architecture: `taonet_ssm`
	- SSM core: `dplr`
	- mixer projection: `ssm_mixer_dim=64`
	- SSM hidden dimension: `256`
	- DPLR rank: `1`
	- kernel mode: `conv`
	- dtype: `bf16`
	- benchmark task: synthetic next-token CE through TaoNet wrapper

	Best current quality-aware token-memory configuration:

	- architecture: `taonet_ssm`
	- SSM core: `dplr`
	- mixer projection: `ssm_mixer_dim=64`
	- SSM hidden dimension: `16`
	- DPLR rank: `1`
	- kernel mode: `conv`
	- dtype: `bf16`
	- local shift: `ssm_local_shift=True`
	- benchmark task: `previous` token memory through TaoNet wrapper
	- evidence: perfect eval accuracy at batch 8, 16, 32, and 64 after 100 steps; best observed short-memory batch-64 SSM backward throughput about `795k` tok/s

	Best current longer token-memory configuration:

	- architecture: `taonet_ssm`
	- SSM core: `dplr`
	- mixer projection: `ssm_mixer_dim=64`
	- SSM hidden dimension: `64`
	- DPLR rank: `1`
	- kernel mode: `conv`
	- dtype: `bf16`
	- local shift: `ssm_local_shift=True`
	- benchmark task: seq-512 `previous` token memory through TaoNet wrapper
	- evidence: perfect eval accuracy at batch 16, 32, and 64 after 100 steps; best observed batch-32 and batch-64 SSM backward throughput about `1.36M` and `1.46M` tok/s, both above attention in the same task

	Best current TaoData real-text pilot configuration:

	- architecture: `taonet_ssm`
	- SSM core: `dplr`
	- mixer projection: `ssm_mixer_dim=128` for best current SentencePiece validation loss; `ssm_mixer_dim=64` for speed/quality balance
	- SSM hidden dimension: `16`
	- DPLR rank: `1`
	- kernel mode: `conv`
	- dtype: `bf16`
	- local shift: `ssm_local_shift=True`
	- local shift gain: `ssm_local_shift_per_channel=True`
	- benchmark task: TaoData FineWeb JSONL, byte-level and pilot SentencePiece next-token prediction, seq 512
	- evidence:
	- byte-level: lower validation loss/perplexity than attention at batch 16/32/64 after 150 steps; hidden-16 also beat attention backward throughput at batch 32
	- SentencePiece batch 32, 500 steps: `ssm_hidden_dim=16`, `ssm_mixer_dim=128` reached eval loss about `4.719` vs attention about `4.715`, with better token accuracy (`0.224` vs `0.211`) and higher backward throughput (`782k` vs `618k` tok/s)

	Current best evidence:

	- At batch 4, seq 512, projected-64 DPLR reaches about `618k` forward tok/s and `192k` backward tok/s.
	- At batch 16, seq 512, projected-64 DPLR reaches about `2.12M` forward tok/s and `702k` backward tok/s.
	- Attention is still faster for backward at batch 16 in the same run: about `990k` tok/s.
	- DPLR projected-64 forward can exceed attention in this benchmark, but training/backward still needs improvement.
	- Newer scaling rerun found a batch-32 sweet spot where projected-64 DPLR exceeded attention in both forward and forward+backward throughput:
	- SSM forward about `2.62M` tok/s vs attention about `1.88M`
	- SSM forward+backward about `705k` tok/s vs attention about `632k`

	Important local artifact paths:

	- `C:\Users\YouZheng\Documents\LYZ\MyContent\MyLLM\Codebase\Taotern\TaoTrain\results\repobridge-token-bench-projected\outputs-taotrain\taonet-token-dplr-proj64-bench-20260429-091624\taonet_token_benchmark.csv`
	- `C:\Users\YouZheng\Documents\LYZ\MyContent\MyLLM\Codebase\Taotern\TaoTrain\results\repobridge-token-bench-projected64-scale\outputs-taotrain\taonet-token-dplr-proj64-scale-bench-20260429-091738\taonet_token_benchmark.csv`
	- `C:\Users\YouZheng\Documents\LYZ\MyContent\MyLLM\Codebase\Taotern\TaoTrain\results\repobridge-token-bench-projected\outputs-taotrain\taonet-token-dplr-proj64-bench-20260429-091956\taonet_token_benchmark.csv`

	Recommended next LLM-wrapper targets:

	1. Rerun the real SentencePiece benchmark for `ssm_hidden_dim=16`, `ssm_mixer_dim=128` at batch 16/32/64 to check whether the gain generalizes beyond the batch-32 spot.
	2. Optimize backward throughput in `S4TernaryDPLRSSM`; the forward path is now competitive at larger batch sizes.
	3. Run a learning-rate and weight-decay sweep around the current best SSM real-token config, because the SSM and attention cores may not share the same optimum optimizer settings.
	4. Investigate whether FFT/direct-response intermediates can be checkpointed or custom-autograded to improve backward speed.
	5. Keep ternary deployment constraints in view: rank-1 DPLR factors still use ternary masks with learned amplitudes, and projected mixer dimensions should remain friendly to ternary compute layouts.

	## Version Timeline

	\| Run \| Notebook(s) \| Commit printed in notebook \| Device \| Main purpose \|
	\|---\|---\|---:\|---\|---\|
	\| `_r1` \| `gamma_s4_sinewave_benchmark_r1.ipynb` \| not printed \| CUDA \| First comparison of baseline, minimal, enhanced on simple sinewave task. \|
	\| `_r2` \| `gamma_s4_sinewave_benchmark_r2.ipynb` \| not printed \| CUDA \| Harder multivariate long-range task; enhanced first became clearly promising. \|
	\| `_r3` \| `gamma-s4-sinewave-benchmark_r3.ipynb` \| `6df3777` \| CUDA \| Quick benchmark after deployment-cache import fix; recurrent enhanced still very slow. \|
	\| `_r4` \| `gamma-s4-sinewave-benchmark_r4.ipynb` \| `d6ebddc` \| CUDA \| Triangular-solve recurrent optimization; large recurrent speedup. \|
	\| `_r5` \| `gamma-s4-sinewave-benchmark_r5.ipynb` \| `78ae31f` \| CUDA \| Added recurrent/full-output agreement metrics. \|
	\| `_r6` \| quick + research notebooks \| `a2474cc` / `5952546` \| CPU for quick, CUDA for research \| Split quick/research benchmark; first practical long-context run showed conv path was too slow. \|
	\| `_r7` \| quick + research notebooks \| `4b977c1` / `b17f72a` \| CUDA \| Faster conv kernel generation and cheaper research defaults. \|
	\| `_r8` \| quick + research notebooks \| `73e76a7` \| CUDA \| Skipped unused final states, enabled baseline deploy metrics, enabled token-lite. \|
	\| `_r9` \| quick + research notebooks \| `8738675` / `60562bd` \| CUDA \| Added research visuals; performance similar to `_r8`, now presentation-friendly. \|
	\| `_r10` \| quick + research notebooks \| `09db0da` / `9ff7e4e` \| CUDA \| Added balanced deployment metrics to test a speed/fidelity point between full recurrent and deployment-lite. \|
	\| `_r11` \| quick + research + challenge notebooks \| `64f8632` / `4842762` / `bfc6e26` \| CUDA \| Fixed AMP FFT path, split result tables, and added challenge benchmarks for permuted MNIST, selective copying, and induction-style recall. \|
	\| `_r12` \| quick + research + challenge notebooks \| `740a9ef` / `0c6ecb8` / `11bd2e6` \| CUDA \| Tested the input-selection gate. Forecasting stayed strong, but challenge recall tasks remained near random. \|

	## `_r1` - First Simple Sinewave Comparison

	Saved notebook:

	- `output/jupyter-notebook/gamma_s4_sinewave_benchmark_r1.ipynb`

	Configuration recovered from notebook:

	- device: CUDA
	- task: simple 1D sinewave next-step prediction
	- `seq_len=128`
	- `train_samples=512`
	- `val_samples=128`
	- `batch_size=32`
	- `epochs=10`
	- `d_model=1`
	- `hidden_dim=32`
	- `num_layers=2`

	Results:

	\| Model \| Params \| Final val loss \| Mean epoch s \| Full ms \| Full tokens/s \| Recurrent ms \| Recurrent tokens/s \|
	\|---\|---:\|---:\|---:\|---:\|---:\|---:\|---:\|
	\| `gamma_baseline` \| 134 \| 0.170722 \| 2.063 \| 24.604 \| 166477 \| 51.177 \| 80036 \|
	\| `gamma_s4_minimal` \| 138 \| 0.019148 \| 1.282 \| 23.545 \| 173967 \| 54.462 \| 75209 \|
	\| `gamma_s4_enhanced` \| 146 \| 1.154002 \| 1.330 \| 23.389 \| 175127 \| 82.343 \| 49743 \|

	Interpretation:

	- `gamma_s4_minimal` was best on this very simple task.
	- `gamma_s4_enhanced` was unstable/underfit badly here.
	- This run showed that the richer enhanced block can be harmful on small/simple tasks.

	## `_r2` - Harder Multivariate Forecasting

	Saved notebook:

	- `output/jupyter-notebook/gamma_s4_sinewave_benchmark_r2.ipynb`

	Configuration recovered from notebook:

	- device: CUDA
	- task: harder multivariate synthetic forecasting
	- `seq_len=512`
	- `num_features=8`
	- `train_samples=768`
	- `val_samples=192`
	- `batch_size=32`
	- `epochs=12`
	- `d_model=8`
	- `hidden_dim=64`
	- `num_layers=3`

	Results:

	\| Model \| Params \| Final val loss \| Mean epoch s \| Full ms \| Full tokens/s \| Recurrent ms \| Recurrent tokens/s \|
	\|---\|---:\|---:\|---:\|---:\|---:\|---:\|---:\|
	\| `gamma_baseline` \| 3192 \| 0.006972 \| 28.916 \| 146.644 \| 111726 \| 305.446 \| 53640 \|
	\| `gamma_s4_minimal` \| 3243 \| 0.110654 \| 17.194 \| 121.234 \| 135144 \| 343.394 \| 47712 \|
	\| `gamma_s4_enhanced` \| 3675 \| 0.006302 \| 17.191 \| 131.929 \| 124188 \| 492.576 \| 33262 \|

	Interpretation:

	- `gamma_s4_enhanced` became the best-quality model.
	- Enhanced training was much faster than baseline on this task.
	- Recurrent inference was still significantly slower than baseline.
	- This was the first strong evidence that the enhanced model is useful on harder sequence tasks.

	## `_r3` - Quick Benchmark With Deployment Cache Available

	Saved notebook:

	- `output/jupyter-notebook/gamma-s4-sinewave-benchmark_r3.ipynb`

	Configuration:

	- device: CUDA
	- commit: `6df3777`
	- quick tasks:
	- `simple`: `seq_len=192`, `features=4`, `epochs=4`
	- `moderate`: `seq_len=320`, `features=6`, `epochs=5`
	- models: `gamma_baseline`, `gamma_s4_enhanced`
	- enhanced: `kernel_mode="auto"`, `kernel_threshold=384`, bilinear discretization

	Results:

	\| Task \| Model \| Val loss \| Mean epoch s \| Full tokens/s \| Recurrent ms \| Recurrent tokens/s \| Deploy recurrent ms \|
	\|---\|---\|---:\|---:\|---:\|---:\|---:\|---:\|
	\| simple \| baseline \| 0.637628 \| 2.199 \| 5779 \| 75.092 \| 5114 \| not available \|
	\| simple \| enhanced \| 0.045667 \| 2.450 \| 9671 \| 1024.471 \| 375 \| 997.631 \|
	\| moderate \| baseline \| 0.533364 \| 6.544 \| 5729 \| 192.084 \| 3332 \| not available \|
	\| moderate \| enhanced \| 0.021113 \| 6.584 \| 6995 \| 2815.223 \| 227 \| 2346.986 \|

	Interpretation:

	- Enhanced quality was much better than baseline.
	- Full-sequence throughput was better for enhanced.
	- Recurrent enhanced path was catastrophically slow.
	- This run motivated recurrent-path optimization.

	## `_r4` - Triangular-Solve Recurrent Optimization

	Saved notebook:

	- `output/jupyter-notebook/gamma-s4-sinewave-benchmark_r4.ipynb`

	Configuration:

	- device: CUDA
	- commit: `d6ebddc`
	- same quick tasks as `_r3`
	- key code change: bilinear recurrent stepping switched to a triangular-solve path

	Results:

	\| Task \| Model \| Val loss \| Mean epoch s \| Full tokens/s \| Recurrent ms \| Recurrent tokens/s \| Deploy recurrent ms \|
	\|---\|---\|---:\|---:\|---:\|---:\|---:\|---:\|
	\| simple \| baseline \| 0.637628 \| 2.219 \| 6186 \| 69.398 \| 5533 \| not available \|
	\| simple \| enhanced \| 0.045667 \| 2.288 \| 9728 \| 139.394 \| 2755 \| 104.623 \|
	\| moderate \| baseline \| 0.533364 \| 6.409 \| 6182 \| 110.415 \| 5796 \| not available \|
	\| moderate \| enhanced \| 0.021113 \| 6.630 \| 9896 \| 240.392 \| 2662 \| 185.037 \|

	Interpretation:

	- This was a major recurrent-inference improvement.
	- Enhanced recurrent latency dropped from seconds to hundreds of milliseconds.
	- Enhanced still remained slower than baseline in recurrent mode.

	## `_r5` - Agreement Metrics Added

	Saved notebook:

	- `output/jupyter-notebook/gamma-s4-sinewave-benchmark_r5.ipynb`

	Configuration:

	- device: CUDA
	- commit: `78ae31f`
	- same quick tasks as `_r4`
	- added:
	- `recurrent_match_mse`
	- `deploy_match_mse`

	Results:

	\| Task \| Model \| Val loss \| Mean epoch s \| Full tokens/s \| Recurrent ms \| Recurrent match MSE \| Deploy recurrent ms \| Deploy match MSE \|
	\|---\|---\|---:\|---:\|---:\|---:\|---:\|---:\|---:\|
	\| simple \| baseline \| 0.637628 \| 2.317 \| 6097 \| 71.296 \| 0.000000 \| not available \| not available \|
	\| simple \| enhanced \| 0.045667 \| 2.381 \| 9963 \| 141.656 \| 0.008500 \| 107.361 \| 0.031251 \|
	\| moderate \| baseline \| 0.533364 \| 6.603 \| 5912 \| 114.832 \| 0.000000 \| not available \| not available \|
	\| moderate \| enhanced \| 0.021113 \| 7.199 \| 9465 \| 242.692 \| 0.007549 \| 178.070 \| 0.029995 \|

	Interpretation:

	- Enhanced remained much better in quality.
	- Full-sequence throughput favored enhanced.
	- Recurrent/deployment-lite speed improved but still trailed baseline.
	- Agreement metrics showed normal enhanced recurrent output was close to full forward; deployment-lite was faster but less faithful.

	## `_r6` - Split Quick/Research Benchmark Era

	### `_r6` Quick Notebook

	Saved notebook:

	- `output/jupyter-notebook/gamma-s4-sinewave-benchmark_r6.ipynb`

	Configuration:

	- device: CPU
	- commit: `a2474cc`
	- same quick tasks as `_r5`

	Results:

	\| Task \| Model \| Val loss \| Mean epoch s \| Full tokens/s \| Recurrent tokens/s \|
	\|---\|---\|---:\|---:\|---:\|---:\|
	\| simple \| baseline \| 0.240532 \| 1.444 \| 15225 \| 8656 \|
	\| simple \| enhanced \| 0.045714 \| 2.598 \| 20066 \| 4278 \|
	\| moderate \| baseline \| 0.056279 \| 5.613 \| 20720 \| 11785 \|
	\| moderate \| enhanced \| 0.021122 \| 8.653 \| 12149 \| 2875 \|

	Interpretation:

	- This was a CPU run, so speed conclusions are not treated as primary benchmark evidence.
	- It was useful as a smoke test only.
	- The CPU result reminded us to warn clearly when notebooks are not running on GPU.

	### `_r6` Research Notebook

	Saved notebook:

	- `output/jupyter-notebook/gamma-s4-research-benchmark_r6.ipynb`

	Configuration:

	- device: CUDA
	- commit: `5952546`
	- research tasks:
	- `current_reference`: `seq_len=320`, `features=6`, `epochs=5`
	- `long_context`: `seq_len=768`, `features=8`, `epochs=4`
	- `RUN_ABLATIONS=True`
	- `RUN_TOKEN_TASK=False`

	Results:

	\| Task \| Model \| Val loss \| Mean epoch s \| Expected mode \| Full tokens/s \| Recurrent tokens/s \| Deploy tokens/s \|
	\|---\|---\|---:\|---:\|---\|---:\|---:\|---:\|
	\| current_reference \| baseline \| 0.709749 \| 6.844 \| recurrent_like \| 6157 \| 5522 \| not available \|
	\| current_reference \| enhanced \| 0.020500 \| 7.366 \| recurrent_like \| 8431 \| 2594 \| 3408 \|
	\| long_context \| baseline \| 27.229956 \| 36.239 \| recurrent_like \| 2819 \| 2939 \| not available \|
	\| long_context \| enhanced \| 0.012164 \| 634.387 \| conv \| 358 \| 1876 \| 2501 \|

	Interpretation:

	- Enhanced crushed baseline in quality.
	- But the long-context conv path was extremely slow.
	- Ablation section was too expensive and was stopped mid-way.
	- This run motivated the later kernel-generation speedup and disabling ablations by default.

	## `_r7` - Conv Kernel Generation Improved

	### `_r7` Quick Notebook

	Saved notebook:

	- `output/jupyter-notebook/gamma-s4-sinewave-benchmark_r7.ipynb`

	Configuration:

	- device: CUDA
	- commit: `4b977c1`
	- same quick tasks

	Results:

	\| Task \| Model \| Val loss \| Mean epoch s \| Full tokens/s \| Recurrent tokens/s \| Deploy tokens/s \|
	\|---\|---\|---:\|---:\|---:\|---:\|---:\|
	\| simple \| baseline \| 0.637628 \| 2.417 \| 4977 \| 5155 \| not available \|
	\| simple \| enhanced \| 0.045667 \| 2.565 \| 8583 \| 2500 \| 3260 \|
	\| moderate \| baseline \| 0.533364 \| 7.405 \| 5413 \| 5186 \| not available \|
	\| moderate \| enhanced \| 0.021113 \| 7.796 \| 7465 \| 2414 \| 3226 \|

	Interpretation:

	- Quick benchmark remained stable.
	- Enhanced retained quality and full-sequence throughput advantages.
	- Recurrent remained slower than baseline.

	### `_r7` Research Notebook

	Saved notebook:

	- `output/jupyter-notebook/gamma-s4-research-benchmark_r7.ipynb`

	Configuration:

	- device: CUDA
	- commit: `b17f72a`
	- `RUN_ABLATIONS=False`
	- `RUN_TOKEN_TASK=False`

	Results:

	\| Task \| Model \| Val loss \| Mean epoch s \| Expected mode \| Full tokens/s \| Recurrent tokens/s \| Deploy tokens/s \|
	\|---\|---\|---:\|---:\|---\|---:\|---:\|---:\|
	\| current_reference \| baseline \| 0.709749 \| 7.351 \| recurrent_like \| 3821 \| 4616 \| not available \|
	\| current_reference \| enhanced \| 0.020500 \| 7.530 \| recurrent_like \| 9289 \| 2780 \| 3339 \|
	\| long_context \| baseline \| 27.229956 \| 39.236 \| recurrent_like \| 3523 \| 3282 \| not available \|
	\| long_context \| enhanced \| 0.012029 \| 44.189 \| conv \| 5971 \| 1776 \| 2229 \|

	Interpretation:

	- The conv speed issue was dramatically improved versus `_r6`.
	- Enhanced long-context epoch time dropped from about 634s to about 44s.
	- Enhanced was still slightly slower than baseline per epoch on long_context, but had much better loss and better full-sequence throughput.

	## `_r8` - No-State Full Forward, Baseline Deploy Metrics, Token-Lite Enabled

	### `_r8` Quick Notebook

	Saved notebook:

	- `output/jupyter-notebook/gamma-s4-sinewave-benchmark_r8.ipynb`

	Configuration:

	- device: CUDA
	- commit: `73e76a7`
	- same quick tasks
	- baseline deploy metrics became available
	- full-sequence training/inference skips unused final-state computation

	Results:

	\| Task \| Model \| Val loss \| Mean epoch s \| Full tokens/s \| Recurrent tokens/s \| Deploy tokens/s \| Deploy match MSE \|
	\|---\|---\|---:\|---:\|---:\|---:\|---:\|---:\|
	\| simple \| baseline \| 0.637628 \| 2.261 \| 5711 \| 5076 \| 5241 \| 0.000000 \|
	\| simple \| enhanced \| 0.044817 \| 2.550 \| 8204 \| 2621 \| 3367 \| 0.022886 \|
	\| moderate \| baseline \| 0.533364 \| 7.011 \| 5782 \| 5447 \| 4519 \| 0.000000 \|
	\| moderate \| enhanced \| 0.020569 \| 7.010 \| 8926 \| 2503 \| 3390 \| 0.018165 \|

	Interpretation:

	- Baseline deploy columns now populate.
	- Enhanced full-sequence throughput remained ahead.
	- Training time was tied on moderate.

	### `_r8` Research Notebook

	Saved notebook:

	- `output/jupyter-notebook/gamma-s4-research-benchmark_r8.ipynb`

	Configuration:

	- device: CUDA
	- commit: `73e76a7`
	- `RUN_ABLATIONS=False`
	- `RUN_TOKEN_TASK=True`

	Forecasting results:

	\| Task \| Model \| Val loss \| Mean epoch s \| Expected mode \| Full tokens/s \| Recurrent tokens/s \| Deploy tokens/s \|
	\|---\|---\|---:\|---:\|---\|---:\|---:\|---:\|
	\| current_reference \| baseline \| 0.709749 \| 7.235 \| recurrent_like \| 5941 \| 5581 \| 5702 \|
	\| current_reference \| enhanced \| 0.019951 \| 7.177 \| recurrent_like \| 7431 \| 1918 \| 2336 \|
	\| long_context \| baseline \| 27.229956 \| 35.557 \| recurrent_like \| 3969 \| 3759 \| 3842 \|
	\| long_context \| enhanced \| 0.011708 \| 14.235 \| conv \| 19544 \| 1860 \| 2406 \|

	Token-lite results:

	\| Model \| Train CE \| Val CE \| Val PPL \| Seq len \| Train samples \|
	\|---\|---:\|---:\|---:\|---:\|---:\|
	\| baseline \| 3.587260 \| 3.132184 \| 22.924 \| 192 \| 1200 \|
	\| enhanced \| 2.483611 \| 2.486829 \| 12.023 \| 192 \| 1200 \|

	Interpretation:

	- This was the strongest practical result so far.
	- On long_context, enhanced was both much more accurate and much faster per epoch.
	- Token-lite showed enhanced also transferred better to a language-like task.

	## `_r9` - Presentation Visuals Added

	### `_r9` Quick Notebook

	Saved notebook:

	- `output/jupyter-notebook/gamma-s4-sinewave-benchmark_r9.ipynb`

	Configuration:

	- device: CUDA
	- commit: `8738675`
	- same quick tasks as `_r8`

	Results:

	\| Task \| Model \| Val loss \| Mean epoch s \| Full tokens/s \| Recurrent tokens/s \| Deploy tokens/s \|
	\|---\|---\|---:\|---:\|---:\|---:\|---:\|
	\| simple \| baseline \| 0.637628 \| 2.249 \| 6058 \| 5478 \| 5502 \|
	\| simple \| enhanced \| 0.044817 \| 2.344 \| 9550 \| 2617 \| 3644 \|
	\| moderate \| baseline \| 0.533364 \| 6.672 \| 6324 \| 5686 \| 5599 \|
	\| moderate \| enhanced \| 0.020569 \| 6.571 \| 9304 \| 2771 \| 3416 \|

	Interpretation:

	- Similar to `_r8`, with slightly improved timing variation.
	- Enhanced still wins on quality and full-sequence throughput.
	- Baseline still wins recurrent throughput.

	### `_r9` Research Notebook

	Saved notebook:

	- `output/jupyter-notebook/gamma-s4-research-benchmark_r9.ipynb`

	Configuration:

	- device: CUDA
	- commit printed in notebook: `60562bd`
	- visual sections added:
	- task visual preview
	- prediction comparison plots
	- error comparison plots

	Forecasting results:

	\| Task \| Model \| Val loss \| Mean epoch s \| Expected mode \| Full tokens/s \| Recurrent tokens/s \| Deploy tokens/s \|
	\|---\|---\|---:\|---:\|---\|---:\|---:\|---:\|
	\| current_reference \| baseline \| 0.709749 \| 7.294 \| recurrent_like \| 6111 \| 4981 \| 5359 \|
	\| current_reference \| enhanced \| 0.019951 \| 7.494 \| recurrent_like \| 8099 \| 2185 \| 3343 \|
	\| long_context \| baseline \| 27.229956 \| 37.885 \| recurrent_like \| 3576 \| 3728 \| 3695 \|
	\| long_context \| enhanced \| 0.011708 \| 14.717 \| conv \| 15654 \| 1810 \| 2327 \|

	Token-lite results:

	\| Model \| Train CE \| Val CE \| Val PPL \| Seq len \| Train samples \|
	\|---\|---:\|---:\|---:\|---:\|---:\|
	\| baseline \| 3.587260 \| 3.132184 \| 22.924 \| 192 \| 1200 \|
	\| enhanced \| 2.483611 \| 2.486829 \| 12.023 \| 192 \| 1200 \|

	Interpretation:

	- `_r9` is the most presentation-friendly record.
	- It confirms the `_r8` story:
	- enhanced wins quality strongly
	- enhanced wins full-sequence/conv long-context training and throughput
	- baseline still wins recurrent deployment throughput
	- token-lite favors enhanced

	## `_r10` - Balanced Deployment Path Added

	### `_r10` Quick Notebook

	Saved notebook:

	- `output/jupyter-notebook/gamma-s4-sinewave-benchmark_r10.ipynb`

	Configuration:

	- device: CUDA
	- commit: `09db0da`
	- same quick tasks as `_r9`
	- new metrics:
	- `balanced_deploy_recurrent_latency_ms`
	- `balanced_deploy_recurrent_tokens_per_s`
	- `balanced_deploy_match_mse`

	Results:

	\| Task \| Model \| Val loss \| Mean epoch s \| Full tokens/s \| Recurrent tokens/s \| Deploy-lite tokens/s \| Balanced deploy tokens/s \| Deploy-lite match MSE \| Balanced match MSE \|
	\|---\|---\|---:\|---:\|---:\|---:\|---:\|---:\|---:\|---:\|
	\| simple \| baseline \| 0.637628 \| 2.134 \| 6053 \| 5891 \| 6034 \| 5938 \| 0.000000 \| 0.000000 \|
	\| simple \| enhanced \| 0.044817 \| 2.532 \| 9973 \| 2507 \| 3763 \| 3123 \| 0.022886 \| 0.000986 \|
	\| moderate \| baseline \| 0.533364 \| 6.331 \| 6134 \| 5835 \| 5512 \| 5816 \| 0.000000 \| 0.000000 \|
	\| moderate \| enhanced \| 0.020569 \| 6.601 \| 10045 \| 2778 \| 3510 \| 2862 \| 0.018165 \| 0.000468 \|

	Interpretation:

	- Enhanced quality and full-sequence throughput remain strong.
	- Deployment-lite is still the fastest enhanced deployment variant.
	- Balanced deployment is slower than deployment-lite, but much more faithful to full forward.
	- Balanced deployment is useful as a fidelity-preserving approximation, not as a pure speed win.

	### `_r10` Research Notebook

	Saved notebook:

	- `output/jupyter-notebook/gamma-s4-research-benchmark_r10.ipynb`

	Configuration:

	- device: CUDA
	- commit printed in notebook: `9ff7e4e`
	- same research tasks as `_r9`
	- balanced deployment metrics added

	Forecasting results:

	\| Task \| Model \| Val loss \| Mean epoch s \| Expected mode \| Full tokens/s \| Recurrent tokens/s \| Deploy-lite tokens/s \| Balanced deploy tokens/s \| Deploy-lite match MSE \| Balanced match MSE \|
	\|---\|---\|---:\|---:\|---\|---:\|---:\|---:\|---:\|---:\|---:\|
	\| current_reference \| baseline \| 0.709749 \| 7.648 \| recurrent_like \| 4933 \| not recorded in compact table \| 5092 \| 4987 \| 0.000000 \| 0.000000 \|
	\| current_reference \| enhanced \| 0.019951 \| 8.193 \| recurrent_like \| 8152 \| not recorded in compact table \| 3404 \| 2687 \| 0.027752 \| 0.000315 \|
	\| long_context \| baseline \| 27.229956 \| 40.350 \| recurrent_like \| 2395 \| not recorded in compact table \| 3397 \| 3285 \| 0.000000 \| 0.000000 \|
	\| long_context \| enhanced \| 0.011708 \| 15.862 \| conv \| 16957 \| not recorded in compact table \| 2245 \| 1886 \| 0.200325 \| 0.001692 \|

	Token-lite results:

	\| Model \| Train CE \| Val CE \| Val PPL \| Seq len \| Train samples \|
	\|---\|---:\|---:\|---:\|---:\|---:\|
	\| baseline \| 3.587260 \| 3.132184 \| 22.924 \| 192 \| 1200 \|
	\| enhanced \| 2.483611 \| 2.486829 \| 12.023 \| 192 \| 1200 \|

	Interpretation:

	- Long-context enhanced still wins strongly on validation loss and full-sequence throughput.
	- Balanced deployment drastically improves fidelity relative to deployment-lite on enhanced:
	- long_context deploy-lite match MSE: `0.200325`
	- long_context balanced match MSE: `0.001692`
	- However, balanced deployment is slower than deployment-lite.
	- This suggests the output projection is important for fidelity, while the input-dependent gate is a major recurrent-time cost.

	## `_r11` - FFT Fix, Split Tables, And Challenge Benchmarks

	### `_r11` Quick Notebook

	Saved notebook:

	- `output/jupyter-notebook/gamma-s4-sinewave-benchmark_r11.ipynb`

	Configuration:

	- device: CUDA
	- commit printed in notebook: `64f8632`
	- same quick tasks as `_r10`
	- notebook tables split into normal, deployment-lite, and balanced deployment views

	Results:

	\| Task \| Model \| Val loss \| Mean epoch s \| Full tokens/s \| Recurrent tokens/s \| Deploy-lite tokens/s \| Balanced deploy tokens/s \| Deploy-lite match MSE \| Balanced match MSE \|
	\|---\|---\|---:\|---:\|---:\|---:\|---:\|---:\|---:\|---:\|
	\| simple \| baseline \| 0.637628 \| 2.353 \| 6162 \| 5776 \| 5524 \| 5649 \| 0.000000 \| 0.000000 \|
	\| simple \| enhanced \| 0.044817 \| 2.113 \| 11279 \| 2625 \| 3618 \| 3112 \| 0.022886 \| 0.000986 \|
	\| moderate \| baseline \| 0.533364 \| 6.527 \| 6187 \| 5337 \| 4572 \| 5563 \| 0.000000 \| 0.000000 \|
	\| moderate \| enhanced \| 0.020569 \| 6.264 \| 11434 \| 2598 \| 3338 \| 2809 \| 0.018165 \| 0.000468 \|

	Interpretation:

	- Enhanced remains much better on validation loss and full-sequence throughput.
	- Baseline remains faster for exact recurrent stepping.
	- Deployment-lite is still the fastest enhanced recurrent approximation, while balanced is much more faithful.

	### `_r11` Research Notebook

	Saved notebook:

	- `output/jupyter-notebook/gamma-s4-research-benchmark_r11.ipynb`

	Configuration:

	- device: CUDA
	- commit printed in notebook: `4842762`
	- includes AMP FFT fix and split benchmark tables

	Forecasting results:

	\| Task \| Model \| Val loss \| Mean epoch s \| Expected mode \| Full tokens/s \| Recurrent tokens/s \| Deploy-lite tokens/s \| Balanced deploy tokens/s \| Deploy-lite match MSE \| Balanced match MSE \|
	\|---\|---\|---:\|---:\|---\|---:\|---:\|---:\|---:\|---:\|---:\|
	\| current_reference \| baseline \| 0.709749 \| 6.922 \| recurrent_like \| 5898 \| 5494 \| 5162 \| 5523 \| 0.000000 \| 0.000000 \|
	\| current_reference \| enhanced \| 0.019951 \| 6.383 \| recurrent_like \| 11200 \| 2665 \| 3522 \| 2931 \| 0.027752 \| 0.000315 \|
	\| long_context \| baseline \| 27.229956 \| 36.928 \| recurrent_like \| 2994 \| 2725 \| 2513 \| 2866 \| 0.000000 \| 0.000000 \|
	\| long_context \| enhanced \| 0.011593 \| 10.419 \| conv \| 235772 \| 1849 \| 2477 \| 1542 \| 0.193474 \| 0.001699 \|

	Token-lite results:

	\| Model \| Train CE \| Val CE \| Val PPL \| Seq len \| Train samples \|
	\|---\|---:\|---:\|---:\|---:\|---:\|
	\| baseline \| 3.587260 \| 3.132184 \| 22.924 \| 192 \| 1200 \|
	\| enhanced \| 2.483604 \| 2.486901 \| 12.024 \| 192 \| 1200 \|

	Interpretation:

	- The AMP FFT fix worked: the long-context enhanced conv path completed and showed very high cached full-sequence throughput.
	- Enhanced long-context training is now much faster than baseline in this setup and far more accurate.
	- Recurrent deployment remains the weak point: enhanced exact recurrent throughput is still lower than baseline.
	- Balanced deployment remains the best fidelity-preserving approximation, but it is slower than deployment-lite.

	### `_r11` Challenge Notebook

	Saved notebook:

	- `output/jupyter-notebook/gamma-s4-challenge-benchmark_r11.ipynb`

	Configuration:

	- device: CUDA
	- commit printed in notebook: `bfc6e26`
	- first saved run for the challenge benchmark notebook
	- tasks:
	- permuted MNIST
	- selective copying
	- induction-style associative recall

	Results:

	\| Task \| Model \| Val loss \| Val accuracy \| Epoch s \| Forward ms \| Forward tokens/s \|
	\|---\|---\|---:\|---:\|---:\|---:\|---:\|
	\| permuted_mnist \| baseline \| 2.760003 \| 0.206000 \| 112.318 \| 425.084 \| 118038 \|
	\| permuted_mnist \| enhanced \| 2.041562 \| 0.232000 \| 35.042 \| 7.950 \| 6311750 \|
	\| selective_copying \| baseline \| 3.529677 \| 0.039551 \| 6.509 \| 73.482 \| 222965 \|
	\| selective_copying \| enhanced \| 3.468455 \| 0.029622 \| 2.149 \| 2.739 \| 5981919 \|
	\| induction_recall \| baseline \| 3.615992 \| 0.040039 \| 6.424 \| 72.235 \| 226816 \|
	\| induction_recall \| enhanced \| 3.519182 \| 0.033203 \| 2.061 \| 2.673 \| 6130411 \|

	Interpretation:

	- Enhanced is much faster on the challenge forward benchmark because the full-sequence conv path is active.
	- Permuted MNIST slightly favors enhanced on both loss and accuracy, but both accuracies are still low.
	- Selective copying and induction recall are near random accuracy:
	- selective copying random accuracy is about `1 / 32 = 0.03125`
	- induction recall random accuracy is about `1 / 32 = 0.03125`
	- Enhanced often has lower CE but not consistently higher accuracy, suggesting it is learning distributional smoothing before reliable exact recall.
	- This is the clearest evidence so far that pure LTI Gamma SSM structure is not enough for Mamba-style selective memory tasks. The next model improvement should add selective input flow while keeping the fixed Gamma transition.

	## `_r12` - Input-Selection Gate Tested

	### `_r12` Quick Notebook

	Saved notebook:

	- `output/jupyter-notebook/gamma-s4-sinewave-benchmark_r12.ipynb`

	Configuration:

	- device: CUDA
	- commit printed in notebook: `740a9ef`
	- enhanced model includes the new pre-SSM input-selection gate

	Results:

	\| Task \| Model \| Val loss \| Mean epoch s \| Full tokens/s \| Recurrent tokens/s \| Deploy-lite tokens/s \| Balanced deploy tokens/s \| Deploy-lite match MSE \| Balanced match MSE \|
	\|---\|---\|---:\|---:\|---:\|---:\|---:\|---:\|---:\|---:\|
	\| simple \| baseline \| 0.637628 \| 2.249 \| 5923 \| 4449 \| 5323 \| 5568 \| 0.000000 \| 0.000000 \|
	\| simple \| enhanced \| 0.043149 \| 2.184 \| 10904 \| 2438 \| 3751 \| 3099 \| 0.021473 \| 0.001503 \|
	\| moderate \| baseline \| 0.450424 \| 6.747 \| 5908 \| 4689 \| 5084 \| 5381 \| 0.000000 \| 0.000000 \|
	\| moderate \| enhanced \| 0.020161 \| 6.264 \| 8135 \| 2357 \| 3771 \| 2944 \| 0.076783 \| 0.001143 \|

	Interpretation:

	- The input-selection gate did not hurt quick-task quality; enhanced still wins validation loss clearly.
	- Exact recurrent enhanced slowed slightly due to the extra gate.
	- Deployment-lite mismatch worsened on moderate, but balanced deployment remained much more faithful.

	### `_r12` Research Notebook

	Saved notebook:

	- `output/jupyter-notebook/gamma-s4-research-benchmark_r12.ipynb`

	Configuration:

	- device: CUDA
	- commit printed in notebook: `0c6ecb8`

	Forecasting results:

	\| Task \| Model \| Val loss \| Mean epoch s \| Expected mode \| Full tokens/s \| Recurrent tokens/s \| Deploy-lite tokens/s \| Balanced deploy tokens/s \| Deploy-lite match MSE \| Balanced match MSE \|
	\|---\|---\|---:\|---:\|---\|---:\|---:\|---:\|---:\|---:\|---:\|
	\| current_reference \| baseline \| 0.709749 \| 7.298 \| recurrent_like \| 5513 \| 5460 \| 5212 \| 5506 \| 0.000000 \| 0.000000 \|
	\| current_reference \| enhanced \| 0.020813 \| 6.753 \| recurrent_like \| 10218 \| 2261 \| 3414 \| 3018 \| 0.023053 \| 0.000478 \|
	\| long_context \| baseline \| 12.850342 \| 37.374 \| recurrent_like \| 3720 \| 3776 \| 3661 \| 3706 \| 0.000000 \| 0.000000 \|
	\| long_context \| enhanced \| 0.011039 \| 11.212 \| conv \| 229320 \| 1589 \| 2390 \| 2043 \| 0.034069 \| 0.001689 \|

	Token-lite results:

	\| Model \| Train CE \| Val CE \| Val PPL \| Seq len \| Train samples \|
	\|---\|---:\|---:\|---:\|---:\|---:\|
	\| baseline \| 2.943133 \| 6.186983 \| 486.377 \| 192 \| 1200 \|
	\| enhanced \| 2.489687 \| 2.490702 \| 12.070 \| 192 \| 1200 \|

	Interpretation:

	- Enhanced remains excellent for long-context forecasting.
	- Input-selection slightly improved long_context val loss versus `_r11` (`0.011039` vs `0.011593`) but worsened exact recurrent speed.
	- Token-lite strongly favors enhanced in this run, though baseline appears unstable.

	### `_r12` Challenge Notebook

	Saved notebook:

	- `output/jupyter-notebook/gamma-s4-challenge-benchmark_r12.ipynb`

	Configuration:

	- device: CUDA
	- commit printed in notebook: `11bd2e6`
	- same challenge tasks as `_r11`, with input-selection gate active in enhanced

	Results:

	\| Task \| Model \| Val loss \| Val accuracy \| Epoch s \| Forward ms \| Forward tokens/s \|
	\|---\|---\|---:\|---:\|---:\|---:\|---:\|
	\| permuted_mnist \| baseline \| 2.760003 \| 0.206000 \| 114.183 \| 511.946 \| 98010 \|
	\| permuted_mnist \| enhanced \| 2.052564 \| 0.209000 \| 34.080 \| 8.393 \| 5978110 \|
	\| selective_copying \| baseline \| 3.514275 \| 0.028320 \| 7.319 \| 73.808 \| 221983 \|
	\| selective_copying \| enhanced \| 3.468793 \| 0.029785 \| 2.353 \| 2.988 \| 5482614 \|
	\| induction_recall \| baseline \| 3.607873 \| 0.038086 \| 7.439 \| 72.952 \| 224585 \|
	\| induction_recall \| enhanced \| 3.535175 \| 0.039062 \| 2.788 \| 2.925 \| 5600445 \|

	Interpretation:

	- The input-selection gate did not produce a meaningful challenge-task accuracy breakthrough.
	- Permuted MNIST accuracy stayed low and did not improve over `_r11`.
	- Selective copying and induction recall are still near random. With 32 classes, random accuracy is about `0.03125`.
	- The enhanced model still has much better forward throughput and somewhat lower CE, but accuracy shows it is not performing reliable exact recall.
	- This suggests two things:
	- permuted MNIST likely needs more epochs and/or more samples
	- selective copying and induction need a stronger selective/content-dependent memory mechanism or a curriculum diagnostic, not just more epochs

	## Versions Not Recorded

	The following are not recorded as complete benchmark versions:

	- Research notebooks before `_r6`: no saved research `_r1` to `_r5` notebooks exist in the repo.
	- Any temporary failed Colab runs during error debugging: tracebacks were discussed in chat, but they are not treated as experiment records.
	- Partial long-context ablation run in `_r6`: only partial output is present, so it is not summarized as a completed ablation result.

	## Current Best Summary

	Best presentable run:

	- `_r12` research benchmark

	Most important result:

	- On `long_context`, `gamma_s4_enhanced` achieved much lower validation loss than baseline and substantially better full-sequence throughput.
	- `_r11` shows the fixed AMP FFT conv path completing successfully and producing very high cached full-sequence throughput on long_context.
	- `_r12` confirms the input-selection gate alone is not enough to solve selective copying or induction recall beyond near-random accuracy.

	Current limitation:

	- `gamma_s4_enhanced` still trails `gamma_baseline` in recurrent token-by-token deployment throughput.
	- Challenge benchmarks show that the current model needs stronger selective/content-dependent memory mechanisms.

	Recommended next improvement targets:

	1. Add challenge-task curriculum diagnostics and longer token-memory epochs.
	2. Explore stronger content-dependent memory beyond static LTI convolution, while preserving the fixed Gamma transition when possible.
	3. Recurrent/deployment optimization for `gamma_s4_enhanced`.
	4. Deployment-lite fidelity improvement, especially on long_context.
	5. Better structured Gamma kernel generation for the conv/full-sequence path.