Text Generation
Transformers
PyTorch
English
taonet_mini_t2
taonet
taotern
ssm
state-space-model
dplr
custom_code
experimental
Instructions to use TaoTern/TaoNet-mini-T2 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use TaoTern/TaoNet-mini-T2 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="TaoTern/TaoNet-mini-T2", trust_remote_code=True)# Load model directly from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("TaoTern/TaoNet-mini-T2", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use TaoTern/TaoNet-mini-T2 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "TaoTern/TaoNet-mini-T2" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "TaoTern/TaoNet-mini-T2", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/TaoTern/TaoNet-mini-T2
- SGLang
How to use TaoTern/TaoNet-mini-T2 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "TaoTern/TaoNet-mini-T2" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "TaoTern/TaoNet-mini-T2", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "TaoTern/TaoNet-mini-T2" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "TaoTern/TaoNet-mini-T2", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use TaoTern/TaoNet-mini-T2 with Docker Model Runner:
docker model run hf.co/TaoTern/TaoNet-mini-T2
| # Gamma SSM / Gamma-S4 Experiment Record | |
| This file records the experiment versions saved as `_rN` notebooks under `output/jupyter-notebook/`. | |
| It also records the later TaoNet-SSM LLM-wrapper and remote RTX benchmark iterations. | |
| The goal is to preserve: | |
| - which model version was tested | |
| - which notebook/task configuration was used | |
| - the main performance results | |
| - what we learned from each run | |
| For runs where the saved notebook does not contain enough information, the version is marked as not recorded. | |
| ## Model Names | |
| - `gamma_baseline`: original Gamma SSM using the fixed lower-bidiagonal Gamma transition and recurrent execution. | |
| - `gamma_s4_minimal`: lighter S4-inspired Gamma block. Used in early experiments, later dropped from the main loop because it was not consistently strong. | |
| - `gamma_s4_enhanced`: main S4-inspired Gamma model with learned `dt`, stable discretization, `D` skip, optional gating/output path, full-sequence/kernel mode, and recurrent stepping. | |
| ## Metrics | |
| - `val_loss`: validation loss for forecasting tasks. Lower is better. | |
| - `mean_epoch_time_s`: average training epoch time. Lower is better. | |
| - `full_forward_ms` / `full_latency_ms`: whole-sequence forward/inference latency. Lower is better. | |
| - `full_forward_tokens_per_s` / `full_tokens_per_s`: whole-sequence throughput. Higher is better. | |
| - `recurrent_inference_ms` / `recurrent_latency_ms`: token-by-token recurrent latency. Lower is better. | |
| - `recurrent_tokens_per_s`: token-by-token recurrent throughput. Higher is better. | |
| - `deploy_*`: deployment-lite recurrent path. For the baseline, deployment and recurrent are the same path once baseline deploy metrics were enabled. | |
| - `val_ce`: validation cross entropy for token prediction. Lower is better. | |
| - `val_ppl`: validation perplexity for token prediction. Lower is better. | |
| ## TaoNet-SSM LLM Wrapper Iterations | |
| This section records the work that moved the SSM from standalone/notebook benchmarks into the TaoNet LLM comparison loop. | |
| The main implementation repo for SSM changes is this repo. The TaoNet wrapper lives in the local TaoTrain repo and branch listed below. | |
| Related repos and branches: | |
| - SSM repo: `https://github.com/StarMists/gamma_SSM_S4_enhanced.git` | |
| - SSM local path: `C:\Users\YouZheng\Documents\LYZ\MyContent\MyLLM\Codebase\Taotern\Taotern_SSM` | |
| - TaoTrain repo: `https://github.com/lobakkang/TaoTrain.git` | |
| - TaoTrain local path: `C:\Users\YouZheng\Documents\LYZ\MyContent\MyLLM\Codebase\Taotern\TaoTrain` | |
| - TaoTrain branch: `codex/taonet-ssm-core` | |
| - Remote server path for SSM: `/home/student/YouZheng/gamma_ssm_repo` | |
| - Remote server path for TaoTrain: `/home/student/YouZheng/repo` | |
| - Remote execution tool: `C:\Users\YouZheng\Documents\LYZ\MyContent\MyComp\RepoBridge` | |
| ### LLM Iteration 1 - Add TaoNet SSM Wrapper | |
| Implementation location: | |
| - TaoTrain: `src/taoTrain/models/taonet_ssm.py` | |
| - TaoTrain: `src/taoTrain/config.py` | |
| - TaoTrain: `src/taoTrain/models/registry.py` | |
| - TaoTrain: `tests/test_taonet_ssm.py` | |
| - TaoTrain: `scripts/benchmark_taonet_token_variants.py` | |
| TaoTrain commits: | |
| - `8b1c6fa Add TaoNet Gamma SSM architecture` | |
| - `6edd09e Benchmark TaoNet token SSM variants` | |
| What changed: | |
| - Added a `taonet_ssm` model architecture for apples-to-apples comparison with original attention `taonet`. | |
| - Kept the outer LLM stack close to TaoNet and replaced the sequence-mixing core with an SSM mixer. | |
| - Supported both `gamma_s4` and `dplr` SSM cores. | |
| - Added token-level synthetic CE benchmark comparing `taonet` and `taonet_ssm`. | |
| - Added focused tests for SSM wrapper construction and forward passes. | |
| Local validation: | |
| - `python -m pytest tests\test_taonet_ssm.py -q` passed. | |
| - Broader TaoTrain tests were not run locally because the local environment was missing `datasets`. | |
| Result: | |
| - Functional success. This established the comparison harness. | |
| - Performance was not yet acceptable with full-width DPLR because the wrapper exposed dense DPLR frequency-transfer cost. | |
| ### LLM Iteration 2 - Projected SSM Mixer Dimension | |
| Implementation location: | |
| - TaoTrain: `src/taoTrain/models/taonet_ssm.py` | |
| - TaoTrain: `src/taoTrain/config.py` | |
| - TaoTrain: `tests/test_taonet_ssm.py` | |
| TaoTrain commit: | |
| - `5e6b802 Add projected SSM mixer dimension` | |
| What changed: | |
| - Added `ssm_mixer_dim`. | |
| - The SSM branch now supports `d_model -> ssm_mixer_dim -> SSM -> d_model`. | |
| - This keeps the LLM interface the same while reducing the DPLR channel width. | |
| - This is important because DPLR convolutional training cost scales strongly with the channel dimension. | |
| Remote benchmark config examples: | |
| - RepoBridge projected 128: `repobridge.taonet.tokenbench.projected128.config.json` | |
| - RepoBridge projected 64: `repobridge.taonet.tokenbench.projected64.config.json` | |
| Important results before SSM-core optimization: | |
| | Variant | Batch | Seq | Forward tok/s | Backward tok/s | Peak MB | Interpretation | | |
| |---|---:|---:|---:|---:|---:|---| | |
| | attention TaoNet | 4 | 512 | about 1.24M | about 280k | about 376 | Baseline comparison point. | | |
| | DPLR full-width mixer 256 | 4 | 512 | about 81k | about 20k | about 6200 | Failed: dense transfer path too slow and memory-heavy. | | |
| | DPLR projected mixer 128 | 4 | 512 | about 214k | about 56k | about 3613 | Better memory, still much slower than attention. | | |
| | DPLR projected mixer 64 | 4 | 512 | about 114k | about 54k | about 2500 | Lower memory but worse forward before core optimization. | | |
| Result: | |
| - Success as an architectural control: projection made DPLR usable enough to iterate. | |
| - Not sufficient alone: the DPLR core still needed direct frequency-response optimization. | |
| ### LLM Iteration 3 - Add Scripted SSM Benchmarks | |
| Implementation location: | |
| - SSM: `scripts/benchmark_ssm_variants.py` | |
| - SSM: `.gitignore` | |
| SSM commits: | |
| - `7a90525 Add lightweight SSM benchmark script` | |
| - `c0dede8 Ignore generated benchmark outputs` | |
| What changed: | |
| - Added a Python benchmark script for `baseline`, `gamma_s4`, and `dplr`. | |
| - Measures forward, optional forward+backward, and optional recurrent stepping. | |
| - Writes JSON and CSV outputs. | |
| - Ignored generated benchmark result directories. | |
| Remote raw DPLR result: | |
| | Model | Batch | Seq | Mode | Tok/s | Peak MB | | |
| |---|---:|---:|---|---:|---:| | |
| | DPLR raw SSM | 4 | 512 | forward | about 841k | about 1310 | | |
| | DPLR raw SSM | 4 | 512 | forward+backward | about 101k | about 1310 | | |
| | DPLR raw recurrent | 4 | 512 | recurrent | about 97k | about 10 | | |
| Interpretation: | |
| - Raw DPLR SSM was promising. | |
| - The wrapped LLM bottleneck came from how the DPLR convolutional path scaled under the TaoNet stack, not from the idea of DPLR alone. | |
| ### LLM Iteration 4 - Direct DPLR Frequency-Response Application | |
| Implementation location: | |
| - SSM: `gamma_space_model/modules/s4_ternary_dplr_ssm.py` | |
| SSM commit: | |
| - `2b204e8 Apply DPLR frequency response directly` | |
| What changed: | |
| - Added a direct training path that applies the DPLR frequency response to the FFT input. | |
| - Avoided materializing the full dense transfer tensor shaped roughly `freq x channels x channels` during training/grad runs. | |
| - Kept the old dense transfer path for eval/no-grad caching. | |
| Validation: | |
| - `python -m pytest tests\test_s4_ternary_dplr_ssm.py -q` passed. | |
| - Local CPU smoke benchmark with backward passed. | |
| Projected-128 remote result after this change: | |
| | Variant | Batch | Seq | Forward tok/s | Backward tok/s | Peak MB | | |
| |---|---:|---:|---:|---:|---:| | |
| | attention TaoNet | 4 | 512 | about 1.32M | about 532k | about 376 | | |
| | DPLR projected mixer 128 | 4 | 512 | about 151k | about 91k | about 508 | | |
| Interpretation: | |
| - Major memory success: projected-128 DPLR dropped from about 3613 MB to about 508 MB. | |
| - Training throughput improved from about 56k to about 91k tok/s. | |
| - Forward-only became slower than the previous projected-128 run, so this change helped training/backward much more than no-grad forward timing. | |
| ### LLM Iteration 5 - Specialize Rank-One DPLR Solve | |
| Implementation location: | |
| - SSM: `gamma_space_model/modules/s4_ternary_dplr_ssm.py` | |
| SSM commit: | |
| - `5a0abad Specialize rank-one DPLR solve` | |
| What changed: | |
| - Current best DPLR configuration uses `rank=1`. | |
| - Replaced the batched `torch.linalg.inv` for `1 x 1` low-rank systems with scalar reciprocal math. | |
| - Applied the specialization to both direct training and cached dense response paths. | |
| - Left the general rank path intact for `rank > 1`. | |
| Validation: | |
| - `python -m pytest tests\test_s4_ternary_dplr_ssm.py -q` passed. | |
| - Local CPU smoke benchmark with backward passed. | |
| Projected-128 remote result: | |
| | Variant | Batch | Seq | Forward tok/s | Backward tok/s | Peak MB | | |
| |---|---:|---:|---:|---:|---:| | |
| | attention TaoNet | 4 | 512 | about 1.33M | about 545k | about 376 | | |
| | DPLR projected mixer 128 | 4 | 512 | about 485k | about 142k | about 508 | | |
| Projected-64 remote result after this change: | |
| | Variant | Batch | Seq | Forward tok/s | Backward tok/s | Peak MB | | |
| |---|---:|---:|---:|---:|---:| | |
| | DPLR projected mixer 64 | 4 | 512 | about 618k | about 192k | about 494 | | |
| Scaling probe for projected-64: | |
| | Variant | Batch | Seq | Forward tok/s | Backward tok/s | Peak MB | | |
| |---|---:|---:|---:|---:|---:| | |
| | attention TaoNet | 16 | 512 | about 1.16M | about 990k | about 1332 | | |
| | DPLR projected mixer 64 | 16 | 512 | about 2.12M | about 702k | about 1684 | | |
| Interpretation: | |
| - Major success. | |
| - DPLR projected-64 became the best current SSM LLM configuration. | |
| - At batch 16, DPLR projected-64 forward throughput exceeded attention in this synthetic benchmark. | |
| - Backward was still behind attention, but the gap narrowed substantially. | |
| - The SSM now scales much better with batch size, suggesting fixed frequency-response overhead is being amortized. | |
| ### LLM Iteration 6 - Precompose Finite Response Projection | |
| Implementation location: | |
| - SSM: `gamma_space_model/modules/s4_ternary_dplr_ssm.py` | |
| SSM commits: | |
| - `f09a71b Precompose DPLR finite response projection` | |
| - `648a32e Revert "Precompose DPLR finite response projection"` | |
| What changed: | |
| - Tried replacing `C @ (I - z^L A^L) @ response` with two projected terms: | |
| - `C @ response` | |
| - `(C @ A^L) @ response` | |
| - The goal was to reduce one batch/frequency hidden-state multiplication in the direct path. | |
| Validation: | |
| - `python -m pytest tests\test_s4_ternary_dplr_ssm.py -q` passed. | |
| - Local smoke benchmark passed. | |
| - Direct-vs-cached convolution comparison had max absolute difference around `2.4e-7`. | |
| Remote result: | |
| | Variant | Batch | Seq | Forward tok/s | Backward tok/s | Peak MB | | |
| |---|---:|---:|---:|---:|---:| | |
| | DPLR projected mixer 64 before this change | 4 | 512 | about 618k | about 192k | about 494 | | |
| | DPLR projected mixer 64 with this change | 4 | 512 | about 495k | about 162k | about 478 | | |
| Interpretation: | |
| - Failed on real GPU token benchmark. | |
| - It saved a little memory but reduced speed too much. | |
| - The commit was intentionally reverted, so current SSM `main` is back to the best-performing rank-one direct-response core. | |
| ### LLM Iteration 7 - Rank-One Matmul Fast Path | |
| Implementation location: | |
| - SSM: `gamma_space_model/modules/s4_ternary_dplr_ssm.py` | |
| SSM commits: | |
| - `43de801 Use matmul fast path for rank-one DPLR` | |
| - `9ffa5a7 Gate rank-one matmul path by batch size` | |
| - `4e130b6 Limit rank-one matmul path to small batches` | |
| - `8969916 Revert "Limit rank-one matmul path to small batches"` | |
| - `5b3a957 Revert "Gate rank-one matmul path by batch size"` | |
| - `a46a2af Revert "Use matmul fast path for rank-one DPLR"` | |
| What changed: | |
| - Tried a deeper `rank=1` direct-application specialization. | |
| - Replaced several generic `einsum` operations with batched `matmul` and vector reductions. | |
| - The goal was to reduce Python/operator overhead and improve backward throughput for the current best DPLR rank. | |
| - A follow-up tried to gate the path by batch size after the batch-16 scaling run regressed. | |
| Validation: | |
| - `python -m pytest tests\test_s4_ternary_dplr_ssm.py -q` passed. | |
| - Local CPU smoke benchmark passed. | |
| - Direct-vs-cached convolution comparison had max absolute difference around `2.4e-7`. | |
| Remote result: | |
| | Variant | Batch | Seq | Forward tok/s | Backward tok/s | Peak MB | | |
| |---|---:|---:|---:|---:|---:| | |
| | DPLR projected mixer 64 before this change | 4 | 512 | about 618k | about 192k | about 494 | | |
| | DPLR projected mixer 64 first matmul run | 4 | 512 | about 643k | about 208k | about 494 | | |
| | DPLR projected mixer 64 repeated small-batch gated run | 4 | 512 | about 470k-472k | about 161k-175k | about 494 | | |
| | DPLR projected mixer 64 matmul at batch 16 | 16 | 512 | about 1.47M | about 388k | about 1684 | | |
| | DPLR projected mixer 64 previous best at batch 16 | 16 | 512 | about 2.12M | about 702k | about 1684 | | |
| Interpretation: | |
| - Failed overall. | |
| - The first batch-4 run looked promising, but repeated remote results were worse. | |
| - The matmul formulation regressed the larger-batch scaling behavior that matters for GPU utilization. | |
| - All matmul fast-path commits were reverted, so current SSM `main` returns to the best-known `5a0abad` rank-one scalar-solve behavior plus the experiment-record commits. | |
| ### LLM Iteration 8 - TileLang Capability Detection | |
| Implementation location: | |
| - SSM: `csrc/tilelang/selective_scan.py` | |
| - SSM: `csrc/tilelang/__init__.py` | |
| - SSM: `gamma_space_model/ops/selective_scan_interface.py` | |
| - SSM: `gamma_space_model/modules/ssm_gamma.py` | |
| - SSM: `scripts/diagnose_tilelang_acceleration.py` | |
| SSM commit: | |
| - `4784856 Make TileLang acceleration detection explicit` | |
| What changed: | |
| - Made TileLang capability reporting explicit and conservative. | |
| - Before this change, `HAS_TILELANG_OPS` became true whenever the Python fallback module imported. | |
| - That was misleading because `csrc/tilelang` did not actually dispatch to a real TileLang kernel; it used PyTorch fallback code. | |
| - Added `TILELANG_BACKEND` and `HAS_TILELANG_ACCELERATION` flags. | |
| - Added `scripts/diagnose_tilelang_acceleration.py` to print package availability, repo backend flags, and a small Gamma forward timing. | |
| - Fixed `SSMGamma.step` dtype/device casting after the honest fallback path exposed a float64 failure in the normal PyTorch path. | |
| Validation: | |
| - `python -m pytest tests\test_ssm_gamma.py tests\test_s4_ternary_dplr_ssm.py -q` passed locally: `22 passed`. | |
| - Local diagnostic reported: | |
| - `has_tilelang_ops=false` | |
| - `tilelang_backend=pytorch_fallback` | |
| - `triton_available=false` | |
| - `tilelang_available=false` | |
| Remote RTX 5090 diagnostic: | |
| | Field | Value | | |
| |---|---| | |
| | Torch | `2.11.0+cu130` | | |
| | CUDA | available | | |
| | GPU | `NVIDIA GeForce RTX 5090` | | |
| | Triton package | available | | |
| | TileLang package | not available | | |
| | Repo `HAS_TILELANG_OPS` | `false` | | |
| | Repo `TILELANG_BACKEND` | `pytorch_fallback` | | |
| | Gamma fallback forward | about `76.7k` tok/s at batch 4, seq 512, bf16 | | |
| Remote raw SSM benchmark after this change: | |
| | Model | Batch | Seq | Forward tok/s | Backward tok/s | Peak MB | | |
| |---|---:|---:|---:|---:|---:| | |
| | DPLR raw SSM | 4 | 512 | about 3.16M | about 1.03M | about 57 | | |
| | Gamma-S4 raw SSM | 4 | 512 | about 100.6k | about 45.5k | about 467 | | |
| | Baseline Gamma raw SSM | 4 | 512 | about 85.2k | about 32.3k | about 120 | | |
| Interpretation: | |
| - This iteration did not add a real TileLang kernel yet. | |
| - It fixed an important measurement and dispatch problem: fallback code is no longer reported as hardware acceleration. | |
| - The remote server has Triton installed but does not have the TileLang package installed. | |
| - The current DPLR path is frequency-domain PyTorch/cuBLAS and does not use `csrc/tilelang`. | |
| - The next hardware-acceleration step should be explicit: either install/use real TileLang on the remote server or write a Triton/TileLang kernel for a clearly scoped hot path. The best candidate hot path is not the old baseline Gamma fallback; it is the DPLR direct frequency-response/backward path used by `taonet_ssm`. | |
| ### LLM Iteration 9 - DPLR Frequency-Path Profiling And Root Cache | |
| Implementation location: | |
| - SSM: `gamma_space_model/modules/s4_ternary_dplr_ssm.py` | |
| - SSM: `scripts/profile_dplr_frequency_path.py` | |
| SSM commit: | |
| - `92643c5 Cache DPLR frequency roots` | |
| What changed: | |
| - Added a per-module cache for FFT roots and `roots^seq_len`. | |
| - These tensors are constants for a given `(seq_len, fft_len, dtype, device)`, so rebuilding them every forward/layer is unnecessary GPU work. | |
| - Added `scripts/profile_dplr_frequency_path.py` to profile the DPLR convolutional path directly on the remote server. | |
| Validation: | |
| - `python -m pytest tests\test_s4_ternary_dplr_ssm.py -q` passed locally. | |
| - `python -m pytest tests\test_ssm_gamma.py tests\test_s4_ternary_dplr_ssm.py -q` passed locally: `22 passed`. | |
| - Local profiler smoke passed and showed `frequency_grid_cache_entries=1`. | |
| Remote profiler result for raw DPLR at batch 4, seq 512, d_model 64, hidden_dim 256: | |
| | Mode | Mean ms | Tok/s | Peak MB | | |
| |---|---:|---:|---:| | |
| | forward | about 2.58 | about 793k | not measured | | |
| | forward+backward | about 3.27 | about 626k | about 52 | | |
| Remote profiler interpretation: | |
| - The largest CUDA entries were `aten::bmm`, `aten::mm`, and their backward paths. | |
| - `aten::linalg_matrix_power` was visible but small in this configuration. | |
| - Root generation was not the dominant cost, so the cache is a modest cleanup rather than a major acceleration. | |
| - A future TileLang/Triton kernel should target fused rank-1 DPLR frequency-response application and its backward, especially around the small complex BMM/MM pattern. Replacing the old Gamma Python fallback is not the right priority for the TaoNet-SSM goal. | |
| TaoNet projected-64 check after this change: | |
| | Variant | Batch | Seq | Forward tok/s | Backward tok/s | Peak MB | | |
| |---|---:|---:|---:|---:|---:| | |
| | DPLR projected mixer 64 | 4 | 512 | about 656k | about 163k | about 494 | | |
| Scaling probe after this change: | |
| | Variant | Batch | Seq | Forward tok/s | Backward tok/s | Peak MB | | |
| |---|---:|---:|---:|---:|---:| | |
| | DPLR projected mixer 64 | 8 | 512 | about 983k | about 341k | about 889 | | |
| | DPLR projected mixer 64 | 16 | 512 | about 1.03M | about 414k | about 1684 | | |
| Interpretation: | |
| - The root cache is correct and removes repeated constant construction. | |
| - End-to-end results remain noisy; this is not a breakthrough optimization. | |
| - The main value of this iteration is the profiler evidence: the next real hardware acceleration should fuse the DPLR rank-1 complex frequency-response operations, not spend effort on the older baseline Gamma fallback path. | |
| ### LLM Iteration 10 - Shared DPLR Frequency Grid Cache | |
| Implementation location: | |
| - SSM: `gamma_space_model/modules/s4_ternary_dplr_ssm.py` | |
| SSM commit: | |
| - `a9e5d3e Share DPLR frequency grid cache` | |
| What changed: | |
| - Promoted the DPLR FFT root cache from per-module to class-level shared cache. | |
| - The previous cache avoided rebuilding roots inside a single SSM module, but a multi-layer TaoNet creates one SSM module per layer. | |
| - The shared cache lets all layers reuse the same `(roots, roots^seq_len)` tensors for a given `(seq_len, fft_len, dtype, device)`. | |
| Validation: | |
| - `python -m pytest tests\test_s4_ternary_dplr_ssm.py -q` passed locally. | |
| - Local `scripts/profile_dplr_frequency_path.py` smoke passed and still reported one frequency-grid cache entry. | |
| Required TaoNet comparison after this iteration: | |
| Remote benchmark: | |
| - RepoBridge run: `taonet-vs-dplr-proj64-shared-grid-bench-20260429-101304` | |
| - Config: projected DPLR mixer dim 64, DPLR rank 1, 4 layers, bf16, synthetic next-token CE | |
| | Architecture | Batch | Seq | Mode | Tok/s | Peak MB | Loss | | |
| |---|---:|---:|---|---:|---:|---:| | |
| | attention TaoNet | 4 | 512 | forward | about 1.30M | about 193 | 9.064 | | |
| | attention TaoNet | 4 | 512 | forward+backward | about 513k | about 376 | 9.064 | | |
| | SSM TaoNet, DPLR projected 64 | 4 | 512 | forward | about 499k | about 190 | 9.059 | | |
| | SSM TaoNet, DPLR projected 64 | 4 | 512 | forward+backward | about 162k | about 492 | 9.059 | | |
| Comparison: | |
| - SSM forward throughput was about `38%` of attention at batch 4, seq 512. | |
| - SSM forward+backward throughput was about `32%` of attention. | |
| - SSM forward memory was slightly lower than attention, but backward peak memory was higher. | |
| - Loss was comparable because this is a random synthetic token benchmark, not a trained quality result. | |
| Interpretation: | |
| - The shared cache is correct and small, but it did not create a clear end-to-end speed breakthrough. | |
| - This reinforces the profiler conclusion: constant/root setup is not the dominant TaoNet-SSM bottleneck. | |
| - Future iterations should include the attention-vs-SSM table directly, and hardware work should focus on the DPLR rank-1 complex BMM/MM and backward pattern. | |
| ### LLM Iteration 11 - Re-anchor On Projected-64 Scaling Regime | |
| Reason for this iteration: | |
| - The strongest previous result came from a scaling probe, not from batch-4 timing. | |
| - Later iterations over-emphasized batch 4, which made the SSM look worse and encouraged the wrong optimization target. | |
| - This iteration re-established the primary benchmark as attention TaoNet vs SSM TaoNet under larger projected-64 batches. | |
| Implementation change: | |
| - No model-code change. | |
| - Benchmark-policy change: projected-64 scaling comparisons should be treated as primary acceptance tests for throughput work. | |
| Remote benchmark: | |
| - RepoBridge run: `taonet-token-dplr-proj64-scale-bench-20260429-111150` | |
| - RepoBridge run: `taonet-token-dplr-proj64-extended-scale-bench-20260429-111350` | |
| - Config: projected DPLR mixer dim 64, DPLR rank 1, 4 layers, bf16, synthetic next-token CE | |
| Required TaoNet comparison: | |
| | Architecture | Batch | Seq | Mode | Tok/s | Peak MB | Loss | | |
| |---|---:|---:|---|---:|---:|---:| | |
| | attention TaoNet | 8 | 512 | forward | about 1.09M | about 319 | 9.061 | | |
| | attention TaoNet | 8 | 512 | forward+backward | about 468k | about 697 | 9.061 | | |
| | SSM TaoNet, DPLR projected 64 | 8 | 512 | forward | about 1.17M | about 320 | 9.058 | | |
| | SSM TaoNet, DPLR projected 64 | 8 | 512 | forward+backward | about 318k | about 889 | 9.058 | | |
| | attention TaoNet | 16 | 512 | forward | about 1.60M | about 596 | 9.059 | | |
| | attention TaoNet | 16 | 512 | forward+backward | about 503k | about 1332 | 9.059 | | |
| | SSM TaoNet, DPLR projected 64 | 16 | 512 | forward | about 1.03M | about 580 | 9.060 | | |
| | SSM TaoNet, DPLR projected 64 | 16 | 512 | forward+backward | about 427k | about 1684 | 9.060 | | |
| | attention TaoNet | 32 | 512 | forward | about 1.88M | about 1124 | 9.062 | | |
| | attention TaoNet | 32 | 512 | forward+backward | about 632k | about 2590 | 9.062 | | |
| | SSM TaoNet, DPLR projected 64 | 32 | 512 | forward | about 2.62M | about 1100 | 9.061 | | |
| | SSM TaoNet, DPLR projected 64 | 32 | 512 | forward+backward | about 705k | about 3273 | 9.061 | | |
| | attention TaoNet | 64 | 512 | forward | about 3.53M | about 2204 | 9.061 | | |
| | attention TaoNet | 64 | 512 | forward+backward | about 683k | about 5121 | 9.061 | | |
| | SSM TaoNet, DPLR projected 64 | 64 | 512 | forward | about 1.30M | about 2140 | 9.060 | | |
| | SSM TaoNet, DPLR projected 64 | 64 | 512 | forward+backward | about 618k | about 6451 | 9.060 | | |
| Comparison: | |
| - Batch 8: SSM forward is slightly faster than attention, but backward is slower. | |
| - Batch 16: SSM backward is closer to attention than in batch-4 runs, but still slower. | |
| - Batch 32: SSM beats attention in both forward and forward+backward throughput in this run. | |
| - Batch 64: SSM falls off sharply, so the useful scaling point is not simply the largest batch. | |
| - SSM backward memory remains higher than attention, especially at larger batches. | |
| Interpretation: | |
| - The projected-64 DPLR SSM should be optimized and evaluated around the scaling sweet spot, currently batch 32 for this synthetic benchmark on the RTX 5090. | |
| - Batch-4 timing is still useful for smoke tests, but it should not be treated as the main performance target. | |
| - This is a configuration-level breakthrough: SSM can outperform attention at the right batch size even before custom TileLang/Triton kernels. | |
| - Next improvement directions should either preserve or improve the batch-32 scaling result, not merely improve batch-4 microbenchmarks. | |
| ### LLM Iteration 12 - Token Accuracy Benchmark And Causal Memory Check | |
| Reason for this iteration: | |
| - Throughput alone is not sufficient; the SSM TaoNet must also learn useful token tasks. | |
| - The benchmark script previously reported only random synthetic CE, which is not an inference accuracy signal. | |
| - This iteration adds lightweight trained token tasks and reports `eval_accuracy`. | |
| Implementation location: | |
| - TaoTrain: `scripts/benchmark_taonet_token_variants.py` | |
| TaoTrain commit: | |
| - `59b84cd Add token task accuracy benchmark` | |
| What changed: | |
| - Added `--token-task` with: | |
| - `random`: original random next-token timing task | |
| - `increment`: deterministic token mapping, label is current token plus one modulo vocab | |
| - `previous`: causal memory task, label is the previous token | |
| - Added optional short training with `--train-steps`, `--learning-rate`, `--weight-decay`. | |
| - Added eval metrics: | |
| - `eval_loss` | |
| - `eval_accuracy` | |
| - `train_final_loss` | |
| - `train_seconds` | |
| Validation: | |
| - Local TaoTrain smoke passed on CPU. | |
| - `python -m pytest tests\test_taonet_ssm.py -q` passed locally. | |
| Broad speed comparison after adding accuracy columns: | |
| - RepoBridge run: `taonet-vs-dplr-proj64-broad-speed-bench-20260429-112432` | |
| - Config: projected DPLR mixer dim 64, DPLR rank 1, 4 layers, bf16, random token task, batch sweep | |
| | Architecture | Batch | Seq | Mode | Tok/s | Peak MB | | |
| |---|---:|---:|---|---:|---:| | |
| | attention TaoNet | 8 | 512 | forward+backward | about 873k | about 697 | | |
| | SSM TaoNet, DPLR projected 64 | 8 | 512 | forward+backward | about 244k | about 956 | | |
| | attention TaoNet | 16 | 512 | forward+backward | about 589k | about 1332 | | |
| | SSM TaoNet, DPLR projected 64 | 16 | 512 | forward+backward | about 646k | about 1748 | | |
| | attention TaoNet | 32 | 512 | forward+backward | about 680k | about 2592 | | |
| | SSM TaoNet, DPLR projected 64 | 32 | 512 | forward+backward | about 816k | about 3338 | | |
| | attention TaoNet | 64 | 512 | forward+backward | about 763k | about 5121 | | |
| | SSM TaoNet, DPLR projected 64 | 64 | 512 | forward+backward | about 544k | about 6516 | | |
| Speed interpretation: | |
| - SSM remains poor at batch 8. | |
| - SSM wins forward+backward throughput at batch 16 and batch 32 in this run. | |
| - SSM falls off again at batch 64. | |
| - Batch 16-32 remains the useful projected-64 scaling range. | |
| Token accuracy comparison: | |
| - Previous-token task run: `taonet-vs-dplr-proj64-previous-token-quality-20260429-112456` | |
| - Linear/ungated SSM ablation run: `taonet-vs-dplr-proj64-previous-token-linear-quality-20260429-112623` | |
| - Increment task run: `taonet-vs-dplr-proj64-increment-token-quality-20260429-112719` | |
| - All quality runs used batch 32, seq 128, vocab 128, 100 train steps, bf16. | |
| | Task | Architecture | Eval loss | Eval accuracy | Forward+backward tok/s | | |
| |---|---|---:|---:|---:| | |
| | previous | attention TaoNet | about 0.033 | about 0.999 | about 551k | | |
| | previous | SSM TaoNet, DPLR projected 64 | about 4.858 | about 0.009 | about 292k | | |
| | previous, linear ungated SSM | attention TaoNet | about 0.046 | about 0.999 | about 990k | | |
| | previous, linear ungated SSM | SSM TaoNet, DPLR projected 64 | about 4.626 | about 0.026 | about 346k | | |
| | increment | attention TaoNet | about 0.007 | 1.000 | about 1.09M | | |
| | increment | SSM TaoNet, DPLR projected 64 | about 0.009 | 1.000 | about 344k | | |
| Failed SSM-core improvement: | |
| - SSM commit `2e974c9 Add delayed DPLR skip` added a learnable one-step diagonal delayed skip to help causal memory. | |
| - Remote previous-token run after this change: `taonet-vs-dplr-proj64-previous-token-quality-20260429-112953`. | |
| - Result: SSM remained near random, eval accuracy about `0.008`, and speed worsened. | |
| - The change was reverted by `3fc0575 Revert "Add delayed DPLR skip"`. | |
| Interpretation: | |
| - Projected-64 DPLR SSM can learn simple token mappings (`increment`) to perfect accuracy. | |
| - It currently fails a short causal memory/copy task (`previous`) under the same 100-step setting where attention TaoNet reaches about 99.9% accuracy. | |
| - The failure is not solved by removing SSM activation/gates or by a simple delayed diagonal skip. | |
| - Future improvements must include both: | |
| - speed comparison across batch 8/16/32/64 | |
| - trained token accuracy, especially on causal memory tasks | |
| - The next quality-focused direction should investigate the SSM wrapper/core's ability to expose previous-token information, not only low-level GPU speed. | |
| ### LLM Iteration 13 - Local Shift Register For Causal Token Memory | |
| Reason for this iteration: | |
| - Projected-64 was the strongest SSM speed configuration, but it failed the `previous` token-memory task. | |
| - Capacity probes showed the failure was not caused by the projected-64 bottleneck alone: | |
| - projected-128 SSM eval accuracy stayed near random, about `0.007` | |
| - full-width projected-256 SSM eval accuracy stayed near random, about `0.008` | |
| - The next improvement therefore targeted explicit short causal memory while preserving the DPLR SSM as the main sequence mixer. | |
| Implementation location: | |
| - TaoTrain commit: `bb3bf90 Add SSM local shift mixer option` | |
| - TaoTrain: `src/taoTrain/models/taonet_ssm.py` | |
| - TaoTrain: `src/taoTrain/config.py` | |
| - TaoTrain: `scripts/benchmark_taonet_token_variants.py` | |
| - TaoTrain: `tests/test_taonet_ssm.py` | |
| What changed: | |
| - Added opt-in `ssm_local_shift`. | |
| - The SSM mixer can now add a one-token causal shift/register branch: | |
| - `shifted[:, 1:] = x_norm[:, :-1]` | |
| - output contribution is controlled by a single learned scalar `ssm_local_shift_init`. | |
| - The branch is deliberately cheap and ternary-friendly in structure: it is a causal shift plus scalar gain, not another dense attention mechanism. | |
| - The default remains off, so older SSM benchmarks are still comparable. | |
| Validation: | |
| - Local TaoTrain: | |
| - `PYTHONPATH=...\TaoTrain\src;...\Taotern_SSM python -m pytest tests\test_taonet_ssm.py -q` passed, `4 passed`. | |
| - CPU smoke for `benchmark_taonet_token_variants.py --ssm-local-shift` passed. | |
| Capacity diagnostic before the change: | |
| | Architecture | Mixer dim | Batch | Seq | Eval loss | Eval accuracy | Forward+backward tok/s | | |
| |---|---:|---:|---:|---:|---:|---:| | |
| | attention TaoNet | n/a | 32 | 128 | about 0.025 | 1.000 | about 1.04M | | |
| | SSM TaoNet, DPLR | 128 | 32 | 128 | about 4.857 | about 0.007 | about 379k | | |
| | attention TaoNet | n/a | 32 | 128 | about 0.042 | about 0.999 | about 556k | | |
| | SSM TaoNet, DPLR | 256 | 32 | 128 | about 4.856 | about 0.008 | about 389k | | |
| Required TaoNet comparison after the change: | |
| - RepoBridge run: `taonet-vs-dplr-proj64-local-shift-previous-quality-20260429-144930` | |
| - RepoBridge broad run: `taonet-vs-dplr-proj64-local-shift-previous-broad-quality-20260429-145014` | |
| - Config: previous-token task, seq 128, vocab 128, 100 train steps, bf16, projected DPLR mixer dim 64, local shift enabled. | |
| | Architecture | Batch | Eval loss | Eval accuracy | Forward tok/s | Forward+backward tok/s | Peak MB | | |
| |---|---:|---:|---:|---:|---:|---:| | |
| | attention TaoNet | 8 | about 4.376 | about 0.096 | about 695k | about 238k | about 103 | | |
| | SSM TaoNet, DPLR projected 64 + local shift | 8 | about 0.010 | 1.000 | about 226k | about 89k | about 181 | | |
| | attention TaoNet | 16 | about 1.048 | about 0.847 | about 1.26M | about 508k | about 166 | | |
| | SSM TaoNet, DPLR projected 64 + local shift | 16 | about 0.008 | 1.000 | about 520k | about 189k | about 299 | | |
| | attention TaoNet | 32 | about 0.043 | 1.000 | about 2.54M | about 555k | about 297 | | |
| | SSM TaoNet, DPLR projected 64 + local shift | 32 | about 0.008 | 1.000 | about 1.16M | about 353k | about 513 | | |
| | attention TaoNet | 64 | about 0.020 | 1.000 | about 4.75M | about 1.73M | about 553 | | |
| | SSM TaoNet, DPLR projected 64 + local shift | 64 | about 0.007 | 1.000 | about 2.43M | about 403k | about 956 | | |
| Interpretation: | |
| - Success: this is the first projected-64 SSM TaoNet result that solves the `previous` causal-memory task. | |
| - The result is not only a batch-32 spot check. SSM reached perfect eval accuracy at batch 8, 16, 32, and 64. | |
| - The quality gain is large: plain projected-64, projected-128, and projected-256 DPLR all stayed near random on the same task. | |
| - Speed tradeoff: local-shift SSM is slower than attention on this short-sequence previous-token benchmark, especially backward. | |
| - This should be treated as a quality architecture fix, not a hardware-acceleration fix. The next hardware iteration should still target fused DPLR frequency/backward kernels. | |
| ### LLM Iteration 14 - Explicit DPLR Transfer-Mode Probe | |
| Reason for this iteration: | |
| - After the local-shift quality fix, the next bottleneck was speed. | |
| - The DPLR direct frequency path applies the finite correction to batch-dependent hidden responses. | |
| - A possible alternative was to materialize the full frequency transfer matrix, then multiply by the input FFT. | |
| - This could be faster for some batch/sequence shapes, but it risks high memory and repeated transfer construction. | |
| Implementation location: | |
| - SSM commit: `749a4cf Add DPLR transfer profiling mode` | |
| - SSM commit: `e34b67c Add DPLR conv transfer mode` | |
| - TaoTrain commit: `ceb08e6 Expose SSM conv transfer mode` | |
| - SSM: `gamma_space_model/modules/s4_ternary_dplr_ssm.py` | |
| - SSM: `scripts/profile_dplr_frequency_path.py` | |
| - TaoTrain: `scripts/benchmark_taonet_token_variants.py` | |
| What changed: | |
| - Added profiler support for comparing: | |
| - direct DPLR frequency response application | |
| - materialized transfer matrix application | |
| - Added explicit `kernel_mode="conv_transfer"` to `S4TernaryDPLRSSM`. | |
| - Exposed `conv_transfer` through TaoTrain config/benchmark CLI. | |
| - The mode is opt-in only. The default/recommended projected-64 path remains `conv`. | |
| Local validation: | |
| - SSM: `python -m pytest tests\test_s4_ternary_dplr_ssm.py tests\test_ssm_gamma.py -q` passed, `23 passed`. | |
| - TaoTrain: `PYTHONPATH=...\TaoTrain\src;...\Taotern_SSM python -m pytest tests\test_taonet_ssm.py -q` passed, `5 passed`. | |
| Isolated SSM-core remote profile: | |
| - RepoBridge run: `ssm-dplr-direct-vs-transfer-s128-profile-20260429-145555` | |
| - Config: DPLR state/mixer dim 64, hidden dim 256, seq 128, bf16, rank 1. | |
| | Method | Batch | Forward tok/s | Forward+backward tok/s | Peak MB | Interpretation | | |
| |---|---:|---:|---:|---:|---| | |
| | direct | 8 | about 555k | about 332k | about 34 | baseline direct path | | |
| | transfer | 8 | about 1.24M | about 440k | about 247 | faster but much higher memory | | |
| | direct | 16 | about 790k | about 1.13M | about 47 | direct wins | | |
| | transfer | 16 | about 737k | about 481k | about 248 | transfer loses | | |
| | direct | 32 | about 6.73M | about 1.86M | about 74 | direct wins | | |
| | transfer | 32 | about 4.89M | about 1.68M | about 250 | transfer loses | | |
| | direct | 64 | about 6.90M | about 2.20M | about 128 | baseline direct path | | |
| | transfer | 64 | about 2.93M | about 3.06M | about 253 | backward faster, forward slower | | |
| TaoNet comparison after exposing `conv_transfer`: | |
| - RepoBridge run: `taonet-vs-dplr-proj64-local-shift-conv-transfer-previous-broad-quality-20260429-145946` | |
| - Config: previous-token task, seq 128, vocab 128, 100 train steps, bf16, projected DPLR mixer dim 64, local shift enabled, `ssm_kernel_mode=conv_transfer`. | |
| | Architecture | Batch | Eval loss | Eval accuracy | Forward tok/s | Forward+backward tok/s | Peak MB | | |
| |---|---:|---:|---:|---:|---:|---:| | |
| | attention TaoNet | 8 | about 4.431 | about 0.082 | about 699k | about 279k | about 103 | | |
| | SSM TaoNet, DPLR projected 64 + local shift + conv_transfer | 8 | about 0.010 | 1.000 | about 52k | about 14k | about 195 | | |
| | attention TaoNet | 16 | about 1.098 | about 0.862 | about 1.24M | about 303k | about 166 | | |
| | SSM TaoNet, DPLR projected 64 + local shift + conv_transfer | 16 | about 0.008 | 1.000 | about 79k | about 24k | about 270 | | |
| | attention TaoNet | 32 | about 0.061 | about 0.998 | about 1.24M | about 674k | about 297 | | |
| | SSM TaoNet, DPLR projected 64 + local shift + conv_transfer | 32 | about 0.007 | 1.000 | about 157k | about 45k | about 420 | | |
| | attention TaoNet | 64 | about 0.015 | 1.000 | about 4.48M | about 1.05M | about 553 | | |
| | SSM TaoNet, DPLR projected 64 + local shift + conv_transfer | 64 | about 0.007 | 1.000 | about 370k | about 97k | about 719 | | |
| Comparison to the previous direct-conv local-shift run: | |
| | Batch | Direct-conv SSM forward+backward tok/s | Transfer-mode SSM forward+backward tok/s | | |
| |---:|---:|---:| | |
| | 8 | about 89k | about 14k | | |
| | 16 | about 189k | about 24k | | |
| | 32 | about 353k | about 45k | | |
| | 64 | about 403k | about 97k | | |
| Interpretation: | |
| - Failed as an end-to-end TaoNet acceleration. | |
| - The isolated SSM profile suggested transfer mode could help in some cases, but inside the LLM wrapper it is much slower across all tested batch sizes. | |
| - Accuracy remains solved because the local shift branch is still active, but speed regresses badly. | |
| - Keep `conv_transfer` only as an explicit diagnostic/experimental mode for now. | |
| - Recommended mode remains `ssm_kernel_mode=conv` with `ssm_local_shift=True`. | |
| - The next hardware target should not be materializing the whole transfer each layer/step. It should focus on fusing or custom-autograding the current direct DPLR response path, especially the complex rank-1 frequency operations and backward. | |
| ### LLM Iteration 15 - Shrink DPLR Hidden State After Local-Shift Quality Fix | |
| Reason for this iteration: | |
| - The local-shift branch solved the `previous` token-memory task, but the quality-fixed SSM was still slower than attention on short seq-128 training. | |
| - The profiler for the recommended direct DPLR path at batch 32, seq 128 showed many small complex BMM/MM calls; there was no single obvious Python-only bottleneck. | |
| - Since local shift now carries exact one-token memory, the DPLR hidden dimension may not need to remain at 256 for this token-memory regime. | |
| - This iteration tested smaller DPLR hidden states as a ternary-friendly architecture/config improvement. | |
| Remote profiler context: | |
| - RepoBridge run: `ssm-dplr-direct-b32-s128-profile-20260429-154242` | |
| - Config: DPLR mixer/state dim 64, hidden dim 256, batch 32, seq 128, bf16, rank 1, direct path. | |
| - Result: forward+backward about `2.25M` core tok/s. | |
| - Profiler top CUDA cost was small complex BMM/MM work; `aten::bmm` accounted for about `48%` of self CUDA time. | |
| - `aten::linalg_matrix_power` was visible but small, about `40us` CUDA total. | |
| Remote hidden-dim sweeps: | |
| - RepoBridge run: `taonet-vs-dplr-proj64-local-shift-hidden-sweep-previous-20260429-154546` | |
| - RepoBridge run: `taonet-vs-dplr-proj64-local-shift-hidden-small-sweep-previous-20260429-155028` | |
| - Config: previous-token task, seq 128, vocab 128, 100 train steps, bf16, projected DPLR mixer dim 64, local shift enabled, direct `conv` path. | |
| | SSM hidden dim | Batch | SSM eval accuracy | SSM forward+backward tok/s | SSM peak MB | Attention eval accuracy | Attention forward+backward tok/s | | |
| |---:|---:|---:|---:|---:|---:|---:| | |
| | 256 | 8 | 1.000 | about 89k | about 181 | about 0.096 | about 238k | | |
| | 256 | 16 | 1.000 | about 189k | about 299 | about 0.847 | about 508k | | |
| | 256 | 32 | 1.000 | about 353k | about 513 | 1.000 | about 555k | | |
| | 256 | 64 | 1.000 | about 403k | about 956 | 1.000 | about 1.73M | | |
| | 64 | 8 | 1.000 | about 95k | about 145 | about 0.102 | about 278k | | |
| | 64 | 16 | 1.000 | about 184k | about 239 | about 0.932 | about 300k | | |
| | 64 | 32 | 1.000 | about 370k | about 404 | about 0.999 | about 895k | | |
| | 64 | 64 | 1.000 | about 564k | about 750 | about 0.999 | about 920k | | |
| | 32 | 8 | 1.000 | about 91k | about 139 | about 0.097 | about 245k | | |
| | 32 | 16 | 1.000 | about 187k | about 227 | about 0.941 | about 460k | | |
| | 32 | 32 | 1.000 | about 302k | about 393 | about 0.998 | about 863k | | |
| | 32 | 64 | 1.000 | about 787k | about 716 | 1.000 | about 1.75M | | |
| | 16 | 8 | 1.000 | about 86k | about 138 | about 0.083 | about 260k | | |
| | 16 | 16 | 1.000 | about 187k | about 223 | about 0.844 | about 495k | | |
| | 16 | 32 | 1.000 | about 357k | about 378 | about 0.999 | about 550k | | |
| | 16 | 64 | 1.000 | about 795k | about 705 | 1.000 | about 1.76M | | |
| Seq-512 speed check for hidden dim 16: | |
| - RepoBridge run: `taonet-vs-dplr-proj64-local-shift-hidden16-random-speed-20260429-155346` | |
| - Config: random next-token timing task, seq 512, vocab 8192, projected DPLR mixer dim 64, hidden dim 16, local shift enabled. | |
| | Architecture | Batch | Forward tok/s | Forward+backward tok/s | Peak MB | Loss | | |
| |---|---:|---:|---:|---:|---:| | |
| | attention TaoNet | 16 | about 1.30M | about 601k | about 1332 | about 9.055 | | |
| | SSM TaoNet, DPLR projected 64, hidden 16 + local shift | 16 | about 2.18M | about 728k | about 1511 | about 9.069 | | |
| | attention TaoNet | 32 | about 3.87M | about 1.37M | about 2590 | about 9.060 | | |
| | SSM TaoNet, DPLR projected 64, hidden 16 + local shift | 32 | about 3.52M | about 1.14M | about 2887 | about 9.061 | | |
| | attention TaoNet | 64 | about 4.16M | about 1.45M | about 5121 | about 9.065 | | |
| | SSM TaoNet, DPLR projected 64, hidden 16 + local shift | 64 | about 4.03M | about 1.31M | about 5649 | about 9.060 | | |
| Interpretation: | |
| - Success for the short token-memory benchmark: hidden dim 16 kept perfect `previous` accuracy and improved batch-64 backward throughput from about `403k` to about `795k` tok/s while reducing peak memory. | |
| - Hidden dim 64 was also strong and slightly better at batch 32 than hidden dim 16. | |
| - This did not become a universal seq-512 speed replacement. On random seq-512 timing, hidden dim 16 beat attention at batch 16 but lost at batch 32 and 64. | |
| - Recommended quality-aware short-memory config is now `ssm_mixer_dim=64`, `ssm_hidden_dim=16`, `ssm_local_shift=True`, `ssm_kernel_mode=conv`. | |
| - Recommended longer seq-512 throughput config should remain benchmark-driven; the older hidden-256 projected-64 regime still has stronger evidence around batch 16-32. | |
| ### LLM Iteration 16 - Seq-512 Previous-Token Robustness And Hidden-State Selection | |
| Reason for this iteration: | |
| - Iteration 15 showed hidden dim 16 was excellent for seq-128 `previous` memory and mixed for seq-512 random timing. | |
| - The missing check was a longer trained token-memory task: seq 512 `previous`, where accuracy and training speed both matter. | |
| - This iteration tested whether the local-shift quality fix holds at seq 512 and whether hidden dim 16, 64, or 256 is the best state size at this longer context. | |
| Remote benchmark: | |
| - RepoBridge run with attention comparison: `taonet-vs-dplr-proj64-local-shift-hidden16-previous512-20260429-161213` | |
| - RepoBridge SSM-only hidden comparison: `taonet-ssm-proj64-local-shift-previous512-hidden-compare-20260429-161306` | |
| - Config: previous-token task, seq 512, vocab 128, 100 train steps, bf16, projected DPLR mixer dim 64, local shift enabled, direct `conv` path. | |
| Required TaoNet comparison: | |
| | Architecture | SSM hidden dim | Batch | Eval loss | Eval accuracy | Forward tok/s | Forward+backward tok/s | Peak MB | | |
| |---|---:|---:|---:|---:|---:|---:|---:| | |
| | attention TaoNet | n/a | 16 | about 4.614 | about 0.048 | about 2.04M | about 1.39M | about 575 | | |
| | SSM TaoNet, DPLR projected 64 + local shift | 16 | 16 | about 0.007 | 1.000 | about 2.57M | about 701k | about 754 | | |
| | attention TaoNet | n/a | 32 | about 2.090 | about 0.629 | about 4.79M | about 899k | about 1099 | | |
| | SSM TaoNet, DPLR projected 64 + local shift | 16 | 32 | about 0.007 | 1.000 | about 4.32M | about 944k | about 1391 | | |
| | attention TaoNet | n/a | 64 | about 0.239 | about 0.962 | about 4.08M | about 1.18M | about 2157 | | |
| | SSM TaoNet, DPLR projected 64 + local shift | 16 | 64 | about 0.007 | 1.000 | about 2.57M | about 961k | about 2677 | | |
| SSM hidden-state comparison at seq 512: | |
| | SSM hidden dim | Batch | Eval accuracy | Forward tok/s | Forward+backward tok/s | Peak MB | | |
| |---:|---:|---:|---:|---:|---:| | |
| | 16 | 16 | 1.000 | about 2.57M | about 701k | about 754 | | |
| | 16 | 32 | 1.000 | about 4.32M | about 944k | about 1391 | | |
| | 16 | 64 | 1.000 | about 2.57M | about 961k | about 2677 | | |
| | 64 | 16 | 1.000 | about 1.18M | about 494k | about 800 | | |
| | 64 | 32 | 1.000 | about 4.26M | about 1.36M | about 1491 | | |
| | 64 | 64 | 1.000 | about 4.70M | about 1.46M | about 2874 | | |
| | 256 | 16 | 1.000 | about 2.49M | about 748k | about 1032 | | |
| | 256 | 32 | 1.000 | about 3.36M | about 705k | about 1914 | | |
| | 256 | 64 | 1.000 | about 3.62M | about 788k | about 3681 | | |
| Interpretation: | |
| - Quality success: local-shift DPLR SSM keeps perfect `previous` accuracy at seq 512 for all tested hidden sizes and batches. | |
| - Attention did not fully learn the same task in 100 steps at batch 16/32 and reached about `0.962` accuracy at batch 64. | |
| - Speed depends on batch: | |
| - batch 16: hidden 256 is fastest among SSM variants, about `748k` backward tok/s; attention is still faster at about `1.39M`. | |
| - batch 32: hidden 64 is fastest, about `1.36M` backward tok/s, beating attention's about `899k`. | |
| - batch 64: hidden 64 is fastest, about `1.46M` backward tok/s, beating attention's about `1.18M`. | |
| - This gives a better longer-memory recommendation than Iteration 15: | |
| - use `ssm_hidden_dim=16` for short seq-128 memory and lower memory pressure | |
| - use `ssm_hidden_dim=64` for seq-512 trained memory around batch 32/64 | |
| - keep hidden 256 as a possible batch-16 or legacy speed point, but not the general quality-aware default | |
| ### LLM Iteration 17 - TaoData Real-Text Byte-Token Pilot | |
| Reason for this iteration: | |
| - Synthetic `previous` and `increment` tasks were useful diagnostics, but they are not enough to judge LLM capability. | |
| - The remote server has a TaoData corpus at `/home/student/Data/TaoData/pretrain.jsonl.fineweb.jsonl`. | |
| - No SentencePiece tokenizer artifact was found at the expected remote TaoTrain/TaoData tokenizer paths, so the first real-text benchmark used dependency-free byte tokenization. | |
| - Byte tokenization is not the final deployment tokenizer, but it gives a real-corpus next-token signal and exercises the same TaoNet model paths. | |
| Implementation location: | |
| - TaoTrain commit: `b8c4f3d Add real token TaoNet benchmark` | |
| - TaoTrain: `scripts/benchmark_taonet_real_tokens.py` | |
| What changed: | |
| - Added a remote-friendly real-token benchmark script that: | |
| - reads JSONL or plain text | |
| - supports TaoData-style `text` records | |
| - supports byte tokenization and optional SentencePiece tokenization | |
| - builds contiguous next-token batches from one long token stream | |
| - reports eval loss, perplexity, token accuracy, throughput, and memory | |
| - compares attention TaoNet against multiple SSM hidden sizes in one run | |
| Validation: | |
| - Local CPU smoke passed on a plain text file with byte tokenization. | |
| - Remote RepoBridge runs completed on TaoData JSONL. | |
| Remote benchmark: | |
| - RepoBridge run: `taonet-vs-ssm-real-token-taodata-byte-pilot-20260429-164623` | |
| - RepoBridge run: `taonet-vs-ssm-real-token-taodata-byte-pilot-b64-20260429-164720` | |
| - Data: `/home/student/Data/TaoData/pretrain.jsonl.fineweb.jsonl` | |
| - Tokenization: byte-level, vocab size 259 | |
| - Data limit: first `2,000,000` byte tokens from up to `5,000` records | |
| - Config: seq 512, 4 layers, hidden dim 256, bf16, 150 train steps, batch 16/32/64, projected DPLR mixer dim 64, local shift enabled. | |
| Required TaoNet comparison: | |
| | Architecture | SSM hidden dim | Batch | Eval loss | Eval PPL | Eval accuracy | Forward tok/s | Forward+backward tok/s | Peak MB | | |
| |---|---:|---:|---:|---:|---:|---:|---:|---:| | |
| | attention TaoNet | n/a | 16 | about 2.549 | about 12.80 | about 0.260 | about 2.03M | about 1.40M | about 585 | | |
| | SSM TaoNet, DPLR projected 64 + local shift | 16 | 16 | about 1.982 | about 7.26 | about 0.423 | about 2.42M | about 564k | about 757 | | |
| | SSM TaoNet, DPLR projected 64 + local shift | 64 | 16 | about 1.928 | about 6.88 | about 0.440 | about 2.16M | about 488k | about 803 | | |
| | attention TaoNet | n/a | 32 | about 2.523 | about 12.47 | about 0.266 | about 2.13M | about 809k | about 1115 | | |
| | SSM TaoNet, DPLR projected 64 + local shift | 16 | 32 | about 1.879 | about 6.55 | about 0.455 | about 4.43M | about 1.38M | about 1396 | | |
| | SSM TaoNet, DPLR projected 64 + local shift | 64 | 32 | about 1.848 | about 6.35 | about 0.457 | about 3.97M | about 1.25M | about 1496 | | |
| | attention TaoNet | n/a | 64 | about 2.529 | about 12.54 | about 0.265 | about 5.98M | about 2.03M | about 2190 | | |
| | SSM TaoNet, DPLR projected 64 + local shift | 16 | 64 | about 1.807 | about 6.10 | about 0.471 | about 4.92M | about 1.67M | about 2686 | | |
| | SSM TaoNet, DPLR projected 64 + local shift | 64 | 64 | about 1.834 | about 6.26 | about 0.466 | about 2.54M | about 1.52M | about 2882 | | |
| Interpretation: | |
| - First real-corpus quality success: both SSM candidates beat attention on validation loss, perplexity, and byte-token accuracy after the same number of train steps. | |
| - Hidden 64 was best quality at batch 16/32, while hidden 16 was best quality at batch 64 and generally faster among SSM variants. | |
| - Speed tradeoff depends on batch: | |
| - batch 16: attention backward is faster, but SSM has much better validation quality. | |
| - batch 32: hidden-16 SSM wins both quality and backward throughput versus attention. | |
| - batch 64: attention wins backward throughput, while SSM wins validation quality. | |
| - This benchmark is byte-level, so it should be treated as a real-text pilot rather than the final TaoData tokenizer benchmark. | |
| - Next real-data step: train or locate the intended SentencePiece tokenizer, then rerun the same script with `--tokenizer-type sentencepiece`. | |
| ### LLM Iteration 18 - TaoData SentencePiece Pilot And Per-Channel Local Shift | |
| Reason for this iteration: | |
| - Byte-level TaoData results were encouraging but not the intended LLM tokenization. | |
| - No pre-existing tokenizer artifact was found on the remote server, so a pilot SentencePiece tokenizer was trained from TaoData. | |
| - The first 500-step SentencePiece run showed attention still ahead on validation loss at batch 32, even though SSM retained a token-accuracy edge. | |
| - Because no-shift SSM was worse, the local shift branch was helping; the next lightweight improvement was making the shift gain per-channel instead of one scalar. | |
| Implementation location: | |
| - TaoTrain commit: `33747c1 Add TaoData pilot tokenizer config` | |
| - TaoTrain commit: `c519645 Add per-channel SSM local shift` | |
| - TaoTrain: `configs/tokenizer_taodata_pilot.yaml` | |
| - TaoTrain: `src/taoTrain/models/taonet_ssm.py` | |
| - TaoTrain: `scripts/benchmark_taonet_real_tokens.py` | |
| What changed: | |
| - Added a pilot tokenizer config: | |
| - input: `/home/student/Data/TaoData/pretrain.jsonl.fineweb.jsonl` | |
| - output: `/home/student/YouZheng/tokenizers/taodata_pilot_8k` | |
| - vocab size: `8192` | |
| - max samples: `20000` | |
| - Trained the remote tokenizer; output files: | |
| - `/home/student/YouZheng/tokenizers/taodata_pilot_8k/tokenizer.model` | |
| - `/home/student/YouZheng/tokenizers/taodata_pilot_8k/tokenizer.vocab` | |
| - Added opt-in `ssm_local_shift_per_channel`. | |
| - The previous shift branch used one learned scalar for all model channels. | |
| - The new branch can use one learned gain per model channel while keeping the operation cheap: shift plus elementwise multiply. | |
| Validation: | |
| - TaoTrain local tests: `python -m pytest tests\test_taonet_ssm.py -q` passed, `6 passed`. | |
| - Local real-token smoke with `--ssm-local-shift-per-channel` passed. | |
| - Remote tokenizer training completed. RepoBridge's local print path initially hit a Windows emoji encoding issue, but the tokenizer files were created successfully. | |
| SentencePiece 150-step pilot: | |
| - RepoBridge run: `taonet-vs-ssm-real-token-taodata-spm-pilot-20260429-171228` | |
| - Data: TaoData FineWeb JSONL | |
| - Tokenization: pilot SentencePiece 8k | |
| - Config: seq 512, 4 layers, hidden dim 256, bf16, 150 train steps, batch 16/32/64, projected DPLR mixer dim 64, local shift enabled. | |
| | Architecture | SSM hidden dim | Batch | Eval loss | Eval PPL | Eval accuracy | Forward+backward tok/s | | |
| |---|---:|---:|---:|---:|---:|---:| | |
| | attention TaoNet | n/a | 16 | about 5.718 | about 304 | about 0.150 | about 1.01M | | |
| | SSM TaoNet | 16 | 16 | about 5.723 | about 306 | about 0.149 | about 743k | | |
| | SSM TaoNet | 64 | 16 | about 5.728 | about 307 | about 0.146 | about 381k | | |
| | attention TaoNet | n/a | 32 | about 5.533 | about 253 | about 0.156 | about 842k | | |
| | SSM TaoNet | 16 | 32 | about 5.505 | about 246 | about 0.165 | about 771k | | |
| | SSM TaoNet | 64 | 32 | about 5.561 | about 260 | about 0.158 | about 1.09M | | |
| | attention TaoNet | n/a | 64 | about 5.414 | about 225 | about 0.163 | about 623k | | |
| | SSM TaoNet | 16 | 64 | about 5.427 | about 227 | about 0.169 | about 1.12M | | |
| | SSM TaoNet | 64 | 64 | about 5.395 | about 220 | about 0.171 | about 623k | | |
| SentencePiece 500-step batch-32 follow-up: | |
| - RepoBridge run: `taonet-vs-ssm-real-token-taodata-spm-b32-500step-20260429-171338` | |
| - RepoBridge run without shift: `taonet-ssm-real-token-taodata-spm-b32-500step-no-shift-20260429-171451` | |
| - RepoBridge run with per-channel shift: `taonet-vs-ssm-real-token-taodata-spm-b32-500step-channel-shift-20260429-171917` | |
| - Config: batch 32, seq 512, 500 train steps, eval batches 16. | |
| | Variant | SSM hidden dim | Shift type | Eval loss | Eval PPL | Eval accuracy | Forward+backward tok/s | | |
| |---|---:|---|---:|---:|---:|---:| | |
| | attention TaoNet | n/a | n/a | about 4.715 | about 112 | about 0.211 | about 1.23M in first run, about 892k in per-channel run | | |
| | SSM TaoNet | 16 | scalar | about 4.798 | about 121 | about 0.217 | about 1.13M | | |
| | SSM TaoNet | 64 | scalar | about 4.830 | about 125 | about 0.215 | about 968k | | |
| | SSM TaoNet | 16 | none | about 5.088 | about 162 | about 0.171 | about 554k | | |
| | SSM TaoNet | 64 | none | about 5.102 | about 164 | about 0.169 | about 580k | | |
| | SSM TaoNet | 16 | per-channel | about 4.782 | about 119 | about 0.218 | about 784k | | |
| | SSM TaoNet | 64 | per-channel | about 4.818 | about 124 | about 0.215 | about 1.08M | | |
| Interpretation: | |
| - The SentencePiece pilot is more realistic and less favorable to SSM than the byte-level pilot. | |
| - SSM has a small token-accuracy edge at batch 32, but attention has the best 500-step validation loss/perplexity. | |
| - Removing local shift is clearly worse, so local shift is useful for real-token modeling too. | |
| - Per-channel shift is a small quality improvement over scalar shift: | |
| - hidden 16 eval loss improved from about `4.798` to `4.782` | |
| - hidden 64 eval loss improved from about `4.830` to `4.818` | |
| - Per-channel shift is not enough to surpass attention on 500-step SentencePiece validation loss. | |
| - Next model-improvement direction should target SSM language-modeling capacity or optimization, not just exact one-token memory: | |
| - try larger `ssm_mixer_dim` such as 96/128 with h16/h64 | |
| - tune SSM learning rate/weight decay separately from attention | |
| - test a small gated local convolution/projection branch if ternary deployment accepts it | |
| ### LLM Iteration 19 - TaoData SentencePiece Mixer-Dimension Sweep | |
| Reason for this iteration: | |
| - The 500-step SentencePiece batch-32 pilot showed SSM had a small token-accuracy edge, but attention still had better validation loss/perplexity. | |
| - The prior best SSM used `ssm_mixer_dim=64`, originally chosen from speed-focused scaling probes. | |
| - Because real-token quality may need more SSM channel capacity, this iteration swept projected mixer dimensions while keeping the same outer TaoNet dimensions. | |
| Implementation location: | |
| - TaoTrain commit: `357336e Sweep SSM mixer dims in real token benchmark` | |
| - TaoTrain: `scripts/benchmark_taonet_real_tokens.py` | |
| - RepoBridge config: `repobridge.taonet.realspm.taodata.b32.500step.mixersweep.config.json` | |
| What changed: | |
| - Added `--ssm-mixer-dims` to the real-token benchmark. | |
| - The benchmark now records `ssm_mixer_dim` in the printed table and CSV. | |
| - Attention TaoNet is still evaluated once per batch, while SSM TaoNet can sweep multiple hidden and mixer dimensions in the same run. | |
| Validation: | |
| - TaoTrain local syntax check: `python -m py_compile scripts\benchmark_taonet_real_tokens.py` passed. | |
| - TaoTrain local tests with the SSM repo on `PYTHONPATH`: `python -m pytest tests\test_taonet_ssm.py -q` passed, `6 passed`. | |
| - Local byte-token smoke with `--ssm-mixer-dims 8,12` passed and wrote CSV/JSON outputs. | |
| Remote benchmark: | |
| - RepoBridge run: `taonet-vs-ssm-real-token-taodata-spm-b32-500step-mixersweep-20260429-193729` | |
| - Data: TaoData FineWeb JSONL | |
| - Tokenization: pilot SentencePiece 8k | |
| - Config: batch 32, seq 512, 4 layers, hidden dim 256, bf16, 500 train steps, 16 eval batches, local shift enabled, per-channel shift enabled. | |
| | Architecture | SSM hidden dim | SSM mixer dim | Eval loss | Eval PPL | Eval accuracy | Forward+backward tok/s | Peak allocated MB | | |
| |---|---:|---:|---:|---:|---:|---:|---:| | |
| | attention TaoNet | n/a | n/a | 4.715 | 111.633 | 0.211 | 618k | 2590 | | |
| | SSM TaoNet | 16 | 64 | 4.780 | 119.046 | 0.218 | 1.13M | 2887 | | |
| | SSM TaoNet | 16 | 96 | 4.759 | 116.643 | 0.222 | 973k | 3029 | | |
| | SSM TaoNet | 16 | 128 | 4.719 | 112.088 | 0.224 | 782k | 3192 | | |
| | SSM TaoNet | 64 | 64 | 4.824 | 124.475 | 0.214 | 982k | 2987 | | |
| | SSM TaoNet | 64 | 96 | 4.761 | 116.917 | 0.219 | 479k | 3131 | | |
| | SSM TaoNet | 64 | 128 | 4.784 | 119.589 | 0.218 | 457k | 3292 | | |
| Interpretation: | |
| - Increasing the projected mixer dimension helped the best SSM real-token validation loss. | |
| - The best quality SSM in this run was `ssm_hidden_dim=16`, `ssm_mixer_dim=128`: | |
| - validation loss `4.719`, very close to attention `4.715` | |
| - token accuracy `0.224`, above attention `0.211` | |
| - forward+backward throughput about `782k` tok/s, above attention about `618k` tok/s | |
| - Hidden dim `64` did not help this batch-32 500-step SentencePiece setting; it was slower and worse than hidden dim `16` at mixer dim 128. | |
| - Mixer dim `64` remains the best SSM speed/quality tradeoff, but mixer dim `128` is now the best SSM quality candidate on real SentencePiece token modeling. | |
| - Next step should test whether `hidden_dim=16`, `mixer_dim=128` remains strong at batch 16/64 and longer training, then try a narrow learning-rate sweep around it. | |
| ### LLM Iteration 20 - Attempted h16/m128 Batch Generalization Sweep | |
| Reason for this iteration: | |
| - Iteration 19 found a strong real-token batch-32 point: `ssm_hidden_dim=16`, `ssm_mixer_dim=128`. | |
| - The user noted earlier that a single batch-size sweet spot can be misleading. | |
| - This iteration was meant to compare attention TaoNet vs SSM TaoNet at batch 16, 32, and 64 with the same 500-step SentencePiece protocol. | |
| Implementation location: | |
| - TaoTrain commit used remotely: `357336e Sweep SSM mixer dims in real token benchmark` | |
| - RepoBridge config: `repobridge.taonet.realspm.taodata.h16m128.batchsweep.config.json` | |
| Planned remote benchmark: | |
| - Data: TaoData FineWeb JSONL | |
| - Tokenization: pilot SentencePiece 8k | |
| - Config: batch 16/32/64, seq 512, 4 layers, hidden dim 256, bf16, 500 train steps, 16 eval batches | |
| - Attention baseline: `taonet` | |
| - SSM candidate: `taonet_ssm`, DPLR, `ssm_hidden_dim=16`, `ssm_mixer_dim=128`, local shift enabled, per-channel shift enabled | |
| Remote status before run: | |
| - RepoBridge write guard passed. | |
| - RepoBridge preflight passed. | |
| - Remote GPU: RTX 5090 with about 21 GB free VRAM. | |
| - A same-user `taodata` process was present and using about 10.9 GB VRAM; no other users were detected. | |
| Outcome: | |
| - RepoBridge `full` began, but the SFTP download phase failed with: | |
| - `Socket exception: An existing connection was forcibly closed by the remote host (10054)` | |
| - `paramiko.ssh_exception.SSHException: Server connection dropped` | |
| - Subsequent read-only RepoBridge SSH checks timed out with WinError `10060`. | |
| - The new result folder did not appear in the partial local download, so no valid benchmark table was available to record. | |
| Interpretation: | |
| - This was an infrastructure interruption, not a model failure. | |
| - Do not infer anything about h16/m128 batch generalization from this attempted run. | |
| - Next action when the remote server is reachable: rerun or download the run for `taonet-vs-ssm-real-token-taodata-spm-h16m128-batchsweep`. | |
| ### Current LLM-Wrapper Best Configuration | |
| Best current speed benchmark configuration: | |
| - architecture: `taonet_ssm` | |
| - SSM core: `dplr` | |
| - mixer projection: `ssm_mixer_dim=64` | |
| - SSM hidden dimension: `256` | |
| - DPLR rank: `1` | |
| - kernel mode: `conv` | |
| - dtype: `bf16` | |
| - benchmark task: synthetic next-token CE through TaoNet wrapper | |
| Best current quality-aware token-memory configuration: | |
| - architecture: `taonet_ssm` | |
| - SSM core: `dplr` | |
| - mixer projection: `ssm_mixer_dim=64` | |
| - SSM hidden dimension: `16` | |
| - DPLR rank: `1` | |
| - kernel mode: `conv` | |
| - dtype: `bf16` | |
| - local shift: `ssm_local_shift=True` | |
| - benchmark task: `previous` token memory through TaoNet wrapper | |
| - evidence: perfect eval accuracy at batch 8, 16, 32, and 64 after 100 steps; best observed short-memory batch-64 SSM backward throughput about `795k` tok/s | |
| Best current longer token-memory configuration: | |
| - architecture: `taonet_ssm` | |
| - SSM core: `dplr` | |
| - mixer projection: `ssm_mixer_dim=64` | |
| - SSM hidden dimension: `64` | |
| - DPLR rank: `1` | |
| - kernel mode: `conv` | |
| - dtype: `bf16` | |
| - local shift: `ssm_local_shift=True` | |
| - benchmark task: seq-512 `previous` token memory through TaoNet wrapper | |
| - evidence: perfect eval accuracy at batch 16, 32, and 64 after 100 steps; best observed batch-32 and batch-64 SSM backward throughput about `1.36M` and `1.46M` tok/s, both above attention in the same task | |
| Best current TaoData real-text pilot configuration: | |
| - architecture: `taonet_ssm` | |
| - SSM core: `dplr` | |
| - mixer projection: `ssm_mixer_dim=128` for best current SentencePiece validation loss; `ssm_mixer_dim=64` for speed/quality balance | |
| - SSM hidden dimension: `16` | |
| - DPLR rank: `1` | |
| - kernel mode: `conv` | |
| - dtype: `bf16` | |
| - local shift: `ssm_local_shift=True` | |
| - local shift gain: `ssm_local_shift_per_channel=True` | |
| - benchmark task: TaoData FineWeb JSONL, byte-level and pilot SentencePiece next-token prediction, seq 512 | |
| - evidence: | |
| - byte-level: lower validation loss/perplexity than attention at batch 16/32/64 after 150 steps; hidden-16 also beat attention backward throughput at batch 32 | |
| - SentencePiece batch 32, 500 steps: `ssm_hidden_dim=16`, `ssm_mixer_dim=128` reached eval loss about `4.719` vs attention about `4.715`, with better token accuracy (`0.224` vs `0.211`) and higher backward throughput (`782k` vs `618k` tok/s) | |
| Current best evidence: | |
| - At batch 4, seq 512, projected-64 DPLR reaches about `618k` forward tok/s and `192k` backward tok/s. | |
| - At batch 16, seq 512, projected-64 DPLR reaches about `2.12M` forward tok/s and `702k` backward tok/s. | |
| - Attention is still faster for backward at batch 16 in the same run: about `990k` tok/s. | |
| - DPLR projected-64 forward can exceed attention in this benchmark, but training/backward still needs improvement. | |
| - Newer scaling rerun found a batch-32 sweet spot where projected-64 DPLR exceeded attention in both forward and forward+backward throughput: | |
| - SSM forward about `2.62M` tok/s vs attention about `1.88M` | |
| - SSM forward+backward about `705k` tok/s vs attention about `632k` | |
| Important local artifact paths: | |
| - `C:\Users\YouZheng\Documents\LYZ\MyContent\MyLLM\Codebase\Taotern\TaoTrain\results\repobridge-token-bench-projected\outputs-taotrain\taonet-token-dplr-proj64-bench-20260429-091624\taonet_token_benchmark.csv` | |
| - `C:\Users\YouZheng\Documents\LYZ\MyContent\MyLLM\Codebase\Taotern\TaoTrain\results\repobridge-token-bench-projected64-scale\outputs-taotrain\taonet-token-dplr-proj64-scale-bench-20260429-091738\taonet_token_benchmark.csv` | |
| - `C:\Users\YouZheng\Documents\LYZ\MyContent\MyLLM\Codebase\Taotern\TaoTrain\results\repobridge-token-bench-projected\outputs-taotrain\taonet-token-dplr-proj64-bench-20260429-091956\taonet_token_benchmark.csv` | |
| Recommended next LLM-wrapper targets: | |
| 1. Rerun the real SentencePiece benchmark for `ssm_hidden_dim=16`, `ssm_mixer_dim=128` at batch 16/32/64 to check whether the gain generalizes beyond the batch-32 spot. | |
| 2. Optimize backward throughput in `S4TernaryDPLRSSM`; the forward path is now competitive at larger batch sizes. | |
| 3. Run a learning-rate and weight-decay sweep around the current best SSM real-token config, because the SSM and attention cores may not share the same optimum optimizer settings. | |
| 4. Investigate whether FFT/direct-response intermediates can be checkpointed or custom-autograded to improve backward speed. | |
| 5. Keep ternary deployment constraints in view: rank-1 DPLR factors still use ternary masks with learned amplitudes, and projected mixer dimensions should remain friendly to ternary compute layouts. | |
| ## Version Timeline | |
| | Run | Notebook(s) | Commit printed in notebook | Device | Main purpose | | |
| |---|---|---:|---|---| | |
| | `_r1` | `gamma_s4_sinewave_benchmark_r1.ipynb` | not printed | CUDA | First comparison of baseline, minimal, enhanced on simple sinewave task. | | |
| | `_r2` | `gamma_s4_sinewave_benchmark_r2.ipynb` | not printed | CUDA | Harder multivariate long-range task; enhanced first became clearly promising. | | |
| | `_r3` | `gamma-s4-sinewave-benchmark_r3.ipynb` | `6df3777` | CUDA | Quick benchmark after deployment-cache import fix; recurrent enhanced still very slow. | | |
| | `_r4` | `gamma-s4-sinewave-benchmark_r4.ipynb` | `d6ebddc` | CUDA | Triangular-solve recurrent optimization; large recurrent speedup. | | |
| | `_r5` | `gamma-s4-sinewave-benchmark_r5.ipynb` | `78ae31f` | CUDA | Added recurrent/full-output agreement metrics. | | |
| | `_r6` | quick + research notebooks | `a2474cc` / `5952546` | CPU for quick, CUDA for research | Split quick/research benchmark; first practical long-context run showed conv path was too slow. | | |
| | `_r7` | quick + research notebooks | `4b977c1` / `b17f72a` | CUDA | Faster conv kernel generation and cheaper research defaults. | | |
| | `_r8` | quick + research notebooks | `73e76a7` | CUDA | Skipped unused final states, enabled baseline deploy metrics, enabled token-lite. | | |
| | `_r9` | quick + research notebooks | `8738675` / `60562bd` | CUDA | Added research visuals; performance similar to `_r8`, now presentation-friendly. | | |
| | `_r10` | quick + research notebooks | `09db0da` / `9ff7e4e` | CUDA | Added balanced deployment metrics to test a speed/fidelity point between full recurrent and deployment-lite. | | |
| | `_r11` | quick + research + challenge notebooks | `64f8632` / `4842762` / `bfc6e26` | CUDA | Fixed AMP FFT path, split result tables, and added challenge benchmarks for permuted MNIST, selective copying, and induction-style recall. | | |
| | `_r12` | quick + research + challenge notebooks | `740a9ef` / `0c6ecb8` / `11bd2e6` | CUDA | Tested the input-selection gate. Forecasting stayed strong, but challenge recall tasks remained near random. | | |
| ## `_r1` - First Simple Sinewave Comparison | |
| Saved notebook: | |
| - `output/jupyter-notebook/gamma_s4_sinewave_benchmark_r1.ipynb` | |
| Configuration recovered from notebook: | |
| - device: CUDA | |
| - task: simple 1D sinewave next-step prediction | |
| - `seq_len=128` | |
| - `train_samples=512` | |
| - `val_samples=128` | |
| - `batch_size=32` | |
| - `epochs=10` | |
| - `d_model=1` | |
| - `hidden_dim=32` | |
| - `num_layers=2` | |
| Results: | |
| | Model | Params | Final val loss | Mean epoch s | Full ms | Full tokens/s | Recurrent ms | Recurrent tokens/s | | |
| |---|---:|---:|---:|---:|---:|---:|---:| | |
| | `gamma_baseline` | 134 | 0.170722 | 2.063 | 24.604 | 166477 | 51.177 | 80036 | | |
| | `gamma_s4_minimal` | 138 | 0.019148 | 1.282 | 23.545 | 173967 | 54.462 | 75209 | | |
| | `gamma_s4_enhanced` | 146 | 1.154002 | 1.330 | 23.389 | 175127 | 82.343 | 49743 | | |
| Interpretation: | |
| - `gamma_s4_minimal` was best on this very simple task. | |
| - `gamma_s4_enhanced` was unstable/underfit badly here. | |
| - This run showed that the richer enhanced block can be harmful on small/simple tasks. | |
| ## `_r2` - Harder Multivariate Forecasting | |
| Saved notebook: | |
| - `output/jupyter-notebook/gamma_s4_sinewave_benchmark_r2.ipynb` | |
| Configuration recovered from notebook: | |
| - device: CUDA | |
| - task: harder multivariate synthetic forecasting | |
| - `seq_len=512` | |
| - `num_features=8` | |
| - `train_samples=768` | |
| - `val_samples=192` | |
| - `batch_size=32` | |
| - `epochs=12` | |
| - `d_model=8` | |
| - `hidden_dim=64` | |
| - `num_layers=3` | |
| Results: | |
| | Model | Params | Final val loss | Mean epoch s | Full ms | Full tokens/s | Recurrent ms | Recurrent tokens/s | | |
| |---|---:|---:|---:|---:|---:|---:|---:| | |
| | `gamma_baseline` | 3192 | 0.006972 | 28.916 | 146.644 | 111726 | 305.446 | 53640 | | |
| | `gamma_s4_minimal` | 3243 | 0.110654 | 17.194 | 121.234 | 135144 | 343.394 | 47712 | | |
| | `gamma_s4_enhanced` | 3675 | 0.006302 | 17.191 | 131.929 | 124188 | 492.576 | 33262 | | |
| Interpretation: | |
| - `gamma_s4_enhanced` became the best-quality model. | |
| - Enhanced training was much faster than baseline on this task. | |
| - Recurrent inference was still significantly slower than baseline. | |
| - This was the first strong evidence that the enhanced model is useful on harder sequence tasks. | |
| ## `_r3` - Quick Benchmark With Deployment Cache Available | |
| Saved notebook: | |
| - `output/jupyter-notebook/gamma-s4-sinewave-benchmark_r3.ipynb` | |
| Configuration: | |
| - device: CUDA | |
| - commit: `6df3777` | |
| - quick tasks: | |
| - `simple`: `seq_len=192`, `features=4`, `epochs=4` | |
| - `moderate`: `seq_len=320`, `features=6`, `epochs=5` | |
| - models: `gamma_baseline`, `gamma_s4_enhanced` | |
| - enhanced: `kernel_mode="auto"`, `kernel_threshold=384`, bilinear discretization | |
| Results: | |
| | Task | Model | Val loss | Mean epoch s | Full tokens/s | Recurrent ms | Recurrent tokens/s | Deploy recurrent ms | | |
| |---|---|---:|---:|---:|---:|---:|---:| | |
| | simple | baseline | 0.637628 | 2.199 | 5779 | 75.092 | 5114 | not available | | |
| | simple | enhanced | 0.045667 | 2.450 | 9671 | 1024.471 | 375 | 997.631 | | |
| | moderate | baseline | 0.533364 | 6.544 | 5729 | 192.084 | 3332 | not available | | |
| | moderate | enhanced | 0.021113 | 6.584 | 6995 | 2815.223 | 227 | 2346.986 | | |
| Interpretation: | |
| - Enhanced quality was much better than baseline. | |
| - Full-sequence throughput was better for enhanced. | |
| - Recurrent enhanced path was catastrophically slow. | |
| - This run motivated recurrent-path optimization. | |
| ## `_r4` - Triangular-Solve Recurrent Optimization | |
| Saved notebook: | |
| - `output/jupyter-notebook/gamma-s4-sinewave-benchmark_r4.ipynb` | |
| Configuration: | |
| - device: CUDA | |
| - commit: `d6ebddc` | |
| - same quick tasks as `_r3` | |
| - key code change: bilinear recurrent stepping switched to a triangular-solve path | |
| Results: | |
| | Task | Model | Val loss | Mean epoch s | Full tokens/s | Recurrent ms | Recurrent tokens/s | Deploy recurrent ms | | |
| |---|---|---:|---:|---:|---:|---:|---:| | |
| | simple | baseline | 0.637628 | 2.219 | 6186 | 69.398 | 5533 | not available | | |
| | simple | enhanced | 0.045667 | 2.288 | 9728 | 139.394 | 2755 | 104.623 | | |
| | moderate | baseline | 0.533364 | 6.409 | 6182 | 110.415 | 5796 | not available | | |
| | moderate | enhanced | 0.021113 | 6.630 | 9896 | 240.392 | 2662 | 185.037 | | |
| Interpretation: | |
| - This was a major recurrent-inference improvement. | |
| - Enhanced recurrent latency dropped from seconds to hundreds of milliseconds. | |
| - Enhanced still remained slower than baseline in recurrent mode. | |
| ## `_r5` - Agreement Metrics Added | |
| Saved notebook: | |
| - `output/jupyter-notebook/gamma-s4-sinewave-benchmark_r5.ipynb` | |
| Configuration: | |
| - device: CUDA | |
| - commit: `78ae31f` | |
| - same quick tasks as `_r4` | |
| - added: | |
| - `recurrent_match_mse` | |
| - `deploy_match_mse` | |
| Results: | |
| | Task | Model | Val loss | Mean epoch s | Full tokens/s | Recurrent ms | Recurrent match MSE | Deploy recurrent ms | Deploy match MSE | | |
| |---|---|---:|---:|---:|---:|---:|---:|---:| | |
| | simple | baseline | 0.637628 | 2.317 | 6097 | 71.296 | 0.000000 | not available | not available | | |
| | simple | enhanced | 0.045667 | 2.381 | 9963 | 141.656 | 0.008500 | 107.361 | 0.031251 | | |
| | moderate | baseline | 0.533364 | 6.603 | 5912 | 114.832 | 0.000000 | not available | not available | | |
| | moderate | enhanced | 0.021113 | 7.199 | 9465 | 242.692 | 0.007549 | 178.070 | 0.029995 | | |
| Interpretation: | |
| - Enhanced remained much better in quality. | |
| - Full-sequence throughput favored enhanced. | |
| - Recurrent/deployment-lite speed improved but still trailed baseline. | |
| - Agreement metrics showed normal enhanced recurrent output was close to full forward; deployment-lite was faster but less faithful. | |
| ## `_r6` - Split Quick/Research Benchmark Era | |
| ### `_r6` Quick Notebook | |
| Saved notebook: | |
| - `output/jupyter-notebook/gamma-s4-sinewave-benchmark_r6.ipynb` | |
| Configuration: | |
| - device: CPU | |
| - commit: `a2474cc` | |
| - same quick tasks as `_r5` | |
| Results: | |
| | Task | Model | Val loss | Mean epoch s | Full tokens/s | Recurrent tokens/s | | |
| |---|---|---:|---:|---:|---:| | |
| | simple | baseline | 0.240532 | 1.444 | 15225 | 8656 | | |
| | simple | enhanced | 0.045714 | 2.598 | 20066 | 4278 | | |
| | moderate | baseline | 0.056279 | 5.613 | 20720 | 11785 | | |
| | moderate | enhanced | 0.021122 | 8.653 | 12149 | 2875 | | |
| Interpretation: | |
| - This was a CPU run, so speed conclusions are not treated as primary benchmark evidence. | |
| - It was useful as a smoke test only. | |
| - The CPU result reminded us to warn clearly when notebooks are not running on GPU. | |
| ### `_r6` Research Notebook | |
| Saved notebook: | |
| - `output/jupyter-notebook/gamma-s4-research-benchmark_r6.ipynb` | |
| Configuration: | |
| - device: CUDA | |
| - commit: `5952546` | |
| - research tasks: | |
| - `current_reference`: `seq_len=320`, `features=6`, `epochs=5` | |
| - `long_context`: `seq_len=768`, `features=8`, `epochs=4` | |
| - `RUN_ABLATIONS=True` | |
| - `RUN_TOKEN_TASK=False` | |
| Results: | |
| | Task | Model | Val loss | Mean epoch s | Expected mode | Full tokens/s | Recurrent tokens/s | Deploy tokens/s | | |
| |---|---|---:|---:|---|---:|---:|---:| | |
| | current_reference | baseline | 0.709749 | 6.844 | recurrent_like | 6157 | 5522 | not available | | |
| | current_reference | enhanced | 0.020500 | 7.366 | recurrent_like | 8431 | 2594 | 3408 | | |
| | long_context | baseline | 27.229956 | 36.239 | recurrent_like | 2819 | 2939 | not available | | |
| | long_context | enhanced | 0.012164 | 634.387 | conv | 358 | 1876 | 2501 | | |
| Interpretation: | |
| - Enhanced crushed baseline in quality. | |
| - But the long-context conv path was extremely slow. | |
| - Ablation section was too expensive and was stopped mid-way. | |
| - This run motivated the later kernel-generation speedup and disabling ablations by default. | |
| ## `_r7` - Conv Kernel Generation Improved | |
| ### `_r7` Quick Notebook | |
| Saved notebook: | |
| - `output/jupyter-notebook/gamma-s4-sinewave-benchmark_r7.ipynb` | |
| Configuration: | |
| - device: CUDA | |
| - commit: `4b977c1` | |
| - same quick tasks | |
| Results: | |
| | Task | Model | Val loss | Mean epoch s | Full tokens/s | Recurrent tokens/s | Deploy tokens/s | | |
| |---|---|---:|---:|---:|---:|---:| | |
| | simple | baseline | 0.637628 | 2.417 | 4977 | 5155 | not available | | |
| | simple | enhanced | 0.045667 | 2.565 | 8583 | 2500 | 3260 | | |
| | moderate | baseline | 0.533364 | 7.405 | 5413 | 5186 | not available | | |
| | moderate | enhanced | 0.021113 | 7.796 | 7465 | 2414 | 3226 | | |
| Interpretation: | |
| - Quick benchmark remained stable. | |
| - Enhanced retained quality and full-sequence throughput advantages. | |
| - Recurrent remained slower than baseline. | |
| ### `_r7` Research Notebook | |
| Saved notebook: | |
| - `output/jupyter-notebook/gamma-s4-research-benchmark_r7.ipynb` | |
| Configuration: | |
| - device: CUDA | |
| - commit: `b17f72a` | |
| - `RUN_ABLATIONS=False` | |
| - `RUN_TOKEN_TASK=False` | |
| Results: | |
| | Task | Model | Val loss | Mean epoch s | Expected mode | Full tokens/s | Recurrent tokens/s | Deploy tokens/s | | |
| |---|---|---:|---:|---|---:|---:|---:| | |
| | current_reference | baseline | 0.709749 | 7.351 | recurrent_like | 3821 | 4616 | not available | | |
| | current_reference | enhanced | 0.020500 | 7.530 | recurrent_like | 9289 | 2780 | 3339 | | |
| | long_context | baseline | 27.229956 | 39.236 | recurrent_like | 3523 | 3282 | not available | | |
| | long_context | enhanced | 0.012029 | 44.189 | conv | 5971 | 1776 | 2229 | | |
| Interpretation: | |
| - The conv speed issue was dramatically improved versus `_r6`. | |
| - Enhanced long-context epoch time dropped from about 634s to about 44s. | |
| - Enhanced was still slightly slower than baseline per epoch on long_context, but had much better loss and better full-sequence throughput. | |
| ## `_r8` - No-State Full Forward, Baseline Deploy Metrics, Token-Lite Enabled | |
| ### `_r8` Quick Notebook | |
| Saved notebook: | |
| - `output/jupyter-notebook/gamma-s4-sinewave-benchmark_r8.ipynb` | |
| Configuration: | |
| - device: CUDA | |
| - commit: `73e76a7` | |
| - same quick tasks | |
| - baseline deploy metrics became available | |
| - full-sequence training/inference skips unused final-state computation | |
| Results: | |
| | Task | Model | Val loss | Mean epoch s | Full tokens/s | Recurrent tokens/s | Deploy tokens/s | Deploy match MSE | | |
| |---|---|---:|---:|---:|---:|---:|---:| | |
| | simple | baseline | 0.637628 | 2.261 | 5711 | 5076 | 5241 | 0.000000 | | |
| | simple | enhanced | 0.044817 | 2.550 | 8204 | 2621 | 3367 | 0.022886 | | |
| | moderate | baseline | 0.533364 | 7.011 | 5782 | 5447 | 4519 | 0.000000 | | |
| | moderate | enhanced | 0.020569 | 7.010 | 8926 | 2503 | 3390 | 0.018165 | | |
| Interpretation: | |
| - Baseline deploy columns now populate. | |
| - Enhanced full-sequence throughput remained ahead. | |
| - Training time was tied on moderate. | |
| ### `_r8` Research Notebook | |
| Saved notebook: | |
| - `output/jupyter-notebook/gamma-s4-research-benchmark_r8.ipynb` | |
| Configuration: | |
| - device: CUDA | |
| - commit: `73e76a7` | |
| - `RUN_ABLATIONS=False` | |
| - `RUN_TOKEN_TASK=True` | |
| Forecasting results: | |
| | Task | Model | Val loss | Mean epoch s | Expected mode | Full tokens/s | Recurrent tokens/s | Deploy tokens/s | | |
| |---|---|---:|---:|---|---:|---:|---:| | |
| | current_reference | baseline | 0.709749 | 7.235 | recurrent_like | 5941 | 5581 | 5702 | | |
| | current_reference | enhanced | 0.019951 | 7.177 | recurrent_like | 7431 | 1918 | 2336 | | |
| | long_context | baseline | 27.229956 | 35.557 | recurrent_like | 3969 | 3759 | 3842 | | |
| | long_context | enhanced | 0.011708 | 14.235 | conv | 19544 | 1860 | 2406 | | |
| Token-lite results: | |
| | Model | Train CE | Val CE | Val PPL | Seq len | Train samples | | |
| |---|---:|---:|---:|---:|---:| | |
| | baseline | 3.587260 | 3.132184 | 22.924 | 192 | 1200 | | |
| | enhanced | 2.483611 | 2.486829 | 12.023 | 192 | 1200 | | |
| Interpretation: | |
| - This was the strongest practical result so far. | |
| - On long_context, enhanced was both much more accurate and much faster per epoch. | |
| - Token-lite showed enhanced also transferred better to a language-like task. | |
| ## `_r9` - Presentation Visuals Added | |
| ### `_r9` Quick Notebook | |
| Saved notebook: | |
| - `output/jupyter-notebook/gamma-s4-sinewave-benchmark_r9.ipynb` | |
| Configuration: | |
| - device: CUDA | |
| - commit: `8738675` | |
| - same quick tasks as `_r8` | |
| Results: | |
| | Task | Model | Val loss | Mean epoch s | Full tokens/s | Recurrent tokens/s | Deploy tokens/s | | |
| |---|---|---:|---:|---:|---:|---:| | |
| | simple | baseline | 0.637628 | 2.249 | 6058 | 5478 | 5502 | | |
| | simple | enhanced | 0.044817 | 2.344 | 9550 | 2617 | 3644 | | |
| | moderate | baseline | 0.533364 | 6.672 | 6324 | 5686 | 5599 | | |
| | moderate | enhanced | 0.020569 | 6.571 | 9304 | 2771 | 3416 | | |
| Interpretation: | |
| - Similar to `_r8`, with slightly improved timing variation. | |
| - Enhanced still wins on quality and full-sequence throughput. | |
| - Baseline still wins recurrent throughput. | |
| ### `_r9` Research Notebook | |
| Saved notebook: | |
| - `output/jupyter-notebook/gamma-s4-research-benchmark_r9.ipynb` | |
| Configuration: | |
| - device: CUDA | |
| - commit printed in notebook: `60562bd` | |
| - visual sections added: | |
| - task visual preview | |
| - prediction comparison plots | |
| - error comparison plots | |
| Forecasting results: | |
| | Task | Model | Val loss | Mean epoch s | Expected mode | Full tokens/s | Recurrent tokens/s | Deploy tokens/s | | |
| |---|---|---:|---:|---|---:|---:|---:| | |
| | current_reference | baseline | 0.709749 | 7.294 | recurrent_like | 6111 | 4981 | 5359 | | |
| | current_reference | enhanced | 0.019951 | 7.494 | recurrent_like | 8099 | 2185 | 3343 | | |
| | long_context | baseline | 27.229956 | 37.885 | recurrent_like | 3576 | 3728 | 3695 | | |
| | long_context | enhanced | 0.011708 | 14.717 | conv | 15654 | 1810 | 2327 | | |
| Token-lite results: | |
| | Model | Train CE | Val CE | Val PPL | Seq len | Train samples | | |
| |---|---:|---:|---:|---:|---:| | |
| | baseline | 3.587260 | 3.132184 | 22.924 | 192 | 1200 | | |
| | enhanced | 2.483611 | 2.486829 | 12.023 | 192 | 1200 | | |
| Interpretation: | |
| - `_r9` is the most presentation-friendly record. | |
| - It confirms the `_r8` story: | |
| - enhanced wins quality strongly | |
| - enhanced wins full-sequence/conv long-context training and throughput | |
| - baseline still wins recurrent deployment throughput | |
| - token-lite favors enhanced | |
| ## `_r10` - Balanced Deployment Path Added | |
| ### `_r10` Quick Notebook | |
| Saved notebook: | |
| - `output/jupyter-notebook/gamma-s4-sinewave-benchmark_r10.ipynb` | |
| Configuration: | |
| - device: CUDA | |
| - commit: `09db0da` | |
| - same quick tasks as `_r9` | |
| - new metrics: | |
| - `balanced_deploy_recurrent_latency_ms` | |
| - `balanced_deploy_recurrent_tokens_per_s` | |
| - `balanced_deploy_match_mse` | |
| Results: | |
| | Task | Model | Val loss | Mean epoch s | Full tokens/s | Recurrent tokens/s | Deploy-lite tokens/s | Balanced deploy tokens/s | Deploy-lite match MSE | Balanced match MSE | | |
| |---|---|---:|---:|---:|---:|---:|---:|---:|---:| | |
| | simple | baseline | 0.637628 | 2.134 | 6053 | 5891 | 6034 | 5938 | 0.000000 | 0.000000 | | |
| | simple | enhanced | 0.044817 | 2.532 | 9973 | 2507 | 3763 | 3123 | 0.022886 | 0.000986 | | |
| | moderate | baseline | 0.533364 | 6.331 | 6134 | 5835 | 5512 | 5816 | 0.000000 | 0.000000 | | |
| | moderate | enhanced | 0.020569 | 6.601 | 10045 | 2778 | 3510 | 2862 | 0.018165 | 0.000468 | | |
| Interpretation: | |
| - Enhanced quality and full-sequence throughput remain strong. | |
| - Deployment-lite is still the fastest enhanced deployment variant. | |
| - Balanced deployment is slower than deployment-lite, but much more faithful to full forward. | |
| - Balanced deployment is useful as a fidelity-preserving approximation, not as a pure speed win. | |
| ### `_r10` Research Notebook | |
| Saved notebook: | |
| - `output/jupyter-notebook/gamma-s4-research-benchmark_r10.ipynb` | |
| Configuration: | |
| - device: CUDA | |
| - commit printed in notebook: `9ff7e4e` | |
| - same research tasks as `_r9` | |
| - balanced deployment metrics added | |
| Forecasting results: | |
| | Task | Model | Val loss | Mean epoch s | Expected mode | Full tokens/s | Recurrent tokens/s | Deploy-lite tokens/s | Balanced deploy tokens/s | Deploy-lite match MSE | Balanced match MSE | | |
| |---|---|---:|---:|---|---:|---:|---:|---:|---:|---:| | |
| | current_reference | baseline | 0.709749 | 7.648 | recurrent_like | 4933 | not recorded in compact table | 5092 | 4987 | 0.000000 | 0.000000 | | |
| | current_reference | enhanced | 0.019951 | 8.193 | recurrent_like | 8152 | not recorded in compact table | 3404 | 2687 | 0.027752 | 0.000315 | | |
| | long_context | baseline | 27.229956 | 40.350 | recurrent_like | 2395 | not recorded in compact table | 3397 | 3285 | 0.000000 | 0.000000 | | |
| | long_context | enhanced | 0.011708 | 15.862 | conv | 16957 | not recorded in compact table | 2245 | 1886 | 0.200325 | 0.001692 | | |
| Token-lite results: | |
| | Model | Train CE | Val CE | Val PPL | Seq len | Train samples | | |
| |---|---:|---:|---:|---:|---:| | |
| | baseline | 3.587260 | 3.132184 | 22.924 | 192 | 1200 | | |
| | enhanced | 2.483611 | 2.486829 | 12.023 | 192 | 1200 | | |
| Interpretation: | |
| - Long-context enhanced still wins strongly on validation loss and full-sequence throughput. | |
| - Balanced deployment drastically improves fidelity relative to deployment-lite on enhanced: | |
| - long_context deploy-lite match MSE: `0.200325` | |
| - long_context balanced match MSE: `0.001692` | |
| - However, balanced deployment is slower than deployment-lite. | |
| - This suggests the output projection is important for fidelity, while the input-dependent gate is a major recurrent-time cost. | |
| ## `_r11` - FFT Fix, Split Tables, And Challenge Benchmarks | |
| ### `_r11` Quick Notebook | |
| Saved notebook: | |
| - `output/jupyter-notebook/gamma-s4-sinewave-benchmark_r11.ipynb` | |
| Configuration: | |
| - device: CUDA | |
| - commit printed in notebook: `64f8632` | |
| - same quick tasks as `_r10` | |
| - notebook tables split into normal, deployment-lite, and balanced deployment views | |
| Results: | |
| | Task | Model | Val loss | Mean epoch s | Full tokens/s | Recurrent tokens/s | Deploy-lite tokens/s | Balanced deploy tokens/s | Deploy-lite match MSE | Balanced match MSE | | |
| |---|---|---:|---:|---:|---:|---:|---:|---:|---:| | |
| | simple | baseline | 0.637628 | 2.353 | 6162 | 5776 | 5524 | 5649 | 0.000000 | 0.000000 | | |
| | simple | enhanced | 0.044817 | 2.113 | 11279 | 2625 | 3618 | 3112 | 0.022886 | 0.000986 | | |
| | moderate | baseline | 0.533364 | 6.527 | 6187 | 5337 | 4572 | 5563 | 0.000000 | 0.000000 | | |
| | moderate | enhanced | 0.020569 | 6.264 | 11434 | 2598 | 3338 | 2809 | 0.018165 | 0.000468 | | |
| Interpretation: | |
| - Enhanced remains much better on validation loss and full-sequence throughput. | |
| - Baseline remains faster for exact recurrent stepping. | |
| - Deployment-lite is still the fastest enhanced recurrent approximation, while balanced is much more faithful. | |
| ### `_r11` Research Notebook | |
| Saved notebook: | |
| - `output/jupyter-notebook/gamma-s4-research-benchmark_r11.ipynb` | |
| Configuration: | |
| - device: CUDA | |
| - commit printed in notebook: `4842762` | |
| - includes AMP FFT fix and split benchmark tables | |
| Forecasting results: | |
| | Task | Model | Val loss | Mean epoch s | Expected mode | Full tokens/s | Recurrent tokens/s | Deploy-lite tokens/s | Balanced deploy tokens/s | Deploy-lite match MSE | Balanced match MSE | | |
| |---|---|---:|---:|---|---:|---:|---:|---:|---:|---:| | |
| | current_reference | baseline | 0.709749 | 6.922 | recurrent_like | 5898 | 5494 | 5162 | 5523 | 0.000000 | 0.000000 | | |
| | current_reference | enhanced | 0.019951 | 6.383 | recurrent_like | 11200 | 2665 | 3522 | 2931 | 0.027752 | 0.000315 | | |
| | long_context | baseline | 27.229956 | 36.928 | recurrent_like | 2994 | 2725 | 2513 | 2866 | 0.000000 | 0.000000 | | |
| | long_context | enhanced | 0.011593 | 10.419 | conv | 235772 | 1849 | 2477 | 1542 | 0.193474 | 0.001699 | | |
| Token-lite results: | |
| | Model | Train CE | Val CE | Val PPL | Seq len | Train samples | | |
| |---|---:|---:|---:|---:|---:| | |
| | baseline | 3.587260 | 3.132184 | 22.924 | 192 | 1200 | | |
| | enhanced | 2.483604 | 2.486901 | 12.024 | 192 | 1200 | | |
| Interpretation: | |
| - The AMP FFT fix worked: the long-context enhanced conv path completed and showed very high cached full-sequence throughput. | |
| - Enhanced long-context training is now much faster than baseline in this setup and far more accurate. | |
| - Recurrent deployment remains the weak point: enhanced exact recurrent throughput is still lower than baseline. | |
| - Balanced deployment remains the best fidelity-preserving approximation, but it is slower than deployment-lite. | |
| ### `_r11` Challenge Notebook | |
| Saved notebook: | |
| - `output/jupyter-notebook/gamma-s4-challenge-benchmark_r11.ipynb` | |
| Configuration: | |
| - device: CUDA | |
| - commit printed in notebook: `bfc6e26` | |
| - first saved run for the challenge benchmark notebook | |
| - tasks: | |
| - permuted MNIST | |
| - selective copying | |
| - induction-style associative recall | |
| Results: | |
| | Task | Model | Val loss | Val accuracy | Epoch s | Forward ms | Forward tokens/s | | |
| |---|---|---:|---:|---:|---:|---:| | |
| | permuted_mnist | baseline | 2.760003 | 0.206000 | 112.318 | 425.084 | 118038 | | |
| | permuted_mnist | enhanced | 2.041562 | 0.232000 | 35.042 | 7.950 | 6311750 | | |
| | selective_copying | baseline | 3.529677 | 0.039551 | 6.509 | 73.482 | 222965 | | |
| | selective_copying | enhanced | 3.468455 | 0.029622 | 2.149 | 2.739 | 5981919 | | |
| | induction_recall | baseline | 3.615992 | 0.040039 | 6.424 | 72.235 | 226816 | | |
| | induction_recall | enhanced | 3.519182 | 0.033203 | 2.061 | 2.673 | 6130411 | | |
| Interpretation: | |
| - Enhanced is much faster on the challenge forward benchmark because the full-sequence conv path is active. | |
| - Permuted MNIST slightly favors enhanced on both loss and accuracy, but both accuracies are still low. | |
| - Selective copying and induction recall are near random accuracy: | |
| - selective copying random accuracy is about `1 / 32 = 0.03125` | |
| - induction recall random accuracy is about `1 / 32 = 0.03125` | |
| - Enhanced often has lower CE but not consistently higher accuracy, suggesting it is learning distributional smoothing before reliable exact recall. | |
| - This is the clearest evidence so far that pure LTI Gamma SSM structure is not enough for Mamba-style selective memory tasks. The next model improvement should add selective input flow while keeping the fixed Gamma transition. | |
| ## `_r12` - Input-Selection Gate Tested | |
| ### `_r12` Quick Notebook | |
| Saved notebook: | |
| - `output/jupyter-notebook/gamma-s4-sinewave-benchmark_r12.ipynb` | |
| Configuration: | |
| - device: CUDA | |
| - commit printed in notebook: `740a9ef` | |
| - enhanced model includes the new pre-SSM input-selection gate | |
| Results: | |
| | Task | Model | Val loss | Mean epoch s | Full tokens/s | Recurrent tokens/s | Deploy-lite tokens/s | Balanced deploy tokens/s | Deploy-lite match MSE | Balanced match MSE | | |
| |---|---|---:|---:|---:|---:|---:|---:|---:|---:| | |
| | simple | baseline | 0.637628 | 2.249 | 5923 | 4449 | 5323 | 5568 | 0.000000 | 0.000000 | | |
| | simple | enhanced | 0.043149 | 2.184 | 10904 | 2438 | 3751 | 3099 | 0.021473 | 0.001503 | | |
| | moderate | baseline | 0.450424 | 6.747 | 5908 | 4689 | 5084 | 5381 | 0.000000 | 0.000000 | | |
| | moderate | enhanced | 0.020161 | 6.264 | 8135 | 2357 | 3771 | 2944 | 0.076783 | 0.001143 | | |
| Interpretation: | |
| - The input-selection gate did not hurt quick-task quality; enhanced still wins validation loss clearly. | |
| - Exact recurrent enhanced slowed slightly due to the extra gate. | |
| - Deployment-lite mismatch worsened on moderate, but balanced deployment remained much more faithful. | |
| ### `_r12` Research Notebook | |
| Saved notebook: | |
| - `output/jupyter-notebook/gamma-s4-research-benchmark_r12.ipynb` | |
| Configuration: | |
| - device: CUDA | |
| - commit printed in notebook: `0c6ecb8` | |
| Forecasting results: | |
| | Task | Model | Val loss | Mean epoch s | Expected mode | Full tokens/s | Recurrent tokens/s | Deploy-lite tokens/s | Balanced deploy tokens/s | Deploy-lite match MSE | Balanced match MSE | | |
| |---|---|---:|---:|---|---:|---:|---:|---:|---:|---:| | |
| | current_reference | baseline | 0.709749 | 7.298 | recurrent_like | 5513 | 5460 | 5212 | 5506 | 0.000000 | 0.000000 | | |
| | current_reference | enhanced | 0.020813 | 6.753 | recurrent_like | 10218 | 2261 | 3414 | 3018 | 0.023053 | 0.000478 | | |
| | long_context | baseline | 12.850342 | 37.374 | recurrent_like | 3720 | 3776 | 3661 | 3706 | 0.000000 | 0.000000 | | |
| | long_context | enhanced | 0.011039 | 11.212 | conv | 229320 | 1589 | 2390 | 2043 | 0.034069 | 0.001689 | | |
| Token-lite results: | |
| | Model | Train CE | Val CE | Val PPL | Seq len | Train samples | | |
| |---|---:|---:|---:|---:|---:| | |
| | baseline | 2.943133 | 6.186983 | 486.377 | 192 | 1200 | | |
| | enhanced | 2.489687 | 2.490702 | 12.070 | 192 | 1200 | | |
| Interpretation: | |
| - Enhanced remains excellent for long-context forecasting. | |
| - Input-selection slightly improved long_context val loss versus `_r11` (`0.011039` vs `0.011593`) but worsened exact recurrent speed. | |
| - Token-lite strongly favors enhanced in this run, though baseline appears unstable. | |
| ### `_r12` Challenge Notebook | |
| Saved notebook: | |
| - `output/jupyter-notebook/gamma-s4-challenge-benchmark_r12.ipynb` | |
| Configuration: | |
| - device: CUDA | |
| - commit printed in notebook: `11bd2e6` | |
| - same challenge tasks as `_r11`, with input-selection gate active in enhanced | |
| Results: | |
| | Task | Model | Val loss | Val accuracy | Epoch s | Forward ms | Forward tokens/s | | |
| |---|---|---:|---:|---:|---:|---:| | |
| | permuted_mnist | baseline | 2.760003 | 0.206000 | 114.183 | 511.946 | 98010 | | |
| | permuted_mnist | enhanced | 2.052564 | 0.209000 | 34.080 | 8.393 | 5978110 | | |
| | selective_copying | baseline | 3.514275 | 0.028320 | 7.319 | 73.808 | 221983 | | |
| | selective_copying | enhanced | 3.468793 | 0.029785 | 2.353 | 2.988 | 5482614 | | |
| | induction_recall | baseline | 3.607873 | 0.038086 | 7.439 | 72.952 | 224585 | | |
| | induction_recall | enhanced | 3.535175 | 0.039062 | 2.788 | 2.925 | 5600445 | | |
| Interpretation: | |
| - The input-selection gate did not produce a meaningful challenge-task accuracy breakthrough. | |
| - Permuted MNIST accuracy stayed low and did not improve over `_r11`. | |
| - Selective copying and induction recall are still near random. With 32 classes, random accuracy is about `0.03125`. | |
| - The enhanced model still has much better forward throughput and somewhat lower CE, but accuracy shows it is not performing reliable exact recall. | |
| - This suggests two things: | |
| - permuted MNIST likely needs more epochs and/or more samples | |
| - selective copying and induction need a stronger selective/content-dependent memory mechanism or a curriculum diagnostic, not just more epochs | |
| ## Versions Not Recorded | |
| The following are not recorded as complete benchmark versions: | |
| - Research notebooks before `_r6`: no saved research `_r1` to `_r5` notebooks exist in the repo. | |
| - Any temporary failed Colab runs during error debugging: tracebacks were discussed in chat, but they are not treated as experiment records. | |
| - Partial long-context ablation run in `_r6`: only partial output is present, so it is not summarized as a completed ablation result. | |
| ## Current Best Summary | |
| Best presentable run: | |
| - `_r12` research benchmark | |
| Most important result: | |
| - On `long_context`, `gamma_s4_enhanced` achieved much lower validation loss than baseline and substantially better full-sequence throughput. | |
| - `_r11` shows the fixed AMP FFT conv path completing successfully and producing very high cached full-sequence throughput on long_context. | |
| - `_r12` confirms the input-selection gate alone is not enough to solve selective copying or induction recall beyond near-random accuracy. | |
| Current limitation: | |
| - `gamma_s4_enhanced` still trails `gamma_baseline` in recurrent token-by-token deployment throughput. | |
| - Challenge benchmarks show that the current model needs stronger selective/content-dependent memory mechanisms. | |
| Recommended next improvement targets: | |
| 1. Add challenge-task curriculum diagnostics and longer token-memory epochs. | |
| 2. Explore stronger content-dependent memory beyond static LTI convolution, while preserving the fixed Gamma transition when possible. | |
| 3. Recurrent/deployment optimization for `gamma_s4_enhanced`. | |
| 4. Deployment-lite fidelity improvement, especially on long_context. | |
| 5. Better structured Gamma kernel generation for the conv/full-sequence path. | |