StarMist0012 commited on 15 days ago

Commit

e2bfccc

verified ·

1 Parent(s): 388fd6e

Add files using upload-large-folder tool

Browse files

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

README.md +129 -95
artifacts/configs/pretrain.yaml +86 -0
artifacts/configs/sft.yaml +92 -0
artifacts/diagnostics/activation_probe_pretrain_final.json +118 -0
artifacts/diagnostics/generation_samples_pretrain_final.json +23 -0
artifacts/diagnostics/generation_samples_sft_final.json +23 -0
artifacts/gpu_telemetry_nvidia_smi.csv +0 -0
artifacts/loss_curve_summary.json +26 -0
artifacts/model_card.json +13 -0
artifacts/run.sh +55 -0
artifacts/run_plan.json +19 -0
artifacts/status.json +1 -0
code/TaoTrain/.gitignore +150 -0
code/TaoTrain/README.md +423 -0
code/TaoTrain/configs/pretrain.yaml +138 -0
code/TaoTrain/configs/rl_dpo.yaml +60 -0
code/TaoTrain/configs/sft.yaml +93 -0
code/TaoTrain/configs/ssm_pretrain.yaml +87 -0
code/TaoTrain/configs/tokenizer.yaml +44 -0
code/TaoTrain/configs/tokenizer_taodata_pilot.yaml +22 -0
code/TaoTrain/configs/yarn_pretrain.yaml +181 -0
code/TaoTrain/data/sample_data.jsonl +10 -0
code/TaoTrain/data/sample_sft_multi.jsonl +4 -0
code/TaoTrain/data/sample_sft_single.jsonl +5 -0
code/TaoTrain/pyproject.toml +65 -0
code/TaoTrain/scripts/benchmark_taonet_real_tokens.py +1034 -0
code/TaoTrain/scripts/benchmark_taonet_token_variants.py +516 -0
code/TaoTrain/scripts/diagnostics/activation_probe.py +160 -0
code/TaoTrain/scripts/diagnostics/generate_checkpoint_samples.py +134 -0
code/TaoTrain/scripts/diagnostics/sft_sanity_check.py +300 -0
code/TaoTrain/scripts/profile_taonet_components.py +393 -0
code/TaoTrain/scripts/remote/job_status.sh +36 -0
code/TaoTrain/scripts/remote/run_200m_base_suite.sh +149 -0
code/TaoTrain/scripts/remote/run_200m_branch_only_chat.sh +329 -0
code/TaoTrain/scripts/remote/run_200m_until_selection.sh +81 -0
code/TaoTrain/scripts/remote/run_pre_200m_branch_only_gate.sh +269 -0
code/TaoTrain/scripts/remote/run_pre_200m_stability_gate.sh +267 -0
code/TaoTrain/scripts/remote/submit_200m_base_suite.sh +16 -0
code/TaoTrain/scripts/remote/submit_200m_branch_only_chat.sh +32 -0
code/TaoTrain/scripts/remote/submit_200m_until_selection.sh +28 -0
code/TaoTrain/scripts/remote/submit_detached_job.sh +118 -0
code/TaoTrain/scripts/remote/submit_pre_200m_branch_only_gate.sh +33 -0
code/TaoTrain/scripts/remote/submit_pre_200m_stability_gate.sh +32 -0
code/TaoTrain/scripts/remote/submit_ssm_improvement_sweep.sh +71 -0
code/TaoTrain/scripts/summarize_taonet_benchmark_suite.py +138 -0
code/TaoTrain/src/taoTrain.egg-info/dependency_links.txt +1 -0
code/TaoTrain/src/taoTrain.egg-info/entry_points.txt +4 -0
code/TaoTrain/src/taoTrain/__init__.py +8 -0
code/TaoTrain/src/taoTrain/cli.py +387 -0
code/TaoTrain/src/taoTrain/config.py +726 -0

README.md CHANGED Viewed

@@ -2,7 +2,7 @@
 license: mit
 language:
 - en
-library_name: pytorch
 pipeline_tag: text-generation
 tags:
 - taonet
@@ -11,7 +11,8 @@ tags:
 - state-space-model
 - dplr
 - pytorch
-- custom-code
 - text-generation
 - experimental
 datasets:
@@ -20,15 +21,93 @@ datasets:
 # TaoNet-mini-T2
-TaoNet-mini-T2 is an experimental 196M-parameter TaoNet-style language model that replaces the attention sequence mixer with a Taotern/Gamma DPLR state-space model (SSM). It is packaged as a runnable TaoTrain checkpoint bundle rather than a native `transformers.AutoModelForCausalLM` repository.
-This release is intended for research backup, reproducibility, and deployment investigation. It includes the model checkpoints, tokenizer, TaoTrain runtime code, Taotern SSM code, run configs, diagnostics, and Windows/Linux helper scripts.
 ## Model Details
 | Field | Value |
-|---|---|
-| Model family | TaoNet / Taotern SSM |
 | Architecture | `taonet_ssm` |
 | Candidate | `pure_ssm_196m_branch_rms_only` |
 | Parameters | 196,573,128 |
@@ -45,20 +124,17 @@ This release is intended for research backup, reproducibility, and deployment in
 | Local shift | Enabled, per-channel |
 | Branch RMS norm | Enabled |
-## Files
-Download and unzip:
-```text
-taotern-200m-branch-only-chat-20260514_handoff.zip
-```
-The zip contains:
 ```text
 model/
   final_model.pt                  # SFT final checkpoint
-  pretrain_final_model.pt         # recommended checkpoint for fixed chat test
 tokenizer/
   tokenizer.model
   tokenizer.vocab
@@ -69,100 +145,64 @@ code/
 artifacts/
   configs/
   diagnostics/
-  model_card.json
-  run_plan.json
-  loss_curve_summary.json
-chat_ssm_fixed.py
-RUN_TEST_WINDOWS.cmd
-setup_windows.ps1
-setup_linux.sh
-run_chat_fixed.ps1
-run_chat_fixed.sh
-CHECKSUMS.sha256
-README_HANDOFF.md
 ```
-Package checksum:
-```text
-sha256=5dd45e5d236f37101d183cadb92c5a13ea0aa574a24f578faa6cc8ee7ed166d6
-```
-## Quick Start
-### Windows
-1. Download and unzip `taotern-200m-branch-only-chat-20260514_handoff.zip`.
-2. Double-click:
-```text
-RUN_TEST_WINDOWS.cmd
 ```
-The launcher creates or reuses `.venv`, tries PyTorch CUDA 12.8 first, falls back to CUDA 12.6, installs the packaged TaoTrain/Taotern SSM code, then launches the fixed chat script.
-Manual Windows setup:
 ```powershell
-powershell -ExecutionPolicy Bypass -File .\setup_windows.ps1 -TorchFlavor cu128
-powershell -ExecutionPolicy Bypass -File .\run_chat_fixed.ps1
 ```
-If CUDA 12.8 fails:
-```powershell
-powershell -ExecutionPolicy Bypass -File .\setup_windows.ps1 -TorchFlavor cu126
-```
-CPU-only setup is possible but very slow:
-```powershell
-powershell -ExecutionPolicy Bypass -File .\setup_windows.ps1 -TorchFlavor cpu
 ```
-### Linux
-```bash
-unzip taotern-200m-branch-only-chat-20260514_handoff.zip
-cd taotern-200m-branch-only-chat-20260514
-chmod +x setup_linux.sh run_chat_fixed.sh
-./setup_linux.sh cu128
-./run_chat_fixed.sh
-```
-If CUDA 12.8 is not suitable, try:
-```bash
-./setup_linux.sh cu126
-```
-## Recommended Inference Mode
-The checkpoint was trained with convolutional SSM execution. For chat testing, this bundle uses a temporary RepoBridge-style inference fix:
 ```text
 ssm_finite_tail_correction=true
-ssm_kernel_mode=recurrent
-default_checkpoint=model/pretrain_final_model.pt
-decode=greedy
-top_p=0.85
-repetition_penalty=1.2
-max_new_tokens=64
-```
-Run it directly:
-```bash
-python chat_ssm_fixed.py --interactive
 ```
-or on Windows:
-```powershell
-powershell -ExecutionPolicy Bypass -File .\run_chat_fixed.ps1
-```
-Plain TaoTrain `tui-chat` reconstructs the model from the training config (`ssm_finite_tail_correction=false`, `ssm_kernel_mode=conv`) and may produce much worse samples. Use `chat_ssm_fixed.py` for the expected behavior.
 ## Training Summary
@@ -172,8 +212,6 @@ Run ID:
 taotern-200m-branch-only-chat-20260514
 ```
-Training shape:
 | Stage | Value |
 |---|---:|
 | Pretrain token positions | 4,000,000,000 |
@@ -204,17 +242,14 @@ This model is intended for:
 - Taotern/TaoNet SSM research
 - checkpoint backup and reproducibility
-- deployment experiments for a custom TaoTrain runtime
 - studying recurrent SSM inference behavior
-It is not currently a drop-in Transformers model and is not intended as a polished production chatbot.
 ## Limitations
-- Experimental model quality; validate before use.
-- Requires custom TaoTrain and Taotern SSM code included in the package.
 - The recommended chat path depends on an inference-time SSM override.
-- Not currently packaged as `AutoModelForCausalLM`.
 - CPU inference is expected to be very slow.
 - English-focused pilot data/tokenizer.
@@ -232,4 +267,3 @@ It is not currently a drop-in Transformers model and is not intended as a polish
 ## Related
 - [TaoTern/TaoNet-pico-T1](https://huggingface.co/TaoTern/TaoNet-pico-T1)

 license: mit
 language:
 - en
+library_name: transformers
 pipeline_tag: text-generation
 tags:
 - taonet
 - state-space-model
 - dplr
 - pytorch
+- transformers
+- custom_code
 - text-generation
 - experimental
 datasets:
 # TaoNet-mini-T2
+TaoNet-mini-T2 is an experimental 196M-parameter TaoNet language model using a Taotern/Gamma DPLR state-space model (SSM) sequence core instead of attention. The repository includes the full training handoff package, but the recommended inference path is now Hugging Face `transformers` remote code:
+```python
+AutoModelForCausalLM.from_pretrained("TaoTern/TaoNet-mini-T2", trust_remote_code=True)
+```
+The default `transformers` loader downloads `model/pretrain_final_model.pt` and applies the RepoBridge chat-quality fix: `ssm_finite_tail_correction=True` and `ssm_kernel_mode="recurrent"`.
+## Quick Start
+Install runtime dependencies:
+```bash
+pip install torch transformers sentencepiece huggingface_hub pydantic pydantic-settings pyyaml numpy
+```
+For the private review repo, log in first:
+```bash
+hf auth login
+```
+Run generation from Python:
+```python
+import time
+import torch
+from transformers import AutoModelForCausalLM, AutoTokenizer
+MODEL_NAME = "TaoTern/TaoNet-mini-T2"
+device = "cuda" if torch.cuda.is_available() else "cpu"
+dtype = torch.bfloat16 if device == "cuda" else torch.float32
+tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, trust_remote_code=True)
+model = AutoModelForCausalLM.from_pretrained(
+    MODEL_NAME,
+    trust_remote_code=True,
+    torch_dtype=dtype,
+).to(device)
+def generate_text(prompt, max_new_tokens=64, temperature=0.7, top_p=0.85):
+    inputs = tokenizer(prompt, return_tensors="pt")
+    inputs = {key: value.to(device) for key, value in inputs.items()}
+    start_time = time.time()
+    with torch.inference_mode():
+        outputs = model.generate(
+            **inputs,
+            max_new_tokens=max_new_tokens,
+            temperature=temperature,
+            top_p=top_p,
+            repetition_penalty=1.2,
+            do_sample=True,
+            use_cache=False,
+            pad_token_id=tokenizer.pad_token_id,
+            eos_token_id=tokenizer.eos_token_id,
+        )
+    elapsed_time = time.time() - start_time
+    new_tokens = outputs.shape[1] - inputs["input_ids"].shape[1]
+    tokens_per_second = new_tokens / elapsed_time if elapsed_time > 0 else 0.0
+    completion = tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
+    return completion, tokens_per_second
+if __name__ == "__main__":
+    text, tps = generate_text("Fruit is now expensive so we should")
+    print(text)
+    print(f"\nTokens per second: {tps:.2f}")
+```
+To load the SFT final checkpoint instead of the default pretrain checkpoint:
+```python
+model = AutoModelForCausalLM.from_pretrained(
+    "TaoTern/TaoNet-mini-T2",
+    trust_remote_code=True,
+    checkpoint_name="final_model.pt",
+)
+```
 ## Model Details
 | Field | Value |
+|---|---:|
 | Architecture | `taonet_ssm` |
 | Candidate | `pure_ssm_196m_branch_rms_only` |
 | Parameters | 196,573,128 |
 | Local shift | Enabled, per-channel |
 | Branch RMS norm | Enabled |
+## Repository Layout
 ```text
+config.json
+configuration_taonet_mini_t2.py
+modeling_taonet_mini_t2.py
+tokenization_taonet_mini_t2.py
+tokenizer.model
 model/
   final_model.pt                  # SFT final checkpoint
+  pretrain_final_model.pt         # default checkpoint for HF inference
 tokenizer/
   tokenizer.model
   tokenizer.vocab
 artifacts/
   configs/
   diagnostics/
+chat_ssm_fixed.py                 # legacy local fixed-chat CLI
+eval_lm_eval.py                   # local lm-eval harness wrapper
 ```
+## Upload Notes
+This repo contains two multi-GB checkpoint files, so prefer the resumable large-folder uploader instead of the normal single-commit upload command:
+```bash
+hf upload-large-folder TaoTern/TaoNet-mini-T2 . --repo-type model --private
 ```
+On Windows, from the repo folder:
 ```powershell
+powershell -ExecutionPolicy Bypass -File .\upload_large_folder.ps1
 ```
+## Inference Notes
+The training config used `ssm_finite_tail_correction=False` and `ssm_kernel_mode="conv"`. That path is fast for full-sequence training/evaluation but produced poor chat samples in the recovered workflow.
+The `transformers` wrapper defaults to:
+```text
+ssm_finite_tail_correction=True
+ssm_kernel_mode=recurrent
+checkpoint=model/pretrain_final_model.pt
 ```
+For fast benchmark scoring, use the included `eval_lm_eval.py` script with `--ssm-kernel-mode conv --finite-tail`.
+## LM Evaluation Harness Benchmark
+Settings:
 ```text
+library=lm-eval-harness
+checkpoint=model/pretrain_final_model.pt
+num_fewshot=0
+limit=100
+ssm_kernel_mode=conv
 ssm_finite_tail_correction=true
+eval_batch_size=8
 ```
+Results:
+| Task | Primary score |
+|---|---:|
+| HellaSwag | 0.3300 |
+| ARC Easy | 0.3400 |
+| ARC Challenge | 0.2200 |
+| PIQA | 0.4400 |
+| Winogrande | 0.5300 |
+| Mean primary score | 0.3720 |
+These are limit-100 smoke benchmark numbers for review, not full leaderboard results.
 ## Training Summary
 taotern-200m-branch-only-chat-20260514
 ```
 | Stage | Value |
 |---|---:|
 | Pretrain token positions | 4,000,000,000 |
 - Taotern/TaoNet SSM research
 - checkpoint backup and reproducibility
+- deployment experiments with custom Hugging Face remote code
 - studying recurrent SSM inference behavior
 ## Limitations
+- Experimental model quality; validate outputs before use.
+- Requires `trust_remote_code=True` because the architecture is not part of upstream `transformers`.
 - The recommended chat path depends on an inference-time SSM override.
 - CPU inference is expected to be very slow.
 - English-focused pilot data/tokenizer.
 ## Related
 - [TaoTern/TaoNet-pico-T1](https://huggingface.co/TaoTern/TaoNet-pico-T1)

artifacts/configs/pretrain.yaml ADDED Viewed

	@@ -0,0 +1,86 @@

+model:
+  architecture_type: taonet_ssm
+  vocab_size: 8192
+  hidden_dim: 1024
+  num_layers: 18
+  num_heads: 8
+  max_seq_length: 512
+  d_latent_kv: 768
+  d_rope: 128
+  hidden_dim_ff: 3072
+  dropout: 0.0
+  gqa_groups: 1
+  use_factorized_embedding: false
+  d_embed_rank: 96
+  init_std: 0.02
+  ssm_core: dplr
+  ssm_hidden_dim: 32
+  ssm_mixer_dim: 256
+  ssm_num_lanes: 2
+  ssm_lane_combine: channel
+  ssm_lane_mode: split
+  ssm_split_mix: none
+  ssm_rank: 1
+  ssm_max_low_rank_scale: 0.1
+  ssm_finite_tail_correction: false
+  ssm_discretization: bilinear
+  ssm_kernel_mode: conv
+  ssm_kernel_threshold: 64
+  ssm_dt_min: 1e-3
+  ssm_dt_max: 1e-1
+  ssm_dt_init: 1e-2
+  ssm_use_d: true
+  ssm_activation: gelu
+  ssm_gate: true
+  ssm_input_gate: true
+  ssm_gate_type: channel
+  ssm_use_padding_mask: false
+  ssm_layer_scale_init: 0.1
+  ssm_branch_rms_norm: true
+  block_residual_rms_norm: false
+  ssm_local_shift: true
+  ssm_local_shift_init: 0.1
+  ssm_local_shift_per_channel: true
+dataset:
+  local: true
+  jsonl_path: /home/student/Data/TaoData/pretrain.jsonl
+  text_field: text
+  tokenizer_type: sentencepiece
+  tokenizer_path: /home/student/YouZheng/tokenizers/taodata_pilot_8k/tokenizer.model
+  samples_per_chunk: 2000
+  tokenizer_threads: 8
+sequence_length: 512
+batch_size: 8
+num_epochs: 100000
+max_steps: 976563
+gradient_accumulation_steps: 1
+max_grad_norm: 1.0
+optimizer:
+  optimizer_type: adamw
+  learning_rate: 0.0008
+  weight_decay: 0.01
+  betas: [0.9, 0.999]
+  eps: 1e-8
+scheduler:
+  scheduler_type: constant
+  warmup_steps: 1000
+  warmup_ratio: 0.0
+dtype: bfloat16
+device: cuda
+checkpoint_dir: /home/student/YouZheng/jobs/taotern/taotern-200m-branch-only-chat-20260514/checkpoints/pretrain
+save_every_steps: 100000
+save_best_model: false
+keep_last_n_checkpoints: 3
+eval_every_steps: 100000
+eval_samples: 32
+log_every_steps: 100
+aim_repo: /home/student/YouZheng/jobs/taotern/taotern-200m-branch-only-chat-20260514/outputs/.aim-pretrain
+seed: 43
+num_workers: 0
+pin_memory: true

artifacts/configs/sft.yaml ADDED Viewed

	@@ -0,0 +1,92 @@

+model:
+  architecture_type: taonet_ssm
+  vocab_size: 8192
+  hidden_dim: 1024
+  num_layers: 18
+  num_heads: 8
+  max_seq_length: 512
+  d_latent_kv: 768
+  d_rope: 128
+  hidden_dim_ff: 3072
+  dropout: 0.0
+  gqa_groups: 1
+  use_factorized_embedding: false
+  d_embed_rank: 96
+  init_std: 0.02
+  ssm_core: dplr
+  ssm_hidden_dim: 32
+  ssm_mixer_dim: 256
+  ssm_num_lanes: 2
+  ssm_lane_combine: channel
+  ssm_lane_mode: split
+  ssm_split_mix: none
+  ssm_rank: 1
+  ssm_max_low_rank_scale: 0.1
+  ssm_finite_tail_correction: false
+  ssm_discretization: bilinear
+  ssm_kernel_mode: conv
+  ssm_kernel_threshold: 64
+  ssm_dt_min: 1e-3
+  ssm_dt_max: 1e-1
+  ssm_dt_init: 1e-2
+  ssm_use_d: true
+  ssm_activation: gelu
+  ssm_gate: true
+  ssm_input_gate: true
+  ssm_gate_type: channel
+  ssm_use_padding_mask: false
+  ssm_layer_scale_init: 0.1
+  ssm_branch_rms_norm: true
+  block_residual_rms_norm: false
+  ssm_local_shift: true
+  ssm_local_shift_init: 0.1
+  ssm_local_shift_per_channel: true
+dataset:
+  split: train
+  instruction_column: input
+  response_column: output
+  local: true
+  jsonl_path: /home/student/Data/TaoData/sft.jsonl
+  samples_per_chunk: 2000
+  tokenizer_type: sentencepiece
+  tokenizer_path: /home/student/YouZheng/tokenizers/taodata_pilot_8k/tokenizer.model
+  tokenizer_threads: 8
+checkpoint_path: /home/student/YouZheng/jobs/taotern/taotern-200m-branch-only-chat-20260514/checkpoints/pretrain/final_model.pt
+user_token: "<user>"
+assistant_token: "<assistant>"
+response_loss_only: true
+batch_size: 8
+num_epochs: 100000
+max_steps: 50000
+gradient_accumulation_steps: 1
+max_grad_norm: 1.0
+optimizer:
+  optimizer_type: adamw
+  learning_rate: 0.00005
+  weight_decay: 0.0
+  betas: [0.9, 0.999]
+  eps: 1e-8
+scheduler:
+  scheduler_type: constant
+  warmup_steps: 100
+  warmup_ratio: 0.0
+dtype: bfloat16
+device: cuda
+checkpoint_dir: /home/student/YouZheng/jobs/taotern/taotern-200m-branch-only-chat-20260514/checkpoints/sft
+save_every_steps: 10000
+save_best_model: false
+keep_last_n_checkpoints: 3
+eval_every_steps: 10000
+eval_samples: 32
+log_every_steps: 20
+aim_repo: /home/student/YouZheng/jobs/taotern/taotern-200m-branch-only-chat-20260514/outputs/.aim-sft
+seed: 44
+num_workers: 0
+pin_memory: true

artifacts/diagnostics/activation_probe_pretrain_final.json ADDED Viewed

	@@ -0,0 +1,118 @@

+{
+  "checkpoint": "/home/student/YouZheng/jobs/taotern/taotern-200m-branch-only-chat-20260514/checkpoints/pretrain/final_model.pt",
+  "loss": 2.8459982872009277,
+  "batch_size": 2,
+  "seq_len": 512,
+  "device": "cuda",
+  "dtype": "torch.bfloat16",
+  "layers": {
+    "blocks.0": {
+      "numel": 1048576,
+      "finite": 1048576,
+      "rms": 2.9217934608459473,
+      "max_abs": 62.90830612182617
+    },
+    "blocks.1": {
+      "numel": 1048576,
+      "finite": 1048576,
+      "rms": 3.4865853786468506,
+      "max_abs": 63.050193786621094
+    },
+    "blocks.2": {
+      "numel": 1048576,
+      "finite": 1048576,
+      "rms": 4.092358589172363,
+      "max_abs": 98.85921478271484
+    },
+    "blocks.3": {
+      "numel": 1048576,
+      "finite": 1048576,
+      "rms": 4.814863681793213,
+      "max_abs": 160.82785034179688
+    },
+    "blocks.4": {
+      "numel": 1048576,
+      "finite": 1048576,
+      "rms": 6.032325267791748,
+      "max_abs": 237.8623504638672
+    },
+    "blocks.5": {
+      "numel": 1048576,
+      "finite": 1048576,
+      "rms": 7.5034403800964355,
+      "max_abs": 324.9512939453125
+    },
+    "blocks.6": {
+      "numel": 1048576,
+      "finite": 1048576,
+      "rms": 8.303068161010742,
+      "max_abs": 345.4449462890625
+    },
+    "blocks.7": {
+      "numel": 1048576,
+      "finite": 1048576,
+      "rms": 8.911171913146973,
+      "max_abs": 337.6418151855469
+    },
+    "blocks.8": {
+      "numel": 1048576,
+      "finite": 1048576,
+      "rms": 10.293173789978027,
+      "max_abs": 343.7690124511719
+    },
+    "blocks.9": {
+      "numel": 1048576,
+      "finite": 1048576,
+      "rms": 11.591143608093262,
+      "max_abs": 361.6515808105469
+    },
+    "blocks.10": {
+      "numel": 1048576,
+      "finite": 1048576,
+      "rms": 13.886137962341309,
+      "max_abs": 401.029052734375
+    },
+    "blocks.11": {
+      "numel": 1048576,
+      "finite": 1048576,
+      "rms": 15.372880935668945,
+      "max_abs": 406.81036376953125
+    },
+    "blocks.12": {
+      "numel": 1048576,
+      "finite": 1048576,
+      "rms": 18.466873168945312,
+      "max_abs": 541.068603515625
+    },
+    "blocks.13": {
+      "numel": 1048576,
+      "finite": 1048576,
+      "rms": 20.376676559448242,
+      "max_abs": 553.8697509765625
+    },
+    "blocks.14": {
+      "numel": 1048576,
+      "finite": 1048576,
+      "rms": 23.877737045288086,
+      "max_abs": 652.4536743164062
+    },
+    "blocks.15": {
+      "numel": 1048576,
+      "finite": 1048576,
+      "rms": 28.56619644165039,
+      "max_abs": 874.2588500976562
+    },
+    "blocks.16": {
+      "numel": 1048576,
+      "finite": 1048576,
+      "rms": 35.949851989746094,
+      "max_abs": 1143.6116943359375
+    },
+    "blocks.17": {
+      "numel": 1048576,
+      "finite": 1048576,
+      "rms": 45.9725227355957,
+      "max_abs": 2560.034423828125
+    }
+  }
+}

artifacts/diagnostics/generation_samples_pretrain_final.json ADDED Viewed

	@@ -0,0 +1,23 @@

+{
+  "checkpoint": "/home/student/YouZheng/jobs/taotern/taotern-200m-branch-only-chat-20260514/checkpoints/pretrain/final_model.pt",
+  "tokenizer_path": "/home/student/YouZheng/tokenizers/taodata_pilot_8k/tokenizer.model",
+  "device": "cuda",
+  "dtype": "torch.bfloat16",
+  "max_new_tokens": 160,
+  "temperature": 0.8,
+  "top_p": 0.9,
+  "samples": [
+    {
+      "prompt": "The purpose of artificial intelligence is",
+      "completion": "to enable human creativity to be a kind of artifact of politics - and sometimes to do that the principle of human dignity cannot be denied without the distinction of two; but the only thing that gives those who rely on it the ability to be free to decide the right thing.In some cases, the majority of the rules about how to use such knowledge are not clearly understood by the law. This leads to the generation of the same arguments, which leads to the mistake of the expert who sought to apply it to everything else.Well, to comply with the human dignity. There is some good news. There are some types of discrimination. This is because it is always a death sentence. Many people are not even able to get to court."
+    },
+    {
+      "prompt": "In a small village,",
+      "completion": "you can live alone, but sometimes you might have to leave the house without a roof. Most of these city residents get two hours of sleep on most of the time, while many do.When it comes to school:Times help.Most of the time, wind is always the most important (or the least important): being well rested during the day.When the weather gets hot, the symptoms of insomnia can appear. Most of the time it just doesn't. Most people get enough sleep.Typically, we don't need to get more than 2 hours.Why are there such a large number of people getting the same kind of sleep?There are some factors that make it difficult to identify as having sleep problems. Some people do get a few"
+    },
+    {
+      "prompt": "<user>Hello, who are you?<assistant>",
+      "completion": "24.2 (2) 1:10Though many households still live without electricity, most people who live close to the power lines will do so to the extent that they have only a small amount of electricity at the end of the day (33%).(3) The average household electricity rate is higher for the two types of households at 32%, with 31% of the 2005-06 peaks.7Take home message:- Unrealistically high societies. This is one factor that makes getting it to such a huge difference. This means that one has to make some very poor choices when it comes to one's overall lifestyle. Many cities do not have a proper gas taxes;"
+    }
+  ]
+}

artifacts/diagnostics/generation_samples_sft_final.json ADDED Viewed

	@@ -0,0 +1,23 @@

+{
+  "checkpoint": "/home/student/YouZheng/jobs/taotern/taotern-200m-branch-only-chat-20260514/checkpoints/sft/final_model.pt",
+  "tokenizer_path": "/home/student/YouZheng/tokenizers/taodata_pilot_8k/tokenizer.model",
+  "device": "cuda",
+  "dtype": "torch.bfloat16",
+  "max_new_tokens": 220,
+  "temperature": 0.7,
+  "top_p": 0.9,
+  "samples": [
+    {
+      "prompt": "<user>Hello, who are you?<assistant>",
+      "completion": "A tip is usually a net, a pack is a scavenger, which means that the initial effort is the delta, the body, the nav, is the ad. ⁇ ** Also, typically, a person typically becomes anemic and always begins to move around, which typically involves movement, which typically involves some movement, such as movement, which can also affect movement. ⁇ **Astrophobic** is not always a result of a very dramatic event. ⁇ * Often, it).That must be.*That got lost. ⁇ That got you got.)That must have got you confused. Most doctors agree. Muscles seem to lose flexibility.But steroids can still keep you sharp. Muscle cramps. Muscle cramps. Muscle fatigue. Most adults can still breathe. ⁇ **Most organs have multiple functions.** ⁇ Most treatments are aimed at improving symptoms. But advanced treatments may trigger inflammation. Low maintenance but soft but solid retain or soft core. Presence: Completing the major: Co the specific major"
+    },
+    {
+      "prompt": "<user>Explain what artificial intelligence is in simple words.<assistant>",
+      "completion": "A oscillator is a singular and eventually identical, specialized singular, the silencer. However, sometimes, sometimes, sometimes, the particles are almost always and typically surrounded by unusual or unusual characteristics, such as being very agile, such as being very small, very small, and often very dense, sometimes very really really. ⁇ They can also sometimes also sometimes sometimes sometimes. Sometimes, sometimes sometimes sometimes sometimes. Sometimes, sometimes sometimes sometimes sometimes. ⁇ However,).).....)....That.).G...That.K..K..New..New....New..New......These two devices seem quite close. Each has its own advantage. ⁇ But sometimes. But sometimes. It doesn't. But it still does. But... The hypothesis. The hypothesis. The untested or rare. Comes in red. P). Complex. Co. High. CoinIn the general, the"
+    },
+    {
+      "prompt": "<user>Give me three practical study tips.<assistant>",
+      "completion": "Dementia, also known as numbness or tingling under the arc (the gait), is a physical deformity. The necessity is the 11111111 and also typically includes an injury or condition that makes it difficult for you to do some simple things, such as getting dressed, getting some rest, or getting some sleep, but not actually getting really tired. Sometimes simply having this sometimes sometimes sometimes does really does sometimes does. Sometimes, sometimes sometimes sometimes. Sometimes sometimes.)It.The....).)...).That...That.. Two. ⁇ .That.. ⁇ Check it. ⁇ Check everything. ⁇ Check it. ⁇ But keep it small. Remember to keep everything hidden. Remember to keep everything organized. Remember to keep everyone informed. Remember to keep valuable items readily accessible. Most businesses don't offer absolute protection. Let the deco stay solid. Cos. Cos. Co. is a specific sub. ⁇ In a Coin,"
+    }
+  ]
+}

artifacts/gpu_telemetry_nvidia_smi.csv ADDED Viewed

The diff for this file is too large to render. See raw diff

artifacts/loss_curve_summary.json ADDED Viewed

	@@ -0,0 +1,26 @@

+{
+  "pretrain": {
+    "points": 961817,
+    "first_loss": 9.26,
+    "last_loss": 2.64,
+    "min_loss": 0.577,
+    "max_loss": 9.28,
+    "tail20_mean": 2.367,
+    "tail100_mean": 2.3350999999999997,
+    "tail200_mean": 2.3415500000000002,
+    "tail1000_mean": 2.373460000000001,
+    "csv_path": "/home/student/YouZheng/jobs/taotern/taotern-200m-branch-only-chat-20260514/outputs/analysis/pretrain_loss_curve.csv"
+  },
+  "sft": {
+    "points": 49238,
+    "first_loss": 3.2,
+    "last_loss": 1.08,
+    "min_loss": 0.386,
+    "max_loss": 3.38,
+    "tail20_mean": 0.9390999999999998,
+    "tail100_mean": 0.9585200000000004,
+    "tail200_mean": 0.965355,
+    "tail1000_mean": 0.9520289999999981,
+    "csv_path": "/home/student/YouZheng/jobs/taotern/taotern-200m-branch-only-chat-20260514/outputs/analysis/sft_loss_curve.csv"
+  }
+}

artifacts/model_card.json ADDED Viewed

	@@ -0,0 +1,13 @@

+{
+  "state": "completed",
+  "architecture": "taonet_ssm",
+  "candidate": "pure_ssm_196m_branch_rms_only",
+  "pretrain_checkpoint": "/home/student/YouZheng/jobs/taotern/taotern-200m-branch-only-chat-20260514/checkpoints/pretrain/final_model.pt",
+  "sft_checkpoint": "/home/student/YouZheng/jobs/taotern/taotern-200m-branch-only-chat-20260514/checkpoints/sft/final_model.pt",
+  "tokenizer_path": "/home/student/YouZheng/tokenizers/taodata_pilot_8k/tokenizer.model",
+  "pretrain_config": "/home/student/YouZheng/jobs/taotern/taotern-200m-branch-only-chat-20260514/outputs/configs/pretrain.yaml",
+  "sft_config": "/home/student/YouZheng/jobs/taotern/taotern-200m-branch-only-chat-20260514/outputs/configs/sft.yaml",
+  "activation_probe": "/home/student/YouZheng/jobs/taotern/taotern-200m-branch-only-chat-20260514/outputs/diagnostics/activation_probe_pretrain_final.json",
+  "pretrain_samples": "/home/student/YouZheng/jobs/taotern/taotern-200m-branch-only-chat-20260514/outputs/diagnostics/generation_samples_pretrain_final.json",
+  "sft_samples": "/home/student/YouZheng/jobs/taotern/taotern-200m-branch-only-chat-20260514/outputs/diagnostics/generation_samples_sft_final.json"
+}

artifacts/run.sh ADDED Viewed

	@@ -0,0 +1,55 @@

+#!/usr/bin/env bash
+set -Eeuo pipefail
+job_dir="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+status_json="$job_dir/status.json"
+log_path="$job_dir/train.log"
+telemetry_path="$job_dir/gpu_telemetry_nvidia_smi.csv"
+write_status() {
+  local state="$1"
+  local extra="${2:-}"
+  local now
+  now="$(date -Iseconds)"
+  cat > "$status_json" <<STATUS
+{"state":"$state","updated_at":"$now","job_dir":"$job_dir"$extra}
+STATUS
+}
+write_status "running" ",\"start_time\":\"$(date -Iseconds)\""
+rm -f "$job_dir/DONE" "$job_dir/FAILED"
+monitor_pid=""
+if command -v nvidia-smi >/dev/null 2>&1; then
+  (
+    while true; do
+      date -Iseconds
+      nvidia-smi --query-gpu=timestamp,name,utilization.gpu,utilization.memory,memory.used,memory.total,power.draw,temperature.gpu --format=csv,noheader,nounits
+      sleep 5
+    done
+  ) > "$telemetry_path" 2>&1 &
+  monitor_pid="$!"
+fi
+cleanup() {
+  if [[ -n "$monitor_pid" ]]; then
+    kill "$monitor_pid" 2>/dev/null || true
+    wait "$monitor_pid" 2>/dev/null || true
+  fi
+}
+trap cleanup EXIT
+set +e
+"$job_dir/command.sh" > "$log_path" 2>&1
+exit_code="$?"
+set -e
+if [[ "$exit_code" -eq 0 ]]; then
+  touch "$job_dir/DONE"
+  write_status "completed" ",\"exit_code\":0,\"end_time\":\"$(date -Iseconds)\""
+else
+  echo "$exit_code" > "$job_dir/FAILED"
+  write_status "failed" ",\"exit_code\":$exit_code,\"end_time\":\"$(date -Iseconds)\""
+fi
+exit "$exit_code"

artifacts/run_plan.json ADDED Viewed

	@@ -0,0 +1,19 @@

+{
+  "purpose": "200m_branch_only_pure_ssm_4b_pretrain_plus_sft_chatbot_attempt",
+  "candidate": "pure_ssm_196m_branch_rms_only",
+  "pretrain_token_positions": 4000000000,
+  "pretrain_steps": 976563,
+  "batch_size": 8,
+  "seq_len": 512,
+  "sft_steps": 50000,
+  "pretrain_lr": 0.0008,
+  "sft_lr": 0.00005,
+  "save_every_steps": 100000,
+  "block_residual_rms_cap": "null",
+  "notes": [
+    "Uses streaming JSONL training through TaoTrain CLI.",
+    "Counts token positions as batch_size * seq_len * max_steps.",
+    "Keeps SSM branch RMS normalization enabled and block residual RMS normalization disabled.",
+    "Runs corrected response-only SFT after pretraining."
+  ]
+}

artifacts/status.json ADDED Viewed

	@@ -0,0 +1 @@


1	+ {"state":"completed","updated_at":"2026-05-15T19:22:45+08:00","job_dir":"/home/student/YouZheng/jobs/taotern/taotern-200m-branch-only-chat-20260514","exit_code":0,"end_time":"2026-05-15T19:22:45+08:00"}

code/TaoTrain/.gitignore ADDED Viewed

	@@ -0,0 +1,150 @@

+# Byte-compiled / optimized / DLL files
+__pycache__/
+*.py[cod]
+*$py.class
+# C extensions
+*.so
+# Distribution / packaging
+.Python
+build/
+develop-eggs/
+dist/
+downloads/
+eggs/
+.eggs/
+lib/
+lib64/
+parts/
+sdist/
+var/
+wheels/
+pip-wheel-metadata/
+share/python-wheels/
+*.egg-info/
+.installed.cfg
+*.egg
+MANIFEST
+# PyInstaller
+*.manifest
+*.spec
+# Installer logs
+pip-log.txt
+pip-delete-this-directory.txt
+# Unit test / coverage reports
+htmlcov/
+.tox/
+.nox/
+.coverage
+.coverage.*
+.cache
+nosetests.xml
+coverage.xml
+*.cover
+*.py,cover
+.hypothesis/
+.pytest_cache/
+# Translations
+*.mo
+*.pot
+# Django stuff:
+*.log
+local_settings.py
+db.sqlite3
+db.sqlite3-journal
+# Flask stuff:
+instance/
+.webassets-cache
+# Scrapy stuff:
+.scrapy
+# Sphinx documentation
+docs/_build/
+# PyBuilder
+target/
+# Jupyter Notebook
+.ipynb_checkpoints
+# IPython
+profile_default/
+ipython_config.py
+# pyenv
+.python-version
+# pipenv
+Pipfile.lock
+# PEP 582
+__pypackages__/
+# Celery stuff
+celerybeat-schedule
+celerybeat.pid
+# SageMath parsed files
+*.sage.py
+# Environments
+.env
+.venv
+env/
+venv/
+ENV/
+env.bak/
+venv.bak/
+# Spyder project settings
+.spyderproject
+.spyproject
+# Rope project settings
+.ropeproject
+# mkdocs documentation
+/site
+# mypy
+.mypy_cache/
+.dmypy.json
+dmypy.json
+# Pyre type checker
+.pyre/
+# IDEs
+.vscode/
+.idea/
+*.swp
+*.swo
+*~
+# OS
+.DS_Store
+Thumbs.db
+# Project-specific
+checkpoints/
+runs/
+results/
+*.pt
+*.pth
+*.safetensors
+aim_logs/
+.aim/
+wandb/
+lightning_logs/
+# Dataset cache
+.cache/
+datasets_cache/

code/TaoTrain/README.md ADDED Viewed

	@@ -0,0 +1,423 @@

+# TaoTrain: Production-Grade LLM Training Framework
+**TaoTrain** is a sophisticated PyTorch framework for training large language models at every scale—from experimental pretraining through supervised fine-tuning to reinforcement learning. Unlike fragmented training scripts or heavyweight frameworks, TaoTrain unifies the **entire training pipeline** in a clean, modular codebase that appeals to both ML engineers and software engineers.
+## Current Taotern Work
+TaoTrain now includes the Taotern comparison architectures used by the current SSM LLM work:
+- `taonet`: the attention/MLA baseline.
+- `taonet_ssm`: the TaoNet shell with the attention mixer replaced by the Gamma Space Model DPLR SSM.
+- `taonet_hybrid`: an alternating attention/SSM TaoNet used for the current best 200M-class candidate.
+The current selected deployment-oriented run is `hybrid_ssm_first_199m`, a `199,480,928` parameter model with 16 layers: SSM layers at `0,2,4,6,8,10,12,14` and attention layers at `1,3,5,7,9,11,13,15`. It uses the DPLR SSM core with split two-lane mixing, channel gates, per-channel local shift, and the faster convolution path for long-sequence training.
+Remote run `taotern-200m-hybrid-chat-20260512` trains this model on TaoData for a 4B-token base stage and then runs SFT so the final artifact can be loaded as a chat model. The trainable fixes added for this run are:
+- Async JSONL iteration keeps polling while tokenization workers are alive instead of ending early after a temporary empty queue.
+- Cached JSONL scan metadata is reused safely while recomputing chunk ranges for the active `samples_per_chunk` and `max_samples` settings.
+## Why TaoTrain?
+- **Complete Unified Pipeline**: Pretraining → SFT → RL in a single, consistent framework. No context switching between different codebases or architectures.
+- **Production-Grade Engineering**: Type-safe Pydantic configs, comprehensive checkpointing, AimStack integration, and proper gradient handling—not research code, but a framework you can deploy.
+- **Extensibility Without Modification**: Register custom models, optimizers, schedulers, and datasets via decorators. Experiment freely without forking the framework.
+- **Developer Experience First**: Interactive TUI for inference, intuitive YAML configurations, async data loading that eliminates I/O bottlenecks, and clear abstractions that make the codebase a pleasure to work with.
+## Key Capabilities
+| Capability | Details |
+|---|---|
+| **Multi-Stage Training** | Unified infrastructure for pretraining, SFT, and RL. Share model checkpoints, logging, and evaluation across stages. |
+| **Advanced Optimization** | Hybrid Muon + AdamW optimizer: efficient 2D weight updates via SVD-based methods + adaptive learning for 1D parameters. |
+| **Modern Architectures** | DeepSeek MLA with grouped query attention (GQA), YaRN context extension, and factorized embeddings—all configurable via YAML. |
+| **Production Features** | BF16 mixed precision training, gradient accumulation, proper gradient clipping, checkpoint resumption, and validation loops. |
+| **Async Data Pipeline** | Background tokenization with multi-threaded workers. Stream billion-token datasets from JSONL without loading into memory. |
+| **Interactive Inference** | TUI chat interface with real-time generation speed metrics and multi-model comparison. |
+| **Logging & Monitoring** | AimStack integration tracks loss, metrics, hyperparameters, and git hashes for reproducibility. Visualize training runs in your browser. |
+## Getting Started
+### Installation
+```bash
+git clone https://github.com/lobakkang/taoTrain.git
+cd taoTrain
+pip install -e .
+```
+### Training Examples
+**Pretraining on a custom dataset:**
+```bash
+train pretrain --config configs/pretrain.yaml
+```
+Starts from scratch, learns representations from raw text via next-token prediction.
+**Supervised Fine-tuning:**
+```bash
+train sft --config configs/sft.yaml
+```
+Fine-tune a pretrained model on instruction-response pairs for improved task performance.
+**Reinforcement Learning (DPO):**
+```bash
+train rl --config configs/rl_dpo.yaml
+```
+Align models with human preferences using Direct Preference Optimization.
+**Interactive Chat:**
+```bash
+tui-chat --model checkpoints/model.pt
+```
+Launch an interactive TUI to chat with your model and monitor generation metrics in real-time.
+### Configuration
+All training is configured via YAML with Pydantic validation. Configs are type-safe and automatically validated:
+```yaml
+# configs/sft.yaml
+model:
+  architecture_type: "mla"  # DeepSeek MLA with GQA
+  hidden_dim: 2048
+  num_layers: 24
+  num_heads: 32
+  d_latent_kv: 1536  # KV compression factor
+training:
+  num_epochs: 3
+  batch_size: 32
+  learning_rate: 1e-4
+  warmup_ratio: 0.1
+  max_grad_norm: 1.0
+optimizer:
+  optimizer_type: "muon_adamw"  # Hybrid Muon + AdamW
+  muon_momentum: 0.95
+data:
+  dataset_type: "sft_jsonl"  # or "sft_hf" for HuggingFace
+  path: "data/sft_training.jsonl"
+logging:
+  log_to_aim: true
+  aim_repo: "/tmp/aim_logs"
+```
+See `configs/` for complete examples.
+## Project Architecture
+```
+src/taoTrain/
+├── cli.py                      # Main CLI entry point
+├── config.py                   # Pydantic configuration schemas
+│
+├── core/                       # Base abstractions
+│   └── base.py                 # BaseModel, BaseDataset, BaseTrainer
+│
+├── models/                     # Pluggable architecture system
+│   ├── registry.py             # Architecture factory with @register_architecture
+│   ├── taonet.py               # SimpleLLM with DeepSeek MLA
+│   ├── mla_components.py       # KV compression, GQA, YaRN
+│   ├── embeddings.py           # Factorized embeddings
+│   └── transformer.py          # Standard Transformer reference
+│
+├── data/                       # Advanced data pipeline
+│   ├── factory.py              # Dataset factory (HF + JSONL backends)
+│   ├── async_loader.py         # Async batch iteration (no I/O bottleneck)
+│   ├── tokenization_queue.py   # Background multi-threaded tokenization
+│   ├── chunk_manager.py        # Stream billion-token JSONL files
+│   ├── hf_pretrain.py          # HuggingFace pretraining datasets
+│   ├── hf_sft.py               # HuggingFace SFT datasets
+│   ├── hf_rl.py                # HuggingFace RL datasets
+│   ├── pretrain_jsonl.py       # JSONL pretraining
+│   ├── sft_jsonl.py            # JSONL SFT with instructions
+│   └── rl_jsonl.py             # JSONL RL with preferences
+│
+├── training/                   # Unified training infrastructure
+│   └── trainer.py              # Trainer + PretrainTrainer, SFTTrainer, RLTrainer
+│
+├── optimizers/                 # Pluggable optimizer system
+│   ├── registry.py             # Optimizer factory with @register_optimizer
+│   ├── hybrid_muon_adamw.py    # Composite: Muon (2D) + AdamW (1D)
+│   ├── adamw.py                # AdamW with weight decay
+│   ├── adam.py                 # Standard Adam
+│   └── sgd.py                  # SGD variants
+│
+├── schedulers/                 # Learning rate schedules
+│   ├── registry.py             # LR scheduler factory
+│   ├── cosine_warmup.py        # 3-phase: linear warmup → plateau → cosine decay
+│   ├── linear_warmup.py        # Linear warmup + constant
+│   └── constant.py             # Constant learning rate
+│
+├── inference/                  # Inference & interaction
+│   ├── inferencer.py           # Load & run inference from checkpoints
+│   └── tui.py                  # Interactive chat with metrics display
+│
+├── checkpointing/              # State management
+│   └── checkpoint.py           # Save/load model + optimizer + config + metrics
+│
+├── logging/                    # Experiment tracking
+│   └── aim_logger.py           # AimStack integration (loss, metrics, hyperparams)
+│
+├── benchmarks/                 # Evaluation tools
+│   └── runner.py               # Perplexity, speed, and task-specific benchmarks
+│
+└── utils/
+    └── helpers.py              # Utility functions
+configs/                        # Example YAML configurations
+├── pretrain.yaml               # Pretraining config
+├── sft.yaml                    # SFT config
+├── rl_dpo.yaml                 # RL/DPO config
+└── tokenizer.yaml              # Tokenizer config
+tests/                          # Unit & integration tests
+└── test_dataset.py
+```
+## Extensible Architecture: The Registry Pattern
+TaoTrain's power lies in its **pluggable design**. Add custom models, optimizers, schedulers, and datasets without modifying the framework.
+### Custom Model Architecture
+```python
+from taoTrain.models import register_architecture, BaseModel
+import torch.nn as nn
+@register_architecture("custom_moe")
+class MixtureOfExperts(BaseModel):
+    """Your custom MoE architecture"""
+    def __init__(self, config):
+        super().__init__(config)
+        self.experts = nn.ModuleList([
+            nn.Linear(config.hidden_dim, config.hidden_dim)
+            for _ in range(config.num_experts)
+        ])
+        self.router = nn.Linear(config.hidden_dim, config.num_experts)
+    def forward(self, input_ids, attention_mask=None):
+        # Your implementation
+        logits = self.compute_logits(input_ids)
+        loss = self.compute_loss(logits, labels) if labels is not None else None
+        return {"logits": logits, "loss": loss}
+```
+Then use it in your config:
+```yaml
+model:
+  architecture_type: "custom_moe"
+  hidden_dim: 2048
+  num_experts: 8
+```
+### Custom Optimizers & Schedulers
+The same pattern works for optimizers and learning rate schedules:
+```python
+from taoTrain.optimizers import register_optimizer
+from torch.optim import Optimizer
+@register_optimizer("my_adaptive_optimizer")
+class MyAdaptiveOptimizer(Optimizer):
+    def step(self, closure=None):
+        # Your optimization logic
+        pass
+```
+```python
+from taoTrain.schedulers import register_scheduler
+@register_scheduler("my_schedule")
+def my_schedule(initial_lr, step, total_steps, **kwargs):
+    return initial_lr * (1.0 - step / total_steps)  # Linear decay
+```
+**The key principle**: No framework code needs to change. You register once, it's available everywhere.
+### Dataset Backend Flexibility
+Define custom datasets (JSONL, HF, streaming, etc.) and let the factory route to them:
+```python
+from taoTrain.data import register_dataset
+@register_dataset("pretrain", "my_backend")
+class MyPretrainDataset(BaseDataset):
+    def __init__(self, config):
+        # Load from your custom backend
+        pass
+    def __getitem__(self, idx):
+        return {"input_ids": ..., "attention_mask": ...}
+```
+Use in config:
+```yaml
+data:
+  dataset_type: "pretrain"
+  backend_type: "my_backend"  # Routes to MyPretrainDataset
+```
+## Why TaoTrain Framework?
+### Async Data Loading: No I/O Bottleneck
+Most training frameworks load and tokenize data on the main training thread, blocking compute. TaoTrain's **multi-threaded tokenization pipeline**:
+- Tokenizes data in background workers while your GPU trains
+- Supports streaming billion-token JSONL files without loading into memory
+- Intelligent chunking (by file size or sample count)
+- Metadata caching to avoid rescanning
+**Result**: 10-100x faster data iteration on large datasets.
+### Type-Safe Configuration
+Forget YAML parsing errors or mysterious config bugs. TaoTrain uses **Pydantic dataclasses** for configuration:
+- Automatic type validation: mistyped `learning_rate: "1e-4"` becomes an error, not silent failure
+- Serialization: configs are part of checkpoints, ensuring reproducibility
+- IDE support: autocomplete and type hints for all config fields
+- Defaults: sensible defaults for all parameters
+### Benchmarking & Metrics
+Track what matters:
+- **Perplexity**: Language modeling quality on held-out data
+- **Generation Speed**: Tokens-per-second (useful for TUI or deployment)
+- **Task-Specific Accuracy**: Evaluate on downstream tasks
+- **Training Metrics**: Loss curves, gradient norms, effective batch size
+All logged to AimStack with git hashes for reproducibility.
+## Logging with AimStack
+Automatically track and visualize experiments:
+```bash
+aim up --host 0.0.0.0
+```
+Then open `http://localhost:43800` to see:
+- **Loss curves** per training step
+- **Hyperparameters** (learning rate, batch size, model architecture)
+- **Git hashes** for reproducibility
+- **Custom metrics** (perplexity, validation accuracy, generation speed)
+- **Compare runs**: Side-by-side experiment comparison
+## Advanced Features
+### Checkpointing with Resumption
+TaoTrain saves complete training state:
+```python
+checkpoint = {
+    "step": 12500,
+    "model_state": model.state_dict(),
+    "optimizer_state": optimizer.state_dict(),
+    "config": config,  # Full config as Pydantic object
+    "metrics": metrics_tracker.to_dict(),
+}
+```
+Resume training from any checkpoint without loss of state. Keep last N checkpoints automatically.
+### Mixed Precision Training (BF16)
+```yaml
+training:
+  use_bfloat16: true
+  gradient_accumulation_steps: 4
+```
+- BF16 via `torch.autocast` for ~2x speedup with minimal accuracy loss
+- Proper gradient scaling and clipping
+- Compatible with all optimizers and architectures
+### 3-Phase Learning Rate Schedule
+```yaml
+scheduler:
+  scheduler_type: "cosine_warmup"
+  warmup_ratio: 0.1          # 10% of training steps
+  steady_ratio: 0.5          # 50% at steady rate
+  min_lr_ratio: 0.1          # Final LR = 0.1 × initial_lr
+  num_cycles: 1
+```
+This schedule:
+1. **Linear warmup** (0 → 1) over 10% of steps
+2. **Steady plateau** at full LR over 50% of steps
+3. **Cosine decay** (1 → 0.1) over remaining 40% of steps
+Better convergence than simple cosine or linear decay.
+### Gradient Accumulation & Clipping
+Simulate larger batch sizes with gradient accumulation:
+```yaml
+training:
+  batch_size: 32
+  gradient_accumulation_steps: 4  # Effective batch = 128
+  max_grad_norm: 1.0               # Gradient clipping
+```
+## Contributing
+Contributions are welcome! TaoTrain is designed to make contributions easy:
+1. **Add a model**: Implement `BaseModel` and `@register_architecture("name")`
+2. **Add an optimizer**: Implement `torch.optim.Optimizer` and `@register_optimizer("name")`
+3. **Add a dataset**: Implement `BaseDataset` and `@register_dataset(mode, backend_type)`
+4. **Improve the core**: Submit PRs to `training/`, `data/`, `logging/`, etc.
+Ensure new code includes:
+- Type hints throughout
+- Pydantic configs for new parameters
+- Unit tests in `tests/`
+- Documentation in docstrings and README
+## Current Scope & Roadmap
+### ✅ Currently Supported
+- **Single GPU / single node** training
+- **Pretraining, SFT, and RL training** stages
+- **HuggingFace and JSONL** data backends
+- **BF16 mixed precision** training
+- **Checkpoint saving/loading** with resumption
+- **Interactive inference** via TUI
+- **Benchmarking** (perplexity, speed)
+- **Pluggable architectures, optimizers, schedulers, datasets**
+### 🚀 Roadmap (Future)
+- **Distributed training** (DDP, FSDP) for multi-GPU/multi-node scaling
+- **Quantization** support (INT8, QLoRA)
+- **Advanced evaluation** (BLEU, ROUGE, custom tasks)
+- **Streaming inference** with KV cache
+- **Speculative decoding** for faster generation
+- **Integration with popular model hubs** (Hugging Face Hub upload/download)
+---
+## Getting Help
+- **Questions?** Open an issue on GitHub
+- **Want to contribute?** See `CONTRIBUTING.md` (coming soon)
+- **Found a bug?** Report it with a minimal reproduction script
+## License
+MIT

code/TaoTrain/configs/pretrain.yaml ADDED Viewed

	@@ -0,0 +1,138 @@

+# TaoNet T2 Configuration for Pretraining
+# DeepSeek MLA + RoPE with Hybrid Muon+AdamW Optimizer
+# Full BF16 precision (no quantization)
+# ============================================================================
+# Model Architecture - TaoNet (DeepSeek MLA + RoPE)
+# ============================================================================
+model:
+  architecture_type: taonet
+  vocab_size: 8192
+  hidden_dim: 768
+  num_layers: 12
+  num_heads: 8
+  max_seq_length: 1024
+  # TaoNet-specific: Multi-head Latent Attention (MLA) compression
+  d_latent_kv: 512
+  # RoPE (Rotary Position Embedding) dimension per head
+  # Default would be 512 / 8 = 64
+  d_rope: 64
+  # Feed-forward intermediate dimension
+  hidden_dim_ff: 2048
+  # Dropout rate (low for stability with large models)
+  dropout: 0.02
+  # Grouped Query Attention (1 = standard MLA, >1 = GQA)
+  gqa_groups: 1
+  # Optional: Use factorized embedding for parameter efficiency
+  # vocab (8192) → rank (96) → hidden (512)
+  use_factorized_embedding: false
+  d_embed_rank: 96
+  # Weight initialization standard deviation
+  init_std: 0.02
+# ============================================================================
+# Dataset Configuration - Local JSONL
+# ============================================================================
+dataset:
+  local: true
+  jsonl_path: /home/student/Data/TaoData/output.jsonl
+  text_field: text
+  max_samples: 1000000
+  samples_per_chunk: 1000
+  # Tokenizer configuration
+  tokenizer_type: sentencepiece
+  tokenizer_path: tokenizer/tokenizer.model
+  tokenizer_threads: 4
+# ============================================================================
+# Training Hyperparameters
+# ============================================================================
+batch_size: 32
+num_epochs: 2  # Set to 10 for full training
+gradient_accumulation_steps: 8  # Effective batch: 32 × 8 = 256
+# Maximum gradient norm for clipping (prevents ternary instability)
+max_grad_norm: 1.0
+# ============================================================================
+# Optimizer - Hybrid Muon + AdamW
+# ============================================================================
+# Strategy:
+# - Muon: For 2D Linear weight matrices (orthogonal/SVD-based optimization)
+#   - 2D weights: learning_rate (5e-3)
+# - AdamW: For 1D parameters (biases, norms, embeddings)
+#   - 1D params: adamw_lr (5e-4) = 1/10 × learning_rate
+optimizer:
+  optimizer_type: hybrid_muon_adamw
+  # Learning rate for Muon (2D weight matrices)
+  learning_rate: 5e-3
+  # Learning rate for AdamW (1D parameters)
+  # Typically 1/10 of learning_rate to prevent over-updating 1D params
+  adamw_lr: 5e-4
+  # L2 regularization (weight decay)
+  weight_decay: 0.01
+  # Adam betas
+  betas: [0.9, 0.999]
+  # Epsilon for numerical stability
+  eps: 1e-8
+# ============================================================================
+# Learning Rate Scheduler - 3-Phase Cosine with Warmup
+# ============================================================================
+# Phases:
+# 1. Warmup: 0 → 1.0 (300 steps, ~1.4% of training)
+# 2. Steady: 1.0 (constant for 5% of training)
+# 3. Decay: 1.0 → 0.1 (cosine decay for remaining 95%)
+scheduler:
+  scheduler_type: cosineWarmup
+  warmup_steps: 300
+  warmup_ratio: 0.0  # Ignored if warmup_steps > 0
+  steady_ratio: 0.05  # 5% of total training steps at peak LR
+  min_lr_ratio: 0.1   # Decay to 10% of peak LR
+  num_cycles: 0.5  # For compatibility (not used in 3-phase schedule)
+# ============================================================================
+# Data Type and Device
+# ============================================================================
+dtype: bfloat16  # Use BF16 for better convergence with large models
+device: cuda     # Use GPU for training
+# ============================================================================
+# Checkpointing and Validation
+# ============================================================================
+checkpoint_dir: checkpoints/test
+save_every_steps: 81920
+save_best_model: true
+keep_last_n_checkpoints: 3
+# Validation
+eval_every_steps: 8192
+eval_samples: 8000
+# ============================================================================
+# Logging
+# ============================================================================
+log_every_steps: 50
+aim_repo: .aim
+# ============================================================================
+# Miscellaneous
+# ============================================================================
+seed: 42
+num_workers: 0
+pin_memory: true

code/TaoTrain/configs/rl_dpo.yaml ADDED Viewed

	@@ -0,0 +1,60 @@

+# Example configuration for RL training (RL stage assumes you have a reward model)
+model:
+  architecture_type: transformer
+  vocab_size: 50257
+  hidden_dim: 256
+  num_layers: 8
+  num_heads: 8
+  dropout: 0.1
+  max_seq_length: 512
+  init_std: 0.02
+dataset:
+  dataset_name: allenai/real_toxicity_prompts
+  split: train
+  prompt_column: text
+  max_samples: 2000
+  cache_dir: .cache/datasets
+  tokenizer_threads: 1  # Number of background threads for tokenization (1-32 recommended)
+batch_size: 4
+num_epochs: 1
+gradient_accumulation_steps: 8
+max_grad_norm: 0.5
+optimizer:
+  optimizer_type: adamw
+  learning_rate: 1e-5
+  weight_decay: 0.0
+scheduler:
+  scheduler_type: linearWarmup
+  warmup_steps: 50
+dtype: bfloat16
+device: cuda
+checkpoint_dir: checkpoints/rl
+save_every_steps: 100
+save_best_model: false
+keep_last_n_checkpoints: 2
+eval_every_steps: 100
+eval_samples: 100
+log_every_steps: 10
+aim_repo: .aim
+# RL-specific settings
+rl_method: ppo  # or "dpo"
+reward_model_path: checkpoints/reward_model.pt  # Path to your reward model
+ppo_epochs: 4
+ppo_clip_ratio: 0.2
+entropy_coeff: 0.01
+value_loss_coeff: 1.0
+generation_max_length: 256
+seed: 42
+num_workers: 0
+pin_memory: true

code/TaoTrain/configs/sft.yaml ADDED Viewed

	@@ -0,0 +1,93 @@

+# Example configuration for supervised fine-tuning
+# Uses TaoNet (MLA+RoPE) architecture loaded from pretrained checkpoint
+# ============================================================================
+# Model Architecture - TaoNet (DeepSeek MLA + RoPE)
+# ============================================================================
+model:
+  architecture_type: taonet
+  vocab_size: 8192
+  hidden_dim: 768
+  num_layers: 12
+  num_heads: 8
+  max_seq_length: 1024
+  # TaoNet-specific: Multi-head Latent Attention (MLA) compression
+  d_latent_kv: 512
+  # RoPE (Rotary Position Embedding) dimension per head
+  # Default would be 512 / 8 = 64
+  d_rope: 64
+  # Feed-forward intermediate dimension
+  hidden_dim_ff: 2048
+  # Dropout rate (low for stability with large models)
+  dropout: 0.02
+  # Grouped Query Attention (1 = standard MLA, >1 = GQA)
+  gqa_groups: 1
+  # Optional: Use factorized embedding for parameter efficiency
+  # vocab (8192) → rank (96) → hidden (512)
+  use_factorized_embedding: false
+  d_embed_rank: 96
+  # Weight initialization standard deviation
+  init_std: 0.02
+dataset:
+  split: train
+  instruction_column: input
+  response_column: output
+  local: true
+  jsonl_path: /home/student/Data/TaoData/sft.jsonl
+  samples_per_chunk: 1000
+  #max_samples: 2000000
+  max_samples: 1000000
+  cache_dir: .cache/datasets
+  instruction_template: "{instruction}\n{response}"
+  # Tokenizer configuration
+  tokenizer_type: sentencepiece
+  tokenizer_path: tokenizer/tokenizer.model
+  tokenizer_threads: 4
+# SFT-specific configuration (these fields are in SFTConfig)
+checkpoint_path: "checkpoints/pretrain/final_model.pt"
+user_token: "<user>"
+assistant_token: "<assistant>"
+response_loss_only: true
+batch_size: 8
+num_epochs: 1
+gradient_accumulation_steps: 4
+max_grad_norm: 1.0
+optimizer:
+  optimizer_type: adamw
+  learning_rate: 5e-5  # Lower LR for fine-tuning (vs 5e-4 pretrain base, 5e-3 Muon)
+  weight_decay: 0.01
+scheduler:
+  scheduler_type: linearWarmup
+  warmup_steps: 500  # Less aggressive warmup for fine-tuning
+dtype: bfloat16
+device: cuda
+checkpoint_dir: checkpoints/sft
+save_every_steps: 81920
+save_best_model: true
+keep_last_n_checkpoints: 2
+eval_every_steps: 8192
+eval_samples: 200
+log_every_steps: 10
+aim_repo: .aim
+seed: 42
+num_workers: 0
+pin_memory: true

code/TaoTrain/configs/ssm_pretrain.yaml ADDED Viewed

	@@ -0,0 +1,87 @@

+# TaoNet-SSM T2 Configuration for Pretraining
+# TaoNet shell with Gamma SSM replacing the MLA attention mixer.
+model:
+  architecture_type: taonet_ssm
+  vocab_size: 8192
+  hidden_dim: 768
+  num_layers: 12
+  num_heads: 8
+  max_seq_length: 1024
+  d_latent_kv: 512
+  d_rope: 64
+  hidden_dim_ff: 2048
+  dropout: 0.02
+  gqa_groups: 1
+  use_factorized_embedding: false
+  d_embed_rank: 96
+  init_std: 0.02
+  ssm_core: dplr
+  ssm_hidden_dim: 512
+  ssm_mixer_dim: 256
+  ssm_rank: 1
+  ssm_max_low_rank_scale: 0.1
+  ssm_discretization: bilinear
+  ssm_kernel_mode: auto
+  ssm_kernel_threshold: 64
+  ssm_dt_min: 1e-3
+  ssm_dt_max: 1e-1
+  ssm_dt_init: 1e-2
+  ssm_use_d: true
+  ssm_activation: gelu
+  ssm_gate: true
+  ssm_input_gate: true
+  ssm_use_padding_mask: false
+  ssm_layer_scale_init: 0.1
+dataset:
+  local: true
+  jsonl_path: /home/student/Data/TaoData/output.jsonl
+  text_field: text
+  max_samples: 1000000
+  samples_per_chunk: 1000
+  tokenizer_type: sentencepiece
+  tokenizer_path: tokenizer/tokenizer.model
+  tokenizer_threads: 4
+batch_size: 32
+num_epochs: 2
+gradient_accumulation_steps: 8
+max_grad_norm: 1.0
+optimizer:
+  optimizer_type: hybrid_muon_adamw
+  learning_rate: 5e-3
+  adamw_lr: 5e-4
+  weight_decay: 0.01
+  betas: [0.9, 0.999]
+  eps: 1e-8
+scheduler:
+  scheduler_type: cosineWarmup
+  warmup_steps: 300
+  warmup_ratio: 0.0
+  steady_ratio: 0.05
+  min_lr_ratio: 0.1
+  num_cycles: 0.5
+dtype: bfloat16
+device: cuda
+checkpoint_dir: checkpoints/ssm_test
+save_every_steps: 81920
+save_best_model: true
+keep_last_n_checkpoints: 3
+eval_every_steps: 8192
+eval_samples: 8000
+log_every_steps: 50
+aim_repo: .aim
+seed: 42
+num_workers: 0
+pin_memory: true

code/TaoTrain/configs/tokenizer.yaml ADDED Viewed

	@@ -0,0 +1,44 @@

+# Example configuration for training a SentencePiece tokenizer from JSONL data
+# Dataset source - JSONL file
+jsonl_path: /home/student/Data/TaoData/output.jsonl
+text_field: text  # Field name in JSON for text data
+# Tokenizer training parameters
+vocab_size: 8192
+model_type: unigram  # SentencePiece model type: unigram, bpe, char, word
+character_coverage: 0.9995
+# Output configuration
+output_dir: tokenizer
+tokenizer_prefix: tokenizer
+# Token ID configuration
+unk_id: 0  # Unknown token ID
+bos_id: 1  # Beginning of sentence token ID
+eos_id: 2  # End of sentence token ID
+pad_id: 3  # Padding token ID
+# Custom special tokens
+# These will be added to the vocabulary with explicit IDs
+# Useful for control tokens like <think>, <user>, <assistant>, etc.
+# Note: Use \n for newline token, \t for tab, etc.
+special_tokens:
+  <PAD>: 3      # Padding (typically same as pad_id above)
+  <EOS>: 2      # End of sentence (typically same as eos_id above)
+  <BOS>: 1      # Beginning of sentence (typically same as bos_id above)
+  <UNK>: 0      # Unknown (typically same as unk_id above)
+  "\n": 4       # Newline token - quoted to preserve literal \n in YAML
+  <think>: 8    # Special token for chain-of-thought reasoning
+  <user>: 9     # User message token
+  <assistant>: 10  # Assistant message token
+  <image>: 11   # Image token for multimodal models
+# Data sampling (optional)
+# Set to a number to train on only the first N samples from the JSONL file
+# Useful for quick testing or sub-sampling large datasets
+# Omit or set to null to use entire file
+max_samples: 1000000
+# Optional metadata
+tokenizer_name: tokenizer

code/TaoTrain/configs/tokenizer_taodata_pilot.yaml ADDED Viewed

	@@ -0,0 +1,22 @@

+# Pilot SentencePiece tokenizer for remote TaoData benchmarks.
+#
+# This is intentionally smaller than full production tokenizer training so the
+# model-comparison loop can validate attention vs SSM on realistic tokenization.
+jsonl_path: /home/student/Data/TaoData/pretrain.jsonl.fineweb.jsonl
+text_field: text
+vocab_size: 8192
+model_type: unigram
+character_coverage: 0.9995
+output_dir: /home/student/YouZheng/tokenizers/taodata_pilot_8k
+tokenizer_prefix: tokenizer
+unk_id: 0
+bos_id: 1
+eos_id: 2
+pad_id: 3
+max_samples: 20000
+tokenizer_name: taodata_pilot_8k

code/TaoTrain/configs/yarn_pretrain.yaml ADDED Viewed

	@@ -0,0 +1,181 @@

+# TaoNet T2 Configuration for YaRN Continued Pretraining
+# Extended Context: 1024 → 8192 tokens with frequency interpolation
+# Built on DeepSeek MLA + RoPE with Hybrid Muon+AdamW Optimizer
+# Full BF16 precision (no quantization)
+# ============================================================================
+# Model Architecture - TaoNet (DeepSeek MLA + RoPE) with YaRN Extension
+# ============================================================================
+model:
+  architecture_type: taonet
+  vocab_size: 8192
+  hidden_dim: 512
+  num_layers: 12
+  num_heads: 8
+  max_seq_length: 8192  # Extended from 1024 → 8192 (8x longer context)
+  # TaoNet-specific: Multi-head Latent Attention (MLA) compression
+  # KV dimension reduced from 512 to 384 (25% memory savings)
+  d_latent_kv: 384
+  # RoPE (Rotary Position Embedding) dimension per head
+  # Default would be 512 / 8 = 64
+  d_rope: 64
+  # Feed-forward intermediate dimension
+  # Default would be 4 * 512 = 2048
+  hidden_dim_ff: 1024
+  # Dropout rate (low for stability)
+  dropout: 0.02
+  # Grouped Query Attention (1 = standard MLA, >1 = GQA)
+  gqa_groups: 1
+  # Optional: Use factorized embedding for parameter efficiency
+  use_factorized_embedding: false
+  d_embed_rank: 96
+  # Weight initialization standard deviation
+  init_std: 0.02
+  # ========================================================================
+  # YaRN (Yet another RoPE eXtension) Configuration
+  # ========================================================================
+  # Enables frequency interpolation to extend context length from 1024 → 8192
+  # The model learns to "pack" RoPE frequencies into the new longer context during training.
+  # RoPE base scale factor (explicit, previously hardcoded to 40)
+  rope_scale: 40.0
+  # Enable YaRN frequency interpolation
+  yarn_enabled: true
+  # Interpolation smoothness parameter
+  # - 1.0 (default): Smooth, gradual interpolation—safer for learning extended context
+  # - 0.5: Aggressive interpolation—faster context expansion, higher risk
+  # - 2.0: Conservative interpolation—safer but slower adaptation
+  # Recommendation: Start with 1.0; tune in follow-up runs if convergence issues
+  yarn_alpha: 1.0
+# ============================================================================
+# Dataset Configuration - Local JSONL (Same as Pretrain)
+# ============================================================================
+dataset:
+  local: true
+  jsonl_path: /home/student/Data/TaoData/output.jsonl
+  text_field: text
+  max_samples: 50000  # Reduced from 1M → 50k for quick YaRN adaptation
+  samples_per_chunk: 1000
+  # Tokenizer configuration (unchanged)
+  tokenizer_type: sentencepiece
+  tokenizer_path: tokenizer/tokenizer.model
+  tokenizer_threads: 4
+# ============================================================================
+# Training Hyperparameters - Conservative for Context Extension
+# ============================================================================
+# Strategy: Lower learning rates + smaller batch to prevent catastrophic forgetting
+# while the model learns to use 8x longer context.
+batch_size: 16  # Reduced from 32 (8192 tokens/seq is memory-intensive)
+num_epochs: 1   # 50k samples / effective_batch=256 ≈ 200 updates (1 epoch sufficient for warm-start)
+# Gradient accumulation to maintain effective batch size of ~256
+# Effective batch = batch_size × gradient_accumulation_steps = 16 × 16 = 256
+gradient_accumulation_steps: 16
+# Maximum gradient norm for clipping
+max_grad_norm: 1.0
+# ============================================================================
+# Optimizer - Hybrid Muon + AdamW (Conservative LR for Stability)
+# ============================================================================
+# Strategy: Use 1/2 of pretrain learning rates to:
+#   1. Avoid catastrophic forgetting of learned features
+#   2. Allow smooth adaptation to YaRN-scaled RoPE frequencies
+#   3. Give the model time to learn how to use extended context
+optimizer:
+  optimizer_type: hybrid_muon_adamw
+  # Learning rate for Muon (2D weight matrices)
+  # Reduced: 5e-3 → 2.5e-3 (50% of pretrain)
+  learning_rate: 2.5e-3
+  # Learning rate for AdamW (1D parameters)
+  # Reduced: 5e-4 → 1.25e-4 (25% of pretrain, maintains 1/10 ratio)
+  adamw_lr: 1.25e-4
+  # L2 regularization (weight decay)
+  weight_decay: 0.01
+  # Adam betas (unchanged)
+  betas: [0.9, 0.999]
+  # Epsilon for numerical stability
+  eps: 1e-8
+# ============================================================================
+# Learning Rate Scheduler - 3-Phase Cosine with Warmup (Same as Pretrain)
+# ============================================================================
+# Phases:
+# 1. Warmup: 0 → 1.0 (300 steps, ~1.4% of training)
+# 2. Steady: 1.0 (constant for 5% of training steps at peak LR)
+# 3. Decay: 1.0 → 0.1 (cosine decay for remaining ~95%)
+scheduler:
+  scheduler_type: cosineWarmup
+  warmup_steps: 300
+  warmup_ratio: 0.0  # Ignored if warmup_steps > 0
+  steady_ratio: 0.05  # 5% of total training steps at peak LR
+  min_lr_ratio: 0.1   # Decay to 10% of peak LR
+  num_cycles: 0.5  # For compatibility (not used in 3-phase schedule)
+# ============================================================================
+# Data Type and Device
+# ============================================================================
+dtype: bfloat16  # Use BF16 for better convergence with extended context
+device: cuda     # Use GPU for training
+# ============================================================================
+# Checkpointing and Validation
+# ============================================================================
+# Load pretrained checkpoint and continue training
+checkpoint_path: checkpoints/pretrain_taonet/best_model.pt
+checkpoint_dir: checkpoints/yarn_taonet
+save_every_steps: 512    # More frequent saves for 50k samples (200 updates total)
+save_best_model: true
+keep_last_n_checkpoints: 3
+# Validation every 512 steps (10% of 50k samples)
+eval_every_steps: 512
+eval_samples: 2500  # Reduced from 8000
+# ============================================================================
+# Logging
+# ============================================================================
+log_every_steps: 50  # Log every 50 updates
+aim_repo: .aim
+# ============================================================================
+# Miscellaneous
+# ============================================================================
+seed: 42
+num_workers: 0
+pin_memory: true
+# ============================================================================
+# YaRN Performance Notes
+# ============================================================================
+# Expected memory usage: ~1.5x of pretrain (8x longer seq, half batch)
+# Expected training time: ~50-100 steps/min on H100 (depends on setup)
+# Expected convergence: Loss should decrease over 50k samples; monitor perplexity on 8192-length sequences
+#
+# Tuning recommendations for iterative improvements:
+# 1. If loss is unstable: Reduce learning_rate further (1.25e-3)
+# 2. If loss plateaus quickly: Increase max_samples (100k-200k)
+# 3. If memory OOM: Reduce batch_size to 8 (maintain grad_accum at 16)
+# 4. To speed context expansion: Reduce yarn_alpha to 0.5 (more aggressive)
+# 5. For safer training: Increase yarn_alpha to 2.0 (more conservative)

code/TaoTrain/data/sample_data.jsonl ADDED Viewed

	@@ -0,0 +1,10 @@

+{"text": "The quick brown fox jumps over the lazy dog."}
+{"text": "Python is a powerful programming language used for data science, machine learning, and web development."}
+{"text": "Artificial intelligence and machine learning are transforming industries and creating new opportunities."}
+{"text": "Natural language processing enables computers to understand and generate human language."}
+{"text": "Deep learning models like transformers have revolutionized the field of artificial intelligence."}
+{"text": "Transfer learning allows us to leverage pre-trained models to solve new tasks more efficiently."}
+{"text": "The transformer architecture introduced attention mechanisms that became fundamental to modern NLP."}
+{"text": "Language models trained on large corpora can perform impressive few-shot learning tasks."}
+{"text": "Tokenization is a crucial preprocessing step in natural language processing pipelines."}
+{"text": "SentencePiece is a language-independent tokenization algorithm that handles subword segmentation."}

code/TaoTrain/data/sample_sft_multi.jsonl ADDED Viewed

	@@ -0,0 +1,4 @@

+{"turns": [{"user": "What is machine learning?", "assistant": "Machine learning is a subset of artificial intelligence that enables systems to learn and improve from experience without being explicitly programmed. It uses algorithms to identify patterns in data."}, {"user": "Can you give me some examples?", "assistant": "Sure! Examples include: 1) Email spam detection - learns to identify spam patterns, 2) Image recognition - learns to identify objects in images, 3) Recommendation systems - learns user preferences to suggest content, 4) Natural language processing - learns language patterns for tasks like translation."}]}
+{"turns": [{"user": "Explain neural networks", "assistant": "Neural networks are computing systems inspired by biological neural networks. They consist of interconnected nodes (neurons) organized in layers. Each connection has a weight that gets adjusted during training to learn patterns from data."}, {"user": "What layers do they have?", "assistant": "Neural networks typically have three types of layers: 1) Input layer - receives raw data, 2) Hidden layers - process data through weighted connections and activation functions, 3) Output layer - produces the final prediction or classification. Deep networks have multiple hidden layers."}]}
+{"turns": [{"user": "What is a transformer model?", "assistant": "A transformer is a neural network architecture that uses self-attention mechanisms to weigh the importance of different input elements. It processes all input tokens in parallel (unlike RNNs) and captures long-range dependencies effectively. Transformers form the basis of modern large language models like GPT and BERT."}, {"user": "How does attention work?", "assistant": "Attention computes relevance scores between query tokens and all other tokens (keys and values). It learns which tokens to focus on when processing each position. Multi-head attention applies this in parallel with multiple learned representations. This allows the model to attend to different types of relationships simultaneously."}, {"user": "Why is it better than RNNs?", "assistant": "Transformers are better than RNNs because: 1) Parallel processing - all tokens processed simultaneously instead of sequentially, 2) Better long-range dependencies - attention can directly connect distant tokens, 3) Easier to train - no vanishing gradient problems, 4) Scalability - transformers scale better to large datasets and models."}]}
+{"turns": [{"user": "What is the difference between supervised and unsupervised learning?", "assistant": "Supervised learning uses labeled data (input-output pairs) to train models, like predicting house prices from features. Unsupervised learning finds patterns in unlabeled data, like clustering similar customers. Supervised learning is more accurate but requires labeled data; unsupervised is useful for exploration."}, {"user": "What about semi-supervised learning?", "assistant": "Semi-supervised learning uses both labeled and unlabeled data. It's useful when labeling is expensive. The unlabeled data helps improve performance by providing additional patterns. Common techniques include pseudo-labeling (labeling unlabeled data with the current model) and consistency regularization."}]}

code/TaoTrain/data/sample_sft_single.jsonl ADDED Viewed

	@@ -0,0 +1,5 @@

+{"input": "What is Python?", "output": "Python is a high-level, interpreted programming language known for its simplicity and readability. It supports multiple programming paradigms including procedural, object-oriented, and functional programming."}
+{"input": "Explain recursion", "output": "Recursion is a programming technique where a function calls itself to solve a problem by breaking it down into smaller subproblems. Base cases prevent infinite recursion. Example: factorial(n) = n * factorial(n-1) for n > 1, else 1."}
+{"input": "What are decorators in Python?", "output": "Decorators are functions that modify or wrap other functions or classes without permanently changing their source code. They use the @ symbol and are executed when the function is defined. Example: @staticmethod, @property, or custom decorators."}
+{"input": "How do you handle exceptions in Python?", "output": "Exceptions are handled using try-except-else-finally blocks. The try block contains code that might raise an exception, except catches specific exceptions, else runs if no exception occurred, and finally runs regardless. Custom exceptions can be created by inheriting from Exception."}
+{"input": "What is the difference between lists and tuples?", "output": "Lists are mutable (can be changed after creation) while tuples are immutable (cannot be changed). Lists use square brackets [] and tuples use parentheses (). Both can contain mixed data types. Tuples are faster and can be used as dictionary keys."}

code/TaoTrain/pyproject.toml ADDED Viewed

	@@ -0,0 +1,65 @@

+[build-system]
+requires = ["setuptools>=68.0", "wheel"]
+build-backend = "setuptools.build_meta"
+[project]
+name = "taoTrain"
+version = "0.1.0"
+description = "Clean, modular PyTorch LLM training framework with pluggable architectures, AimStack logging, and TUI inference"
+readme = "README.md"
+requires-python = ">=3.10"
+license = { text = "MIT" }
+authors = [
+    { name = "Felix", email = "felix@example.com" }
+]
+dependencies = [
+    "torch>=2.0.0",
+    "transformers>=4.30.0",
+    "datasets>=2.10.0",
+    "pydantic>=2.0.0",
+    "pydantic-settings>=2.0.0",
+    "aim>=3.15.0",
+    "click>=8.1.0",
+    "rich>=13.0.0",
+    "textual>=0.30.0",
+    "numpy>=1.24.0",
+    "tqdm>=4.65.0",
+    "sentencepiece>=0.1.99",
+]
+[project.optional-dependencies]
+dev = [
+    "pytest>=7.4.0",
+    "pytest-cov>=4.1.0",
+    "pytest-xdist>=3.3.0",
+    "black>=23.7.0",
+    "ruff>=0.0.280",
+    "typing-extensions>=4.7.0",
+]
+[project.scripts]
+train = "taoTrain.cli:main"
+train-tokenizer = "taoTrain.cli:train_tokenizer_command"
+tui-chat = "taoTrain.inference.tui:main"
+[tool.setuptools.packages.find]
+where = ["src"]
+[tool.setuptools.package-data]
+taoTrain = ["configs/**/*.yaml"]
+[tool.black]
+line-length = 100
+target-version = ["py310"]
+[tool.ruff]
+line-length = 100
+target-version = "py310"
+select = ["E", "F", "W", "I", "N", "UP", "RUF"]
+ignore = ["E501"]
+[tool.pytest.ini_options]
+testpaths = ["tests"]
+python_files = "test_*.py"
+addopts = "--verbose"

code/TaoTrain/scripts/benchmark_taonet_real_tokens.py ADDED Viewed

	@@ -0,0 +1,1034 @@

+"""Real-text token benchmark for TaoNet attention vs TaoNet-SSM.
+This script consumes a JSONL or plain-text corpus, tokenizes it into one long
+stream, samples contiguous next-token batches, and compares TaoNet variants
+with identical outer dimensions. It is intentionally lighter than the full
+trainer so it can be used in the RepoBridge iteration loop.
+"""
+from __future__ import annotations
+import argparse
+import hashlib
+from contextlib import nullcontext
+from contextlib import redirect_stdout
+import csv
+import io
+import json
+import math
+import os
+from pathlib import Path
+import platform
+import random
+import sys
+import time
+from typing import Any, Iterable
+import torch
+REPO_ROOT = Path(__file__).resolve().parents[1]
+SRC_ROOT = REPO_ROOT / "src"
+if str(SRC_ROOT) not in sys.path:
+    sys.path.insert(0, str(SRC_ROOT))
+from taoTrain.config import ModelConfig
+from taoTrain.models import get_model
+DTYPES = {
+    "float32": torch.float32,
+    "fp32": torch.float32,
+    "float16": torch.float16,
+    "fp16": torch.float16,
+    "bfloat16": torch.bfloat16,
+    "bf16": torch.bfloat16,
+}
+BYTE_PAD_ID = 0
+BYTE_EOS_ID = 1
+BYTE_UNK_ID = 2
+BYTE_OFFSET = 3
+BYTE_VOCAB_SIZE = 259
+def parse_int_list(value: str) -> list[int]:
+    return [int(item.strip()) for item in value.split(",") if item.strip()]
+def parse_float_list(value: str) -> list[float]:
+    return [float(item.strip()) for item in value.split(",") if item.strip()]
+def parse_str_list(value: str) -> list[str]:
+    return [item.strip() for item in value.split(",") if item.strip()]
+def stable_case_id(parts: dict[str, Any]) -> str:
+    payload = json.dumps(parts, sort_keys=True, separators=(",", ":"))
+    return hashlib.sha1(payload.encode("utf-8")).hexdigest()[:16]
+def make_case_parts(
+    args: argparse.Namespace,
+    *,
+    architecture: str,
+    ssm_hidden_dim: int | None,
+    ssm_mixer_dim: int | None,
+    ssm_num_lanes: int | None,
+    hybrid_pattern: str | None,
+    ssm_gate_type: str | None,
+    learning_rate: float,
+    weight_decay: float,
+    batch_size: int,
+) -> dict[str, Any]:
+    uses_ssm = architecture in {"taonet_ssm", "taonet_hybrid"}
+    return {
+        "architecture": architecture,
+        "batch_size": batch_size,
+        "seq_len": args.seq_len,
+        "hidden_dim": args.hidden_dim,
+        "num_layers": args.num_layers,
+        "num_heads": args.num_heads,
+        "hidden_dim_ff": args.hidden_dim_ff,
+        "d_latent_kv": args.d_latent_kv,
+        "learning_rate": learning_rate,
+        "weight_decay": weight_decay,
+        "train_steps": args.train_steps,
+        "eval_batches": args.eval_batches,
+        "ssm_core": args.ssm_core if uses_ssm else None,
+        "ssm_hidden_dim": ssm_hidden_dim if uses_ssm else None,
+        "ssm_mixer_dim": ssm_mixer_dim if uses_ssm else None,
+        "ssm_num_lanes": ssm_num_lanes if uses_ssm else None,
+        "ssm_lane_mode": args.ssm_lane_mode if uses_ssm else None,
+        "ssm_split_mix": args.ssm_split_mix if uses_ssm else None,
+        "ssm_gate_type": ssm_gate_type if uses_ssm else None,
+        "ssm_branch_rms_norm": args.ssm_branch_rms_norm if uses_ssm else None,
+        "ssm_branch_clip_value": args.ssm_branch_clip_value if uses_ssm else None,
+        "block_residual_rms_norm": args.block_residual_rms_norm if uses_ssm else None,
+        "block_residual_rms_target": args.block_residual_rms_target if uses_ssm else None,
+        "block_residual_rms_cap": args.block_residual_rms_cap if uses_ssm else None,
+        "hybrid_pattern": hybrid_pattern if architecture == "taonet_hybrid" else None,
+        "hybrid_ssm_layers": args.hybrid_ssm_layers if architecture == "taonet_hybrid" else None,
+    }
+def synchronize(device: torch.device) -> None:
+    if device.type == "cuda":
+        torch.cuda.synchronize(device)
+def reset_memory(device: torch.device) -> None:
+    if device.type == "cuda":
+        torch.cuda.reset_peak_memory_stats(device)
+def memory_stats(device: torch.device) -> dict[str, float | None]:
+    if device.type != "cuda":
+        return {"peak_allocated_mb": None, "peak_reserved_mb": None}
+    return {
+        "peak_allocated_mb": torch.cuda.max_memory_allocated(device) / (1024**2),
+        "peak_reserved_mb": torch.cuda.max_memory_reserved(device) / (1024**2),
+    }
+def iter_texts(path: Path, *, text_field: str, max_records: int | None) -> Iterable[str]:
+    suffix = path.suffix.lower()
+    count = 0
+    with path.open("r", encoding="utf-8", errors="replace") as handle:
+        if suffix in {".jsonl", ".json"}:
+            for line in handle:
+                if max_records is not None and count >= max_records:
+                    break
+                line = line.strip()
+                if not line:
+                    continue
+                try:
+                    record = json.loads(line)
+                except json.JSONDecodeError:
+                    continue
+                text = record.get(text_field)
+                if isinstance(text, str) and text:
+                    count += 1
+                    yield text
+        else:
+            for line in handle:
+                if max_records is not None and count >= max_records:
+                    break
+                line = line.rstrip("\n")
+                if line:
+                    count += 1
+                    yield line
+def load_sentencepiece(path: Path):
+    import sentencepiece as spm
+    processor = spm.SentencePieceProcessor()
+    processor.load(str(path))
+    return processor
+def encode_text(text: str, *, tokenizer_type: str, sentencepiece_processor=None) -> list[int]:
+    if tokenizer_type == "byte":
+        return [byte + BYTE_OFFSET for byte in text.encode("utf-8", errors="replace")] + [BYTE_EOS_ID]
+    if tokenizer_type == "sentencepiece":
+        if sentencepiece_processor is None:
+            raise ValueError("sentencepiece tokenizer requested without a processor.")
+        ids = list(sentencepiece_processor.encode(text, out_type=int))
+        eos_id = sentencepiece_processor.eos_id()
+        if eos_id >= 0:
+            ids.append(eos_id)
+        return ids
+    raise ValueError(f"Unsupported tokenizer type '{tokenizer_type}'.")
+def load_token_stream(args: argparse.Namespace) -> tuple[torch.Tensor, int]:
+    path = Path(args.data_path)
+    if not path.exists():
+        raise FileNotFoundError(f"Data path not found: {path}")
+    sp = None
+    if args.tokenizer_type == "sentencepiece":
+        if not args.tokenizer_path:
+            raise ValueError("--tokenizer-path is required for sentencepiece tokenization.")
+        sp = load_sentencepiece(Path(args.tokenizer_path))
+        vocab_size = int(sp.vocab_size())
+    else:
+        vocab_size = BYTE_VOCAB_SIZE
+    tokens: list[int] = []
+    for text in iter_texts(path, text_field=args.text_field, max_records=args.max_records):
+        tokens.extend(encode_text(text, tokenizer_type=args.tokenizer_type, sentencepiece_processor=sp))
+        if args.max_tokens is not None and len(tokens) >= args.max_tokens:
+            tokens = tokens[: args.max_tokens]
+            break
+    if len(tokens) < args.seq_len + 2:
+        raise ValueError(
+            f"Need at least {args.seq_len + 2} tokens, found {len(tokens)} from {path}."
+        )
+    return torch.tensor(tokens, dtype=torch.long), vocab_size
+def split_stream(tokens: torch.Tensor, eval_fraction: float) -> tuple[torch.Tensor, torch.Tensor]:
+    split = int(tokens.numel() * (1.0 - eval_fraction))
+    split = max(2, min(split, tokens.numel() - 2))
+    return tokens[:split].contiguous(), tokens[split:].contiguous()
+def sample_batch(
+    stream: torch.Tensor,
+    *,
+    batch_size: int,
+    seq_len: int,
+    device: torch.device,
+    generator: torch.Generator,
+) -> tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
+    max_start = stream.numel() - seq_len - 1
+    if max_start <= 0:
+        raise ValueError(f"Token stream too short for seq_len={seq_len}.")
+    starts = torch.randint(0, max_start, (batch_size,), generator=generator)
+    rows = [stream[int(start) : int(start) + seq_len + 1] for start in starts]
+    batch = torch.stack(rows, dim=0).to(device=device)
+    input_ids = batch[:, :-1].contiguous()
+    labels = batch[:, 1:].contiguous()
+    attention_mask = torch.ones_like(input_ids)
+    return input_ids, labels, attention_mask
+def token_accuracy(logits: torch.Tensor, labels: torch.Tensor) -> float:
+    predictions = torch.argmax(logits, dim=-1)
+    correct = predictions == labels
+    return float(correct.sum().detach().cpu() / correct.numel())
+def build_config(
+    args: argparse.Namespace,
+    *,
+    architecture: str,
+    vocab_size: int,
+    ssm_hidden_dim: int | None,
+    ssm_mixer_dim: int | None,
+    ssm_num_lanes: int | None,
+    hybrid_pattern: str | None,
+    ssm_gate_type: str | None,
+) -> ModelConfig:
+    uses_ssm = architecture in {"taonet_ssm", "taonet_hybrid"}
+    d_latent_kv = args.d_latent_kv if args.d_latent_kv is not None else int(args.hidden_dim * 0.75)
+    d_rope = args.d_rope if args.d_rope is not None else args.hidden_dim // args.num_heads
+    hidden_dim_ff = args.hidden_dim_ff if args.hidden_dim_ff is not None else args.hidden_dim * 4
+    return ModelConfig(
+        architecture_type=architecture,
+        vocab_size=vocab_size,
+        hidden_dim=args.hidden_dim,
+        num_layers=args.num_layers,
+        num_heads=args.num_heads,
+        max_seq_length=args.seq_len,
+        d_latent_kv=d_latent_kv,
+        d_rope=d_rope,
+        hidden_dim_ff=hidden_dim_ff,
+        dropout=args.dropout,
+        gqa_groups=args.gqa_groups,
+        rope_scale=args.rope_scale,
+        yarn_alpha=args.yarn_alpha,
+        init_std=args.init_std,
+        ssm_core=args.ssm_core,
+        ssm_hidden_dim=ssm_hidden_dim,
+        ssm_mixer_dim=ssm_mixer_dim,
+        ssm_num_lanes=ssm_num_lanes or args.ssm_num_lanes,
+        ssm_lane_combine=args.ssm_lane_combine,
+        ssm_lane_mode=args.ssm_lane_mode,
+        ssm_split_mix=args.ssm_split_mix,
+        ssm_rank=args.ssm_rank,
+        ssm_max_low_rank_scale=args.ssm_max_low_rank_scale,
+        ssm_finite_tail_correction=args.ssm_finite_tail_correction,
+        ssm_kernel_mode=args.ssm_kernel_mode,
+        ssm_kernel_threshold=args.ssm_kernel_threshold,
+        ssm_dt_min=args.ssm_dt_min,
+        ssm_dt_max=args.ssm_dt_max,
+        ssm_dt_init=args.ssm_dt_init,
+        ssm_use_padding_mask=False,
+        ssm_activation=args.ssm_activation,
+        ssm_gate=args.ssm_gate,
+        ssm_input_gate=args.ssm_input_gate,
+        ssm_gate_type=ssm_gate_type or args.ssm_gate_type,
+        ssm_layer_scale_init=args.ssm_layer_scale_init,
+        ssm_branch_rms_norm=args.ssm_branch_rms_norm if uses_ssm else False,
+        ssm_branch_clip_value=args.ssm_branch_clip_value if uses_ssm else None,
+        block_residual_rms_norm=args.block_residual_rms_norm if uses_ssm else False,
+        block_residual_rms_target=args.block_residual_rms_target,
+        block_residual_rms_cap=args.block_residual_rms_cap if uses_ssm else None,
+        ssm_local_shift=args.ssm_local_shift,
+        ssm_local_shift_init=args.ssm_local_shift_init,
+        ssm_local_shift_per_channel=args.ssm_local_shift_per_channel,
+        hybrid_pattern=hybrid_pattern or args.hybrid_pattern,
+        hybrid_ssm_layers=args.hybrid_ssm_layers or None,
+    )
+def time_repeats(fn, *, device: torch.device, warmup: int, repeats: int) -> tuple[float, float, float]:
+    last_loss = float("nan")
+    for _ in range(warmup):
+        last_loss = fn()
+    synchronize(device)
+    latencies = []
+    for _ in range(repeats):
+        reset_memory(device)
+        synchronize(device)
+        start = time.perf_counter()
+        last_loss = fn()
+        synchronize(device)
+        latencies.append(time.perf_counter() - start)
+    return sum(latencies) / len(latencies), min(latencies), last_loss
+def evaluate_model(
+    model: torch.nn.Module,
+    *,
+    eval_stream: torch.Tensor,
+    args: argparse.Namespace,
+    batch_size: int,
+    device: torch.device,
+    generator: torch.Generator,
+    autocast_context,
+) -> tuple[float, float]:
+    model.eval()
+    losses = []
+    accuracies = []
+    with torch.no_grad():
+        for _ in range(args.eval_batches):
+            input_ids, labels, attention_mask = sample_batch(
+                eval_stream,
+                batch_size=batch_size,
+                seq_len=args.seq_len,
+                device=device,
+                generator=generator,
+            )
+            with autocast_context():
+                outputs = model(input_ids=input_ids, attention_mask=attention_mask, labels=labels)
+            losses.append(float(outputs["loss"].detach().cpu()))
+            accuracies.append(token_accuracy(outputs["logits"], labels))
+    model.train()
+    return sum(losses) / len(losses), sum(accuracies) / len(accuracies)
+def train_model(
+    model: torch.nn.Module,
+    *,
+    train_stream: torch.Tensor,
+    args: argparse.Namespace,
+    learning_rate: float,
+    weight_decay: float,
+    batch_size: int,
+    device: torch.device,
+    generator: torch.Generator,
+    autocast_context,
+) -> tuple[float | None, float | None, float | None, float | None, list[dict[str, float | int]]]:
+    if args.train_steps <= 0:
+        return None, None, None, None, []
+    model.train()
+    optimizer = torch.optim.AdamW(
+        model.parameters(),
+        lr=learning_rate,
+        weight_decay=weight_decay,
+    )
+    last_loss = float("nan")
+    last_grad_norm: float | None = None
+    max_grad_norm_seen: float | None = None
+    history: list[dict[str, float | int]] = []
+    start = time.perf_counter()
+    trainable_params = [param for param in model.parameters() if param.requires_grad]
+    log_every = max(0, int(args.train_log_every))
+    for step in range(1, args.train_steps + 1):
+        input_ids, labels, attention_mask = sample_batch(
+            train_stream,
+            batch_size=batch_size,
+            seq_len=args.seq_len,
+            device=device,
+            generator=generator,
+        )
+        optimizer.zero_grad(set_to_none=True)
+        with autocast_context():
+            outputs = model(input_ids=input_ids, attention_mask=attention_mask, labels=labels)
+            loss = outputs["loss"]
+        loss.backward()
+        if args.max_grad_norm and args.max_grad_norm > 0:
+            grad_norm_tensor = torch.nn.utils.clip_grad_norm_(trainable_params, args.max_grad_norm)
+            last_grad_norm = float(grad_norm_tensor.detach().cpu())
+            max_grad_norm_seen = (
+                last_grad_norm
+                if max_grad_norm_seen is None
+                else max(max_grad_norm_seen, last_grad_norm)
+            )
+        optimizer.step()
+        last_loss = float(loss.detach().cpu())
+        if log_every and (step == 1 or step % log_every == 0 or step == args.train_steps):
+            history.append(
+                {
+                    "step": step,
+                    "loss": last_loss,
+                    "grad_norm": last_grad_norm if last_grad_norm is not None else float("nan"),
+                }
+            )
+    synchronize(device)
+    return last_loss, time.perf_counter() - start, last_grad_norm, max_grad_norm_seen, history
+def save_case_checkpoint(
+    *,
+    args: argparse.Namespace,
+    model: torch.nn.Module,
+    config: ModelConfig,
+    case_id: str,
+    case_parts: dict[str, Any],
+    train_final_loss: float | None,
+    train_seconds: float | None,
+    train_history: list[dict[str, float | int]],
+    eval_loss: float,
+    eval_accuracy: float,
+) -> str | None:
+    if not args.save_case_checkpoints:
+        return None
+    checkpoint_dir = Path(args.checkpoint_dir) if args.checkpoint_dir else Path(args.output_dir) / "checkpoints"
+    checkpoint_dir.mkdir(parents=True, exist_ok=True)
+    path = checkpoint_dir / f"{case_id}.pt"
+    checkpoint = {
+        "step": args.train_steps,
+        "model_state": model.state_dict(),
+        "optimizer_state": None,
+        "config": {"model": config.to_dict(), "benchmark_args": vars(args)},
+        "metrics": {
+            "train_final_loss": train_final_loss,
+            "train_seconds": train_seconds,
+            "train_history": train_history,
+            "eval_loss": eval_loss,
+            "eval_accuracy": eval_accuracy,
+        },
+        "case_id": case_id,
+        "case_parts": case_parts,
+    }
+    torch.save(checkpoint, path)
+    latest_path = checkpoint_dir / "latest.pt"
+    try:
+        torch.save(checkpoint, latest_path)
+    except Exception:
+        pass
+    return str(path)
+def benchmark_case(
+    *,
+    args: argparse.Namespace,
+    architecture: str,
+    ssm_hidden_dim: int | None,
+    ssm_mixer_dim: int | None,
+    ssm_num_lanes: int | None,
+    hybrid_pattern: str | None,
+    ssm_gate_type: str | None,
+    learning_rate: float,
+    weight_decay: float,
+    vocab_size: int,
+    train_stream: torch.Tensor,
+    eval_stream: torch.Tensor,
+    batch_size: int,
+    dtype: torch.dtype,
+    device: torch.device,
+) -> list[dict[str, Any]]:
+    seed_offset = 0 if architecture == "taonet" else int(ssm_hidden_dim or 0) + int(ssm_mixer_dim or 0)
+    seed_offset += int(ssm_num_lanes or 0) * 31
+    if hybrid_pattern:
+        seed_offset += sum(ord(char) for char in hybrid_pattern)
+    if ssm_gate_type:
+        seed_offset += sum(ord(char) for char in ssm_gate_type)
+    case_seed = args.seed + seed_offset + batch_size
+    torch.manual_seed(case_seed)
+    if device.type == "cuda":
+        torch.cuda.manual_seed_all(case_seed)
+    cpu_generator = torch.Generator().manual_seed(case_seed)
+    case_parts = make_case_parts(
+        args,
+        architecture=architecture,
+        ssm_hidden_dim=ssm_hidden_dim,
+        ssm_mixer_dim=ssm_mixer_dim,
+        ssm_num_lanes=ssm_num_lanes,
+        hybrid_pattern=hybrid_pattern,
+        ssm_gate_type=ssm_gate_type,
+        learning_rate=learning_rate,
+        weight_decay=weight_decay,
+        batch_size=batch_size,
+    )
+    case_id = stable_case_id(case_parts)
+    config = build_config(
+        args,
+        architecture=architecture,
+        vocab_size=vocab_size,
+        ssm_hidden_dim=ssm_hidden_dim,
+        ssm_mixer_dim=ssm_mixer_dim,
+        ssm_num_lanes=ssm_num_lanes,
+        hybrid_pattern=hybrid_pattern,
+        ssm_gate_type=ssm_gate_type,
+    )
+    with redirect_stdout(io.StringIO()):
+        model = get_model(config, device=device)
+    model.train()
+    if device.type == "cuda" and dtype in {torch.float16, torch.bfloat16}:
+        def autocast_context():
+            return torch.autocast(device_type="cuda", dtype=dtype, enabled=True)
+    else:
+        def autocast_context():
+            return nullcontext()
+    (
+        train_final_loss,
+        train_seconds,
+        train_last_grad_norm,
+        train_max_grad_norm,
+        train_history,
+    ) = train_model(
+        model,
+        train_stream=train_stream,
+        args=args,
+        learning_rate=learning_rate,
+        weight_decay=weight_decay,
+        batch_size=batch_size,
+        device=device,
+        generator=cpu_generator,
+        autocast_context=autocast_context,
+    )
+    eval_loss, eval_accuracy = evaluate_model(
+        model,
+        eval_stream=eval_stream,
+        args=args,
+        batch_size=batch_size,
+        device=device,
+        generator=cpu_generator,
+        autocast_context=autocast_context,
+    )
+    checkpoint_path = save_case_checkpoint(
+        args=args,
+        model=model,
+        config=config,
+        case_id=case_id,
+        case_parts=case_parts,
+        train_final_loss=train_final_loss,
+        train_seconds=train_seconds,
+        train_history=train_history,
+        eval_loss=eval_loss,
+        eval_accuracy=eval_accuracy,
+    )
+    input_ids, labels, attention_mask = sample_batch(
+        eval_stream,
+        batch_size=batch_size,
+        seq_len=args.seq_len,
+        device=device,
+        generator=cpu_generator,
+    )
+    tokens = batch_size * args.seq_len
+    total_params = sum(param.numel() for param in model.parameters())
+    trainable_params = sum(param.numel() for param in model.parameters() if param.requires_grad)
+    rows: list[dict[str, Any]] = []
+    uses_ssm = architecture in {"taonet_ssm", "taonet_hybrid"}
+    def add_row(mode: str, mean_s: float, min_s: float, loss: float) -> None:
+        perplexity = math.exp(min(eval_loss, 20.0))
+        rows.append(
+            {
+                "architecture": architecture,
+                "case_id": case_id,
+                "checkpoint_path": checkpoint_path,
+                "ssm_core": args.ssm_core if uses_ssm else None,
+                "ssm_hidden_dim": ssm_hidden_dim if uses_ssm else None,
+                "ssm_mixer_dim": ssm_mixer_dim if uses_ssm else None,
+                "ssm_num_lanes": ssm_num_lanes if uses_ssm else None,
+                "ssm_lane_combine": args.ssm_lane_combine if uses_ssm else None,
+                "ssm_lane_mode": args.ssm_lane_mode if uses_ssm else None,
+                "ssm_split_mix": args.ssm_split_mix if uses_ssm else None,
+                "ssm_finite_tail_correction": args.ssm_finite_tail_correction if uses_ssm else None,
+                "ssm_gate_type": ssm_gate_type if uses_ssm else None,
+                "ssm_local_shift": args.ssm_local_shift if uses_ssm else None,
+                "ssm_local_shift_per_channel": args.ssm_local_shift_per_channel if uses_ssm else None,
+                "ssm_branch_rms_norm": args.ssm_branch_rms_norm if uses_ssm else None,
+                "ssm_branch_clip_value": args.ssm_branch_clip_value if uses_ssm else None,
+                "block_residual_rms_norm": args.block_residual_rms_norm if uses_ssm else None,
+                "block_residual_rms_target": args.block_residual_rms_target if uses_ssm else None,
+                "block_residual_rms_cap": args.block_residual_rms_cap if uses_ssm else None,
+                "hybrid_pattern": hybrid_pattern if architecture == "taonet_hybrid" else None,
+                "hybrid_ssm_layers": args.hybrid_ssm_layers if architecture == "taonet_hybrid" else None,
+                "mode": mode,
+                "batch_size": batch_size,
+                "seq_len": args.seq_len,
+                "tokens": tokens,
+                "vocab_size": vocab_size,
+                "tokenizer_type": args.tokenizer_type,
+                "hidden_dim": args.hidden_dim,
+                "num_layers": args.num_layers,
+                "num_heads": args.num_heads,
+                "dtype": str(dtype).replace("torch.", ""),
+                "device": str(device),
+                "learning_rate": learning_rate,
+                "weight_decay": weight_decay,
+                "total_params": total_params,
+                "trainable_params": trainable_params,
+                "mean_ms": mean_s * 1000.0,
+                "min_ms": min_s * 1000.0,
+                "tokens_per_s_mean": tokens / max(mean_s, 1e-12),
+                "tokens_per_s_best": tokens / max(min_s, 1e-12),
+                "loss": loss,
+                "eval_loss": eval_loss,
+                "eval_perplexity": perplexity,
+                "eval_accuracy": eval_accuracy,
+                "train_final_loss": train_final_loss,
+                "train_seconds": train_seconds,
+                "train_history": json.dumps(train_history, separators=(",", ":")),
+                "train_last_grad_norm": train_last_grad_norm,
+                "train_max_grad_norm": train_max_grad_norm,
+                "max_grad_norm": args.max_grad_norm,
+                **memory_stats(device),
+            }
+        )
+    def forward_only() -> float:
+        with torch.no_grad():
+            with autocast_context():
+                outputs = model(input_ids=input_ids, attention_mask=attention_mask, labels=labels)
+        return float(outputs["loss"].detach().cpu())
+    mean_s, min_s, loss = time_repeats(
+        forward_only,
+        device=device,
+        warmup=args.warmup,
+        repeats=args.repeats,
+    )
+    add_row("forward", mean_s, min_s, loss)
+    if args.backward:
+        def forward_backward() -> float:
+            model.zero_grad(set_to_none=True)
+            with autocast_context():
+                outputs = model(input_ids=input_ids, attention_mask=attention_mask, labels=labels)
+                loss = outputs["loss"]
+            loss.backward()
+            return float(loss.detach().cpu())
+        mean_s, min_s, loss = time_repeats(
+            forward_backward,
+            device=device,
+            warmup=args.warmup,
+            repeats=args.repeats,
+        )
+        add_row("forward_backward", mean_s, min_s, loss)
+    return rows
+def print_table(rows: list[dict[str, Any]]) -> None:
+    columns = [
+        "architecture",
+        "hybrid_pattern",
+        "ssm_hidden_dim",
+        "ssm_mixer_dim",
+        "ssm_num_lanes",
+        "ssm_lane_combine",
+        "ssm_lane_mode",
+        "ssm_split_mix",
+        "ssm_gate_type",
+        "ssm_branch_rms_norm",
+        "block_residual_rms_norm",
+        "block_residual_rms_cap",
+        "learning_rate",
+        "mode",
+        "batch_size",
+        "seq_len",
+        "mean_ms",
+        "tokens_per_s_mean",
+        "peak_allocated_mb",
+        "eval_loss",
+        "eval_perplexity",
+        "eval_accuracy",
+        "train_max_grad_norm",
+    ]
+    print("\t".join(columns))
+    for row in rows:
+        values = []
+        for column in columns:
+            value = row[column]
+            if isinstance(value, float):
+                values.append(f"{value:.3f}")
+            else:
+                values.append(str(value))
+        print("\t".join(values))
+def write_outputs(rows: list[dict[str, Any]], output_dir: Path, metadata: dict[str, Any]) -> None:
+    output_dir.mkdir(parents=True, exist_ok=True)
+    json_path = output_dir / "taonet_real_token_benchmark.json"
+    csv_path = output_dir / "taonet_real_token_benchmark.csv"
+    json_path.write_text(json.dumps({"metadata": metadata, "results": rows}, indent=2), encoding="utf-8")
+    fieldnames = list(rows[0].keys()) if rows else []
+    with csv_path.open("w", newline="", encoding="utf-8") as handle:
+        writer = csv.DictWriter(handle, fieldnames=fieldnames)
+        writer.writeheader()
+        writer.writerows(rows)
+    print(f"Wrote {json_path}")
+    print(f"Wrote {csv_path}")
+def load_existing_rows(output_dir: Path) -> list[dict[str, Any]]:
+    csv_path = output_dir / "taonet_real_token_benchmark.csv"
+    if not csv_path.exists():
+        return []
+    with csv_path.open("r", newline="", encoding="utf-8") as handle:
+        return list(csv.DictReader(handle))
+def main() -> None:
+    parser = argparse.ArgumentParser(description="Benchmark TaoNet variants on a real text token stream.")
+    parser.add_argument("--data-path", required=True)
+    parser.add_argument("--text-field", default="text")
+    parser.add_argument("--tokenizer-type", choices=["byte", "sentencepiece"], default="byte")
+    parser.add_argument("--tokenizer-path", default="")
+    parser.add_argument("--max-records", type=int, default=None)
+    parser.add_argument("--max-tokens", type=int, default=1_000_000)
+    parser.add_argument("--eval-fraction", type=float, default=0.1)
+    parser.add_argument("--architectures", default="taonet,taonet_ssm")
+    parser.add_argument("--batch-sizes", default="8,16")
+    parser.add_argument("--seq-len", type=int, default=512)
+    parser.add_argument("--hidden-dim", type=int, default=256)
+    parser.add_argument("--num-layers", type=int, default=4)
+    parser.add_argument("--num-heads", type=int, default=4)
+    parser.add_argument("--d-latent-kv", type=int, default=None)
+    parser.add_argument("--d-rope", type=int, default=None)
+    parser.add_argument("--hidden-dim-ff", type=int, default=None)
+    parser.add_argument("--dropout", type=float, default=0.0)
+    parser.add_argument("--gqa-groups", type=int, default=1)
+    parser.add_argument("--rope-scale", type=float, default=40.0)
+    parser.add_argument("--yarn-alpha", type=float, default=1.0)
+    parser.add_argument("--init-std", type=float, default=0.02)
+    parser.add_argument("--ssm-core", choices=["gamma_s4", "dplr"], default="dplr")
+    parser.add_argument("--ssm-hidden-dims", default="16,64")
+    parser.add_argument("--ssm-mixer-dim", type=int, default=64)
+    parser.add_argument(
+        "--ssm-mixer-dims",
+        default="",
+        help="Comma-separated SSM mixer projection dimensions to sweep. Defaults to --ssm-mixer-dim.",
+    )
+    parser.add_argument("--ssm-num-lanes", type=int, default=1)
+    parser.add_argument(
+        "--ssm-num-lanes-list",
+        default="",
+        help="Comma-separated SSM lane counts to sweep for SSM-bearing architectures.",
+    )
+    parser.add_argument("--ssm-lane-combine", choices=["mean", "channel"], default="mean")
+    parser.add_argument("--ssm-lane-mode", choices=["full", "split"], default="full")
+    parser.add_argument(
+        "--ssm-lane-modes",
+        default="",
+        help="Comma-separated SSM lane modes to sweep for SSM-bearing architectures.",
+    )
+    parser.add_argument("--ssm-split-mix", choices=["none", "hadamard"], default="none")
+    parser.add_argument(
+        "--ssm-split-mixes",
+        default="",
+        help="Comma-separated split-lane cross mixers to sweep for SSM-bearing architectures.",
+    )
+    parser.add_argument("--ssm-rank", type=int, default=1)
+    parser.add_argument("--ssm-max-low-rank-scale", type=float, default=0.1)
+    parser.add_argument("--ssm-finite-tail-correction", action=argparse.BooleanOptionalAction, default=True)
+    parser.add_argument("--ssm-kernel-mode", choices=["auto", "conv", "conv_transfer", "recurrent"], default="conv")
+    parser.add_argument("--ssm-kernel-threshold", type=int, default=1)
+    parser.add_argument("--ssm-dt-min", type=float, default=1e-3)
+    parser.add_argument("--ssm-dt-max", type=float, default=1e-1)
+    parser.add_argument("--ssm-dt-init", type=float, default=1e-2)
+    parser.add_argument("--ssm-activation", choices=["gelu", "silu", "identity", "linear"], default="gelu")
+    parser.add_argument("--ssm-gate", action=argparse.BooleanOptionalAction, default=True)
+    parser.add_argument("--ssm-input-gate", action=argparse.BooleanOptionalAction, default=True)
+    parser.add_argument("--ssm-gate-type", choices=["dense", "channel"], default="dense")
+    parser.add_argument(
+        "--ssm-gate-types",
+        default="",
+        help="Comma-separated SSM gate implementations to sweep for SSM-bearing architectures.",
+    )
+    parser.add_argument("--ssm-layer-scale-init", type=float, default=0.1)
+    parser.add_argument("--ssm-branch-rms-norm", action=argparse.BooleanOptionalAction, default=False)
+    parser.add_argument("--ssm-branch-clip-value", type=float, default=None)
+    parser.add_argument("--block-residual-rms-norm", action=argparse.BooleanOptionalAction, default=False)
+    parser.add_argument("--block-residual-rms-target", type=float, default=1.0)
+    parser.add_argument("--block-residual-rms-cap", type=float, default=None)
+    parser.add_argument("--ssm-local-shift", action=argparse.BooleanOptionalAction, default=True)
+    parser.add_argument("--ssm-local-shift-init", type=float, default=0.1)
+    parser.add_argument("--ssm-local-shift-per-channel", action=argparse.BooleanOptionalAction, default=False)
+    parser.add_argument(
+        "--hybrid-pattern",
+        choices=["attention_first", "ssm_first", "single_ssm_middle", "single_ssm_late"],
+        default="attention_first",
+    )
+    parser.add_argument(
+        "--hybrid-patterns",
+        default="",
+        help="Comma-separated taonet_hybrid patterns to sweep. Ignored when --hybrid-ssm-layers is set.",
+    )
+    parser.add_argument(
+        "--hybrid-ssm-layers",
+        default="",
+        help="Comma-separated 0-based SSM layer indices for taonet_hybrid. Overrides pattern selection.",
+    )
+    parser.add_argument("--dtype", choices=sorted(DTYPES), default="bf16")
+    parser.add_argument("--device", default="auto")
+    parser.add_argument("--warmup", type=int, default=1)
+    parser.add_argument("--repeats", type=int, default=3)
+    parser.add_argument("--backward", action="store_true")
+    parser.add_argument("--train-steps", type=int, default=100)
+    parser.add_argument(
+        "--train-log-every",
+        type=int,
+        default=0,
+        help="Record train loss/grad history every N steps in CSV/JSON/checkpoints. 0 disables.",
+    )
+    parser.add_argument("--learning-rate", type=float, default=8e-4)
+    parser.add_argument(
+        "--max-grad-norm",
+        type=float,
+        default=0.0,
+        help="Clip training gradients to this norm during the benchmark. Use <=0 to disable.",
+    )
+    parser.add_argument(
+        "--learning-rates",
+        default="",
+        help="Comma-separated SSM learning rates to sweep. Attention uses --learning-rate once.",
+    )
+    parser.add_argument("--weight-decay", type=float, default=0.01)
+    parser.add_argument(
+        "--weight-decays",
+        default="",
+        help="Comma-separated SSM weight decays to sweep. Attention uses --weight-decay once.",
+    )
+    parser.add_argument("--eval-batches", type=int, default=8)
+    parser.add_argument("--seed", type=int, default=42)
+    parser.add_argument("--output-dir", default=os.environ.get("REPOBRIDGE_OUTPUT_DIR", "results/real-token-bench"))
+    parser.add_argument(
+        "--resume-completed",
+        action="store_true",
+        help="Skip benchmark cases whose case_id is already present in the output CSV.",
+    )
+    parser.add_argument(
+        "--incremental-output",
+        action=argparse.BooleanOptionalAction,
+        default=True,
+        help="Rewrite CSV/JSON outputs after each completed case so detached jobs leave usable partial results.",
+    )
+    parser.add_argument(
+        "--save-case-checkpoints",
+        action="store_true",
+        help="Save a model checkpoint after each trained benchmark case.",
+    )
+    parser.add_argument(
+        "--checkpoint-dir",
+        default="",
+        help="Directory for per-case checkpoints. Defaults to <output-dir>/checkpoints.",
+    )
+    args = parser.parse_args()
+    random.seed(args.seed)
+    torch.manual_seed(args.seed)
+    if args.device == "auto":
+        device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+    else:
+        device = torch.device(args.device)
+    dtype = DTYPES[args.dtype]
+    if device.type == "cuda":
+        torch.backends.cuda.matmul.allow_tf32 = True
+        torch.backends.cudnn.allow_tf32 = True
+    tokens, vocab_size = load_token_stream(args)
+    train_stream, eval_stream = split_stream(tokens, args.eval_fraction)
+    architectures = [item.strip() for item in args.architectures.split(",") if item.strip()]
+    ssm_hidden_dims = parse_int_list(args.ssm_hidden_dims)
+    ssm_mixer_dims = parse_int_list(args.ssm_mixer_dims) if args.ssm_mixer_dims else [args.ssm_mixer_dim]
+    ssm_num_lanes_options = parse_int_list(args.ssm_num_lanes_list) if args.ssm_num_lanes_list else [args.ssm_num_lanes]
+    ssm_lane_mode_options = parse_str_list(args.ssm_lane_modes) if args.ssm_lane_modes else [args.ssm_lane_mode]
+    ssm_split_mix_options = parse_str_list(args.ssm_split_mixes) if args.ssm_split_mixes else [args.ssm_split_mix]
+    ssm_learning_rates = parse_float_list(args.learning_rates) if args.learning_rates else [args.learning_rate]
+    ssm_weight_decays = parse_float_list(args.weight_decays) if args.weight_decays else [args.weight_decay]
+    hybrid_patterns = parse_str_list(args.hybrid_patterns) if args.hybrid_patterns else [args.hybrid_pattern]
+    ssm_gate_types = parse_str_list(args.ssm_gate_types) if args.ssm_gate_types else [args.ssm_gate_type]
+    output_dir = Path(args.output_dir)
+    rows: list[dict[str, Any]] = load_existing_rows(output_dir) if args.resume_completed else []
+    completed_case_ids = {
+        str(row.get("case_id"))
+        for row in rows
+        if row.get("mode") == "forward_backward" or not args.backward
+    }
+    for architecture in architectures:
+        hidden_options: list[int | None] = [None]
+        mixer_options: list[int | None] = [None]
+        learning_rate_options = [args.learning_rate]
+        weight_decay_options = [args.weight_decay]
+        hybrid_pattern_options: list[str | None] = [None]
+        gate_type_options: list[str | None] = [None]
+        lane_options: list[int | None] = [None]
+        lane_mode_options: list[str | None] = [None]
+        split_mix_options: list[str | None] = ["none"]
+        if architecture in {"taonet_ssm", "taonet_hybrid"}:
+            hidden_options = ssm_hidden_dims
+            mixer_options = ssm_mixer_dims
+            learning_rate_options = ssm_learning_rates
+            weight_decay_options = ssm_weight_decays
+            gate_type_options = ssm_gate_types
+            lane_options = ssm_num_lanes_options
+            lane_mode_options = ssm_lane_mode_options
+            split_mix_options = ssm_split_mix_options
+        if architecture == "taonet_hybrid":
+            hybrid_pattern_options = [args.hybrid_pattern] if args.hybrid_ssm_layers else hybrid_patterns
+        for ssm_hidden_dim in hidden_options:
+            for ssm_mixer_dim in mixer_options:
+                for ssm_num_lanes in lane_options:
+                    for ssm_lane_mode in lane_mode_options:
+                        for ssm_split_mix in split_mix_options:
+                            if (
+                                architecture in {"taonet_ssm", "taonet_hybrid"}
+                                and ssm_lane_mode == "split"
+                                and (ssm_num_lanes is None or ssm_num_lanes <= 1)
+                            ):
+                                continue
+                            if (
+                                architecture in {"taonet_ssm", "taonet_hybrid"}
+                                and ssm_lane_mode == "split"
+                                and ssm_mixer_dim is not None
+                                and ssm_num_lanes is not None
+                                and ssm_mixer_dim % ssm_num_lanes != 0
+                            ):
+                                print(
+                                    "Skipping split-lane case because ssm_mixer_dim is not divisible by "
+                                    f"ssm_num_lanes: mixer={ssm_mixer_dim} lanes={ssm_num_lanes}"
+                                )
+                                continue
+                            if (
+                                architecture in {"taonet_ssm", "taonet_hybrid"}
+                                and ssm_lane_mode != "split"
+                                and ssm_split_mix != "none"
+                            ):
+                                continue
+                            if (
+                                architecture in {"taonet_ssm", "taonet_hybrid"}
+                                and ssm_split_mix == "hadamard"
+                                and ssm_num_lanes != 2
+                            ):
+                                continue
+                            args.ssm_lane_mode = ssm_lane_mode or args.ssm_lane_mode
+                            args.ssm_split_mix = ssm_split_mix or args.ssm_split_mix
+                            for hybrid_pattern in hybrid_pattern_options:
+                                for ssm_gate_type in gate_type_options:
+                                    for learning_rate in learning_rate_options:
+                                        for weight_decay in weight_decay_options:
+                                            for batch_size in parse_int_list(args.batch_sizes):
+                                                print(
+                                                    "Benchmarking "
+                                                    f"architecture={architecture} ssm_hidden_dim={ssm_hidden_dim} "
+                                                    f"ssm_mixer_dim={ssm_mixer_dim} ssm_num_lanes={ssm_num_lanes} "
+                                                    f"ssm_lane_mode={args.ssm_lane_mode} ssm_split_mix={args.ssm_split_mix} "
+                                                    f"hybrid_pattern={hybrid_pattern} ssm_gate_type={ssm_gate_type} "
+                                                    f"lr={learning_rate} wd={weight_decay} batch={batch_size} seq={args.seq_len}"
+                                                )
+                                                case_parts = make_case_parts(
+                                                    args,
+                                                    architecture=architecture,
+                                                    ssm_hidden_dim=ssm_hidden_dim,
+                                                    ssm_mixer_dim=ssm_mixer_dim,
+                                                    ssm_num_lanes=ssm_num_lanes,
+                                                    hybrid_pattern=hybrid_pattern,
+                                                    ssm_gate_type=ssm_gate_type,
+                                                    learning_rate=learning_rate,
+                                                    weight_decay=weight_decay,
+                                                    batch_size=batch_size,
+                                                )
+                                                case_id = stable_case_id(case_parts)
+                                                if args.resume_completed and case_id in completed_case_ids:
+                                                    print(f"Skipping completed case_id={case_id}")
+                                                    continue
+                                                case_rows = benchmark_case(
+                                                    args=args,
+                                                    architecture=architecture,
+                                                    ssm_hidden_dim=ssm_hidden_dim,
+                                                    ssm_mixer_dim=ssm_mixer_dim,
+                                                    ssm_num_lanes=ssm_num_lanes,
+                                                    hybrid_pattern=hybrid_pattern,
+                                                    ssm_gate_type=ssm_gate_type,
+                                                    learning_rate=learning_rate,
+                                                    weight_decay=weight_decay,
+                                                    vocab_size=vocab_size,
+                                                    train_stream=train_stream,
+                                                    eval_stream=eval_stream,
+                                                    batch_size=batch_size,
+                                                    dtype=dtype,
+                                                    device=device,
+                                                )
+                                                rows.extend(case_rows)
+                                                completed_case_ids.add(case_id)
+                                                if args.incremental_output:
+                                                    write_outputs(rows, output_dir, {})
+    metadata = {
+        "python": platform.python_version(),
+        "platform": platform.platform(),
+        "torch": torch.__version__,
+        "cuda_available": torch.cuda.is_available(),
+        "cuda_device": torch.cuda.get_device_name(device) if device.type == "cuda" else None,
+        "args": vars(args),
+        "vocab_size": vocab_size,
+        "num_tokens": int(tokens.numel()),
+        "train_tokens": int(train_stream.numel()),
+        "eval_tokens": int(eval_stream.numel()),
+    }
+    print_table(rows)
+    write_outputs(rows, output_dir, metadata)
+if __name__ == "__main__":
+    main()

code/TaoTrain/scripts/benchmark_taonet_token_variants.py ADDED Viewed

	@@ -0,0 +1,516 @@

+"""Token-level benchmark for TaoNet attention vs TaoNet-SSM.
+The goal is to compare the two LLM wrappers with the same outer dimensions:
+original MLA attention TaoNet versus TaoNet with an SSM mixer.
+"""
+from __future__ import annotations
+import argparse
+from contextlib import nullcontext
+from contextlib import redirect_stdout
+import csv
+import io
+import json
+import os
+from pathlib import Path
+import platform
+import subprocess
+import sys
+import time
+from typing import Any
+import torch
+REPO_ROOT = Path(__file__).resolve().parents[1]
+SRC_ROOT = REPO_ROOT / "src"
+if str(SRC_ROOT) not in sys.path:
+    sys.path.insert(0, str(SRC_ROOT))
+from taoTrain.config import ModelConfig
+from taoTrain.models import get_model
+DTYPES = {
+    "float32": torch.float32,
+    "fp32": torch.float32,
+    "float16": torch.float16,
+    "fp16": torch.float16,
+    "bfloat16": torch.bfloat16,
+    "bf16": torch.bfloat16,
+}
+def parse_int_list(value: str) -> list[int]:
+    return [int(item.strip()) for item in value.split(",") if item.strip()]
+def synchronize(device: torch.device) -> None:
+    if device.type == "cuda":
+        torch.cuda.synchronize(device)
+def reset_memory(device: torch.device) -> None:
+    if device.type == "cuda":
+        torch.cuda.reset_peak_memory_stats(device)
+def memory_stats(device: torch.device) -> dict[str, float | None]:
+    if device.type != "cuda":
+        return {
+            "peak_allocated_mb": None,
+            "peak_reserved_mb": None,
+        }
+    return {
+        "peak_allocated_mb": torch.cuda.max_memory_allocated(device) / (1024**2),
+        "peak_reserved_mb": torch.cuda.max_memory_reserved(device) / (1024**2),
+    }
+def nvidia_smi_snapshot() -> str | None:
+    try:
+        completed = subprocess.run(
+            [
+                "nvidia-smi",
+                "--query-gpu=name,memory.used,memory.total,utilization.gpu,utilization.memory,power.draw,temperature.gpu",
+                "--format=csv,noheader,nounits",
+            ],
+            check=False,
+            capture_output=True,
+            text=True,
+            timeout=5,
+        )
+    except (OSError, subprocess.TimeoutExpired):
+        return None
+    if completed.returncode != 0:
+        return None
+    return completed.stdout.strip()
+def make_token_batch(
+    *,
+    batch_size: int,
+    seq_len: int,
+    vocab_size: int,
+    device: torch.device,
+    task: str = "random",
+) -> tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
+    if task == "random":
+        input_ids = torch.randint(0, vocab_size, (batch_size, seq_len), device=device)
+        labels = torch.empty_like(input_ids)
+        labels[:, :-1] = input_ids[:, 1:]
+        labels[:, -1] = torch.randint(0, vocab_size, (batch_size,), device=device)
+    elif task == "increment":
+        starts = torch.randint(0, vocab_size, (batch_size, 1), device=device)
+        offsets = torch.arange(seq_len, device=device).view(1, seq_len)
+        input_ids = (starts + offsets) % vocab_size
+        labels = (input_ids + 1) % vocab_size
+    elif task == "previous":
+        input_ids = torch.randint(0, vocab_size, (batch_size, seq_len), device=device)
+        labels = torch.empty_like(input_ids)
+        labels[:, 0] = -100
+        labels[:, 1:] = input_ids[:, :-1]
+    else:
+        raise ValueError(f"Unsupported token task '{task}'.")
+    attention_mask = torch.ones_like(input_ids)
+    return input_ids, labels, attention_mask
+def token_accuracy(logits: torch.Tensor, labels: torch.Tensor) -> float:
+    valid = labels != -100
+    if not torch.any(valid):
+        return float("nan")
+    predictions = torch.argmax(logits, dim=-1)
+    correct = (predictions == labels) & valid
+    return float(correct.sum().detach().cpu() / valid.sum().detach().cpu())
+def build_config(args: argparse.Namespace, architecture: str) -> ModelConfig:
+    d_latent_kv = args.d_latent_kv if args.d_latent_kv is not None else int(args.hidden_dim * 0.75)
+    d_rope = args.d_rope if args.d_rope is not None else args.hidden_dim // args.num_heads
+    hidden_dim_ff = args.hidden_dim_ff if args.hidden_dim_ff is not None else args.hidden_dim * 4
+    return ModelConfig(
+        architecture_type=architecture,
+        vocab_size=args.vocab_size,
+        hidden_dim=args.hidden_dim,
+        num_layers=args.num_layers,
+        num_heads=args.num_heads,
+        max_seq_length=max(parse_int_list(args.seq_lens)),
+        d_latent_kv=d_latent_kv,
+        d_rope=d_rope,
+        hidden_dim_ff=hidden_dim_ff,
+        dropout=args.dropout,
+        gqa_groups=args.gqa_groups,
+        rope_scale=args.rope_scale,
+        yarn_alpha=args.yarn_alpha,
+        init_std=args.init_std,
+        ssm_core=args.ssm_core,
+        ssm_hidden_dim=args.ssm_hidden_dim or d_latent_kv,
+        ssm_mixer_dim=args.ssm_mixer_dim,
+        ssm_rank=args.ssm_rank,
+        ssm_max_low_rank_scale=args.ssm_max_low_rank_scale,
+        ssm_kernel_mode=args.ssm_kernel_mode,
+        ssm_kernel_threshold=args.ssm_kernel_threshold,
+        ssm_dt_min=args.ssm_dt_min,
+        ssm_dt_max=args.ssm_dt_max,
+        ssm_dt_init=args.ssm_dt_init,
+        ssm_use_padding_mask=args.ssm_use_padding_mask,
+        ssm_activation=args.ssm_activation,
+        ssm_gate=args.ssm_gate,
+        ssm_input_gate=args.ssm_input_gate,
+        ssm_layer_scale_init=args.ssm_layer_scale_init,
+        ssm_local_shift=args.ssm_local_shift,
+        ssm_local_shift_init=args.ssm_local_shift_init,
+        ssm_local_shift_per_channel=args.ssm_local_shift_per_channel,
+    )
+def count_params(model: torch.nn.Module) -> tuple[int, int]:
+    total = sum(param.numel() for param in model.parameters())
+    trainable = sum(param.numel() for param in model.parameters() if param.requires_grad)
+    return total, trainable
+def time_repeats(fn, *, device: torch.device, warmup: int, repeats: int) -> tuple[float, float, float]:
+    last_loss = float("nan")
+    for _ in range(warmup):
+        last_loss = fn()
+    synchronize(device)
+    latencies = []
+    for _ in range(repeats):
+        reset_memory(device)
+        synchronize(device)
+        start = time.perf_counter()
+        last_loss = fn()
+        synchronize(device)
+        latencies.append(time.perf_counter() - start)
+    return sum(latencies) / len(latencies), min(latencies), last_loss
+def evaluate_model(
+    model: torch.nn.Module,
+    *,
+    args: argparse.Namespace,
+    batch_size: int,
+    seq_len: int,
+    device: torch.device,
+    autocast_context,
+) -> tuple[float, float]:
+    model.eval()
+    losses = []
+    accuracies = []
+    with torch.no_grad():
+        for _ in range(args.eval_batches):
+            input_ids, labels, attention_mask = make_token_batch(
+                batch_size=batch_size,
+                seq_len=seq_len,
+                vocab_size=args.vocab_size,
+                device=device,
+                task=args.token_task,
+            )
+            with autocast_context():
+                outputs = model(input_ids=input_ids, attention_mask=attention_mask, labels=labels)
+            losses.append(float(outputs["loss"].detach().cpu()))
+            accuracies.append(token_accuracy(outputs["logits"], labels))
+    model.train()
+    return sum(losses) / len(losses), sum(accuracies) / len(accuracies)
+def train_model(
+    model: torch.nn.Module,
+    *,
+    args: argparse.Namespace,
+    batch_size: int,
+    seq_len: int,
+    device: torch.device,
+    autocast_context,
+) -> tuple[float | None, float | None]:
+    if args.train_steps <= 0:
+        return None, None
+    model.train()
+    optimizer = torch.optim.AdamW(
+        model.parameters(),
+        lr=args.learning_rate,
+        weight_decay=args.weight_decay,
+    )
+    last_loss = float("nan")
+    start = time.perf_counter()
+    for _ in range(args.train_steps):
+        input_ids, labels, attention_mask = make_token_batch(
+            batch_size=batch_size,
+            seq_len=seq_len,
+            vocab_size=args.vocab_size,
+            device=device,
+            task=args.token_task,
+        )
+        optimizer.zero_grad(set_to_none=True)
+        with autocast_context():
+            outputs = model(input_ids=input_ids, attention_mask=attention_mask, labels=labels)
+            loss = outputs["loss"]
+        loss.backward()
+        optimizer.step()
+        last_loss = float(loss.detach().cpu())
+    synchronize(device)
+    return last_loss, time.perf_counter() - start
+def benchmark_case(
+    *,
+    args: argparse.Namespace,
+    architecture: str,
+    batch_size: int,
+    seq_len: int,
+    dtype: torch.dtype,
+    device: torch.device,
+) -> list[dict[str, Any]]:
+    config = build_config(args, architecture)
+    with redirect_stdout(io.StringIO()):
+        model = get_model(config, device=device)
+    model.train()
+    total_params, trainable_params = count_params(model)
+    tokens = batch_size * seq_len
+    input_ids, labels, attention_mask = make_token_batch(
+        batch_size=batch_size,
+        seq_len=seq_len,
+        vocab_size=args.vocab_size,
+        device=device,
+        task=args.token_task,
+    )
+    autocast_enabled = device.type == "cuda" and dtype in {torch.float16, torch.bfloat16}
+    def autocast_context():
+        if not autocast_enabled:
+            return nullcontext()
+        return torch.autocast(device_type=device.type, dtype=dtype, enabled=True)
+    train_final_loss, train_seconds = train_model(
+        model,
+        args=args,
+        batch_size=batch_size,
+        seq_len=seq_len,
+        device=device,
+        autocast_context=autocast_context,
+    )
+    eval_loss, eval_accuracy = evaluate_model(
+        model,
+        args=args,
+        batch_size=batch_size,
+        seq_len=seq_len,
+        device=device,
+        autocast_context=autocast_context,
+    )
+    rows: list[dict[str, Any]] = []
+    def add_row(mode: str, mean_s: float, min_s: float, loss: float) -> None:
+        rows.append(
+            {
+                "architecture": architecture,
+                "ssm_core": args.ssm_core if architecture == "taonet_ssm" else None,
+                "token_task": args.token_task,
+                "train_steps": args.train_steps,
+                "mode": mode,
+                "batch_size": batch_size,
+                "seq_len": seq_len,
+                "tokens": tokens,
+                "vocab_size": args.vocab_size,
+                "hidden_dim": args.hidden_dim,
+                "num_layers": args.num_layers,
+                "num_heads": args.num_heads,
+                "d_latent_kv": config.d_latent_kv,
+                "ssm_hidden_dim": config.ssm_hidden_dim if architecture == "taonet_ssm" else None,
+                "ssm_mixer_dim": config.ssm_mixer_dim if architecture == "taonet_ssm" else None,
+                "ssm_rank": config.ssm_rank if architecture == "taonet_ssm" else None,
+                "ssm_local_shift": config.ssm_local_shift if architecture == "taonet_ssm" else None,
+                "ssm_local_shift_init": config.ssm_local_shift_init if architecture == "taonet_ssm" else None,
+                "ssm_local_shift_per_channel": config.ssm_local_shift_per_channel if architecture == "taonet_ssm" else None,
+                "dtype": str(dtype).replace("torch.", ""),
+                "device": str(device),
+                "total_params": total_params,
+                "trainable_params": trainable_params,
+                "mean_ms": mean_s * 1000.0,
+                "min_ms": min_s * 1000.0,
+                "tokens_per_s_mean": tokens / max(mean_s, 1e-12),
+                "tokens_per_s_best": tokens / max(min_s, 1e-12),
+                "loss": loss,
+                "eval_loss": eval_loss,
+                "eval_accuracy": eval_accuracy,
+                "train_final_loss": train_final_loss,
+                "train_seconds": train_seconds,
+                **memory_stats(device),
+            }
+        )
+    def forward_only() -> float:
+        with torch.no_grad():
+            with autocast_context():
+                outputs = model(input_ids=input_ids, attention_mask=attention_mask, labels=labels)
+                loss = outputs["loss"]
+        return float(loss.detach().cpu())
+    mean_s, min_s, loss = time_repeats(
+        forward_only,
+        device=device,
+        warmup=args.warmup,
+        repeats=args.repeats,
+    )
+    add_row("forward", mean_s, min_s, loss)
+    if args.backward:
+        def forward_backward() -> float:
+            model.zero_grad(set_to_none=True)
+            with autocast_context():
+                outputs = model(input_ids=input_ids, attention_mask=attention_mask, labels=labels)
+                loss = outputs["loss"]
+            loss.backward()
+            return float(loss.detach().cpu())
+        mean_s, min_s, loss = time_repeats(
+            forward_backward,
+            device=device,
+            warmup=args.warmup,
+            repeats=args.repeats,
+        )
+        add_row("forward_backward", mean_s, min_s, loss)
+    return rows
+def print_table(rows: list[dict[str, Any]]) -> None:
+    columns = [
+        "architecture",
+        "ssm_core",
+        "token_task",
+        "mode",
+        "batch_size",
+        "seq_len",
+        "mean_ms",
+        "tokens_per_s_mean",
+        "peak_allocated_mb",
+        "loss",
+        "eval_loss",
+        "eval_accuracy",
+    ]
+    print("\t".join(columns))
+    for row in rows:
+        values = []
+        for column in columns:
+            value = row[column]
+            if isinstance(value, float):
+                values.append(f"{value:.3f}")
+            else:
+                values.append(str(value))
+        print("\t".join(values))
+def write_outputs(rows: list[dict[str, Any]], output_dir: Path, metadata: dict[str, Any]) -> None:
+    output_dir.mkdir(parents=True, exist_ok=True)
+    json_path = output_dir / "taonet_token_benchmark.json"
+    csv_path = output_dir / "taonet_token_benchmark.csv"
+    json_path.write_text(json.dumps({"metadata": metadata, "results": rows}, indent=2), encoding="utf-8")
+    fieldnames = list(rows[0].keys()) if rows else []
+    with csv_path.open("w", newline="", encoding="utf-8") as handle:
+        writer = csv.DictWriter(handle, fieldnames=fieldnames)
+        writer.writeheader()
+        writer.writerows(rows)
+    print(f"Wrote {json_path}")
+    print(f"Wrote {csv_path}")
+def main() -> None:
+    parser = argparse.ArgumentParser(description="Benchmark TaoNet attention vs TaoNet-SSM on token batches.")
+    parser.add_argument("--architectures", default="taonet,taonet_ssm")
+    parser.add_argument("--batch-sizes", default="1,4")
+    parser.add_argument("--seq-lens", default="128,512")
+    parser.add_argument("--vocab-size", type=int, default=8192)
+    parser.add_argument("--hidden-dim", type=int, default=256)
+    parser.add_argument("--num-layers", type=int, default=4)
+    parser.add_argument("--num-heads", type=int, default=4)
+    parser.add_argument("--d-latent-kv", type=int, default=None)
+    parser.add_argument("--d-rope", type=int, default=None)
+    parser.add_argument("--hidden-dim-ff", type=int, default=None)
+    parser.add_argument("--dropout", type=float, default=0.0)
+    parser.add_argument("--gqa-groups", type=int, default=1)
+    parser.add_argument("--rope-scale", type=float, default=40.0)
+    parser.add_argument("--yarn-alpha", type=float, default=1.0)
+    parser.add_argument("--init-std", type=float, default=0.02)
+    parser.add_argument("--ssm-core", choices=["gamma_s4", "dplr"], default="dplr")
+    parser.add_argument("--ssm-hidden-dim", type=int, default=None)
+    parser.add_argument("--ssm-mixer-dim", type=int, default=None)
+    parser.add_argument("--ssm-rank", type=int, default=1)
+    parser.add_argument("--ssm-max-low-rank-scale", type=float, default=0.1)
+    parser.add_argument("--ssm-kernel-mode", choices=["auto", "conv", "conv_transfer", "recurrent"], default="conv")
+    parser.add_argument("--ssm-kernel-threshold", type=int, default=1)
+    parser.add_argument("--ssm-dt-min", type=float, default=1e-3)
+    parser.add_argument("--ssm-dt-max", type=float, default=1e-1)
+    parser.add_argument("--ssm-dt-init", type=float, default=1e-2)
+    parser.add_argument("--ssm-use-padding-mask", action="store_true")
+    parser.add_argument("--ssm-activation", choices=["gelu", "silu", "identity", "linear"], default="gelu")
+    parser.add_argument("--ssm-gate", action=argparse.BooleanOptionalAction, default=True)
+    parser.add_argument("--ssm-input-gate", action=argparse.BooleanOptionalAction, default=True)
+    parser.add_argument("--ssm-layer-scale-init", type=float, default=0.1)
+    parser.add_argument("--ssm-local-shift", action=argparse.BooleanOptionalAction, default=False)
+    parser.add_argument("--ssm-local-shift-init", type=float, default=0.1)
+    parser.add_argument("--ssm-local-shift-per-channel", action=argparse.BooleanOptionalAction, default=False)
+    parser.add_argument("--dtype", choices=sorted(DTYPES), default="bf16")
+    parser.add_argument("--device", default="auto")
+    parser.add_argument("--warmup", type=int, default=2)
+    parser.add_argument("--repeats", type=int, default=5)
+    parser.add_argument("--backward", action="store_true")
+    parser.add_argument("--token-task", choices=["random", "increment", "previous"], default="random")
+    parser.add_argument("--train-steps", type=int, default=0)
+    parser.add_argument("--learning-rate", type=float, default=3e-4)
+    parser.add_argument("--weight-decay", type=float, default=0.01)
+    parser.add_argument("--eval-batches", type=int, default=1)
+    parser.add_argument("--output-dir", default=os.environ.get("REPOBRIDGE_OUTPUT_DIR", "results/token-bench"))
+    args = parser.parse_args()
+    if args.device == "auto":
+        device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+    else:
+        device = torch.device(args.device)
+    dtype = DTYPES[args.dtype]
+    if device.type != "cuda" and dtype == torch.float16:
+        raise ValueError("float16 benchmark requires CUDA.")
+    if device.type == "cuda":
+        torch.backends.cuda.matmul.allow_tf32 = True
+        torch.backends.cudnn.allow_tf32 = True
+    architectures = [item.strip() for item in args.architectures.split(",") if item.strip()]
+    rows: list[dict[str, Any]] = []
+    metadata = {
+        "python": platform.python_version(),
+        "platform": platform.platform(),
+        "torch": torch.__version__,
+        "cuda_available": torch.cuda.is_available(),
+        "cuda_device": torch.cuda.get_device_name(device) if device.type == "cuda" else None,
+        "nvidia_smi_before": nvidia_smi_snapshot(),
+        "args": vars(args),
+    }
+    for architecture in architectures:
+        for batch_size in parse_int_list(args.batch_sizes):
+            for seq_len in parse_int_list(args.seq_lens):
+                print(f"Benchmarking architecture={architecture} batch={batch_size} seq={seq_len}")
+                rows.extend(
+                    benchmark_case(
+                        args=args,
+                        architecture=architecture,
+                        batch_size=batch_size,
+                        seq_len=seq_len,
+                        dtype=dtype,
+                        device=device,
+                    )
+                )
+    metadata["nvidia_smi_after"] = nvidia_smi_snapshot()
+    print_table(rows)
+    write_outputs(rows, Path(args.output_dir), metadata)
+if __name__ == "__main__":
+    main()

code/TaoTrain/scripts/diagnostics/activation_probe.py ADDED Viewed

	@@ -0,0 +1,160 @@

+"""Probe residual activation scale for a saved TaoTrain checkpoint."""
+from __future__ import annotations
+import argparse
+import json
+import re
+import sys
+from pathlib import Path
+from typing import Any
+import torch
+REPO_ROOT = Path(__file__).resolve().parents[2]
+SRC_ROOT = REPO_ROOT / "src"
+if str(SRC_ROOT) not in sys.path:
+    sys.path.insert(0, str(SRC_ROOT))
+from taoTrain.checkpointing.checkpoint import CheckpointManager
+from taoTrain.config import ModelConfig
+from taoTrain.models import get_model
+def load_sentencepiece(path: Path):
+    import sentencepiece as spm
+    processor = spm.SentencePieceProcessor()
+    processor.load(str(path))
+    return processor
+def load_tokens(args: argparse.Namespace) -> tuple[torch.Tensor, int]:
+    tokenizer = load_sentencepiece(Path(args.tokenizer_path))
+    tokens: list[int] = []
+    with Path(args.data_path).open("r", encoding="utf-8", errors="replace") as handle:
+        for line in handle:
+            if len(tokens) >= args.max_tokens:
+                break
+            line = line.strip()
+            if not line:
+                continue
+            try:
+                record = json.loads(line)
+            except json.JSONDecodeError:
+                continue
+            text = record.get(args.text_field)
+            if not isinstance(text, str) or not text:
+                continue
+            ids = list(tokenizer.encode(text, out_type=int))
+            eos_id = tokenizer.eos_id()
+            if eos_id >= 0:
+                ids.append(eos_id)
+            tokens.extend(ids)
+    if len(tokens) < args.seq_len + 2:
+        raise ValueError(f"Need at least {args.seq_len + 2} tokens, got {len(tokens)}")
+    return torch.tensor(tokens[: args.max_tokens], dtype=torch.long), int(tokenizer.vocab_size())
+def sample_batch(tokens: torch.Tensor, *, batch_size: int, seq_len: int, device: torch.device) -> tuple[torch.Tensor, torch.Tensor]:
+    max_start = tokens.numel() - seq_len - 1
+    starts = torch.linspace(0, max_start - 1, steps=batch_size).long()
+    rows = [tokens[int(start) : int(start) + seq_len + 1] for start in starts]
+    batch = torch.stack(rows, dim=0).to(device=device)
+    return batch[:, :-1].contiguous(), batch[:, 1:].contiguous()
+def tensor_stats(value: torch.Tensor) -> dict[str, float | int]:
+    data = value.detach().float()
+    finite = torch.isfinite(data)
+    finite_count = int(finite.sum().cpu())
+    numel = data.numel()
+    if finite_count:
+        finite_data = data[finite]
+        rms = float(torch.sqrt(torch.mean(finite_data * finite_data)).cpu())
+        max_abs = float(finite_data.abs().max().cpu())
+    else:
+        rms = float("inf")
+        max_abs = float("inf")
+    return {
+        "numel": numel,
+        "finite": finite_count,
+        "rms": rms,
+        "max_abs": max_abs,
+    }
+def main() -> None:
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--checkpoint", required=True)
+    parser.add_argument("--tokenizer-path", required=True)
+    parser.add_argument("--data-path", required=True)
+    parser.add_argument("--text-field", default="text")
+    parser.add_argument("--output", required=True)
+    parser.add_argument("--batch-size", type=int, default=2)
+    parser.add_argument("--seq-len", type=int, default=512)
+    parser.add_argument("--max-tokens", type=int, default=200_000)
+    parser.add_argument("--device", default="cuda")
+    parser.add_argument("--dtype", choices=["float32", "bfloat16", "float16"], default="bfloat16")
+    args = parser.parse_args()
+    device = torch.device(args.device if args.device == "cpu" or torch.cuda.is_available() else "cpu")
+    dtype = {
+        "float32": torch.float32,
+        "bfloat16": torch.bfloat16,
+        "float16": torch.float16,
+    }[args.dtype]
+    tokens, _ = load_tokens(args)
+    input_ids, labels = sample_batch(tokens, batch_size=args.batch_size, seq_len=args.seq_len, device=device)
+    attention_mask = torch.ones_like(input_ids)
+    checkpoint_path = Path(args.checkpoint)
+    checkpoint = CheckpointManager(checkpoint_path.parent).load(checkpoint_path, device=device)
+    config_dict = checkpoint.get("config", {})
+    model_config = ModelConfig(**config_dict.get("model", {}))
+    model = get_model(model_config, device=device)
+    model.load_state_dict(checkpoint["model_state"], strict=False)
+    model.eval()
+    layer_stats: dict[str, dict[str, float | int]] = {}
+    handles = []
+    layer_pattern = re.compile(r"^(?:model\.)?(?:layers|blocks)\.\d+$")
+    def make_hook(name: str):
+        def hook(_module, _inputs, output):
+            value = output[0] if isinstance(output, tuple) else output
+            if torch.is_tensor(value):
+                layer_stats[name] = tensor_stats(value)
+        return hook
+    for name, module in model.named_modules():
+        if layer_pattern.match(name):
+            handles.append(module.register_forward_hook(make_hook(name)))
+    device_type = "cuda" if device.type == "cuda" else "cpu"
+    autocast_enabled = device.type == "cuda" and dtype in {torch.float16, torch.bfloat16}
+    with torch.no_grad(), torch.autocast(device_type=device_type, dtype=dtype, enabled=autocast_enabled):
+        outputs = model(input_ids=input_ids, attention_mask=attention_mask, labels=labels)
+    for handle in handles:
+        handle.remove()
+    result: dict[str, Any] = {
+        "checkpoint": str(checkpoint_path),
+        "loss": float(outputs["loss"].detach().cpu()),
+        "batch_size": args.batch_size,
+        "seq_len": args.seq_len,
+        "device": str(device),
+        "dtype": str(dtype),
+        "layers": layer_stats,
+    }
+    output_path = Path(args.output)
+    output_path.parent.mkdir(parents=True, exist_ok=True)
+    output_path.write_text(json.dumps(result, indent=2), encoding="utf-8")
+    print(json.dumps(result, indent=2))
+if __name__ == "__main__":
+    main()

code/TaoTrain/scripts/diagnostics/generate_checkpoint_samples.py ADDED Viewed

	@@ -0,0 +1,134 @@

+"""Generate a few text samples from a saved checkpoint."""
+from __future__ import annotations
+import argparse
+import json
+import sys
+from pathlib import Path
+import torch
+REPO_ROOT = Path(__file__).resolve().parents[2]
+SRC_ROOT = REPO_ROOT / "src"
+if str(SRC_ROOT) not in sys.path:
+    sys.path.insert(0, str(SRC_ROOT))
+from taoTrain.checkpointing.checkpoint import CheckpointManager
+from taoTrain.config import ModelConfig
+from taoTrain.inference.inferencer import Inferencer
+from taoTrain.models import get_model
+def clear_kernel_caches(model) -> None:
+    for module in model.modules():
+        clear = getattr(module, "clear_kernel_cache", None)
+        if callable(clear):
+            clear()
+def generate_once(
+    model,
+    tokenizer,
+    prompt: str,
+    *,
+    device: torch.device,
+    max_new_tokens: int,
+    temperature: float,
+    top_p: float,
+    dtype: torch.dtype,
+) -> str:
+    input_ids = tokenizer.encode(prompt, return_tensors="pt").to(device)
+    generated = []
+    eos_token_id = getattr(tokenizer, "eos_token_id", None)
+    model.eval()
+    device_type = "cuda" if device.type == "cuda" else "cpu"
+    autocast_enabled = device.type == "cuda" and dtype in {torch.float16, torch.bfloat16}
+    with torch.inference_mode(), torch.autocast(device_type=device_type, dtype=dtype, enabled=autocast_enabled):
+        for _ in range(max_new_tokens):
+            clear_kernel_caches(model)
+            outputs = model(input_ids=input_ids, attention_mask=torch.ones_like(input_ids), labels=None)
+            logits = outputs["logits"][:, -1, :] / max(temperature, 1e-6)
+            if top_p < 1.0:
+                sorted_logits, sorted_indices = torch.sort(logits, descending=True)
+                sorted_probs = torch.softmax(sorted_logits, dim=-1)
+                cumulative = torch.cumsum(sorted_probs, dim=-1)
+                remove = cumulative > top_p
+                remove[..., 1:] = remove[..., :-1].clone()
+                remove[..., 0] = False
+                indices_to_remove = sorted_indices[remove]
+                logits[0, indices_to_remove] = float("-inf")
+            probs = torch.softmax(logits, dim=-1)
+            next_token = torch.multinomial(probs, num_samples=1)
+            token_id = int(next_token.item())
+            if eos_token_id is not None and token_id == eos_token_id:
+                break
+            generated.append(token_id)
+            input_ids = torch.cat([input_ids, next_token], dim=-1)
+    clear_kernel_caches(model)
+    return tokenizer.decode(generated, skip_special_tokens=True)
+def main() -> None:
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--checkpoint", required=True)
+    parser.add_argument("--tokenizer-path", required=True)
+    parser.add_argument("--output", required=True)
+    parser.add_argument("--prompt", action="append", default=[])
+    parser.add_argument("--max-new-tokens", type=int, default=80)
+    parser.add_argument("--temperature", type=float, default=0.8)
+    parser.add_argument("--top-p", type=float, default=0.9)
+    parser.add_argument("--device", default="cuda")
+    parser.add_argument("--dtype", choices=["float32", "bfloat16", "float16"], default="bfloat16")
+    args = parser.parse_args()
+    prompts = args.prompt or [
+        "The purpose of artificial intelligence is",
+        "In a small village,",
+        "<user>Hello, who are you?<assistant>",
+    ]
+    device = torch.device(args.device if args.device == "cpu" or torch.cuda.is_available() else "cpu")
+    dtype = {
+        "float32": torch.float32,
+        "bfloat16": torch.bfloat16,
+        "float16": torch.float16,
+    }[args.dtype]
+    tokenizer = Inferencer._load_tokenizer(args.tokenizer_path)
+    checkpoint_path = Path(args.checkpoint)
+    checkpoint = CheckpointManager(checkpoint_path.parent).load(checkpoint_path, device=device)
+    model_config = ModelConfig(**checkpoint.get("config", {}).get("model", {}))
+    model = get_model(model_config, device=device)
+    model.load_state_dict(checkpoint["model_state"], strict=False)
+    samples = []
+    for prompt in prompts:
+        text = generate_once(
+            model,
+            tokenizer,
+            prompt,
+            device=device,
+            max_new_tokens=args.max_new_tokens,
+            temperature=args.temperature,
+            top_p=args.top_p,
+            dtype=dtype,
+        )
+        samples.append({"prompt": prompt, "completion": text})
+    result = {
+        "checkpoint": args.checkpoint,
+        "tokenizer_path": args.tokenizer_path,
+        "device": str(device),
+        "dtype": str(dtype),
+        "max_new_tokens": args.max_new_tokens,
+        "temperature": args.temperature,
+        "top_p": args.top_p,
+        "samples": samples,
+    }
+    output = Path(args.output)
+    output.parent.mkdir(parents=True, exist_ok=True)
+    output.write_text(json.dumps(result, indent=2, ensure_ascii=False), encoding="utf-8")
+    print(json.dumps(result, indent=2, ensure_ascii=False))
+if __name__ == "__main__":
+    main()

code/TaoTrain/scripts/diagnostics/sft_sanity_check.py ADDED Viewed

	@@ -0,0 +1,300 @@

+"""Small SFT diagnostics for checkpoint quality and trainability.
+This script intentionally bypasses the full trainer so it can answer one narrow
+question quickly: can the checkpoint reduce response-only SFT loss on a tiny,
+fixed batch?
+"""
+from __future__ import annotations
+import argparse
+import json
+import math
+from pathlib import Path
+from typing import Any
+import torch
+from taoTrain.checkpointing.checkpoint import CheckpointManager
+from taoTrain.config import TrainingModeEnum, load_config
+from taoTrain.core import create_model
+from taoTrain.data.sft_utils import build_sft_sequence_tokens, parse_sft_record
+try:
+    from taoTrain.data.sft_utils import build_response_only_next_token_labels
+except ImportError:
+    def build_response_only_next_token_labels(input_ids: list[int], mask: list[int]) -> list[int]:
+        labels = [token_id if mask_value else -100 for token_id, mask_value in zip(input_ids, mask)]
+        return labels[1:] + [-100]
+from taoTrain.data.tokenizer import SentencePieceTokenizerWrapper
+from taoTrain.utils import set_seed
+def load_tokenizer(tokenizer_path: str):
+    path = Path(tokenizer_path)
+    if path.suffix == ".model":
+        import sentencepiece as spm
+        sp = spm.SentencePieceProcessor()
+        sp.Load(str(path))
+        return SentencePieceTokenizerWrapper(sp)
+    from transformers import AutoTokenizer
+    tokenizer = AutoTokenizer.from_pretrained(tokenizer_path)
+    if getattr(tokenizer, "pad_token", None) is None and getattr(tokenizer, "eos_token", None):
+        tokenizer.pad_token = tokenizer.eos_token
+    return tokenizer
+def read_jsonl_records(path: str, limit: int) -> list[dict[str, Any]]:
+    records = []
+    with open(path, "r", encoding="utf-8") as handle:
+        for line in handle:
+            line = line.strip()
+            if not line:
+                continue
+            records.append(json.loads(line))
+            if len(records) >= limit:
+                break
+    return records
+def build_batch(config, tokenizer, records: list[dict[str, Any]], device: torch.device) -> dict[str, torch.Tensor]:
+    input_rows = []
+    attention_rows = []
+    label_rows = []
+    train_tokens = []
+    for record in records:
+        turns, _ = parse_sft_record(record, config)
+        if not turns:
+            continue
+        input_ids, attention_mask, mask = build_sft_sequence_tokens(
+            turns=turns,
+            tokenizer=tokenizer,
+            user_token=getattr(config, "user_token", "<user>"),
+            assistant_token=getattr(config, "assistant_token", "<assistant>"),
+            max_seq_length=config.model.max_seq_length,
+        )
+        labels = build_response_only_next_token_labels(input_ids, mask)
+        input_rows.append(input_ids)
+        attention_rows.append(attention_mask)
+        label_rows.append(labels)
+        train_tokens.append(sum(1 for value in labels if value != -100))
+    if not input_rows:
+        raise ValueError("No valid SFT records found for the diagnostic batch")
+    return {
+        "input_ids": torch.tensor(input_rows, dtype=torch.long, device=device),
+        "attention_mask": torch.tensor(attention_rows, dtype=torch.long, device=device),
+        "labels": torch.tensor(label_rows, dtype=torch.long, device=device),
+        "train_tokens": torch.tensor(train_tokens, dtype=torch.long),
+    }
+@torch.no_grad()
+def score_batch(model, batch: dict[str, torch.Tensor], dtype: torch.dtype) -> float:
+    model.eval()
+    device_type = "cuda" if batch["input_ids"].is_cuda else "cpu"
+    enabled = device_type == "cuda" and dtype in (torch.float16, torch.bfloat16)
+    with torch.autocast(device_type=device_type, dtype=dtype, enabled=enabled):
+        outputs = model(
+            input_ids=batch["input_ids"],
+            attention_mask=batch["attention_mask"],
+            labels=batch["labels"],
+        )
+    return float(outputs["loss"].detach().cpu())
+def grad_l2_norm(parameters) -> float:
+    total = 0.0
+    for parameter in parameters:
+        if parameter.grad is None:
+            continue
+        grad = parameter.grad.detach()
+        total += float(torch.sum(grad.float() * grad.float()).cpu())
+    return math.sqrt(total)
+def grad_summary(named_parameters, max_items: int = 12) -> dict[str, Any]:
+    groups: dict[str, dict[str, Any]] = {}
+    worst = []
+    nonfinite = []
+    for name, parameter in named_parameters:
+        if parameter.grad is None:
+            continue
+        grad = parameter.grad.detach().float()
+        finite = torch.isfinite(grad)
+        finite_count = int(finite.sum().cpu())
+        numel = grad.numel()
+        finite_abs_max = float(grad[finite].abs().max().cpu()) if finite_count else float("inf")
+        has_nonfinite = finite_count != numel
+        if has_nonfinite:
+            nonfinite.append(name)
+        if ".layers." in name:
+            parts = name.split(".")
+            try:
+                idx = parts.index("layers")
+                group = "layer_" + parts[idx + 1]
+            except (ValueError, IndexError):
+                group = "layers"
+        else:
+            group = name.split(".", 1)[0]
+        entry = groups.setdefault(group, {
+            "numel": 0,
+            "finite": 0,
+            "nonfinite_tensors": 0,
+            "max_abs_grad": 0.0,
+        })
+        entry["numel"] += numel
+        entry["finite"] += finite_count
+        entry["nonfinite_tensors"] += int(has_nonfinite)
+        entry["max_abs_grad"] = max(entry["max_abs_grad"], finite_abs_max)
+        worst.append((finite_abs_max, name))
+    worst.sort(reverse=True, key=lambda item: item[0])
+    return {
+        "groups": groups,
+        "worst_tensors": [{"name": name, "max_abs_grad": value} for value, name in worst[:max_items]],
+        "nonfinite_tensors": nonfinite[:max_items],
+        "nonfinite_tensor_count": len(nonfinite),
+    }
+def freeze_ssm_core_parameters(model) -> int:
+    frozen = 0
+    markers = (
+        ".ssm_lanes.",
+        ".ssm.",
+    )
+    for name, parameter in model.named_parameters():
+        if any(marker in name for marker in markers):
+            parameter.requires_grad_(False)
+            frozen += parameter.numel()
+    return frozen
+def main() -> None:
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--config", required=True)
+    parser.add_argument("--checkpoint", required=True)
+    parser.add_argument("--output", required=True)
+    parser.add_argument("--samples", type=int, default=2)
+    parser.add_argument("--steps", type=int, default=80)
+    parser.add_argument("--lr", type=float, default=3e-4)
+    parser.add_argument("--log-every", type=int, default=10)
+    parser.add_argument("--device", default="cuda")
+    parser.add_argument("--dtype", choices=["config", "float32", "float16", "bfloat16"], default="config")
+    parser.add_argument("--no-clip", action="store_true")
+    parser.add_argument("--freeze-ssm-core", action="store_true")
+    parser.add_argument("--ssm-branch-rms-norm", action="store_true")
+    parser.add_argument("--ssm-branch-clip-value", type=float, default=None)
+    parser.add_argument("--block-residual-rms-norm", action="store_true")
+    parser.add_argument("--block-residual-rms-target", type=float, default=None)
+    parser.add_argument("--seed", type=int, default=123)
+    args = parser.parse_args()
+    set_seed(args.seed)
+    config = load_config(args.config, TrainingModeEnum.SFT)
+    if args.ssm_branch_rms_norm:
+        config.model.ssm_branch_rms_norm = True
+    if args.ssm_branch_clip_value is not None:
+        config.model.ssm_branch_clip_value = args.ssm_branch_clip_value
+    if args.block_residual_rms_norm:
+        config.model.block_residual_rms_norm = True
+    if args.block_residual_rms_target is not None:
+        config.model.block_residual_rms_target = args.block_residual_rms_target
+    device = torch.device(args.device if args.device == "cpu" or torch.cuda.is_available() else "cpu")
+    if args.dtype == "float32":
+        dtype = torch.float32
+    elif args.dtype == "float16":
+        dtype = torch.float16
+    elif args.dtype == "bfloat16":
+        dtype = torch.bfloat16
+    else:
+        dtype = torch.bfloat16 if str(config.dtype) == "DataTypeEnum.BFLOAT16" or str(config.dtype) == "bfloat16" else torch.float32
+    tokenizer = load_tokenizer(config.dataset.tokenizer_path)
+    records = read_jsonl_records(config.dataset.jsonl_path, args.samples)
+    batch = build_batch(config, tokenizer, records, device)
+    model = create_model(config, device)
+    checkpoint = CheckpointManager(config.checkpoint_dir).load(args.checkpoint, device=device)
+    model.load_state_dict(checkpoint["model_state"], strict=False)
+    frozen_params = freeze_ssm_core_parameters(model) if args.freeze_ssm_core else 0
+    initial_loss = score_batch(model, batch, dtype)
+    trainable_params = [parameter for parameter in model.parameters() if parameter.requires_grad]
+    optimizer = torch.optim.AdamW(trainable_params, lr=args.lr, weight_decay=0.0)
+    history = []
+    device_type = "cuda" if device.type == "cuda" else "cpu"
+    autocast_enabled = device_type == "cuda" and dtype in (torch.float16, torch.bfloat16)
+    model.train()
+    for step in range(1, args.steps + 1):
+        optimizer.zero_grad(set_to_none=True)
+        with torch.autocast(device_type=device_type, dtype=dtype, enabled=autocast_enabled):
+            outputs = model(
+                input_ids=batch["input_ids"],
+                attention_mask=batch["attention_mask"],
+                labels=batch["labels"],
+            )
+            loss = outputs["loss"]
+        loss.backward()
+        grad_norm = grad_l2_norm(trainable_params)
+        stats = None
+        if step == 1 or step % args.log_every == 0 or step == args.steps:
+            stats = grad_summary(model.named_parameters())
+        if not args.no_clip:
+            torch.nn.utils.clip_grad_norm_(trainable_params, 1.0)
+        optimizer.step()
+        if step == 1 or step % args.log_every == 0 or step == args.steps:
+            item = {
+                "step": step,
+                "loss": float(loss.detach().cpu()),
+                "grad_l2_norm": grad_norm,
+            }
+            if stats is not None:
+                item["grad_summary"] = stats
+            history.append(item)
+    final_loss = score_batch(model, batch, dtype)
+    result = {
+        "checkpoint": str(Path(args.checkpoint)),
+        "config": str(Path(args.config)),
+        "dataset": config.dataset.jsonl_path,
+        "samples": len(records),
+        "sequence_length": config.model.max_seq_length,
+        "train_tokens_per_sample": batch["train_tokens"].tolist(),
+        "lr": args.lr,
+        "steps": args.steps,
+        "clip_grad_norm": not args.no_clip,
+        "freeze_ssm_core": args.freeze_ssm_core,
+        "ssm_branch_rms_norm": config.model.ssm_branch_rms_norm,
+        "ssm_branch_clip_value": config.model.ssm_branch_clip_value,
+        "block_residual_rms_norm": config.model.block_residual_rms_norm,
+        "block_residual_rms_target": config.model.block_residual_rms_target,
+        "frozen_params": frozen_params,
+        "trainable_params": sum(parameter.numel() for parameter in trainable_params),
+        "initial_loss": initial_loss,
+        "final_loss": final_loss,
+        "loss_delta": final_loss - initial_loss,
+        "history": history,
+        "device": str(device),
+        "dtype": str(dtype),
+    }
+    output = Path(args.output)
+    output.parent.mkdir(parents=True, exist_ok=True)
+    output.write_text(json.dumps(result, indent=2), encoding="utf-8")
+    print(json.dumps(result, indent=2))
+if __name__ == "__main__":
+    main()

code/TaoTrain/scripts/profile_taonet_components.py ADDED Viewed

	@@ -0,0 +1,393 @@

+"""Profile TaoNet and TaoNet-SSM component costs on synthetic token batches.
+The real-token benchmark tells us end-to-end quality and throughput. This
+script is the companion microscope: it times forward components such as the
+SSM core, gates, projections, FFN, embeddings, and output head so hardware work
+targets the largest measured costs.
+"""
+from __future__ import annotations
+import argparse
+from collections import defaultdict
+from contextlib import nullcontext
+from contextlib import redirect_stdout
+import io
+import json
+import os
+from pathlib import Path
+import platform
+import sys
+import time
+from typing import Any
+import torch
+REPO_ROOT = Path(__file__).resolve().parents[1]
+SRC_ROOT = REPO_ROOT / "src"
+if str(SRC_ROOT) not in sys.path:
+    sys.path.insert(0, str(SRC_ROOT))
+from taoTrain.config import ModelConfig
+from taoTrain.models import get_model
+DTYPES = {
+    "float32": torch.float32,
+    "fp32": torch.float32,
+    "float16": torch.float16,
+    "fp16": torch.float16,
+    "bfloat16": torch.bfloat16,
+    "bf16": torch.bfloat16,
+}
+def synchronize(device: torch.device) -> None:
+    if device.type == "cuda":
+        torch.cuda.synchronize(device)
+def reset_memory(device: torch.device) -> None:
+    if device.type == "cuda":
+        torch.cuda.reset_peak_memory_stats(device)
+def memory_stats(device: torch.device) -> dict[str, float | None]:
+    if device.type != "cuda":
+        return {"peak_allocated_mb": None, "peak_reserved_mb": None}
+    return {
+        "peak_allocated_mb": torch.cuda.max_memory_allocated(device) / (1024**2),
+        "peak_reserved_mb": torch.cuda.max_memory_reserved(device) / (1024**2),
+    }
+class ComponentTimer:
+    def __init__(self, device: torch.device) -> None:
+        self.device = device
+        self.records: dict[str, list[float]] = defaultdict(list)
+        self._starts: dict[int, Any] = {}
+        self._handles = []
+    def _record_ms(self, name: str, start: Any) -> None:
+        if self.device.type == "cuda":
+            end = torch.cuda.Event(enable_timing=True)
+            end.record()
+            end.synchronize()
+            self.records[name].append(float(start.elapsed_time(end)))
+        else:
+            self.records[name].append((time.perf_counter() - start) * 1000.0)
+    def add(self, module: torch.nn.Module, name: str) -> None:
+        def pre_hook(mod, inputs):
+            del inputs
+            if self.device.type == "cuda":
+                start = torch.cuda.Event(enable_timing=True)
+                start.record()
+            else:
+                start = time.perf_counter()
+            self._starts[id(mod)] = start
+        def post_hook(mod, inputs, output):
+            del inputs, output
+            start = self._starts.pop(id(mod), None)
+            if start is not None:
+                self._record_ms(name, start)
+        self._handles.append(module.register_forward_pre_hook(pre_hook))
+        self._handles.append(module.register_forward_hook(post_hook))
+    def close(self) -> None:
+        for handle in self._handles:
+            handle.remove()
+        self._handles.clear()
+    def summary(self) -> list[dict[str, float | str | int]]:
+        rows = []
+        for name, values in sorted(self.records.items()):
+            if not values:
+                continue
+            rows.append(
+                {
+                    "component": name,
+                    "calls": len(values),
+                    "mean_ms": sum(values) / len(values),
+                    "total_ms": sum(values),
+                    "min_ms": min(values),
+                    "max_ms": max(values),
+                }
+            )
+        rows.sort(key=lambda row: float(row["total_ms"]), reverse=True)
+        return rows
+def build_config(args: argparse.Namespace, architecture: str) -> ModelConfig:
+    d_latent_kv = args.d_latent_kv if args.d_latent_kv is not None else int(args.hidden_dim * 0.75)
+    d_rope = args.d_rope if args.d_rope is not None else args.hidden_dim // args.num_heads
+    hidden_dim_ff = args.hidden_dim_ff if args.hidden_dim_ff is not None else args.hidden_dim * 4
+    return ModelConfig(
+        architecture_type=architecture,
+        vocab_size=args.vocab_size,
+        hidden_dim=args.hidden_dim,
+        num_layers=args.num_layers,
+        num_heads=args.num_heads,
+        max_seq_length=args.seq_len,
+        d_latent_kv=d_latent_kv,
+        d_rope=d_rope,
+        hidden_dim_ff=hidden_dim_ff,
+        dropout=args.dropout,
+        gqa_groups=args.gqa_groups,
+        rope_scale=args.rope_scale,
+        yarn_alpha=args.yarn_alpha,
+        init_std=args.init_std,
+        ssm_core=args.ssm_core,
+        ssm_hidden_dim=args.ssm_hidden_dim,
+        ssm_mixer_dim=args.ssm_mixer_dim,
+        ssm_rank=args.ssm_rank,
+        ssm_max_low_rank_scale=args.ssm_max_low_rank_scale,
+        ssm_kernel_mode=args.ssm_kernel_mode,
+        ssm_kernel_threshold=args.ssm_kernel_threshold,
+        ssm_dt_min=args.ssm_dt_min,
+        ssm_dt_max=args.ssm_dt_max,
+        ssm_dt_init=args.ssm_dt_init,
+        ssm_use_padding_mask=False,
+        ssm_activation=args.ssm_activation,
+        ssm_gate=args.ssm_gate,
+        ssm_input_gate=args.ssm_input_gate,
+        ssm_layer_scale_init=args.ssm_layer_scale_init,
+        ssm_local_shift=args.ssm_local_shift,
+        ssm_local_shift_init=args.ssm_local_shift_init,
+        ssm_local_shift_per_channel=args.ssm_local_shift_per_channel,
+    )
+def add_component_hooks(model: torch.nn.Module, architecture: str, timer: ComponentTimer) -> None:
+    timer.add(model.token_embedding, "embedding")
+    timer.add(model.final_norm, "final_norm")
+    timer.add(model.output_head, "output_head")
+    for layer_index, block in enumerate(model.blocks):
+        prefix = f"block{layer_index}"
+        if architecture == "taonet_ssm":
+            mixer = block.mixer
+            timer.add(mixer.norm, f"{prefix}.mixer.norm")
+            if mixer.input_gate is not None:
+                timer.add(mixer.input_gate, f"{prefix}.mixer.input_gate")
+            timer.add(mixer.input_proj, f"{prefix}.mixer.input_proj")
+            timer.add(mixer.ssm, f"{prefix}.mixer.ssm_core")
+            timer.add(mixer.activation, f"{prefix}.mixer.activation")
+            timer.add(mixer.out_proj, f"{prefix}.mixer.out_proj")
+            if mixer.output_gate is not None:
+                timer.add(mixer.output_gate, f"{prefix}.mixer.output_gate")
+            timer.add(mixer.proj_dropout, f"{prefix}.mixer.dropout")
+        else:
+            mla = block.mla
+            timer.add(mla.norm, f"{prefix}.attention.norm")
+            timer.add(mla.q_proj, f"{prefix}.attention.q_proj")
+            timer.add(mla.k_proj, f"{prefix}.attention.k_proj")
+            timer.add(mla.v_proj, f"{prefix}.attention.v_proj")
+            timer.add(mla.out_proj, f"{prefix}.attention.out_proj")
+            timer.add(mla.attn_dropout, f"{prefix}.attention.attn_dropout")
+            timer.add(mla.proj_dropout, f"{prefix}.attention.proj_dropout")
+        timer.add(block.ff_norm, f"{prefix}.ff.norm")
+        timer.add(block.ff_gate, f"{prefix}.ff.gate")
+        timer.add(block.ff_value, f"{prefix}.ff.value")
+        timer.add(block.ff_out, f"{prefix}.ff.out")
+def time_repeats(fn, *, device: torch.device, warmup: int, repeats: int) -> dict[str, float]:
+    for _ in range(warmup):
+        fn()
+    synchronize(device)
+    latencies = []
+    for _ in range(repeats):
+        reset_memory(device)
+        synchronize(device)
+        start = time.perf_counter()
+        fn()
+        synchronize(device)
+        latencies.append(time.perf_counter() - start)
+    mean_s = sum(latencies) / len(latencies)
+    return {
+        "mean_ms": mean_s * 1000.0,
+        "min_ms": min(latencies) * 1000.0,
+        "max_ms": max(latencies) * 1000.0,
+    }
+def profile_architecture(
+    args: argparse.Namespace,
+    *,
+    architecture: str,
+    device: torch.device,
+    dtype: torch.dtype,
+) -> dict[str, Any]:
+    torch.manual_seed(args.seed)
+    if device.type == "cuda":
+        torch.cuda.manual_seed_all(args.seed)
+    config = build_config(args, architecture)
+    with redirect_stdout(io.StringIO()):
+        model = get_model(config, device=device)
+    model.train()
+    input_ids = torch.randint(
+        low=0,
+        high=args.vocab_size,
+        size=(args.batch_size, args.seq_len),
+        device=device,
+    )
+    labels = torch.randint(
+        low=0,
+        high=args.vocab_size,
+        size=(args.batch_size, args.seq_len),
+        device=device,
+    )
+    attention_mask = torch.ones_like(input_ids)
+    autocast_enabled = device.type == "cuda" and dtype in {torch.float16, torch.bfloat16}
+    def autocast_context():
+        if not autocast_enabled:
+            return nullcontext()
+        return torch.autocast(device_type=device.type, dtype=dtype, enabled=True)
+    def forward_only() -> torch.Tensor:
+        with torch.no_grad():
+            with autocast_context():
+                outputs = model(input_ids=input_ids, attention_mask=attention_mask, labels=labels)
+            return outputs["loss"]
+    def forward_backward() -> torch.Tensor:
+        model.zero_grad(set_to_none=True)
+        with autocast_context():
+            outputs = model(input_ids=input_ids, attention_mask=attention_mask, labels=labels)
+            loss = outputs["loss"]
+        loss.backward()
+        return loss
+    no_timer_forward = time_repeats(
+        forward_only,
+        device=device,
+        warmup=args.warmup,
+        repeats=args.repeats,
+    )
+    no_timer_backward = time_repeats(
+        forward_backward,
+        device=device,
+        warmup=args.warmup,
+        repeats=args.repeats,
+    )
+    timer = ComponentTimer(device)
+    add_component_hooks(model, architecture, timer)
+    try:
+        for _ in range(args.component_warmup):
+            forward_only()
+        synchronize(device)
+        for _ in range(args.component_repeats):
+            forward_only()
+        synchronize(device)
+    finally:
+        timer.close()
+    tokens = args.batch_size * args.seq_len
+    component_rows = timer.summary()
+    return {
+        "architecture": architecture,
+        "total_params": sum(param.numel() for param in model.parameters()),
+        "trainable_params": sum(param.numel() for param in model.parameters() if param.requires_grad),
+        "forward": {
+            **no_timer_forward,
+            "tokens_per_s": tokens / max(no_timer_forward["mean_ms"] / 1000.0, 1e-12),
+        },
+        "forward_backward": {
+            **no_timer_backward,
+            "tokens_per_s": tokens / max(no_timer_backward["mean_ms"] / 1000.0, 1e-12),
+            **memory_stats(device),
+        },
+        "components_forward": component_rows,
+    }
+def main() -> int:
+    parser = argparse.ArgumentParser(description=__doc__)
+    parser.add_argument("--architectures", default="taonet,taonet_ssm")
+    parser.add_argument("--vocab-size", type=int, default=8192)
+    parser.add_argument("--batch-size", type=int, default=32)
+    parser.add_argument("--seq-len", type=int, default=512)
+    parser.add_argument("--hidden-dim", type=int, default=256)
+    parser.add_argument("--num-layers", type=int, default=4)
+    parser.add_argument("--num-heads", type=int, default=4)
+    parser.add_argument("--d-latent-kv", type=int, default=None)
+    parser.add_argument("--d-rope", type=int, default=None)
+    parser.add_argument("--hidden-dim-ff", type=int, default=None)
+    parser.add_argument("--dropout", type=float, default=0.0)
+    parser.add_argument("--gqa-groups", type=int, default=1)
+    parser.add_argument("--rope-scale", type=float, default=40.0)
+    parser.add_argument("--yarn-alpha", type=float, default=1.0)
+    parser.add_argument("--init-std", type=float, default=0.02)
+    parser.add_argument("--ssm-core", choices=["gamma_s4", "dplr"], default="dplr")
+    parser.add_argument("--ssm-hidden-dim", type=int, default=16)
+    parser.add_argument("--ssm-mixer-dim", type=int, default=128)
+    parser.add_argument("--ssm-rank", type=int, default=1)
+    parser.add_argument("--ssm-max-low-rank-scale", type=float, default=0.1)
+    parser.add_argument("--ssm-kernel-mode", choices=["auto", "conv", "conv_transfer", "recurrent"], default="conv")
+    parser.add_argument("--ssm-kernel-threshold", type=int, default=1)
+    parser.add_argument("--ssm-dt-min", type=float, default=1e-3)
+    parser.add_argument("--ssm-dt-max", type=float, default=1e-1)
+    parser.add_argument("--ssm-dt-init", type=float, default=1e-2)
+    parser.add_argument("--ssm-activation", choices=["gelu", "silu", "identity", "linear"], default="gelu")
+    parser.add_argument("--ssm-gate", action=argparse.BooleanOptionalAction, default=True)
+    parser.add_argument("--ssm-input-gate", action=argparse.BooleanOptionalAction, default=True)
+    parser.add_argument("--ssm-layer-scale-init", type=float, default=0.1)
+    parser.add_argument("--ssm-local-shift", action=argparse.BooleanOptionalAction, default=True)
+    parser.add_argument("--ssm-local-shift-init", type=float, default=0.1)
+    parser.add_argument("--ssm-local-shift-per-channel", action=argparse.BooleanOptionalAction, default=True)
+    parser.add_argument("--dtype", choices=sorted(DTYPES), default="bf16")
+    parser.add_argument("--device", default="auto")
+    parser.add_argument("--warmup", type=int, default=2)
+    parser.add_argument("--repeats", type=int, default=5)
+    parser.add_argument("--component-warmup", type=int, default=1)
+    parser.add_argument("--component-repeats", type=int, default=3)
+    parser.add_argument("--seed", type=int, default=42)
+    parser.add_argument("--output", type=Path, default=None)
+    args = parser.parse_args()
+    if args.device == "auto":
+        device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+    else:
+        device = torch.device(args.device)
+    dtype = DTYPES[args.dtype]
+    if device.type == "cuda":
+        torch.backends.cuda.matmul.allow_tf32 = True
+        torch.backends.cudnn.allow_tf32 = True
+    results = [
+        profile_architecture(args, architecture=architecture.strip(), device=device, dtype=dtype)
+        for architecture in args.architectures.split(",")
+        if architecture.strip()
+    ]
+    report = {
+        "metadata": {
+            "python": platform.python_version(),
+            "platform": platform.platform(),
+            "torch": torch.__version__,
+            "cuda_available": torch.cuda.is_available(),
+            "cuda_device": torch.cuda.get_device_name(device) if device.type == "cuda" else None,
+            "device": str(device),
+            "dtype": str(dtype).replace("torch.", ""),
+            "args": vars(args) | {"output": str(args.output) if args.output else None},
+        },
+        "results": results,
+    }
+    text = json.dumps(report, indent=2, sort_keys=True, default=str)
+    print(text)
+    if args.output is not None:
+        args.output.parent.mkdir(parents=True, exist_ok=True)
+        args.output.write_text(text, encoding="utf-8")
+    return 0
+if __name__ == "__main__":
+    raise SystemExit(main())

code/TaoTrain/scripts/remote/job_status.sh ADDED Viewed

	@@ -0,0 +1,36 @@

+#!/usr/bin/env bash
+set -euo pipefail
+RUN_ID="${RUN_ID:-}"
+JOB_ROOT="${JOB_ROOT:-/home/student/YouZheng/jobs/taotern}"
+if [[ -z "$RUN_ID" ]]; then
+  echo "RUN_ID is required" >&2
+  exit 2
+fi
+safe_run_id="$(printf '%s' "$RUN_ID" | tr -c 'A-Za-z0-9_.-' '_')"
+job_dir="${JOB_ROOT%/}/${safe_run_id}"
+if [[ ! -d "$job_dir" ]]; then
+  echo "Job directory not found: $job_dir" >&2
+  exit 1
+fi
+echo "== status.json =="
+cat "$job_dir/status.json" 2>/dev/null || true
+echo
+echo "== markers =="
+ls -1 "$job_dir"/DONE "$job_dir"/FAILED 2>/dev/null || true
+echo
+echo "== tmux =="
+tmux ls 2>/dev/null | grep -F "taotern_${safe_run_id}" || true
+echo
+echo "== recent log =="
+tail -n "${TAIL_LINES:-80}" "$job_dir/train.log" 2>/dev/null || true
+echo
+echo "== outputs =="
+find "$job_dir/outputs" -maxdepth 2 -type f 2>/dev/null | sort | tail -n 40 || true
+echo
+echo "== checkpoints =="
+find "$job_dir/checkpoints" -maxdepth 1 -type f 2>/dev/null | sort | tail -n 20 || true

code/TaoTrain/scripts/remote/run_200m_base_suite.sh ADDED Viewed

	@@ -0,0 +1,149 @@

+#!/usr/bin/env bash
+set -euo pipefail
+DATA_PATH="${DATA_PATH:-/home/student/Data/TaoData/pretrain.jsonl}"
+TOKENIZER_PATH="${TOKENIZER_PATH:-/home/student/YouZheng/tokenizers/taodata_pilot_8k/tokenizer.model}"
+SSM_REPO_PATH="${SSM_REPO_PATH:-/home/student/YouZheng/gamma_ssm_repo}"
+PYTHON_BIN="${PYTHON_BIN:-/home/student/.venv/bin/python}"
+REMOTE_REPO="${REMOTE_REPO:-$(pwd)}"
+OUTPUT_BASE="${REPOBRIDGE_OUTPUT_DIR:-$REMOTE_REPO/results/200m-base-suite}"
+CHECKPOINT_BASE="${TAOTERN_CHECKPOINT_DIR:-$OUTPUT_BASE/checkpoints}"
+# Stage-1 defaults are intentionally modest. Increase these through environment
+# variables after the 200M shapes are stable on the RTX5090.
+MAX_TOKENS="${MAX_TOKENS:-50000000}"
+MAX_RECORDS="${MAX_RECORDS:-100000}"
+TRAIN_STEPS="${TRAIN_STEPS:-200}"
+EVAL_BATCHES="${EVAL_BATCHES:-16}"
+BATCH_SIZES="${BATCH_SIZES:-4,8}"
+SEQ_LEN="${SEQ_LEN:-512}"
+LEARNING_RATE="${LEARNING_RATE:-0.0006}"
+WEIGHT_DECAY="${WEIGHT_DECAY:-0.01}"
+DRY_RUN="${DRY_RUN:-0}"
+export PYTHONPATH="$REMOTE_REPO/src:$SSM_REPO_PATH"
+mkdir -p "$OUTPUT_BASE" "$CHECKPOINT_BASE"
+run_variant() {
+  local variant="$1"
+  shift
+  local output_dir="$OUTPUT_BASE/$variant"
+  local checkpoint_dir="$CHECKPOINT_BASE/$variant"
+  mkdir -p "$output_dir" "$checkpoint_dir"
+  local cmd="$PYTHON_BIN scripts/benchmark_taonet_real_tokens.py \
+    --data-path $DATA_PATH \
+    --text-field text \
+    --tokenizer-type sentencepiece \
+    --tokenizer-path $TOKENIZER_PATH \
+    --max-records $MAX_RECORDS \
+    --max-tokens $MAX_TOKENS \
+    --eval-fraction 0.1 \
+    --batch-sizes $BATCH_SIZES \
+    --seq-len $SEQ_LEN \
+    --dtype bf16 \
+    --device cuda \
+    --warmup 1 \
+    --repeats 2 \
+    --backward \
+    --train-steps $TRAIN_STEPS \
+    --learning-rate $LEARNING_RATE \
+    --weight-decay $WEIGHT_DECAY \
+    --eval-batches $EVAL_BATCHES \
+    --output-dir $output_dir \
+    --resume-completed \
+    --incremental-output \
+    --save-case-checkpoints \
+    --checkpoint-dir $checkpoint_dir \
+    $*"
+  printf '\n=== 200M variant: %s ===\n' "$variant"
+  printf '%s\n' "$cmd"
+  if [ "$DRY_RUN" = "1" ]; then
+    return 0
+  fi
+  eval "$cmd"
+}
+run_variant attention_196m \
+  --architectures taonet \
+  --hidden-dim 960 \
+  --num-layers 16 \
+  --num-heads 8 \
+  --d-latent-kv 720 \
+  --d-rope 120 \
+  --hidden-dim-ff 2880
+run_variant pure_ssm_196m_hadamard \
+  --architectures taonet_ssm \
+  --hidden-dim 1024 \
+  --num-layers 18 \
+  --num-heads 8 \
+  --d-latent-kv 768 \
+  --d-rope 128 \
+  --hidden-dim-ff 3072 \
+  --ssm-core dplr \
+  --ssm-hidden-dims 16 \
+  --ssm-mixer-dims 256 \
+  --ssm-num-lanes-list 2 \
+  --ssm-lane-combine channel \
+  --ssm-lane-modes split \
+  --ssm-split-mixes hadamard \
+  --ssm-rank 1 \
+  --ssm-kernel-mode conv \
+  --no-ssm-finite-tail-correction \
+  --ssm-gate-types channel \
+  --ssm-local-shift \
+  --ssm-local-shift-per-channel \
+  --ssm-local-shift-init 0.1
+run_variant pure_ssm_196m_nomix \
+  --architectures taonet_ssm \
+  --hidden-dim 1024 \
+  --num-layers 18 \
+  --num-heads 8 \
+  --d-latent-kv 768 \
+  --d-rope 128 \
+  --hidden-dim-ff 3072 \
+  --ssm-core dplr \
+  --ssm-hidden-dims 16 \
+  --ssm-mixer-dims 256 \
+  --ssm-num-lanes-list 2 \
+  --ssm-lane-combine channel \
+  --ssm-lane-modes split \
+  --ssm-split-mixes none \
+  --ssm-rank 1 \
+  --ssm-kernel-mode conv \
+  --no-ssm-finite-tail-correction \
+  --ssm-gate-types channel \
+  --ssm-local-shift \
+  --ssm-local-shift-per-channel \
+  --ssm-local-shift-init 0.1
+run_variant hybrid_ssm_first_199m \
+  --architectures taonet_hybrid \
+  --hidden-dim 1024 \
+  --num-layers 16 \
+  --num-heads 8 \
+  --d-latent-kv 768 \
+  --d-rope 128 \
+  --hidden-dim-ff 3072 \
+  --ssm-core dplr \
+  --ssm-hidden-dims 32 \
+  --ssm-mixer-dims 256 \
+  --ssm-num-lanes-list 2 \
+  --ssm-lane-combine channel \
+  --ssm-lane-modes split \
+  --ssm-split-mixes hadamard \
+  --ssm-rank 1 \
+  --ssm-kernel-mode conv \
+  --no-ssm-finite-tail-correction \
+  --ssm-gate-types channel \
+  --hybrid-patterns ssm_first \
+  --ssm-local-shift \
+  --ssm-local-shift-per-channel \
+  --ssm-local-shift-init 0.1
+if [ "$DRY_RUN" != "1" ]; then
+  "$PYTHON_BIN" scripts/summarize_taonet_benchmark_suite.py --suite-dir "$OUTPUT_BASE"
+fi

code/TaoTrain/scripts/remote/run_200m_branch_only_chat.sh ADDED Viewed

	@@ -0,0 +1,329 @@

+#!/usr/bin/env bash
+set -euo pipefail
+DATA_PATH="${DATA_PATH:-/home/student/Data/TaoData/pretrain.jsonl}"
+SFT_DATA_PATH="${SFT_DATA_PATH:-/home/student/Data/TaoData/sft.jsonl}"
+TOKENIZER_PATH="${TOKENIZER_PATH:-/home/student/YouZheng/tokenizers/taodata_pilot_8k/tokenizer.model}"
+SSM_REPO_PATH="${SSM_REPO_PATH:-/home/student/YouZheng/gamma_ssm_repo}"
+PYTHON_BIN="${PYTHON_BIN:-/home/student/.venv/bin/python}"
+REMOTE_REPO="${REMOTE_REPO:-$(pwd)}"
+OUTPUT_BASE="${REPOBRIDGE_OUTPUT_DIR:-$REMOTE_REPO/results/200m-branch-only-chat}"
+CHECKPOINT_BASE="${TAOTERN_CHECKPOINT_DIR:-$OUTPUT_BASE/checkpoints}"
+SEQ_LEN="${SEQ_LEN:-512}"
+BATCH_SIZE="${BATCH_SIZE:-8}"
+PRETRAIN_TOKENS="${PRETRAIN_TOKENS:-4000000000}"
+SFT_STEPS="${SFT_STEPS:-50000}"
+PRETRAIN_LR="${PRETRAIN_LR:-0.0008}"
+SFT_LR="${SFT_LR:-0.00005}"
+WEIGHT_DECAY="${WEIGHT_DECAY:-0.01}"
+LOG_EVERY="${LOG_EVERY:-100}"
+SAVE_EVERY="${SAVE_EVERY:-100000}"
+SFT_SAVE_EVERY="${SFT_SAVE_EVERY:-10000}"
+TOKENIZER_THREADS="${TOKENIZER_THREADS:-8}"
+SAMPLES_PER_CHUNK="${SAMPLES_PER_CHUNK:-2000}"
+BLOCK_RESIDUAL_RMS_CAP="${BLOCK_RESIDUAL_RMS_CAP:-}"
+ceil_div() {
+  local numerator="$1"
+  local denominator="$2"
+  echo $(( (numerator + denominator - 1) / denominator ))
+}
+PRETRAIN_STEPS="${PRETRAIN_STEPS:-$(ceil_div "$PRETRAIN_TOKENS" $((BATCH_SIZE * SEQ_LEN)))}"
+export PYTHONPATH="$REMOTE_REPO/src:$SSM_REPO_PATH"
+mkdir -p "$OUTPUT_BASE/configs" "$OUTPUT_BASE/diagnostics" "$CHECKPOINT_BASE/pretrain" "$CHECKPOINT_BASE/sft"
+cap_yaml=""
+if [[ -n "$BLOCK_RESIDUAL_RMS_CAP" ]]; then
+  cap_yaml="  block_residual_rms_cap: $BLOCK_RESIDUAL_RMS_CAP"
+fi
+cat > "$OUTPUT_BASE/run_plan.json" <<JSON
+{
+  "purpose": "200m_branch_only_pure_ssm_4b_pretrain_plus_sft_chatbot_attempt",
+  "candidate": "pure_ssm_196m_branch_rms_only",
+  "pretrain_token_positions": $PRETRAIN_TOKENS,
+  "pretrain_steps": $PRETRAIN_STEPS,
+  "batch_size": $BATCH_SIZE,
+  "seq_len": $SEQ_LEN,
+  "sft_steps": $SFT_STEPS,
+  "pretrain_lr": $PRETRAIN_LR,
+  "sft_lr": $SFT_LR,
+  "save_every_steps": $SAVE_EVERY,
+  "block_residual_rms_cap": "${BLOCK_RESIDUAL_RMS_CAP:-null}",
+  "notes": [
+    "Uses streaming JSONL training through TaoTrain CLI.",
+    "Counts token positions as batch_size * seq_len * max_steps.",
+    "Keeps SSM branch RMS normalization enabled and block residual RMS normalization disabled.",
+    "Runs corrected response-only SFT after pretraining."
+  ]
+}
+JSON
+PRETRAIN_CONFIG="$OUTPUT_BASE/configs/pretrain.yaml"
+cat > "$PRETRAIN_CONFIG" <<YAML
+model:
+  architecture_type: taonet_ssm
+  vocab_size: 8192
+  hidden_dim: 1024
+  num_layers: 18
+  num_heads: 8
+  max_seq_length: $SEQ_LEN
+  d_latent_kv: 768
+  d_rope: 128
+  hidden_dim_ff: 3072
+  dropout: 0.0
+  gqa_groups: 1
+  use_factorized_embedding: false
+  d_embed_rank: 96
+  init_std: 0.02
+  ssm_core: dplr
+  ssm_hidden_dim: 32
+  ssm_mixer_dim: 256
+  ssm_num_lanes: 2
+  ssm_lane_combine: channel
+  ssm_lane_mode: split
+  ssm_split_mix: none
+  ssm_rank: 1
+  ssm_max_low_rank_scale: 0.1
+  ssm_finite_tail_correction: false
+  ssm_discretization: bilinear
+  ssm_kernel_mode: conv
+  ssm_kernel_threshold: 64
+  ssm_dt_min: 1e-3
+  ssm_dt_max: 1e-1
+  ssm_dt_init: 1e-2
+  ssm_use_d: true
+  ssm_activation: gelu
+  ssm_gate: true
+  ssm_input_gate: true
+  ssm_gate_type: channel
+  ssm_use_padding_mask: false
+  ssm_layer_scale_init: 0.1
+  ssm_branch_rms_norm: true
+  block_residual_rms_norm: false
+$cap_yaml
+  ssm_local_shift: true
+  ssm_local_shift_init: 0.1
+  ssm_local_shift_per_channel: true
+dataset:
+  local: true
+  jsonl_path: $DATA_PATH
+  text_field: text
+  tokenizer_type: sentencepiece
+  tokenizer_path: $TOKENIZER_PATH
+  samples_per_chunk: $SAMPLES_PER_CHUNK
+  tokenizer_threads: $TOKENIZER_THREADS
+sequence_length: $SEQ_LEN
+batch_size: $BATCH_SIZE
+num_epochs: 100000
+max_steps: $PRETRAIN_STEPS
+gradient_accumulation_steps: 1
+max_grad_norm: 1.0
+optimizer:
+  optimizer_type: adamw
+  learning_rate: $PRETRAIN_LR
+  weight_decay: $WEIGHT_DECAY
+  betas: [0.9, 0.999]
+  eps: 1e-8
+scheduler:
+  scheduler_type: constant
+  warmup_steps: 1000
+  warmup_ratio: 0.0
+dtype: bfloat16
+device: cuda
+checkpoint_dir: $CHECKPOINT_BASE/pretrain
+save_every_steps: $SAVE_EVERY
+save_best_model: false
+keep_last_n_checkpoints: 3
+eval_every_steps: $SAVE_EVERY
+eval_samples: 32
+log_every_steps: $LOG_EVERY
+aim_repo: $OUTPUT_BASE/.aim-pretrain
+seed: 43
+num_workers: 0
+pin_memory: true
+YAML
+printf '\n============================================================\n'
+printf '200M branch-only pure SSM pretrain\n'
+printf 'token_positions=%s batch=%s seq_len=%s steps=%s save_every=%s\n' \
+  "$PRETRAIN_TOKENS" "$BATCH_SIZE" "$SEQ_LEN" "$PRETRAIN_STEPS" "$SAVE_EVERY"
+printf '============================================================\n'
+"$PYTHON_BIN" -m taoTrain.cli pretrain --config "$PRETRAIN_CONFIG"
+PRETRAIN_CKPT="$CHECKPOINT_BASE/pretrain/final_model.pt"
+if [[ ! -f "$PRETRAIN_CKPT" ]]; then
+  echo "Expected pretrain checkpoint missing: $PRETRAIN_CKPT" >&2
+  exit 2
+fi
+"$PYTHON_BIN" scripts/diagnostics/activation_probe.py \
+  --checkpoint "$PRETRAIN_CKPT" \
+  --tokenizer-path "$TOKENIZER_PATH" \
+  --data-path "$DATA_PATH" \
+  --text-field text \
+  --output "$OUTPUT_BASE/diagnostics/activation_probe_pretrain_final.json" \
+  --batch-size 2 \
+  --seq-len "$SEQ_LEN" \
+  --device cuda \
+  --dtype bfloat16
+"$PYTHON_BIN" scripts/diagnostics/generate_checkpoint_samples.py \
+  --checkpoint "$PRETRAIN_CKPT" \
+  --tokenizer-path "$TOKENIZER_PATH" \
+  --output "$OUTPUT_BASE/diagnostics/generation_samples_pretrain_final.json" \
+  --max-new-tokens 160 \
+  --temperature 0.8 \
+  --top-p 0.9 \
+  --prompt "The purpose of artificial intelligence is" \
+  --prompt "In a small village," \
+  --prompt "<user>Hello, who are you?<assistant>"
+SFT_CONFIG="$OUTPUT_BASE/configs/sft.yaml"
+cat > "$SFT_CONFIG" <<YAML
+model:
+  architecture_type: taonet_ssm
+  vocab_size: 8192
+  hidden_dim: 1024
+  num_layers: 18
+  num_heads: 8
+  max_seq_length: $SEQ_LEN
+  d_latent_kv: 768
+  d_rope: 128
+  hidden_dim_ff: 3072
+  dropout: 0.0
+  gqa_groups: 1
+  use_factorized_embedding: false
+  d_embed_rank: 96
+  init_std: 0.02
+  ssm_core: dplr
+  ssm_hidden_dim: 32
+  ssm_mixer_dim: 256
+  ssm_num_lanes: 2
+  ssm_lane_combine: channel
+  ssm_lane_mode: split
+  ssm_split_mix: none
+  ssm_rank: 1
+  ssm_max_low_rank_scale: 0.1
+  ssm_finite_tail_correction: false
+  ssm_discretization: bilinear
+  ssm_kernel_mode: conv
+  ssm_kernel_threshold: 64
+  ssm_dt_min: 1e-3
+  ssm_dt_max: 1e-1
+  ssm_dt_init: 1e-2
+  ssm_use_d: true
+  ssm_activation: gelu
+  ssm_gate: true
+  ssm_input_gate: true
+  ssm_gate_type: channel
+  ssm_use_padding_mask: false
+  ssm_layer_scale_init: 0.1
+  ssm_branch_rms_norm: true
+  block_residual_rms_norm: false
+$cap_yaml
+  ssm_local_shift: true
+  ssm_local_shift_init: 0.1
+  ssm_local_shift_per_channel: true
+dataset:
+  split: train
+  instruction_column: input
+  response_column: output
+  local: true
+  jsonl_path: $SFT_DATA_PATH
+  samples_per_chunk: $SAMPLES_PER_CHUNK
+  tokenizer_type: sentencepiece
+  tokenizer_path: $TOKENIZER_PATH
+  tokenizer_threads: $TOKENIZER_THREADS
+checkpoint_path: $PRETRAIN_CKPT
+user_token: "<user>"
+assistant_token: "<assistant>"
+response_loss_only: true
+batch_size: $BATCH_SIZE
+num_epochs: 100000
+max_steps: $SFT_STEPS
+gradient_accumulation_steps: 1
+max_grad_norm: 1.0
+optimizer:
+  optimizer_type: adamw
+  learning_rate: $SFT_LR
+  weight_decay: 0.0
+  betas: [0.9, 0.999]
+  eps: 1e-8
+scheduler:
+  scheduler_type: constant
+  warmup_steps: 100
+  warmup_ratio: 0.0
+dtype: bfloat16
+device: cuda
+checkpoint_dir: $CHECKPOINT_BASE/sft
+save_every_steps: $SFT_SAVE_EVERY
+save_best_model: false
+keep_last_n_checkpoints: 3
+eval_every_steps: $SFT_SAVE_EVERY
+eval_samples: 32
+log_every_steps: 20
+aim_repo: $OUTPUT_BASE/.aim-sft
+seed: 44
+num_workers: 0
+pin_memory: true
+YAML
+printf '\n============================================================\n'
+printf '200M branch-only pure SSM response-only SFT\n'
+printf 'steps=%s batch=%s seq_len=%s save_every=%s\n' \
+  "$SFT_STEPS" "$BATCH_SIZE" "$SEQ_LEN" "$SFT_SAVE_EVERY"
+printf '============================================================\n'
+"$PYTHON_BIN" -m taoTrain.cli sft --config "$SFT_CONFIG"
+SFT_CKPT="$CHECKPOINT_BASE/sft/final_model.pt"
+if [[ ! -f "$SFT_CKPT" ]]; then
+  echo "Expected SFT checkpoint missing: $SFT_CKPT" >&2
+  exit 2
+fi
+"$PYTHON_BIN" scripts/diagnostics/generate_checkpoint_samples.py \
+  --checkpoint "$SFT_CKPT" \
+  --tokenizer-path "$TOKENIZER_PATH" \
+  --output "$OUTPUT_BASE/diagnostics/generation_samples_sft_final.json" \
+  --max-new-tokens 220 \
+  --temperature 0.7 \
+  --top-p 0.9 \
+  --prompt "<user>Hello, who are you?<assistant>" \
+  --prompt "<user>Explain what artificial intelligence is in simple words.<assistant>" \
+  --prompt "<user>Give me three practical study tips.<assistant>"
+cat > "$OUTPUT_BASE/model_card.json" <<JSON
+{
+  "state": "completed",
+  "architecture": "taonet_ssm",
+  "candidate": "pure_ssm_196m_branch_rms_only",
+  "pretrain_checkpoint": "$PRETRAIN_CKPT",
+  "sft_checkpoint": "$SFT_CKPT",
+  "tokenizer_path": "$TOKENIZER_PATH",
+  "pretrain_config": "$PRETRAIN_CONFIG",
+  "sft_config": "$SFT_CONFIG",
+  "activation_probe": "$OUTPUT_BASE/diagnostics/activation_probe_pretrain_final.json",
+  "pretrain_samples": "$OUTPUT_BASE/diagnostics/generation_samples_pretrain_final.json",
+  "sft_samples": "$OUTPUT_BASE/diagnostics/generation_samples_sft_final.json"
+}
+JSON
+echo "200M branch-only pure SSM pretrain+SFT run completed."
+cat "$OUTPUT_BASE/model_card.json"

code/TaoTrain/scripts/remote/run_200m_until_selection.sh ADDED Viewed

	@@ -0,0 +1,81 @@

+#!/usr/bin/env bash
+set -euo pipefail
+ROOT_OUTPUT_DIR="${REPOBRIDGE_OUTPUT_DIR:-$(pwd)/results/200m-until-selection}"
+ROOT_CHECKPOINT_DIR="${TAOTERN_CHECKPOINT_DIR:-$ROOT_OUTPUT_DIR/checkpoints}"
+SEQ_LEN="${SEQ_LEN:-512}"
+BATCH_SIZE="${BATCH_SIZE:-8}"
+PILOT_TOKENS="${PILOT_TOKENS:-300000000}"
+SERIOUS_TOKENS="${SERIOUS_TOKENS:-1000000000}"
+PILOT_EVAL_BATCHES="${PILOT_EVAL_BATCHES:-64}"
+SERIOUS_EVAL_BATCHES="${SERIOUS_EVAL_BATCHES:-128}"
+LEARNING_RATE="${LEARNING_RATE:-0.0003}"
+WEIGHT_DECAY="${WEIGHT_DECAY:-0.01}"
+ceil_div() {
+  local numerator="$1"
+  local denominator="$2"
+  echo $(( (numerator + denominator - 1) / denominator ))
+}
+run_phase() {
+  local phase="$1"
+  local target_tokens="$2"
+  local eval_batches="$3"
+  local tokens_per_step=$((BATCH_SIZE * SEQ_LEN))
+  local train_steps
+  train_steps="$(ceil_div "$target_tokens" "$tokens_per_step")"
+  printf '\n============================================================\n'
+  printf '200M until-selection phase: %s\n' "$phase"
+  printf 'target_tokens=%s batch=%s seq_len=%s train_steps=%s eval_batches=%s\n' \
+    "$target_tokens" "$BATCH_SIZE" "$SEQ_LEN" "$train_steps" "$eval_batches"
+  printf '============================================================\n'
+  REPOBRIDGE_OUTPUT_DIR="$ROOT_OUTPUT_DIR/$phase" \
+  TAOTERN_CHECKPOINT_DIR="$ROOT_CHECKPOINT_DIR/$phase" \
+  BATCH_SIZES="$BATCH_SIZE" \
+  SEQ_LEN="$SEQ_LEN" \
+  TRAIN_STEPS="$train_steps" \
+  EVAL_BATCHES="$eval_batches" \
+  LEARNING_RATE="$LEARNING_RATE" \
+  WEIGHT_DECAY="$WEIGHT_DECAY" \
+  bash scripts/remote/run_200m_base_suite.sh
+}
+mkdir -p "$ROOT_OUTPUT_DIR" "$ROOT_CHECKPOINT_DIR"
+cat > "$ROOT_OUTPUT_DIR/run_plan.json" <<JSON
+{
+  "stopping_point": "selection_after_1b_all_four_variants",
+  "batch_size": $BATCH_SIZE,
+  "seq_len": $SEQ_LEN,
+  "learning_rate": $LEARNING_RATE,
+  "weight_decay": $WEIGHT_DECAY,
+  "phases": [
+    {
+      "name": "pilot_300m",
+      "target_tokens_per_variant": $PILOT_TOKENS,
+      "train_steps": $(ceil_div "$PILOT_TOKENS" $((BATCH_SIZE * SEQ_LEN))),
+      "eval_batches": $PILOT_EVAL_BATCHES
+    },
+    {
+      "name": "serious_1b",
+      "target_tokens_per_variant": $SERIOUS_TOKENS,
+      "train_steps": $(ceil_div "$SERIOUS_TOKENS" $((BATCH_SIZE * SEQ_LEN))),
+      "eval_batches": $SERIOUS_EVAL_BATCHES
+    }
+  ],
+  "variants": [
+    "attention_196m",
+    "pure_ssm_196m_hadamard",
+    "pure_ssm_196m_nomix",
+    "hybrid_ssm_first_199m"
+  ]
+}
+JSON
+run_phase "pilot_300m" "$PILOT_TOKENS" "$PILOT_EVAL_BATCHES"
+run_phase "serious_1b" "$SERIOUS_TOKENS" "$SERIOUS_EVAL_BATCHES"
+echo "Selection gate reached after pilot_300m and serious_1b completed for all four variants."

code/TaoTrain/scripts/remote/run_pre_200m_branch_only_gate.sh ADDED Viewed

	@@ -0,0 +1,269 @@

+#!/usr/bin/env bash
+set -euo pipefail
+DATA_PATH="${DATA_PATH:-/home/student/Data/TaoData/pretrain.jsonl}"
+SFT_DATA_PATH="${SFT_DATA_PATH:-/home/student/Data/TaoData/sft.jsonl}"
+TOKENIZER_PATH="${TOKENIZER_PATH:-/home/student/YouZheng/tokenizers/taodata_pilot_8k/tokenizer.model}"
+SSM_REPO_PATH="${SSM_REPO_PATH:-/home/student/YouZheng/gamma_ssm_repo}"
+PYTHON_BIN="${PYTHON_BIN:-/home/student/.venv/bin/python}"
+REMOTE_REPO="${REMOTE_REPO:-$(pwd)}"
+OUTPUT_BASE="${REPOBRIDGE_OUTPUT_DIR:-$REMOTE_REPO/results/pre-200m-branch-only-gate}"
+CHECKPOINT_BASE="${TAOTERN_CHECKPOINT_DIR:-$OUTPUT_BASE/checkpoints}"
+SEQ_LEN="${SEQ_LEN:-512}"
+BATCH_SIZE="${BATCH_SIZE:-8}"
+TARGET_TOKENS="${TARGET_TOKENS:-100000000}"
+MAX_TOKENS="${MAX_TOKENS:-100000000}"
+MAX_RECORDS="${MAX_RECORDS:-200000}"
+EVAL_BATCHES="${EVAL_BATCHES:-128}"
+LEARNING_RATE="${LEARNING_RATE:-0.0008}"
+WEIGHT_DECAY="${WEIGHT_DECAY:-0.01}"
+TRAIN_LOG_EVERY="${TRAIN_LOG_EVERY:-1000}"
+SFT_SANITY_SAMPLES="${SFT_SANITY_SAMPLES:-4}"
+SFT_SANITY_STEPS="${SFT_SANITY_STEPS:-120}"
+SFT_SANITY_LR="${SFT_SANITY_LR:-0.00005}"
+BLOCK_RESIDUAL_RMS_CAP="${BLOCK_RESIDUAL_RMS_CAP:-}"
+ceil_div() {
+  local numerator="$1"
+  local denominator="$2"
+  echo $(( (numerator + denominator - 1) / denominator ))
+}
+TRAIN_STEPS="${TRAIN_STEPS:-$(ceil_div "$TARGET_TOKENS" $((BATCH_SIZE * SEQ_LEN)))}"
+export PYTHONPATH="$REMOTE_REPO/src:$SSM_REPO_PATH"
+mkdir -p "$OUTPUT_BASE" "$CHECKPOINT_BASE" "$OUTPUT_BASE/configs" "$OUTPUT_BASE/diagnostics"
+cap_args=""
+cap_yaml=""
+if [[ -n "$BLOCK_RESIDUAL_RMS_CAP" ]]; then
+  cap_args="--block-residual-rms-cap $BLOCK_RESIDUAL_RMS_CAP"
+  cap_yaml="  block_residual_rms_cap: $BLOCK_RESIDUAL_RMS_CAP"
+fi
+cat > "$OUTPUT_BASE/run_plan.json" <<JSON
+{
+  "purpose": "pre_200m_branch_only_quality_gate_before_4b_sft_chatbot_run",
+  "candidate": "pure_ssm_196m_branch_rms_only",
+  "target_tokens": $TARGET_TOKENS,
+  "train_steps": $TRAIN_STEPS,
+  "batch_size": $BATCH_SIZE,
+  "seq_len": $SEQ_LEN,
+  "learning_rate": $LEARNING_RATE,
+  "weight_decay": $WEIGHT_DECAY,
+  "block_residual_rms_cap": "${BLOCK_RESIDUAL_RMS_CAP:-null}",
+  "checks": [
+    "bounded pretrain loss/eval/grad telemetry",
+    "activation scale probe",
+    "sample generation",
+    "tiny SFT overfit probe"
+  ]
+}
+JSON
+printf '\n============================================================\n'
+printf 'Pre-200M branch-only gate: pure SSM stabilized candidate\n'
+printf 'target_tokens=%s batch=%s seq_len=%s train_steps=%s eval_batches=%s cap=%s\n' \
+  "$TARGET_TOKENS" "$BATCH_SIZE" "$SEQ_LEN" "$TRAIN_STEPS" "$EVAL_BATCHES" "${BLOCK_RESIDUAL_RMS_CAP:-none}"
+printf '============================================================\n'
+"$PYTHON_BIN" scripts/benchmark_taonet_real_tokens.py \
+  --data-path "$DATA_PATH" \
+  --text-field text \
+  --tokenizer-type sentencepiece \
+  --tokenizer-path "$TOKENIZER_PATH" \
+  --max-records "$MAX_RECORDS" \
+  --max-tokens "$MAX_TOKENS" \
+  --eval-fraction 0.1 \
+  --architectures taonet_ssm \
+  --batch-sizes "$BATCH_SIZE" \
+  --seq-len "$SEQ_LEN" \
+  --hidden-dim 1024 \
+  --num-layers 18 \
+  --num-heads 8 \
+  --d-latent-kv 768 \
+  --d-rope 128 \
+  --hidden-dim-ff 3072 \
+  --dropout 0.0 \
+  --ssm-core dplr \
+  --ssm-hidden-dims 32 \
+  --ssm-mixer-dims 256 \
+  --ssm-num-lanes-list 2 \
+  --ssm-lane-combine channel \
+  --ssm-lane-modes split \
+  --ssm-split-mixes none \
+  --ssm-rank 1 \
+  --ssm-kernel-mode conv \
+  --no-ssm-finite-tail-correction \
+  --ssm-gate-types channel \
+  --dtype bf16 \
+  --device cuda \
+  --warmup 1 \
+  --repeats 2 \
+  --backward \
+  --train-steps "$TRAIN_STEPS" \
+  --train-log-every "$TRAIN_LOG_EVERY" \
+  --learning-rate "$LEARNING_RATE" \
+  --weight-decay "$WEIGHT_DECAY" \
+  --max-grad-norm 1.0 \
+  --eval-batches "$EVAL_BATCHES" \
+  --ssm-local-shift \
+  --ssm-local-shift-per-channel \
+  --ssm-local-shift-init 0.1 \
+  --ssm-branch-rms-norm \
+  $cap_args \
+  --output-dir "$OUTPUT_BASE/pretrain" \
+  --resume-completed \
+  --incremental-output \
+  --save-case-checkpoints \
+  --checkpoint-dir "$CHECKPOINT_BASE/pretrain"
+PRETRAIN_CKPT="$CHECKPOINT_BASE/pretrain/latest.pt"
+if [[ ! -f "$PRETRAIN_CKPT" ]]; then
+  echo "Expected pretrain checkpoint missing: $PRETRAIN_CKPT" >&2
+  exit 2
+fi
+"$PYTHON_BIN" scripts/diagnostics/activation_probe.py \
+  --checkpoint "$PRETRAIN_CKPT" \
+  --tokenizer-path "$TOKENIZER_PATH" \
+  --data-path "$DATA_PATH" \
+  --text-field text \
+  --output "$OUTPUT_BASE/diagnostics/activation_probe_pretrain_latest.json" \
+  --batch-size 2 \
+  --seq-len "$SEQ_LEN" \
+  --device cuda \
+  --dtype bfloat16
+"$PYTHON_BIN" scripts/diagnostics/generate_checkpoint_samples.py \
+  --checkpoint "$PRETRAIN_CKPT" \
+  --tokenizer-path "$TOKENIZER_PATH" \
+  --output "$OUTPUT_BASE/diagnostics/generation_samples_pretrain_latest.json" \
+  --max-new-tokens 120 \
+  --temperature 0.8 \
+  --top-p 0.9 \
+  --prompt "The purpose of artificial intelligence is" \
+  --prompt "In a small village," \
+  --prompt "<user>Hello, who are you?<assistant>"
+SFT_CONFIG="$OUTPUT_BASE/configs/sft_sanity.yaml"
+cat > "$SFT_CONFIG" <<YAML
+model:
+  architecture_type: taonet_ssm
+  vocab_size: 8192
+  hidden_dim: 1024
+  num_layers: 18
+  num_heads: 8
+  max_seq_length: $SEQ_LEN
+  d_latent_kv: 768
+  d_rope: 128
+  hidden_dim_ff: 3072
+  dropout: 0.0
+  gqa_groups: 1
+  use_factorized_embedding: false
+  d_embed_rank: 96
+  init_std: 0.02
+  ssm_core: dplr
+  ssm_hidden_dim: 32
+  ssm_mixer_dim: 256
+  ssm_num_lanes: 2
+  ssm_lane_combine: channel
+  ssm_lane_mode: split
+  ssm_split_mix: none
+  ssm_rank: 1
+  ssm_max_low_rank_scale: 0.1
+  ssm_finite_tail_correction: false
+  ssm_discretization: bilinear
+  ssm_kernel_mode: conv
+  ssm_kernel_threshold: 64
+  ssm_dt_min: 1e-3
+  ssm_dt_max: 1e-1
+  ssm_dt_init: 1e-2
+  ssm_use_d: true
+  ssm_activation: gelu
+  ssm_gate: true
+  ssm_input_gate: true
+  ssm_gate_type: channel
+  ssm_use_padding_mask: false
+  ssm_layer_scale_init: 0.1
+  ssm_branch_rms_norm: true
+  block_residual_rms_norm: false
+$cap_yaml
+  ssm_local_shift: true
+  ssm_local_shift_init: 0.1
+  ssm_local_shift_per_channel: true
+dataset:
+  split: train
+  instruction_column: input
+  response_column: output
+  local: true
+  jsonl_path: $SFT_DATA_PATH
+  samples_per_chunk: 2000
+  tokenizer_type: sentencepiece
+  tokenizer_path: $TOKENIZER_PATH
+  tokenizer_threads: 8
+checkpoint_path: $PRETRAIN_CKPT
+user_token: "<user>"
+assistant_token: "<assistant>"
+response_loss_only: true
+batch_size: $BATCH_SIZE
+num_epochs: 100000
+max_steps: $SFT_SANITY_STEPS
+gradient_accumulation_steps: 1
+max_grad_norm: 1.0
+optimizer:
+  optimizer_type: adamw
+  learning_rate: $SFT_SANITY_LR
+  weight_decay: 0.0
+  betas: [0.9, 0.999]
+  eps: 1e-8
+scheduler:
+  scheduler_type: linearWarmup
+  warmup_steps: 0
+dtype: bfloat16
+device: cuda
+checkpoint_dir: $CHECKPOINT_BASE/sft_sanity
+save_every_steps: 5000
+save_best_model: false
+keep_last_n_checkpoints: 1
+eval_every_steps: 5000
+eval_samples: 32
+log_every_steps: 10
+aim_repo: $OUTPUT_BASE/.aim-sft-sanity
+seed: 43
+num_workers: 0
+pin_memory: true
+YAML
+"$PYTHON_BIN" scripts/diagnostics/sft_sanity_check.py \
+  --config "$SFT_CONFIG" \
+  --checkpoint "$PRETRAIN_CKPT" \
+  --output "$OUTPUT_BASE/diagnostics/sft_sanity_pretrain_latest.json" \
+  --samples "$SFT_SANITY_SAMPLES" \
+  --steps "$SFT_SANITY_STEPS" \
+  --lr "$SFT_SANITY_LR" \
+  --log-every 20 \
+  --device cuda \
+  --dtype bfloat16 \
+  --ssm-branch-rms-norm
+cat > "$OUTPUT_BASE/gate_summary.json" <<JSON
+{
+  "state": "completed",
+  "pretrain_checkpoint": "$PRETRAIN_CKPT",
+  "pretrain_results": "$OUTPUT_BASE/pretrain/taonet_real_token_benchmark.json",
+  "activation_probe": "$OUTPUT_BASE/diagnostics/activation_probe_pretrain_latest.json",
+  "generation_samples": "$OUTPUT_BASE/diagnostics/generation_samples_pretrain_latest.json",
+  "sft_sanity": "$OUTPUT_BASE/diagnostics/sft_sanity_pretrain_latest.json"
+}
+JSON
+echo "Pre-200M branch-only gate completed."
+cat "$OUTPUT_BASE/gate_summary.json"

code/TaoTrain/scripts/remote/run_pre_200m_stability_gate.sh ADDED Viewed

	@@ -0,0 +1,267 @@

+#!/usr/bin/env bash
+set -euo pipefail
+DATA_PATH="${DATA_PATH:-/home/student/Data/TaoData/pretrain.jsonl}"
+SFT_DATA_PATH="${SFT_DATA_PATH:-/home/student/Data/TaoData/sft.jsonl}"
+TOKENIZER_PATH="${TOKENIZER_PATH:-/home/student/YouZheng/tokenizers/taodata_pilot_8k/tokenizer.model}"
+SSM_REPO_PATH="${SSM_REPO_PATH:-/home/student/YouZheng/gamma_ssm_repo}"
+PYTHON_BIN="${PYTHON_BIN:-/home/student/.venv/bin/python}"
+REMOTE_REPO="${REMOTE_REPO:-$(pwd)}"
+OUTPUT_BASE="${REPOBRIDGE_OUTPUT_DIR:-$REMOTE_REPO/results/pre-200m-stability-gate}"
+CHECKPOINT_BASE="${TAOTERN_CHECKPOINT_DIR:-$OUTPUT_BASE/checkpoints}"
+SEQ_LEN="${SEQ_LEN:-512}"
+BATCH_SIZE="${BATCH_SIZE:-8}"
+TARGET_TOKENS="${TARGET_TOKENS:-20000000}"
+MAX_TOKENS="${MAX_TOKENS:-50000000}"
+MAX_RECORDS="${MAX_RECORDS:-120000}"
+EVAL_BATCHES="${EVAL_BATCHES:-64}"
+LEARNING_RATE="${LEARNING_RATE:-0.0008}"
+WEIGHT_DECAY="${WEIGHT_DECAY:-0.01}"
+TRAIN_LOG_EVERY="${TRAIN_LOG_EVERY:-250}"
+SFT_SANITY_SAMPLES="${SFT_SANITY_SAMPLES:-4}"
+SFT_SANITY_STEPS="${SFT_SANITY_STEPS:-120}"
+SFT_SANITY_LR="${SFT_SANITY_LR:-0.00005}"
+ceil_div() {
+  local numerator="$1"
+  local denominator="$2"
+  echo $(( (numerator + denominator - 1) / denominator ))
+}
+TRAIN_STEPS="${TRAIN_STEPS:-$(ceil_div "$TARGET_TOKENS" $((BATCH_SIZE * SEQ_LEN)))}"
+export PYTHONPATH="$REMOTE_REPO/src:$SSM_REPO_PATH"
+mkdir -p "$OUTPUT_BASE" "$CHECKPOINT_BASE" "$OUTPUT_BASE/configs" "$OUTPUT_BASE/diagnostics"
+cat > "$OUTPUT_BASE/run_plan.json" <<JSON
+{
+  "purpose": "pre_200m_stability_gate_before_4b_sft_chatbot_run",
+  "candidate": "pure_ssm_196m_stabilized_m256_h32",
+  "target_tokens": $TARGET_TOKENS,
+  "train_steps": $TRAIN_STEPS,
+  "batch_size": $BATCH_SIZE,
+  "seq_len": $SEQ_LEN,
+  "learning_rate": $LEARNING_RATE,
+  "weight_decay": $WEIGHT_DECAY,
+  "checks": [
+    "bounded pretrain loss/eval/grad telemetry",
+    "activation scale probe",
+    "sample generation",
+    "tiny SFT overfit probe"
+  ]
+}
+JSON
+printf '\n============================================================\n'
+printf 'Pre-200M stability gate: pure SSM stabilized candidate\n'
+printf 'target_tokens=%s batch=%s seq_len=%s train_steps=%s eval_batches=%s\n' \
+  "$TARGET_TOKENS" "$BATCH_SIZE" "$SEQ_LEN" "$TRAIN_STEPS" "$EVAL_BATCHES"
+printf '============================================================\n'
+"$PYTHON_BIN" scripts/benchmark_taonet_real_tokens.py \
+  --data-path "$DATA_PATH" \
+  --text-field text \
+  --tokenizer-type sentencepiece \
+  --tokenizer-path "$TOKENIZER_PATH" \
+  --max-records "$MAX_RECORDS" \
+  --max-tokens "$MAX_TOKENS" \
+  --eval-fraction 0.1 \
+  --architectures taonet_ssm \
+  --batch-sizes "$BATCH_SIZE" \
+  --seq-len "$SEQ_LEN" \
+  --hidden-dim 1024 \
+  --num-layers 18 \
+  --num-heads 8 \
+  --d-latent-kv 768 \
+  --d-rope 128 \
+  --hidden-dim-ff 3072 \
+  --dropout 0.0 \
+  --ssm-core dplr \
+  --ssm-hidden-dims 32 \
+  --ssm-mixer-dims 256 \
+  --ssm-num-lanes-list 2 \
+  --ssm-lane-combine channel \
+  --ssm-lane-modes split \
+  --ssm-split-mixes none \
+  --ssm-rank 1 \
+  --ssm-kernel-mode conv \
+  --no-ssm-finite-tail-correction \
+  --ssm-gate-types channel \
+  --dtype bf16 \
+  --device cuda \
+  --warmup 1 \
+  --repeats 2 \
+  --backward \
+  --train-steps "$TRAIN_STEPS" \
+  --train-log-every "$TRAIN_LOG_EVERY" \
+  --learning-rate "$LEARNING_RATE" \
+  --weight-decay "$WEIGHT_DECAY" \
+  --max-grad-norm 1.0 \
+  --eval-batches "$EVAL_BATCHES" \
+  --ssm-local-shift \
+  --ssm-local-shift-per-channel \
+  --ssm-local-shift-init 0.1 \
+  --ssm-branch-rms-norm \
+  --ssm-branch-clip-value 1.0 \
+  --block-residual-rms-norm \
+  --block-residual-rms-target 1.0 \
+  --output-dir "$OUTPUT_BASE/pretrain" \
+  --resume-completed \
+  --incremental-output \
+  --save-case-checkpoints \
+  --checkpoint-dir "$CHECKPOINT_BASE/pretrain"
+PRETRAIN_CKPT="$CHECKPOINT_BASE/pretrain/latest.pt"
+if [[ ! -f "$PRETRAIN_CKPT" ]]; then
+  echo "Expected pretrain checkpoint missing: $PRETRAIN_CKPT" >&2
+  exit 2
+fi
+"$PYTHON_BIN" scripts/diagnostics/activation_probe.py \
+  --checkpoint "$PRETRAIN_CKPT" \
+  --tokenizer-path "$TOKENIZER_PATH" \
+  --data-path "$DATA_PATH" \
+  --text-field text \
+  --output "$OUTPUT_BASE/diagnostics/activation_probe_pretrain_latest.json" \
+  --batch-size 2 \
+  --seq-len "$SEQ_LEN" \
+  --device cuda \
+  --dtype bfloat16
+"$PYTHON_BIN" scripts/diagnostics/generate_checkpoint_samples.py \
+  --checkpoint "$PRETRAIN_CKPT" \
+  --tokenizer-path "$TOKENIZER_PATH" \
+  --output "$OUTPUT_BASE/diagnostics/generation_samples_pretrain_latest.json" \
+  --max-new-tokens 80 \
+  --temperature 0.8 \
+  --top-p 0.9 \
+  --prompt "The purpose of artificial intelligence is" \
+  --prompt "In a small village," \
+  --prompt "<user>Hello, who are you?<assistant>"
+SFT_CONFIG="$OUTPUT_BASE/configs/sft_sanity.yaml"
+cat > "$SFT_CONFIG" <<YAML
+model:
+  architecture_type: taonet_ssm
+  vocab_size: 8192
+  hidden_dim: 1024
+  num_layers: 18
+  num_heads: 8
+  max_seq_length: $SEQ_LEN
+  d_latent_kv: 768
+  d_rope: 128
+  hidden_dim_ff: 3072
+  dropout: 0.0
+  gqa_groups: 1
+  use_factorized_embedding: false
+  d_embed_rank: 96
+  init_std: 0.02
+  ssm_core: dplr
+  ssm_hidden_dim: 32
+  ssm_mixer_dim: 256
+  ssm_num_lanes: 2
+  ssm_lane_combine: channel
+  ssm_lane_mode: split
+  ssm_split_mix: none
+  ssm_rank: 1
+  ssm_max_low_rank_scale: 0.1
+  ssm_finite_tail_correction: false
+  ssm_discretization: bilinear
+  ssm_kernel_mode: conv
+  ssm_kernel_threshold: 64
+  ssm_dt_min: 1e-3
+  ssm_dt_max: 1e-1
+  ssm_dt_init: 1e-2
+  ssm_use_d: true
+  ssm_activation: gelu
+  ssm_gate: true
+  ssm_input_gate: true
+  ssm_gate_type: channel
+  ssm_use_padding_mask: false
+  ssm_layer_scale_init: 0.1
+  ssm_branch_rms_norm: true
+  ssm_branch_clip_value: 1.0
+  block_residual_rms_norm: true
+  block_residual_rms_target: 1.0
+  ssm_local_shift: true
+  ssm_local_shift_init: 0.1
+  ssm_local_shift_per_channel: true
+dataset:
+  split: train
+  instruction_column: input
+  response_column: output
+  local: true
+  jsonl_path: $SFT_DATA_PATH
+  samples_per_chunk: 2000
+  tokenizer_type: sentencepiece
+  tokenizer_path: $TOKENIZER_PATH
+  tokenizer_threads: 8
+checkpoint_path: $PRETRAIN_CKPT
+user_token: "<user>"
+assistant_token: "<assistant>"
+response_loss_only: true
+batch_size: $BATCH_SIZE
+num_epochs: 100000
+max_steps: $SFT_SANITY_STEPS
+gradient_accumulation_steps: 1
+max_grad_norm: 1.0
+optimizer:
+  optimizer_type: adamw
+  learning_rate: $SFT_SANITY_LR
+  weight_decay: 0.0
+  betas: [0.9, 0.999]
+  eps: 1e-8
+scheduler:
+  scheduler_type: linearWarmup
+  warmup_steps: 0
+dtype: bfloat16
+device: cuda
+checkpoint_dir: $CHECKPOINT_BASE/sft_sanity
+save_every_steps: 5000
+save_best_model: false
+keep_last_n_checkpoints: 1
+eval_every_steps: 5000
+eval_samples: 32
+log_every_steps: 10
+aim_repo: $OUTPUT_BASE/.aim-sft-sanity
+seed: 43
+num_workers: 0
+pin_memory: true
+YAML
+"$PYTHON_BIN" scripts/diagnostics/sft_sanity_check.py \
+  --config "$SFT_CONFIG" \
+  --checkpoint "$PRETRAIN_CKPT" \
+  --output "$OUTPUT_BASE/diagnostics/sft_sanity_pretrain_latest.json" \
+  --samples "$SFT_SANITY_SAMPLES" \
+  --steps "$SFT_SANITY_STEPS" \
+  --lr "$SFT_SANITY_LR" \
+  --log-every 20 \
+  --device cuda \
+  --dtype bfloat16 \
+  --ssm-branch-rms-norm \
+  --ssm-branch-clip-value 1.0 \
+  --block-residual-rms-norm \
+  --block-residual-rms-target 1.0
+cat > "$OUTPUT_BASE/gate_summary.json" <<JSON
+{
+  "state": "completed",
+  "pretrain_checkpoint": "$PRETRAIN_CKPT",
+  "pretrain_results": "$OUTPUT_BASE/pretrain/taonet_real_token_benchmark.json",
+  "activation_probe": "$OUTPUT_BASE/diagnostics/activation_probe_pretrain_latest.json",
+  "generation_samples": "$OUTPUT_BASE/diagnostics/generation_samples_pretrain_latest.json",
+  "sft_sanity": "$OUTPUT_BASE/diagnostics/sft_sanity_pretrain_latest.json"
+}
+JSON
+echo "Pre-200M stability gate completed."
+cat "$OUTPUT_BASE/gate_summary.json"

code/TaoTrain/scripts/remote/submit_200m_base_suite.sh ADDED Viewed

	@@ -0,0 +1,16 @@

+#!/usr/bin/env bash
+set -euo pipefail
+RUN_ID="${RUN_ID:-taotern-200m-base-suite-$(date +%Y%m%d-%H%M%S)}"
+JOB_ROOT="${JOB_ROOT:-/home/student/YouZheng/jobs/taotern}"
+REMOTE_REPO="${REMOTE_REPO:-$(pwd)}"
+PYTHON_BIN="${PYTHON_BIN:-/home/student/.venv/bin/python}"
+SSM_REPO_PATH="${SSM_REPO_PATH:-/home/student/YouZheng/gamma_ssm_repo}"
+JOB_COMMAND="REMOTE_REPO=$REMOTE_REPO PYTHON_BIN=$PYTHON_BIN SSM_REPO_PATH=$SSM_REPO_PATH bash scripts/remote/run_200m_base_suite.sh"
+export RUN_ID JOB_ROOT JOB_COMMAND
+export OUTPUT_DIR="${OUTPUT_DIR:-$JOB_ROOT/$RUN_ID/outputs}"
+export CHECKPOINT_DIR="${CHECKPOINT_DIR:-$JOB_ROOT/$RUN_ID/checkpoints}"
+bash scripts/remote/submit_detached_job.sh

code/TaoTrain/scripts/remote/submit_200m_branch_only_chat.sh ADDED Viewed

	@@ -0,0 +1,32 @@

+#!/usr/bin/env bash
+set -euo pipefail
+RUN_ID="${RUN_ID:-taotern-200m-branch-only-chat-$(date +%Y%m%d-%H%M%S)}"
+JOB_ROOT="${JOB_ROOT:-/home/student/YouZheng/jobs/taotern}"
+REMOTE_REPO="${REMOTE_REPO:-$(pwd)}"
+PYTHON_BIN="${PYTHON_BIN:-/home/student/.venv/bin/python}"
+SSM_REPO_PATH="${SSM_REPO_PATH:-/home/student/YouZheng/gamma_ssm_repo}"
+DATA_PATH="${DATA_PATH:-/home/student/Data/TaoData/pretrain.jsonl}"
+SFT_DATA_PATH="${SFT_DATA_PATH:-/home/student/Data/TaoData/sft.jsonl}"
+TOKENIZER_PATH="${TOKENIZER_PATH:-/home/student/YouZheng/tokenizers/taodata_pilot_8k/tokenizer.model}"
+SEQ_LEN="${SEQ_LEN:-512}"
+BATCH_SIZE="${BATCH_SIZE:-8}"
+PRETRAIN_TOKENS="${PRETRAIN_TOKENS:-4000000000}"
+SFT_STEPS="${SFT_STEPS:-50000}"
+PRETRAIN_LR="${PRETRAIN_LR:-0.0008}"
+SFT_LR="${SFT_LR:-0.00005}"
+WEIGHT_DECAY="${WEIGHT_DECAY:-0.01}"
+LOG_EVERY="${LOG_EVERY:-100}"
+SAVE_EVERY="${SAVE_EVERY:-100000}"
+SFT_SAVE_EVERY="${SFT_SAVE_EVERY:-10000}"
+TOKENIZER_THREADS="${TOKENIZER_THREADS:-8}"
+SAMPLES_PER_CHUNK="${SAMPLES_PER_CHUNK:-2000}"
+BLOCK_RESIDUAL_RMS_CAP="${BLOCK_RESIDUAL_RMS_CAP:-}"
+JOB_COMMAND="REMOTE_REPO=$REMOTE_REPO PYTHON_BIN=$PYTHON_BIN SSM_REPO_PATH=$SSM_REPO_PATH DATA_PATH=$DATA_PATH SFT_DATA_PATH=$SFT_DATA_PATH TOKENIZER_PATH=$TOKENIZER_PATH SEQ_LEN=$SEQ_LEN BATCH_SIZE=$BATCH_SIZE PRETRAIN_TOKENS=$PRETRAIN_TOKENS SFT_STEPS=$SFT_STEPS PRETRAIN_LR=$PRETRAIN_LR SFT_LR=$SFT_LR WEIGHT_DECAY=$WEIGHT_DECAY LOG_EVERY=$LOG_EVERY SAVE_EVERY=$SAVE_EVERY SFT_SAVE_EVERY=$SFT_SAVE_EVERY TOKENIZER_THREADS=$TOKENIZER_THREADS SAMPLES_PER_CHUNK=$SAMPLES_PER_CHUNK BLOCK_RESIDUAL_RMS_CAP=$BLOCK_RESIDUAL_RMS_CAP bash scripts/remote/run_200m_branch_only_chat.sh"
+export RUN_ID JOB_ROOT JOB_COMMAND
+export OUTPUT_DIR="${OUTPUT_DIR:-$JOB_ROOT/$RUN_ID/outputs}"
+export CHECKPOINT_DIR="${CHECKPOINT_DIR:-$JOB_ROOT/$RUN_ID/checkpoints}"
+bash scripts/remote/submit_detached_job.sh

code/TaoTrain/scripts/remote/submit_200m_until_selection.sh ADDED Viewed

	@@ -0,0 +1,28 @@

+#!/usr/bin/env bash
+set -euo pipefail
+RUN_ID="${RUN_ID:-taotern-200m-until-selection-$(date +%Y%m%d-%H%M%S)}"
+JOB_ROOT="${JOB_ROOT:-/home/student/YouZheng/jobs/taotern}"
+REMOTE_REPO="${REMOTE_REPO:-$(pwd)}"
+PYTHON_BIN="${PYTHON_BIN:-/home/student/.venv/bin/python}"
+SSM_REPO_PATH="${SSM_REPO_PATH:-/home/student/YouZheng/gamma_ssm_repo}"
+DATA_PATH="${DATA_PATH:-/home/student/Data/TaoData/pretrain.jsonl}"
+TOKENIZER_PATH="${TOKENIZER_PATH:-/home/student/YouZheng/tokenizers/taodata_pilot_8k/tokenizer.model}"
+SEQ_LEN="${SEQ_LEN:-512}"
+BATCH_SIZE="${BATCH_SIZE:-8}"
+PILOT_TOKENS="${PILOT_TOKENS:-300000000}"
+SERIOUS_TOKENS="${SERIOUS_TOKENS:-1000000000}"
+PILOT_EVAL_BATCHES="${PILOT_EVAL_BATCHES:-64}"
+SERIOUS_EVAL_BATCHES="${SERIOUS_EVAL_BATCHES:-128}"
+MAX_TOKENS="${MAX_TOKENS:-50000000}"
+MAX_RECORDS="${MAX_RECORDS:-100000}"
+LEARNING_RATE="${LEARNING_RATE:-0.0003}"
+WEIGHT_DECAY="${WEIGHT_DECAY:-0.01}"
+JOB_COMMAND="REMOTE_REPO=$REMOTE_REPO PYTHON_BIN=$PYTHON_BIN SSM_REPO_PATH=$SSM_REPO_PATH DATA_PATH=$DATA_PATH TOKENIZER_PATH=$TOKENIZER_PATH SEQ_LEN=$SEQ_LEN BATCH_SIZE=$BATCH_SIZE PILOT_TOKENS=$PILOT_TOKENS SERIOUS_TOKENS=$SERIOUS_TOKENS PILOT_EVAL_BATCHES=$PILOT_EVAL_BATCHES SERIOUS_EVAL_BATCHES=$SERIOUS_EVAL_BATCHES MAX_TOKENS=$MAX_TOKENS MAX_RECORDS=$MAX_RECORDS LEARNING_RATE=$LEARNING_RATE WEIGHT_DECAY=$WEIGHT_DECAY bash scripts/remote/run_200m_until_selection.sh"
+export RUN_ID JOB_ROOT JOB_COMMAND
+export OUTPUT_DIR="${OUTPUT_DIR:-$JOB_ROOT/$RUN_ID/outputs}"
+export CHECKPOINT_DIR="${CHECKPOINT_DIR:-$JOB_ROOT/$RUN_ID/checkpoints}"
+bash scripts/remote/submit_detached_job.sh

code/TaoTrain/scripts/remote/submit_detached_job.sh ADDED Viewed

	@@ -0,0 +1,118 @@

+#!/usr/bin/env bash
+set -euo pipefail
+RUN_ID="${RUN_ID:-}"
+JOB_COMMAND="${JOB_COMMAND:-}"
+JOB_ROOT="${JOB_ROOT:-/home/student/YouZheng/jobs/taotern}"
+SESSION_NAME="${SESSION_NAME:-}"
+if [[ -z "$RUN_ID" ]]; then
+  echo "RUN_ID is required" >&2
+  exit 2
+fi
+if [[ -z "$JOB_COMMAND" ]]; then
+  echo "JOB_COMMAND is required" >&2
+  exit 2
+fi
+if ! command -v tmux >/dev/null 2>&1; then
+  echo "tmux is required for detached jobs" >&2
+  exit 2
+fi
+safe_run_id="$(printf '%s' "$RUN_ID" | tr -c 'A-Za-z0-9_.-' '_')"
+session="${SESSION_NAME:-taotern_${safe_run_id}}"
+job_dir="${JOB_ROOT%/}/${safe_run_id}"
+output_dir="${OUTPUT_DIR:-$job_dir/outputs}"
+checkpoint_dir="${CHECKPOINT_DIR:-$job_dir/checkpoints}"
+mkdir -p "$job_dir" "$output_dir" "$checkpoint_dir"
+cat > "$job_dir/command.sh" <<EOF
+#!/usr/bin/env bash
+set -euo pipefail
+export REPOBRIDGE_OUTPUT_DIR="$output_dir"
+export TAOTERN_CHECKPOINT_DIR="$checkpoint_dir"
+cd "$(pwd)"
+$JOB_COMMAND
+EOF
+chmod +x "$job_dir/command.sh"
+cat > "$job_dir/run.sh" <<'EOF'
+#!/usr/bin/env bash
+set -Eeuo pipefail
+job_dir="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+status_json="$job_dir/status.json"
+log_path="$job_dir/train.log"
+telemetry_path="$job_dir/gpu_telemetry_nvidia_smi.csv"
+write_status() {
+  local state="$1"
+  local extra="${2:-}"
+  local now
+  now="$(date -Iseconds)"
+  cat > "$status_json" <<STATUS
+{"state":"$state","updated_at":"$now","job_dir":"$job_dir"$extra}
+STATUS
+}
+write_status "running" ",\"start_time\":\"$(date -Iseconds)\""
+rm -f "$job_dir/DONE" "$job_dir/FAILED"
+monitor_pid=""
+if command -v nvidia-smi >/dev/null 2>&1; then
+  (
+    while true; do
+      date -Iseconds
+      nvidia-smi --query-gpu=timestamp,name,utilization.gpu,utilization.memory,memory.used,memory.total,power.draw,temperature.gpu --format=csv,noheader,nounits
+      sleep 5
+    done
+  ) > "$telemetry_path" 2>&1 &
+  monitor_pid="$!"
+fi
+cleanup() {
+  if [[ -n "$monitor_pid" ]]; then
+    kill "$monitor_pid" 2>/dev/null || true
+    wait "$monitor_pid" 2>/dev/null || true
+  fi
+}
+trap cleanup EXIT
+set +e
+"$job_dir/command.sh" > "$log_path" 2>&1
+exit_code="$?"
+set -e
+if [[ "$exit_code" -eq 0 ]]; then
+  touch "$job_dir/DONE"
+  write_status "completed" ",\"exit_code\":0,\"end_time\":\"$(date -Iseconds)\""
+else
+  echo "$exit_code" > "$job_dir/FAILED"
+  write_status "failed" ",\"exit_code\":$exit_code,\"end_time\":\"$(date -Iseconds)\""
+fi
+exit "$exit_code"
+EOF
+chmod +x "$job_dir/run.sh"
+cat > "$job_dir/status.json" <<STATUS
+{"state":"submitted","updated_at":"$(date -Iseconds)","job_dir":"$job_dir","session":"$session","output_dir":"$output_dir","checkpoint_dir":"$checkpoint_dir"}
+STATUS
+if tmux has-session -t "$session" 2>/dev/null; then
+  echo "tmux session already exists: $session" >&2
+  echo "$job_dir"
+  exit 3
+fi
+tmux new-session -d -s "$session" "bash '$job_dir/run.sh'"
+echo "Submitted detached job"
+echo "  run_id: $safe_run_id"
+echo "  session: $session"
+echo "  job_dir: $job_dir"
+echo "  output_dir: $output_dir"
+echo "  checkpoint_dir: $checkpoint_dir"

code/TaoTrain/scripts/remote/submit_pre_200m_branch_only_gate.sh ADDED Viewed

	@@ -0,0 +1,33 @@

+#!/usr/bin/env bash
+set -euo pipefail
+RUN_ID="${RUN_ID:-taotern-pre-200m-branch-only-gate-$(date +%Y%m%d-%H%M%S)}"
+JOB_ROOT="${JOB_ROOT:-/home/student/YouZheng/jobs/taotern}"
+REMOTE_REPO="${REMOTE_REPO:-$(pwd)}"
+PYTHON_BIN="${PYTHON_BIN:-/home/student/.venv/bin/python}"
+SSM_REPO_PATH="${SSM_REPO_PATH:-/home/student/YouZheng/gamma_ssm_repo}"
+DATA_PATH="${DATA_PATH:-/home/student/Data/TaoData/pretrain.jsonl}"
+SFT_DATA_PATH="${SFT_DATA_PATH:-/home/student/Data/TaoData/sft.jsonl}"
+TOKENIZER_PATH="${TOKENIZER_PATH:-/home/student/YouZheng/tokenizers/taodata_pilot_8k/tokenizer.model}"
+SEQ_LEN="${SEQ_LEN:-512}"
+BATCH_SIZE="${BATCH_SIZE:-8}"
+TARGET_TOKENS="${TARGET_TOKENS:-100000000}"
+MAX_TOKENS="${MAX_TOKENS:-100000000}"
+MAX_RECORDS="${MAX_RECORDS:-200000}"
+EVAL_BATCHES="${EVAL_BATCHES:-128}"
+LEARNING_RATE="${LEARNING_RATE:-0.0008}"
+WEIGHT_DECAY="${WEIGHT_DECAY:-0.01}"
+TRAIN_LOG_EVERY="${TRAIN_LOG_EVERY:-1000}"
+SFT_SANITY_SAMPLES="${SFT_SANITY_SAMPLES:-4}"
+SFT_SANITY_STEPS="${SFT_SANITY_STEPS:-120}"
+SFT_SANITY_LR="${SFT_SANITY_LR:-0.00005}"
+BLOCK_RESIDUAL_RMS_CAP="${BLOCK_RESIDUAL_RMS_CAP:-}"
+JOB_COMMAND="REMOTE_REPO=$REMOTE_REPO PYTHON_BIN=$PYTHON_BIN SSM_REPO_PATH=$SSM_REPO_PATH DATA_PATH=$DATA_PATH SFT_DATA_PATH=$SFT_DATA_PATH TOKENIZER_PATH=$TOKENIZER_PATH SEQ_LEN=$SEQ_LEN BATCH_SIZE=$BATCH_SIZE TARGET_TOKENS=$TARGET_TOKENS MAX_TOKENS=$MAX_TOKENS MAX_RECORDS=$MAX_RECORDS EVAL_BATCHES=$EVAL_BATCHES LEARNING_RATE=$LEARNING_RATE WEIGHT_DECAY=$WEIGHT_DECAY TRAIN_LOG_EVERY=$TRAIN_LOG_EVERY SFT_SANITY_SAMPLES=$SFT_SANITY_SAMPLES SFT_SANITY_STEPS=$SFT_SANITY_STEPS SFT_SANITY_LR=$SFT_SANITY_LR BLOCK_RESIDUAL_RMS_CAP=$BLOCK_RESIDUAL_RMS_CAP bash scripts/remote/run_pre_200m_branch_only_gate.sh"
+export RUN_ID JOB_ROOT JOB_COMMAND
+export OUTPUT_DIR="${OUTPUT_DIR:-$JOB_ROOT/$RUN_ID/outputs}"
+export CHECKPOINT_DIR="${CHECKPOINT_DIR:-$JOB_ROOT/$RUN_ID/checkpoints}"
+bash scripts/remote/submit_detached_job.sh

code/TaoTrain/scripts/remote/submit_pre_200m_stability_gate.sh ADDED Viewed

	@@ -0,0 +1,32 @@

+#!/usr/bin/env bash
+set -euo pipefail
+RUN_ID="${RUN_ID:-taotern-pre-200m-stability-gate-$(date +%Y%m%d-%H%M%S)}"
+JOB_ROOT="${JOB_ROOT:-/home/student/YouZheng/jobs/taotern}"
+REMOTE_REPO="${REMOTE_REPO:-$(pwd)}"
+PYTHON_BIN="${PYTHON_BIN:-/home/student/.venv/bin/python}"
+SSM_REPO_PATH="${SSM_REPO_PATH:-/home/student/YouZheng/gamma_ssm_repo}"
+DATA_PATH="${DATA_PATH:-/home/student/Data/TaoData/pretrain.jsonl}"
+SFT_DATA_PATH="${SFT_DATA_PATH:-/home/student/Data/TaoData/sft.jsonl}"
+TOKENIZER_PATH="${TOKENIZER_PATH:-/home/student/YouZheng/tokenizers/taodata_pilot_8k/tokenizer.model}"
+SEQ_LEN="${SEQ_LEN:-512}"
+BATCH_SIZE="${BATCH_SIZE:-8}"
+TARGET_TOKENS="${TARGET_TOKENS:-20000000}"
+MAX_TOKENS="${MAX_TOKENS:-50000000}"
+MAX_RECORDS="${MAX_RECORDS:-120000}"
+EVAL_BATCHES="${EVAL_BATCHES:-64}"
+LEARNING_RATE="${LEARNING_RATE:-0.0008}"
+WEIGHT_DECAY="${WEIGHT_DECAY:-0.01}"
+TRAIN_LOG_EVERY="${TRAIN_LOG_EVERY:-250}"
+SFT_SANITY_SAMPLES="${SFT_SANITY_SAMPLES:-4}"
+SFT_SANITY_STEPS="${SFT_SANITY_STEPS:-120}"
+SFT_SANITY_LR="${SFT_SANITY_LR:-0.00005}"
+JOB_COMMAND="REMOTE_REPO=$REMOTE_REPO PYTHON_BIN=$PYTHON_BIN SSM_REPO_PATH=$SSM_REPO_PATH DATA_PATH=$DATA_PATH SFT_DATA_PATH=$SFT_DATA_PATH TOKENIZER_PATH=$TOKENIZER_PATH SEQ_LEN=$SEQ_LEN BATCH_SIZE=$BATCH_SIZE TARGET_TOKENS=$TARGET_TOKENS MAX_TOKENS=$MAX_TOKENS MAX_RECORDS=$MAX_RECORDS EVAL_BATCHES=$EVAL_BATCHES LEARNING_RATE=$LEARNING_RATE WEIGHT_DECAY=$WEIGHT_DECAY TRAIN_LOG_EVERY=$TRAIN_LOG_EVERY SFT_SANITY_SAMPLES=$SFT_SANITY_SAMPLES SFT_SANITY_STEPS=$SFT_SANITY_STEPS SFT_SANITY_LR=$SFT_SANITY_LR bash scripts/remote/run_pre_200m_stability_gate.sh"
+export RUN_ID JOB_ROOT JOB_COMMAND
+export OUTPUT_DIR="${OUTPUT_DIR:-$JOB_ROOT/$RUN_ID/outputs}"
+export CHECKPOINT_DIR="${CHECKPOINT_DIR:-$JOB_ROOT/$RUN_ID/checkpoints}"
+bash scripts/remote/submit_detached_job.sh

code/TaoTrain/scripts/remote/submit_ssm_improvement_sweep.sh ADDED Viewed

	@@ -0,0 +1,71 @@

+#!/usr/bin/env bash
+set -euo pipefail
+RUN_ID="${RUN_ID:-ssm-improvement-sweep-$(date +%Y%m%d-%H%M%S)}"
+DATA_PATH="${DATA_PATH:-/home/student/Data/TaoData/pretrain.jsonl}"
+TOKENIZER_PATH="${TOKENIZER_PATH:-/home/student/YouZheng/tokenizers/taodata_pilot_8k/tokenizer.model}"
+SSM_REPO_PATH="${SSM_REPO_PATH:-/home/student/YouZheng/gamma_ssm_repo}"
+PYTHON_BIN="${PYTHON_BIN:-/home/student/.venv/bin/python}"
+JOB_ROOT="${JOB_ROOT:-/home/student/YouZheng/jobs/taotern}"
+REMOTE_REPO="${REMOTE_REPO:-$(pwd)}"
+# This is intentionally below the eventual 200M scale. It is a model-selection
+# sweep that can run unattended and resume completed benchmark cases.
+MAX_TOKENS="${MAX_TOKENS:-100000000}"
+MAX_RECORDS="${MAX_RECORDS:-150000}"
+TRAIN_STEPS="${TRAIN_STEPS:-5000}"
+EVAL_BATCHES="${EVAL_BATCHES:-96}"
+BATCH_SIZES="${BATCH_SIZES:-32,64}"
+SEQ_LEN="${SEQ_LEN:-512}"
+JOB_COMMAND="PYTHONPATH=$REMOTE_REPO/src:$SSM_REPO_PATH $PYTHON_BIN scripts/benchmark_taonet_real_tokens.py \
+  --data-path $DATA_PATH \
+  --text-field text \
+  --tokenizer-type sentencepiece \
+  --tokenizer-path $TOKENIZER_PATH \
+  --max-records $MAX_RECORDS \
+  --max-tokens $MAX_TOKENS \
+  --eval-fraction 0.1 \
+  --architectures taonet,taonet_ssm,taonet_hybrid \
+  --batch-sizes $BATCH_SIZES \
+  --seq-len $SEQ_LEN \
+  --hidden-dim 256 \
+  --num-layers 4 \
+  --num-heads 4 \
+  --d-latent-kv 192 \
+  --hidden-dim-ff 1024 \
+  --ssm-core dplr \
+  --ssm-hidden-dims 16,32 \
+  --ssm-mixer-dims 128,256 \
+  --ssm-num-lanes-list 1,2 \
+  --ssm-lane-combine channel \
+  --ssm-lane-modes full,split \
+  --ssm-split-mixes none,hadamard \
+  --ssm-rank 1 \
+  --ssm-kernel-mode conv \
+  --no-ssm-finite-tail-correction \
+  --ssm-gate-types channel \
+  --hybrid-patterns attention_first,ssm_first,single_ssm_middle,single_ssm_late \
+  --dtype bf16 \
+  --device cuda \
+  --warmup 2 \
+  --repeats 3 \
+  --backward \
+  --train-steps $TRAIN_STEPS \
+  --learning-rate 0.0008 \
+  --weight-decay 0.01 \
+  --eval-batches $EVAL_BATCHES \
+  --ssm-local-shift \
+  --ssm-local-shift-per-channel \
+  --ssm-local-shift-init 0.1 \
+  --output-dir \"\$REPOBRIDGE_OUTPUT_DIR\" \
+  --resume-completed \
+  --incremental-output \
+  --save-case-checkpoints \
+  --checkpoint-dir \"\$TAOTERN_CHECKPOINT_DIR\""
+export RUN_ID JOB_ROOT JOB_COMMAND
+export OUTPUT_DIR="${OUTPUT_DIR:-$JOB_ROOT/$RUN_ID/outputs}"
+export CHECKPOINT_DIR="${CHECKPOINT_DIR:-$JOB_ROOT/$RUN_ID/checkpoints}"
+bash scripts/remote/submit_detached_job.sh

code/TaoTrain/scripts/summarize_taonet_benchmark_suite.py ADDED Viewed

	@@ -0,0 +1,138 @@

+from __future__ import annotations
+import argparse
+import csv
+import json
+from pathlib import Path
+from typing import Any
+def _as_float(value: str | None) -> float | None:
+    if value is None or value == "":
+        return None
+    try:
+        return float(value)
+    except ValueError:
+        return None
+def _load_rows(root: Path) -> list[dict[str, Any]]:
+    rows: list[dict[str, Any]] = []
+    for csv_path in sorted(root.glob("*/taonet_real_token_benchmark.csv")):
+        variant = csv_path.parent.name
+        with csv_path.open("r", newline="", encoding="utf-8") as handle:
+            for row in csv.DictReader(handle):
+                row = dict(row)
+                row["variant"] = variant
+                rows.append(row)
+    return rows
+def _best_forward_backward(rows: list[dict[str, Any]]) -> list[dict[str, Any]]:
+    candidates = [row for row in rows if row.get("mode") == "forward_backward"]
+    grouped: dict[str, list[dict[str, Any]]] = {}
+    for row in candidates:
+        grouped.setdefault(row["variant"], []).append(row)
+    best_rows = []
+    for variant, items in grouped.items():
+        items.sort(
+            key=lambda row: (
+                _as_float(row.get("eval_loss")) if _as_float(row.get("eval_loss")) is not None else float("inf"),
+                -(_as_float(row.get("eval_accuracy")) or 0.0),
+            )
+        )
+        best_rows.append(items[0])
+    best_rows.sort(
+        key=lambda row: (
+            _as_float(row.get("eval_loss")) if _as_float(row.get("eval_loss")) is not None else float("inf"),
+            -(_as_float(row.get("eval_accuracy")) or 0.0),
+        )
+    )
+    return best_rows
+def _project(row: dict[str, Any]) -> dict[str, Any]:
+    keys = [
+        "variant",
+        "architecture",
+        "hybrid_pattern",
+        "batch_size",
+        "seq_len",
+        "total_params",
+        "ssm_core",
+        "ssm_hidden_dim",
+        "ssm_mixer_dim",
+        "ssm_num_lanes",
+        "ssm_lane_mode",
+        "ssm_split_mix",
+        "tokens_per_s_mean",
+        "eval_loss",
+        "eval_perplexity",
+        "eval_accuracy",
+        "train_final_loss",
+        "train_seconds",
+        "peak_reserved_mb",
+        "case_id",
+        "checkpoint_path",
+    ]
+    return {key: row.get(key, "") for key in keys}
+def _write_markdown(summary: list[dict[str, Any]], path: Path) -> None:
+    headers = [
+        "variant",
+        "architecture",
+        "batch",
+        "params",
+        "eval_loss",
+        "eval_acc",
+        "tok/s",
+        "checkpoint",
+    ]
+    lines = [
+        "# TaoNet Benchmark Suite Summary",
+        "",
+        "| " + " | ".join(headers) + " |",
+        "| " + " | ".join(["---"] * len(headers)) + " |",
+    ]
+    for row in summary:
+        lines.append(
+            "| "
+            + " | ".join(
+                [
+                    str(row["variant"]),
+                    str(row["architecture"]),
+                    str(row["batch_size"]),
+                    str(row["total_params"]),
+                    str(row["eval_loss"]),
+                    str(row["eval_accuracy"]),
+                    str(row["tokens_per_s_mean"]),
+                    str(row["checkpoint_path"]),
+                ]
+            )
+            + " |"
+        )
+    path.write_text("\n".join(lines) + "\n", encoding="utf-8")
+def main() -> None:
+    parser = argparse.ArgumentParser(description="Summarize a TaoNet benchmark suite output directory.")
+    parser.add_argument("--suite-dir", required=True, help="Directory containing one subdirectory per benchmark variant.")
+    parser.add_argument("--output-json", default="", help="Summary JSON path. Defaults to <suite-dir>/suite_summary.json.")
+    parser.add_argument("--output-md", default="", help="Summary Markdown path. Defaults to <suite-dir>/suite_summary.md.")
+    args = parser.parse_args()
+    suite_dir = Path(args.suite_dir)
+    rows = _load_rows(suite_dir)
+    summary = [_project(row) for row in _best_forward_backward(rows)]
+    json_path = Path(args.output_json) if args.output_json else suite_dir / "suite_summary.json"
+    md_path = Path(args.output_md) if args.output_md else suite_dir / "suite_summary.md"
+    json_path.write_text(json.dumps(summary, indent=2) + "\n", encoding="utf-8")
+    _write_markdown(summary, md_path)
+    print(f"Wrote {json_path}")
+    print(f"Wrote {md_path}")
+if __name__ == "__main__":
+    main()

code/TaoTrain/src/taoTrain.egg-info/dependency_links.txt ADDED Viewed

	@@ -0,0 +1 @@


1	+

code/TaoTrain/src/taoTrain.egg-info/entry_points.txt ADDED Viewed

	@@ -0,0 +1,4 @@

+[console_scripts]
+train = taoTrain.cli:main
+train-tokenizer = taoTrain.cli:train_tokenizer_command
+tui-chat = taoTrain.inference.tui:main

code/TaoTrain/src/taoTrain/__init__.py ADDED Viewed

	@@ -0,0 +1,8 @@

+"""
+TaoTrain: A clean, modular PyTorch LLM training framework.
+For quick usage, see: https://github.com/your-org/taoTrain
+"""
+__version__ = "0.1.0"
+__author__ = "Felix"

code/TaoTrain/src/taoTrain/cli.py ADDED Viewed

	@@ -0,0 +1,387 @@

+"""Main CLI entry point."""
+import sys
+from pathlib import Path
+from typing import Optional
+import click
+import torch
+from taoTrain.config import load_config, load_tokenizer_config, TrainingModeEnum, PretrainConfig, SFTConfig, RLConfig
+from taoTrain.utils import set_seed, get_device
+from taoTrain.core import BaseModel, create_model, create_datasets
+from taoTrain.data import get_dataloader
+from taoTrain.training import PretrainTrainer, SFTTrainer, RLTrainer
+from taoTrain.benchmarks import BenchmarkRunner
+from taoTrain.tokenizers import TokenizerTrainer
+@click.group()
+def main():
+    """TaoTrain: A clean, modular PyTorch LLM training framework."""
+    pass
+@main.command()
+@click.option(
+    "--config",
+    type=click.Path(exists=True),
+    required=True,
+    help="Path to training config file (YAML or JSON)",
+)
+def pretrain(config: str):
+    """Pretrain a language model."""
+    _train_command(config, TrainingModeEnum.PRETRAIN)
+@main.command()
+@click.option(
+    "--config",
+    type=click.Path(exists=True),
+    required=True,
+    help="Path to training config file (YAML or JSON)",
+)
+def sft(config: str):
+    """Supervised fine-tune a language model."""
+    _train_command(config, TrainingModeEnum.SFT)
+@main.command()
+@click.option(
+    "--config",
+    type=click.Path(exists=True),
+    required=True,
+    help="Path to training config file (YAML or JSON)",
+)
+def rl(config: str):
+    """Train with reinforcement learning."""
+    _train_command(config, TrainingModeEnum.RL)
+@main.command()
+@click.option(
+    "--config",
+    type=click.Path(exists=True),
+    required=True,
+    help="Path to tokenizer config file (YAML or JSON)",
+)
+def train_tokenizer(config: str):
+    """Train a SentencePiece tokenizer from a YAML/JSON config file."""
+    try:
+        click.echo("🚀 TaoTrain Tokenizer Trainer")
+        click.echo(f"{'=' * 50}")
+        # Load tokenizer config
+        click.echo(f"Loading config from {config}...")
+        tokenizer_config = load_tokenizer_config(config)
+        # Train tokenizer from config
+        result = TokenizerTrainer.train_from_config(tokenizer_config)
+        # Display results
+        click.echo(f"\n{'=' * 50}")
+        click.echo("✅ Tokenizer Training Complete!")
+        click.echo(f"\n📊 Configuration:")
+        click.echo(f"   - Input file: {tokenizer_config.jsonl_path}")
+        click.echo(f"   - Samples: {tokenizer_config.max_samples or 'all'}")
+        click.echo(f"   - Output dir: {result['output_dir']}")
+        click.echo(f"   - Vocab size: {result['vocab_size']}")
+        click.echo(f"   - Model type: {result['model_type']}")
+        if tokenizer_config.special_tokens:
+            click.echo(f"   - Special tokens: {tokenizer_config.special_tokens}")
+        click.echo(f"\n📁 Generated Files:")
+        click.echo(f"   - Model: {result['model_file']}")
+        click.echo(f"   - Vocab: {result['vocab_file']}")
+        click.echo(f"\n📝 Next Steps:")
+        click.echo(f"   1. Use this tokenizer in your pretraining config:")
+        click.echo(f"       dataset:")
+        click.echo(f"         local: true")
+        click.echo(f"         jsonl_path: {tokenizer_config.jsonl_path}")
+        click.echo(f"         tokenizer_path: {result['model_file']}")
+        click.echo(f"")
+        click.echo(f"   2. Run pretraining with:")
+        click.echo(f"       train pretrain --config your_config.yaml")
+    except ImportError as e:
+        click.echo(f"❌ Error: {e}", err=True)
+        sys.exit(1)
+    except FileNotFoundError as e:
+        click.echo(f"❌ File Error: {e}", err=True)
+        sys.exit(1)
+    except ValueError as e:
+        click.echo(f"❌ Validation Error: {e}", err=True)
+        sys.exit(1)
+    except Exception as e:
+        click.echo(f"❌ Unexpected Error: {e}", err=True)
+        click.echo(f"   Please report this issue.", err=True)
+        sys.exit(1)
+@click.command()
+@click.option(
+    "--jsonl-path",
+    type=click.Path(exists=True),
+    required=True,
+    help="Path to JSONL file containing training data",
+)
+@click.option(
+    "--output-dir",
+    type=click.Path(),
+    default="tokenizers",
+    help="Directory to save tokenizer files",
+)
+@click.option(
+    "--vocab-size",
+    type=int,
+    default=50000,
+    help="Vocabulary size for the tokenizer",
+)
+@click.option(
+    "--model-type",
+    type=click.Choice(["unigram", "bpe", "char", "word"]),
+    default="unigram",
+    help="SentencePiece model type",
+)
+@click.option(
+    "--character-coverage",
+    type=float,
+    default=0.9995,
+    help="Character coverage for SentencePiece",
+)
+@click.option(
+    "--tokenizer-prefix",
+    type=str,
+    default=None,
+    help="Prefix for tokenizer output files (default: model_type)",
+)
+def train_tokenizer_command(
+    jsonl_path: str,
+    output_dir: str,
+    vocab_size: int,
+    model_type: str,
+    character_coverage: float,
+    tokenizer_prefix: Optional[str],
+):
+    """Train a SentencePiece tokenizer from JSONL data."""
+    try:
+        click.echo("🚀 TaoTrain Tokenizer Trainer")
+        click.echo(f"{'=' * 50}")
+        # Train tokenizer
+        result = TokenizerTrainer.train_sentencepiece(
+            jsonl_path=jsonl_path,
+            output_dir=output_dir,
+            vocab_size=vocab_size,
+            model_type=model_type,
+            character_coverage=character_coverage,
+            tokenizer_prefix=tokenizer_prefix,
+        )
+        # Display results
+        click.echo(f"\n{'=' * 50}")
+        click.echo("✅ Tokenizer Training Complete!")
+        click.echo(f"\n📊 Configuration:")
+        click.echo(f"   - Input file: {jsonl_path}")
+        click.echo(f"   - Output dir: {result['output_dir']}")
+        click.echo(f"   - Vocab size: {result['vocab_size']}")
+        click.echo(f"   - Model type: {result['model_type']}")
+        click.echo(f"\n📁 Generated Files:")
+        click.echo(f"   - Model: {result['model_file']}")
+        click.echo(f"   - Vocab: {result['vocab_file']}")
+        click.echo(f"\n📝 Next Steps:")
+        click.echo(f"   1. Use this tokenizer in your pretraining config:")
+        click.echo(f"       dataset:")
+        click.echo(f"         local: true")
+        click.echo(f"         jsonl_path: {jsonl_path}")
+        click.echo(f"         tokenizer_path: {result['model_file']}")
+        click.echo(f"")
+        click.echo(f"   2. Run pretraining with:")
+        click.echo(f"       train pretrain --config your_config.yaml")
+    except ImportError as e:
+        click.echo(f"❌ Error: {e}", err=True)
+        sys.exit(1)
+    except FileNotFoundError as e:
+        click.echo(f"❌ File Error: {e}", err=True)
+        sys.exit(1)
+    except ValueError as e:
+        click.echo(f"❌ Validation Error: {e}", err=True)
+        sys.exit(1)
+    except Exception as e:
+        click.echo(f"❌ Unexpected Error: {e}", err=True)
+        click.echo(f"   Please report this issue.", err=True)
+        sys.exit(1)
+# Keep legacy CLI command as train-tokenizer-legacy for backward compatibility
+main.add_command(train_tokenizer_command, name="train-tokenizer-legacy")
+def _train_command(config_path: str, mode: TrainingModeEnum):
+    """Internal training command."""
+    try:
+        # Load config
+        click.echo(f"Loading config from {config_path}...")
+        train_config = load_config(config_path, mode)
+        # Set seed
+        set_seed(train_config.seed)
+        # Get device
+        device = get_device(train_config.device)
+        click.echo(f"Using device: {device}")
+        # Create model
+        click.echo("Creating model...")
+        model = create_model(train_config, device)
+        total_params, trainable_params = _count_params(model)
+        click.echo(f"  - Total parameters: {total_params:,}")
+        click.echo(f"  - Trainable parameters: {trainable_params:,}")
+        # Load pretrained checkpoint if provided (for SFT/RL)
+        if train_config.checkpoint_path:
+            click.echo(f"Loading pretrained checkpoint from {train_config.checkpoint_path}...")
+            from taoTrain.checkpointing.checkpoint import CheckpointManager
+            checkpoint_manager = CheckpointManager(train_config.checkpoint_dir)
+            checkpoint = checkpoint_manager.load(train_config.checkpoint_path, device=device)
+            # CheckpointManager.load() normalizes format and ensures 'model_state' key exists
+            if "model_state" in checkpoint:
+                model.load_state_dict(checkpoint["model_state"], strict=False)
+                click.echo("  ✓ Checkpoint loaded successfully")
+            else:
+                raise KeyError(f"Invalid checkpoint format: 'model_state' key not found. "
+                             f"Available keys: {list(checkpoint.keys())}")
+        # Create datasets
+        click.echo("Loading datasets...")
+        train_dataset, val_dataset = create_datasets(train_config)
+        click.echo(f"  - Train samples: {len(train_dataset)}")
+        if val_dataset:
+            click.echo(f"  - Val samples: {len(val_dataset)}")
+        # Select trainer
+        if mode == TrainingModeEnum.PRETRAIN:
+            trainer_class = PretrainTrainer
+        elif mode == TrainingModeEnum.SFT:
+            trainer_class = SFTTrainer
+        elif mode == TrainingModeEnum.RL:
+            trainer_class = RLTrainer
+        else:
+            raise ValueError(f"Unknown training mode: {mode}")
+        # Create trainer
+        click.echo("Setting up trainer...")
+        trainer = trainer_class(
+            model=model,
+            train_dataset=train_dataset,
+            val_dataset=val_dataset,
+            config=train_config,
+            device=device,
+        )
+        # Training loop
+        click.echo("\nStarting training...\n")
+        for epoch in range(train_config.num_epochs):
+            if train_config.max_steps and trainer.global_step >= train_config.max_steps:
+                break
+            epoch_metrics = trainer.train_epoch()
+            click.echo(f"\nEpoch {epoch + 1} complete")
+            click.echo(f"  - Loss: {epoch_metrics.get('loss', 'N/A')}")
+            click.echo(f"  - Learning rate: {epoch_metrics.get('lr', 'N/A')}")
+        # Final checkpoint
+        final_path = Path(train_config.checkpoint_dir) / "final_model.pt"
+        trainer.save_checkpoint(final_path)
+        click.echo(f"\nTraining complete! Final model saved to {final_path}")
+        # Log finish
+        trainer.logger.finish()
+    except Exception as e:
+        click.echo(f"Error during training: {e}", err=True)
+        sys.exit(1)
+@main.command()
+@click.option(
+    "--model",
+    type=click.Path(exists=True),
+    required=True,
+    help="Path to model checkpoint",
+)
+@click.option(
+    "--benchmark-type",
+    type=click.Choice(["all", "perplexity", "throughput", "memory"]),
+    default="all",
+    help="Type of benchmark to run",
+)
+@click.option(
+    "--batch-size",
+    type=int,
+    default=32,
+    help="Batch size for benchmarking",
+)
+@click.option(
+    "--seq-length",
+    type=int,
+    default=1024,
+    help="Sequence length for benchmarking",
+)
+def benchmark(model: str, benchmark_type: str, batch_size: int, seq_length: int):
+    """Benchmark a trained model."""
+    try:
+        click.echo(f"Loading model from {model}...")
+        device = get_device("cuda")
+        runner = BenchmarkRunner.load_from_checkpoint(model, device=device)
+        click.echo("Running benchmarks...\n")
+        if benchmark_type == "throughput" or benchmark_type == "all":
+            click.echo("Throughput benchmark:")
+            results = runner.benchmark_throughput(batch_size, seq_length)
+            for key, val in results.items():
+                click.echo(f"  {key}: {val:.2f}")
+        if benchmark_type == "memory" or benchmark_type == "all":
+            click.echo("\nMemory benchmark:")
+            results = runner.benchmark_memory()
+            for key, val in results.items():
+                click.echo(f"  {key}: {val:.2f}")
+        click.echo("\nBenchmarking complete!")
+    except Exception as e:
+        click.echo(f"Error during benchmarking: {e}", err=True)
+        sys.exit(1)
+@main.command()
+@click.option("--repo", type=str, default=".aim", help="AimStack repository path")
+def view_logs(repo: str):
+    """View training logs with AimStack."""
+    try:
+        import subprocess
+        click.echo(f"Opening AimStack dashboard for repo: {repo}")
+        subprocess.run(["aim", "up", "--repo", repo])
+    except FileNotFoundError:
+        click.echo("Error: 'aim' command not found. Install with: pip install aim", err=True)
+        sys.exit(1)
+def _count_params(model: BaseModel) -> tuple[int, int]:
+    """Count model parameters."""
+    total = sum(p.numel() for p in model.parameters())
+    trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
+    return total, trainable
+if __name__ == "__main__":
+    main()

code/TaoTrain/src/taoTrain/config.py ADDED Viewed

	@@ -0,0 +1,726 @@

+"""Pydantic configuration schemas for TaoTrain."""
+from enum import Enum
+from typing import Optional, Literal
+from pathlib import Path
+import json
+from pydantic import BaseModel as PydanticBaseModel, Field, validator
+import yaml
+# ============================================================================
+# Enums
+# ============================================================================
+class DataTypeEnum(str, Enum):
+    """Data types for training."""
+    FLOAT32 = "float32"
+    FLOAT16 = "float16"
+    BFLOAT16 = "bfloat16"
+class OptimizerEnum(str, Enum):
+    """Supported optimizers."""
+    ADAM = "adam"
+    ADAMW = "adamw"
+    SGD = "sgd"
+    HYBRID_MUON_ADAMW = "hybrid_muon_adamw"
+class ModelArchitectureEnum(str, Enum):
+    """Built-in model architectures."""
+    TRANSFORMER = "transformer"
+    TAONET = "taonet"
+    TAONET_SSM = "taonet_ssm"
+    TAONET_HYBRID = "taonet_hybrid"
+class SchedulerEnum(str, Enum):
+    """Supported learning rate schedulers."""
+    LINEAR_WARMUP = "linearWarmup"
+    COSINE_WARMUP = "cosineWarmup"
+    CONSTANT = "constant"
+class RLMethodEnum(str, Enum):
+    """Supported RL training methods."""
+    PPO = "ppo"
+    DPO = "dpo"
+class TrainingModeEnum(str, Enum):
+    """Training stages."""
+    PRETRAIN = "pretrain"
+    SFT = "sft"
+    RL = "rl"
+# ============================================================================
+# Base Configs
+# ============================================================================
+class BaseConfig(PydanticBaseModel):
+    """Base Pydantic model with utility methods."""
+    class Config:
+        """Pydantic config."""
+        arbitrary_types_allowed = True
+    def to_dict(self) -> dict:
+        """Convert to dictionary."""
+        data = self.model_dump(mode='json')  # Enums -> strings
+        return data
+    def to_json_str(self) -> str:
+        """Convert to JSON string."""
+        return json.dumps(self.to_dict(), indent=2)
+    def save_yaml(self, path: str | Path) -> None:
+        """Save config to YAML file."""
+        path = Path(path)
+        path.parent.mkdir(parents=True, exist_ok=True)
+        with open(path, 'w') as f:
+            yaml.dump(self.to_dict(), f, default_flow_style=False, sort_keys=False)
+    def save_json(self, path: str | Path) -> None:
+        """Save config to JSON file."""
+        path = Path(path)
+        path.parent.mkdir(parents=True, exist_ok=True)
+        with open(path, 'w') as f:
+            f.write(self.to_json_str())
+    @classmethod
+    def load_yaml(cls, path: str | Path) -> "BaseConfig":
+        """Load config from YAML file."""
+        with open(path) as f:
+            data = yaml.safe_load(f)
+        return cls(**data)
+    @classmethod
+    def load_json(cls, path: str | Path) -> "BaseConfig":
+        """Load config from JSON file."""
+        with open(path) as f:
+            data = json.load(f)
+        return cls(**data)
+# ============================================================================
+# Model Config
+# ============================================================================
+class ModelConfig(BaseConfig):
+    """Configuration for model architecture."""
+    architecture_type: ModelArchitectureEnum = Field(
+        default=ModelArchitectureEnum.TRANSFORMER,
+        description="Type of model architecture"
+    )
+    # Transformer-specific
+    vocab_size: int = Field(default=50257, description="Vocabulary size")
+    hidden_dim: int = Field(default=768, description="Hidden dimension")
+    num_layers: int = Field(default=12, description="Number of transformer blocks")
+    num_heads: int = Field(default=12, description="Number of attention heads")
+    head_dim: Optional[int] = Field(
+        default=None,
+        description="Head dimension (defaults to hidden_dim // num_heads)"
+    )
+    intermediate_dim: Optional[int] = Field(
+        default=None,
+        description="FFN intermediate dimension (defaults to 4 * hidden_dim)"
+    )
+    dropout: float = Field(default=0.1, description="Dropout rate")
+    max_seq_length: int = Field(default=2048, description="Maximum sequence length")
+    # TaoNet (DeepSeek MLA) specific
+    d_latent_kv: Optional[int] = Field(
+        default=None,
+        description="KV compression dimension for MLA (defaults to 3/4 * hidden_dim). Only used for taonet architecture."
+    )
+    d_rope: Optional[int] = Field(
+        default=None,
+        description="RoPE dimension per head (defaults to hidden_dim // num_heads). Only used for taonet architecture."
+    )
+    gqa_groups: int = Field(
+        default=1,
+        description="Grouped Query Attention groups (1 = standard MLA, >1 = GQA). Only used for taonet architecture."
+    )
+    hidden_dim_ff: Optional[int] = Field(
+        default=None,
+        description="Feed-forward intermediate dimension (defaults to 4 * hidden_dim)."
+    )
+    use_factorized_embedding: bool = Field(
+        default=False,
+        description="Use low-rank factorized embedding instead of standard embedding (reduces params). Only for taonet."
+    )
+    d_embed_rank: int = Field(
+        default=96,
+        description="Rank dimension for factorized embedding. Only used if use_factorized_embedding=True."
+    )
+    # YaRN (Yet another RoPE eXtension) for context length extension
+    rope_scale: float = Field(
+        default=40.0,
+        description="Base RoPE scale factor (default: 40.0). Controls position frequency base."
+    )
+    yarn_enabled: bool = Field(
+        default=False,
+        description="Enable YaRN (Yet another RoPE eXtension) for context length interpolation."
+    )
+    yarn_alpha: float = Field(
+        default=1.0,
+        description="YaRN interpolation smoothness (1.0=smooth, <1.0=aggressive, >1.0=conservative). Only used if yarn_enabled=True."
+    )
+    # TaoNet-SSM specific: SSM mixer replacing MLA attention
+    ssm_core: Literal["gamma_s4", "dplr"] = Field(
+        default="gamma_s4",
+        description="SSM core used by taonet_ssm. Use dplr for the ternary-aware DPLR SSM."
+    )
+    ssm_hidden_dim: Optional[int] = Field(
+        default=None,
+        description="SSM hidden/state dimension for taonet_ssm. Defaults to d_latent_kv or hidden_dim."
+    )
+    ssm_mixer_dim: Optional[int] = Field(
+        default=None,
+        description="Channel dimension processed by the SSM mixer. Defaults to hidden_dim; smaller values use an input/output projection bottleneck."
+    )
+    ssm_num_lanes: int = Field(
+        default=1,
+        description="Number of independent SSM lanes inside each SSM mixer. Multiple lanes add SSM capacity with cheap elementwise combination."
+    )
+    ssm_lane_combine: Literal["mean", "channel"] = Field(
+        default="mean",
+        description="How to combine multiple SSM lanes. Channel uses learned per-lane/per-channel elementwise weights."
+    )
+    ssm_lane_mode: Literal["full", "split"] = Field(
+        default="full",
+        description="Whether each SSM lane processes the full mixer dimension or a disjoint split of the mixer channels."
+    )
+    ssm_split_mix: Literal["none", "hadamard"] = Field(
+        default="none",
+        description="Optional ternary-friendly cross-lane mixer for split SSM lanes."
+    )
+    ssm_rank: int = Field(
+        default=1,
+        description="Low-rank correction rank for ssm_core=dplr."
+    )
+    ssm_max_low_rank_scale: float = Field(
+        default=0.1,
+        description="Maximum low-rank correction scale for ssm_core=dplr."
+    )
+    ssm_finite_tail_correction: bool = Field(
+        default=True,
+        description="Enable exact finite-length tail correction for ssm_core=dplr. Disable for the faster approximate DPLR path."
+    )
+    ssm_discretization: Literal["bilinear", "zoh", "euler"] = Field(
+        default="bilinear",
+        description="Discretization used by the Gamma SSM mixer."
+    )
+    ssm_kernel_mode: Literal["auto", "recurrent", "conv", "conv_transfer"] = Field(
+        default="auto",
+        description="Gamma SSM execution path. Use auto/conv for full-sequence GPU training, conv_transfer to materialize frequency transfers, recurrent for step-wise tests."
+    )
+    ssm_kernel_threshold: int = Field(
+        default=64,
+        description="Minimum sequence length for auto mode to use the convolutional Gamma SSM path."
+    )
+    ssm_dt_min: float = Field(default=1e-3, description="Minimum learned SSM timestep.")
+    ssm_dt_max: float = Field(default=1e-1, description="Maximum learned SSM timestep.")
+    ssm_dt_init: float = Field(default=1e-2, description="Initial learned SSM timestep.")
+    ssm_use_d: bool = Field(default=True, description="Enable direct skip term D in the Gamma SSM.")
+    ssm_activation: Literal["gelu", "silu", "identity", "linear"] = Field(
+        default="gelu",
+        description="Activation applied to the Gamma SSM branch output."
+    )
+    ssm_gate: bool = Field(default=True, description="Enable output gate on the Gamma SSM branch.")
+    ssm_input_gate: bool = Field(default=True, description="Enable input gate before the Gamma SSM.")
+    ssm_gate_type: Literal["dense", "channel"] = Field(
+        default="dense",
+        description="Gate implementation for enabled SSM input/output gates. Channel gates are elementwise and ternary-friendly."
+    )
+    ssm_use_padding_mask: bool = Field(
+        default=False,
+        description="Apply dataset padding masks inside the SSM. Disabled by default so training can use the convolutional path."
+    )
+    ssm_layer_scale_init: float = Field(
+        default=0.1,
+        description="Initial layer-scale multiplier for the Gamma SSM branch."
+    )
+    ssm_branch_rms_norm: bool = Field(
+        default=False,
+        description="Normalize the SSM residual branch to unit RMS before layer-scale. Useful for stabilizing deep SSM/hybrid runs."
+    )
+    ssm_branch_rms_eps: float = Field(
+        default=1e-6,
+        description="Numerical epsilon for optional SSM branch RMS normalization."
+    )
+    ssm_branch_clip_value: Optional[float] = Field(
+        default=None,
+        description="Optional symmetric clamp applied to the SSM residual branch after layer-scale. None disables clamping."
+    )
+    block_residual_rms_norm: bool = Field(
+        default=False,
+        description="Normalize the residual stream RMS after block residual additions. Intended for stabilizing deep SSM/hybrid experiments."
+    )
+    block_residual_rms_target: float = Field(
+        default=1.0,
+        description="Target per-token RMS when block_residual_rms_norm is enabled."
+    )
+    block_residual_rms_cap: Optional[float] = Field(
+        default=None,
+        description="Optional per-token RMS cap for the residual stream. Unlike block_residual_rms_norm, this only scales down tokens whose RMS exceeds the cap."
+    )
+    block_residual_rms_eps: float = Field(
+        default=1e-6,
+        description="Numerical epsilon for optional block residual RMS normalization."
+    )
+    ssm_local_shift: bool = Field(
+        default=False,
+        description="Add a cheap one-token causal shift/register branch to the taonet_ssm mixer."
+    )
+    ssm_local_shift_init: float = Field(
+        default=0.1,
+        description="Initial scalar weight for the optional one-token local shift/register branch."
+    )
+    ssm_local_shift_per_channel: bool = Field(
+        default=False,
+        description="Use one learned local-shift gain per model channel instead of one scalar."
+    )
+    hybrid_pattern: Literal["attention_first", "ssm_first", "single_ssm_middle", "single_ssm_late"] = Field(
+        default="attention_first",
+        description="Layer pattern for taonet_hybrid when hybrid_ssm_layers is not set."
+    )
+    hybrid_ssm_layers: Optional[str] = Field(
+        default=None,
+        description="Optional comma-separated 0-based layer indices that should use SSM blocks in taonet_hybrid."
+    )
+    # Initializations
+    init_std: float = Field(default=0.02, description="Weight initialization standard deviation")
+    @validator("head_dim", always=True)
+    def validate_head_dim(cls, v, values):
+        """Validate head dimension."""
+        if v is None and 'hidden_dim' in values:
+            return values['hidden_dim'] // values.get('num_heads', 12)
+        return v
+    @validator("intermediate_dim", always=True)
+    def validate_intermediate_dim(cls, v, values):
+        """Validate intermediate dimension."""
+        if v is None and 'hidden_dim' in values:
+            return 4 * values['hidden_dim']
+        return v
+# ============================================================================
+# Dataset Config
+# ============================================================================
+class DatasetConfig(BaseConfig):
+    """Configuration for dataset loading."""
+    # Local vs HuggingFace dataset selection
+    local: bool = Field(default=False, description="Use local JSONL dataset instead of HuggingFace")
+    # HuggingFace dataset fields
+    dataset_name: Optional[str] = Field(default=None, description="HuggingFace dataset name (e.g., 'wikitext', 'openwebtext')")
+    split: str = Field(default="train", description="Dataset split to use")
+    config: Optional[str] = Field(default=None, description="Dataset config if multi-config (e.g., 'wikitext-103')")
+    # Local JSONL dataset fields
+    jsonl_path: Optional[str] = Field(default=None, description="Path to local JSONL dataset file")
+    text_field: str = Field(default="text", description="Name of text field in JSONL")
+    # Text column name varies by dataset
+    text_column: str = Field(default="text", description="Name of text column in dataset")
+    # Preprocessing
+    max_samples: Optional[int] = Field(
+        default=None,
+        description="Limit dataset to N samples (useful for debugging)"
+    )
+    cache_dir: str = Field(default=".cache/datasets", description="HuggingFace cache directory")
+    # For SFT/RL datasets with instruction-response format
+    instruction_column: Optional[str] = Field(default=None, description="Instruction column for SFT")
+    response_column: Optional[str] = Field(default=None, description="Response column for SFT")
+    prompt_column: Optional[str] = Field(default=None, description="Prompt column for RL")
+    # Instruction template
+    instruction_template: Optional[str] = Field(
+        default=None,
+        description="Template for combining instruction and response. E.g., '{instruction}\\n{response}'"
+    )
+    # Tokenizer configuration
+    tokenizer_type: Optional[str] = Field(
+        default=None,
+        description="Tokenizer type: 'huggingface' or 'sentencepiece'. If None, defaults based on tokenizer_path."
+    )
+    tokenizer_path: Optional[str] = Field(
+        default=None,
+        description="Path to saved tokenizer (for SentencePiece: .model file, for HuggingFace: model name or local path)"
+    )
+    # Chunked loading for large JSONL files
+    enable_streaming: bool = Field(
+        default=True,
+        description="Enable streaming/chunked loading for large JSONL files to reduce memory usage"
+    )
+    chunk_size_gb: float = Field(
+        default=5.0,
+        description="Approximate chunk size in GB (ignored if samples_per_chunk is set)"
+    )
+    samples_per_chunk: Optional[int] = Field(
+        default=1000,
+        description="Number of samples per chunk (takes precedence over chunk_size_gb). Default: 1000 samples"
+    )
+    # Chunk caching
+    enable_chunk_metadata_cache: bool = Field(
+        default=True,
+        description="Enable caching of chunk metadata (file scan results) to avoid re-scanning large JSONL files"
+    )
+    enable_chunk_data_cache: bool = Field(
+        default=False,
+        description="Enable caching of actual chunk data as separate files for faster loading (uses more disk space)"
+    )
+    chunk_cache_dir: str = Field(
+        default=".cache/chunks",
+        description="Directory to store chunk metadata and data cache files"
+    )
+    # Tokenization parallelization
+    tokenizer_threads: int = Field(
+        default=1,
+        description="Number of background threads for tokenization (1-32 recommended). Higher values speed up tokenization but increase memory usage."
+    )
+    @validator('jsonl_path', always=True)
+    def validate_dataset_source(cls, v, values):
+        """Validate that either local JSONL or HuggingFace dataset is specified."""
+        local = values.get('local', False)
+        dataset_name = values.get('dataset_name')
+        if local and not v:
+            raise ValueError("jsonl_path must be provided when local=True")
+        if not local and not dataset_name:
+            raise ValueError("dataset_name must be provided when local=False (HuggingFace dataset)")
+        return v
+    @validator('tokenizer_threads')
+    def validate_tokenizer_threads(cls, v):
+        """Validate tokenizer_threads is a positive integer."""
+        if v < 1:
+            raise ValueError("tokenizer_threads must be at least 1")
+        if v > 128:
+            raise ValueError("tokenizer_threads should not exceed 128 (recommended: 1-32)")
+        return v
+# ============================================================================
+# Tokenizer Config
+# ============================================================================
+class TokenizerConfig(BaseConfig):
+    """Configuration for tokenizer training."""
+    # Dataset source
+    jsonl_path: str = Field(description="Path to JSONL file containing training data")
+    text_field: str = Field(default="text", description="Field name in JSONL for text data")
+    # Training configuration
+    vocab_size: int = Field(default=50000, description="Vocabulary size")
+    model_type: str = Field(default="unigram", description="SentencePiece model type (unigram, bpe, char, word)")
+    character_coverage: float = Field(
+        default=0.9995,
+        description="Character coverage for SentencePiece training"
+    )
+    output_dir: str = Field(default="tokenizers", description="Directory to save trained tokenizer")
+    tokenizer_prefix: Optional[str] = Field(
+        default=None,
+        description="Prefix for tokenizer output files (default: model_type)"
+    )
+    # SentencePiece token IDs
+    unk_id: int = Field(default=0, description="Unknown token ID")
+    bos_id: int = Field(default=1, description="Beginning of sentence token ID")
+    eos_id: int = Field(default=2, description="End of sentence token ID")
+    pad_id: int = Field(default=3, description="Padding token ID")
+    # Custom special tokens - add custom tokens like <PAD>, <EOS>, <BOS>, <UNK>, <think>, <user>, <assistant>, <image>
+    special_tokens: Optional[dict[str, int]] = Field(
+        default=None,
+        description="Custom special tokens mapping: {token: id}. Example: {'<think>': 4, '<user>': 5, '<assistant>': 6, '<image>': 7}"
+    )
+    # Data sampling
+    max_samples: Optional[int] = Field(
+        default=None,
+        description="Limit training to first N samples from JSONL (useful for quick testing)"
+    )
+    # Tokenizer metadata
+    tokenizer_name: Optional[str] = Field(
+        default=None,
+        description="Optional name for the tokenizer"
+    )
+# ============================================================================
+# Training Config
+# ============================================================================
+class OptimizerConfig(BaseConfig):
+    """Optimizer configuration."""
+    optimizer_type: OptimizerEnum = Field(default=OptimizerEnum.ADAMW, description="Optimizer type")
+    learning_rate: float = Field(default=1e-4, description="Peak learning rate (for Muon 2D weights)")
+    adamw_lr: Optional[float] = Field(
+        default=None,
+        description="Learning rate for AdamW (1D parameters). If None, defaults to learning_rate / 10. Used in hybrid_muon_adamw optimizer."
+    )
+    weight_decay: float = Field(default=1e-2, description="Weight decay (L2 regularization)")
+    betas: tuple[float, float] = Field(default=(0.9, 0.999), description="Adam betas")
+    eps: float = Field(default=1e-8, description="Optimizer epsilon")
+    @validator('adamw_lr', always=True)
+    def set_default_adamw_lr(cls, v, values):
+        """Set default adamw_lr as 1/10 of learning_rate if not specified."""
+        if v is None and 'learning_rate' in values:
+            return values['learning_rate'] / 10
+        return v
+class SchedulerConfig(BaseConfig):
+    """Learning rate scheduler configuration."""
+    scheduler_type: SchedulerEnum = Field(default=SchedulerEnum.LINEAR_WARMUP, description="Scheduler type")
+    warmup_steps: int = Field(default=0, description="Number of warmup steps (takes precedence over warmup_ratio)")
+    warmup_ratio: float = Field(default=0.1, description="Warmup as fraction of total steps (used if warmup_steps=0)")
+    # Cosine scheduler specific
+    num_cycles: float = Field(default=0.5, description="Number of cycles for cosine schedule")
+    last_epoch: int = Field(default=-1, description="Last epoch for scheduler")
+    # TaoNet 3-phase scheduler (warmup -> steady -> cosine decay)
+    steady_ratio: float = Field(
+        default=0.0,
+        description="Fraction of training steps at peak LR before cosine decay (0.0 = no steady phase). Only for cosineWarmup."
+    )
+    min_lr_ratio: float = Field(
+        default=0.0,
+        description="Minimum LR as fraction of peak LR at end of training (0.0 = decay to 0). Only for cosineWarmup."
+    )
+    @validator('warmup_ratio')
+    def validate_warmup_ratio(cls, v):
+        """Validate warmup ratio is between 0 and 1."""
+        if not 0 <= v <= 1:
+            raise ValueError("warmup_ratio must be between 0 and 1")
+        return v
+    @validator('steady_ratio')
+    def validate_steady_ratio(cls, v):
+        """Validate steady ratio is between 0 and 1."""
+        if not 0 <= v <= 1:
+            raise ValueError("steady_ratio must be between 0 and 1")
+        return v
+    @validator('min_lr_ratio')
+    def validate_min_lr_ratio(cls, v):
+        """Validate min_lr_ratio is between 0 and 1."""
+        if not 0 <= v <= 1:
+            raise ValueError("min_lr_ratio must be between 0 and 1")
+        return v
+    @validator('warmup_steps')
+    def validate_warmup_steps(cls, v):
+        """Validate warmup steps is non-negative."""
+        if v < 0:
+            raise ValueError("warmup_steps must be non-negative")
+        return v
+class TrainingConfig(BaseConfig):
+    """Base training configuration shared across all modes."""
+    # Data and model
+    model: ModelConfig = Field(default_factory=ModelConfig, description="Model configuration")
+    dataset: DatasetConfig = Field(description="Dataset configuration")
+    # Training hyperparameters
+    batch_size: int = Field(default=32, description="Batch size per device")
+    num_epochs: int = Field(default=3, description="Number of training epochs")
+    max_steps: Optional[int] = Field(
+        default=None,
+        description="Maximum steps (overrides num_epochs if set)"
+    )
+    gradient_accumulation_steps: int = Field(
+        default=1,
+        description="Gradient accumulation steps"
+    )
+    max_grad_norm: float = Field(default=1.0, description="Gradient clipping max norm")
+    # Optimizer
+    optimizer: OptimizerConfig = Field(
+        default_factory=OptimizerConfig,
+        description="Optimizer configuration"
+    )
+    # Scheduler
+    scheduler: SchedulerConfig = Field(
+        default_factory=SchedulerConfig,
+        description="Learning rate scheduler configuration"
+    )
+    # Data type and device
+    dtype: DataTypeEnum = Field(
+        default=DataTypeEnum.BFLOAT16,
+        description="Training data type"
+    )
+    device: str = Field(default="cuda", description="Device to train on (cuda, cpu)")
+    seed: int = Field(default=42, description="Random seed")
+    # Checkpointing
+    checkpoint_dir: str = Field(default="checkpoints", description="Directory to save checkpoints")
+    checkpoint_path: Optional[str] = Field(
+        default=None,
+        description="Path to load pretrained checkpoint (for SFT/RL). If provided, loads weights before training starts."
+    )
+    save_every_steps: int = Field(default=500, description="Save checkpoint every N steps")
+    keep_last_n_checkpoints: int = Field(default=3, description="Keep only last N checkpoints")
+    save_best_model: bool = Field(default=True, description="Save best model based on validation loss")
+    # Validation
+    eval_every_steps: int = Field(default=500, description="Evaluate every N steps")
+    eval_samples: int = Field(default=1000, description="Number of validation samples")
+    # Logging
+    log_every_steps: int = Field(default=10, description="Log metrics every N steps")
+    aim_repo: str = Field(default=".aim", description="AimStack repository path")
+    # Misc
+    num_workers: int = Field(default=0, description="Number of DataLoader workers")
+    pin_memory: bool = Field(default=True, description="Pin memory for DataLoader")
+    use_compile: bool = Field(default=False, description="Use torch.compile (experimental)")
+    # Mode
+    mode: TrainingModeEnum = Field(default=TrainingModeEnum.PRETRAIN, description="Training mode")
+# ============================================================================
+# Stage-Specific Configs
+# ============================================================================
+class PretrainConfig(TrainingConfig):
+    """Configuration for pretraining."""
+    mode: Literal[TrainingModeEnum.PRETRAIN] = TrainingModeEnum.PRETRAIN
+    # Pretraining-specific
+    sequence_length: int = Field(default=1024, description="Sequence length for pretraining")
+class SFTConfig(TrainingConfig):
+    """Configuration for supervised fine-tuning."""
+    mode: Literal[TrainingModeEnum.SFT] = TrainingModeEnum.SFT
+    # SFT-specific
+    response_loss_only: bool = Field(
+        default=True,
+        description="Only compute loss on response/assistant tokens (not instruction/user tokens). Uses -100 label masking."
+    )
+    # Multi-turn conversation role tokens
+    user_token: str = Field(
+        default="<user>",
+        description="Special token representing user/instruction role in conversations"
+    )
+    assistant_token: str = Field(
+        default="<assistant>",
+        description="Special token representing assistant/response role in conversations"
+    )
+class RLConfig(TrainingConfig):
+    """Configuration for reinforcement learning training."""
+    mode: Literal[TrainingModeEnum.RL] = TrainingModeEnum.RL
+    # RL-specific
+    rl_method: RLMethodEnum = Field(
+        default=RLMethodEnum.PPO,
+        description="RL training method (PPO or DPO)"
+    )
+    # Reward model
+    reward_model_path: str = Field(description="Path to trained reward model checkpoint")
+    # PPO-specific
+    ppo_epochs: int = Field(default=4, description="PPO inner epochs")
+    ppo_clip_ratio: float = Field(default=0.2, description="PPO clipping ratio")
+    entropy_coeff: float = Field(default=0.01, description="Entropy bonus coefficient")
+    value_loss_coeff: float = Field(default=1.0, description="Value function loss coefficient")
+    # DPO-specific (Direct Preference Optimization)
+    dpo_beta: float = Field(default=0.1, description="DPO inverse temperature (beta)")
+    # Prompt distribution
+    prompt_dataset: Optional[DatasetConfig] = Field(
+        default=None,
+        description="Separate dataset for prompts (if different from main dataset)"
+    )
+    generation_max_length: int = Field(
+        default=256,
+        description="Maximum length for generated responses during RL"
+    )
+# ============================================================================
+# Factory function
+# ============================================================================
+def load_config(path: str | Path, mode: TrainingModeEnum | str) -> TrainingConfig:
+    """Load config file and return appropriate config class."""
+    if isinstance(mode, str):
+        mode = TrainingModeEnum(mode)
+    config_map = {
+        TrainingModeEnum.PRETRAIN: PretrainConfig,
+        TrainingModeEnum.SFT: SFTConfig,
+        TrainingModeEnum.RL: RLConfig,
+    }
+    config_class = config_map[mode]
+    path = Path(path)
+    if path.suffix == '.yaml' or path.suffix == '.yml':
+        return config_class.load_yaml(path)
+    elif path.suffix == '.json':
+        return config_class.load_json(path)
+    else:
+        raise ValueError(f"Unsupported config file format: {path.suffix}")
+def load_tokenizer_config(path: str | Path) -> TokenizerConfig:
+    """Load tokenizer config from YAML or JSON file."""
+    path = Path(path)
+    if path.suffix == '.yaml' or path.suffix == '.yml':
+        return TokenizerConfig.load_yaml(path)
+    elif path.suffix == '.json':
+        return TokenizerConfig.load_json(path)
+    else:
+        raise ValueError(f"Unsupported config file format: {path.suffix}")