File size: 8,311 Bytes

e2bfccc

# Directory Navigation Guide

This guide maps the project by responsibility. Use it when a new thread needs to find the SSM code, LLM wrapper, data pipeline, train/test scripts, or remote-run artifacts quickly.

## Local Workspace

Current local workspace root:

```text

C:\Users\YouZheng\Documents\LYZ\MyContent\MyLLM\Codebase\Taotern

```

Main local repos:

| Purpose | Local path | GitHub |
|---|---|---|
| Experiment ledger | `C:\Users\YouZheng\Documents\LYZ\MyContent\MyLLM\Codebase\Taotern\Taotern_LLM_Experiments` | `https://github.com/StarMists/Taotern_LLM_Experiments` |
| SSM model | `C:\Users\YouZheng\Documents\LYZ\MyContent\MyLLM\Codebase\Taotern\Taotern_SSM` | `https://github.com/StarMists/gamma_SSM_S4_enhanced` |
| TaoTrain LLM code | `C:\Users\YouZheng\Documents\LYZ\MyContent\MyLLM\Codebase\Taotern\TaoTrain` | `https://github.com/lobakkang/TaoTrain` |
| Remote run tool | `C:\Users\YouZheng\Documents\LYZ\MyContent\MyComp\RepoBridge` | local tool repo |
| TaoData scripts | not currently cloned under this workspace | `https://github.com/lobakkang/TaoData` |

## Experiment Ledger

Path:

```text

C:\Users\YouZheng\Documents\LYZ\MyContent\MyLLM\Codebase\Taotern\Taotern_LLM_Experiments

```

Important files:

| File | Purpose |
|---|---|
| `README.md` | Current SSM LLM status and attention TaoNet comparison |
| `experiments/index.csv` | Searchable run ledger |
| `experiments/runs/<run_id>/manifest.yaml` | Run purpose, commits, data, status |
| `experiments/runs/<run_id>/summary.md` | Human-readable result |
| `experiments/runs/<run_id>/metrics.csv` | Compact metric snapshot |
| `experiments/runs/<run_id>/repobridge.config.json` | Exact remote-run config |
| `experiments/resources/tokenizers/` | Small tokenizer configs only |
| `docs/WORKFLOW.md` | How future runs should be recorded |
| `docs/CURRENT_SSM_LLM_ARCHITECTURE.md` | Current TaoNet-SSM layers, equations, matrices, parameters |

Rule: keep this repo compact. Commit summaries/configs/CSV metrics, not raw output trees or checkpoints.

## SSM Model Repo

Path:

```text

C:\Users\YouZheng\Documents\LYZ\MyContent\MyLLM\Codebase\Taotern\Taotern_SSM

```

Main code locations:

| Area | File or directory | Notes |
|---|---|---|
| DPLR SSM core | `gamma_space_model/modules/s4_ternary_dplr_ssm.py` | Current main SSM core for TaoNet-SSM |
| Gamma S4 core | `gamma_space_model/modules/ssm_gamma_s4.py` | Older Gamma/S4-style core |
| Baseline Gamma core | `gamma_space_model/modules/ssm_gamma.py` | Baseline/reference SSM |
| SSM blocks | `gamma_space_model/modules/block*.py` | Standalone SSM block wrappers |
| TileLang/Triton fallback area | `csrc/tilelang/` | Capability detection and fallback code |
| Selective scan op wrapper | `gamma_space_model/ops/selective_scan_interface.py` | SSM op interface |
| DPLR profiler | `scripts/profile_dplr_frequency_path.py` | Profiles DPLR frequency path |
| TileLang diagnosis | `scripts/diagnose_tilelang_acceleration.py` | Reports real vs fallback acceleration |
| SSM variant benchmark | `scripts/benchmark_ssm_variants.py` | Standalone SSM benchmarks |
| SSM tests | `tests/test_s4_ternary_dplr_ssm.py`, `tests/test_ssm_gamma*.py` | Core correctness tests |
| Historical record | `EXPERIMENT_RECORD.md` | Older narrative record; new LLM records should be mirrored into this experiment ledger |

When improving the SSM model itself, start from:

```text

gamma_space_model/modules/s4_ternary_dplr_ssm.py

```

When working on hardware acceleration, start from:

```text

csrc/tilelang/

scripts/profile_dplr_frequency_path.py

scripts/diagnose_tilelang_acceleration.py

```

Remote SSM path used by RepoBridge runs:

```text

/home/student/YouZheng/gamma_ssm_repo

```

## TaoTrain LLM Repo

Path:

```text

C:\Users\YouZheng\Documents\LYZ\MyContent\MyLLM\Codebase\Taotern\TaoTrain

```

Main code locations:

| Area | File or directory | Notes |
|---|---|---|
| Attention TaoNet baseline | `src/taoTrain/models/taonet.py` | Reference model for comparisons |
| SSM TaoNet wrapper | `src/taoTrain/models/taonet_ssm.py` | Replaces attention core with SSM mixer |
| Model config schema | `src/taoTrain/config.py` | SSM flags live here: hidden dim, mixer dim, shift, kernel mode |
| Model registry | `src/taoTrain/models/registry.py` | Architecture registration |
| Token/data utilities | `src/taoTrain/data/` | JSONL and tokenization data paths |
| Tokenizer trainer | `src/taoTrain/tokenizers/trainer.py` | SentencePiece training path |
| Training loop | `src/taoTrain/training/trainer.py` | Full trainer implementation |
| CLI | `src/taoTrain/cli.py` | TaoTrain command entry |
| Real-token benchmark | `scripts/benchmark_taonet_real_tokens.py` | Main attention vs SSM benchmark for TaoData token tasks |
| Synthetic token benchmark | `scripts/benchmark_taonet_token_variants.py` | Previous/increment/random token probes |
| TaoData pilot tokenizer config | `configs/tokenizer_taodata_pilot.yaml` | Generated pilot 8k SentencePiece tokenizer |
| SSM pretrain config | `configs/ssm_pretrain.yaml` | Config path for SSM pretraining experiments |
| SSM wrapper tests | `tests/test_taonet_ssm.py` | Shape and config behavior tests |

Current real-token benchmark entry point:

```text

scripts/benchmark_taonet_real_tokens.py

```

Current SSM wrapper entry point:

```text

src/taoTrain/models/taonet_ssm.py

```

Current attention baseline:

```text

src/taoTrain/models/taonet.py

```

Remote TaoTrain path used by RepoBridge:

```text

/home/student/YouZheng/repo

```

## TaoData

GitHub:

```text

https://github.com/lobakkang/TaoData

```

Current local status:

```text

No local TaoData checkout was found under C:\Users\YouZheng\Documents\LYZ\MyContent\MyLLM\Codebase as of 2026-04-30.

```

Remote data path used in current benchmarks:

```text

/home/student/Data/TaoData/pretrain.jsonl.fineweb.jsonl

```

Current pilot tokenizer path on remote:

```text

/home/student/YouZheng/tokenizers/taodata_pilot_8k/tokenizer.model

/home/student/YouZheng/tokenizers/taodata_pilot_8k/tokenizer.vocab

```

Tokenizer config snapshot in this ledger:

```text

experiments/resources/tokenizers/taodata_pilot_8k.yaml

```

When TaoData is cloned locally, update this guide with the exact data download/generation scripts and any preprocessing entry points.

## RepoBridge

Path:

```text

C:\Users\YouZheng\Documents\LYZ\MyContent\MyComp\RepoBridge

```

Important files:

| File or directory | Purpose |
|---|---|
| `repobridge/core.py` | Sync, SSH, SFTP, run, download implementation |
| `repobridge/cli.py` | CLI entry point |
| `repobridge/app.py` | GUI |
| `CODEX_OPERATOR_GUIDE.md` | Codex remote-run guide |
| `PRODUCTION_RUNBOOK.md` | Production checklist |
| old `repobridge.*.config.json` files | Historical configs; new experiment configs should live in this ledger |

Preferred future location for experiment configs:

```text

Taotern_LLM_Experiments\experiments\runs\<run_id>\repobridge.config.json

```

Remote write root:

```text

/home/student/YouZheng

```

Remote output base:

```text

/home/student/YouZheng/outputs-taotrain

```

Important operational note:

Avoid downloading the whole remote output base if it contains many historical runs. Prefer downloading or copying only the specific run folder.

## Current Best SSM LLM Path

To inspect the current best SSM LLM implementation:

1. Open TaoTrain wrapper:

   ```text

   C:\Users\YouZheng\Documents\LYZ\MyContent\MyLLM\Codebase\Taotern\TaoTrain\src\taoTrain\models\taonet_ssm.py

   ```

2. Follow the DPLR core import into:

   ```text

   C:\Users\YouZheng\Documents\LYZ\MyContent\MyLLM\Codebase\Taotern\Taotern_SSM\gamma_space_model\modules\s4_ternary_dplr_ssm.py

   ```

3. Compare against attention TaoNet:

   ```text

   C:\Users\YouZheng\Documents\LYZ\MyContent\MyLLM\Codebase\Taotern\TaoTrain\src\taoTrain\models\taonet.py

   ```

4. Reproduce current best benchmark with:

   ```text

   Taotern_LLM_Experiments\experiments\runs\2026-04-29_spm_b32_500step_mixer_sweep\repobridge.config.json

   ```

5. Read current conclusion in:

   ```text

   Taotern_LLM_Experiments\README.md

   ```