TaoNet-mini-T2 / code /Taotern_LLM_Experiments /docs /DIRECTORY_NAVIGATION.md
StarMist0012's picture
Add files using upload-large-folder tool
e2bfccc verified
# Directory Navigation Guide
This guide maps the project by responsibility. Use it when a new thread needs to find the SSM code, LLM wrapper, data pipeline, train/test scripts, or remote-run artifacts quickly.
## Local Workspace
Current local workspace root:
```text
C:\Users\YouZheng\Documents\LYZ\MyContent\MyLLM\Codebase\Taotern
```
Main local repos:
| Purpose | Local path | GitHub |
|---|---|---|
| Experiment ledger | `C:\Users\YouZheng\Documents\LYZ\MyContent\MyLLM\Codebase\Taotern\Taotern_LLM_Experiments` | `https://github.com/StarMists/Taotern_LLM_Experiments` |
| SSM model | `C:\Users\YouZheng\Documents\LYZ\MyContent\MyLLM\Codebase\Taotern\Taotern_SSM` | `https://github.com/StarMists/gamma_SSM_S4_enhanced` |
| TaoTrain LLM code | `C:\Users\YouZheng\Documents\LYZ\MyContent\MyLLM\Codebase\Taotern\TaoTrain` | `https://github.com/lobakkang/TaoTrain` |
| Remote run tool | `C:\Users\YouZheng\Documents\LYZ\MyContent\MyComp\RepoBridge` | local tool repo |
| TaoData scripts | not currently cloned under this workspace | `https://github.com/lobakkang/TaoData` |
## Experiment Ledger
Path:
```text
C:\Users\YouZheng\Documents\LYZ\MyContent\MyLLM\Codebase\Taotern\Taotern_LLM_Experiments
```
Important files:
| File | Purpose |
|---|---|
| `README.md` | Current SSM LLM status and attention TaoNet comparison |
| `experiments/index.csv` | Searchable run ledger |
| `experiments/runs/<run_id>/manifest.yaml` | Run purpose, commits, data, status |
| `experiments/runs/<run_id>/summary.md` | Human-readable result |
| `experiments/runs/<run_id>/metrics.csv` | Compact metric snapshot |
| `experiments/runs/<run_id>/repobridge.config.json` | Exact remote-run config |
| `experiments/resources/tokenizers/` | Small tokenizer configs only |
| `docs/WORKFLOW.md` | How future runs should be recorded |
| `docs/CURRENT_SSM_LLM_ARCHITECTURE.md` | Current TaoNet-SSM layers, equations, matrices, parameters |
Rule: keep this repo compact. Commit summaries/configs/CSV metrics, not raw output trees or checkpoints.
## SSM Model Repo
Path:
```text
C:\Users\YouZheng\Documents\LYZ\MyContent\MyLLM\Codebase\Taotern\Taotern_SSM
```
Main code locations:
| Area | File or directory | Notes |
|---|---|---|
| DPLR SSM core | `gamma_space_model/modules/s4_ternary_dplr_ssm.py` | Current main SSM core for TaoNet-SSM |
| Gamma S4 core | `gamma_space_model/modules/ssm_gamma_s4.py` | Older Gamma/S4-style core |
| Baseline Gamma core | `gamma_space_model/modules/ssm_gamma.py` | Baseline/reference SSM |
| SSM blocks | `gamma_space_model/modules/block*.py` | Standalone SSM block wrappers |
| TileLang/Triton fallback area | `csrc/tilelang/` | Capability detection and fallback code |
| Selective scan op wrapper | `gamma_space_model/ops/selective_scan_interface.py` | SSM op interface |
| DPLR profiler | `scripts/profile_dplr_frequency_path.py` | Profiles DPLR frequency path |
| TileLang diagnosis | `scripts/diagnose_tilelang_acceleration.py` | Reports real vs fallback acceleration |
| SSM variant benchmark | `scripts/benchmark_ssm_variants.py` | Standalone SSM benchmarks |
| SSM tests | `tests/test_s4_ternary_dplr_ssm.py`, `tests/test_ssm_gamma*.py` | Core correctness tests |
| Historical record | `EXPERIMENT_RECORD.md` | Older narrative record; new LLM records should be mirrored into this experiment ledger |
When improving the SSM model itself, start from:
```text
gamma_space_model/modules/s4_ternary_dplr_ssm.py
```
When working on hardware acceleration, start from:
```text
csrc/tilelang/
scripts/profile_dplr_frequency_path.py
scripts/diagnose_tilelang_acceleration.py
```
Remote SSM path used by RepoBridge runs:
```text
/home/student/YouZheng/gamma_ssm_repo
```
## TaoTrain LLM Repo
Path:
```text
C:\Users\YouZheng\Documents\LYZ\MyContent\MyLLM\Codebase\Taotern\TaoTrain
```
Main code locations:
| Area | File or directory | Notes |
|---|---|---|
| Attention TaoNet baseline | `src/taoTrain/models/taonet.py` | Reference model for comparisons |
| SSM TaoNet wrapper | `src/taoTrain/models/taonet_ssm.py` | Replaces attention core with SSM mixer |
| Model config schema | `src/taoTrain/config.py` | SSM flags live here: hidden dim, mixer dim, shift, kernel mode |
| Model registry | `src/taoTrain/models/registry.py` | Architecture registration |
| Token/data utilities | `src/taoTrain/data/` | JSONL and tokenization data paths |
| Tokenizer trainer | `src/taoTrain/tokenizers/trainer.py` | SentencePiece training path |
| Training loop | `src/taoTrain/training/trainer.py` | Full trainer implementation |
| CLI | `src/taoTrain/cli.py` | TaoTrain command entry |
| Real-token benchmark | `scripts/benchmark_taonet_real_tokens.py` | Main attention vs SSM benchmark for TaoData token tasks |
| Synthetic token benchmark | `scripts/benchmark_taonet_token_variants.py` | Previous/increment/random token probes |
| TaoData pilot tokenizer config | `configs/tokenizer_taodata_pilot.yaml` | Generated pilot 8k SentencePiece tokenizer |
| SSM pretrain config | `configs/ssm_pretrain.yaml` | Config path for SSM pretraining experiments |
| SSM wrapper tests | `tests/test_taonet_ssm.py` | Shape and config behavior tests |
Current real-token benchmark entry point:
```text
scripts/benchmark_taonet_real_tokens.py
```
Current SSM wrapper entry point:
```text
src/taoTrain/models/taonet_ssm.py
```
Current attention baseline:
```text
src/taoTrain/models/taonet.py
```
Remote TaoTrain path used by RepoBridge:
```text
/home/student/YouZheng/repo
```
## TaoData
GitHub:
```text
https://github.com/lobakkang/TaoData
```
Current local status:
```text
No local TaoData checkout was found under C:\Users\YouZheng\Documents\LYZ\MyContent\MyLLM\Codebase as of 2026-04-30.
```
Remote data path used in current benchmarks:
```text
/home/student/Data/TaoData/pretrain.jsonl.fineweb.jsonl
```
Current pilot tokenizer path on remote:
```text
/home/student/YouZheng/tokenizers/taodata_pilot_8k/tokenizer.model
/home/student/YouZheng/tokenizers/taodata_pilot_8k/tokenizer.vocab
```
Tokenizer config snapshot in this ledger:
```text
experiments/resources/tokenizers/taodata_pilot_8k.yaml
```
When TaoData is cloned locally, update this guide with the exact data download/generation scripts and any preprocessing entry points.
## RepoBridge
Path:
```text
C:\Users\YouZheng\Documents\LYZ\MyContent\MyComp\RepoBridge
```
Important files:
| File or directory | Purpose |
|---|---|
| `repobridge/core.py` | Sync, SSH, SFTP, run, download implementation |
| `repobridge/cli.py` | CLI entry point |
| `repobridge/app.py` | GUI |
| `CODEX_OPERATOR_GUIDE.md` | Codex remote-run guide |
| `PRODUCTION_RUNBOOK.md` | Production checklist |
| old `repobridge.*.config.json` files | Historical configs; new experiment configs should live in this ledger |
Preferred future location for experiment configs:
```text
Taotern_LLM_Experiments\experiments\runs\<run_id>\repobridge.config.json
```
Remote write root:
```text
/home/student/YouZheng
```
Remote output base:
```text
/home/student/YouZheng/outputs-taotrain
```
Important operational note:
Avoid downloading the whole remote output base if it contains many historical runs. Prefer downloading or copying only the specific run folder.
## Current Best SSM LLM Path
To inspect the current best SSM LLM implementation:
1. Open TaoTrain wrapper:
```text
C:\Users\YouZheng\Documents\LYZ\MyContent\MyLLM\Codebase\Taotern\TaoTrain\src\taoTrain\models\taonet_ssm.py
```
2. Follow the DPLR core import into:
```text
C:\Users\YouZheng\Documents\LYZ\MyContent\MyLLM\Codebase\Taotern\Taotern_SSM\gamma_space_model\modules\s4_ternary_dplr_ssm.py
```
3. Compare against attention TaoNet:
```text
C:\Users\YouZheng\Documents\LYZ\MyContent\MyLLM\Codebase\Taotern\TaoTrain\src\taoTrain\models\taonet.py
```
4. Reproduce current best benchmark with:
```text
Taotern_LLM_Experiments\experiments\runs\2026-04-29_spm_b32_500step_mixer_sweep\repobridge.config.json
```
5. Read current conclusion in:
```text
Taotern_LLM_Experiments\README.md
```