Add files using upload-large-folder tool

e2bfccc verified 16 days ago

8.31 kB

	# Directory Navigation Guide

	This guide maps the project by responsibility. Use it when a new thread needs to find the SSM code, LLM wrapper, data pipeline, train/test scripts, or remote-run artifacts quickly.

	## Local Workspace

	Current local workspace root:

	```text
	C:\Users\YouZheng\Documents\LYZ\MyContent\MyLLM\Codebase\Taotern
	```

	Main local repos:

	\| Purpose \| Local path \| GitHub \|
	\|---\|---\|---\|
	\| Experiment ledger \| `C:\Users\YouZheng\Documents\LYZ\MyContent\MyLLM\Codebase\Taotern\Taotern_LLM_Experiments` \| `https://github.com/StarMists/Taotern_LLM_Experiments` \|
	\| SSM model \| `C:\Users\YouZheng\Documents\LYZ\MyContent\MyLLM\Codebase\Taotern\Taotern_SSM` \| `https://github.com/StarMists/gamma_SSM_S4_enhanced` \|
	\| TaoTrain LLM code \| `C:\Users\YouZheng\Documents\LYZ\MyContent\MyLLM\Codebase\Taotern\TaoTrain` \| `https://github.com/lobakkang/TaoTrain` \|
	\| Remote run tool \| `C:\Users\YouZheng\Documents\LYZ\MyContent\MyComp\RepoBridge` \| local tool repo \|
	\| TaoData scripts \| not currently cloned under this workspace \| `https://github.com/lobakkang/TaoData` \|

	## Experiment Ledger

	Path:

	```text
	C:\Users\YouZheng\Documents\LYZ\MyContent\MyLLM\Codebase\Taotern\Taotern_LLM_Experiments
	```

	Important files:

	\| File \| Purpose \|
	\|---\|---\|
	\| `README.md` \| Current SSM LLM status and attention TaoNet comparison \|
	\| `experiments/index.csv` \| Searchable run ledger \|
	\| `experiments/runs/<run_id>/manifest.yaml` \| Run purpose, commits, data, status \|
	\| `experiments/runs/<run_id>/summary.md` \| Human-readable result \|
	\| `experiments/runs/<run_id>/metrics.csv` \| Compact metric snapshot \|
	\| `experiments/runs/<run_id>/repobridge.config.json` \| Exact remote-run config \|
	\| `experiments/resources/tokenizers/` \| Small tokenizer configs only \|
	\| `docs/WORKFLOW.md` \| How future runs should be recorded \|
	\| `docs/CURRENT_SSM_LLM_ARCHITECTURE.md` \| Current TaoNet-SSM layers, equations, matrices, parameters \|

	Rule: keep this repo compact. Commit summaries/configs/CSV metrics, not raw output trees or checkpoints.

	## SSM Model Repo

	Path:

	```text
	C:\Users\YouZheng\Documents\LYZ\MyContent\MyLLM\Codebase\Taotern\Taotern_SSM
	```

	Main code locations:

	\| Area \| File or directory \| Notes \|
	\|---\|---\|---\|
	\| DPLR SSM core \| `gamma_space_model/modules/s4_ternary_dplr_ssm.py` \| Current main SSM core for TaoNet-SSM \|
	\| Gamma S4 core \| `gamma_space_model/modules/ssm_gamma_s4.py` \| Older Gamma/S4-style core \|
	\| Baseline Gamma core \| `gamma_space_model/modules/ssm_gamma.py` \| Baseline/reference SSM \|
	\| SSM blocks \| `gamma_space_model/modules/block*.py` \| Standalone SSM block wrappers \|
	\| TileLang/Triton fallback area \| `csrc/tilelang/` \| Capability detection and fallback code \|
	\| Selective scan op wrapper \| `gamma_space_model/ops/selective_scan_interface.py` \| SSM op interface \|
	\| DPLR profiler \| `scripts/profile_dplr_frequency_path.py` \| Profiles DPLR frequency path \|
	\| TileLang diagnosis \| `scripts/diagnose_tilelang_acceleration.py` \| Reports real vs fallback acceleration \|
	\| SSM variant benchmark \| `scripts/benchmark_ssm_variants.py` \| Standalone SSM benchmarks \|
	\| SSM tests \| `tests/test_s4_ternary_dplr_ssm.py`, `tests/test_ssm_gamma*.py` \| Core correctness tests \|
	\| Historical record \| `EXPERIMENT_RECORD.md` \| Older narrative record; new LLM records should be mirrored into this experiment ledger \|

	When improving the SSM model itself, start from:

	```text
	gamma_space_model/modules/s4_ternary_dplr_ssm.py
	```

	When working on hardware acceleration, start from:

	```text
	csrc/tilelang/
	scripts/profile_dplr_frequency_path.py
	scripts/diagnose_tilelang_acceleration.py
	```

	Remote SSM path used by RepoBridge runs:

	```text
	/home/student/YouZheng/gamma_ssm_repo
	```

	## TaoTrain LLM Repo

	Path:

	```text
	C:\Users\YouZheng\Documents\LYZ\MyContent\MyLLM\Codebase\Taotern\TaoTrain
	```

	Main code locations:

	\| Area \| File or directory \| Notes \|
	\|---\|---\|---\|
	\| Attention TaoNet baseline \| `src/taoTrain/models/taonet.py` \| Reference model for comparisons \|
	\| SSM TaoNet wrapper \| `src/taoTrain/models/taonet_ssm.py` \| Replaces attention core with SSM mixer \|
	\| Model config schema \| `src/taoTrain/config.py` \| SSM flags live here: hidden dim, mixer dim, shift, kernel mode \|
	\| Model registry \| `src/taoTrain/models/registry.py` \| Architecture registration \|
	\| Token/data utilities \| `src/taoTrain/data/` \| JSONL and tokenization data paths \|
	\| Tokenizer trainer \| `src/taoTrain/tokenizers/trainer.py` \| SentencePiece training path \|
	\| Training loop \| `src/taoTrain/training/trainer.py` \| Full trainer implementation \|
	\| CLI \| `src/taoTrain/cli.py` \| TaoTrain command entry \|
	\| Real-token benchmark \| `scripts/benchmark_taonet_real_tokens.py` \| Main attention vs SSM benchmark for TaoData token tasks \|
	\| Synthetic token benchmark \| `scripts/benchmark_taonet_token_variants.py` \| Previous/increment/random token probes \|
	\| TaoData pilot tokenizer config \| `configs/tokenizer_taodata_pilot.yaml` \| Generated pilot 8k SentencePiece tokenizer \|
	\| SSM pretrain config \| `configs/ssm_pretrain.yaml` \| Config path for SSM pretraining experiments \|
	\| SSM wrapper tests \| `tests/test_taonet_ssm.py` \| Shape and config behavior tests \|

	Current real-token benchmark entry point:

	```text
	scripts/benchmark_taonet_real_tokens.py
	```

	Current SSM wrapper entry point:

	```text
	src/taoTrain/models/taonet_ssm.py
	```

	Current attention baseline:

	```text
	src/taoTrain/models/taonet.py
	```

	Remote TaoTrain path used by RepoBridge:

	```text
	/home/student/YouZheng/repo
	```

	## TaoData

	GitHub:

	```text
	https://github.com/lobakkang/TaoData
	```

	Current local status:

	```text
	No local TaoData checkout was found under C:\Users\YouZheng\Documents\LYZ\MyContent\MyLLM\Codebase as of 2026-04-30.
	```

	Remote data path used in current benchmarks:

	```text
	/home/student/Data/TaoData/pretrain.jsonl.fineweb.jsonl
	```

	Current pilot tokenizer path on remote:

	```text
	/home/student/YouZheng/tokenizers/taodata_pilot_8k/tokenizer.model
	/home/student/YouZheng/tokenizers/taodata_pilot_8k/tokenizer.vocab
	```

	Tokenizer config snapshot in this ledger:

	```text
	experiments/resources/tokenizers/taodata_pilot_8k.yaml
	```

	When TaoData is cloned locally, update this guide with the exact data download/generation scripts and any preprocessing entry points.

	## RepoBridge

	Path:

	```text
	C:\Users\YouZheng\Documents\LYZ\MyContent\MyComp\RepoBridge
	```

	Important files:

	\| File or directory \| Purpose \|
	\|---\|---\|
	\| `repobridge/core.py` \| Sync, SSH, SFTP, run, download implementation \|
	\| `repobridge/cli.py` \| CLI entry point \|
	\| `repobridge/app.py` \| GUI \|
	\| `CODEX_OPERATOR_GUIDE.md` \| Codex remote-run guide \|
	\| `PRODUCTION_RUNBOOK.md` \| Production checklist \|
	\| old `repobridge.*.config.json` files \| Historical configs; new experiment configs should live in this ledger \|

	Preferred future location for experiment configs:

	```text
	Taotern_LLM_Experiments\experiments\runs\<run_id>\repobridge.config.json
	```

	Remote write root:

	```text
	/home/student/YouZheng
	```

	Remote output base:

	```text
	/home/student/YouZheng/outputs-taotrain
	```

	Important operational note:

	Avoid downloading the whole remote output base if it contains many historical runs. Prefer downloading or copying only the specific run folder.

	## Current Best SSM LLM Path

	To inspect the current best SSM LLM implementation:

	1. Open TaoTrain wrapper:

	```text
	C:\Users\YouZheng\Documents\LYZ\MyContent\MyLLM\Codebase\Taotern\TaoTrain\src\taoTrain\models\taonet_ssm.py
	```

	2. Follow the DPLR core import into:

	```text
	C:\Users\YouZheng\Documents\LYZ\MyContent\MyLLM\Codebase\Taotern\Taotern_SSM\gamma_space_model\modules\s4_ternary_dplr_ssm.py
	```

	3. Compare against attention TaoNet:

	```text
	C:\Users\YouZheng\Documents\LYZ\MyContent\MyLLM\Codebase\Taotern\TaoTrain\src\taoTrain\models\taonet.py
	```

	4. Reproduce current best benchmark with:

	```text
	Taotern_LLM_Experiments\experiments\runs\2026-04-29_spm_b32_500step_mixer_sweep\repobridge.config.json
	```

	5. Read current conclusion in:

	```text
	Taotern_LLM_Experiments\README.md
	```