Buckets:

YoungXuan
/

MS-backup

29 days ago

1.38 kB

	---
	name: env-setup
	description: "Training environment setup on lucia6750000000 — conda env, dependencies, permissions"
	metadata:
	node_type: memory
	type: reference
	originSessionId: a902e50d-bd1f-422b-8298-552e3fb0a73f
	---

	## Environment on lucia6750000000

	- User: tunneladmin (in sudo group, NOT in sigma group)
	- Machine: 8x H100 80GB, 32TB disk at /data
	- `/data/xuano/` owned by sigma — write access granted via `sudo chmod -R o+w /data/xuano/`
	- Conda env `ttt`: `/home/tunneladmin/.conda/envs/ttt/`
	- Python 3.11, PyTorch 2.8+cu128, transformers 4.57.3, VeOmni 0.1.0
	- FlashAttention 2.8.3, liger-kernel, datasets 2.21.0
	- Installed via: `conda create -n ttt python=3.11` + pip per [[qwen3-4b-cpt-experiment]]
	- VeOmni: Installed from git commit `9b91e164bea9e17f17ed490aab5e076c2335ca25` (ByteDance-Seed/VeOmni)
	- Project code: `/data/xuano/Plug-In-Test-time-training/` (In-Place TTT repo, also registers custom HF models for Qwen3/LLaMA/Mistral)

	### Key notes
	- VeOmni's `lr_decay_ratio` means fraction of total steps that use cosine decay (NOT the min lr ratio). Set to 1.0 for full cosine.
	- `FLOPS_DISABLE=1` needed for mbs>=4 at 32K context on H100 80GB (FlopCounterMode causes OOM)
	- Use `nohup` for long training runs to prevent process death
	- `hf` CLI installed at `/home/tunneladmin/.local/bin/hf` (v1.14.0) for HF bucket sync

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.