YoungXuan/MS-backup / xuan /memory /env-setup.md
YoungXuan's picture
|
download
raw
1.38 kB
---
name: env-setup
description: "Training environment setup on lucia6750000000 — conda env, dependencies, permissions"
metadata:
node_type: memory
type: reference
originSessionId: a902e50d-bd1f-422b-8298-552e3fb0a73f
---
## Environment on lucia6750000000
- **User:** tunneladmin (in sudo group, NOT in sigma group)
- **Machine:** 8x H100 80GB, 32TB disk at /data
- `/data/xuano/` owned by sigma — write access granted via `sudo chmod -R o+w /data/xuano/`
- **Conda env `ttt`:** `/home/tunneladmin/.conda/envs/ttt/`
- Python 3.11, PyTorch 2.8+cu128, transformers 4.57.3, VeOmni 0.1.0
- FlashAttention 2.8.3, liger-kernel, datasets 2.21.0
- Installed via: `conda create -n ttt python=3.11` + pip per [[qwen3-4b-cpt-experiment]]
- **VeOmni:** Installed from git commit `9b91e164bea9e17f17ed490aab5e076c2335ca25` (ByteDance-Seed/VeOmni)
- **Project code:** `/data/xuano/Plug-In-Test-time-training/` (In-Place TTT repo, also registers custom HF models for Qwen3/LLaMA/Mistral)
### Key notes
- VeOmni's `lr_decay_ratio` means fraction of total steps that use cosine decay (NOT the min lr ratio). Set to 1.0 for full cosine.
- `FLOPS_DISABLE=1` needed for mbs>=4 at 32K context on H100 80GB (FlopCounterMode causes OOM)
- Use `nohup` for long training runs to prevent process death
- `hf` CLI installed at `/home/tunneladmin/.local/bin/hf` (v1.14.0) for HF bucket sync

Xet Storage Details

Size:
1.38 kB
·
Xet hash:
6d8b2e6b15129938fd016127e2b84cf4430701681257a1b94642b5e56c211428

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.