YoungXuan/MS-backup / xuan /memory /env-setup.md
YoungXuan's picture
|
download
raw
1.38 kB
metadata
name: env-setup
description: >-
  Training environment setup on lucia6750000000 — conda env, dependencies,
  permissions
metadata:
  node_type: memory
  type: reference
  originSessionId: a902e50d-bd1f-422b-8298-552e3fb0a73f

Environment on lucia6750000000

  • User: tunneladmin (in sudo group, NOT in sigma group)
  • Machine: 8x H100 80GB, 32TB disk at /data
  • /data/xuano/ owned by sigma — write access granted via sudo chmod -R o+w /data/xuano/
  • Conda env ttt: /home/tunneladmin/.conda/envs/ttt/
    • Python 3.11, PyTorch 2.8+cu128, transformers 4.57.3, VeOmni 0.1.0
    • FlashAttention 2.8.3, liger-kernel, datasets 2.21.0
    • Installed via: conda create -n ttt python=3.11 + pip per [[qwen3-4b-cpt-experiment]]
  • VeOmni: Installed from git commit 9b91e164bea9e17f17ed490aab5e076c2335ca25 (ByteDance-Seed/VeOmni)
  • Project code: /data/xuano/Plug-In-Test-time-training/ (In-Place TTT repo, also registers custom HF models for Qwen3/LLaMA/Mistral)

Key notes

  • VeOmni's lr_decay_ratio means fraction of total steps that use cosine decay (NOT the min lr ratio). Set to 1.0 for full cosine.
  • FLOPS_DISABLE=1 needed for mbs>=4 at 32K context on H100 80GB (FlopCounterMode causes OOM)
  • Use nohup for long training runs to prevent process death
  • hf CLI installed at /home/tunneladmin/.local/bin/hf (v1.14.0) for HF bucket sync

Xet Storage Details

Size:
1.38 kB
·
Xet hash:
6d8b2e6b15129938fd016127e2b84cf4430701681257a1b94642b5e56c211428

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.