# SimToken Setup, Data, Upload, and Download Guide This guide is for moving the SimToken workspace between rented servers. Assumed paths: ```bash PROJECT_ROOT=/workspace/SimToken SAM2_ROOT=/workspace/sam2 HF_REPO=yfan07/SimToken ``` ## 1. Environment Setup ```bash conda create -n simtoken python=3.10 -y conda activate simtoken conda install -c conda-forge ffmpeg libsndfile git git-lfs wget -y git lfs install pip install --upgrade pip setuptools wheel pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu126 ``` If CUDA 12.6 wheels are unavailable, use CUDA 12.1 wheels: ```bash pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121 ``` Install SimToken dependencies: ```bash pip install \ numpy pandas matplotlib opencv-python pillow tqdm einops timm sentencepiece \ transformers==4.30.2 peft==0.2.0 accelerate safetensors huggingface-hub \ packaging regex requests psutil gdown ``` Optional, only needed if regenerating audio features: ```bash pip install towhee towhee.models ``` ## 2. Repository Download ```bash cd /workspace huggingface-cli login huggingface-cli download yfan07/SimToken \ --repo-type model \ --local-dir /workspace/SimToken \ --local-dir-use-symlinks False ``` ## 3. Model Preparation ### Hugging Face Models ```bash mkdir -p /workspace/hf_models huggingface-cli download openai/clip-vit-large-patch14 \ --local-dir /workspace/hf_models/clip-vit-large-patch14 \ --local-dir-use-symlinks False huggingface-cli download Chat-UniVi/Chat-UniVi-7B-v1.5 \ --local-dir /workspace/hf_models/Chat-UniVi-7B-v1.5 \ --local-dir-use-symlinks False ``` ### SAM2 for TubeToken Proposals Put SAM2 under `/workspace/sam2`: ```bash cd /workspace git clone https://github.com/facebookresearch/sam2.git cd /workspace/sam2 pip install -e . ``` Download SAM2.1 checkpoints: ```bash cd /workspace/sam2/checkpoints bash download_ckpts.sh ``` The TubeToken Phase 0 commands use: ```text /workspace/sam2/checkpoints/sam2.1_hiera_large.pt /workspace/sam2/sam2/configs/sam2.1/sam2.1_hiera_l.yaml ``` ## 4. Dataset Preparation Runtime layout: ```text /workspace/SimToken/data metadata.csv media/ gt_mask/ audio_embed/ image_embed/ ``` Package the four data directories: ```bash cd /workspace/SimToken/data tar -cf media.tar media tar -czf gt_mask.tar.gz gt_mask tar -czf audio_embed.tar.gz audio_embed tar -cf image_embed.tar image_embed ``` Restore the four data directories: ```bash cd /workspace/SimToken/data tar -xf media.tar tar -xzf gt_mask.tar.gz tar -xzf audio_embed.tar.gz tar -xf image_embed.tar ``` ## 5. Upload Repository The remote repo stores the four large data directories as tar archives (`media.tar`, `image_embed.tar`, etc.). The local workspace has them extracted as plain directories. **Do not re-upload these directories**—use `--ignore-patterns` to skip them, otherwise every extracted file would be treated as a new upload. ### 5a. Pack any new data directories before uploading If `data/text_embed/` is new (first upload after running `precompute_text_feats.py`): ```bash cd /workspace/SimToken/data tar -cf text_embed.tar text_embed ``` ### 5b. Login ```bash cd /workspace/SimToken huggingface-cli login ``` ### 5c. Upload (excluding extracted data directories) Use the new `hf upload` command (not the deprecated `huggingface-cli upload`). The deprecated command hashes all files before applying any filter, which is extremely slow with large data directories. `hf upload` with `--exclude` skips the specified files before hashing. ```bash hf upload yfan07/SimToken . . \ --repo-type model \ --exclude "data/media/**" "data/gt_mask/**" "data/audio_embed/**" "data/image_embed/**" "data/text_embed/**" \ 2>&1 | tee upload.log ``` This uploads everything except the four extracted dataset directories and the raw `text_embed/` folder. The `data/text_embed.tar` file (sitting directly under `data/`) is **not** matched by `data/text_embed/**` and will be uploaded normally. ### Restore on a new server After downloading the repo (Section 2), extract all packed data: ```bash cd /workspace/SimToken/data tar -xf media.tar tar -xzf gt_mask.tar.gz tar -xzf audio_embed.tar.gz tar -xf image_embed.tar tar -xf text_embed.tar # if present ``` ## 6. Current Experiment Files to Preserve Keep these files and directories for continuing TubeToken experiments: ```text runs/tubetoken_phase_minus1/audit_full runs/tubetoken_phase_minus1/simtoken_eval runs/tubetoken_phase0/proposals_stride8_n64_bidir runs/tubetoken_phase0/eval_stride8_n64_bidir runs/tubetoken_phase0/miss_videos_r64.txt TubeToken_Phase0_Experiment_Log.md TubeToken_Experiment_Plan_v4_Final.md ```