SimToken / SimToken_Setup_Upload_Download_Guide.md

Upload folder using huggingface_hub

9af2926 verified 9 days ago

4.75 kB

	# SimToken Setup, Data, Upload, and Download Guide

	This guide is for moving the SimToken workspace between rented servers.

	Assumed paths:

	```bash
	PROJECT_ROOT=/workspace/SimToken
	SAM2_ROOT=/workspace/sam2
	HF_REPO=yfan07/SimToken
	```

	## 1. Environment Setup

	```bash
	conda create -n simtoken python=3.10 -y
	conda activate simtoken

	conda install -c conda-forge ffmpeg libsndfile git git-lfs wget -y
	git lfs install

	pip install --upgrade pip setuptools wheel
	pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu126
	```

	If CUDA 12.6 wheels are unavailable, use CUDA 12.1 wheels:

	```bash
	pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
	```

	Install SimToken dependencies:

	```bash
	pip install \
	numpy pandas matplotlib opencv-python pillow tqdm einops timm sentencepiece \
	transformers==4.30.2 peft==0.2.0 accelerate safetensors huggingface-hub \
	packaging regex requests psutil gdown
	```

	Optional, only needed if regenerating audio features:

	```bash
	pip install towhee towhee.models
	```

	## 2. Repository Download

	```bash
	cd /workspace
	huggingface-cli login

	huggingface-cli download yfan07/SimToken \
	--repo-type model \
	--local-dir /workspace/SimToken \
	--local-dir-use-symlinks False
	```

	## 3. Model Preparation

	### Hugging Face Models

	```bash
	mkdir -p /workspace/hf_models

	huggingface-cli download openai/clip-vit-large-patch14 \
	--local-dir /workspace/hf_models/clip-vit-large-patch14 \
	--local-dir-use-symlinks False

	huggingface-cli download Chat-UniVi/Chat-UniVi-7B-v1.5 \
	--local-dir /workspace/hf_models/Chat-UniVi-7B-v1.5 \
	--local-dir-use-symlinks False
	```

	### SAM2 for TubeToken Proposals

	Put SAM2 under `/workspace/sam2`:

	```bash
	cd /workspace
	git clone https://github.com/facebookresearch/sam2.git
	cd /workspace/sam2

	pip install -e .
	```

	Download SAM2.1 checkpoints:

	```bash
	cd /workspace/sam2/checkpoints
	bash download_ckpts.sh
	```

	The TubeToken Phase 0 commands use:

	```text
	/workspace/sam2/checkpoints/sam2.1_hiera_large.pt
	/workspace/sam2/sam2/configs/sam2.1/sam2.1_hiera_l.yaml
	```

	## 4. Dataset Preparation

	Runtime layout:

	```text
	/workspace/SimToken/data
	metadata.csv
	media/
	gt_mask/
	audio_embed/
	image_embed/
	```

	Package the four data directories:

	```bash
	cd /workspace/SimToken/data

	tar -cf media.tar media
	tar -czf gt_mask.tar.gz gt_mask
	tar -czf audio_embed.tar.gz audio_embed
	tar -cf image_embed.tar image_embed
	```

	Restore the four data directories:

	```bash
	cd /workspace/SimToken/data

	tar -xf media.tar
	tar -xzf gt_mask.tar.gz
	tar -xzf audio_embed.tar.gz
	tar -xf image_embed.tar
	```

	## 5. Upload Repository

	The remote repo stores the four large data directories as tar archives (`media.tar`, `image_embed.tar`, etc.).
	The local workspace has them extracted as plain directories.
	Do not re-upload these directories—use `--ignore-patterns` to skip them, otherwise every extracted file would be treated as a new upload.

	### 5a. Pack any new data directories before uploading

	If `data/text_embed/` is new (first upload after running `precompute_text_feats.py`):

	```bash
	cd /workspace/SimToken/data
	tar -cf text_embed.tar text_embed
	```

	### 5b. Login

	```bash
	cd /workspace/SimToken
	huggingface-cli login
	```

	### 5c. Upload (excluding extracted data directories)

	Use the new `hf upload` command (not the deprecated `huggingface-cli upload`).
	The deprecated command hashes all files before applying any filter, which is extremely slow with large data directories.
	`hf upload` with `--exclude` skips the specified files before hashing.

	```bash
	hf upload yfan07/SimToken . . \
	--repo-type model \
	--exclude "data/media/" "data/gt_mask/" "data/audio_embed/" "data/image_embed/" "data/text_embed/**" \
	2>&1 \| tee upload.log
	```

	This uploads everything except the four extracted dataset directories and the raw `text_embed/` folder.
	The `data/text_embed.tar` file (sitting directly under `data/`) is not matched by `data/text_embed/**` and will be uploaded normally.

	### Restore on a new server

	After downloading the repo (Section 2), extract all packed data:

	```bash
	cd /workspace/SimToken/data
	tar -xf media.tar
	tar -xzf gt_mask.tar.gz
	tar -xzf audio_embed.tar.gz
	tar -xf image_embed.tar
	tar -xf text_embed.tar # if present
	```

	## 6. Current Experiment Files to Preserve

	Keep these files and directories for continuing TubeToken experiments:

	```text
	runs/tubetoken_phase_minus1/audit_full
	runs/tubetoken_phase_minus1/simtoken_eval
	runs/tubetoken_phase0/proposals_stride8_n64_bidir
	runs/tubetoken_phase0/eval_stride8_n64_bidir
	runs/tubetoken_phase0/miss_videos_r64.txt
	TubeToken_Phase0_Experiment_Log.md
	TubeToken_Experiment_Plan_v4_Final.md
	```