SimToken / SimToken_Setup_Upload_Download_Guide.md
yfan07's picture
Upload folder using huggingface_hub
9af2926 verified
# SimToken Setup, Data, Upload, and Download Guide
This guide is for moving the SimToken workspace between rented servers.
Assumed paths:
```bash
PROJECT_ROOT=/workspace/SimToken
SAM2_ROOT=/workspace/sam2
HF_REPO=yfan07/SimToken
```
## 1. Environment Setup
```bash
conda create -n simtoken python=3.10 -y
conda activate simtoken
conda install -c conda-forge ffmpeg libsndfile git git-lfs wget -y
git lfs install
pip install --upgrade pip setuptools wheel
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu126
```
If CUDA 12.6 wheels are unavailable, use CUDA 12.1 wheels:
```bash
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
```
Install SimToken dependencies:
```bash
pip install \
numpy pandas matplotlib opencv-python pillow tqdm einops timm sentencepiece \
transformers==4.30.2 peft==0.2.0 accelerate safetensors huggingface-hub \
packaging regex requests psutil gdown
```
Optional, only needed if regenerating audio features:
```bash
pip install towhee towhee.models
```
## 2. Repository Download
```bash
cd /workspace
huggingface-cli login
huggingface-cli download yfan07/SimToken \
--repo-type model \
--local-dir /workspace/SimToken \
--local-dir-use-symlinks False
```
## 3. Model Preparation
### Hugging Face Models
```bash
mkdir -p /workspace/hf_models
huggingface-cli download openai/clip-vit-large-patch14 \
--local-dir /workspace/hf_models/clip-vit-large-patch14 \
--local-dir-use-symlinks False
huggingface-cli download Chat-UniVi/Chat-UniVi-7B-v1.5 \
--local-dir /workspace/hf_models/Chat-UniVi-7B-v1.5 \
--local-dir-use-symlinks False
```
### SAM2 for TubeToken Proposals
Put SAM2 under `/workspace/sam2`:
```bash
cd /workspace
git clone https://github.com/facebookresearch/sam2.git
cd /workspace/sam2
pip install -e .
```
Download SAM2.1 checkpoints:
```bash
cd /workspace/sam2/checkpoints
bash download_ckpts.sh
```
The TubeToken Phase 0 commands use:
```text
/workspace/sam2/checkpoints/sam2.1_hiera_large.pt
/workspace/sam2/sam2/configs/sam2.1/sam2.1_hiera_l.yaml
```
## 4. Dataset Preparation
Runtime layout:
```text
/workspace/SimToken/data
metadata.csv
media/
gt_mask/
audio_embed/
image_embed/
```
Package the four data directories:
```bash
cd /workspace/SimToken/data
tar -cf media.tar media
tar -czf gt_mask.tar.gz gt_mask
tar -czf audio_embed.tar.gz audio_embed
tar -cf image_embed.tar image_embed
```
Restore the four data directories:
```bash
cd /workspace/SimToken/data
tar -xf media.tar
tar -xzf gt_mask.tar.gz
tar -xzf audio_embed.tar.gz
tar -xf image_embed.tar
```
## 5. Upload Repository
The remote repo stores the four large data directories as tar archives (`media.tar`, `image_embed.tar`, etc.).
The local workspace has them extracted as plain directories.
**Do not re-upload these directories**—use `--ignore-patterns` to skip them, otherwise every extracted file would be treated as a new upload.
### 5a. Pack any new data directories before uploading
If `data/text_embed/` is new (first upload after running `precompute_text_feats.py`):
```bash
cd /workspace/SimToken/data
tar -cf text_embed.tar text_embed
```
### 5b. Login
```bash
cd /workspace/SimToken
huggingface-cli login
```
### 5c. Upload (excluding extracted data directories)
Use the new `hf upload` command (not the deprecated `huggingface-cli upload`).
The deprecated command hashes all files before applying any filter, which is extremely slow with large data directories.
`hf upload` with `--exclude` skips the specified files before hashing.
```bash
hf upload yfan07/SimToken . . \
--repo-type model \
--exclude "data/media/**" "data/gt_mask/**" "data/audio_embed/**" "data/image_embed/**" "data/text_embed/**" \
2>&1 | tee upload.log
```
This uploads everything except the four extracted dataset directories and the raw `text_embed/` folder.
The `data/text_embed.tar` file (sitting directly under `data/`) is **not** matched by `data/text_embed/**` and will be uploaded normally.
### Restore on a new server
After downloading the repo (Section 2), extract all packed data:
```bash
cd /workspace/SimToken/data
tar -xf media.tar
tar -xzf gt_mask.tar.gz
tar -xzf audio_embed.tar.gz
tar -xf image_embed.tar
tar -xf text_embed.tar # if present
```
## 6. Current Experiment Files to Preserve
Keep these files and directories for continuing TubeToken experiments:
```text
runs/tubetoken_phase_minus1/audit_full
runs/tubetoken_phase_minus1/simtoken_eval
runs/tubetoken_phase0/proposals_stride8_n64_bidir
runs/tubetoken_phase0/eval_stride8_n64_bidir
runs/tubetoken_phase0/miss_videos_r64.txt
TubeToken_Phase0_Experiment_Log.md
TubeToken_Experiment_Plan_v4_Final.md
```