# SimToken Setup, Data, Upload, and Download Guide

This guide is for moving the SimToken workspace between rented servers.

Assumed paths:

```bash
PROJECT_ROOT=/workspace/SimToken
SAM2_ROOT=/workspace/sam2
HF_REPO=yfan07/SimToken
```

## 1. Environment Setup

```bash
conda create -n simtoken python=3.10 -y
conda activate simtoken

conda install -c conda-forge ffmpeg libsndfile git git-lfs wget -y
git lfs install

pip install --upgrade pip setuptools wheel
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu126
```

If CUDA 12.6 wheels are unavailable, use CUDA 12.1 wheels:

```bash
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
```

Install SimToken dependencies:

```bash
pip install \
  numpy pandas matplotlib opencv-python pillow tqdm einops timm sentencepiece \
  transformers==4.30.2 peft==0.2.0 accelerate safetensors huggingface-hub \
  packaging regex requests psutil gdown
```

Optional, only needed if regenerating audio features:

```bash
pip install towhee towhee.models
```

## 2. Repository Download

```bash
cd /workspace
huggingface-cli login

huggingface-cli download yfan07/SimToken \
  --repo-type model \
  --local-dir /workspace/SimToken \
  --local-dir-use-symlinks False
```

## 3. Model Preparation

### Hugging Face Models

```bash
mkdir -p /workspace/hf_models

huggingface-cli download openai/clip-vit-large-patch14 \
  --local-dir /workspace/hf_models/clip-vit-large-patch14 \
  --local-dir-use-symlinks False

huggingface-cli download Chat-UniVi/Chat-UniVi-7B-v1.5 \
  --local-dir /workspace/hf_models/Chat-UniVi-7B-v1.5 \
  --local-dir-use-symlinks False
```

### SAM2 for TubeToken Proposals

Put SAM2 under `/workspace/sam2`:

```bash
cd /workspace
git clone https://github.com/facebookresearch/sam2.git
cd /workspace/sam2

pip install -e .
```

Download SAM2.1 checkpoints:

```bash
cd /workspace/sam2/checkpoints
bash download_ckpts.sh
```

The TubeToken Phase 0 commands use:

```text
/workspace/sam2/checkpoints/sam2.1_hiera_large.pt
/workspace/sam2/sam2/configs/sam2.1/sam2.1_hiera_l.yaml
```

## 4. Dataset Preparation

Runtime layout:

```text
/workspace/SimToken/data
  metadata.csv
  media/
  gt_mask/
  audio_embed/
  image_embed/
```

Package the four data directories:

```bash
cd /workspace/SimToken/data

tar -cf media.tar media
tar -czf gt_mask.tar.gz gt_mask
tar -czf audio_embed.tar.gz audio_embed
tar -cf image_embed.tar image_embed
```

Restore the four data directories:

```bash
cd /workspace/SimToken/data

tar -xf media.tar
tar -xzf gt_mask.tar.gz
tar -xzf audio_embed.tar.gz
tar -xf image_embed.tar
```

## 5. Upload Repository

The remote repo stores the four large data directories as tar archives (`media.tar`, `image_embed.tar`, etc.).
The local workspace has them extracted as plain directories.
**Do not re-upload these directories**—use `--ignore-patterns` to skip them, otherwise every extracted file would be treated as a new upload.

### 5a. Pack any new data directories before uploading

If `data/text_embed/` is new (first upload after running `precompute_text_feats.py`):

```bash
cd /workspace/SimToken/data
tar -cf text_embed.tar text_embed
```

### 5b. Login

```bash
cd /workspace/SimToken
huggingface-cli login
```

### 5c. Upload (excluding extracted data directories)

Use the new `hf upload` command (not the deprecated `huggingface-cli upload`).
The deprecated command hashes all files before applying any filter, which is extremely slow with large data directories.
`hf upload` with `--exclude` skips the specified files before hashing.

```bash
hf upload yfan07/SimToken . . \
  --repo-type model \
  --exclude "data/media/**" "data/gt_mask/**" "data/audio_embed/**" "data/image_embed/**" "data/text_embed/**" \
  2>&1 | tee upload.log
```

This uploads everything except the four extracted dataset directories and the raw `text_embed/` folder.
The `data/text_embed.tar` file (sitting directly under `data/`) is **not** matched by `data/text_embed/**` and will be uploaded normally.

### Restore on a new server

After downloading the repo (Section 2), extract all packed data:

```bash
cd /workspace/SimToken/data
tar -xf media.tar
tar -xzf gt_mask.tar.gz
tar -xzf audio_embed.tar.gz
tar -xf image_embed.tar
tar -xf text_embed.tar      # if present
```

## 6. Current Experiment Files to Preserve

Keep these files and directories for continuing TubeToken experiments:

```text
runs/tubetoken_phase_minus1/audit_full
runs/tubetoken_phase_minus1/simtoken_eval
runs/tubetoken_phase0/proposals_stride8_n64_bidir
runs/tubetoken_phase0/eval_stride8_n64_bidir
runs/tubetoken_phase0/miss_videos_r64.txt
TubeToken_Phase0_Experiment_Log.md
TubeToken_Experiment_Plan_v4_Final.md
```