Spaces:

rdz-falcon
/

SignMotionGPT

Running

App Files Files Community

rdz-falcon commited on Dec 7, 2025

Commit

4bd136e

1 Parent(s): bf06606

Deploy SignMotionGPT Demo with LFS

Browse files

Files changed (31) hide show

.gitattributes +4 -0
INFERENCE_AND_VIS.md +253 -0
README.md +116 -12
app.py +166 -0
collators.py +75 -0
config.py +87 -0
data.py +169 -0
data/motion_llm_dataset.json +3 -0
data/smplx_models/SMPLX_NEUTRAL.npz +3 -0
data/vqvae_model.pt +3 -0
data/vqvae_stats.pt +3 -0
generate.py +194 -0
inference.py +244 -0
mGPT/__init__.py +0 -0
mGPT/archs/__init__.py +0 -0
mGPT/archs/mgpt_vq.py +189 -0
mGPT/archs/tools/__init__.py +0 -0
mGPT/archs/tools/quantize_cnn.py +410 -0
mGPT/archs/tools/resnet.py +81 -0
metrics.py +731 -0
model.py +152 -0
requirements.txt +32 -0
setup_env.sh +70 -0
templates.py +133 -0
test_dataset_eval.py +534 -0
test_overfit.py +1562 -0
train.py +744 -0
train_mgpt_vqvae.py +438 -0
train_pipeline.py +264 -0
train_vqvae.py +421 -0
visualize.py +681 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,7 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+data/motion_llm_dataset.json filter=lfs diff=lfs merge=lfs -text
+/content/SignMotionGPT/data/vqvae_model.pt filter=lfs diff=lfs merge=lfs -text
+/content/SignMotionGPT/data/smplx_models/SMPLX_NEUTRAL.npz filter=lfs diff=lfs merge=lfs -text
+/content/SignMotionGPT/data/vqvae_stats.pt filter=lfs diff=lfs merge=lfs -text

INFERENCE_AND_VIS.md ADDED Viewed

	@@ -0,0 +1,253 @@

+# Inference & Visualization Quick Reference
+## Overview
+After training your 3-stage SignMotionGPT model, use these scripts to generate and visualize motions.
+---
+## 1. Inference (Generate Motion Tokens)
+### Basic Usage
+```bash
+# Generate from Stage 3 model (recommended)
+python inference.py --prompt "walking forward"
+# Try different stages
+python inference.py --prompt "dancing" --stage 1  # Motion-only LM
+python inference.py --prompt "dancing" --stage 2  # Multi-task
+python inference.py --prompt "dancing" --stage 3  # T2M SFT (best quality)
+```
+### Save Output
+```bash
+python inference.py --prompt "jumping" --output my_motion.txt
+```
+### With Participant ID
+```bash
+python inference.py --prompt "yoga pose" --pid P40
+```
+### Expected Output
+```
+============================================================
+Motion Generation Inference - Stage 3
+============================================================
+Prompt: 'walking forward'
+Device: cuda
+Loading Stage 3 model from: /kaggle/working/SignMotionGPT/stage3_t2m_sft
+✅ Stage 3 model loaded successfully
+Generating motion for: 'walking forward'
+============================================================
+Generated Motion:
+============================================================
+<MOT_BEGIN><motion_224><motion_39><motion_76>...<MOT_END>
+============================================================
+```
+---
+## 2. Visualization (Motion Tokens → 3D Animation)
+### Prerequisites
+#### Option A: Use Google Drive (Colab/Kaggle)
+Edit `setup_env.sh` and add your Google Drive file IDs:
+```bash
+VQVAE_MODEL_ID="1AbCdEfGhIj"           # VQ-VAE checkpoint (.pt)
+VQVAE_STATS_ID="2KlMnOpQrSt"          # Normalization stats (.pt)
+SMPLX_MODELS_ID="3UvWxYzAbCd"         # SMPL-X models (.zip)
+```
+Then run:
+```bash
+bash setup_env.sh
+```
+#### Option B: Manual Setup (Local)
+```bash
+export VQVAE_CHECKPOINT=/path/to/vqvae_model.pt
+export VQVAE_STATS_PATH=/path/to/vqvae_stats.pt
+export SMPLX_MODEL_DIR=/path/to/smplx_models
+```
+### Basic Usage
+```bash
+# Visualize token string
+python visualize.py --tokens "<MOT_BEGIN><motion_177><motion_135>...<MOT_END>"
+# Visualize from file
+python visualize.py --input my_motion.txt
+# Generate + visualize in one command
+python visualize.py --prompt "walking" --stage 3
+```
+### Custom Output
+```bash
+python visualize.py \
+  --input motion_tokens.txt \
+  --output walk_animation.html \
+  --title "Walking Forward" \
+  --fps 30
+```
+### With Custom Paths
+```bash
+python visualize.py \
+  --tokens "<MOT_BEGIN>..." \
+  --vqvae-ckpt /custom/vqvae.pt \
+  --stats /custom/stats.pt \
+  --smplx-dir /custom/smplx_models \
+  --output animation.html
+```
+### Expected Output
+```
+============================================================
+Motion Visualization Pipeline
+============================================================
+[1/5] Parsing tokens...
+   Parsed 15 tokens
+[2/5] Loading VQ-VAE...
+✅ VQ-VAE loaded (codebook size: 512)
+[3/5] Loading normalization stats...
+✅ Stats loaded (mean shape: (182,))
+[4/5] Loading SMPL-X model...
+✅ SMPL-X loaded
+[5/5] Decoding and rendering...
+   Decoding tokens to SMPL-X parameters...
+   Decoded params shape: (16, 182)
+   Converting parameters to vertices...
+   Vertices shape: (16, 10475, 3), Faces: (20908, 3)
+   Creating animation...
+✅ Animation saved to: motion_animation.html
+============================================================
+✅ Visualization complete!
+============================================================
+```
+---
+## 3. Complete Workflow Example
+### A. Train (already done)
+```bash
+python train_pipeline.py
+```
+### B. Generate Motion Tokens
+```bash
+python inference.py --prompt "college" --stage 3 --output college_motion.txt
+```
+### C. Visualize
+```bash
+python visualize.py --input college_motion.txt --output college_animation.html
+```
+### D. View Animation
+Open `college_animation.html` in a browser. You'll see an interactive 3D SMPL-X character performing the motion. Use mouse to rotate/zoom, and click Play/Pause buttons.
+---
+## 4. Troubleshooting
+### Inference Issues
+**"Checkpoint not found"**
+- Ensure you've trained all stages first: `python train_pipeline.py`
+- Check that `OUT_S1`, `OUT_S2`, `OUT_S3` directories exist in `WORK_DIR`
+**"Dataset not found"**
+- Inference needs the dataset to build vocabulary
+- Set `DATA_JSON_PATH` in `config.py` or via environment variable
+### Visualization Issues
+**"VQ-VAE checkpoint not found"**
+- Download VQ-VAE model or set `VQVAE_CHECKPOINT` path
+- The VQ-VAE is separate from LLM training (used to decode tokens to SMPL-X params)
+**"SMPL-X models not found"**
+- Download SMPL-X models from https://smpl-x.is.tue.mpg.de/
+- Extract to a directory and set `SMPLX_MODEL_DIR`
+**"No tokens to visualize"**
+- Check token format: should contain `<motion_ID>` tags or space-separated numbers
+- Example valid formats:
+  - `<MOT_BEGIN><motion_177><motion_135><MOT_END>`
+  - `177 135 152 200 46 142`
+**"Shape mismatch" or "Decoding errors"**
+- Ensure VQ-VAE checkpoint matches the codebook size used in LLM training
+- Check `CODEBOOK_SIZE`, `CODE_DIM`, `SMPL_DIM` in `visualize.py` match training
+---
+## 5. Configuration
+### Key Environment Variables
+| Variable | Purpose | Default |
+|----------|---------|---------|
+| `VQVAE_CHECKPOINT` | VQ-VAE model path | `./data/vqvae_model.pt` |
+| `VQVAE_STATS_PATH` | Normalization stats | `./data/vqvae_stats.pt` |
+| `SMPLX_MODEL_DIR` | SMPL-X models directory | `./data/smplx_models` |
+| `VIS_OUTPUT_DIR` | Output directory for animations | `WORK_DIR` |
+### VQ-VAE Architecture (must match training)
+In `visualize.py`:
+```python
+SMPL_DIM = 182           # SMPL-X parameter dimension
+CODEBOOK_SIZE = 512      # Motion vocabulary size
+CODE_DIM = 512           # Latent code dimension
+VQ_ARGS = dict(
+    width=512,
+    depth=3,
+    down_t=2,
+    stride_t=2,
+    ...
+)
+```
+---
+## 6. Tips
+### Inference
+- **Stage 3** generally produces best quality for text-to-motion
+- **Stage 2** can handle M2T and denoising (but inference.py only does T2M)
+- **Stage 1** generates motion without text conditioning (still needs prompt for length)
+- Use `--no-per-prompt-vocab` to allow novel combinations (less constrained)
+### Visualization
+- **FPS 20-30** works well for most motions
+- Longer sequences may take a few seconds to render
+- The HTML file is self-contained and can be shared
+- 3D mesh has ~10K vertices; animations can be large for long sequences
+### Performance
+- Inference: ~1-2 seconds per generation (depends on length)
+- Visualization: ~3-10 seconds (depends on sequence length and batch size)
+- Both run on GPU if available, fall back to CPU otherwise
+---
+## 7. Next Steps
+- **Batch Inference**: Loop over multiple prompts and save outputs
+- **Evaluate Quality**: Compare generated tokens to ground truth using edit distance
+- **Fine-tune Generation**: Adjust `GEN_TEMPERATURE`, `GEN_TOP_P` in `config.py`
+- **Export to Other Formats**: Extend `visualize.py` to export BVH, FBX, or USD

README.md CHANGED Viewed

@@ -1,12 +1,116 @@
----
-title: SignMotionGPT
-emoji: 🌍
-colorFrom: yellow
-colorTo: green
-sdk: gradio
-sdk_version: 6.0.2
-app_file: app.py
-pinned: false
----
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

+### 1) Configure setup script (one time)
+Run the setup:
+```bash
+bash setup_env.sh
+```
+After setup, defaults are:
+- `WORK_DIR` = current directory
+- `DATA_JSON_PATH` = `./data/motion_llm_dataset.json`
+You can override via environment variables if needed:
+```bash
+export WORK_DIR=/path/to/workdir
+export DATA_JSON_PATH=/path/to/motion_llm_dataset.json
+```
+## Overview
+This repository implements a robust 2-stage training pipeline for motion generation, replicating the high-performance "overfit" test setup:
+- **Stage 1**: Motion-only Language Model (MLM) - Pre-training on motion token sequences to learn the "language of motion".
+- **Stage 2**: Text-to-Motion Fine-Tuning (T2M) - Supervised fine-tuning to align text prompts with motion sequences.
+Key features:
+- **Integrated Evaluation**: Automatically computes FID, Diversity, and Multimodality (MIM) metrics.
+- **Side-by-Side Visualization**: Generates HTML comparisons of Ground Truth vs Generated motions.
+- **Test Set Evaluation**: Can optionally run evaluation on a held-out test set (SMPL-X data).
+- **Hugging Face Integration**: Automatic checkpointing and resuming from the Hub.
+## Installation
+```bash
+# Clone the repository
+git clone https://github.com/rajvizala/SignMotionGPT.git
+cd SignMotionGPT
+# Setup Everything
+bash setup_env.sh
+```
+## Dataset Format
+Your dataset should be a JSON file with the following structure:
+```json
+[
+  {
+    "text_query": "a person walks forward",
+    "motion_tokens": "42 18 91 ...",
+    "participant_id": "P001"  // Optional
+  },
+  ...
+]
+```
+## Quick Start
+### 1. Configure Training
+Edit `config.py` to set your paths and hyperparameters. Key settings include:
+- `DATA_JSON_PATH`: Path to your dataset.
+- `MODEL_NAME`: Base model (e.g., "Qwen/Qwen3-0.6B").
+- `PIPELINE_OUTPUT_DIR`: Directory for checkpoints and results.
+- `HF_TOKEN`: Your Hugging Face token (or set via env var).
+### 2. Run Full Pipeline
+```bash
+python train_pipeline.py
+```
+This script orchestrates the entire process:
+1.  **Data Loading & Cleaning**: Deduplicates samples and builds vocabulary.
+2.  **Stage 1 Training**: Motion Language Modeling (Pre-training).
+3.  **Stage 2 Training**: Text-to-Motion Fine-Tuning.
+4.  **Evaluation**: Runs inference on specific words, computes metrics (FID, Diversity, MIM), and generates visualizations.
+5.  **Test Set Evaluation**: (Optional) Runs evaluation on held-out test data if configured.
+### 3. Environment Variables
+You can control many aspects via environment variables without editing code:
+```bash
+# Training Config
+export PIPELINE_S1_EPOCHS=20
+export PIPELINE_S2_EPOCHS=20
+export PIPELINE_S1_BATCH=8
+export PIPELINE_S2_BATCH=8
+# Hugging Face
+export HUGGINGFACE_HUB_TOKEN="your_token"
+export HF_UPLOAD_INTERVAL_EPOCHS=2
+# Evaluation
+export EVALUATION_WORDS="passport,send,library"
+export TEST_EVAL_SAMPLE_LIMIT=100
+```
+## Held-out Test Dataset Evaluation
+The pipeline includes integration with `test_dataset_eval.py` to measure performance on an unseen SMPL-X test dataset.
+To enable this, ensure `TEST_EVAL_DOWNLOAD_DIR` or `TEST_EVAL_EXTRACT_DIR` are configured in `config.py` or via env vars. The pipeline will attempt to run this after training if data is available.
+## Visualization
+The pipeline automatically generates side-by-side HTML visualizations in the output directory (`html_visualizations` folder). You can open these in any browser to compare Ground Truth motions with the model's generations.
+To manually visualize tokens:
+```bash
+python visualize.py --tokens "<MOT_BEGIN><motion_177>...<MOT_END>" --output my_anim.html
+```

app.py ADDED Viewed

	@@ -0,0 +1,166 @@

+import gradio as gr
+import torch
+import os
+import sys
+import warnings
+from pathlib import Path
+# Add root to path to allow imports from project root when running from demo-code/
+# or when running from root
+current_dir = os.path.dirname(os.path.abspath(__file__))
+parent_dir = os.path.dirname(current_dir)
+sys.path.append(current_dir)
+sys.path.append(parent_dir)
+# Import project modules
+try:
+    from inference import load_trained_model, inference as run_inference_cmd
+    from visualize import visualize
+    from model import setup_model_and_tokenizer, get_motion_token_info
+    from generate import generate_t2m
+    from data import compute_length_stats, build_prompt_vocab, check_has_participant_id, load_dataset
+    import config
+except ImportError as e:
+    print(f"Error importing project modules: {e}")
+    print("Make sure you are running this from the project root or have the project structure intact.")
+# Constants
+HF_REPO_ID = "rdz-falcon/SignMotionGPT"
+EPOCH_SUBFOLDER = "stage2/epoch-030"
+def load_model_from_hf(repo_id, subfolder, token=None):
+    from transformers import AutoModelForCausalLM, AutoTokenizer
+    print(f"Loading model from HF: {repo_id}/{subfolder}")
+    try:
+        tokenizer = AutoTokenizer.from_pretrained(repo_id, subfolder=subfolder, token=token, trust_remote_code=True)
+        model = AutoModelForCausalLM.from_pretrained(repo_id, subfolder=subfolder, token=token, trust_remote_code=True)
+        return model, tokenizer
+    except Exception as e:
+        print(f"Error loading model: {e}")
+        return None, None
+# Global model cache
+MODEL = None
+TOKENIZER = None
+MOTION_TOKEN_IDS = None
+MOT_BEGIN_ID = None
+MOT_END_ID = None
+CODEBOOK_SIZE = 512
+def init_model():
+    global MODEL, TOKENIZER, MOTION_TOKEN_IDS, MOT_BEGIN_ID, MOT_END_ID
+    if MODEL is not None:
+        return
+    token = os.environ.get("HF_TOKEN") or os.environ.get("HUGGINGFACE_HUB_TOKEN")
+    # Load model/tokenizer
+    MODEL, TOKENIZER = load_model_from_hf(HF_REPO_ID, EPOCH_SUBFOLDER, token)
+    if MODEL is None:
+        raise RuntimeError(f"Failed to load model from {HF_REPO_ID}/{EPOCH_SUBFOLDER}")
+    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+    MODEL.to(device)
+    MODEL.eval()
+    # Setup token info
+    motion_token_ids = []
+    for i in range(CODEBOOK_SIZE):
+        t = f"<motion_{i}>"
+        if t in TOKENIZER.get_vocab():
+            motion_token_ids.append(TOKENIZER.convert_tokens_to_ids(t))
+    MOTION_TOKEN_IDS = motion_token_ids
+    MOT_BEGIN_ID = TOKENIZER.convert_tokens_to_ids("<MOT_BEGIN>") if "<MOT_BEGIN>" in TOKENIZER.get_vocab() else None
+    MOT_END_ID = TOKENIZER.convert_tokens_to_ids("<MOT_END>") if "<MOT_END>" in TOKENIZER.get_vocab() else None
+    print("Model initialized.")
+def generate_motion_app(text_prompt):
+    if not text_prompt:
+        return None, "Please enter a prompt."
+    if MODEL is None:
+        try:
+            init_model()
+        except Exception as e:
+            return None, f"Model Initialization Failed: {e}"
+    device = MODEL.device
+    print(f"Generating for: {text_prompt}")
+    try:
+        generated_tokens = generate_t2m(
+            model=MODEL,
+            tokenizer=TOKENIZER,
+            prompt_text=text_prompt,
+            mot_begin_id=MOT_BEGIN_ID,
+            mot_end_id=MOT_END_ID,
+            motion_token_ids=MOTION_TOKEN_IDS,
+            length_stats_by_text={}, # Fallback to global_median_len
+            global_median_len=100,   # Reasonable default
+            prompt_vocab=None,
+            has_pid=False,
+            per_prompt_vocab=False   # Allow all tokens
+        )
+    except Exception as e:
+        return None, f"Generation Error: {e}"
+    # Visualization
+    try:
+        # Ensure paths for VQ-VAE and SMPL-X
+        # In HF Spaces, we assume these are in the repo (e.g., ./data)
+        data_dir = os.environ.get("DATA_DIR", "data")
+        vqvae_ckpt = os.path.join(data_dir, "vqvae_model.pt")
+        stats_path = os.path.join(data_dir, "vqvae_stats.pt")
+        smplx_dir = os.path.join(data_dir, "smplx_models")
+        # Check existence
+        missing = []
+        if not os.path.exists(vqvae_ckpt): missing.append(vqvae_ckpt)
+        if not os.path.exists(stats_path): missing.append(stats_path)
+        if not os.path.exists(smplx_dir): missing.append(smplx_dir)
+        if missing:
+            return None, f"Missing visualization files in {data_dir}: {missing}. Please ensure they are uploaded to the Space."
+        # Output to a temporary file
+        # Gradio needs a file path or HTML string. visualize returns a Figure.
+        output_html = "temp_viz.html"
+        fig = visualize(
+            tokens=generated_tokens,
+            vqvae_ckpt=vqvae_ckpt,
+            stats_path=stats_path,
+            smplx_dir=smplx_dir,
+            output_html=output_html,
+            title=f"Motion: {text_prompt}",
+            fps=20
+        )
+        if fig is None:
+             return None, "Visualization failed (no frames produced)."
+        return fig, f"Success! Generated tokens length: {len(generated_tokens.split())}"
+    except Exception as e:
+        return None, f"Visualization Error: {e}"
+# Gradio UI
+with gr.Interface(
+    fn=generate_motion_app,
+    inputs=gr.Textbox(label="Enter Motion Prompt", placeholder="e.g. walking forward"),
+    outputs=[
+        gr.Plot(label="Motion Visualization"),
+        gr.Textbox(label="Status/Output")
+    ],
+    title="SignMotionGPT Demo",
+    description="Generate Sign Language/Motion Avatars from Text. Using model checkpoint: epoch 30."
+) as demo:
+    pass
+if __name__ == "__main__":
+    demo.launch()

collators.py ADDED Viewed

	@@ -0,0 +1,75 @@

+"""
+Data collators with label masking for training
+"""
+import torch
+class AssistantSpanCollator:
+    """
+    Collator that masks labels to only train on assistant responses.
+    For where=="mot": labels only inside <MOT_BEGIN>...<MOT_END> in assistant
+    For where=="text": labels entire assistant span (for M2T tasks)
+    """
+    def __init__(self, tokenizer, max_length):
+        self.tok = tokenizer
+        self.max_len = max_length
+        # Get special token IDs
+        self.im_start = self.tok.convert_tokens_to_ids("<|im_start|>")
+        self.im_end = self.tok.convert_tokens_to_ids("<|im_end|>")
+        self.mot_beg = self.tok.convert_tokens_to_ids("<MOT_BEGIN>")
+        self.mot_end = self.tok.convert_tokens_to_ids("<MOT_END>")
+    def __call__(self, examples):
+        texts = [e["text"] for e in examples]
+        wheres = [e["where"] for e in examples]
+        # Tokenize
+        enc = self.tok(
+            texts,
+            return_tensors="pt",
+            padding=True,
+            truncation=True,
+            max_length=self.max_len
+        )
+        input_ids = enc["input_ids"]
+        labels = input_ids.clone().fill_(-100)
+        # Apply label masking per example
+        for i, w in enumerate(wheres):
+            seq = input_ids[i]
+            # Find last <|im_start|> (start of assistant)
+            starts = (seq == self.im_start).nonzero(as_tuple=True)[0]
+            if starts.numel() == 0:
+                continue
+            a_start = int(starts[-1].item())
+            # Find corresponding <|im_end|>
+            sub = seq[a_start+1:]
+            ends = (sub == self.im_end).nonzero(as_tuple=True)[0]
+            a_end = (a_start + 1 + int(ends[0].item())) if ends.numel() > 0 else (seq.size(0) - 1)
+            if w == "text":
+                # Label entire assistant span
+                labels[i, a_start+1:a_end] = seq[a_start+1:a_end]
+            else:
+                # Label only motion tokens between <MOT_BEGIN> and <MOT_END>
+                asst = seq[a_start+1:a_end]
+                bpos = (asst == self.mot_beg).nonzero(as_tuple=True)[0]
+                epos = (asst == self.mot_end).nonzero(as_tuple=True)[0]
+                if bpos.numel() > 0 and epos.numel() > 0 and epos[0] >= bpos[0]:
+                    b = a_start + 1 + int(bpos[0].item())
+                    e = a_start + 1 + int(epos[0].item())
+                    labels[i, b:e+1] = seq[b:e+1]
+        return {
+            "input_ids": input_ids,
+            "attention_mask": enc["attention_mask"],
+            "labels": labels
+        }

config.py ADDED Viewed

	@@ -0,0 +1,87 @@

+"""
+Configuration file for Motion LLM training
+"""
+import os
+import torch
+# Random seed
+SEED = 42
+# Paths
+# WORK_DIR defaults to current working directory if not explicitly set
+WORK_DIR = os.environ.get("WORK_DIR", os.getcwd())
+DATA_DIR = os.environ.get("DATA_DIR", os.path.join(WORK_DIR, "data"))
+os.makedirs(DATA_DIR, exist_ok=True)
+# Single-file JSON dataset path (can be overridden via env)
+DATA_JSON_PATH = os.environ.get(
+    "DATA_JSON_PATH",
+    os.path.join(DATA_DIR, "motion_llm_dataset.json"),
+)
+# Directory Configuration
+# PIPELINE_OUTPUT_DIR matches test_overfit's default "./motion_gpt_full_model"
+PIPELINE_OUTPUT_DIR = os.environ.get("PIPELINE_OUTPUT_DIR", "./motion_gpt_full_model")
+METRICS_JSON_PATH = os.path.join(PIPELINE_OUTPUT_DIR, "metrics.json")
+CHECKPOINTS_DIR = os.path.join(PIPELINE_OUTPUT_DIR, "checkpoints")
+# Model configuration
+MODEL_NAME = "Qwen/Qwen3-0.6B"  # Matches test_overfit.py
+MAX_SEQ_LEN = 512 # Kept from previous config, though test_overfit uses 256 in datasets
+DTYPE = torch.bfloat16 if torch.cuda.is_available() and torch.cuda.get_device_capability(0)[0] >= 8 else torch.float16
+# Evaluation Words (matches test_overfit.py)
+EVALUATION_WORDS = ["passport", "send", "library", "push"]
+# Training Hyperparameters (matches test_overfit.py)
+# Stage 1
+S1_EPOCHS = 20
+S1_LR = 5e-5
+S1_BATCH_SIZE = 8
+# Stage 2
+S2_EPOCHS = 20
+S2_LR = 2e-5
+S2_BATCH_SIZE = 8
+# Inference Hyperparameters (matches test_overfit.py)
+INFERENCE_REPETITION_PENALTY = 1.2
+INFERENCE_TEMPERATURE = 0.7
+INFERENCE_TOP_K = 50
+# Special Tokens (matches test_overfit.py)
+M_START = "<M_START>"
+M_END = "<M_END>"
+PAD_TOKEN = "<PAD>"
+# Hugging Face Hub Configuration
+HF_USE_HUB = True
+HF_TOKEN = os.environ.get("HUGGINGFACE_HUB_TOKEN") or os.environ.get("hf_auth_token")
+HF_USER = os.environ.get("HF_USER", "rdz-falcon") # Derived from test_overfit.py repo ids
+HF_STAGE1_REPO_ID = "rdz-falcon/SignMotionGPTfit-archive"
+HF_STAGE2_REPO_ID = "rdz-falcon/SignMotionGPTfit-archive"
+HF_PRIVATE_REPO = os.environ.get("HF_PRIVATE", "true").lower() != "false"
+FORCE_STAGE2_FROM_STAGE1_RAW = os.environ.get("FORCE_STAGE2_FROM_STAGE1", "false")
+FORCE_STAGE2_FROM_STAGE1 = str(FORCE_STAGE2_FROM_STAGE1_RAW).strip().lower() not in ("0", "false", "no", "off")
+HF_STAGE2_SAVE_SUBDIR = os.environ.get("HF_STAGE2_SAVE_SUBDIR", "stage2_v2")
+CHECKPOINT_UPLOAD_INTERVAL_EPOCHS = int(os.environ.get("HF_UPLOAD_INTERVAL_EPOCHS", "2"))
+HF_DISABLE_PROGRESS = os.environ.get("HF_DISABLE_PROGRESS", "true").lower() != "false"
+# Evaluation controls
+RUN_EVALS_ONLY = False
+EVAL_SAMPLE_LIMIT = 100
+# Test Eval Config (from test_dataset_eval.py defaults)
+TEST_EVAL_OUTPUT_DIR = os.environ.get("TEST_EVAL_OUTPUT_DIR", PIPELINE_OUTPUT_DIR)
+TEST_EVAL_DOWNLOAD_DIR = os.environ.get(
+    "TEST_EVAL_DOWNLOAD_DIR", os.path.join(WORK_DIR, "test_data", "downloads")
+)
+TEST_EVAL_EXTRACT_DIR = os.environ.get(
+    "TEST_EVAL_EXTRACT_DIR", os.path.join(WORK_DIR, "test_data", "extracted")
+)
+TEST_EVAL_SAMPLE_LIMIT = int(os.environ.get("TEST_EVAL_SAMPLE_LIMIT", "300"))
+TEST_EVAL_MAX_ZIPS = int(os.environ.get("TEST_EVAL_MAX_ZIPS", "500"))
+TEST_EVAL_HF_REPO = os.environ.get("TEST_EVAL_HF_REPO", "rdz-falcon/SignMotionGPTfit-archive")
+TEST_EVAL_HF_SUBFOLDER = os.environ.get(
+    "TEST_EVAL_HF_SUBFOLDER", f"{HF_STAGE2_SAVE_SUBDIR}/latest"
+)

data.py ADDED Viewed

	@@ -0,0 +1,169 @@

+"""
+Dataset loading and vocabulary building utilities
+"""
+import json
+import os
+import random
+from typing import List, Dict, Tuple, Any
+from collections import defaultdict
+import torch
+from torch.utils.data import Dataset, DataLoader
+from transformers import AutoTokenizer
+from config import M_START, M_END, PAD_TOKEN
+# ======================================================================================
+# Logic from test_overfit.py
+# ======================================================================================
+def read_json_data(json_path: str) -> List[Dict[str, Any]]:
+    """Loads the dataset from the specified JSON file."""
+    if not os.path.exists(json_path):
+        raise FileNotFoundError(f"Dataset not found at: {json_path}")
+    with open(json_path, "r", encoding="utf-8") as f:
+        return json.load(f)
+def deduplicate_and_prepare_data(entries: List[Dict[str, Any]]) -> Tuple[List[Dict[str, Any]], List[str]]:
+    """
+    Cleans the entire dataset by ensuring each (word, participant_id) pair is unique.
+    If a conflict is found (same pair, different motion), it keeps only the first one encountered.
+    Then, it prepares the full list of motion tokens from the cleaned data.
+    """
+    print("\n---> Cleaning dataset by removing ambiguous (word, participant_id) pairs...")
+    unique_samples = {}
+    conflicts_found = 0
+    for entry in entries:
+        word = entry.get("word", "").lower()
+        pid = entry.get("participant_id", "")
+        key = (word, pid)
+        if key not in unique_samples:
+            unique_samples[key] = entry
+        else:
+            # A sample for this key already exists. We only care if it's a conflict.
+            existing_tokens = unique_samples[key].get("motion_tokens")
+            current_tokens = entry.get("motion_tokens")
+            if existing_tokens != current_tokens:
+                conflicts_found += 1
+                # We do nothing, effectively discarding this new conflicting sample.
+    cleaned_data = list(unique_samples.values())
+    print(f"Original samples: {len(entries)}")
+    print(f"Cleaned samples (unique (word, pid) pairs): {len(cleaned_data)}")
+    print(f"Removed {len(entries) - len(cleaned_data)} total samples. ({conflicts_found} were direct conflicts).")
+    print("\n---> Extracting motion tokens from the full cleaned dataset...")
+    all_motion_tokens = set()
+    for entry in cleaned_data:
+        motion_tokens = entry.get("motion_tokens", "").strip().split()
+        for token in motion_tokens:
+            all_motion_tokens.add(f"<M{token}>")
+    unique_tokens = sorted(list(all_motion_tokens))
+    print(f"Found {len(unique_tokens)} unique motion tokens in the entire dataset.")
+    return cleaned_data, unique_tokens
+class MotionDataset(Dataset):
+    """Dataset for Stage 1: Contains only motion token sequences."""
+    def __init__(self, data: List[Dict[str, Any]], tokenizer: AutoTokenizer, max_length: int = 256):
+        self.tokenizer = tokenizer
+        self.max_length = max_length
+        self.sequences = []
+        for item in data:
+            tokens_str = item.get("motion_tokens", "")
+            wrapped_tokens = " ".join([f"<M{t}>" for t in tokens_str.split()])
+            full_sequence = f"{M_START} {wrapped_tokens} {M_END}"
+            self.sequences.append(full_sequence)
+    def __len__(self):
+        return len(self.sequences)
+    def __getitem__(self, idx):
+        return self.tokenizer(
+            self.sequences[idx],
+            truncation=True,
+            max_length=self.max_length,
+            padding="max_length",
+            return_tensors="pt"
+        )
+class TextMotionDataset(Dataset):
+    """Dataset for Stage 2: Contains (prompt, motion_sequence) pairs."""
+    def __init__(self, data: List[Dict[str, Any]], tokenizer: AutoTokenizer, max_length: int = 256):
+        self.tokenizer = tokenizer
+        self.max_length = max_length
+        self.items = []
+        for item in data:
+            prompt = f"Instruction: Generate motion for word '{item['word']}' with variant '{item['participant_id']}'.\nMotion: "
+            tokens_str = item.get("motion_tokens", "")
+            wrapped_tokens = " ".join([f"<M{t}>" for t in tokens_str.split()])
+            target_sequence = f"{M_START} {wrapped_tokens} {M_END}"
+            full_text = prompt + target_sequence
+            tokenized = self.tokenizer(
+                full_text,
+                truncation=True,
+                max_length=self.max_length,
+                padding="max_length",
+                return_tensors="pt"
+            )
+            prompt_tokenized = self.tokenizer(prompt, return_tensors="pt")
+            prompt_len = prompt_tokenized.input_ids.shape[1]
+            labels = tokenized['input_ids'].clone()
+            labels[0, :prompt_len] = -100
+            self.items.append({
+                "input_ids": tokenized['input_ids'].squeeze(0),
+                "attention_mask": tokenized['attention_mask'].squeeze(0),
+                "labels": labels.squeeze(0)
+            })
+    def __len__(self):
+        return len(self.items)
+    def __getitem__(self, idx):
+        return self.items[idx]
+# ======================================================================================
+# Legacy utilities (kept for compatibility if needed, but mostly superseded)
+# ======================================================================================
+def build_motion_vocab(dataset):
+    """
+    Build motion vocabulary by finding max token ID
+    Returns: (codebook_size, max_token_id)
+    """
+    def max_token_in_example(ex):
+        return max(int(x) for x in ex["motion_tokens"].split())
+    global_max_id = 0
+    for ex in dataset:
+        global_max_id = max(global_max_id, max_token_in_example(ex))
+    codebook_size = global_max_id + 1
+    return codebook_size, global_max_id
+def motion_specials_to_ids(s: str) -> List[int]:
+    """Extract motion IDs from special tokens"""
+    toks = s.strip().split()
+    ids = []
+    for t in toks:
+        if t.startswith("<motion_") or (t.startswith("<M") and t.endswith(">") and t[2:-1].isdigit()):
+             # Handle both <motion_ID> and <MID> formats
+            try:
+                if t.startswith("<motion_"):
+                    ids.append(int(t[8:-1]))
+                else:
+                    ids.append(int(t[2:-1]))
+            except:
+                pass
+    return ids

data/motion_llm_dataset.json ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:ba9a0521241d7c72d0759c739ea323eee47e04cf41a5a7b756b9e083b40bc4e1
+size 16798494

data/smplx_models/SMPLX_NEUTRAL.npz ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:376021446ddc86e99acacd795182bbef903e61d33b76b9d8b359c2b0865bd992
+size 108752058

data/vqvae_model.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:fadbf3fb4ded1c6fe7752e7e381b627a46fa37787d051d969b73d97f81b278fb
+size 231392924

data/vqvae_stats.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:fa86de891dd702ca71f0006cfbf68839c5eba35fb728891ab9f1890949dca943
+size 2876

generate.py ADDED Viewed

	@@ -0,0 +1,194 @@

+"""
+Generation and inference utilities with constrained decoding
+"""
+import torch
+from transformers import LogitsProcessor, LogitsProcessorList
+from typing import Dict
+from config import (
+    SYSTEM_MSG, GEN_MAX_NEW_TOKENS, GEN_TEMPERATURE,
+    GEN_TOP_P, GEN_TOP_K, GEN_NO_REPEAT_NGRAM_SIZE,
+    GEN_REPETITION_PENALTY, GEN_END_LOGIT_SLOPE
+)
+class LengthAwareMotionLogitsProcessor(LogitsProcessor):
+    """
+    Constrained decoding processor that:
+    1. Enforces motion token vocabulary
+    2. Controls sequence length (min/soft_target/max)
+    3. Biases toward ending at soft_target length
+    """
+    def __init__(self, prompt_len, mot_begin_id, mot_end_id, motion_ids,
+                 hard_min, soft_target, hard_max, end_logit_slope=0.25):
+        super().__init__()
+        self.prompt_len = int(prompt_len)
+        self.mot_begin_id = int(mot_begin_id)
+        self.mot_end_id = int(mot_end_id)
+        self.motion_ids = torch.tensor(sorted(set(int(x) for x in motion_ids)))
+        self.motion_plus_end = torch.tensor(
+            sorted(set(list(self.motion_ids.tolist()) + [self.mot_end_id]))
+        )
+        self.hard_min = int(hard_min)
+        self.soft_target = int(soft_target)
+        self.hard_max = int(hard_max)
+        self.end_logit_slope = float(end_logit_slope)
+    def __call__(self, input_ids, scores):
+        device = scores.device
+        bs = scores.size(0)
+        mask = torch.full_like(scores, float("-inf"))
+        for b in range(bs):
+            gen = input_ids[b, self.prompt_len:]
+            # No tokens generated yet - must start with MOT_BEGIN
+            if gen.numel() == 0:
+                allowed = torch.tensor([self.mot_begin_id], device=device)
+                mask[b].index_fill_(0, allowed, 0.0)
+                continue
+            # Find MOT_BEGIN position
+            begin_pos = (gen == self.mot_begin_id).nonzero(as_tuple=True)[0]
+            if begin_pos.numel() == 0:
+                allowed = torch.tensor([self.mot_begin_id], device=device)
+                mask[b].index_fill_(0, allowed, 0.0)
+                continue
+            # Already generated MOT_END - force EOS
+            if (gen == self.mot_end_id).any():
+                allowed = torch.tensor([self.mot_end_id], device=device)
+                mask[b].index_fill_(0, allowed, 0.0)
+                continue
+            # Count motion tokens after MOT_BEGIN
+            after_begin = gen[begin_pos[0].item() + 1:]
+            cur_len = after_begin.numel()
+            # Before minimum length - only allow motion tokens
+            if cur_len < self.hard_min:
+                allowed = self.motion_ids.to(device)
+                mask[b].index_fill_(0, allowed, 0.0)
+            # After maximum length - force end
+            elif cur_len >= self.hard_max:
+                allowed = torch.tensor([self.mot_end_id], device=device)
+                mask[b].index_fill_(0, allowed, 0.0)
+            # Between min and max - allow motion tokens or end
+            else:
+                allowed = self.motion_plus_end.to(device)
+                mask[b].index_fill_(0, allowed, 0.0)
+                # Bias toward ending at soft_target
+                distance = max(0, cur_len - self.soft_target)
+                bias = self.end_logit_slope * float(distance)
+                scores[b, self.mot_end_id] = scores[b, self.mot_end_id] + bias
+        return scores + mask
+def get_len_controls(prompt_text: str, length_stats_by_text: Dict, global_median_len: int):
+    """
+    Get length controls (min/soft_target/max) for a given prompt
+    """
+    s = length_stats_by_text.get(prompt_text)
+    if s is None:
+        med = global_median_len
+    else:
+        med = s["median"]
+    hard_min = max(1, int(0.6 * med))
+    soft_tgt = med
+    hard_max = max(hard_min + 4, int(1.4 * med))
+    return hard_min, soft_tgt, hard_max
+def generate_t2m(
+    model,
+    tokenizer,
+    prompt_text: str,
+    mot_begin_id: int,
+    mot_end_id: int,
+    motion_token_ids: list,
+    length_stats_by_text: Dict,
+    global_median_len: int,
+    prompt_vocab: Dict = None,
+    pid: str = None,
+    has_pid: bool = False,
+    max_new_tokens: int = None,
+    per_prompt_vocab: bool = True
+):
+    """
+    Generate motion sequence from text prompt with constrained decoding
+    """
+    model.eval()
+    device = next(model.parameters()).device
+    if max_new_tokens is None:
+        max_new_tokens = GEN_MAX_NEW_TOKENS
+    # Build prompt
+    pid_tok = ""
+    if has_pid and pid is not None:
+        pid_tok = f"<PID_{pid}>"
+    user_text = f"<T2M>{pid_tok}\n\n" + prompt_text
+    prompt = (
+        "<|im_start|>system\n" + SYSTEM_MSG + "<|im_end|>\n"
+        + "<|im_start|>user\n" + user_text + "\n<|im_end|>\n"
+        + "<|im_start|>assistant\n"
+    )
+    # Tokenize
+    inputs = tokenizer(prompt, return_tensors="pt").to(device)
+    prompt_len = inputs["input_ids"].size(1)
+    # Get length controls
+    hard_min, soft_tgt, hard_max = get_len_controls(
+        prompt_text, length_stats_by_text, global_median_len
+    )
+    # Get allowed motion tokens
+    if per_prompt_vocab and prompt_vocab:
+        allowed_motion_ids = prompt_vocab.get(prompt_text, motion_token_ids)
+    else:
+        allowed_motion_ids = motion_token_ids
+    # Setup constrained decoding
+    processors = LogitsProcessorList([
+        LengthAwareMotionLogitsProcessor(
+            prompt_len=prompt_len,
+            mot_begin_id=mot_begin_id,
+            mot_end_id=mot_end_id,
+            motion_ids=allowed_motion_ids,
+            hard_min=hard_min,
+            soft_target=soft_tgt,
+            hard_max=hard_max,
+            end_logit_slope=GEN_END_LOGIT_SLOPE,
+        )
+    ])
+    # Generate
+    with torch.no_grad():
+        out = model.generate(
+            input_ids=inputs["input_ids"],
+            attention_mask=inputs.get("attention_mask"),
+            max_new_tokens=min(max_new_tokens, hard_max + 4),
+            do_sample=True,
+            temperature=GEN_TEMPERATURE,
+            top_p=GEN_TOP_P,
+            top_k=GEN_TOP_K,
+            no_repeat_ngram_size=GEN_NO_REPEAT_NGRAM_SIZE,
+            repetition_penalty=GEN_REPETITION_PENALTY,
+            logits_processor=processors,
+            eos_token_id=mot_end_id,
+            pad_token_id=tokenizer.eos_token_id,
+        )
+    # Decode
+    decoded = tokenizer.decode(out[0], skip_special_tokens=False)
+    reply = decoded.split("<|im_start|>assistant\n")[-1].split("<|im_end|>")[0]
+    return reply

inference.py ADDED Viewed

	@@ -0,0 +1,244 @@

+"""
+Inference script for generating motion tokens from text prompts.
+Run after training to generate motion sequences from any text description.
+Usage:
+    python inference.py --prompt "walking forward" --stage 3
+    python inference.py --prompt "dancing" --stage 2 --output motion_output.txt
+"""
+import os
+import argparse
+import torch
+from pathlib import Path
+from config import (
+    OUT_S1, OUT_S2, OUT_S3, MAX_SEQ_LEN, DATA_JSON_PATH,
+    WORK_DIR
+)
+from data import (
+    load_dataset, compute_length_stats, build_prompt_vocab,
+    check_has_participant_id
+)
+from model import setup_model_and_tokenizer, get_motion_token_info
+from generate import generate_t2m
+def load_trained_model(stage: int, device: torch.device):
+    """
+    Load a trained model from a specific stage checkpoint.
+    Args:
+        stage: Stage number (1, 2, or 3)
+        device: Device to load model on
+    Returns:
+        model, tokenizer, motion_token_ids, mot_begin_id, mot_end_id
+    """
+    stage_dirs = {1: OUT_S1, 2: OUT_S2, 3: OUT_S3}
+    stage_dir = stage_dirs.get(stage)
+    if not stage_dir or not os.path.exists(stage_dir):
+        raise FileNotFoundError(
+            f"Stage {stage} checkpoint not found at {stage_dir}. "
+            f"Train stage {stage} first."
+        )
+    print(f"\nLoading Stage {stage} model from: {stage_dir}")
+    # Load dataset to build vocab (needed for model setup)
+    if not os.path.exists(DATA_JSON_PATH):
+        raise FileNotFoundError(f"Dataset not found: {DATA_JSON_PATH}")
+    raw_ds = load_dataset(DATA_JSON_PATH)
+    # Build motion vocab
+    def max_token_in_example(ex):
+        return max(int(x) for x in ex["motion_tokens"].split())
+    global_max_id = max(max_token_in_example(ex) for ex in raw_ds)
+    codebook_size = global_max_id + 1
+    # Check for participant IDs
+    has_pid = check_has_participant_id(raw_ds)
+    unique_pids = None
+    if has_pid:
+        unique_pids = sorted({str(ex["participant_id"]) for ex in raw_ds})
+    # Setup model and tokenizer with same config as training
+    model, tokenizer, _ = setup_model_and_tokenizer(codebook_size, unique_pids)
+    # Load trained weights from checkpoint
+    # Try different checkpoint naming patterns
+    possible_ckpts = [
+        os.path.join(stage_dir, "pytorch_model.bin"),
+        os.path.join(stage_dir, "model.safetensors"),
+        os.path.join(stage_dir, "adapter_model.bin"),
+    ]
+    loaded = False
+    for ckpt_path in possible_ckpts:
+        if os.path.exists(ckpt_path):
+            print(f"Loading checkpoint: {ckpt_path}")
+            # Unsloth/PEFT models save adapters separately
+            # The model will auto-load from the directory
+            loaded = True
+            break
+    if not loaded:
+        print(f"⚠️  No explicit checkpoint file found, using model directory: {stage_dir}")
+    # Move model to device
+    model.to(device)
+    model.eval()
+    # Get motion token info
+    motion_token_ids, mot_begin_id, mot_end_id = get_motion_token_info(
+        tokenizer, codebook_size
+    )
+    print(f"✅ Stage {stage} model loaded successfully")
+    print(f"   Vocabulary size: {len(tokenizer)}")
+    print(f"   Motion tokens: {len(motion_token_ids)}")
+    return model, tokenizer, motion_token_ids, mot_begin_id, mot_end_id, raw_ds
+def inference(
+    prompt: str,
+    stage: int = 3,
+    pid: str = None,
+    output_file: str = None,
+    per_prompt_vocab: bool = True,
+    device: torch.device = None
+):
+    """
+    Generate motion tokens from a text prompt.
+    Args:
+        prompt: Text description of desired motion
+        stage: Which training stage model to use (1, 2, or 3)
+        pid: Optional participant ID for personalization
+        output_file: Optional file to save output tokens
+        per_prompt_vocab: Whether to use per-prompt vocabulary constraints
+        device: Device to run inference on
+    Returns:
+        Generated motion token string
+    """
+    if device is None:
+        device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+    print("="*60)
+    print(f"Motion Generation Inference - Stage {stage}")
+    print("="*60)
+    print(f"Prompt: '{prompt}'")
+    print(f"Device: {device}")
+    # Load model and dataset
+    model, tokenizer, motion_token_ids, mot_begin_id, mot_end_id, raw_ds = load_trained_model(stage, device)
+    # Compute length stats and prompt vocab
+    print("\nComputing dataset statistics...")
+    length_stats_by_text, global_median_len = compute_length_stats(raw_ds)
+    prompt_vocab = build_prompt_vocab(raw_ds)
+    has_pid = check_has_participant_id(raw_ds)
+    # Generate motion tokens
+    print(f"\nGenerating motion for: '{prompt}'")
+    print(f"Per-prompt vocabulary: {per_prompt_vocab}")
+    generated = generate_t2m(
+        model=model,
+        tokenizer=tokenizer,
+        prompt_text=prompt,
+        mot_begin_id=mot_begin_id,
+        mot_end_id=mot_end_id,
+        motion_token_ids=motion_token_ids,
+        length_stats_by_text=length_stats_by_text,
+        global_median_len=global_median_len,
+        prompt_vocab=prompt_vocab,
+        has_pid=has_pid,
+        per_prompt_vocab=per_prompt_vocab,
+        pid=pid
+    )
+    print("\n" + "="*60)
+    print("Generated Motion:")
+    print("="*60)
+    print(generated)
+    print("="*60)
+    # Optionally save to file
+    if output_file:
+        output_path = Path(output_file)
+        output_path.parent.mkdir(parents=True, exist_ok=True)
+        with open(output_path, 'w') as f:
+            f.write(generated)
+        print(f"\n✅ Output saved to: {output_file}")
+    return generated
+def main():
+    parser = argparse.ArgumentParser(
+        description="Generate motion tokens from text prompts using trained SignMotionGPT model"
+    )
+    parser.add_argument(
+        "--prompt",
+        type=str,
+        required=True,
+        help="Text description of the desired motion (e.g., 'walking forward', 'dancing')"
+    )
+    parser.add_argument(
+        "--stage",
+        type=int,
+        default=3,
+        choices=[1, 2, 3],
+        help="Which training stage model to use (1=motion-only, 2=multi-task, 3=T2M SFT, default=3)"
+    )
+    parser.add_argument(
+        "--pid",
+        type=str,
+        default=None,
+        help="Optional participant ID for personalized generation (e.g., 'P40')"
+    )
+    parser.add_argument(
+        "--output",
+        type=str,
+        default=None,
+        help="Optional output file to save generated tokens"
+    )
+    parser.add_argument(
+        "--no-per-prompt-vocab",
+        action="store_true",
+        help="Disable per-prompt vocabulary constraints (allows all motion tokens)"
+    )
+    parser.add_argument(
+        "--device",
+        type=str,
+        default=None,
+        choices=["cpu", "cuda", "cuda:0", "cuda:1"],
+        help="Device to run inference on (default: auto-detect)"
+    )
+    args = parser.parse_args()
+    # Setup device
+    if args.device:
+        device = torch.device(args.device)
+    else:
+        device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+    # Run inference
+    inference(
+        prompt=args.prompt,
+        stage=args.stage,
+        pid=args.pid,
+        output_file=args.output,
+        per_prompt_vocab=not args.no_per_prompt_vocab,
+        device=device
+    )
+if __name__ == "__main__":
+    main()

mGPT/__init__.py ADDED Viewed

File without changes

mGPT/archs/__init__.py ADDED Viewed

File without changes

mGPT/archs/mgpt_vq.py ADDED Viewed

	@@ -0,0 +1,189 @@

+from typing import List, Optional, Union
+import torch
+import torch.nn as nn
+from torch import Tensor, nn
+from torch.distributions.distribution import Distribution
+from .tools.resnet import Resnet1D
+from .tools.quantize_cnn import QuantizeEMAReset, Quantizer, QuantizeEMA, QuantizeReset
+from collections import OrderedDict
+class VQVae(nn.Module):
+    def __init__(self,
+                 nfeats: int,
+                 quantizer: str = "ema_reset",
+                 code_num=512,
+                 code_dim=512,
+                 output_emb_width=512,
+                 down_t=3,
+                 stride_t=2,
+                 width=512,
+                 depth=3,
+                 dilation_growth_rate=3,
+                 norm=None,
+                 activation: str = "relu",
+                 **kwargs) -> None:
+        super().__init__()
+        self.code_dim = code_dim
+        self.encoder = Encoder(nfeats,
+                               output_emb_width,
+                               down_t,
+                               stride_t,
+                               width,
+                               depth,
+                               dilation_growth_rate,
+                               activation=activation,
+                               norm=norm)
+        self.decoder = Decoder(nfeats,
+                               output_emb_width,
+                               down_t,
+                               stride_t,
+                               width,
+                               depth,
+                               dilation_growth_rate,
+                               activation=activation,
+                               norm=norm)
+        if quantizer == "ema_reset":
+            self.quantizer = QuantizeEMAReset(code_num, code_dim, mu=0.99)
+        elif quantizer == "orig":
+            self.quantizer = Quantizer(code_num, code_dim, beta=1.0)
+        elif quantizer == "ema":
+            self.quantizer = QuantizeEMA(code_num, code_dim, mu=0.99)
+        elif quantizer == "reset":
+            self.quantizer = QuantizeReset(code_num, code_dim)
+    def preprocess(self, x):
+        # (bs, T, Jx3) -> (bs, Jx3, T)
+        x = x.permute(0, 2, 1)
+        return x
+    def postprocess(self, x):
+        # (bs, Jx3, T) ->  (bs, T, Jx3)
+        x = x.permute(0, 2, 1)
+        return x
+    def forward(self, features: Tensor):
+        # Preprocess
+        x_in = self.preprocess(features)
+        # Encode
+        x_encoder = self.encoder(x_in)
+        # quantization
+        x_quantized, loss, perplexity = self.quantizer(x_encoder)
+        # decoder
+        x_decoder = self.decoder(x_quantized)
+        x_out = self.postprocess(x_decoder)
+        return x_out, loss, perplexity
+    def encode(
+        self,
+        features: Tensor,
+    ) -> Union[Tensor, Distribution]:
+        N, T, _ = features.shape
+        x_in = self.preprocess(features)
+        x_encoder = self.encoder(x_in)
+        x_encoder = self.postprocess(x_encoder)
+        x_encoder = x_encoder.contiguous().view(-1,
+                                                x_encoder.shape[-1])  # (NT, C)
+        code_idx = self.quantizer.quantize(x_encoder)
+        code_idx = code_idx.view(N, -1)
+        # latent, dist
+        return code_idx, None
+    def decode(self, z: Tensor):
+        x_d = self.quantizer.dequantize(z)
+        x_d = x_d.view(1, -1, self.code_dim).permute(0, 2, 1).contiguous()
+        # decoder
+        x_decoder = self.decoder(x_d)
+        x_out = self.postprocess(x_decoder)
+        return x_out
+class Encoder(nn.Module):
+    def __init__(self,
+                 input_emb_width=3,
+                 output_emb_width=512,
+                 down_t=3,
+                 stride_t=2,
+                 width=512,
+                 depth=3,
+                 dilation_growth_rate=3,
+                 activation='relu',
+                 norm=None):
+        super().__init__()
+        blocks = []
+        filter_t, pad_t = stride_t * 2, stride_t // 2
+        blocks.append(nn.Conv1d(input_emb_width, width, 3, 1, 1))
+        blocks.append(nn.ReLU())
+        for i in range(down_t):
+            input_dim = width
+            block = nn.Sequential(
+                nn.Conv1d(input_dim, width, filter_t, stride_t, pad_t),
+                Resnet1D(width,
+                         depth,
+                         dilation_growth_rate,
+                         activation=activation,
+                         norm=norm),
+            )
+            blocks.append(block)
+        blocks.append(nn.Conv1d(width, output_emb_width, 3, 1, 1))
+        self.model = nn.Sequential(*blocks)
+    def forward(self, x):
+        return self.model(x)
+class Decoder(nn.Module):
+    def __init__(self,
+                 input_emb_width=3,
+                 output_emb_width=512,
+                 down_t=3,
+                 stride_t=2,
+                 width=512,
+                 depth=3,
+                 dilation_growth_rate=3,
+                 activation='relu',
+                 norm=None):
+        super().__init__()
+        blocks = []
+        filter_t, pad_t = stride_t * 2, stride_t // 2
+        blocks.append(nn.Conv1d(output_emb_width, width, 3, 1, 1))
+        blocks.append(nn.ReLU())
+        for i in range(down_t):
+            out_dim = width
+            block = nn.Sequential(
+                Resnet1D(width,
+                         depth,
+                         dilation_growth_rate,
+                         reverse_dilation=True,
+                         activation=activation,
+                         norm=norm), nn.Upsample(scale_factor=2,
+                                                 mode='nearest'),
+                nn.Conv1d(width, out_dim, 3, 1, 1))
+            blocks.append(block)
+        blocks.append(nn.Conv1d(width, width, 3, 1, 1))
+        blocks.append(nn.ReLU())
+        blocks.append(nn.Conv1d(width, input_emb_width, 3, 1, 1))
+        self.model = nn.Sequential(*blocks)
+    def forward(self, x):
+        return self.model(x)

mGPT/archs/tools/__init__.py ADDED Viewed

File without changes

mGPT/archs/tools/quantize_cnn.py ADDED Viewed

	@@ -0,0 +1,410 @@

+import numpy as np
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+class QuantizeEMAReset(nn.Module):
+    def __init__(self, nb_code, code_dim, mu):
+        super().__init__()
+        self.nb_code = nb_code
+        self.code_dim = code_dim
+        self.mu = mu
+        self.reset_codebook()
+    def reset_codebook(self):
+        self.init = False
+        self.code_sum = None
+        self.code_count = None
+        device = "cuda" if torch.cuda.is_available() else "cpu"
+        self.register_buffer('codebook', torch.zeros(self.nb_code, self.code_dim).to(device))
+    def _tile(self, x):
+        nb_code_x, code_dim = x.shape
+        if nb_code_x < self.nb_code:
+            n_repeats = (self.nb_code + nb_code_x - 1) // nb_code_x
+            std = 0.01 / np.sqrt(code_dim)
+            out = x.repeat(n_repeats, 1)
+            out = out + torch.randn_like(out) * std
+        else :
+            out = x
+        return out
+    def init_codebook(self, x):
+        out = self._tile(x)
+        self.codebook = out[:self.nb_code]
+        self.code_sum = self.codebook.clone()
+        self.code_count = torch.ones(self.nb_code, device=self.codebook.device)
+        self.init = True
+    @torch.no_grad()
+    def compute_perplexity(self, code_idx) :
+        # Calculate new centres
+        code_onehot = torch.zeros(self.nb_code, code_idx.shape[0], device=code_idx.device)  # nb_code, N * L
+        code_onehot.scatter_(0, code_idx.view(1, code_idx.shape[0]), 1)
+        code_count = code_onehot.sum(dim=-1)  # nb_code
+        prob = code_count / torch.sum(code_count)
+        perplexity = torch.exp(-torch.sum(prob * torch.log(prob + 1e-7)))
+        return perplexity
+    @torch.no_grad()
+    def update_codebook(self, x, code_idx):
+        code_onehot = torch.zeros(self.nb_code, x.shape[0], device=x.device)  # nb_code, N * L
+        code_onehot.scatter_(0, code_idx.view(1, x.shape[0]), 1)
+        code_sum = torch.matmul(code_onehot, x)  # nb_code, w
+        code_count = code_onehot.sum(dim=-1)  # nb_code
+        out = self._tile(x)
+        code_rand = out[:self.nb_code]
+        # Update centres
+        self.code_sum = self.mu * self.code_sum + (1. - self.mu) * code_sum  # w, nb_code
+        self.code_count = self.mu * self.code_count + (1. - self.mu) * code_count  # nb_code
+        usage = (self.code_count.view(self.nb_code, 1) >= 1.0).float()
+        code_update = self.code_sum.view(self.nb_code, self.code_dim) / self.code_count.view(self.nb_code, 1)
+        self.codebook = usage * code_update + (1 - usage) * code_rand
+        prob = code_count / torch.sum(code_count)
+        perplexity = torch.exp(-torch.sum(prob * torch.log(prob + 1e-7)))
+        return perplexity
+    def preprocess(self, x):
+        # NCT -> NTC -> [NT, C]
+        x = x.permute(0, 2, 1).contiguous()
+        x = x.view(-1, x.shape[-1])
+        return x
+    def quantize(self, x):
+        # Calculate latent code x_l
+        k_w = self.codebook.t()
+        distance = torch.sum(x ** 2, dim=-1, keepdim=True) - 2 * torch.matmul(x, k_w) + torch.sum(k_w ** 2, dim=0,
+                                                                                            keepdim=True)  # (N * L, b)
+        _, code_idx = torch.min(distance, dim=-1)
+        return code_idx
+    def dequantize(self, code_idx):
+        x = F.embedding(code_idx, self.codebook)
+        return x
+    def forward(self, x):
+        N, width, T = x.shape
+        # Preprocess
+        x = self.preprocess(x)
+        # Init codebook if not inited
+        if self.training and not self.init:
+            self.init_codebook(x)
+        # quantize and dequantize through bottleneck
+        code_idx = self.quantize(x)
+        x_d = self.dequantize(code_idx)
+        # Update embeddings
+        if self.training:
+            perplexity = self.update_codebook(x, code_idx)
+        else :
+            perplexity = self.compute_perplexity(code_idx)
+        # Loss
+        commit_loss = F.mse_loss(x, x_d.detach())
+        # Passthrough
+        x_d = x + (x_d - x).detach()
+        # Postprocess
+        x_d = x_d.view(N, T, -1).permute(0, 2, 1).contiguous()   #(N, DIM, T)
+        return x_d, commit_loss, perplexity
+class Quantizer(nn.Module):
+    def __init__(self, n_e, e_dim, beta):
+        super(Quantizer, self).__init__()
+        self.e_dim = e_dim
+        self.n_e = n_e
+        self.beta = beta
+        self.embedding = nn.Embedding(self.n_e, self.e_dim)
+        self.embedding.weight.data.uniform_(-1.0 / self.n_e, 1.0 / self.n_e)
+    def forward(self, z):
+        N, width, T = z.shape
+        z = self.preprocess(z)
+        assert z.shape[-1] == self.e_dim
+        z_flattened = z.contiguous().view(-1, self.e_dim)
+        # B x V
+        d = torch.sum(z_flattened ** 2, dim=1, keepdim=True) +             torch.sum(self.embedding.weight**2, dim=1) - 2 *             torch.matmul(z_flattened, self.embedding.weight.t())
+        # B x 1
+        min_encoding_indices = torch.argmin(d, dim=1)
+        z_q = self.embedding(min_encoding_indices).view(z.shape)
+        # compute loss for embedding
+        loss = torch.mean((z_q - z.detach())**2) + self.beta *                torch.mean((z_q.detach() - z)**2)
+        # preserve gradients
+        z_q = z + (z_q - z).detach()
+        z_q = z_q.view(N, T, -1).permute(0, 2, 1).contiguous()   #(N, DIM, T)
+        min_encodings = F.one_hot(min_encoding_indices, self.n_e).type(z.dtype)
+        e_mean = torch.mean(min_encodings, dim=0)
+        perplexity = torch.exp(-torch.sum(e_mean*torch.log(e_mean + 1e-10)))
+        return z_q, loss, perplexity
+    def quantize(self, z):
+        assert z.shape[-1] == self.e_dim
+        # B x V
+        d = torch.sum(z ** 2, dim=1, keepdim=True) +             torch.sum(self.embedding.weight ** 2, dim=1) - 2 *             torch.matmul(z, self.embedding.weight.t())
+        # B x 1
+        min_encoding_indices = torch.argmin(d, dim=1)
+        return min_encoding_indices
+    def dequantize(self, indices):
+        index_flattened = indices.view(-1)
+        z_q = self.embedding(index_flattened)
+        z_q = z_q.view(indices.shape + (self.e_dim, )).contiguous()
+        return z_q
+    def preprocess(self, x):
+        # NCT -> NTC -> [NT, C]
+        x = x.permute(0, 2, 1).contiguous()
+        x = x.view(-1, x.shape[-1])
+        return x
+class QuantizeReset(nn.Module):
+    def __init__(self, nb_code, code_dim):
+        super().__init__()
+        self.nb_code = nb_code
+        self.code_dim = code_dim
+        self.reset_codebook()
+        self.codebook = nn.Parameter(torch.randn(nb_code, code_dim))
+    def reset_codebook(self):
+        self.init = False
+        self.code_count = None
+    def _tile(self, x):
+        nb_code_x, code_dim = x.shape
+        if nb_code_x < self.nb_code:
+            n_repeats = (self.nb_code + nb_code_x - 1) // nb_code_x
+            std = 0.01 / np.sqrt(code_dim)
+            out = x.repeat(n_repeats, 1)
+            out = out + torch.randn_like(out) * std
+        else :
+            out = x
+        return out
+    def init_codebook(self, x):
+        out = self._tile(x)
+        self.codebook = nn.Parameter(out[:self.nb_code])
+        self.code_count = torch.ones(self.nb_code, device=self.codebook.device)
+        self.init = True
+    @torch.no_grad()
+    def compute_perplexity(self, code_idx) :
+        # Calculate new centres
+        code_onehot = torch.zeros(self.nb_code, code_idx.shape[0], device=code_idx.device)  # nb_code, N * L
+        code_onehot.scatter_(0, code_idx.view(1, code_idx.shape[0]), 1)
+        code_count = code_onehot.sum(dim=-1)  # nb_code
+        prob = code_count / torch.sum(code_count)
+        perplexity = torch.exp(-torch.sum(prob * torch.log(prob + 1e-7)))
+        return perplexity
+    def update_codebook(self, x, code_idx):
+        code_onehot = torch.zeros(self.nb_code, x.shape[0], device=x.device)  # nb_code, N * L
+        code_onehot.scatter_(0, code_idx.view(1, x.shape[0]), 1)
+        code_count = code_onehot.sum(dim=-1)  # nb_code
+        out = self._tile(x)
+        code_rand = out[:self.nb_code]
+        # Update centres
+        self.code_count = code_count  # nb_code
+        usage = (self.code_count.view(self.nb_code, 1) >= 1.0).float()
+        self.codebook.data = usage * self.codebook.data + (1 - usage) * code_rand
+        prob = code_count / torch.sum(code_count)
+        perplexity = torch.exp(-torch.sum(prob * torch.log(prob + 1e-7)))
+        return perplexity
+    def preprocess(self, x):
+        # NCT -> NTC -> [NT, C]
+        x = x.permute(0, 2, 1).contiguous()
+        x = x.view(-1, x.shape[-1])
+        return x
+    def quantize(self, x):
+        # Calculate latent code x_l
+        k_w = self.codebook.t()
+        distance = torch.sum(x ** 2, dim=-1, keepdim=True) - 2 * torch.matmul(x, k_w) + torch.sum(k_w ** 2, dim=0,
+                                                                                            keepdim=True)  # (N * L, b)
+        _, code_idx = torch.min(distance, dim=-1)
+        return code_idx
+    def dequantize(self, code_idx):
+        x = F.embedding(code_idx, self.codebook)
+        return x
+    def forward(self, x):
+        N, width, T = x.shape
+        # Preprocess
+        x = self.preprocess(x)
+        # Init codebook if not inited
+        if self.training and not self.init:
+            self.init_codebook(x)
+        # quantize and dequantize through bottleneck
+        code_idx = self.quantize(x)
+        x_d = self.dequantize(code_idx)
+        # Update embeddings
+        if self.training:
+            perplexity = self.update_codebook(x, code_idx)
+        else :
+            perplexity = self.compute_perplexity(code_idx)
+        # Loss
+        commit_loss = F.mse_loss(x, x_d.detach())
+        # Passthrough
+        x_d = x + (x_d - x).detach()
+        # Postprocess
+        x_d = x_d.view(N, T, -1).permute(0, 2, 1).contiguous()   #(N, DIM, T)
+        return x_d, commit_loss, perplexity
+class QuantizeEMA(nn.Module):
+    def __init__(self, nb_code, code_dim, mu):
+        super().__init__()
+        self.nb_code = nb_code
+        self.code_dim = code_dim
+        self.mu = mu
+        self.reset_codebook()
+    def reset_codebook(self):
+        self.init = False
+        self.code_sum = None
+        self.code_count = None
+        self.register_buffer('codebook', torch.zeros(self.nb_code, self.code_dim).cuda())
+    def _tile(self, x):
+        nb_code_x, code_dim = x.shape
+        if nb_code_x < self.nb_code:
+            n_repeats = (self.nb_code + nb_code_x - 1) // nb_code_x
+            std = 0.01 / np.sqrt(code_dim)
+            out = x.repeat(n_repeats, 1)
+            out = out + torch.randn_like(out) * std
+        else :
+            out = x
+        return out
+    def init_codebook(self, x):
+        out = self._tile(x)
+        self.codebook = out[:self.nb_code]
+        self.code_sum = self.codebook.clone()
+        self.code_count = torch.ones(self.nb_code, device=self.codebook.device)
+        self.init = True
+    @torch.no_grad()
+    def compute_perplexity(self, code_idx) :
+        # Calculate new centres
+        code_onehot = torch.zeros(self.nb_code, code_idx.shape[0], device=code_idx.device)  # nb_code, N * L
+        code_onehot.scatter_(0, code_idx.view(1, code_idx.shape[0]), 1)
+        code_count = code_onehot.sum(dim=-1)  # nb_code
+        prob = code_count / torch.sum(code_count)
+        perplexity = torch.exp(-torch.sum(prob * torch.log(prob + 1e-7)))
+        return perplexity
+    @torch.no_grad()
+    def update_codebook(self, x, code_idx):
+        code_onehot = torch.zeros(self.nb_code, x.shape[0], device=x.device)  # nb_code, N * L
+        code_onehot.scatter_(0, code_idx.view(1, x.shape[0]), 1)
+        code_sum = torch.matmul(code_onehot, x)  # nb_code, w
+        code_count = code_onehot.sum(dim=-1)  # nb_code
+        # Update centres
+        self.code_sum = self.mu * self.code_sum + (1. - self.mu) * code_sum  # w, nb_code
+        self.code_count = self.mu * self.code_count + (1. - self.mu) * code_count  # nb_code
+        code_update = self.code_sum.view(self.nb_code, self.code_dim) / self.code_count.view(self.nb_code, 1)
+        self.codebook = code_update
+        prob = code_count / torch.sum(code_count)
+        perplexity = torch.exp(-torch.sum(prob * torch.log(prob + 1e-7)))
+        return perplexity
+    def preprocess(self, x):
+        # NCT -> NTC -> [NT, C]
+        x = x.permute(0, 2, 1).contiguous()
+        x = x.view(-1, x.shape[-1])
+        return x
+    def quantize(self, x):
+        # Calculate latent code x_l
+        k_w = self.codebook.t()
+        distance = torch.sum(x ** 2, dim=-1, keepdim=True) - 2 * torch.matmul(x, k_w) + torch.sum(k_w ** 2, dim=0,
+                                                                                            keepdim=True)  # (N * L, b)
+        _, code_idx = torch.min(distance, dim=-1)
+        return code_idx
+    def dequantize(self, code_idx):
+        x = F.embedding(code_idx, self.codebook)
+        return x
+    def forward(self, x):
+        N, width, T = x.shape
+        # Preprocess
+        x = self.preprocess(x)
+        # Init codebook if not inited
+        if self.training and not self.init:
+            self.init_codebook(x)
+        # quantize and dequantize through bottleneck
+        code_idx = self.quantize(x)
+        x_d = self.dequantize(code_idx)
+        # Update embeddings
+        if self.training:
+            perplexity = self.update_codebook(x, code_idx)
+        else :
+            perplexity = self.compute_perplexity(code_idx)
+        # Loss
+        commit_loss = F.mse_loss(x, x_d.detach())
+        # Passthrough
+        x_d = x + (x_d - x).detach()
+        # Postprocess
+        x_d = x_d.view(N, T, -1).permute(0, 2, 1).contiguous()   #(N, DIM, T)
+        return x_d, commit_loss, perplexity

mGPT/archs/tools/resnet.py ADDED Viewed

	@@ -0,0 +1,81 @@

+import torch.nn as nn
+import torch
+class nonlinearity(nn.Module):
+    def __init__(self):
+        super().__init__()
+    def forward(self, x):
+        # swish
+        return x * torch.sigmoid(x)
+class ResConv1DBlock(nn.Module):
+    def __init__(self, n_in, n_state, dilation=1, activation='silu', norm=None, dropout=None):
+        super().__init__()
+        padding = dilation
+        self.norm = norm
+        if norm == "LN":
+            self.norm1 = nn.LayerNorm(n_in)
+            self.norm2 = nn.LayerNorm(n_in)
+        elif norm == "GN":
+            self.norm1 = nn.GroupNorm(num_groups=32, num_channels=n_in, eps=1e-6, affine=True)
+            self.norm2 = nn.GroupNorm(num_groups=32, num_channels=n_in, eps=1e-6, affine=True)
+        elif norm == "BN":
+            self.norm1 = nn.BatchNorm1d(num_features=n_in, eps=1e-6, affine=True)
+            self.norm2 = nn.BatchNorm1d(num_features=n_in, eps=1e-6, affine=True)
+        else:
+            self.norm1 = nn.Identity()
+            self.norm2 = nn.Identity()
+        if activation == "relu":
+            self.activation1 = nn.ReLU()
+            self.activation2 = nn.ReLU()
+        elif activation == "silu":
+            self.activation1 = nonlinearity()
+            self.activation2 = nonlinearity()
+        elif activation == "gelu":
+            self.activation1 = nn.GELU()
+            self.activation2 = nn.GELU()
+        self.conv1 = nn.Conv1d(n_in, n_state, 3, 1, padding, dilation)
+        self.conv2 = nn.Conv1d(n_state, n_in, 1, 1, 0,)
+    def forward(self, x):
+        x_orig = x
+        if self.norm == "LN":
+            x = self.norm1(x.transpose(-2, -1))
+            x = self.activation1(x.transpose(-2, -1))
+        else:
+            x = self.norm1(x)
+            x = self.activation1(x)
+        x = self.conv1(x)
+        if self.norm == "LN":
+            x = self.norm2(x.transpose(-2, -1))
+            x = self.activation2(x.transpose(-2, -1))
+        else:
+            x = self.norm2(x)
+            x = self.activation2(x)
+        x = self.conv2(x)
+        x = x + x_orig
+        return x
+class Resnet1D(nn.Module):
+    def __init__(self, n_in, n_depth, dilation_growth_rate=1, reverse_dilation=True, activation='relu', norm=None):
+        super().__init__()
+        blocks = [ResConv1DBlock(n_in, n_in, dilation=dilation_growth_rate ** depth, activation=activation, norm=norm) for depth in range(n_depth)]
+        if reverse_dilation:
+            blocks = blocks[::-1]
+        self.model = nn.Sequential(*blocks)
+    def forward(self, x):
+        return self.model(x)

metrics.py ADDED Viewed

	@@ -0,0 +1,731 @@

+"""
+Evaluation metrics for motion generation
+"""
+import random
+import os
+import re
+import json
+import numpy as np
+import scipy.linalg
+import torch
+from typing import List, Tuple, Dict, Optional, Any
+from rapidfuzz.distance import Levenshtein
+from collections import defaultdict
+from data import motion_specials_to_ids
+from config import (
+    SEED, PIPELINE_OUTPUT_DIR, M_START, M_END,
+    INFERENCE_TEMPERATURE, INFERENCE_TOP_K, INFERENCE_REPETITION_PENALTY
+)
+random.seed(SEED)
+# ======================================================================================
+# Logic from test_overfit.py (Metrics & Visualization)
+# ======================================================================================
+def calculate_activation_statistics_np(activations: np.ndarray):
+    """
+    Params:
+    -- activations: num_samples x dim_feat (numpy)
+    Returns:
+    -- mu: dim_feat
+    -- sigma: dim_feat x dim_feat
+    """
+    mu = np.mean(activations, axis=0)
+    cov = np.cov(activations, rowvar=False)
+    return mu, cov
+def calculate_frechet_distance_np(mu1, sigma1, mu2, sigma2, eps=1e-6):
+    """Numpy implementation of the Frechet Distance."""
+    mu1 = np.atleast_1d(mu1)
+    mu2 = np.atleast_1d(mu2)
+    sigma1 = np.atleast_2d(sigma1)
+    sigma2 = np.atleast_2d(sigma2)
+    assert mu1.shape == mu2.shape, "Training and test mean vectors have different lengths"
+    assert sigma1.shape == sigma2.shape, "Training and test covariances have different dimensions"
+    diff = mu1 - mu2
+    covmean, _ = scipy.linalg.sqrtm(sigma1.dot(sigma2), disp=False)
+    if not np.isfinite(covmean).all():
+        offset = np.eye(sigma1.shape[0]) * eps
+        covmean = scipy.linalg.sqrtm((sigma1 + offset).dot(sigma2 + offset))
+    if np.iscomplexobj(covmean):
+        if not np.allclose(np.diagonal(covmean).imag, 0, atol=1e-3):
+            m = np.max(np.abs(covmean.imag))
+            raise ValueError(f"Imaginary component {m}")
+        covmean = covmean.real
+    tr_covmean = np.trace(covmean)
+    return diff.dot(diff) + np.trace(sigma1) + np.trace(sigma2) - 2 * tr_covmean
+def calculate_diversity_np(activation: np.ndarray, diversity_times: int = 200) -> float:
+    """Mean pairwise L2 distance across random pairs."""
+    assert len(activation.shape) == 2
+    if activation.shape[0] < 2:
+        return 0.0
+    num_samples = activation.shape[0]
+    effective_times = min(diversity_times, max(1, num_samples - 1))
+    first_indices = np.random.choice(num_samples, effective_times, replace=False)
+    second_indices = np.random.choice(num_samples, effective_times, replace=False)
+    diffs = activation[first_indices] - activation[second_indices]
+    dist = np.linalg.norm(diffs, axis=1)
+    return float(dist.mean())
+def calculate_multimodality_np(activation: np.ndarray, multimodality_times: int = 20) -> float:
+    """
+    activation: [num_labels, num_per_label, D]
+    Returns mean pairwise within-label diversity (higher = more multimodal).
+    """
+    assert len(activation.shape) == 3
+    num_labels, num_per_label, _ = activation.shape
+    if num_per_label < 2:
+        return float("nan")
+    effective_times = min(multimodality_times, max(1, num_per_label - 1))
+    first_dices = np.random.choice(num_per_label, effective_times, replace=False)
+    second_dices = np.random.choice(num_per_label, effective_times, replace=False)
+    diffs = activation[:, first_dices] - activation[:, second_dices]
+    dist = np.linalg.norm(diffs, axis=2)
+    return float(dist.mean())
+# --------------------------------------------------------------------------------------
+# Token sequence → activation (bag-of-motion-tokens) helpers
+# --------------------------------------------------------------------------------------
+def _extract_motion_tokens_from_sequence(seq: str) -> list[str]:
+    # Expect tokens like <M123>, within M_START/M_END fences; keep only <M...>
+    return [tok for tok in seq.split() if tok.startswith("<M") and tok.endswith(">")]
+def _extract_ids_from_sequence(seq: str) -> list[int]:
+    return [int(t[2:-1]) for t in _extract_motion_tokens_from_sequence(seq) if t[2:-1].isdigit()]
+def _build_token_index(tokens_vocab: list[str]) -> Dict[str, int]:
+    return {tok: idx for idx, tok in enumerate(tokens_vocab)}
+def _sequence_to_activation(seq: str, token_to_index: Dict[str, int]) -> np.ndarray:
+    vec = np.zeros((len(token_to_index),), dtype=np.float32)
+    for tok in _extract_motion_tokens_from_sequence(seq):
+        idx = token_to_index.get(tok)
+        if idx is not None:
+            vec[idx] += 1.0
+    # Normalize to unit length to reduce length bias
+    norm = np.linalg.norm(vec)
+    if norm > 0:
+        vec = vec / norm
+    return vec
+def generate_motion(model, tokenizer, prompt, device):
+    """Generates a motion sequence from a prompt using sampling."""
+    model.eval()
+    inputs = tokenizer(prompt, return_tensors="pt").to(device)
+    with torch.no_grad():
+        output = model.generate(
+            **inputs,
+            max_new_tokens=100,
+            do_sample=True,
+            temperature=INFERENCE_TEMPERATURE,
+            top_k=INFERENCE_TOP_K,
+            repetition_penalty=INFERENCE_REPETITION_PENALTY,
+            pad_token_id=tokenizer.pad_token_id,
+            eos_token_id=tokenizer.convert_tokens_to_ids(M_END),
+            early_stopping=True
+        )
+    decoded = tokenizer.decode(output[0], skip_special_tokens=False)
+    if "Motion: " in decoded:
+        motion_part = decoded.split("Motion: ")[-1]
+    else:
+        motion_part = decoded
+    return motion_part.strip()
+def _collect_eval_pairs(model, tokenizer, data, device) -> list[Tuple[str, str, str]]:
+    """
+    Returns list of (word, participant_id, gt_sequence, generated_sequence) for each sample in data.
+    """
+    results = []
+    for sample in data:
+        gt_tokens_str = sample.get("motion_tokens", "")
+        gt_wrapped = " ".join([f"<M{t}>" for t in gt_tokens_str.split()])
+        gt_sequence = f"{M_START} {gt_wrapped} {M_END}"
+        prompt = f"Instruction: Generate motion for word '{sample['word']}' with variant '{sample['participant_id']}'.\nMotion: "
+        generated_sequence = generate_motion(model, tokenizer, prompt, device)
+        pid = str(sample.get("participant_id", ""))
+        results.append((sample["word"], pid, gt_sequence, generated_sequence))
+    return results
+def _activations_from_pairs(pairs: list[Tuple[str, str, str]], vocab_tokens: list[str]):
+    """
+    Build numpy activations and labels arrays from sequences.
+    Returns:
+      gt_acts: (N, D)
+      gen_acts: (N, D)
+      labels: list[str] length N (word labels)
+    """
+    token_to_index = _build_token_index(vocab_tokens)
+    gt_vecs = []
+    gen_vecs = []
+    labels = []
+    for pair in pairs:
+        # Support both legacy 3-tuple (word, gt, gen) and new 4-tuple (word, pid, gt, gen)
+        if len(pair) == 4:
+            word, _pid, gt_seq, gen_seq = pair
+        else:
+            word, gt_seq, gen_seq = pair
+        gt_vecs.append(_sequence_to_activation(gt_seq, token_to_index))
+        gen_vecs.append(_sequence_to_activation(gen_seq, token_to_index))
+        labels.append(word)
+    return np.stack(gt_vecs, axis=0), np.stack(gen_vecs, axis=0), labels
+def _to_label_tensor3(acts: np.ndarray, labels: list[str]) -> np.ndarray:
+    """
+    Convert N x D activations with string labels to [L, K, D] by truncating each label
+    to the minimum count across labels.
+    """
+    label_to_indices: Dict[str, list[int]] = {}
+    for i, lbl in enumerate(labels):
+        label_to_indices.setdefault(lbl, []).append(i)
+    per_label_counts = [len(idxs) for idxs in label_to_indices.values()]
+    if len(per_label_counts) == 0:
+        raise ValueError("No labels found for multimodality computation.")
+    min_count = max(2, min(per_label_counts))
+    label_names = sorted(label_to_indices.keys())
+    stacked = []
+    for lbl in label_names:
+        idxs = label_to_indices[lbl][:min_count]
+        stacked.append(acts[idxs])
+    return np.stack(stacked, axis=0)  # [L, K, D]
+def evaluate_metrics_motiongpt_style(model, tokenizer, eval_data, all_motion_tokens, device):
+    """
+    Computes:
+      - Diversity: GT vs GEN (pair)
+      - Multimodality (MIM): GT vs GEN (pair)
+      - FID: between GT and GEN
+    """
+    print("\n" + "="*80)
+    print("      METRICS EVALUATION (FID, Diversity, Multimodality)")
+    print("="*80)
+    pairs = _collect_eval_pairs(model, tokenizer, eval_data, device)
+    gt_acts, gen_acts, labels = _activations_from_pairs(pairs, all_motion_tokens)
+    # Diversity
+    diversity_times = min(200, max(4, gt_acts.shape[0] - 1))
+    diversity_gt = calculate_diversity_np(gt_acts, diversity_times=diversity_times)
+    diversity_gen = calculate_diversity_np(gen_acts, diversity_times=diversity_times)
+    # Multimodality (MIM)
+    try:
+        gt_lbl_tensor = _to_label_tensor3(gt_acts, labels)
+        gen_lbl_tensor = _to_label_tensor3(gen_acts, labels)
+        multimodality_times = min(20, max(3, gt_lbl_tensor.shape[1] - 1))
+        mim_gt = calculate_multimodality_np(gt_lbl_tensor, multimodality_times=multimodality_times)
+        mim_gen = calculate_multimodality_np(gen_lbl_tensor, multimodality_times=multimodality_times)
+    except Exception as exc:
+        print(f"⚠️  Multimodality could not be computed reliably: {exc}")
+        mim_gt = float("nan")
+        mim_gen = float("nan")
+    # FID
+    mu_gen, cov_gen = calculate_activation_statistics_np(gen_acts)
+    mu_gt, cov_gt = calculate_activation_statistics_np(gt_acts)
+    fid = calculate_frechet_distance_np(mu_gt, cov_gt, mu_gen, cov_gen)
+    print(f"Diversity:    GT = {diversity_gt:.4f} | GEN = {diversity_gen:.4f}")
+    print(f"Multimodality (MIM): GT = {mim_gt:.4f} | GEN = {mim_gen:.4f}")
+    print(f"FID (GT vs GEN): {fid:.4f}")
+    return {
+        "diversity_gt": diversity_gt,
+        "diversity_gen": diversity_gen,
+        "mim_gt": mim_gt,
+        "mim_gen": mim_gen,
+        "fid": fid,
+        "pairs": pairs,  # for visualization usage
+    }
+def _encode_params_to_feature(params: np.ndarray, vq_model, mean, std, device) -> np.ndarray:
+    """
+    Convert SMPL-X parameter sequence (T, D) into a single clip feature using
+    the VQ-VAE encoder output BEFORE quantization. Average-pool over time to get (D_embed,).
+    """
+    if params.size == 0:
+        return np.zeros((getattr(vq_model.vqvae, "output_emb_width", 512),), dtype=np.float32)
+    x = torch.from_numpy(params.astype(np.float32)).to(device)  # [T, D]
+    x = x.unsqueeze(0)  # [1, T, D]
+    with torch.no_grad():
+        # Normalize / preprocess
+        x_pre = None
+        if hasattr(vq_model.vqvae, "preprocess"):
+            try:
+                x_pre = vq_model.vqvae.preprocess(x)  # expected to return tensor ready for encoder
+            except Exception:
+                x_pre = None
+        if x_pre is None:
+            # Manual normalization with provided mean/std
+            if mean is not None and std is not None:
+                mean_t = torch.from_numpy(np.array(mean, dtype=np.float32)).to(device).view(1, 1, -1)
+                std_t = torch.from_numpy(np.array(std, dtype=np.float32)).to(device).view(1, 1, -1)
+                x_norm = (x - mean_t) / (std_t + 1e-8)
+            else:
+                x_norm = x
+            # Some encoders expect [N, D, T]
+            x_pre = x_norm.transpose(1, 2).contiguous()  # [1, D, T]
+        # Encode to get pre-quant latent
+        z_e = vq_model.vqvae.encoder(x_pre)
+        # z_e could be [N, D_embed, T_q] or [N, T_q, D_embed]
+        if z_e.dim() == 3:
+            embed_dim_known = getattr(vq_model.vqvae, "output_emb_width", None)
+            if embed_dim_known is not None:
+                if z_e.shape[1] == embed_dim_known:
+                    time_axis = 2  # [N, D_embed, T_q]
+                elif z_e.shape[2] == embed_dim_known:
+                    time_axis = 1  # [N, T_q, D_embed]
+                else:
+                    time_axis = 2 if z_e.shape[2] < z_e.shape[1] else 1
+            else:
+                time_axis = 2 if z_e.shape[2] < z_e.shape[1] else 1
+            feat = z_e.mean(dim=time_axis).squeeze(0)
+        elif z_e.dim() == 2:
+            feat = z_e.squeeze(0)
+        else:
+            feat = z_e.view(1, -1).mean(dim=0)
+        feat_np = feat.detach().cpu().numpy().astype(np.float32)
+        # L2 normalize
+        norm = np.linalg.norm(feat_np)
+        if norm > 0:
+            feat_np = feat_np / norm
+        return feat_np
+def evaluate_metrics_encoder_style(
+    model,
+    tokenizer,
+    eval_data,
+    device,
+    vqvae_ckpt: Optional[str] = None,
+    stats_path: Optional[str] = None,
+    sample_limit: int = 100,
+):
+    """
+    Computes FID, Diversity, and MIM using VQ-VAE encoder pre-quantization features.
+    """
+    print("\n" + "="*80)
+    print("      METRICS EVALUATION (VQ-VAE Encoder Features)")
+    print("="*80)
+    # Lazy import to reuse your visualization utilities and stats
+    try:
+        from visualize import load_vqvae, load_stats, VQVAE_CHECKPOINT as DEFAULT_VQ, STATS_PATH as DEFAULT_STATS
+        vq_ckpt = vqvae_ckpt or os.getenv("VQVAE_CHECKPOINT", DEFAULT_VQ)
+        stats_p = stats_path or os.getenv("VQVAE_STATS_PATH", DEFAULT_STATS)
+        vq_model = load_vqvae(vq_ckpt, device=device)
+        mean, std = load_stats(stats_p)
+        from visualize import decode_tokens_to_params
+    except Exception as exc:
+        print(f"⚠️  Could not set up VQ-VAE encoder metrics: {exc}")
+        return {}
+    # Collect GT/GEN token sequences for pairs (limit to speed-up)
+    pairs = _collect_eval_pairs(model, tokenizer, eval_data[:sample_limit], device)
+    # Build features
+    gt_feats = []
+    gen_feats = []
+    labels = []
+    for pair in pairs:
+        if len(pair) == 4:
+            word, _pid, gt_seq, gen_seq = pair
+        else:
+            word, gt_seq, gen_seq = pair
+        # Decode to SMPL-X
+        tokens_gt = _extract_ids_from_sequence(gt_seq)
+        tokens_gen = _extract_ids_from_sequence(gen_seq)
+        try:
+            params_gt = decode_tokens_to_params(tokens_gt, vq_model, mean, std, device=device)  # (T, D) denorm
+        except Exception:
+            params_gt = np.zeros((0, 182), dtype=np.float32)
+        try:
+            params_gen = decode_tokens_to_params(tokens_gen, vq_model, mean, std, device=device)  # (T, D) denorm
+        except Exception:
+            params_gen = np.zeros((0, 182), dtype=np.float32)
+        # Encode (pre-quant) -> pooled feature
+        feat_gt = _encode_params_to_feature(params_gt, vq_model, mean, std, device)
+        feat_gen = _encode_params_to_feature(params_gen, vq_model, mean, std, device)
+        gt_feats.append(feat_gt)
+        gen_feats.append(feat_gen)
+        labels.append(word)
+    gt_feats = np.stack(gt_feats, axis=0)
+    gen_feats = np.stack(gen_feats, axis=0)
+    # Diversity
+    diversity_times = min(200, max(4, gt_feats.shape[0] - 1))
+    diversity_gt = calculate_diversity_np(gt_feats, diversity_times=diversity_times)
+    diversity_gen = calculate_diversity_np(gen_feats, diversity_times=diversity_times)
+    # Multimodality (MIM)
+    try:
+        gt_lbl_tensor = _to_label_tensor3(gt_feats, labels)
+        gen_lbl_tensor = _to_label_tensor3(gen_feats, labels)
+        multimodality_times = min(20, max(3, gt_lbl_tensor.shape[1] - 1))
+        mim_gt = calculate_multimodality_np(gt_lbl_tensor, multimodality_times=multimodality_times)
+        mim_gen = calculate_multimodality_np(gen_lbl_tensor, multimodality_times=multimodality_times)
+    except Exception as exc:
+        print(f"⚠️  Multimodality could not be computed reliably: {exc}")
+        mim_gt = float("nan")
+        mim_gen = float("nan")
+    # FID (on encoder features)
+    mu_gen, cov_gen = calculate_activation_statistics_np(gen_feats)
+    mu_gt, cov_gt = calculate_activation_statistics_np(gt_feats)
+    fid = calculate_frechet_distance_np(mu_gt, cov_gt, mu_gen, cov_gen)
+    print(f"Diversity (encoder feats):    GT = {diversity_gt:.4f} | GEN = {diversity_gen:.4f}")
+    print(f"Multimodality (MIM, encoder): GT = {mim_gt:.4f} | GEN = {mim_gen:.4f}")
+    print(f"FID (encoder feats, GT vs GEN): {fid:.4f}")
+    return {
+        "diversity_gt": diversity_gt,
+        "diversity_gen": diversity_gen,
+        "mim_gt": mim_gt,
+        "mim_gen": mim_gen,
+        "fid": fid,
+        "pairs": pairs,
+    }
+def save_side_by_side_visualizations(pairs: list[Tuple[str, str, str]], output_dir: str, limit: int = 4):
+    """
+    Generate side-by-side 3D animations for GT vs GEN.
+    """
+    try:
+        from visualize import (
+            load_vqvae, load_stats, load_smplx_model,
+            decode_tokens_to_params, params_to_vertices,
+            VQVAE_CHECKPOINT as DEFAULT_VQ, STATS_PATH as DEFAULT_STATS, SMPLX_MODEL_DIR as DEFAULT_SMPLX
+        )
+        import plotly.graph_objects as go
+        from plotly.subplots import make_subplots
+    except Exception as exc:
+        print(f"⚠️  Visualization skipped (missing dependencies): {exc}")
+        return
+    os.makedirs(output_dir, exist_ok=True)
+    vqvae_ckpt = os.getenv("VQVAE_CHECKPOINT", DEFAULT_VQ)
+    stats_path = os.getenv("VQVAE_STATS_PATH", DEFAULT_STATS)
+    smplx_dir = os.getenv("SMPLX_MODEL_DIR", DEFAULT_SMPLX)
+    print("Loading VQ-VAE, stats, SMPL-X ...")
+    vq_model = load_vqvae(vqvae_ckpt)
+    mean, std = load_stats(stats_path)
+    smplx_model = load_smplx_model(smplx_dir)
+    def animate_side_by_side(verts_left, faces, verts_right, fps=20, titles=("Ground Truth", "LLM Generated"), output_html=None):
+        T = min(verts_left.shape[0], verts_right.shape[0])
+        verts_left, verts_right = verts_left[:T], verts_right[:T]
+        i, j, k = faces.T.tolist()
+        fig = make_subplots(
+            rows=1, cols=2,
+            specs=[[{'type': 'scene'}, {'type': 'scene'}]],
+            horizontal_spacing=0.05,
+            subplot_titles=list(titles)
+        )
+        left_mesh = go.Mesh3d(x=verts_left[0,:,0], y=verts_left[0,:,1], z=verts_left[0,:,2], i=i,j=j,k=k,opacity=0.7,showscale=False)
+        right_mesh = go.Mesh3d(x=verts_right[0,:,0], y=verts_right[0,:,1], z=verts_right[0,:,2], i=i,j=j,k=k,opacity=0.7,showscale=False)
+        fig.add_trace(left_mesh, row=1, col=1)
+        fig.add_trace(right_mesh, row=1, col=2)
+        frames = []
+        for t in range(T):
+            frames.append(go.Frame(
+                name=str(t),
+                data=[
+                    go.Mesh3d(x=verts_left[t,:,0], y=verts_left[t,:,1], z=verts_left[t,:,2], i=i,j=j,k=k,opacity=0.7,showscale=False,scene="scene"),
+                    go.Mesh3d(x=verts_right[t,:,0], y=verts_right[t,:,1], z=verts_right[t,:,2], i=i,j=j,k=k,opacity=0.7,showscale=False,scene="scene2")
+                ]
+            ))
+        fig.frames = frames
+        fig.update_layout(
+            showlegend=False,
+            margin=dict(l=10, r=10, t=50, b=10),
+            scene=dict(aspectmode='data',xaxis=dict(visible=False),yaxis=dict(visible=False),zaxis=dict(visible=False),
+                       camera=dict(eye=dict(x=0,y=-2,z=0.7))),
+            scene2=dict(aspectmode='data',xaxis=dict(visible=False),yaxis=dict(visible=False),zaxis=dict(visible=False),
+                        camera=dict(eye=dict(x=0,y=-2,z=0.7))),
+            updatemenus=[dict(
+                type="buttons", x=0.5, xanchor="center", y=1.15, yanchor="top",
+                buttons=[
+                    dict(label="Play", method="animate", args=[None, {"frame": {"duration": max(1,1000//fps), "redraw": True}, "fromcurrent": True}]),
+                    dict(label="Pause", method="animate", args=[[None], {"frame": {"duration": 0, "redraw": False}}])
+                ]
+            )]
+        )
+        if output_html:
+            fig.write_html(output_html)
+            print(f"✅ Saved: {output_html}")
+        return fig
+    # Determine which words to include (up to `limit` distinct words)
+    allowed_words = None
+    if isinstance(limit, int) and limit > 0:
+        ordered_unique_words = []
+        for pair in pairs:
+            word = pair[0]
+            if word not in ordered_unique_words:
+                ordered_unique_words.append(word)
+            if len(ordered_unique_words) >= limit:
+                break
+        allowed_words = set(ordered_unique_words)
+    for pair in pairs:
+        try:
+            if len(pair) == 4:
+                word, pid, gt_seq, gen_seq = pair
+            else:
+                word, gt_seq, gen_seq = pair
+                pid = "unknown"
+            if allowed_words is not None and word not in allowed_words:
+                continue
+            tokens_gt = _extract_ids_from_sequence(gt_seq)
+            tokens_gen = _extract_ids_from_sequence(gen_seq)
+            params_gt = decode_tokens_to_params(tokens_gt, vq_model, mean, std)
+            params_gen = decode_tokens_to_params(tokens_gen, vq_model, mean, std)
+            verts_gt, faces = params_to_vertices(params_gt, smplx_model)
+            verts_gen, _ = params_to_vertices(params_gen, smplx_model)
+            out_dir = os.path.join(output_dir)
+            os.makedirs(out_dir, exist_ok=True)
+            # Sanitize for filesystem safety
+            safe_word = re.sub(r'[^A-Za-z0-9_-]+', '_', str(word))
+            safe_pid = re.sub(r'[^A-Za-z0-9_-]+', '_', str(pid))
+            output_html = os.path.join(out_dir, f"word_{safe_word}_{safe_pid}_side_by_side.html")
+            animate_side_by_side(
+                verts_left=verts_gt,
+                faces=faces,
+                verts_right=verts_gen,
+                fps=20,
+                titles=("Ground Truth", "LLM Generated"),
+                output_html=output_html
+            )
+        except Exception as exc:
+            print(f"⚠️  Error creating visualization for word '{pair[0]}': {exc}")
+def run_inference_on_all_samples(model, tokenizer, data, device):
+    """
+    Runs inference on ALL available samples for the trained words and compares
+    each one to its specific ground truth.
+    """
+    print("\n" + "="*80)
+    print("      INFERENCE AND EVALUATION (ALL SAMPLES)")
+    print("      Goal: Test the model's performance on every variant.")
+    print("="*80)
+    def compare_sequences(gt: str, gen: str):
+        """Provides a simple visual diff of two sequences without external libraries."""
+        gt_tokens = gt.split()
+        gen_tokens = gen.split()
+        print("\nDetailed Comparison (✅ = Match, ❌ = Mismatch/Missing/Added):")
+        gt_str =   "  GT:  "
+        gen_str =  "  GEN: "
+        diff_str = "       "
+        max_len = max(len(gt_tokens), len(gen_tokens))
+        for i in range(max_len):
+            gt_tok = gt_tokens[i] if i < len(gt_tokens) else "___"
+            gen_tok = gen_tokens[i] if i < len(gen_tokens) else "___"
+            max_tok_len = max(len(gt_tok), len(gen_tok))
+            gt_tok_padded = gt_tok.ljust(max_tok_len)
+            gen_tok_padded = gen_tok.ljust(max_tok_len)
+            gt_str += gt_tok_padded + " "
+            gen_str += gen_tok_padded + " "
+            if gt_tok == gen_tok:
+                diff_str += "✅".ljust(max_tok_len) + " "
+            else:
+                diff_str += "❌".ljust(max_tok_len) + " "
+        print(gt_str)
+        print(gen_str)
+        print(diff_str)
+    data_by_word = {}
+    for item in data:
+        word = item['word']
+        if word not in data_by_word:
+            data_by_word[word] = []
+        data_by_word[word].append(item)
+    for word, samples in data_by_word.items():
+        print(f"\n\n{'='*25} TESTING WORD: '{word}' {'='*25}")
+        num_correct = 0
+        for i, sample in enumerate(samples):
+            print(f"\n--- Testing Variant {i+1}/{len(samples)}: '{sample['participant_id']}' ---")
+            gt_tokens_str = sample.get("motion_tokens", "")
+            gt_wrapped = " ".join([f"<M{t}>" for t in gt_tokens_str.split()])
+            gt_sequence = f"{M_START} {gt_wrapped} {M_END}"
+            print(f"Ground Truth:\n{gt_sequence}")
+            prompt = f"Instruction: Generate motion for word '{sample['word']}' with variant '{sample['participant_id']}'.\nMotion: "
+            generated_sequence = generate_motion(model, tokenizer, prompt, device)
+            print(f"\nLLM Generated:\n{generated_sequence}")
+            compare_sequences(gt_sequence, generated_sequence)
+            if gt_sequence.strip() == generated_sequence.strip():
+                num_correct += 1
+            print("-" * 80)
+        accuracy = (num_correct / len(samples)) * 100
+        print(f"\nSUMMARY FOR '{word}': {num_correct}/{len(samples)} correct ({accuracy:.1f}%)")
+# ======================================================================================
+# Existing Utilities (Compatibility)
+# ======================================================================================
+def seq_edit_distance(a_ids: List[int], b_ids: List[int]) -> int:
+    """Token-level Levenshtein distance"""
+    return Levenshtein.distance(a_ids, b_ids)
+def best_ref_distance(pred_ids: List[int], refs: List[List[int]]) -> int:
+    """Find minimum edit distance to any reference"""
+    if not refs:
+        return len(pred_ids)
+    return min(seq_edit_distance(pred_ids, r) for r in refs)
+def build_text_to_refs(dataset):
+    """
+    Build mapping from text prompts to list of reference motion sequences
+    """
+    text_to_refs = defaultdict(list)
+    for ex in dataset:
+        text_to_refs[ex["text_query"]].append(
+            [int(x) for x in ex["motion_tokens"].split()]
+        )
+    return text_to_refs
+def _concat(ids_list: List[List[int]]) -> List[int]:
+    out = []
+    for s in ids_list:
+        out.extend(s)
+    return out
+def _distinct_n(ids_list: List[List[int]], n: int) -> float:
+    if n <= 0:
+        return 0.0
+    total = 0
+    uniq = set()
+    for seq in ids_list:
+        if len(seq) < n:
+            continue
+        total += (len(seq) - n + 1)
+        for i in range(len(seq) - n + 1):
+            uniq.add(tuple(seq[i:i+n]))
+    if total == 0:
+        return 0.0
+    return len(uniq) / float(total)
+def token_fid_diag(gens: List[List[int]], refs: List[List[int]], codebook_size: int) -> float:
+    """
+    Diagonal-covariance Fréchet distance between histograms of token usage.
+    This is a lightweight proxy for FID using token distributions.
+    """
+    if len(gens) == 0 or len(refs) == 0:
+        return float("nan")
+    def feats(batch: List[List[int]]) -> np.ndarray:
+        mats = []
+        for seq in batch:
+            hist = np.bincount([x for x in seq if 0 <= x < codebook_size], minlength=codebook_size).astype(np.float64)
+            s = hist.sum()
+            if s > 0:
+                hist /= s
+            mats.append(hist)
+        return np.stack(mats, axis=0)
+    G = feats(gens)
+    R = feats(refs)
+    mu_g = G.mean(axis=0)
+    mu_r = R.mean(axis=0)
+    var_g = G.var(axis=0)
+    var_r = R.var(axis=0)
+    mean_term = np.sum((mu_g - mu_r) ** 2)
+    # Diagonal covariance approximation
+    cov_term = np.sum(var_g + var_r - 2.0 * np.sqrt(np.clip(var_g * var_r, 0.0, None)))
+    return float(mean_term + cov_term)
+def compute_token_metrics(
+    gen_by_text: Dict[str, List[int]],
+    text_to_refs: Dict[str, List[List[int]]],
+    codebook_size: int,
+) -> Dict[str, float]:
+    """
+    Compute token-level metrics:
+      - FID_diag: Fréchet distance between token histograms (diag cov)
+      - MIM: average min edit distance to references
+      - Diversity: distinct-1 and distinct-2
+    """
+    gens = list(gen_by_text.values())
+    refs_all = _concat([v for v in text_to_refs.values()])
+    # refs_all is concatenated list of ids; split sequences are needed
+    ref_seqs = [r for refs in text_to_refs.values() for r in refs]
+    fid_diag = token_fid_diag(gens, ref_seqs, codebook_size)
+    # MIM: average best edit distance per prompt (only over prompts we generated)
+    mim_dists = []
+    for text, gen_ids in gen_by_text.items():
+        refs = text_to_refs.get(text, [])
+        mim_dists.append(best_ref_distance(gen_ids, refs))
+    mim = float(sum(mim_dists) / len(mim_dists)) if mim_dists else float("nan")
+    div1 = _distinct_n(gens, 1)
+    div2 = _distinct_n(gens, 2)
+    return {
+        "FID_diag": fid_diag,
+        "MIM": mim,
+        "distinct_1": div1,
+        "distinct_2": div2,
+    }
+def eval_t2m_set(
+    model,
+    tokenizer,
+    sample_pairs: List[Tuple[str, List[List[int]]]],
+    mot_begin_id: int,
+    mot_end_id: int,
+    motion_token_ids: list,
+    length_stats_by_text: dict,
+    global_median_len: int,
+    prompt_vocab: dict = None,
+    has_pid: bool = False,
+    per_prompt_vocab: bool = True,
+    n_eval: int = 100
+):
+    """
+    Evaluate text-to-motion generation on a set of samples
+    Returns a compact dict with avg_edit_dist & median_len; kept for pipeline compatibility.
+    """
+    random.shuffle(sample_pairs)
+    subset = sample_pairs[:min(n_eval, len(sample_pairs))]
+    dists = []
+    lens = []
+    for text, ref_list in subset:
+        gen = generate_t2m(
+            model=model,
+            tokenizer=tokenizer,
+            prompt_text=text,
+            mot_begin_id=mot_begin_id,
+            mot_end_id=mot_end_id,
+            motion_token_ids=motion_token_ids,
+            length_stats_by_text=length_stats_by_text,
+            global_median_len=global_median_len,
+            prompt_vocab=prompt_vocab,
+            pid=None,
+            has_pid=has_pid,
+            per_prompt_vocab=per_prompt_vocab
+        )
+        span = gen.split("<MOT_BEGIN>")[-1]
+        span = span.split("<MOT_END>")[0]
+        pred_ids = motion_specials_to_ids(span)
+        d = best_ref_distance(pred_ids, ref_list)
+        dists.append(d)
+        lens.append(len(pred_ids))
+    if dists:
+        avg_dist = sum(dists) / len(dists)
+        median_len = sorted(lens)[len(lens)//2] if lens else 0
+        print(f"Eval T2M: avg_edit_dist={avg_dist:.2f}, median_len={median_len}, n={len(dists)}")
+        return {"avg_edit_dist": avg_dist, "median_len": median_len, "n_samples": len(dists)}
+    else:
+        print("Eval T2M: no samples")
+        return {}

model.py ADDED Viewed

	@@ -0,0 +1,152 @@

+"""
+Model and tokenizer initialization
+"""
+import torch
+from typing import List, Set, Tuple
+from transformers import AutoModelForCausalLM, AutoTokenizer
+from unsloth import FastLanguageModel
+from config import (
+    MODEL_NAME, MAX_SEQ_LEN, DTYPE,
+    LORA_R, LORA_ALPHA, LORA_DROPOUT,
+    LORA_TARGET_MODULES, LORA_MODULES_TO_SAVE,
+    PAD_TOKEN, M_START, M_END
+)
+# ======================================================================================
+# Logic from test_overfit.py (Standard Transformers)
+# ======================================================================================
+def setup_model_and_tokenizer_raw(model_name: str, motion_tokens: List[str]) -> Tuple[AutoModelForCausalLM, AutoTokenizer]:
+    """Loads the model and tokenizer, adding special and motion tokens (Standard Transformers)."""
+    print(f"\n---> Loading base model and tokenizer: {model_name}")
+    tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
+    model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True)
+    # Add special tokens (matches test_overfit.py)
+    tokenizer.add_special_tokens({"pad_token": PAD_TOKEN, "additional_special_tokens": [M_START, M_END]})
+    print(f"Adding {len(motion_tokens)} motion tokens to the tokenizer.")
+    tokenizer.add_tokens(motion_tokens, special_tokens=True)
+    model.resize_token_embeddings(len(tokenizer))
+    model.config.pad_token_id = tokenizer.pad_token_id
+    return model, tokenizer
+def ensure_tokenizer_has_motion_tokens(tokenizer: AutoTokenizer, motion_tokens: List[str]) -> int:
+    """
+    Adds any missing motion tokens to the tokenizer. Returns number of tokens added.
+    """
+    tokenizer.add_special_tokens({"pad_token": PAD_TOKEN, "additional_special_tokens": [M_START, M_END]})
+    added = tokenizer.add_tokens(motion_tokens, special_tokens=True)
+    return added
+# ======================================================================================
+# Existing Logic (Unsloth / LoRA)
+# ======================================================================================
+def build_special_tokens(codebook_size: int, unique_pids: List[str] = None) -> List[str]:
+    """
+    Build all special tokens for motion vocabulary
+    """
+    # Motion tokens
+    motion_tokens = [f"<motion_{i}>" for i in range(codebook_size)]
+    # Boundary tokens
+    boundary_tokens = ["<MOT_BEGIN>", "<MOT_END>"]
+    # Task tokens
+    task_tokens = ["<T2M>", "<M2T>", "<DENOISE>", "<MOTION_MASK>"]
+    # Participant ID tokens
+    pid_tokens = []
+    if unique_pids:
+        pid_tokens = ["<PID_NULL>"] + [f"<PID_{pid}>" for pid in unique_pids]
+    return boundary_tokens + motion_tokens + task_tokens + pid_tokens
+def setup_model_and_tokenizer(codebook_size: int, unique_pids: List[str] = None):
+    """
+    Initialize model and tokenizer with custom tokens (Unsloth LoRA)
+    Returns: (model, tokenizer, new_token_ids)
+    """
+    # Build special tokens
+    additional_special_tokens = build_special_tokens(codebook_size, unique_pids)
+    # Load base model
+    model, tokenizer = FastLanguageModel.from_pretrained(
+        model_name=MODEL_NAME,
+        max_seq_length=MAX_SEQ_LEN,
+        dtype=DTYPE,
+        load_in_4bit=False,
+        trust_remote_code=True,
+    )
+    # Configure tokenizer
+    tokenizer.padding_side = "right"
+    # Add special tokens
+    existing = set(tokenizer.special_tokens_map_extended.get("additional_special_tokens", []))
+    to_add = [t for t in additional_special_tokens if t not in existing]
+    if to_add:
+        tokenizer.add_special_tokens({"additional_special_tokens": to_add})
+    if tokenizer.pad_token is None:
+        tokenizer.pad_token = tokenizer.eos_token
+    # Resize embeddings
+    model.resize_token_embeddings(len(tokenizer))
+    # Apply LoRA
+    model = FastLanguageModel.get_peft_model(
+        model,
+        r=LORA_R,
+        lora_alpha=LORA_ALPHA,
+        lora_dropout=LORA_DROPOUT,
+        bias="none",
+        target_modules=LORA_TARGET_MODULES,
+        modules_to_save=LORA_MODULES_TO_SAVE,
+        use_gradient_checkpointing="unsloth",
+    )
+    # Get new token IDs for gradient masking
+    new_token_ids = set(tokenizer.convert_tokens_to_ids(additional_special_tokens))
+    # Apply gradient mask to prevent base vocab drift
+    apply_gradient_mask(model, new_token_ids)
+    return model, tokenizer, new_token_ids
+def apply_gradient_mask(model, new_token_ids: Set[int]):
+    """
+    Apply gradient mask so only new token embeddings are updated
+    """
+    def mask_rows_hook(param, rows: set):
+        mask = torch.zeros(param.size(0), device=param.device, dtype=param.dtype)
+        idxs = sorted(list(rows))
+        if len(idxs) > 0:
+            mask[idxs] = 1.0
+        param.register_hook(lambda g: g * mask.unsqueeze(1))
+    with torch.no_grad():
+        emb = model.get_input_embeddings().weight
+        head = model.get_output_embeddings().weight
+        mask_rows_hook(emb, new_token_ids)
+        mask_rows_hook(head, new_token_ids)
+def get_motion_token_info(tokenizer, codebook_size: int):
+    """
+    Get motion token IDs and boundary token IDs
+    Returns: (motion_token_ids, mot_begin_id, mot_end_id)
+    """
+    motion_token_strs = [f"<motion_{i}>" for i in range(codebook_size)]
+    motion_token_ids = tokenizer.convert_tokens_to_ids(motion_token_strs)
+    mot_begin_id = tokenizer.convert_tokens_to_ids("<MOT_BEGIN>")
+    mot_end_id = tokenizer.convert_tokens_to_ids("<MOT_END>")
+    return motion_token_ids, mot_begin_id, mot_end_id

requirements.txt ADDED Viewed

	@@ -0,0 +1,32 @@

+gradio
+torch
+transformers
+accelerate
+numpy
+scipy
+rapidfuzz
+huggingface_hub
+plotly
+smplx
+# Core dependencies
+torch>=2.0.0
+transformers>=4.40.0
+datasets>=2.14.0
+accelerate>=0.20.0
+# Unsloth for efficient training
+unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git
+# Training utilities
+bitsandbytes>=0.41.0
+peft>=0.4.0
+trl>=0.4.7
+# Evaluation
+rapidfuzz>=3.0.0
+# Utilities
+numpy>=1.24.0
+tqdm>=4.65.0
+huggingface_hub>=0.22.0
+gdown>=5.2.0

setup_env.sh ADDED Viewed

	@@ -0,0 +1,70 @@

+#!/usr/bin/env bash
+set -euo pipefail
+# Usage:
+#   bash setup_env.sh
+#
+# - Installs Python dependencies from requirements.txt
+# - Downloads a public Google Drive dataset file into ./data/motion_llm_dataset.json
+# - Exports env vars for this session (optional) and prints instructions
+THIS_DIR="$(pwd)"
+DATA_DIR="$THIS_DIR/data"
+mkdir -p "$DATA_DIR"
+# --- Explicit placeholders (replace these later) ---
+# Training dataset
+GDRIVE_ID="11711RgTmzauXpYVFoqLF8DZXiZlZovfn"
+# Visualization assets (optional - only needed for visualize.py)
+VQVAE_MODEL_ID="1JEMKVZWFG4Ue7k3Nm7q1o7-uBVsVricY"
+VQVAE_STATS_ID="1WTwP5DdBl4c-X5Kj7jXtlEHofOX2BifZ"
+SMPLX_MODELS_ID="1tZEfqw9zHgOaBEw5X_oazAEnesRtE9ky"
+# Hugging Face token
+HF_TOKEN_IN=""
+# ---------------------------------------------------
+echo "Installing Python dependencies..."
+python -m pip install --upgrade pip
+pip install -r requirements.txt
+if [[ -n "$GDRIVE_ID" ]] && [[ "$GDRIVE_ID" != "YOUR_GOOGLE_DRIVE_FILE_ID_HERE" ]]; then
+  echo "Downloading training dataset from Google Drive (file id: $GDRIVE_ID)..."
+  gdown --id "$GDRIVE_ID" -O "$DATA_DIR/motion_llm_dataset.json"
+else
+  echo "No training dataset Google Drive ID provided. Skipping dataset download."
+fi
+# Download visualization assets if IDs are provided
+if [[ -n "$VQVAE_MODEL_ID" ]] && [[ "$VQVAE_MODEL_ID" != "YOUR_VQVAE_CHECKPOINT_GDRIVE_ID_HERE" ]]; then
+  echo "Downloading VQ-VAE model from Google Drive (file id: $VQVAE_MODEL_ID)..."
+  gdown --id "$VQVAE_MODEL_ID" -O "$DATA_DIR/vqvae_model.pt"
+fi
+if [[ -n "$VQVAE_STATS_ID" ]] && [[ "$VQVAE_STATS_ID" != "YOUR_VQVAE_STATS_GDRIVE_ID_HERE" ]]; then
+  echo "Downloading VQ-VAE stats from Google Drive (file id: $VQVAE_STATS_ID)..."
+  gdown --id "$VQVAE_STATS_ID" -O "$DATA_DIR/vqvae_stats.pt"
+fi
+if [[ -n "$SMPLX_MODELS_ID" ]] && [[ "$SMPLX_MODELS_ID" != "YOUR_SMPLX_MODELS_GDRIVE_ID_HERE" ]]; then
+  echo "Downloading SMPL-X neutral model (.npz) from Google Drive (file id: $SMPLX_MODELS_ID)..."
+  mkdir -p "$DATA_DIR/smplx_models"
+  gdown --id "$SMPLX_MODELS_ID" -O "$DATA_DIR/smplx_models/SMPLX_NEUTRAL.npz"
+  echo "Saved SMPLX_NEUTRAL.npz to $DATA_DIR/smplx_models"
+fi
+if [[ -n "$HF_TOKEN_IN" ]]; then
+  echo "Exporting HUGGINGFACE_HUB_TOKEN for this shell session..."
+  export HUGGINGFACE_HUB_TOKEN="$HF_TOKEN_IN"
+fi
+echo
+echo "Environment setup complete."
+echo "- WORK_DIR defaults to: $THIS_DIR"
+echo "- DATA_JSON_PATH defaults to: $DATA_DIR/motion_llm_dataset.json"
+echo "- To persist HF token, set an environment variable before running:"
+echo "    export HUGGINGFACE_HUB_TOKEN=hf_..."
+echo
+echo "You can now run your training scripts."

templates.py ADDED Viewed

	@@ -0,0 +1,133 @@

+"""
+Prompt templates and mapping functions for different training stages
+"""
+import random
+from data import ids_to_motion_specials
+from config import SYSTEM_MSG, SEED
+random.seed(SEED)
+def pid_token_from_example(ex, has_pid: bool):
+    """Get participant ID token from example"""
+    if not has_pid:
+        return ""
+    pid = ex.get("participant_id", None)
+    if pid is not None:
+        return f"<PID_{pid}>"
+    return "<PID_NULL>"
+def map_stage1(ex, has_pid: bool):
+    """
+    Stage 1: Word + optional PID conditioning to learn motion language.
+    The user explicitly provides the word (+PID); assistant outputs motion span.
+    """
+    mot = ids_to_motion_specials(ex["motion_tokens"])
+    assistant = f"<MOT_BEGIN> {mot} <MOT_END>"
+    pid_tok = pid_token_from_example(ex, has_pid)
+    word = ex.get("word", ex.get("text_query", ""))
+    # Word + PID conditioning (no natural language chatter to keep it compact)
+    user = f"<T2M>{pid_tok}\nword: {word}"
+    text = (
+        "<|im_start|>system\n" + SYSTEM_MSG + "<|im_end|>\n"
+        + "<|im_start|>user\n" + user + "\n<|im_end|>\n"
+        + "<|im_start|>assistant\n" + assistant + "\n<|im_end|>\n"
+    )
+    return {"text": text, "where": "mot"}
+def map_stage2(ex, has_pid: bool):
+    """
+    Stage 2: Multi-task (T2M/M2T/DENOISE)
+    Randomly choose between text-to-motion, motion-to-text, or denoising
+    """
+    t = ex["text_query"]
+    mot = ids_to_motion_specials(ex["motion_tokens"])
+    pid_tok = pid_token_from_example(ex, has_pid)
+    # Sample task type
+    task = random.choices(["t2m", "m2t", "denoise"], weights=[0.5, 0.3, 0.2], k=1)[0]
+    if task == "t2m":
+        # Text to motion
+        assistant = f"<MOT_BEGIN> {mot} <MOT_END>"
+        text = (
+            "<|im_start|>system\n" + SYSTEM_MSG + "<|im_end|>\n"
+            + "<|im_start|>user\n" + f"<T2M>{pid_tok}\n\n" + t + "\n<|im_end|>\n"
+            + "<|im_start|>assistant\n" + assistant + "\n<|im_end|>\n"
+        )
+        where = "mot"
+    elif task == "m2t":
+        # Motion to text
+        user = f"<M2T>{pid_tok}\n\n<MOT_BEGIN> {mot} <MOT_END>"
+        text = (
+            "<|im_start|>system\n" + SYSTEM_MSG + "<|im_end|>\n"
+            + "<|im_start|>user\n" + user + "\n<|im_end|>\n"
+            + "<|im_start|>assistant\n" + t + "\n<|im_end|>\n"
+        )
+        where = "text"
+    else:
+        # Denoising
+        toks = mot.split()
+        noisy = []
+        for tok in toks:
+            if random.random() < 0.15:
+                noisy.append("<MOTION_MASK>")
+            else:
+                noisy.append(tok)
+        user = f"<DENOISE>{pid_tok}\n\n<MOT_BEGIN> {' '.join(noisy)} <MOT_END>"
+        assistant = f"<MOT_BEGIN> {mot} <MOT_END>"
+        text = (
+            "<|im_start|>system\n" + SYSTEM_MSG + "<|im_end|>\n"
+            + "<|im_start|>user\n" + user + "\n<|im_end|>\n"
+            + "<|im_start|>assistant\n" + assistant + "\n<|im_end|>\n"
+        )
+        where = "mot"
+    return {"text": text, "where": where, "text_query": t}
+def map_stage3(ex, has_pid: bool):
+    """
+    Stage 3 (Instruct): Word-only request, no participant ID.
+    The system prompt directs: "Output motion tokens for the given word".
+    """
+    t = ex["text_query"]
+    mot = ids_to_motion_specials(ex["motion_tokens"])
+    assistant = f"<MOT_BEGIN> {mot} <MOT_END>"
+    # Instruct-style, no PID
+    user = f"<T2M>\nword: {t}"
+    text = (
+        "<|im_start|>system\n" + SYSTEM_MSG + "<|im_end|>\n"
+        + "<|im_start|>user\n" + user + "\n<|im_end|>\n"
+        + "<|im_start|>assistant\n" + assistant + "\n<|im_end|>\n"
+    )
+    return {
+        "text": text,
+        "where": "mot",
+        "text_query": t,
+        "motion_tokens": ex["motion_tokens"]
+    }
+def create_mapper(stage: int, has_pid: bool):
+    """
+    Create a mapper function for a specific stage
+    """
+    if stage == 1:
+        return lambda ex: map_stage1(ex, has_pid)
+    elif stage == 2:
+        return lambda ex: map_stage2(ex, has_pid)
+    elif stage == 3:
+        return lambda ex: map_stage3(ex, has_pid)
+    else:
+        raise ValueError(f"Unknown stage: {stage}")

test_dataset_eval.py ADDED Viewed

	@@ -0,0 +1,534 @@

+"""
+Evaluate the SignMotionGPT model on a held-out SMPL-X test dataset.
+The script can download Google Drive archives or consume an already extracted
+directory of `video_data.pkl` files. Each sequence is converted into encoder
+features via the project VQ-VAE utilities and compared against motions generated
+by the LLM to compute FID/Diversity/Multimodality metrics.
+"""
+from __future__ import annotations
+import argparse
+import json
+import os
+import pickle
+import random
+import sys
+import zipfile
+from typing import Dict, List, Optional, Tuple
+import numpy as np
+import torch
+from transformers import AutoModelForCausalLM, AutoTokenizer
+from config import (
+    TEST_EVAL_DOWNLOAD_DIR,
+    TEST_EVAL_EXTRACT_DIR,
+    TEST_EVAL_HF_REPO,
+    TEST_EVAL_HF_SUBFOLDER,
+    TEST_EVAL_MAX_ZIPS,
+    TEST_EVAL_OUTPUT_DIR,
+    TEST_EVAL_SAMPLE_LIMIT,
+)
+M_START = "<M_START>"
+M_END = "<M_END>"
+PAD_TOKEN = "<PAD>"
+INFERENCE_REPETITION_PENALTY = 1.2
+INFERENCE_TEMPERATURE = 0.7
+INFERENCE_TOP_K = 50
+# -----------------------------------------------------------------------------
+# Download / extraction helpers
+# -----------------------------------------------------------------------------
+def try_import_gdown() -> bool:
+    try:
+        import gdown  # noqa: F401
+        return True
+    except Exception:
+        return False
+def download_drive_folder(folder_url_or_id: str, dest_dir: str) -> None:
+    os.makedirs(dest_dir, exist_ok=True)
+    if not try_import_gdown():
+        raise RuntimeError("gdown is required for Drive downloads. Install with `pip install gdown`.")
+    import gdown
+    if "drive.google.com" in folder_url_or_id:
+        url = folder_url_or_id
+    else:
+        url = f"https://drive.google.com/drive/folders/{folder_url_or_id}"
+    print(f"Downloading Drive folder to {dest_dir} ...")
+    gdown.download_folder(url=url, output=dest_dir, quiet=False, use_cookies=False)
+    print("Download complete.")
+def list_zip_files(download_dir: str) -> List[str]:
+    matches: List[str] = []
+    for root, _dirs, files in os.walk(download_dir):
+        for name in files:
+            if name.lower().endswith(".zip"):
+                matches.append(os.path.join(root, name))
+    return sorted(matches)
+def extract_zip_files(zip_paths: List[str], extract_dir: str, limit: Optional[int]) -> List[str]:
+    os.makedirs(extract_dir, exist_ok=True)
+    extracted_roots: List[str] = []
+    for idx, zp in enumerate(zip_paths):
+        if limit is not None and idx >= limit:
+            break
+        try:
+            with zipfile.ZipFile(zp, "r") as archive:
+                subdir = os.path.splitext(os.path.basename(zp))[0]
+                target = os.path.join(extract_dir, subdir)
+                os.makedirs(target, exist_ok=True)
+                archive.extractall(target)
+                extracted_roots.append(target)
+        except Exception as exc:
+            print(f"⚠️  Failed to extract {zp}: {exc}")
+    print(f"Extracted {len(extracted_roots)} archives.")
+    return extracted_roots
+def find_video_pkl_paths(extracted_root: str) -> List[str]:
+    matches: List[str] = []
+    for root, _dirs, files in os.walk(extracted_root):
+        for name in files:
+            if name == "video_data.pkl":
+                matches.append(os.path.join(root, name))
+    return matches
+def parse_word_from_path(path: str) -> str:
+    base = os.path.basename(os.path.dirname(path))
+    if "-" in base:
+        word = base.split("-", 1)[1]
+    else:
+        word = base
+    return word.strip().lower()
+# -----------------------------------------------------------------------------
+# SMPL-X helpers
+# -----------------------------------------------------------------------------
+def try_to_array(value) -> Optional[np.ndarray]:
+    if isinstance(value, np.ndarray):
+        return value
+    try:
+        return np.asarray(value)
+    except Exception:
+        return None
+def load_smplx_params_from_pkl(pkl_path: str) -> Optional[np.ndarray]:
+    try:
+        with open(pkl_path, "rb") as handle:
+            payload = pickle.load(handle)
+    except Exception as exc:
+        print(f"⚠️  Could not read {pkl_path}: {exc}")
+        return None
+    if not isinstance(payload, (list, tuple)) or len(payload) == 0:
+        return None
+    def get_vec(frame: dict, key: str, expected: int, allow_trim: bool = True) -> np.ndarray:
+        val = frame.get(key)
+        arr = try_to_array(val)
+        if arr is None:
+            return np.zeros((expected,), dtype=np.float32)
+        arr = np.array(arr, dtype=np.float32).reshape(-1)
+        if arr.size == expected:
+            return arr
+        if allow_trim and arr.size > expected:
+            if key == "body_pose" and arr.size == 66 and expected == 63:
+                return arr[3:3 + 63]
+            return arr[:expected]
+        if arr.size < expected:
+            out = np.zeros((expected,), dtype=np.float32)
+            out[: arr.size] = arr
+            return out
+        return arr[:expected]
+    sequences: List[np.ndarray] = []
+    for frame in payload:
+        if not isinstance(frame, dict):
+            continue
+        vec = np.concatenate(
+            [
+                get_vec(frame, "shape", 10),
+                get_vec(frame, "body_pose", 63),
+                get_vec(frame, "lhand_pose", 45),
+                get_vec(frame, "rhand_pose", 45),
+                get_vec(frame, "cam_trans", 3),
+                get_vec(frame, "expression", 10),
+                get_vec(frame, "jaw_pose", 3),
+                np.zeros((3,), dtype=np.float32),  # eye pose placeholder
+            ],
+            axis=0,
+        )
+        sequences.append(vec)
+    if not sequences:
+        return None
+    return np.stack(sequences, axis=0).astype(np.float32)
+def import_visualize_helpers():
+    try:
+        from visualize import (
+            load_vqvae,
+            load_stats,
+            decode_tokens_to_params,
+            VQVAE_CHECKPOINT as DEFAULT_VQ,
+            STATS_PATH as DEFAULT_STATS,
+        )
+        return load_vqvae, load_stats, decode_tokens_to_params, DEFAULT_VQ, DEFAULT_STATS
+    except Exception as exc:
+        raise RuntimeError(f"Failed to import visualize helpers: {exc}") from exc
+def _encode_params_to_feature(
+    params: np.ndarray,
+    vq_model,
+    mean,
+    std,
+    device: torch.device,
+) -> Optional[np.ndarray]:
+    if params is None or params.size == 0:
+        return None
+    clip = torch.from_numpy(params.astype(np.float32)).unsqueeze(0).to(device)
+    with torch.no_grad():
+        x_pre = None
+        if hasattr(vq_model.vqvae, "preprocess"):
+            try:
+                x_pre = vq_model.vqvae.preprocess(clip)
+            except Exception:
+                x_pre = None
+        if x_pre is None:
+            if mean is not None and std is not None:
+                mean_t = torch.from_numpy(np.array(mean, dtype=np.float32)).to(device).view(1, 1, -1)
+                std_t = torch.from_numpy(np.array(std, dtype=np.float32)).to(device).view(1, 1, -1)
+                clip = (clip - mean_t) / (std_t + 1e-8)
+            x_pre = clip.transpose(1, 2).contiguous()
+        latent = vq_model.vqvae.encoder(x_pre)
+        if latent.dim() == 3:
+            embed_dim = getattr(vq_model.vqvae, "output_emb_width", None)
+            if embed_dim is not None:
+                if latent.shape[1] == embed_dim:
+                    axis = 2
+                elif latent.shape[2] == embed_dim:
+                    axis = 1
+                else:
+                    axis = 2 if latent.shape[2] < latent.shape[1] else 1
+            else:
+                axis = 2 if latent.shape[2] < latent.shape[1] else 1
+            feat = latent.mean(dim=axis).squeeze(0)
+        elif latent.dim() == 2:
+            feat = latent.squeeze(0)
+        else:
+            feat = latent.view(1, -1).mean(dim=0)
+        vec = feat.detach().cpu().numpy().astype(np.float32)
+        norm = np.linalg.norm(vec)
+        if norm > 0:
+            vec = vec / norm
+        return vec
+# -----------------------------------------------------------------------------
+# Metrics helpers
+# -----------------------------------------------------------------------------
+def calculate_activation_statistics_np(activations: np.ndarray) -> Tuple[np.ndarray, np.ndarray]:
+    mu = np.mean(activations, axis=0)
+    cov = np.cov(activations, rowvar=False)
+    return mu, cov
+def calculate_frechet_distance_np(mu1, sigma1, mu2, sigma2, eps=1e-6) -> float:
+    from scipy.linalg import sqrtm
+    mu1 = np.atleast_1d(mu1)
+    mu2 = np.atleast_1d(mu2)
+    sigma1 = np.atleast_2d(sigma1)
+    sigma2 = np.atleast_2d(sigma2)
+    assert mu1.shape == mu2.shape, "Mean vectors must match"
+    assert sigma1.shape == sigma2.shape, "Covariance matrices must match"
+    diff = mu1 - mu2
+    covmean, _ = sqrtm(sigma1.dot(sigma2), disp=False)
+    if not np.isfinite(covmean).all():
+        offset = np.eye(sigma1.shape[0]) * eps
+        covmean = sqrtm((sigma1 + offset).dot(sigma2 + offset))
+    if np.iscomplexobj(covmean):
+        if not np.allclose(np.diagonal(covmean).imag, 0, atol=1e-3):
+            raise ValueError("Covmean contains large imaginary components")
+        covmean = covmean.real
+    return float(diff.dot(diff) + np.trace(sigma1) + np.trace(sigma2) - 2 * np.trace(covmean))
+def calculate_diversity_np(activation: np.ndarray, diversity_times: int = 200) -> float:
+    assert activation.ndim == 2
+    n = activation.shape[0]
+    if n < 2:
+        return float("nan")
+    times = min(diversity_times, max(1, n - 1))
+    idx1 = np.random.choice(n, times, replace=False)
+    idx2 = np.random.choice(n, times, replace=False)
+    diffs = activation[idx1] - activation[idx2]
+    return float(np.linalg.norm(diffs, axis=1).mean())
+def _to_label_tensor3(acts: np.ndarray, labels: List[str]) -> np.ndarray:
+    label_to_indices: Dict[str, List[int]] = {}
+    for idx, lbl in enumerate(labels):
+        label_to_indices.setdefault(lbl, []).append(idx)
+    counts = [len(v) for v in label_to_indices.values()]
+    if not counts:
+        raise ValueError("No labels available for multimodality computation.")
+    min_count = max(2, min(counts))
+    stacked = []
+    for lbl in sorted(label_to_indices.keys()):
+        stacked.append(acts[label_to_indices[lbl][:min_count]])
+    return np.stack(stacked, axis=0)
+def calculate_multimodality_np(activation: np.ndarray, multimodality_times: int = 20) -> float:
+    assert activation.ndim == 3
+    _, per_label, _ = activation.shape
+    if per_label < 2:
+        return float("nan")
+    times = min(multimodality_times, max(1, per_label - 1))
+    first = np.random.choice(per_label, times, replace=False)
+    second = np.random.choice(per_label, times, replace=False)
+    diffs = activation[:, first] - activation[:, second]
+    return float(np.linalg.norm(diffs, axis=2).mean())
+# -----------------------------------------------------------------------------
+# Generation helpers
+# -----------------------------------------------------------------------------
+def extract_ids_from_sequence(seq: str) -> List[int]:
+    content = seq
+    if M_START in seq and M_END in seq:
+        content = seq.split(M_START, 1)[-1].split(M_END, 1)[0]
+    ids: List[int] = []
+    for tok in content.split():
+        if tok.startswith("<M") and tok.endswith(">"):
+            payload = tok[2:-1]
+            if payload.isdigit():
+                ids.append(int(payload))
+    return ids
+def generate_motion_text(model, tokenizer, word: str, device: torch.device) -> str:
+    model.eval()
+    prompt = f"Instruction: Generate motion for word '{word}' with variant 'unknown'.\nMotion: "
+    inputs = tokenizer(prompt, return_tensors="pt").to(device)
+    with torch.no_grad():
+        output = model.generate(
+            **inputs,
+            max_new_tokens=100,
+            do_sample=True,
+            temperature=INFERENCE_TEMPERATURE,
+            top_k=INFERENCE_TOP_K,
+            repetition_penalty=INFERENCE_REPETITION_PENALTY,
+            pad_token_id=tokenizer.pad_token_id,
+            eos_token_id=tokenizer.convert_tokens_to_ids(M_END),
+        )
+    decoded = tokenizer.decode(output[0], skip_special_tokens=False)
+    if "Motion: " in decoded:
+        return decoded.split("Motion: ", 1)[-1].strip()
+    return decoded.strip()
+# -----------------------------------------------------------------------------
+# Core evaluation
+# -----------------------------------------------------------------------------
+def parse_args() -> argparse.Namespace:
+    parser = argparse.ArgumentParser(
+        "Evaluate the trained Stage 2 model on an unseen SMPL-X test dataset."
+    )
+    group = parser.add_mutually_exclusive_group(required=True)
+    group.add_argument("--drive-url", type=str, help="Google Drive folder URL to download archives from.")
+    group.add_argument("--drive-id", type=str, help="Google Drive folder ID to download archives from.")
+    group.add_argument(
+        "--local-extracted-dir",
+        type=str,
+        help="Use an existing directory that already contains extracted `video_data.pkl` files.",
+    )
+    parser.add_argument("--max-zips", type=int, default=TEST_EVAL_MAX_ZIPS, help="Maximum number of zip files to extract.")
+    parser.add_argument("--download-dir", type=str, default=TEST_EVAL_DOWNLOAD_DIR, help="Directory to store downloaded zips.")
+    parser.add_argument("--extract-dir", type=str, default=TEST_EVAL_EXTRACT_DIR, help="Directory to extract archives into.")
+    parser.add_argument("--hf-repo-id", type=str, default=TEST_EVAL_HF_REPO, help="Hugging Face repo containing the Stage 2 checkpoint.")
+    parser.add_argument(
+        "--hf-subfolder",
+        type=str,
+        default=TEST_EVAL_HF_SUBFOLDER,
+        help="Subfolder inside the repo that hosts the Stage 2 model (e.g., `stage2_v2/epoch-020`).",
+    )
+    parser.add_argument("--vqvae-ckpt", type=str, default=None, help="Optional override for VQ-VAE checkpoint path.")
+    parser.add_argument("--stats-path", type=str, default=None, help="Optional override for VQ-VAE stats file.")
+    parser.add_argument("--output-dir", type=str, default=TEST_EVAL_OUTPUT_DIR, help="Directory to write metrics JSON.")
+    parser.add_argument("--sample-limit", type=int, default=TEST_EVAL_SAMPLE_LIMIT, help="Maximum number of samples to evaluate.")
+    parser.add_argument("--seed", type=int, default=42, help="Random seed.")
+    return parser.parse_args()
+def run_evaluation(args: argparse.Namespace) -> Dict[str, object]:
+    random.seed(args.seed)
+    np.random.seed(args.seed)
+    torch.manual_seed(args.seed)
+    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+    os.makedirs(args.output_dir, exist_ok=True)
+    metrics_path = os.path.join(args.output_dir, "metrics_test.json")
+    print(f"Loading Stage 2 model from HF: {args.hf_repo_id} (subfolder='{args.hf_subfolder}')")
+    tokenizer = AutoTokenizer.from_pretrained(args.hf_repo_id, subfolder=args.hf_subfolder, trust_remote_code=True)
+    model = AutoModelForCausalLM.from_pretrained(args.hf_repo_id, subfolder=args.hf_subfolder, trust_remote_code=True)
+    if tokenizer.pad_token is None:
+        tokenizer.add_special_tokens({"pad_token": PAD_TOKEN})
+    model.resize_token_embeddings(len(tokenizer))
+    model.config.pad_token_id = tokenizer.pad_token_id
+    model.to(device)
+    load_vqvae, load_stats, decode_tokens_to_params, DEFAULT_VQ, DEFAULT_STATS = import_visualize_helpers()
+    vq_ckpt = args.vqvae_ckpt if args.vqvae_ckpt else os.getenv("VQVAE_CHECKPOINT", DEFAULT_VQ)
+    stats_path = args.stats_path if args.stats_path else os.getenv("VQVAE_STATS_PATH", DEFAULT_STATS)
+    print(f"Loading VQ-VAE from: {vq_ckpt}")
+    vq_model = load_vqvae(vq_ckpt, device=device)
+    print(f"Loading stats from: {stats_path}")
+    mean, std = load_stats(stats_path)
+    extracted_dirs: List[str] = []
+    if args.local_extracted_dir:
+        if not os.path.isdir(args.local_extracted_dir):
+            raise FileNotFoundError(f"Local extracted dir not found: {args.local_extracted_dir}")
+        extracted_dirs = [args.local_extracted_dir]
+    else:
+        folder_ref = args.drive_url if args.drive_url else args.drive_id
+        download_drive_folder(folder_ref, args.download_dir)
+        zips = list_zip_files(args.download_dir)
+        if not zips:
+            raise RuntimeError("No zip files found after download.")
+        extracted_dirs = extract_zip_files(zips, args.extract_dir, limit=args.max_zips)
+    samples: List[Tuple[str, str]] = []
+    for root in extracted_dirs:
+        for pkl_path in find_video_pkl_paths(root):
+            samples.append((parse_word_from_path(pkl_path), pkl_path))
+    if not samples:
+        raise RuntimeError("No `video_data.pkl` files discovered in the extracted directories.")
+    random.shuffle(samples)
+    samples = samples[: args.sample_limit]
+    print(f"Found {len(samples)} samples to evaluate.")
+    gt_features: List[np.ndarray] = []
+    gen_features: List[np.ndarray] = []
+    labels: List[str] = []
+    for idx, (word, pkl_path) in enumerate(samples, 1):
+        params_gt = load_smplx_params_from_pkl(pkl_path)
+        if params_gt is None or params_gt.ndim != 2:
+            print(f"Skipping {pkl_path}: invalid SMPL-X payload.")
+            continue
+        try:
+            feat_gt = _encode_params_to_feature(params_gt, vq_model, mean, std, device)
+        except Exception as exc:
+            print(f"Skipping {pkl_path}: encoder failed ({exc}).")
+            continue
+        if feat_gt is None:
+            print(f"Skipping {pkl_path}: empty GT feature.")
+            continue
+        gen_text = generate_motion_text(model, tokenizer, word, device)
+        token_ids = extract_ids_from_sequence(gen_text)
+        if not token_ids:
+            print(f"Skipping GEN for '{word}': no motion tokens produced.")
+            continue
+        try:
+            params_gen = decode_tokens_to_params(token_ids, vq_model, mean, std, device=device)
+        except Exception as exc:
+            print(f"Skipping GEN for '{word}': decode failed ({exc}).")
+            continue
+        feat_gen = _encode_params_to_feature(params_gen, vq_model, mean, std, device)
+        if feat_gen is None:
+            print(f"Skipping GEN for '{word}': empty GEN feature.")
+            continue
+        gt_features.append(feat_gt)
+        gen_features.append(feat_gen)
+        labels.append(word)
+        if idx % 25 == 0:
+            print(f"Processed {idx} samples...")
+    if len(gt_features) < 5 or len(gen_features) < 5:
+        print("⚠️  Not enough samples to compute stable metrics; results may be noisy.")
+    gt_feats = np.stack(gt_features, axis=0)
+    gen_feats = np.stack(gen_features, axis=0)
+    diversity_gt = calculate_diversity_np(gt_feats, diversity_times=min(200, max(4, gt_feats.shape[0] - 1)))
+    diversity_gen = calculate_diversity_np(gen_feats, diversity_times=min(200, max(4, gen_feats.shape[0] - 1)))
+    try:
+        gt_lbl_tensor = _to_label_tensor3(gt_feats, labels)
+        gen_lbl_tensor = _to_label_tensor3(gen_feats, labels)
+        mim_gt = calculate_multimodality_np(
+            gt_lbl_tensor, multimodality_times=min(20, max(3, gt_lbl_tensor.shape[1] - 1))
+        )
+        mim_gen = calculate_multimodality_np(
+            gen_lbl_tensor, multimodality_times=min(20, max(3, gen_lbl_tensor.shape[1] - 1))
+        )
+    except Exception as exc:
+        print(f"⚠️  Multimodality could not be computed reliably: {exc}")
+        mim_gt = float("nan")
+        mim_gen = float("nan")
+    mu_gen, cov_gen = calculate_activation_statistics_np(gen_feats)
+    mu_gt, cov_gt = calculate_activation_statistics_np(gt_feats)
+    fid = calculate_frechet_distance_np(mu_gt, cov_gt, mu_gen, cov_gen)
+    metrics_payload = {
+        "source": "test_raw_smplx_encoder_features",
+        "counts": {
+            "samples_total": len(samples),
+            "samples_used": int(gt_feats.shape[0]),
+        },
+        "fid": fid,
+        "diversity": {
+            "ground_truth": diversity_gt,
+            "model": diversity_gen,
+        },
+        "multimodality": {
+            "ground_truth": mim_gt,
+            "model": mim_gen,
+        },
+    }
+    with open(metrics_path, "w", encoding="utf-8") as handle:
+        json.dump(metrics_payload, handle, ensure_ascii=False, indent=2)
+    print(f"\n✅ Saved test metrics to {metrics_path}")
+    return metrics_payload
+def main() -> None:
+    args = parse_args()
+    try:
+        run_evaluation(args)
+    except Exception as exc:
+        print(f"Evaluation failed: {exc}")
+        sys.exit(1)
+if __name__ == "__main__":
+    main()

test_overfit.py ADDED Viewed

	@@ -0,0 +1,1562 @@

+import os
+import re
+import json
+import random
+from typing import Dict, List, Tuple, Any, Optional
+import shutil
+from datetime import datetime
+import time
+import torch
+from torch.utils.data import Dataset, DataLoader
+from transformers import AutoTokenizer, AutoModelForCausalLM
+from torch.optim import AdamW
+from huggingface_hub import HfApi, upload_folder, hf_hub_download
+import numpy as np
+import scipy.linalg
+# ======================================================================================
+# 0. Configuration
+# ======================================================================================
+# --- Paths and Words ---
+DATASET_PATH = "/content/SignMotionGPT/data/motion_llm_dataset.json"
+MODEL_NAME = "Qwen/Qwen3-0.6B"
+# We will train on the full dataset, but use these words for our final evaluation
+EVALUATION_WORDS = ["passport", "send", "library", "push"]
+OUTPUT_DIR = "./motion_gpt_full_model"
+# --- Evaluation controls ---
+# If True: after training, only compute metrics (FID, Diversity, MIM) and save to JSON.
+#          Skip per-sample inference logs and HTML visualizations.
+# If False: run the existing flow and also compute these 3 metrics.
+RUN_EVALS_ONLY = False
+EVAL_SAMPLE_LIMIT = 100
+METRICS_JSON_PATH = ""
+# --- Training Hyperparameters ---
+# NOTE: Training on the full dataset will take longer.
+# These epochs are a starting point.
+S1_EPOCHS = 20
+S1_LR = 5e-5
+S1_BATCH_SIZE = 8 # Kept small for Colab VRAM
+S2_EPOCHS = 20
+S2_LR = 2e-5
+S2_BATCH_SIZE = 8
+# --- Inference Hyperparameters ---
+INFERENCE_REPETITION_PENALTY = 1.2
+INFERENCE_TEMPERATURE = 0.7
+INFERENCE_TOP_K = 50
+# --- Special Tokens ---
+M_START = "<M_START>"
+M_END = "<M_END>"
+PAD_TOKEN = "<PAD>"
+# --- Hugging Face Hub Configuration ---
+# Provide HUGGINGFACE_HUB_TOKEN or hf_auth_token in environment for private repos.
+HF_USE_HUB = True
+hf_auth_token = os.getenv("hf_auth_token")
+if hf_auth_token is None:
+    raise ValueError("hf_auth_token environment variable is not set")
+HF_STAGE1_REPO_ID = "rdz-falcon/SignMotionGPTfit-archive"
+HF_STAGE2_REPO_ID = "rdz-falcon/SignMotionGPTfit-archive"
+HF_PRIVATE_REPO = os.environ.get("HF_PRIVATE", "true").lower() != "false"
+FORCE_STAGE2_FROM_STAGE1_RAW = os.environ.get("FORCE_STAGE2_FROM_STAGE1", "false")
+FORCE_STAGE2_FROM_STAGE1 = str(FORCE_STAGE2_FROM_STAGE1_RAW).strip().lower() not in ("0", "false", "no", "off")
+# Save Stage 2 checkpoints to a new subfolder so old stage2 checkpoints remain intact
+HF_STAGE2_SAVE_SUBDIR = os.environ.get("HF_STAGE2_SAVE_SUBDIR", "stage2_v2")
+# --- Local Checkpoint Root ---
+CHECKPOINTS_DIR = ""
+# --- Upload frequency and progress control ---
+# Push to Hugging Face only every N epochs (still save locally every epoch)
+CHECKPOINT_UPLOAD_INTERVAL_EPOCHS = int(os.environ.get("HF_UPLOAD_INTERVAL_EPOCHS", "2"))
+# Disable HF Hub progress bars to reduce noisy logs (set HF_DISABLE_PROGRESS=false to re-enable)
+HF_DISABLE_PROGRESS = os.environ.get("HF_DISABLE_PROGRESS", "true").lower() != "false"
+def _refresh_runtime_paths() -> None:
+    """Refresh derived paths when OUTPUT_DIR changes."""
+    global METRICS_JSON_PATH, CHECKPOINTS_DIR
+    METRICS_JSON_PATH = os.path.join(OUTPUT_DIR, "metrics.json")
+    CHECKPOINTS_DIR = os.path.join(OUTPUT_DIR, "checkpoints")
+def _apply_progress_setting() -> None:
+    """Apply huggingface_hub progress bar preference."""
+    if HF_DISABLE_PROGRESS:
+        try:
+            # Also respected by huggingface_hub internal progress usage
+            os.environ.setdefault("HF_HUB_DISABLE_PROGRESS_BARS", "1")
+            from huggingface_hub.utils import disable_progress_bars  # type: ignore
+            disable_progress_bars()
+        except Exception:
+            pass
+    else:
+        os.environ.pop("HF_HUB_DISABLE_PROGRESS_BARS", None)
+def apply_config_overrides(overrides: Optional[Dict[str, Any]] = None) -> None:
+    """
+    Allow external callers to override module-level configuration prior to running main().
+    """
+    global hf_auth_token, HF_DISABLE_PROGRESS, OUTPUT_DIR
+    if not overrides:
+        return
+    updated_paths = False
+    progress_flag_updated = False
+    for key, value in overrides.items():
+        if key == "hf_auth_token":
+            hf_auth_token = value
+            continue
+        if key not in globals():
+            print(f"[config] Unknown override ignored: {key}")
+            continue
+        globals()[key] = value
+        if key == "OUTPUT_DIR":
+            updated_paths = True
+        if key == "HF_DISABLE_PROGRESS":
+            progress_flag_updated = True
+    if updated_paths:
+        _refresh_runtime_paths()
+    if progress_flag_updated:
+        _apply_progress_setting()
+_refresh_runtime_paths()
+_apply_progress_setting()
+# ======================================================================================
+# 1. Data Loading and Preparation (NEW & IMPROVED)
+# ======================================================================================
+def read_json_data(json_path: str) -> List[Dict[str, Any]]:
+    """Loads the dataset from the specified JSON file."""
+    if not os.path.exists(json_path):
+        raise FileNotFoundError(f"Dataset not found at: {json_path}")
+    with open(json_path, "r", encoding="utf-8") as f:
+        return json.load(f)
+def deduplicate_and_prepare_data(entries: List[Dict[str, Any]]) -> Tuple[List[Dict[str, Any]], List[str]]:
+    """
+    Cleans the entire dataset by ensuring each (word, participant_id) pair is unique.
+    If a conflict is found (same pair, different motion), it keeps only the first one encountered.
+    Then, it prepares the full list of motion tokens from the cleaned data.
+    """
+    print("\n---> Cleaning dataset by removing ambiguous (word, participant_id) pairs...")
+    unique_samples = {}
+    conflicts_found = 0
+    for entry in entries:
+        word = entry.get("word", "").lower()
+        pid = entry.get("participant_id", "")
+        key = (word, pid)
+        if key not in unique_samples:
+            unique_samples[key] = entry
+        else:
+            # A sample for this key already exists. We only care if it's a conflict.
+            existing_tokens = unique_samples[key].get("motion_tokens")
+            current_tokens = entry.get("motion_tokens")
+            if existing_tokens != current_tokens:
+                conflicts_found += 1
+                # We do nothing, effectively discarding this new conflicting sample.
+    cleaned_data = list(unique_samples.values())
+    print(f"Original samples: {len(entries)}")
+    print(f"Cleaned samples (unique (word, pid) pairs): {len(cleaned_data)}")
+    print(f"Removed {len(entries) - len(cleaned_data)} total samples. ({conflicts_found} were direct conflicts).")
+    print("\n---> Extracting motion tokens from the full cleaned dataset...")
+    all_motion_tokens = set()
+    for entry in cleaned_data:
+        motion_tokens = entry.get("motion_tokens", "").strip().split()
+        for token in motion_tokens:
+            all_motion_tokens.add(f"<M{token}>")
+    unique_tokens = sorted(list(all_motion_tokens))
+    print(f"Found {len(unique_tokens)} unique motion tokens in the entire dataset.")
+    return cleaned_data, unique_tokens
+# ======================================================================================
+# 2. Model and Tokenizer Setup
+# ======================================================================================
+def setup_model_and_tokenizer(model_name: str, motion_tokens: List[str]) -> Tuple[AutoModelForCausalLM, AutoTokenizer]:
+    """Loads the model and tokenizer, adding special and motion tokens."""
+    print(f"\n---> Loading base model and tokenizer: {model_name}")
+    tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
+    model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True)
+    tokenizer.add_special_tokens({"pad_token": PAD_TOKEN, "additional_special_tokens": [M_START, M_END]})
+    print(f"Adding {len(motion_tokens)} motion tokens to the tokenizer.")
+    tokenizer.add_tokens(motion_tokens, special_tokens=True)
+    model.resize_token_embeddings(len(tokenizer))
+    model.config.pad_token_id = tokenizer.pad_token_id
+    return model, tokenizer
+# ======================================================================================
+# 2b. Hugging Face Hub Utilities and Checkpointing
+# ======================================================================================
+def _format_seconds(seconds: float) -> str:
+    """Formats seconds into H:MM:SS or M:SS."""
+    seconds = int(max(0, seconds))
+    h = seconds // 3600
+    m = (seconds % 3600) // 60
+    s = seconds % 60
+    if h > 0:
+        return f"{h:d}:{m:02d}:{s:02d}"
+    return f"{m:d}:{s:02d}"
+def _ensure_dir(path: str) -> None:
+    os.makedirs(path, exist_ok=True)
+def _resolve_and_ensure_repo(repo_id: str) -> Optional[str]:
+    """
+    Ensures the HF repo exists. Returns the fully-qualified repo_id (namespace/repo)
+    when token is available; otherwise returns the input repo_id.
+    """
+    if not HF_USE_HUB:
+        return None
+    if hf_auth_token is None:
+        print("⚠️  HF token not found. Set HUGGINGFACE_HUB_TOKEN or hf_auth_token to enable Hub sync.")
+        return None
+    api = HfApi()
+    try:
+        who = api.whoami(token=hf_auth_token)
+        namespace = who.get("name") or (who.get("orgs", [None])[0] if isinstance(who.get("orgs"), list) else None)
+    except Exception as exc:
+        print(f"⚠️  Unable to resolve HF namespace: {exc}")
+        namespace = None
+    if "/" not in repo_id and namespace:
+        full_repo_id = f"{namespace}/{repo_id}"
+    else:
+        full_repo_id = repo_id
+    try:
+        api.create_repo(
+            repo_id=full_repo_id,
+            token=hf_auth_token,
+            repo_type="model",
+            private=HF_PRIVATE_REPO,
+            exist_ok=True,
+        )
+    except Exception as exc:
+        print(f"⚠️  create_repo failed (may already exist): {exc}")
+    return full_repo_id
+def _repo_has_stage_latest(repo_id: str, stage: str) -> bool:
+    """Checks if a stage/latest checkpoint exists in the HF repo."""
+    if not HF_USE_HUB or hf_auth_token is None:
+        return False
+    api = HfApi()
+    try:
+        files = api.list_repo_files(repo_id=repo_id, repo_type="model", token=hf_auth_token)
+        return any(path.startswith(f"{stage}/latest/") and path.endswith("config.json") for path in files)
+    except Exception as exc:
+        print(f"⚠️  Could not list files for {repo_id}: {exc}")
+        return False
+def _repo_list_epoch_numbers(repo_id: str, stage: str) -> List[int]:
+    """
+    Returns sorted list of epoch numbers available under {stage}/epoch-XXX/ by scanning files.
+    Works even if 'latest' does not exist.
+    """
+    if not HF_USE_HUB or hf_auth_token is None:
+        return []
+    api = HfApi()
+    try:
+        files = api.list_repo_files(repo_id=repo_id, repo_type="model", token=hf_auth_token)
+    except Exception as exc:
+        print(f"⚠️  Could not list files for {repo_id}: {exc}")
+        return []
+    epoch_numbers: List[int] = []
+    pattern = re.compile(rf"^{re.escape(stage)}/epoch-(\d+)/config\.json$")
+    for path in files:
+        m = pattern.match(path)
+        if m:
+            try:
+                epoch_numbers.append(int(m.group(1)))
+            except ValueError:
+                pass
+    return sorted(set(epoch_numbers))
+def _repo_get_latest_epoch_subfolder(repo_id: str, stage: str) -> Optional[str]:
+    """
+    Returns subfolder path like '{stage}/epoch-XXX' for the highest available epoch, or None.
+    """
+    epochs = _repo_list_epoch_numbers(repo_id, stage)
+    if not epochs:
+        return None
+    latest = max(epochs)
+    return f"{stage}/epoch-{latest:03d}"
+def _load_model_and_tokenizer_from_hf_subfolder(repo_id: str, subfolder: str) -> Optional[Tuple[AutoModelForCausalLM, AutoTokenizer]]:
+    """
+    Loads model and tokenizer from HF under a specific subfolder (e.g., 'stage1/epoch-020').
+    """
+    if not HF_USE_HUB or hf_auth_token is None:
+        return None
+    print(f"\n---> Loading checkpoint from Hugging Face: {repo_id} (subfolder='{subfolder}')")
+    try:
+        tokenizer = AutoTokenizer.from_pretrained(repo_id, subfolder=subfolder, trust_remote_code=True)
+        model = AutoModelForCausalLM.from_pretrained(repo_id, subfolder=subfolder, trust_remote_code=True)
+    except Exception as exc:
+        print(f"⚠️  Failed to load model/tokenizer from subfolder '{subfolder}': {exc}")
+        return None
+    if tokenizer.pad_token is None:
+        tokenizer.add_special_tokens({"pad_token": PAD_TOKEN})
+    model.resize_token_embeddings(len(tokenizer))
+    model.config.pad_token_id = tokenizer.pad_token_id
+    return model, tokenizer
+def _download_training_state_from_subfolder(repo_id: str, subfolder: str) -> Optional[Dict[str, Any]]:
+    """
+    Downloads training_state.json from a specific subfolder (e.g., 'stage1/epoch-020').
+    """
+    if not HF_USE_HUB or hf_auth_token is None:
+        return None
+    try:
+        state_path = hf_hub_download(
+            repo_id=repo_id,
+            filename=f"{subfolder}/training_state.json",
+            repo_type="model",
+            token=hf_auth_token,
+        )
+        with open(state_path, "r", encoding="utf-8") as f:
+            return json.load(f)
+    except Exception:
+        return None
+def _download_training_state(repo_id: str, stage: str) -> Optional[Dict[str, Any]]:
+    """Downloads training_state.json from HF if present."""
+    if not HF_USE_HUB or hf_auth_token is None:
+        return None
+    try:
+        state_path = hf_hub_download(
+            repo_id=repo_id,
+            filename=f"{stage}/latest/training_state.json",
+            repo_type="model",
+            token=hf_auth_token,
+        )
+        with open(state_path, "r", encoding="utf-8") as f:
+            return json.load(f)
+    except Exception:
+        return None
+def _download_optimizer_state(repo_id: str, stage: str) -> Optional[str]:
+    """Downloads optimizer.pt for resuming optimizer state."""
+    if not HF_USE_HUB or hf_auth_token is None:
+        return None
+    try:
+        opt_path = hf_hub_download(
+            repo_id=repo_id,
+            filename=f"{stage}/latest/optimizer.pt",
+            repo_type="model",
+            token=hf_auth_token,
+        )
+        return opt_path
+    except Exception:
+        return None
+def _load_model_and_tokenizer_from_hf(repo_id: str, stage: str) -> Optional[Tuple[AutoModelForCausalLM, AutoTokenizer]]:
+    """
+    Loads model and tokenizer from HF under subfolder {stage}/latest if available.
+    """
+    if not _repo_has_stage_latest(repo_id, stage):
+        return None
+    print(f"\n---> Loading checkpoint from Hugging Face: {repo_id} (subfolder='{stage}/latest')")
+    tokenizer = AutoTokenizer.from_pretrained(repo_id, subfolder=f"{stage}/latest", trust_remote_code=True)
+    model = AutoModelForCausalLM.from_pretrained(repo_id, subfolder=f"{stage}/latest", trust_remote_code=True)
+    if tokenizer.pad_token is None:
+        tokenizer.add_special_tokens({"pad_token": PAD_TOKEN})
+    model.resize_token_embeddings(len(tokenizer))
+    model.config.pad_token_id = tokenizer.pad_token_id
+    return model, tokenizer
+def _ensure_tokenizer_has_motion_tokens(tokenizer: AutoTokenizer, motion_tokens: List[str]) -> int:
+    """
+    Adds any missing motion tokens to the tokenizer. Returns number of tokens added.
+    """
+    tokenizer.add_special_tokens({"pad_token": PAD_TOKEN, "additional_special_tokens": [M_START, M_END]})
+    added = tokenizer.add_tokens(motion_tokens, special_tokens=True)
+    return added
+def _save_and_push_checkpoint(
+    stage: str,
+    epoch_index_zero_based: int,
+    model: AutoModelForCausalLM,
+    tokenizer: AutoTokenizer,
+    optimizer: AdamW,
+    avg_loss: float,
+    dataloader_len: int,
+    batch_size: int,
+    total_epochs: int,
+    repo_id: Optional[str],
+) -> None:
+    """
+    Saves checkpoint locally (per-epoch and latest) and pushes to HF under:
+      - {stage}/epoch-XXX
+      - {stage}/latest
+    Also saves optimizer state and training_state.json to preserve resume info.
+    """
+    epoch_number = epoch_index_zero_based + 1
+    stage_dir = os.path.join(CHECKPOINTS_DIR, stage)
+    epoch_dir_name = f"epoch-{epoch_number:03d}"
+    epoch_dir = os.path.join(stage_dir, epoch_dir_name)
+    latest_dir = os.path.join(stage_dir, "latest")
+    _ensure_dir(epoch_dir)
+    _ensure_dir(stage_dir)
+    # Save model + tokenizer
+    model.save_pretrained(epoch_dir)
+    tokenizer.save_pretrained(epoch_dir)
+    # Save optimizer state
+    torch.save(optimizer.state_dict(), os.path.join(epoch_dir, "optimizer.pt"))
+    # Save training state
+    training_state = {
+        "stage": stage,
+        "epoch_completed": epoch_number,
+        "total_epochs_for_stage": total_epochs,
+        "global_step": epoch_number * dataloader_len,
+        "avg_loss": float(avg_loss),
+        "batch_size": batch_size,
+        "saved_at": datetime.utcnow().isoformat() + "Z",
+    }
+    with open(os.path.join(epoch_dir, "training_state.json"), "w", encoding="utf-8") as f:
+        json.dump(training_state, f, ensure_ascii=False, indent=2)
+    # Update "latest"
+    if os.path.exists(latest_dir):
+        shutil.rmtree(latest_dir)
+    shutil.copytree(epoch_dir, latest_dir)
+    # Push to Hugging Face
+    if HF_USE_HUB and repo_id and hf_auth_token:
+        try:
+            upload_folder(
+                repo_id=repo_id,
+                folder_path=epoch_dir,
+                path_in_repo=f"{stage}/{epoch_dir_name}",
+                repo_type="model",
+                token=hf_auth_token,
+                commit_message=f"{stage}: save {epoch_dir_name}",
+            )
+            upload_folder(
+                repo_id=repo_id,
+                folder_path=latest_dir,
+                path_in_repo=f"{stage}/latest",
+                repo_type="model",
+                token=hf_auth_token,
+                commit_message=f"{stage}: update latest -> {epoch_dir_name}",
+            )
+            print(f"☁️  Pushed checkpoint to HF: {repo_id} ({stage}/{epoch_dir_name} and {stage}/latest)")
+        except Exception as exc:
+            print(f"⚠️  Failed to push checkpoint to HF: {exc}")
+    else:
+        print("ℹ️  Skipped HF push (Hub disabled or token/repo missing).")
+# ======================================================================================
+# 3. Training Stage 1: Motion Language Modeling
+# ======================================================================================
+class MotionDataset(Dataset):
+    """Dataset for Stage 1: Contains only motion token sequences."""
+    def __init__(self, data: List[Dict[str, Any]], tokenizer: AutoTokenizer, max_length: int = 256):
+        self.tokenizer = tokenizer
+        self.max_length = max_length
+        self.sequences = []
+        for item in data:
+            tokens_str = item.get("motion_tokens", "")
+            wrapped_tokens = " ".join([f"<M{t}>" for t in tokens_str.split()])
+            full_sequence = f"{M_START} {wrapped_tokens} {M_END}"
+            self.sequences.append(full_sequence)
+    def __len__(self):
+        return len(self.sequences)
+    def __getitem__(self, idx):
+        return self.tokenizer(
+            self.sequences[idx],
+            truncation=True,
+            max_length=self.max_length,
+            padding="max_length",
+            return_tensors="pt"
+        )
+def train_stage1(
+    model,
+    tokenizer,
+    data,
+    device,
+    start_epoch: int = 0,
+    hf_repo_id: Optional[str] = None,
+):
+    """Trains the model on motion sequences only to learn the 'language of motion'.
+    Resumes from Hugging Face if available (model/tokenizer/optimizer)."""
+    print("\n" + "="*80)
+    print("      STAGE 1: MOTION LANGUAGE MODELING (PRE-TRAINING)")
+    print(f"      Training on {len(data)} samples.")
+    print("="*80)
+    dataset = MotionDataset(data, tokenizer)
+    dataloader = DataLoader(dataset, batch_size=S1_BATCH_SIZE, shuffle=True)
+    optimizer = AdamW(model.parameters(), lr=S1_LR)
+    model.to(device)
+    model.train()
+    # Try to resume optimizer if we resumed from HF
+    if hf_repo_id and start_epoch > 0 and HF_USE_HUB and hf_auth_token:
+        opt_path = _download_optimizer_state(hf_repo_id, "stage1")
+        if opt_path is not None:
+            try:
+                optimizer.load_state_dict(torch.load(opt_path, map_location=device))
+                print("↩️  Resumed optimizer state for Stage 1 from HF.")
+            except Exception as exc:
+                print(f"⚠️  Failed to load optimizer state for Stage 1: {exc}")
+    for epoch in range(start_epoch, S1_EPOCHS):
+        total_loss = 0
+        total_batches = len(dataloader)
+        epoch_start_time = time.time()
+        step_interval = max(1, total_batches // 50)  # ~2% progress updates
+        for i, batch in enumerate(dataloader, 1):
+            optimizer.zero_grad()
+            input_ids = batch['input_ids'].squeeze(1).to(device)
+            attention_mask = batch['attention_mask'].squeeze(1).to(device)
+            outputs = model(input_ids=input_ids, attention_mask=attention_mask, labels=input_ids)
+            loss = outputs.loss
+            loss.backward()
+            optimizer.step()
+            total_loss += loss.item()
+            # Progress with ETA
+            if i == 1 or (i % step_interval == 0) or (i == total_batches):
+                elapsed = time.time() - epoch_start_time
+                est_total = (elapsed / i) * total_batches
+                eta = est_total - elapsed
+                pct = (i / total_batches) * 100.0
+                print(
+                    f"\r[Stage 1] Epoch {epoch+1}/{S1_EPOCHS} - "
+                    f"{i}/{total_batches} ({pct:.1f}%) - ETA {_format_seconds(eta)}",
+                    end="",
+                    flush=True,
+                )
+        # Finish the progress line
+        print()
+        avg_loss = total_loss / len(dataloader)
+        print(f"--- End of Epoch {epoch+1}/{S1_EPOCHS}, Average Loss: {avg_loss:.4f} ---")
+        # Save checkpoint locally every epoch; push to HF only at interval or final epoch
+        push_this_epoch = ((epoch + 1) % CHECKPOINT_UPLOAD_INTERVAL_EPOCHS == 0) or ((epoch + 1) == S1_EPOCHS)
+        repo_for_epoch = hf_repo_id if push_this_epoch else None
+        _save_and_push_checkpoint(
+            stage="stage1",
+            epoch_index_zero_based=epoch,
+            model=model,
+            tokenizer=tokenizer,
+            optimizer=optimizer,
+            avg_loss=avg_loss,
+            dataloader_len=len(dataloader),
+            batch_size=S1_BATCH_SIZE,
+            total_epochs=S1_EPOCHS,
+            repo_id=repo_for_epoch,
+        )
+    print("\n✅ Stage 1 Training Complete.")
+    return model
+# ======================================================================================
+# 4. Training Stage 2: Text-to-Motion Fine-Tuning
+# ======================================================================================
+class TextMotionDataset(Dataset):
+    """Dataset for Stage 2: Contains (prompt, motion_sequence) pairs."""
+    def __init__(self, data: List[Dict[str, Any]], tokenizer: AutoTokenizer, max_length: int = 256):
+        self.tokenizer = tokenizer
+        self.max_length = max_length
+        self.items = []
+        for item in data:
+            prompt = f"Instruction: Generate motion for word '{item['word']}' with variant '{item['participant_id']}'.\nMotion: "
+            tokens_str = item.get("motion_tokens", "")
+            wrapped_tokens = " ".join([f"<M{t}>" for t in tokens_str.split()])
+            target_sequence = f"{M_START} {wrapped_tokens} {M_END}"
+            full_text = prompt + target_sequence
+            tokenized = self.tokenizer(
+                full_text,
+                truncation=True,
+                max_length=self.max_length,
+                padding="max_length",
+                return_tensors="pt"
+            )
+            prompt_tokenized = self.tokenizer(prompt, return_tensors="pt")
+            prompt_len = prompt_tokenized.input_ids.shape[1]
+            labels = tokenized['input_ids'].clone()
+            labels[0, :prompt_len] = -100
+            self.items.append({
+                "input_ids": tokenized['input_ids'].squeeze(0),
+                "attention_mask": tokenized['attention_mask'].squeeze(0),
+                "labels": labels.squeeze(0)
+            })
+    def __len__(self):
+        return len(self.items)
+    def __getitem__(self, idx):
+        return self.items[idx]
+def train_stage2(
+    model,
+    tokenizer,
+    data,
+    device,
+    start_epoch: int = 0,
+    hf_repo_id: Optional[str] = None,
+    hf_stage_subdir: str = "stage2",
+):
+    """Fine-tunes the motion-aware model to connect text prompts to motions.
+    Resumes from Hugging Face if available (model/tokenizer/optimizer)."""
+    print("\n" + "="*80)
+    print("      STAGE 2: TEXT-TO-MOTION FINE-TUNING")
+    print(f"      Training on {len(data)} samples.")
+    print("="*80)
+    dataset = TextMotionDataset(data, tokenizer)
+    dataloader = DataLoader(dataset, batch_size=S2_BATCH_SIZE, shuffle=True)
+    optimizer = AdamW(model.parameters(), lr=S2_LR)
+    model.to(device)
+    model.train()
+    # Try to resume optimizer if we resumed from HF
+    if hf_repo_id and start_epoch > 0 and HF_USE_HUB and hf_auth_token:
+        opt_path = _download_optimizer_state(hf_repo_id, hf_stage_subdir)
+        if opt_path is not None:
+            try:
+                optimizer.load_state_dict(torch.load(opt_path, map_location=device))
+                print("↩️  Resumed optimizer state for Stage 2 from HF.")
+            except Exception as exc:
+                print(f"⚠️  Failed to load optimizer state for Stage 2: {exc}")
+    for epoch in range(start_epoch, S2_EPOCHS):
+        total_loss = 0
+        total_batches = len(dataloader)
+        epoch_start_time = time.time()
+        step_interval = max(1, total_batches // 50)  # ~2% progress updates
+        for i, batch in enumerate(dataloader, 1):
+            optimizer.zero_grad()
+            input_ids = batch['input_ids'].to(device)
+            attention_mask = batch['attention_mask'].to(device)
+            labels = batch['labels'].to(device)
+            outputs = model(input_ids=input_ids, attention_mask=attention_mask, labels=labels)
+            loss = outputs.loss
+            loss.backward()
+            optimizer.step()
+            total_loss += loss.item()
+            # Progress with ETA
+            if i == 1 or (i % step_interval == 0) or (i == total_batches):
+                elapsed = time.time() - epoch_start_time
+                est_total = (elapsed / i) * total_batches
+                eta = est_total - elapsed
+                pct = (i / total_batches) * 100.0
+                print(
+                    f"\r[Stage 2] Epoch {epoch+1}/{S2_EPOCHS} - "
+                    f"{i}/{total_batches} ({pct:.1f}%) - ETA {_format_seconds(eta)}",
+                    end="",
+                    flush=True,
+                )
+        # Finish the progress line
+        print()
+        avg_loss = total_loss / len(dataloader)
+        print(f"--- End of Epoch {epoch+1}/{S2_EPOCHS}, Average Loss: {avg_loss:.4f} ---")
+        # Save checkpoint locally every epoch; push to HF only at interval or final epoch
+        push_this_epoch = ((epoch + 1) % CHECKPOINT_UPLOAD_INTERVAL_EPOCHS == 0) or ((epoch + 1) == S2_EPOCHS)
+        repo_for_epoch = hf_repo_id if push_this_epoch else None
+        _save_and_push_checkpoint(
+            stage=hf_stage_subdir,
+            epoch_index_zero_based=epoch,
+            model=model,
+            tokenizer=tokenizer,
+            optimizer=optimizer,
+            avg_loss=avg_loss,
+            dataloader_len=len(dataloader),
+            batch_size=S2_BATCH_SIZE,
+            total_epochs=S2_EPOCHS,
+            repo_id=repo_for_epoch,
+        )
+    print("\n✅ Stage 2 Training Complete.")
+    if not os.path.exists(OUTPUT_DIR):
+        os.makedirs(OUTPUT_DIR)
+    model.save_pretrained(OUTPUT_DIR)
+    tokenizer.save_pretrained(OUTPUT_DIR)
+    print(f"Model saved to {OUTPUT_DIR}")
+    return model
+# ======================================================================================
+# 5. Inference and Comparison
+# ======================================================================================
+def generate_motion(model, tokenizer, prompt, device):
+    """Generates a motion sequence from a prompt using sampling."""
+    model.eval()
+    inputs = tokenizer(prompt, return_tensors="pt").to(device)
+    with torch.no_grad():
+        output = model.generate(
+            **inputs,
+            max_new_tokens=100,
+            do_sample=True,
+            temperature=INFERENCE_TEMPERATURE,
+            top_k=INFERENCE_TOP_K,
+            repetition_penalty=INFERENCE_REPETITION_PENALTY,
+            pad_token_id=tokenizer.pad_token_id,
+            eos_token_id=tokenizer.convert_tokens_to_ids(M_END),
+            early_stopping=True
+        )
+    decoded = tokenizer.decode(output[0], skip_special_tokens=False)
+    motion_part = decoded.split("Motion: ")[-1]
+    return motion_part.strip()
+def compare_sequences(gt: str, gen: str):
+    """Provides a simple visual diff of two sequences without external libraries."""
+    gt_tokens = gt.split()
+    gen_tokens = gen.split()
+    print("\nDetailed Comparison (✅ = Match, ❌ = Mismatch/Missing/Added):")
+    gt_str =   "  GT:  "
+    gen_str =  "  GEN: "
+    diff_str = "       "
+    max_len = max(len(gt_tokens), len(gen_tokens))
+    for i in range(max_len):
+        gt_tok = gt_tokens[i] if i < len(gt_tokens) else "___"
+        gen_tok = gen_tokens[i] if i < len(gen_tokens) else "___"
+        max_tok_len = max(len(gt_tok), len(gen_tok))
+        gt_tok_padded = gt_tok.ljust(max_tok_len)
+        gen_tok_padded = gen_tok.ljust(max_tok_len)
+        gt_str += gt_tok_padded + " "
+        gen_str += gen_tok_padded + " "
+        if gt_tok == gen_tok:
+            diff_str += "✅".ljust(max_tok_len) + " "
+        else:
+            diff_str += "❌".ljust(max_tok_len) + " "
+    print(gt_str)
+    print(gen_str)
+    print(diff_str)
+def run_inference_on_all_samples(model, tokenizer, data, device):
+    """
+    Runs inference on ALL available samples for the trained words and compares
+    each one to its specific ground truth.
+    """
+    print("\n" + "="*80)
+    print("      INFERENCE AND EVALUATION (ALL SAMPLES)")
+    print("      Goal: Test the model's performance on every variant.")
+    print("="*80)
+    data_by_word = {}
+    for item in data:
+        word = item['word']
+        if word not in data_by_word:
+            data_by_word[word] = []
+        data_by_word[word].append(item)
+    for word, samples in data_by_word.items():
+        print(f"\n\n{'='*25} TESTING WORD: '{word}' {'='*25}")
+        num_correct = 0
+        for i, sample in enumerate(samples):
+            print(f"\n--- Testing Variant {i+1}/{len(samples)}: '{sample['participant_id']}' ---")
+            gt_tokens_str = sample.get("motion_tokens", "")
+            gt_wrapped = " ".join([f"<M{t}>" for t in gt_tokens_str.split()])
+            gt_sequence = f"{M_START} {gt_wrapped} {M_END}"
+            print(f"Ground Truth:\n{gt_sequence}")
+            prompt = f"Instruction: Generate motion for word '{sample['word']}' with variant '{sample['participant_id']}'.\nMotion: "
+            generated_sequence = generate_motion(model, tokenizer, prompt, device)
+            print(f"\nLLM Generated:\n{generated_sequence}")
+            compare_sequences(gt_sequence, generated_sequence)
+            if gt_sequence.strip() == generated_sequence.strip():
+                num_correct += 1
+            print("-" * 80)
+        accuracy = (num_correct / len(samples)) * 100
+        print(f"\nSUMMARY FOR '{word}': {num_correct}/{len(samples)} correct ({accuracy:.1f}%)")
+# ======================================================================================
+# 5b. Metrics: FID, Diversity, Multimodality (MIM) using MotionGPT-style utils
+# ======================================================================================
+def calculate_activation_statistics_np(activations: np.ndarray):
+    """
+    Params:
+    -- activations: num_samples x dim_feat (numpy)
+    Returns:
+    -- mu: dim_feat
+    -- sigma: dim_feat x dim_feat
+    """
+    mu = np.mean(activations, axis=0)
+    cov = np.cov(activations, rowvar=False)
+    return mu, cov
+def calculate_frechet_distance_np(mu1, sigma1, mu2, sigma2, eps=1e-6):
+    """Numpy implementation of the Frechet Distance."""
+    mu1 = np.atleast_1d(mu1)
+    mu2 = np.atleast_1d(mu2)
+    sigma1 = np.atleast_2d(sigma1)
+    sigma2 = np.atleast_2d(sigma2)
+    assert mu1.shape == mu2.shape, "Training and test mean vectors have different lengths"
+    assert sigma1.shape == sigma2.shape, "Training and test covariances have different dimensions"
+    diff = mu1 - mu2
+    covmean, _ = scipy.linalg.sqrtm(sigma1.dot(sigma2), disp=False)
+    if not np.isfinite(covmean).all():
+        offset = np.eye(sigma1.shape[0]) * eps
+        covmean = scipy.linalg.sqrtm((sigma1 + offset).dot(sigma2 + offset))
+    if np.iscomplexobj(covmean):
+        if not np.allclose(np.diagonal(covmean).imag, 0, atol=1e-3):
+            m = np.max(np.abs(covmean.imag))
+            raise ValueError(f"Imaginary component {m}")
+        covmean = covmean.real
+    tr_covmean = np.trace(covmean)
+    return diff.dot(diff) + np.trace(sigma1) + np.trace(sigma2) - 2 * tr_covmean
+def calculate_diversity_np(activation: np.ndarray, diversity_times: int = 200) -> float:
+    """Mean pairwise L2 distance across random pairs."""
+    assert len(activation.shape) == 2
+    assert activation.shape[0] > max(2, diversity_times)
+    num_samples = activation.shape[0]
+    first_indices = np.random.choice(num_samples, diversity_times, replace=False)
+    second_indices = np.random.choice(num_samples, diversity_times, replace=False)
+    diffs = activation[first_indices] - activation[second_indices]
+    dist = np.linalg.norm(diffs, axis=1)
+    return float(dist.mean())
+def calculate_multimodality_np(activation: np.ndarray, multimodality_times: int = 20) -> float:
+    """
+    activation: [num_labels, num_per_label, D]
+    Returns mean pairwise within-label diversity (higher = more multimodal).
+    """
+    assert len(activation.shape) == 3
+    num_labels, num_per_label, _ = activation.shape
+    assert num_per_label > multimodality_times
+    first_dices = np.random.choice(num_per_label, multimodality_times, replace=False)
+    second_dices = np.random.choice(num_per_label, multimodality_times, replace=False)
+    diffs = activation[:, first_dices] - activation[:, second_dices]
+    dist = np.linalg.norm(diffs, axis=2)
+    return float(dist.mean())
+# --------------------------------------------------------------------------------------
+# Token sequence → activation (bag-of-motion-tokens) helpers
+# --------------------------------------------------------------------------------------
+def _extract_motion_tokens_from_sequence(seq: str) -> list[str]:
+    # Expect tokens like <M123>, within M_START/M_END fences; keep only <M...>
+    return [tok for tok in seq.split() if tok.startswith("<M") and tok.endswith(">")]
+def _build_token_index(tokens_vocab: list[str]) -> Dict[str, int]:
+    return {tok: idx for idx, tok in enumerate(tokens_vocab)}
+def _sequence_to_activation(seq: str, token_to_index: Dict[str, int]) -> np.ndarray:
+    vec = np.zeros((len(token_to_index),), dtype=np.float32)
+    for tok in _extract_motion_tokens_from_sequence(seq):
+        idx = token_to_index.get(tok)
+        if idx is not None:
+            vec[idx] += 1.0
+    # Normalize to unit length to reduce length bias
+    norm = np.linalg.norm(vec)
+    if norm > 0:
+        vec = vec / norm
+    return vec
+def _collect_eval_pairs(model, tokenizer, data, device) -> list[Tuple[str, str, str]]:
+    """
+    Returns list of (word, participant_id, gt_sequence, generated_sequence) for each sample in data.
+    """
+    results = []
+    for sample in data:
+        gt_tokens_str = sample.get("motion_tokens", "")
+        gt_wrapped = " ".join([f"<M{t}>" for t in gt_tokens_str.split()])
+        gt_sequence = f"{M_START} {gt_wrapped} {M_END}"
+        prompt = f"Instruction: Generate motion for word '{sample['word']}' with variant '{sample['participant_id']}'.\nMotion: "
+        generated_sequence = generate_motion(model, tokenizer, prompt, device)
+        pid = str(sample.get("participant_id", ""))
+        results.append((sample["word"], pid, gt_sequence, generated_sequence))
+    return results
+def _activations_from_pairs(pairs: list[Tuple[str, str, str]], vocab_tokens: list[str]):
+    """
+    Build numpy activations and labels arrays from sequences.
+    Returns:
+      gt_acts: (N, D)
+      gen_acts: (N, D)
+      labels: list[str] length N (word labels)
+    """
+    token_to_index = _build_token_index(vocab_tokens)
+    gt_vecs = []
+    gen_vecs = []
+    labels = []
+    for pair in pairs:
+        # Support both legacy 3-tuple (word, gt, gen) and new 4-tuple (word, pid, gt, gen)
+        if len(pair) == 4:
+            word, _pid, gt_seq, gen_seq = pair
+        else:
+            word, gt_seq, gen_seq = pair
+        gt_vecs.append(_sequence_to_activation(gt_seq, token_to_index))
+        gen_vecs.append(_sequence_to_activation(gen_seq, token_to_index))
+        labels.append(word)
+    return np.stack(gt_vecs, axis=0), np.stack(gen_vecs, axis=0), labels
+def _to_label_tensor3(acts: np.ndarray, labels: list[str]) -> np.ndarray:
+    """
+    Convert N x D activations with string labels to [L, K, D] by truncating each label
+    to the minimum count across labels.
+    """
+    label_to_indices: Dict[str, list[int]] = {}
+    for i, lbl in enumerate(labels):
+        label_to_indices.setdefault(lbl, []).append(i)
+    per_label_counts = [len(idxs) for idxs in label_to_indices.values()]
+    if len(per_label_counts) == 0:
+        raise ValueError("No labels found for multimodality computation.")
+    min_count = max(2, min(per_label_counts))
+    label_names = sorted(label_to_indices.keys())
+    stacked = []
+    for lbl in label_names:
+        idxs = label_to_indices[lbl][:min_count]
+        stacked.append(acts[idxs])
+    return np.stack(stacked, axis=0)  # [L, K, D]
+def evaluate_metrics_motiongpt_style(model, tokenizer, eval_data, all_motion_tokens, device):
+    """
+    Computes:
+      - Diversity: GT vs GEN (pair)
+      - Multimodality (MIM): GT vs GEN (pair)
+      - FID: between GT and GEN
+    """
+    print("\n" + "="*80)
+    print("      METRICS EVALUATION (FID, Diversity, Multimodality)")
+    print("="*80)
+    pairs = _collect_eval_pairs(model, tokenizer, eval_data, device)
+    gt_acts, gen_acts, labels = _activations_from_pairs(pairs, all_motion_tokens)
+    # Diversity
+    diversity_times = min(200, max(4, gt_acts.shape[0] - 1))
+    diversity_gt = calculate_diversity_np(gt_acts, diversity_times=diversity_times)
+    diversity_gen = calculate_diversity_np(gen_acts, diversity_times=diversity_times)
+    # Multimodality (MIM)
+    try:
+        gt_lbl_tensor = _to_label_tensor3(gt_acts, labels)
+        gen_lbl_tensor = _to_label_tensor3(gen_acts, labels)
+        multimodality_times = min(20, max(3, gt_lbl_tensor.shape[1] - 1))
+        mim_gt = calculate_multimodality_np(gt_lbl_tensor, multimodality_times=multimodality_times)
+        mim_gen = calculate_multimodality_np(gen_lbl_tensor, multimodality_times=multimodality_times)
+    except Exception as exc:
+        print(f"⚠️  Multimodality could not be computed reliably: {exc}")
+        mim_gt = float("nan")
+        mim_gen = float("nan")
+    # FID
+    mu_gen, cov_gen = calculate_activation_statistics_np(gen_acts)
+    mu_gt, cov_gt = calculate_activation_statistics_np(gt_acts)
+    fid = calculate_frechet_distance_np(mu_gt, cov_gt, mu_gen, cov_gen)
+    print(f"Diversity:    GT = {diversity_gt:.4f} | GEN = {diversity_gen:.4f}")
+    print(f"Multimodality (MIM): GT = {mim_gt:.4f} | GEN = {mim_gen:.4f}")
+    print(f"FID (GT vs GEN): {fid:.4f}")
+    return {
+        "diversity_gt": diversity_gt,
+        "diversity_gen": diversity_gen,
+        "mim_gt": mim_gt,
+        "mim_gen": mim_gen,
+        "fid": fid,
+        "pairs": pairs,  # for visualization usage
+    }
+# ======================================================================================
+# 5b-ALT. Metrics using VQ-VAE codebook embeddings (near-standard activations)
+# ======================================================================================
+def _sequence_to_codebook_feature(seq: str, vq_model) -> np.ndarray:
+    """
+    Build a single clip feature by mean-pooling VQ-VAE codebook embeddings
+    corresponding to the token ids in the sequence. L2-normalized.
+    """
+    token_ids = _extract_ids_from_sequence(seq)
+    # Resolve code dimension and codebook availability
+    quantizer = getattr(vq_model.vqvae, "quantizer", None)
+    if quantizer is None:
+        raise RuntimeError("VQ-VAE quantizer missing; cannot extract codebook embeddings.")
+    # Try dequantize -> mean over time (preferred)
+    feat_vec = None
+    if hasattr(quantizer, "dequantize") and token_ids:
+        try:
+            idx = torch.tensor(token_ids, dtype=torch.long, device=next(vq_model.parameters()).device).unsqueeze(0)
+            with torch.no_grad():
+                dq = quantizer.dequantize(idx)
+            if dq is not None:
+                # Expect shape [N, code_dim, T]; average over T
+                if dq.ndim == 3:
+                    if dq.shape[0] == 1:
+                        x = dq.squeeze(0)  # [code_dim, T] or [T, code_dim]
+                    else:
+                        x = dq.mean(dim=0)
+                    if x.shape[0] < x.shape[1]:
+                        # [code_dim, T]
+                        feat = x.mean(dim=1)
+                    else:
+                        # [T, code_dim]
+                        feat = x.mean(dim=0)
+                    feat_vec = feat.detach().cpu().numpy().astype(np.float32)
+        except Exception:
+            feat_vec = None
+    # Fallback: direct codebook lookup -> mean over token ids
+    if feat_vec is None:
+        codebook = getattr(quantizer, "codebook", None)
+        if codebook is None:
+            raise RuntimeError("Quantizer has neither dequantize() nor codebook.")
+        code_np = codebook.detach().cpu().numpy()  # [K, D]
+        if not token_ids:
+            feat_vec = np.zeros((code_np.shape[1],), dtype=np.float32)
+        else:
+            ids = np.asarray(token_ids, dtype=np.int64)
+            ids = np.clip(ids, 0, code_np.shape[0] - 1)
+            feat_vec = code_np[ids].mean(axis=0).astype(np.float32)
+    # L2-normalize to reduce length/scale bias
+    norm = np.linalg.norm(feat_vec)
+    if norm > 0:
+        feat_vec = feat_vec / norm
+    return feat_vec
+def _activations_from_pairs_codebook(pairs: list[Tuple[str, str, str]], vq_model):
+    """
+    Produce codebook-embedding features for GT and GEN sequences and their labels.
+    Returns:
+      gt_feats: (N, D)
+      gen_feats: (N, D)
+      labels: list[str] of length N (word labels)
+    """
+    gt_feats = []
+    gen_feats = []
+    labels = []
+    for pair in pairs:
+        if len(pair) == 4:
+            word, _pid, gt_seq, gen_seq = pair
+        else:
+            word, gt_seq, gen_seq = pair
+        gt_feats.append(_sequence_to_codebook_feature(gt_seq, vq_model))
+        gen_feats.append(_sequence_to_codebook_feature(gen_seq, vq_model))
+        labels.append(word)
+    return np.stack(gt_feats, axis=0), np.stack(gen_feats, axis=0), labels
+def evaluate_metrics_codebook_style(model, tokenizer, eval_data, device, vqvae_ckpt: Optional[str] = None):
+    """
+    Computes FID, Diversity, and MIM using features derived from the VQ-VAE codebook:
+      - Feature per clip = mean-pooled codebook embeddings over token sequence, L2-normalized
+      - Diversity/MIM computed exactly as in MotionGPT-style helpers but on these features
+      - FID computed via full covariance Fréchet distance on these features
+    Returns a dict mirroring evaluate_metrics_motiongpt_style.
+    """
+    print("\n" + "="*80)
+    print("      METRICS EVALUATION (Codebook-Embedding Features)")
+    print("="*80)
+    # Lazy import to avoid hard dependency at module import time
+    try:
+        from visualize import load_vqvae, VQVAE_CHECKPOINT as DEFAULT_VQ
+        vq_ckpt = vqvae_ckpt or os.getenv("VQVAE_CHECKPOINT", DEFAULT_VQ)
+        vq_model = load_vqvae(vq_ckpt, device=device)
+    except Exception as exc:
+        print(f"⚠️  Could not load VQ-VAE for codebook metrics: {exc}")
+        return {}
+    # Collect pairs and build features
+    pairs = _collect_eval_pairs(model, tokenizer, eval_data, device)
+    gt_feats, gen_feats, labels = _activations_from_pairs_codebook(pairs, vq_model)
+    # Diversity
+    diversity_times = min(200, max(4, gt_feats.shape[0] - 1))
+    diversity_gt = calculate_diversity_np(gt_feats, diversity_times=diversity_times)
+    diversity_gen = calculate_diversity_np(gen_feats, diversity_times=diversity_times)
+    # Multimodality (MIM)
+    try:
+        gt_lbl_tensor = _to_label_tensor3(gt_feats, labels)
+        gen_lbl_tensor = _to_label_tensor3(gen_feats, labels)
+        multimodality_times = min(20, max(3, gt_lbl_tensor.shape[1] - 1))
+        mim_gt = calculate_multimodality_np(gt_lbl_tensor, multimodality_times=multimodality_times)
+        mim_gen = calculate_multimodality_np(gen_lbl_tensor, multimodality_times=multimodality_times)
+    except Exception as exc:
+        print(f"⚠️  Multimodality could not be computed reliably: {exc}")
+        mim_gt = float("nan")
+        mim_gen = float("nan")
+    # FID (on codebook features)
+    mu_gen, cov_gen = calculate_activation_statistics_np(gen_feats)
+    mu_gt, cov_gt = calculate_activation_statistics_np(gt_feats)
+    fid = calculate_frechet_distance_np(mu_gt, cov_gt, mu_gen, cov_gen)
+    print(f"Diversity (codebook feats):    GT = {diversity_gt:.4f} | GEN = {diversity_gen:.4f}")
+    print(f"Multimodality (MIM, codebook): GT = {mim_gt:.4f} | GEN = {mim_gen:.4f}")
+    print(f"FID (codebook feats, GT vs GEN): {fid:.4f}")
+    return {
+        "diversity_gt": diversity_gt,
+        "diversity_gen": diversity_gen,
+        "mim_gt": mim_gt,
+        "mim_gen": mim_gen,
+        "fid": fid,
+        "pairs": pairs,
+    }
+# ======================================================================================
+# 5b-ALT2. Metrics using VQ-VAE encoder pre-quantization features (as described)
+# ======================================================================================
+def _encode_params_to_feature(params: np.ndarray, vq_model, mean, std, device) -> np.ndarray:
+    """
+    Convert SMPL-X parameter sequence (T, D) into a single clip feature using
+    the VQ-VAE encoder output BEFORE quantization. Average-pool over time to get (D_embed,).
+    - Attempts to use vq_model.vqvae.preprocess; otherwise applies manual normalization with mean/std.
+    - Handles encoder outputs shaped as [N, D, T] or [N, T, D_embed].
+    """
+    if params.size == 0:
+        return np.zeros((getattr(vq_model.vqvae, "output_emb_width", 512),), dtype=np.float32)
+    x = torch.from_numpy(params.astype(np.float32)).to(device)  # [T, D]
+    x = x.unsqueeze(0)  # [1, T, D]
+    with torch.no_grad():
+        # Normalize / preprocess
+        x_pre = None
+        if hasattr(vq_model.vqvae, "preprocess"):
+            try:
+                x_pre = vq_model.vqvae.preprocess(x)  # expected to return tensor ready for encoder
+            except Exception:
+                x_pre = None
+        if x_pre is None:
+            # Manual normalization with provided mean/std
+            if mean is not None and std is not None:
+                mean_t = torch.from_numpy(np.array(mean, dtype=np.float32)).to(device).view(1, 1, -1)
+                std_t = torch.from_numpy(np.array(std, dtype=np.float32)).to(device).view(1, 1, -1)
+                x_norm = (x - mean_t) / (std_t + 1e-8)
+            else:
+                x_norm = x
+            # Some encoders expect [N, D, T]
+            x_pre = x_norm.transpose(1, 2).contiguous()  # [1, D, T]
+        # Encode to get pre-quant latent
+        z_e = vq_model.vqvae.encoder(x_pre)
+        # z_e could be [N, D_embed, T_q] or [N, T_q, D_embed]
+        if z_e.dim() == 3:
+            # Determine which axis is time by comparing to known embed dim when available,
+            # otherwise assume time is the smaller dimension (varies per clip).
+            embed_dim_known = getattr(vq_model.vqvae, "output_emb_width", None)
+            if embed_dim_known is not None:
+                if z_e.shape[1] == embed_dim_known:
+                    time_axis = 2  # [N, D_embed, T_q]
+                elif z_e.shape[2] == embed_dim_known:
+                    time_axis = 1  # [N, T_q, D_embed]
+                else:
+                    time_axis = 2 if z_e.shape[2] < z_e.shape[1] else 1
+            else:
+                time_axis = 2 if z_e.shape[2] < z_e.shape[1] else 1
+            feat = z_e.mean(dim=time_axis).squeeze(0)
+        elif z_e.dim() == 2:
+            feat = z_e.squeeze(0)
+        else:
+            # Fallback: flatten then reduce
+            feat = z_e.view(1, -1).mean(dim=0)
+        feat_np = feat.detach().cpu().numpy().astype(np.float32)
+        # L2 normalize
+        norm = np.linalg.norm(feat_np)
+        if norm > 0:
+            feat_np = feat_np / norm
+        return feat_np
+def evaluate_metrics_encoder_style(
+    model,
+    tokenizer,
+    eval_data,
+    device,
+    vqvae_ckpt: Optional[str] = None,
+    stats_path: Optional[str] = None,
+    sample_limit: int = 100,
+):
+    """
+    Computes FID, Diversity, and MIM using VQ-VAE encoder pre-quantization features:
+      - For each sample, decode tokens -> SMPL-X params, then run through VQ-VAE encoder,
+        average-pool across time, L2-normalize to get a clip feature.
+      - Diversity/MIM identical formulations but on these encoder features.
+      - FID via full covariance Fréchet distance on these encoder features.
+    Evaluates on up to 'sample_limit' samples for speed.
+    """
+    print("\n" + "="*80)
+    print("      METRICS EVALUATION (VQ-VAE Encoder Features)")
+    print("="*80)
+    # Lazy import to reuse your visualization utilities and stats
+    try:
+        from visualize import load_vqvae, load_stats, VQVAE_CHECKPOINT as DEFAULT_VQ, STATS_PATH as DEFAULT_STATS
+        vq_ckpt = vqvae_ckpt or os.getenv("VQVAE_CHECKPOINT", DEFAULT_VQ)
+        stats_p = stats_path or os.getenv("VQVAE_STATS_PATH", DEFAULT_STATS)
+        vq_model = load_vqvae(vq_ckpt, device=device)
+        mean, std = load_stats(stats_p)
+        from visualize import decode_tokens_to_params
+    except Exception as exc:
+        print(f"⚠️  Could not set up VQ-VAE encoder metrics: {exc}")
+        return {}
+    # Collect GT/GEN token sequences for pairs (limit to speed-up)
+    pairs = _collect_eval_pairs(model, tokenizer, eval_data[:sample_limit], device)
+    # Build features
+    gt_feats = []
+    gen_feats = []
+    labels = []
+    for pair in pairs:
+        if len(pair) == 4:
+            word, _pid, gt_seq, gen_seq = pair
+        else:
+            word, gt_seq, gen_seq = pair
+        # Decode to SMPL-X
+        tokens_gt = _extract_ids_from_sequence(gt_seq)
+        tokens_gen = _extract_ids_from_sequence(gen_seq)
+        try:
+            params_gt = decode_tokens_to_params(tokens_gt, vq_model, mean, std, device=device)  # (T, D) denorm
+        except Exception:
+            params_gt = np.zeros((0, 182), dtype=np.float32)
+        try:
+            params_gen = decode_tokens_to_params(tokens_gen, vq_model, mean, std, device=device)  # (T, D) denorm
+        except Exception:
+            params_gen = np.zeros((0, 182), dtype=np.float32)
+        # Encode (pre-quant) -> pooled feature
+        feat_gt = _encode_params_to_feature(params_gt, vq_model, mean, std, device)
+        feat_gen = _encode_params_to_feature(params_gen, vq_model, mean, std, device)
+        gt_feats.append(feat_gt)
+        gen_feats.append(feat_gen)
+        labels.append(word)
+    gt_feats = np.stack(gt_feats, axis=0)
+    gen_feats = np.stack(gen_feats, axis=0)
+    # Diversity
+    diversity_times = min(200, max(4, gt_feats.shape[0] - 1))
+    diversity_gt = calculate_diversity_np(gt_feats, diversity_times=diversity_times)
+    diversity_gen = calculate_diversity_np(gen_feats, diversity_times=diversity_times)
+    # Multimodality (MIM)
+    try:
+        gt_lbl_tensor = _to_label_tensor3(gt_feats, labels)
+        gen_lbl_tensor = _to_label_tensor3(gen_feats, labels)
+        multimodality_times = min(20, max(3, gt_lbl_tensor.shape[1] - 1))
+        mim_gt = calculate_multimodality_np(gt_lbl_tensor, multimodality_times=multimodality_times)
+        mim_gen = calculate_multimodality_np(gen_lbl_tensor, multimodality_times=multimodality_times)
+    except Exception as exc:
+        print(f"⚠️  Multimodality could not be computed reliably: {exc}")
+        mim_gt = float("nan")
+        mim_gen = float("nan")
+    # FID (on encoder features)
+    mu_gen, cov_gen = calculate_activation_statistics_np(gen_feats)
+    mu_gt, cov_gt = calculate_activation_statistics_np(gt_feats)
+    fid = calculate_frechet_distance_np(mu_gt, cov_gt, mu_gen, cov_gen)
+    print(f"Diversity (encoder feats):    GT = {diversity_gt:.4f} | GEN = {diversity_gen:.4f}")
+    print(f"Multimodality (MIM, encoder): GT = {mim_gt:.4f} | GEN = {mim_gen:.4f}")
+    print(f"FID (encoder feats, GT vs GEN): {fid:.4f}")
+    return {
+        "diversity_gt": diversity_gt,
+        "diversity_gen": diversity_gen,
+        "mim_gt": mim_gt,
+        "mim_gen": mim_gen,
+        "fid": fid,
+        "pairs": pairs,
+    }
+# ======================================================================================
+# 5c. Side-by-side visualization (4 samples)
+# ======================================================================================
+def _extract_ids_from_sequence(seq: str) -> list[int]:
+    return [int(t[2:-1]) for t in _extract_motion_tokens_from_sequence(seq) if t[2:-1].isdigit()]
+def save_side_by_side_visualizations(pairs: list[Tuple[str, str, str]], output_dir: str, limit: int = 4):
+    """
+    Generate side-by-side 3D animations for GT vs GEN, saving one HTML per sample
+    using filename scheme: word_PID_side_by_side.html.
+    - Processes ALL samples for up to `limit` distinct words (if provided).
+    - Requires visualize.py utilities and plotly.
+    """
+    try:
+        from visualize import (
+            load_vqvae, load_stats, load_smplx_model,
+            decode_tokens_to_params, params_to_vertices,
+            VQVAE_CHECKPOINT as DEFAULT_VQ, STATS_PATH as DEFAULT_STATS, SMPLX_MODEL_DIR as DEFAULT_SMPLX
+        )
+        import plotly.graph_objects as go
+        from plotly.subplots import make_subplots
+    except Exception as exc:
+        print(f"⚠️  Visualization skipped (missing dependencies): {exc}")
+        return
+    os.makedirs(output_dir, exist_ok=True)
+    vqvae_ckpt = os.getenv("VQVAE_CHECKPOINT", DEFAULT_VQ)
+    stats_path = os.getenv("VQVAE_STATS_PATH", DEFAULT_STATS)
+    smplx_dir = os.getenv("SMPLX_MODEL_DIR", DEFAULT_SMPLX)
+    print("Loading VQ-VAE, stats, SMPL-X ...")
+    vq_model = load_vqvae(vqvae_ckpt)
+    mean, std = load_stats(stats_path)
+    smplx_model = load_smplx_model(smplx_dir)
+    def animate_side_by_side(verts_left, faces, verts_right, fps=20, titles=("Ground Truth", "LLM Generated"), output_html=None):
+        T = min(verts_left.shape[0], verts_right.shape[0])
+        verts_left, verts_right = verts_left[:T], verts_right[:T]
+        i, j, k = faces.T.tolist()
+        fig = make_subplots(
+            rows=1, cols=2,
+            specs=[[{'type': 'scene'}, {'type': 'scene'}]],
+            horizontal_spacing=0.05,
+            subplot_titles=list(titles)
+        )
+        left_mesh = go.Mesh3d(x=verts_left[0,:,0], y=verts_left[0,:,1], z=verts_left[0,:,2], i=i,j=j,k=k,opacity=0.7,showscale=False)
+        right_mesh = go.Mesh3d(x=verts_right[0,:,0], y=verts_right[0,:,1], z=verts_right[0,:,2], i=i,j=j,k=k,opacity=0.7,showscale=False)
+        fig.add_trace(left_mesh, row=1, col=1)
+        fig.add_trace(right_mesh, row=1, col=2)
+        frames = []
+        for t in range(T):
+            frames.append(go.Frame(
+                name=str(t),
+                data=[
+                    go.Mesh3d(x=verts_left[t,:,0], y=verts_left[t,:,1], z=verts_left[t,:,2], i=i,j=j,k=k,opacity=0.7,showscale=False,scene="scene"),
+                    go.Mesh3d(x=verts_right[t,:,0], y=verts_right[t,:,1], z=verts_right[t,:,2], i=i,j=j,k=k,opacity=0.7,showscale=False,scene="scene2")
+                ]
+            ))
+        fig.frames = frames
+        fig.update_layout(
+            showlegend=False,
+            margin=dict(l=10, r=10, t=50, b=10),
+            scene=dict(aspectmode='data',xaxis=dict(visible=False),yaxis=dict(visible=False),zaxis=dict(visible=False),
+                       camera=dict(eye=dict(x=0,y=-2,z=0.7))),
+            scene2=dict(aspectmode='data',xaxis=dict(visible=False),yaxis=dict(visible=False),zaxis=dict(visible=False),
+                        camera=dict(eye=dict(x=0,y=-2,z=0.7))),
+            updatemenus=[dict(
+                type="buttons", x=0.5, xanchor="center", y=1.15, yanchor="top",
+                buttons=[
+                    dict(label="Play", method="animate", args=[None, {"frame": {"duration": max(1,1000//fps), "redraw": True}, "fromcurrent": True}]),
+                    dict(label="Pause", method="animate", args=[[None], {"frame": {"duration": 0, "redraw": False}}])
+                ]
+            )]
+        )
+        if output_html:
+            fig.write_html(output_html)
+            print(f"✅ Saved: {output_html}")
+        return fig
+    # Determine which words to include (up to `limit` distinct words)
+    allowed_words = None
+    if isinstance(limit, int) and limit > 0:
+        ordered_unique_words = []
+        for pair in pairs:
+            word = pair[0]
+            if word not in ordered_unique_words:
+                ordered_unique_words.append(word)
+            if len(ordered_unique_words) >= limit:
+                break
+        allowed_words = set(ordered_unique_words)
+    for pair in pairs:
+        try:
+            if len(pair) == 4:
+                word, pid, gt_seq, gen_seq = pair
+            else:
+                word, gt_seq, gen_seq = pair
+                pid = "unknown"
+            if allowed_words is not None and word not in allowed_words:
+                continue
+            tokens_gt = _extract_ids_from_sequence(gt_seq)
+            tokens_gen = _extract_ids_from_sequence(gen_seq)
+            params_gt = decode_tokens_to_params(tokens_gt, vq_model, mean, std)
+            params_gen = decode_tokens_to_params(tokens_gen, vq_model, mean, std)
+            verts_gt, faces = params_to_vertices(params_gt, smplx_model)
+            verts_gen, _ = params_to_vertices(params_gen, smplx_model)
+            out_dir = os.path.join(output_dir)
+            os.makedirs(out_dir, exist_ok=True)
+            # Sanitize for filesystem safety
+            safe_word = re.sub(r'[^A-Za-z0-9_-]+', '_', str(word))
+            safe_pid = re.sub(r'[^A-Za-z0-9_-]+', '_', str(pid))
+            output_html = os.path.join(out_dir, f"word_{safe_word}_{safe_pid}_side_by_side.html")
+            animate_side_by_side(
+                verts_left=verts_gt,
+                faces=faces,
+                verts_right=verts_gen,
+                fps=20,
+                titles=("Ground Truth", "LLM Generated"),
+                output_html=output_html
+            )
+        except Exception as exc:
+            print(f"⚠️  Error creating visualization for word '{pair[0]}': {exc}")
+# ======================================================================================
+# 6. Main Execution Block (UPDATED)
+# ======================================================================================
+def main(config_overrides: Optional[Dict[str, Any]] = None):
+    """Main function to run the entire pipeline."""
+    apply_config_overrides(config_overrides)
+    if config_overrides:
+        printable = {k: v for k, v in config_overrides.items() if "token" not in k.lower()}
+        if printable:
+            print("\nApplied config overrides:")
+            for key, value in printable.items():
+                print(f"  - {key} = {value}")
+    random.seed(42)
+    torch.manual_seed(42)
+    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+    print(f"Using device: {device}")
+    # 1. Load ALL data
+    all_entries = read_json_data(DATASET_PATH)
+    # 2. Clean the ENTIRE dataset and get all tokens
+    cleaned_data, all_motion_tokens = deduplicate_and_prepare_data(all_entries)
+    # 3. Stage 1: Initialize or resume from HF, then train
+    resolved_stage1_repo = _resolve_and_ensure_repo(HF_STAGE1_REPO_ID) if HF_USE_HUB else None
+    start_epoch_s1 = 0
+    stage1_loaded = None
+    if resolved_stage1_repo:
+        if _repo_has_stage_latest(resolved_stage1_repo, "stage1"):
+            stage1_loaded = _load_model_and_tokenizer_from_hf(resolved_stage1_repo, "stage1")
+            state_s1 = _download_training_state(resolved_stage1_repo, "stage1")
+            if state_s1 and isinstance(state_s1.get("epoch_completed"), int):
+                start_epoch_s1 = state_s1["epoch_completed"]
+        else:
+            # Fallback: no 'latest' folder; select highest epoch-XXX
+            latest_s1_sub = _repo_get_latest_epoch_subfolder(resolved_stage1_repo, "stage1")
+            if latest_s1_sub:
+                stage1_loaded = _load_model_and_tokenizer_from_hf_subfolder(resolved_stage1_repo, latest_s1_sub)
+                state_s1 = _download_training_state_from_subfolder(resolved_stage1_repo, latest_s1_sub)
+                if state_s1 and isinstance(state_s1.get("epoch_completed"), int):
+                    start_epoch_s1 = state_s1["epoch_completed"]
+    if stage1_loaded:
+        base_model, tokenizer = stage1_loaded
+        # Ensure tokenizer contains all motion tokens (add missing if dataset expanded)
+        added = _ensure_tokenizer_has_motion_tokens(tokenizer, all_motion_tokens)
+        if added > 0:
+            base_model.resize_token_embeddings(len(tokenizer))
+    else:
+        base_model, tokenizer = setup_model_and_tokenizer(MODEL_NAME, all_motion_tokens)
+    print(f"\nStarting Stage 1 training on {len(cleaned_data)} samples (resume from epoch {start_epoch_s1}).")
+    motion_model = train_stage1(
+        base_model,
+        tokenizer,
+        cleaned_data,
+        device,
+        start_epoch=start_epoch_s1,
+        hf_repo_id=resolved_stage1_repo,
+    )
+    # 4. Stage 2: Initialize or resume from HF, then train
+    resolved_stage2_repo = _resolve_and_ensure_repo(HF_STAGE2_REPO_ID) if HF_USE_HUB else None
+    start_epoch_s2 = 0
+    stage2_loaded = None
+    print(f"\nStage 2 resume policy: FORCE_STAGE2_FROM_STAGE1={FORCE_STAGE2_FROM_STAGE1}, save_subdir='{HF_STAGE2_SAVE_SUBDIR}'")
+    # For this run we want Stage 2 to start from Stage 1 epoch-20 even if an old stage2 exists.
+    # Only resume Stage 2 if explicitly allowed and if there is a checkpoint under the save subdir.
+    if not FORCE_STAGE2_FROM_STAGE1 and resolved_stage2_repo:
+        # Prefer loading from the configured Stage 2 save subdir (e.g., 'stage2_v2')
+        if _repo_has_stage_latest(resolved_stage2_repo, HF_STAGE2_SAVE_SUBDIR):
+            stage2_loaded = _load_model_and_tokenizer_from_hf(resolved_stage2_repo, HF_STAGE2_SAVE_SUBDIR)
+            state_s2 = _download_training_state(resolved_stage2_repo, HF_STAGE2_SAVE_SUBDIR)
+            if state_s2 and isinstance(state_s2.get("epoch_completed"), int):
+                start_epoch_s2 = state_s2["epoch_completed"]
+            print(f"Resuming Stage 2 from HF subfolder: {HF_STAGE2_SAVE_SUBDIR}/latest (epoch_completed={start_epoch_s2})")
+        else:
+            latest_s2_sub = _repo_get_latest_epoch_subfolder(resolved_stage2_repo, HF_STAGE2_SAVE_SUBDIR)
+            if latest_s2_sub:
+                stage2_loaded = _load_model_and_tokenizer_from_hf_subfolder(resolved_stage2_repo, latest_s2_sub)
+                state_s2 = _download_training_state_from_subfolder(resolved_stage2_repo, latest_s2_sub)
+                if state_s2 and isinstance(state_s2.get("epoch_completed"), int):
+                    start_epoch_s2 = state_s2["epoch_completed"]
+                print(f"Resuming Stage 2 from HF subfolder: {latest_s2_sub} (epoch_completed={start_epoch_s2})")
+    if stage2_loaded:
+        stage2_model, tokenizer = stage2_loaded
+        added2 = _ensure_tokenizer_has_motion_tokens(tokenizer, all_motion_tokens)
+        if added2 > 0:
+            stage2_model.resize_token_embeddings(len(tokenizer))
+    else:
+        stage2_model = motion_model  # Start Stage 2 from Stage 1 model
+    print(f"\nStarting Stage 2 training on {len(cleaned_data)} samples (resume from epoch {start_epoch_s2}).")
+    final_model = train_stage2(
+        stage2_model,
+        tokenizer,
+        cleaned_data,
+        device,
+        start_epoch=start_epoch_s2,
+        hf_repo_id=resolved_stage2_repo,
+        hf_stage_subdir=HF_STAGE2_SAVE_SUBDIR,
+    )
+    # 5. Filter the cleaned data to get a smaller set for evaluation
+    # This keeps the evaluation focused on our benchmark words and the logs readable
+    print("\n--- Filtering data for evaluation on specific words ---")
+    evaluation_data = [item for item in cleaned_data if item['word'].lower() in EVALUATION_WORDS]
+    print(f"Found {len(evaluation_data)} samples for evaluation words: {EVALUATION_WORDS}")
+    # 6. Metrics-only mode or full flow
+    if RUN_EVALS_ONLY:
+        # Compute the 3 metrics using VQ-VAE encoder features and save to JSON
+        metrics_enc = evaluate_metrics_encoder_style(
+            final_model, tokenizer, evaluation_data, device, sample_limit=EVAL_SAMPLE_LIMIT
+        )
+        os.makedirs(OUTPUT_DIR, exist_ok=True)
+        metrics_payload = {
+            "source": "vqvae_encoder",
+            "fid": metrics_enc.get("fid"),
+            "diversity": {
+                "ground_truth": metrics_enc.get("diversity_gt"),
+                "model": metrics_enc.get("diversity_gen"),
+            },
+            "multimodality": {
+                "ground_truth": metrics_enc.get("mim_gt"),
+                "model": metrics_enc.get("mim_gen"),
+            },
+            "num_pairs": len(metrics_enc.get("pairs", [])),
+        }
+        with open(METRICS_JSON_PATH, "w", encoding="utf-8") as f:
+            json.dump(metrics_payload, f, ensure_ascii=False, indent=2)
+        print(f"\n✅ Saved metrics to {METRICS_JSON_PATH}")
+        return
+    # Full flow: inference logs + MotionGPT-style metrics + encoder metrics + visualizations
+    run_inference_on_all_samples(final_model, tokenizer, evaluation_data, device)
+    metrics_token = evaluate_metrics_motiongpt_style(final_model, tokenizer, evaluation_data, all_motion_tokens, device)
+    # Also compute encoder-based 3 metrics
+    metrics_enc = evaluate_metrics_encoder_style(
+        final_model, tokenizer, evaluation_data, device, sample_limit=EVAL_SAMPLE_LIMIT
+    )
+    # Visualizations (skip if metrics-only)
+    viz_dir = os.path.join(OUTPUT_DIR, "html_visualizations")
+    save_side_by_side_visualizations(metrics_token["pairs"], viz_dir, limit=4)
+if __name__ == "__main__":
+    main()

train.py ADDED Viewed

	@@ -0,0 +1,744 @@

+"""
+Training utilities and functions
+"""
+import math
+import os
+import re
+import time
+import json
+import shutil
+import torch
+from datetime import datetime
+from typing import Optional, Dict, Any, List, Tuple
+from torch.optim import AdamW
+from torch.utils.data import DataLoader
+from transformers import TrainingArguments, Trainer, AutoModelForCausalLM, AutoTokenizer
+from transformers.trainer_callback import TrainerCallback
+from huggingface_hub import HfApi, upload_folder, snapshot_download, hf_hub_download
+from config import (
+    BATCH_TRAIN, BATCH_EVAL, GRAD_ACCUM, LR, WARMUP,
+    LOG_STEPS, EVAL_STEPS, SAVE_STEPS, SEED, DTYPE,
+    HUB_REPO_S1, HUB_REPO_S2, HUB_REPO_S3, HF_TOKEN,
+    CHECKPOINTS_DIR, HF_USE_HUB, CHECKPOINT_UPLOAD_INTERVAL_EPOCHS,
+    S1_BATCH_SIZE, S1_LR, S1_EPOCHS, S2_BATCH_SIZE, S2_LR, S2_EPOCHS,
+    PAD_TOKEN, M_START, M_END
+)
+# ======================================================================================
+# Logic from test_overfit.py (Raw Training Loops & HF Utils)
+# ======================================================================================
+def _format_seconds(seconds: float) -> str:
+    """Formats seconds into H:MM:SS or M:SS."""
+    seconds = int(max(0, seconds))
+    h = seconds // 3600
+    m = (seconds % 3600) // 60
+    s = seconds % 60
+    if h > 0:
+        return f"{h:d}:{m:02d}:{s:02d}"
+    return f"{m:d}:{s:02d}"
+def _ensure_dir(path: str) -> None:
+    os.makedirs(path, exist_ok=True)
+def resolve_and_ensure_repo(repo_id: str, hf_auth_token: Optional[str] = None) -> Optional[str]:
+    """
+    Ensures the HF repo exists. Returns the fully-qualified repo_id (namespace/repo)
+    when token is available; otherwise returns the input repo_id.
+    """
+    if not HF_USE_HUB:
+        return None
+    token = hf_auth_token or HF_TOKEN
+    if not token:
+        print("⚠️  HF token not found. Set HUGGINGFACE_HUB_TOKEN to enable Hub sync.")
+        return None
+    api = HfApi()
+    try:
+        who = api.whoami(token=token)
+        namespace = who.get("name") or (who.get("orgs", [None])[0] if isinstance(who.get("orgs"), list) else None)
+    except Exception as exc:
+        print(f"⚠️  Unable to resolve HF namespace: {exc}")
+        namespace = None
+    if "/" not in repo_id and namespace:
+        full_repo_id = f"{namespace}/{repo_id}"
+    else:
+        full_repo_id = repo_id
+    try:
+        api.create_repo(
+            repo_id=full_repo_id,
+            token=token,
+            repo_type="model",
+            private=True, # Default to private as in test_overfit config if not specified
+            exist_ok=True,
+        )
+    except Exception as exc:
+        print(f"⚠️  create_repo failed (may already exist): {exc}")
+    return full_repo_id
+def repo_has_stage_latest(repo_id: str, stage: str, hf_auth_token: Optional[str] = None) -> bool:
+    """Checks if a stage/latest checkpoint exists in the HF repo."""
+    token = hf_auth_token or HF_TOKEN
+    if not HF_USE_HUB or not token:
+        return False
+    api = HfApi()
+    try:
+        files = api.list_repo_files(repo_id=repo_id, repo_type="model", token=token)
+        return any(path.startswith(f"{stage}/latest/") and path.endswith("config.json") for path in files)
+    except Exception as exc:
+        print(f"⚠️  Could not list files for {repo_id}: {exc}")
+        return False
+def repo_list_epoch_numbers(repo_id: str, stage: str, hf_auth_token: Optional[str] = None) -> List[int]:
+    """
+    Returns sorted list of epoch numbers available under {stage}/epoch-XXX/ by scanning files.
+    """
+    token = hf_auth_token or HF_TOKEN
+    if not HF_USE_HUB or not token:
+        return []
+    api = HfApi()
+    try:
+        files = api.list_repo_files(repo_id=repo_id, repo_type="model", token=token)
+    except Exception as exc:
+        print(f"⚠️  Could not list files for {repo_id}: {exc}")
+        return []
+    epoch_numbers: List[int] = []
+    pattern = re.compile(rf"^{re.escape(stage)}/epoch-(\d+)/config\.json$")
+    for path in files:
+        m = pattern.match(path)
+        if m:
+            try:
+                epoch_numbers.append(int(m.group(1)))
+            except ValueError:
+                pass
+    return sorted(set(epoch_numbers))
+def repo_get_latest_epoch_subfolder(repo_id: str, stage: str, hf_auth_token: Optional[str] = None) -> Optional[str]:
+    """
+    Returns subfolder path like '{stage}/epoch-XXX' for the highest available epoch, or None.
+    """
+    epochs = repo_list_epoch_numbers(repo_id, stage, hf_auth_token)
+    if not epochs:
+        return None
+    latest = max(epochs)
+    return f"{stage}/epoch-{latest:03d}"
+def load_model_and_tokenizer_from_hf_subfolder(repo_id: str, subfolder: str, hf_auth_token: Optional[str] = None) -> Optional[Tuple[AutoModelForCausalLM, AutoTokenizer]]:
+    """
+    Loads model and tokenizer from HF under a specific subfolder.
+    """
+    if not HF_USE_HUB or (not hf_auth_token and not HF_TOKEN):
+        return None
+    print(f"\n---> Loading checkpoint from Hugging Face: {repo_id} (subfolder='{subfolder}')")
+    try:
+        tokenizer = AutoTokenizer.from_pretrained(repo_id, subfolder=subfolder, trust_remote_code=True)
+        model = AutoModelForCausalLM.from_pretrained(repo_id, subfolder=subfolder, trust_remote_code=True)
+    except Exception as exc:
+        print(f"⚠️  Failed to load model/tokenizer from subfolder '{subfolder}': {exc}")
+        return None
+    if tokenizer.pad_token is None:
+        tokenizer.add_special_tokens({"pad_token": PAD_TOKEN})
+    model.resize_token_embeddings(len(tokenizer))
+    model.config.pad_token_id = tokenizer.pad_token_id
+    return model, tokenizer
+def download_training_state_from_subfolder(repo_id: str, subfolder: str, hf_auth_token: Optional[str] = None) -> Optional[Dict[str, Any]]:
+    """
+    Downloads training_state.json from a specific subfolder.
+    """
+    token = hf_auth_token or HF_TOKEN
+    if not HF_USE_HUB or not token:
+        return None
+    try:
+        state_path = hf_hub_download(
+            repo_id=repo_id,
+            filename=f"{subfolder}/training_state.json",
+            repo_type="model",
+            token=token,
+        )
+        with open(state_path, "r", encoding="utf-8") as f:
+            return json.load(f)
+    except Exception:
+        return None
+def download_training_state(repo_id: str, stage: str, hf_auth_token: Optional[str] = None) -> Optional[Dict[str, Any]]:
+    """Downloads training_state.json from HF if present."""
+    token = hf_auth_token or HF_TOKEN
+    if not HF_USE_HUB or not token:
+        return None
+    try:
+        state_path = hf_hub_download(
+            repo_id=repo_id,
+            filename=f"{stage}/latest/training_state.json",
+            repo_type="model",
+            token=token,
+        )
+        with open(state_path, "r", encoding="utf-8") as f:
+            return json.load(f)
+    except Exception:
+        return None
+def download_optimizer_state(repo_id: str, stage: str, hf_auth_token: Optional[str] = None) -> Optional[str]:
+    """Downloads optimizer.pt for resuming optimizer state."""
+    token = hf_auth_token or HF_TOKEN
+    if not HF_USE_HUB or not token:
+        return None
+    try:
+        opt_path = hf_hub_download(
+            repo_id=repo_id,
+            filename=f"{stage}/latest/optimizer.pt",
+            repo_type="model",
+            token=token,
+        )
+        return opt_path
+    except Exception:
+        return None
+def load_model_and_tokenizer_from_hf(repo_id: str, stage: str, hf_auth_token: Optional[str] = None) -> Optional[Tuple[AutoModelForCausalLM, AutoTokenizer]]:
+    """
+    Loads model and tokenizer from HF under subfolder {stage}/latest if available.
+    """
+    if not repo_has_stage_latest(repo_id, stage, hf_auth_token):
+        return None
+    print(f"\n---> Loading checkpoint from Hugging Face: {repo_id} (subfolder='{stage}/latest')")
+    tokenizer = AutoTokenizer.from_pretrained(repo_id, subfolder=f"{stage}/latest", trust_remote_code=True)
+    model = AutoModelForCausalLM.from_pretrained(repo_id, subfolder=f"{stage}/latest", trust_remote_code=True)
+    if tokenizer.pad_token is None:
+        tokenizer.add_special_tokens({"pad_token": PAD_TOKEN})
+    model.resize_token_embeddings(len(tokenizer))
+    model.config.pad_token_id = tokenizer.pad_token_id
+    return model, tokenizer
+def save_and_push_checkpoint(
+    stage: str,
+    epoch_index_zero_based: int,
+    model: AutoModelForCausalLM,
+    tokenizer: AutoTokenizer,
+    optimizer: AdamW,
+    avg_loss: float,
+    dataloader_len: int,
+    batch_size: int,
+    total_epochs: int,
+    repo_id: Optional[str],
+    hf_auth_token: Optional[str] = None
+) -> None:
+    """
+    Saves checkpoint locally and pushes to HF.
+    """
+    token = hf_auth_token or HF_TOKEN
+    epoch_number = epoch_index_zero_based + 1
+    stage_dir = os.path.join(CHECKPOINTS_DIR, stage)
+    epoch_dir_name = f"epoch-{epoch_number:03d}"
+    epoch_dir = os.path.join(stage_dir, epoch_dir_name)
+    latest_dir = os.path.join(stage_dir, "latest")
+    _ensure_dir(epoch_dir)
+    _ensure_dir(stage_dir)
+    # Save model + tokenizer
+    model.save_pretrained(epoch_dir)
+    tokenizer.save_pretrained(epoch_dir)
+    # Save optimizer state
+    torch.save(optimizer.state_dict(), os.path.join(epoch_dir, "optimizer.pt"))
+    # Save training state
+    training_state = {
+        "stage": stage,
+        "epoch_completed": epoch_number,
+        "total_epochs_for_stage": total_epochs,
+        "global_step": epoch_number * dataloader_len,
+        "avg_loss": float(avg_loss),
+        "batch_size": batch_size,
+        "saved_at": datetime.utcnow().isoformat() + "Z",
+    }
+    with open(os.path.join(epoch_dir, "training_state.json"), "w", encoding="utf-8") as f:
+        json.dump(training_state, f, ensure_ascii=False, indent=2)
+    # Update "latest"
+    if os.path.exists(latest_dir):
+        shutil.rmtree(latest_dir)
+    shutil.copytree(epoch_dir, latest_dir)
+    # Push to Hugging Face
+    if HF_USE_HUB and repo_id and token:
+        try:
+            upload_folder(
+                repo_id=repo_id,
+                folder_path=epoch_dir,
+                path_in_repo=f"{stage}/{epoch_dir_name}",
+                repo_type="model",
+                token=token,
+                commit_message=f"{stage}: save {epoch_dir_name}",
+            )
+            upload_folder(
+                repo_id=repo_id,
+                folder_path=latest_dir,
+                path_in_repo=f"{stage}/latest",
+                repo_type="model",
+                token=token,
+                commit_message=f"{stage}: update latest -> {epoch_dir_name}",
+            )
+            print(f"☁️  Pushed checkpoint to HF: {repo_id} ({stage}/{epoch_dir_name} and {stage}/latest)")
+        except Exception as exc:
+            print(f"⚠️  Failed to push checkpoint to HF: {exc}")
+    else:
+        print("ℹ️  Skipped HF push (Hub disabled or token/repo missing).")
+def train_stage1_raw(
+    model,
+    tokenizer,
+    data: List[Dict[str, Any]],
+    device,
+    start_epoch: int = 0,
+    hf_repo_id: Optional[str] = None,
+):
+    """Trains the model on motion sequences only to learn the 'language of motion'."""
+    from data import MotionDataset # Import here to avoid circular imports
+    print("\n" + "="*80)
+    print("      STAGE 1: MOTION LANGUAGE MODELING (PRE-TRAINING)")
+    print(f"      Training on {len(data)} samples.")
+    print("="*80)
+    dataset = MotionDataset(data, tokenizer)
+    dataloader = DataLoader(dataset, batch_size=S1_BATCH_SIZE, shuffle=True)
+    optimizer = AdamW(model.parameters(), lr=S1_LR)
+    model.to(device)
+    model.train()
+    # Try to resume optimizer if we resumed from HF
+    token = HF_TOKEN
+    if hf_repo_id and start_epoch > 0 and HF_USE_HUB and token:
+        opt_path = download_optimizer_state(hf_repo_id, "stage1", token)
+        if opt_path is not None:
+            try:
+                optimizer.load_state_dict(torch.load(opt_path, map_location=device))
+                print("↩️  Resumed optimizer state for Stage 1 from HF.")
+            except Exception as exc:
+                print(f"⚠️  Failed to load optimizer state for Stage 1: {exc}")
+    for epoch in range(start_epoch, S1_EPOCHS):
+        total_loss = 0
+        total_batches = len(dataloader)
+        epoch_start_time = time.time()
+        step_interval = max(1, total_batches // 50)  # ~2% progress updates
+        for i, batch in enumerate(dataloader, 1):
+            optimizer.zero_grad()
+            input_ids = batch['input_ids'].squeeze(1).to(device)
+            attention_mask = batch['attention_mask'].squeeze(1).to(device)
+            outputs = model(input_ids=input_ids, attention_mask=attention_mask, labels=input_ids)
+            loss = outputs.loss
+            loss.backward()
+            optimizer.step()
+            total_loss += loss.item()
+            # Progress with ETA
+            if i == 1 or (i % step_interval == 0) or (i == total_batches):
+                elapsed = time.time() - epoch_start_time
+                est_total = (elapsed / i) * total_batches
+                eta = est_total - elapsed
+                pct = (i / total_batches) * 100.0
+                print(
+                    f"\r[Stage 1] Epoch {epoch+1}/{S1_EPOCHS} - "
+                    f"{i}/{total_batches} ({pct:.1f}%) - ETA {_format_seconds(eta)}",
+                    end="",
+                    flush=True,
+                )
+        # Finish the progress line
+        print()
+        avg_loss = total_loss / len(dataloader)
+        print(f"--- End of Epoch {epoch+1}/{S1_EPOCHS}, Average Loss: {avg_loss:.4f} ---")
+        # Save checkpoint locally every epoch; push to HF only at interval or final epoch
+        push_this_epoch = ((epoch + 1) % CHECKPOINT_UPLOAD_INTERVAL_EPOCHS == 0) or ((epoch + 1) == S1_EPOCHS)
+        repo_for_epoch = hf_repo_id if push_this_epoch else None
+        save_and_push_checkpoint(
+            stage="stage1",
+            epoch_index_zero_based=epoch,
+            model=model,
+            tokenizer=tokenizer,
+            optimizer=optimizer,
+            avg_loss=avg_loss,
+            dataloader_len=len(dataloader),
+            batch_size=S1_BATCH_SIZE,
+            total_epochs=S1_EPOCHS,
+            repo_id=repo_for_epoch,
+            hf_auth_token=token
+        )
+    print("\n✅ Stage 1 Training Complete.")
+    return model
+def train_stage2_raw(
+    model,
+    tokenizer,
+    data: List[Dict[str, Any]],
+    device,
+    start_epoch: int = 0,
+    hf_repo_id: Optional[str] = None,
+    hf_stage_subdir: str = "stage2",
+):
+    """Fine-tunes the motion-aware model to connect text prompts to motions."""
+    from data import TextMotionDataset # Import here to avoid circular imports
+    print("\n" + "="*80)
+    print("      STAGE 2: TEXT-TO-MOTION FINE-TUNING")
+    print(f"      Training on {len(data)} samples.")
+    print("="*80)
+    dataset = TextMotionDataset(data, tokenizer)
+    dataloader = DataLoader(dataset, batch_size=S2_BATCH_SIZE, shuffle=True)
+    optimizer = AdamW(model.parameters(), lr=S2_LR)
+    model.to(device)
+    model.train()
+    # Try to resume optimizer if we resumed from HF
+    token = HF_TOKEN
+    if hf_repo_id and start_epoch > 0 and HF_USE_HUB and token:
+        opt_path = download_optimizer_state(hf_repo_id, hf_stage_subdir, token)
+        if opt_path is not None:
+            try:
+                optimizer.load_state_dict(torch.load(opt_path, map_location=device))
+                print("↩️  Resumed optimizer state for Stage 2 from HF.")
+            except Exception as exc:
+                print(f"⚠️  Failed to load optimizer state for Stage 2: {exc}")
+    for epoch in range(start_epoch, S2_EPOCHS):
+        total_loss = 0
+        total_batches = len(dataloader)
+        epoch_start_time = time.time()
+        step_interval = max(1, total_batches // 50)  # ~2% progress updates
+        for i, batch in enumerate(dataloader, 1):
+            optimizer.zero_grad()
+            input_ids = batch['input_ids'].to(device)
+            attention_mask = batch['attention_mask'].to(device)
+            labels = batch['labels'].to(device)
+            outputs = model(input_ids=input_ids, attention_mask=attention_mask, labels=labels)
+            loss = outputs.loss
+            loss.backward()
+            optimizer.step()
+            total_loss += loss.item()
+            # Progress with ETA
+            if i == 1 or (i % step_interval == 0) or (i == total_batches):
+                elapsed = time.time() - epoch_start_time
+                est_total = (elapsed / i) * total_batches
+                eta = est_total - elapsed
+                pct = (i / total_batches) * 100.0
+                print(
+                    f"\r[Stage 2] Epoch {epoch+1}/{S2_EPOCHS} - "
+                    f"{i}/{total_batches} ({pct:.1f}%) - ETA {_format_seconds(eta)}",
+                    end="",
+                    flush=True,
+                )
+        # Finish the progress line
+        print()
+        avg_loss = total_loss / len(dataloader)
+        print(f"--- End of Epoch {epoch+1}/{S2_EPOCHS}, Average Loss: {avg_loss:.4f} ---")
+        # Save checkpoint locally every epoch; push to HF only at interval or final epoch
+        push_this_epoch = ((epoch + 1) % CHECKPOINT_UPLOAD_INTERVAL_EPOCHS == 0) or ((epoch + 1) == S2_EPOCHS)
+        repo_for_epoch = hf_repo_id if push_this_epoch else None
+        save_and_push_checkpoint(
+            stage=hf_stage_subdir,
+            epoch_index_zero_based=epoch,
+            model=model,
+            tokenizer=tokenizer,
+            optimizer=optimizer,
+            avg_loss=avg_loss,
+            dataloader_len=len(dataloader),
+            batch_size=S2_BATCH_SIZE,
+            total_epochs=S2_EPOCHS,
+            repo_id=repo_for_epoch,
+            hf_auth_token=token
+        )
+    print("\n✅ Stage 2 Training Complete.")
+    return model
+# ======================================================================================
+# Existing Utilities
+# ======================================================================================
+def make_training_args(out_dir: str, epochs: int, two_point_hub: bool = False) -> TrainingArguments:
+    """
+    Create TrainingArguments for a training stage
+    """
+    return TrainingArguments(
+        output_dir=out_dir,
+        per_device_train_batch_size=BATCH_TRAIN,
+        per_device_eval_batch_size=BATCH_EVAL,
+        gradient_accumulation_steps=GRAD_ACCUM,
+        learning_rate=LR,
+        num_train_epochs=epochs,
+        logging_steps=LOG_STEPS,
+        eval_strategy="steps",
+        eval_steps=EVAL_STEPS,
+        # When using two-point hub checkpointing, disable periodic local saves and rely on forced saves
+        save_steps=(10**12 if two_point_hub else SAVE_STEPS),
+        save_total_limit=2,
+        warmup_ratio=WARMUP,
+        bf16=(DTYPE == torch.bfloat16),
+        fp16=(DTYPE == torch.float16),
+        lr_scheduler_type="cosine",
+        optim="adamw_torch",
+        report_to="none",
+        seed=SEED,
+        remove_unused_columns=False,
+    )
+def latest_hub_checkpoint(repo_id: str) -> Optional[str]:
+    """
+    Download and return the local path to the latest checkpoint folder from the Hub.
+    Returns None if no checkpoint exists or on failure.
+    """
+    api = HfApi()
+    try:
+        files = api.list_repo_files(repo_id=repo_id, repo_type="model")
+    except Exception as e:
+        print(f"Hub list failed for {repo_id}: {e}")
+        return None
+    def _step_key(dirname: str) -> int:
+        nums = re.findall(r"\d+", dirname)
+        return int(nums[-1]) if nums else -1
+    ckpt_dirs = sorted(
+        {p.split('/')[0] for p in files if p.startswith("checkpoint-")},
+        key=_step_key,
+    )
+    if not ckpt_dirs:
+        return None
+    latest = ckpt_dirs[-1]
+    local_root = snapshot_download(
+        repo_id=repo_id,
+        repo_type="model",
+        allow_patterns=[f"{latest}/**", "trainer_state.json"],
+        local_dir_use_symlinks=False,
+    )
+    return os.path.join(local_root, latest)
+class TwoPointHubCheckpointCallback(TrainerCallback):
+    """
+    Save to Hugging Face Hub exactly twice per training: halfway and at final step.
+    Keeps only the most recent N checkpoints on Hub.
+    """
+    def __init__(self, repo_id: str, keep_last: int = 2, token: Optional[str] = None):
+        self.repo_id = repo_id
+        self.keep_last = keep_last
+        self.api = HfApi()
+        self.token = token or os.environ.get("HUGGINGFACE_HUB_TOKEN")
+        self._half_step: Optional[int] = None
+        self._final_step: Optional[int] = None
+        self._saved_steps = set()
+        self._pending_push_for_step: Optional[int] = None
+        try:
+            self.api.create_repo(repo_id=self.repo_id, private=True, exist_ok=True, token=self.token)
+        except Exception as e:
+            print(f"Could not ensure repo exists: {e}")
+    def _enforce_keep_last(self) -> None:
+        try:
+            files = self.api.list_repo_files(repo_id=self.repo_id, repo_type="model", token=self.token)
+            def _step_key(dirname: str) -> int:
+                nums = re.findall(r"\d+", dirname)
+                return int(nums[-1]) if nums else -1
+            ckpt_dirs = sorted(
+                {p.split('/')[0] for p in files if p.startswith("checkpoint-")},
+                key=_step_key,
+            )
+            if len(ckpt_dirs) <= self.keep_last:
+                return
+            to_delete = ckpt_dirs[:-self.keep_last]
+            for d in to_delete:
+                for f in [p for p in files if p.startswith(f"{d}/")]:
+                    try:
+                        self.api.delete_file(path=f, repo_id=self.repo_id, repo_type="model", token=self.token)
+                    except Exception as e:
+                        print(f"Failed deleting {f}: {e}")
+        except Exception as e:
+            print(f"Keep-last enforcement failed: {e}")
+    def on_train_begin(self, args, state, control, **kwargs):
+        # Prefer Trainer-computed max_steps
+        if state.max_steps and state.max_steps > 0:
+            self._half_step = max(1, state.max_steps // 2)
+            self._final_step = state.max_steps
+            print(f"Two-point checkpointing: half={self._half_step}, final={self._final_step}")
+        else:
+            # Best-effort fallback using dataloader length and grad accumulation if available
+            td = kwargs.get("train_dataloader")
+            if td is not None and args.gradient_accumulation_steps > 0:
+                steps_per_epoch = math.ceil(len(td) / args.gradient_accumulation_steps)
+                self._final_step = steps_per_epoch * int(args.num_train_epochs)
+                self._half_step = max(1, self._final_step // 2)
+                print(f"Two-point checkpointing (approx): half={self._half_step}, final={self._final_step}")
+    def on_step_end(self, args, state, control, **kwargs):
+        if not self._final_step:
+            return control
+        gs = state.global_step
+        if gs == self._half_step and gs not in self._saved_steps:
+            control.should_save = True
+            self._pending_push_for_step = gs
+        if gs == self._final_step and gs not in self._saved_steps:
+            control.should_save = True
+            self._pending_push_for_step = gs
+        return control
+    def on_save(self, args, state, control, **kwargs):
+        # Push only when we triggered this save
+        if self._pending_push_for_step is None:
+            return control
+        step = self._pending_push_for_step
+        self._pending_push_for_step = None
+        self._saved_steps.add(step)
+        ckpt_dirname = f"checkpoint-{step}"
+        try:
+            upload_folder(
+                repo_id=self.repo_id,
+                folder_path=args.output_dir,
+                repo_type="model",
+                token=self.token,
+                commit_message=f"upload {ckpt_dirname}",
+                allow_patterns=[f"{ckpt_dirname}/**", "trainer_state.json"],
+            )
+            self._enforce_keep_last()
+            print(f"Pushed {ckpt_dirname} to {self.repo_id}")
+        except Exception as e:
+            print(f"Hub upload failed for {ckpt_dirname}: {e}")
+        return control
+def train_stage(
+    stage_name: str,
+    model,
+    tokenizer,
+    train_dataset,
+    eval_dataset,
+    data_collator,
+    out_dir: str,
+    epochs: int,
+    hub_repo: Optional[str] = None,
+):
+    """
+    Train a single stage
+    """
+    print(f"\n{'='*60}")
+    print(f"Training {stage_name}")
+    print(f"{'='*60}")
+    # Auto-select Hub repo by stage if not provided
+    if hub_repo is None:
+        s = (stage_name or "").lower()
+        if s.startswith("stage1"):
+            hub_repo = HUB_REPO_S1
+        elif s.startswith("stage2"):
+            hub_repo = HUB_REPO_S2
+        elif s.startswith("stage3"):
+            hub_repo = HUB_REPO_S3
+    args = make_training_args(out_dir, epochs, two_point_hub=(hub_repo is not None))
+    trainer = Trainer(
+        model=model,
+        tokenizer=tokenizer,
+        train_dataset=train_dataset,
+        eval_dataset=eval_dataset,
+        args=args,
+        data_collator=data_collator,
+    )
+    # Train-loss early stop (match test_overfit behavior)
+    class TrainLossStopCallback(TrainerCallback):
+        def __init__(self, threshold: float = 1.0):
+            self.threshold = float(threshold)
+            self.triggered = False
+        def on_log(self, args, state, control, logs=None, **kwargs):
+            if logs is None:
+                return control
+            loss = logs.get("loss")
+            if loss is not None and loss < self.threshold and state.global_step > 0 and not self.triggered:
+                self.triggered = True
+                print(f"\nTrain-loss early stop: loss={loss:.4f} < {self.threshold}")
+                control.should_training_stop = True
+            return control
+    trainer.add_callback(TrainLossStopCallback(threshold=1.0))
+    # Add two-point Hub checkpoint uploader if configured
+    if hub_repo:
+        # Pass token if available to avoid auth prompts in Kaggle/Colab
+        token = HF_TOKEN if isinstance(HF_TOKEN, str) and len(HF_TOKEN) > 0 else None
+        trainer.add_callback(TwoPointHubCheckpointCallback(hub_repo, token=token))
+    # Train (with auto-resume from Hub if available)
+    resume_path = latest_hub_checkpoint(hub_repo) if hub_repo else None
+    if resume_path:
+        print(f"Resuming from Hub checkpoint: {resume_path}")
+        trainer.train(resume_from_checkpoint=resume_path)
+    else:
+        print(f"Starting training for {stage_name}...")
+        trainer.train()
+    # Evaluate
+    print(f"Evaluating {stage_name}...")
+    metrics = trainer.evaluate()
+    # Compute perplexity
+    eval_loss = metrics.get("eval_loss", float("nan"))
+    ppl = math.exp(eval_loss) if not math.isnan(eval_loss) else float("nan")
+    print(f"\n{stage_name} Results:")
+    print(f"  eval_loss: {eval_loss:.4f}")
+    print(f"  perplexity: {ppl:.3f}")
+    # Save model (optional - can be commented out to save space)
+    # trainer.save_model(out_dir)
+    # print(f"Model saved to {out_dir}")
+    return metrics
+def save_model_to_hub(model, tokenizer, repo_id: str, stage_name: str):
+    """
+    Save model and tokenizer to HuggingFace Hub
+    """
+    print(f"\nSaving {stage_name} to HuggingFace Hub: {repo_id}")
+    model.push_to_hub(repo_id, commit_message=f"Upload {stage_name}")
+    tokenizer.push_to_hub(repo_id, commit_message=f"Upload {stage_name}")
+    print(f"Successfully saved {stage_name}")
+def load_model_from_hub(repo_id: str):
+    """
+    Load model and tokenizer from HuggingFace Hub
+    """
+    from unsloth import FastLanguageModel
+    from config import MAX_SEQ_LEN, DTYPE
+    print(f"\nLoading model from HuggingFace Hub: {repo_id}")
+    model, tokenizer = FastLanguageModel.from_pretrained(
+        model_name=repo_id,
+        max_seq_length=MAX_SEQ_LEN,
+        dtype=DTYPE,
+        load_in_4bit=True,
+    )
+    print(f"Successfully loaded model from {repo_id}")
+    return model, tokenizer

train_mgpt_vqvae.py ADDED Viewed

	@@ -0,0 +1,438 @@

+import os
+import pickle
+import zipfile
+import torch
+import torch.nn as nn
+import pandas as pd
+import numpy as np
+from torch.utils.data import Dataset, DataLoader
+import glob
+import warnings
+import json
+import time
+from datetime import datetime
+import random
+import math
+import matplotlib.pyplot as plt
+import sys
+# Add the mGPT directory to the path
+sys.path.append('/kaggle/working')
+from mGPT.archs.mgpt_vq import VQVae
+warnings.filterwarnings("ignore")
+# Configuration
+DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu'
+DATA_ROOT = '/kaggle/working/extracted_files'
+CHECKPOINT_DIR = '/kaggle/working/checkpoints_mgpt'
+os.makedirs(CHECKPOINT_DIR, exist_ok=True)
+print("Device:", DEVICE)
+# ──────────────────────────────────────────────────────────
+# Enhanced Dataset with File Tracking and Batching (UNCHANGED)
+# ──────────────────────────────────────────────────────────
+def load_smplx_from_folder(folder_path):
+    all_frame_dicts = []
+    for pkl_file in sorted(glob.glob(os.path.join(folder_path, '*.pkl'))):
+        try:
+            with open(pkl_file, 'rb') as f:
+                data = pickle.load(f)
+                if isinstance(data, list):
+                    all_frame_dicts.extend(data)
+                elif isinstance(data, dict):
+                    all_frame_dicts.append(data)
+        except Exception:
+            continue
+    if not all_frame_dicts:
+        return None
+    param_keys = ['shape','body_pose','lhand_pose','rhand_pose','jaw_pose',
+                  'expression','root_pose','cam_trans']
+    param_dims = [10,63,45,45,3,10,3,3]
+    sequences = []
+    for frame in all_frame_dicts:
+        vec = []
+        for key, dim in zip(param_keys, param_dims):
+            arr = np.zeros(dim)
+            if key in frame and frame[key] is not None:
+                v = np.array(frame[key]).flatten()
+                arr[:min(len(v), dim)] = v[:dim]
+            vec.append(arr)
+        sequences.append(np.concatenate(vec))
+    return torch.tensor(np.stack(sequences), dtype=torch.float32)
+class EnhancedMotionDataset(Dataset):
+    def __init__(self, root_dir, processed_files_path, batch_folders=1000):
+        self.root_dir = root_dir
+        self.processed_files_path = processed_files_path
+        self.batch_folders = batch_folders
+        print(f"\n[DEBUG] Initializing Dataset.")
+        print(f"[DEBUG] Root directory: '{self.root_dir}'")
+        if not os.path.exists(self.root_dir):
+            print(f"[DEBUG] ERROR: The root directory '{self.root_dir}' does not exist!")
+            self.all_folders = []
+        else:
+            print(f"[DEBUG] Root directory exists.")
+            glob_path = os.path.join(root_dir, '*')
+            print(f"[DEBUG] Using glob pattern: '{glob_path}'")
+            all_paths = glob.glob(glob_path)
+            print(f"[DEBUG] Glob found {len(all_paths)} total paths.")
+            self.all_folders = [d for d in all_paths if os.path.isdir(d)]
+            print(f"[DEBUG] Found {len(self.all_folders)} directories.")
+        self.processed = self._load_processed()
+        print(f"[DEBUG] Loaded {len(self.processed)} processed folder paths.")
+        self.unprocessed = [f for f in self.all_folders if f not in self.processed]
+        print(f"[DEBUG] Found {len(self.unprocessed)} unprocessed folders.")
+        self._prep_batch()
+    def _load_processed(self):
+        if os.path.exists(self.processed_files_path):
+            with open(self.processed_files_path, 'r') as f:
+                return json.load(f)
+        return []
+    def _save_processed(self):
+        with open(self.processed_files_path, 'w') as f:
+            json.dump(self.processed, f)
+    def _prep_batch(self):
+        self.current = self.unprocessed[:self.batch_folders]
+        self.samples = self.current.copy()
+        print(f"→ Loading {len(self.samples)} folders this batch")
+    def mark_batch_as_processed(self):
+        self.processed += self.current
+        self._save_processed()
+    def get_next_batch(self):
+        all_folders = [d for d in glob.glob(os.path.join(self.root_dir, '*')) if os.path.isdir(d)]
+        self.processed = self._load_processed()
+        self.unprocessed = [f for f in all_folders if f not in self.processed]
+        if not self.unprocessed:
+            print("✅ All data processed")
+            return False
+        self._prep_batch()
+        return True
+    def __len__(self):
+        return len(self.samples)
+    def __getitem__(self, idx):
+        seq = load_smplx_from_folder(self.samples[idx])
+        if seq is None or seq.shape[0] < 64:
+            return None
+        return seq
+# ───────────────────────────────────────────��──────────────
+# Checkpoint Management (UNCHANGED)
+# ──────────────────────────────────────────────────────────
+class CheckpointManager:
+    def __init__(self, checkpoint_dir, max_checkpoints=2):
+        self.checkpoint_dir = checkpoint_dir
+        self.max_checkpoints = max_checkpoints
+    def save_checkpoint(self, model, optimizer, epoch, batch_idx, loss, metadata=None):
+        checkpoint = {
+            'epoch': epoch,
+            'batch_idx': batch_idx,
+            'model_state_dict': model.state_dict(),
+            'optimizer_state_dict': optimizer.state_dict(),
+            'loss': loss,
+            'timestamp': datetime.now().isoformat(),
+            'metadata': metadata or {}
+        }
+        checkpoint_path = os.path.join(
+            self.checkpoint_dir,
+            f'mgpt_vqvae_epoch_{epoch:03d}_batch_{batch_idx:04d}.pt'
+        )
+        torch.save(checkpoint, checkpoint_path)
+        print(f"Saved checkpoint: {checkpoint_path}")
+        self.cleanup_old_checkpoints()
+        return checkpoint_path
+    def cleanup_old_checkpoints(self):
+        checkpoints = glob.glob(os.path.join(self.checkpoint_dir, 'mgpt_vqvae_epoch_*.pt'))
+        checkpoints.sort(key=os.path.getmtime, reverse=True)
+        if len(checkpoints) > self.max_checkpoints:
+            for checkpoint in checkpoints[self.max_checkpoints:]:
+                os.remove(checkpoint)
+                print(f"Removed old checkpoint: {checkpoint}")
+    def load_latest_checkpoint(self):
+        checkpoints = glob.glob(os.path.join(self.checkpoint_dir, 'mgpt_vqvae_epoch_*.pt'))
+        if not checkpoints:
+            return None
+        latest_checkpoint = max(checkpoints, key=os.path.getmtime)
+        print(f"Loading checkpoint: {latest_checkpoint}")
+        return torch.load(latest_checkpoint, map_location=DEVICE)
+    def get_checkpoint_info(self):
+        checkpoints = glob.glob(os.path.join(self.checkpoint_dir, 'mgpt_vqvae_epoch_*.pt'))
+        return len(checkpoints), checkpoints
+# ──────────────────────────────────────────────────────────
+# Enhanced Training Function with MotionGPT VQ-VAE
+# ──────────────────────────────────────────────────────────
+def train_mgpt_vqvae(vq_model, dataset, epochs_per_batch=20, batch_size=16, lr=1e-4):
+    print("\n" + "="*70)
+    print("      STARTING MGPT VQ-VAE TRAINING WITH CHECKPOINTING      ")
+    print("="*70)
+    optimizer = torch.optim.AdamW(vq_model.parameters(), lr=lr, weight_decay=1e-4)
+    scheduler = torch.optim.lr_scheduler.CosineAnnealingWarmRestarts(optimizer, T_0=10, T_mult=2)
+    loss_fn = nn.SmoothL1Loss(reduction='none')
+    checkpoint_manager = CheckpointManager(CHECKPOINT_DIR)
+    checkpoint = checkpoint_manager.load_latest_checkpoint()
+    global_epoch = 1
+    if checkpoint:
+        vq_model.load_state_dict(checkpoint['model_state_dict'])
+        optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
+        global_epoch = checkpoint.get('metadata', {}).get('global_epoch', checkpoint['epoch'])
+        print(f"Resumed from GLOBAL epoch {global_epoch}")
+    vq_model.to(DEVICE).train()
+    # Define loss weights for SMPL parameters
+    param_dims = [10, 63, 45, 45, 3, 10, 3, 3]
+    param_starts = np.cumsum([0] + param_dims[:-1]).tolist()
+    smpl_dim = sum(param_dims)
+    loss_weights = torch.ones(smpl_dim, device=DEVICE)
+    loss_weights[param_starts[1]:param_starts[5]] = 10.0  # pose parameters
+    loss_weights[param_starts[0]:param_starts[1]] = 5.0   # shape parameters
+    loss_weights[param_starts[5]:param_starts[6]] = 8.0   # expression parameters
+    def log_codebook_analysis(x_recon, loss, perplexity, epoch, batch_idx):
+        # Extract encoded indices for analysis
+        with torch.no_grad():
+            x_in = vq_model.preprocess(x_recon[:1])  # Use reconstructed sample for analysis
+            x_encoder = vq_model.encoder(x_in)
+            x_flat = vq_model.quantizer.preprocess(x_encoder)
+            indices = vq_model.quantizer.quantize(x_flat)
+        unique_codes = torch.unique(indices)
+        usage_percentage = (len(unique_codes) / vq_model.quantizer.nb_code) * 100
+        print(f"[ANALYSIS] Epoch {epoch}, Batch {batch_idx}")
+        print(f"Unique codes used: {len(unique_codes)}/{vq_model.quantizer.nb_code} ({usage_percentage:.1f}%)")
+        print(f"Perplexity: {perplexity:.2f}")
+        return usage_percentage, indices
+    def save_reconstruction_sample(x, x_recon, lengths, epoch):
+        original_seq = x[0, :lengths[0]].cpu().numpy()
+        recon_seq = x_recon[0, :lengths[0]].cpu().numpy()
+        filename = os.path.join(CHECKPOINT_DIR, f'mgpt_recon_epoch_{epoch}.npz')
+        np.savez(filename, original=original_seq, reconstructed=recon_seq)
+        print(f"Saved reconstruction sample to {filename}")
+        mse = ((original_seq - recon_seq) ** 2).mean()
+        print(f"Reconstruction MSE: {mse:.6f}")
+        return mse
+    def collate_fn_enhanced(batch):
+        batch = [item for item in batch if item is not None]
+        if not batch:
+            return None
+        batch.sort(key=lambda x: x.shape[0], reverse=True)
+        max_len = batch[0].shape[0]
+        max_len = min(max_len, 256)
+        downsampling_factor = 8
+        padded_max_len = math.ceil(max_len / downsampling_factor) * downsampling_factor
+        padded_batch = torch.zeros(len(batch), padded_max_len, batch[0].shape[1])
+        lengths = []
+        for i, x in enumerate(batch):
+            length = min(x.shape[0], padded_max_len)
+            padded_batch[i, :length, :] = x[:length, :]
+            lengths.append(length)
+        return padded_batch, torch.tensor(lengths)
+    while True:
+        print(f"\n{'='*50}")
+        print(f"Processing file batch with {len(dataset)} files")
+        print(f"{'='*50}")
+        if len(dataset) == 0:
+            if not dataset.get_next_batch():
+                print("✅ All data processed! Training complete.")
+                break
+            continue
+        dataloader = DataLoader(
+            dataset, batch_size=batch_size, shuffle=True,
+            num_workers=0, collate_fn=collate_fn_enhanced, drop_last=True
+        )
+        for epoch in range(global_epoch, global_epoch + epochs_per_batch):
+            epoch_losses, epoch_vq_losses, epoch_rec_losses = [], [], []
+            codebook_usage_history = []
+            epoch_indices = []
+            for batch_idx, batch_data in enumerate(dataloader):
+                if batch_data is None:
+                    continue
+                motion_batch, lengths = batch_data
+                x = motion_batch.to(DEVICE)
+                # Forward pass through MotionGPT VQ-VAE
+                x_recon, vq_loss, perplexity = vq_model(x)
+                if batch_idx % 50 == 0:
+                    usage_pct, indices = log_codebook_analysis(x_recon, vq_loss, perplexity, epoch, batch_idx)
+                    epoch_indices.append(indices.cpu().numpy().flatten())
+                # Calculate reconstruction loss with weighted parameters
+                rec_loss_unreduced = loss_fn(x_recon, x) * loss_weights.unsqueeze(0).unsqueeze(0)
+                mask = torch.zeros_like(x[:, :, 0])
+                for i, length in enumerate(lengths):
+                    mask[i, :length] = 1.0
+                mask = mask.unsqueeze(-1).expand_as(rec_loss_unreduced)
+                rec_loss = (rec_loss_unreduced * mask).sum() / mask.sum()
+                vq_weight = 1.0
+                total_loss = rec_loss + vq_weight * vq_loss
+                optimizer.zero_grad()
+                total_loss.backward()
+                torch.nn.utils.clip_grad_norm_(vq_model.parameters(), max_norm=1.0)
+                optimizer.step()
+                scheduler.step()
+                epoch_losses.append(total_loss.item())
+                epoch_vq_losses.append(vq_loss.item())
+                epoch_rec_losses.append(rec_loss.item())
+                if batch_idx % 20 == 0:
+                    current_lr = optimizer.param_groups[0]['lr']
+                    print(f"[E:{epoch:03d}] B:{batch_idx:03d} | "
+                          f"Loss: {total_loss.item():.4f} "
+                          f"(Rec: {rec_loss.item():.4f}, VQ: {vq_loss.item():.4f}) | "
+                          f"Perplexity: {perplexity:.2f} | "
+                          f"LR: {current_lr:.2e}")
+            if epoch_losses:
+                avg_loss = np.mean(epoch_losses)
+                avg_vq_loss = np.mean(epoch_vq_losses)
+                avg_rec_loss = np.mean(epoch_rec_losses)
+                print(f"\n[EPOCH {epoch:03d} SUMMARY]")
+                print(f"Avg Loss: {avg_loss:.4f} (Rec: {avg_rec_loss:.4f}, VQ: {avg_vq_loss:.4f})")
+                # Create histogram if we collected indices
+                if epoch_indices:
+                    all_epoch_indices = np.concatenate(epoch_indices)
+                    plt.figure(figsize=(12, 6))
+                    plt.hist(all_epoch_indices, bins=vq_model.quantizer.nb_code,
+                           range=(0, vq_model.quantizer.nb_code-1))
+                    plt.title(f'MotionGPT Codebook Usage Distribution - Epoch {epoch}')
+                    plt.xlabel('Codebook Index')
+                    plt.ylabel('Frequency')
+                    hist_path = os.path.join(CHECKPOINT_DIR, f'mgpt_codebook_usage_epoch_{epoch:03d}.png')
+                    plt.savefig(hist_path)
+                    plt.close()
+                    print(f"Saved codebook usage histogram to {hist_path}")
+            if epoch > 0 and epoch % 5 == 0:
+                vq_model.eval()
+                with torch.no_grad():
+                    for val_data in dataloader:
+                        if val_data is not None:
+                            motion_batch, lengths = val_data
+                            x = motion_batch.to(DEVICE)
+                            x_recon, _, _ = vq_model(x)
+                            save_reconstruction_sample(x, x_recon, lengths, epoch)
+                            break
+                vq_model.train()
+            if epoch > 0 and epoch % 10 == 0:
+                checkpoint_manager.save_checkpoint(
+                    vq_model, optimizer, epoch, -1, np.mean(epoch_losses),
+                    metadata={'global_epoch': epoch}
+                )
+        global_epoch += epochs_per_batch
+        dataset.mark_batch_as_processed()
+        if not dataset.get_next_batch():
+            print("✅ All data processed! Training complete.")
+            break
+    return vq_model
+# ──────────────────────────────────────────────────────────
+# Main Training Script
+# ──────────────────────────────────────────────────────────
+def main():
+    print("Starting MotionGPT VQ-VAE Training System")
+    print(f"Checkpoint directory: {CHECKPOINT_DIR}")
+    smpl_dim = 182
+    codebook_size = 512
+    code_dim = 512
+    # Initialize MotionGPT VQ-VAE
+    vq_model = VQVae(
+        nfeats=smpl_dim,
+        quantizer="ema_reset",  # Options: "ema_reset", "orig", "ema", "reset"
+        code_num=codebook_size,
+        code_dim=code_dim,
+        output_emb_width=code_dim,
+        down_t=3,
+        stride_t=2,
+        width=512,
+        depth=3,
+        dilation_growth_rate=3,
+        norm=None,
+        activation="relu"
+    ).to(DEVICE)
+    total_params = sum(p.numel() for p in vq_model.parameters())
+    trainable_params = sum(p.numel() for p in vq_model.parameters() if p.requires_grad)
+    print(f"Total parameters: {total_params:,}")
+    print(f"Trainable parameters: {trainable_params:,}")
+    motion_dataset = EnhancedMotionDataset(
+        root_dir=DATA_ROOT,
+        processed_files_path=os.path.join(CHECKPOINT_DIR, 'processed_folders_mgpt.json'),
+        batch_folders=800
+    )
+    vq_model = train_mgpt_vqvae(
+        vq_model,
+        motion_dataset,
+        epochs_per_batch=15,
+        batch_size=12,
+        lr=2e-4
+    )
+    print("\n" + "="*70)
+    print("MGPT VQ-VAE TRAINING COMPLETED SUCCESSFULLY!")
+    print("="*70)
+    final_model_path = os.path.join(CHECKPOINT_DIR, 'final_mgpt_vqvae_model.pt')
+    torch.save({
+        'model_state_dict': vq_model.state_dict(),
+        'model_config': {
+            'nfeats': smpl_dim,
+            'code_num': codebook_size,
+            'code_dim': code_dim,
+            'quantizer': "ema_reset"
+        },
+        'training_completed': True
+    }, final_model_path)
+    print(f"Final model saved to: {final_model_path}")
+if __name__ == "__main__":
+    main()

train_pipeline.py ADDED Viewed

	@@ -0,0 +1,264 @@

+"""
+Main training pipeline for Motion LLM (Matched to test_overfit.py logic)
+Run this script to execute the full training process matching the reference implementation.
+"""
+import os
+import random
+import torch
+import json
+import argparse
+from types import SimpleNamespace
+import warnings
+# Import updated modules
+from config import (
+    SEED, DATA_JSON_PATH, MODEL_NAME, PIPELINE_OUTPUT_DIR,
+    HF_STAGE1_REPO_ID, HF_STAGE2_REPO_ID, HF_STAGE2_SAVE_SUBDIR,
+    FORCE_STAGE2_FROM_STAGE1, HF_USE_HUB, HF_TOKEN,
+    EVALUATION_WORDS, EVAL_SAMPLE_LIMIT, RUN_EVALS_ONLY,
+    TEST_EVAL_OUTPUT_DIR, TEST_EVAL_DOWNLOAD_DIR, TEST_EVAL_EXTRACT_DIR,
+    TEST_EVAL_SAMPLE_LIMIT, TEST_EVAL_MAX_ZIPS, TEST_EVAL_HF_REPO, TEST_EVAL_HF_SUBFOLDER
+)
+from data import read_json_data, deduplicate_and_prepare_data, build_motion_vocab
+from model import setup_model_and_tokenizer_raw, ensure_tokenizer_has_motion_tokens
+from train import (
+    train_stage1_raw, train_stage2_raw, resolve_and_ensure_repo,
+    repo_has_stage_latest, load_model_and_tokenizer_from_hf,
+    download_training_state, repo_get_latest_epoch_subfolder,
+    load_model_and_tokenizer_from_hf_subfolder, download_training_state_from_subfolder
+)
+from metrics import (
+    evaluate_metrics_encoder_style, run_inference_on_all_samples,
+    evaluate_metrics_motiongpt_style, save_side_by_side_visualizations
+)
+import test_dataset_eval
+# Suppress warnings
+warnings.filterwarnings("ignore")
+def main():
+    """Main function to run the entire pipeline matching test_overfit.py."""
+    print("="*80)
+    print("      Motion LLM Training Pipeline (Matches test_overfit.py)")
+    print("="*80)
+    # Set seeds
+    random.seed(SEED)
+    torch.manual_seed(SEED)
+    if torch.cuda.is_available():
+        torch.cuda.manual_seed_all(SEED)
+    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+    print(f"Using device: {device}")
+    # 1. Load ALL data
+    print(f"\n[1/6] Loading dataset from {DATA_JSON_PATH}...")
+    all_entries = read_json_data(DATA_JSON_PATH)
+    # 2. Clean the ENTIRE dataset and get all tokens
+    print("\n[2/6] Cleaning dataset...")
+    cleaned_data, all_motion_tokens = deduplicate_and_prepare_data(all_entries)
+    # 3. Stage 1: Initialize or resume from HF, then train
+    print("\n[3/6] Stage 1 Setup & Training...")
+    resolved_stage1_repo = resolve_and_ensure_repo(HF_STAGE1_REPO_ID, HF_TOKEN) if HF_USE_HUB else None
+    start_epoch_s1 = 0
+    stage1_loaded = None
+    if resolved_stage1_repo:
+        if repo_has_stage_latest(resolved_stage1_repo, "stage1", HF_TOKEN):
+            stage1_loaded = load_model_and_tokenizer_from_hf(resolved_stage1_repo, "stage1", HF_TOKEN)
+            state_s1 = download_training_state(resolved_stage1_repo, "stage1", HF_TOKEN)
+            if state_s1 and isinstance(state_s1.get("epoch_completed"), int):
+                start_epoch_s1 = state_s1["epoch_completed"]
+        else:
+            # Fallback: no 'latest' folder; select highest epoch-XXX
+            latest_s1_sub = repo_get_latest_epoch_subfolder(resolved_stage1_repo, "stage1", HF_TOKEN)
+            if latest_s1_sub:
+                stage1_loaded = load_model_and_tokenizer_from_hf_subfolder(resolved_stage1_repo, latest_s1_sub, HF_TOKEN)
+                state_s1 = download_training_state_from_subfolder(resolved_stage1_repo, latest_s1_sub, HF_TOKEN)
+                if state_s1 and isinstance(state_s1.get("epoch_completed"), int):
+                    start_epoch_s1 = state_s1["epoch_completed"]
+    if stage1_loaded:
+        base_model, tokenizer = stage1_loaded
+        # Ensure tokenizer contains all motion tokens (add missing if dataset expanded)
+        added = ensure_tokenizer_has_motion_tokens(tokenizer, all_motion_tokens)
+        if added > 0:
+            base_model.resize_token_embeddings(len(tokenizer))
+    else:
+        base_model, tokenizer = setup_model_and_tokenizer_raw(MODEL_NAME, all_motion_tokens)
+    print(f"\nStarting Stage 1 training on {len(cleaned_data)} samples (resume from epoch {start_epoch_s1}).")
+    motion_model = train_stage1_raw(
+        base_model,
+        tokenizer,
+        cleaned_data,
+        device,
+        start_epoch=start_epoch_s1,
+        hf_repo_id=resolved_stage1_repo,
+    )
+    # 4. Stage 2: Initialize or resume from HF, then train
+    print("\n[4/6] Stage 2 Setup & Training...")
+    resolved_stage2_repo = resolve_and_ensure_repo(HF_STAGE2_REPO_ID, HF_TOKEN) if HF_USE_HUB else None
+    start_epoch_s2 = 0
+    stage2_loaded = None
+    print(f"Stage 2 resume policy: FORCE_STAGE2_FROM_STAGE1={FORCE_STAGE2_FROM_STAGE1}, save_subdir='{HF_STAGE2_SAVE_SUBDIR}'")
+    if not FORCE_STAGE2_FROM_STAGE1 and resolved_stage2_repo:
+        # Prefer loading from the configured Stage 2 save subdir (e.g., 'stage2_v2')
+        if repo_has_stage_latest(resolved_stage2_repo, HF_STAGE2_SAVE_SUBDIR, HF_TOKEN):
+            stage2_loaded = load_model_and_tokenizer_from_hf(resolved_stage2_repo, HF_STAGE2_SAVE_SUBDIR, HF_TOKEN)
+            state_s2 = download_training_state(resolved_stage2_repo, HF_STAGE2_SAVE_SUBDIR, HF_TOKEN)
+            if state_s2 and isinstance(state_s2.get("epoch_completed"), int):
+                start_epoch_s2 = state_s2["epoch_completed"]
+            print(f"Resuming Stage 2 from HF subfolder: {HF_STAGE2_SAVE_SUBDIR}/latest (epoch_completed={start_epoch_s2})")
+        else:
+            latest_s2_sub = repo_get_latest_epoch_subfolder(resolved_stage2_repo, HF_STAGE2_SAVE_SUBDIR, HF_TOKEN)
+            if latest_s2_sub:
+                stage2_loaded = load_model_and_tokenizer_from_hf_subfolder(resolved_stage2_repo, latest_s2_sub, HF_TOKEN)
+                state_s2 = download_training_state_from_subfolder(resolved_stage2_repo, latest_s2_sub, HF_TOKEN)
+                if state_s2 and isinstance(state_s2.get("epoch_completed"), int):
+                    start_epoch_s2 = state_s2["epoch_completed"]
+                print(f"Resuming Stage 2 from HF subfolder: {latest_s2_sub} (epoch_completed={start_epoch_s2})")
+    if stage2_loaded:
+        stage2_model, tokenizer = stage2_loaded
+        added2 = ensure_tokenizer_has_motion_tokens(tokenizer, all_motion_tokens)
+        if added2 > 0:
+            stage2_model.resize_token_embeddings(len(tokenizer))
+    else:
+        stage2_model = motion_model  # Start Stage 2 from Stage 1 model
+    print(f"\nStarting Stage 2 training on {len(cleaned_data)} samples (resume from epoch {start_epoch_s2}).")
+    final_model = train_stage2_raw(
+        stage2_model,
+        tokenizer,
+        cleaned_data,
+        device,
+        start_epoch=start_epoch_s2,
+        hf_repo_id=resolved_stage2_repo,
+        hf_stage_subdir=HF_STAGE2_SAVE_SUBDIR,
+    )
+    # Save final model locally
+    if not os.path.exists(PIPELINE_OUTPUT_DIR):
+        os.makedirs(PIPELINE_OUTPUT_DIR)
+    final_model.save_pretrained(PIPELINE_OUTPUT_DIR)
+    tokenizer.save_pretrained(PIPELINE_OUTPUT_DIR)
+    print(f"Model saved to {PIPELINE_OUTPUT_DIR}")
+    # 5. Evaluation on Specific Words
+    print("\n[5/6] Evaluation on Specific Words...")
+    print("--- Filtering data for evaluation on specific words ---")
+    evaluation_data = [item for item in cleaned_data if item['word'].lower() in EVALUATION_WORDS]
+    print(f"Found {len(evaluation_data)} samples for evaluation words: {EVALUATION_WORDS}")
+    metrics_json_path = os.path.join(PIPELINE_OUTPUT_DIR, "metrics.json")
+    # 6. Metrics-only mode or full flow
+    if RUN_EVALS_ONLY:
+        # Compute the 3 metrics using VQ-VAE encoder features and save to JSON
+        metrics_enc = evaluate_metrics_encoder_style(
+            final_model, tokenizer, evaluation_data, device, sample_limit=EVAL_SAMPLE_LIMIT
+        )
+        os.makedirs(PIPELINE_OUTPUT_DIR, exist_ok=True)
+        metrics_payload = {
+            "source": "vqvae_encoder",
+            "fid": metrics_enc.get("fid"),
+            "diversity": {
+                "ground_truth": metrics_enc.get("diversity_gt"),
+                "model": metrics_enc.get("diversity_gen"),
+            },
+            "multimodality": {
+                "ground_truth": metrics_enc.get("mim_gt"),
+                "model": metrics_enc.get("mim_gen"),
+            },
+            "num_pairs": len(metrics_enc.get("pairs", [])),
+        }
+        with open(metrics_json_path, "w", encoding="utf-8") as f:
+            json.dump(metrics_payload, f, ensure_ascii=False, indent=2)
+        print(f"\n✅ Saved metrics to {metrics_json_path}")
+        return
+    # Full flow: inference logs + MotionGPT-style metrics + encoder metrics + visualizations
+    run_inference_on_all_samples(final_model, tokenizer, evaluation_data, device)
+    metrics_token = evaluate_metrics_motiongpt_style(final_model, tokenizer, evaluation_data, all_motion_tokens, device)
+    # Also compute encoder-based 3 metrics
+    metrics_enc = evaluate_metrics_encoder_style(
+        final_model, tokenizer, evaluation_data, device, sample_limit=EVAL_SAMPLE_LIMIT
+    )
+    # Visualizations (skip if metrics-only)
+    viz_dir = os.path.join(PIPELINE_OUTPUT_DIR, "html_visualizations")
+    save_side_by_side_visualizations(metrics_token["pairs"], viz_dir, limit=4)
+    # 7. Run Test Dataset Evaluation (test_dataset_eval.py)
+    print("\n[6/6] Running Evaluation on Held-out Test Dataset...")
+    try:
+        # Construct args matching test_dataset_eval.parse_args
+        eval_args = SimpleNamespace(
+            drive_url=None,
+            drive_id=None,
+            local_extracted_dir=None, # Will assume user needs to configure this or it uses defaults if not provided
+            # Note: test_dataset_eval requires one of drive/local. We can try to rely on defaults or skip if not configured.
+            # We will set download_dir and extract_dir from config.
+            download_dir=TEST_EVAL_DOWNLOAD_DIR,
+            extract_dir=TEST_EVAL_EXTRACT_DIR,
+            max_zips=TEST_EVAL_MAX_ZIPS,
+            hf_repo_id=TEST_EVAL_HF_REPO,
+            hf_subfolder=TEST_EVAL_HF_SUBFOLDER,
+            vqvae_ckpt=None,
+            stats_path=None,
+            output_dir=TEST_EVAL_OUTPUT_DIR,
+            sample_limit=TEST_EVAL_SAMPLE_LIMIT,
+            seed=SEED
+        )
+        # For this pipeline, we might want to pass the *currently loaded* model instead of reloading from HF?
+        # test_dataset_eval.run_evaluation loads from HF.
+        # The prompt asked to "incorporate... code of test_dataset_eval.py".
+        # Ideally we pass the model object, but run_evaluation is written to load from HF.
+        # Given we just saved and pushed (if enabled), loading from HF is fine.
+        # If we haven't pushed (HF_USE_HUB=False), run_evaluation might fail if it tries to load from HF.
+        # However, the prompt implies using test_overfit.py training setup which pushes to HF.
+        # Critical fix: If we want to use the *local* model we just trained, we should modify test_dataset_eval or pass it.
+        # But test_dataset_eval.run_evaluation doesn't accept model arg.
+        # For now, we'll attempt to run it as designed (loading from HF).
+        # If HF_USE_HUB is False, this step might fail.
+        # Let's check if we can use local_extracted_dir if it exists, otherwise drive download.
+        # We will use a try-except block.
+        print("Calling test_dataset_eval.run_evaluation...")
+        # We need to provide either drive-url/id or local-extracted.
+        # We'll try to use the extracted dir if it has content, otherwise default to download if URL known?
+        # Actually, since we don't have a drive URL in config (it was an arg), we might skip this if not set up?
+        # But the user said "include the code".
+        # We'll default to using the extract dir if it exists, otherwise we might need to ask or skip.
+        # Let's assume the user has data or we use the default drive-id if known (it wasn't in the provided file).
+        # Wait, test_dataset_eval.py has mutually exclusive required group.
+        # I'll add a fallback: if TEST_EVAL_EXTRACT_DIR exists and has files, use it.
+        if os.path.exists(TEST_EVAL_EXTRACT_DIR) and os.listdir(TEST_EVAL_EXTRACT_DIR):
+             eval_args.local_extracted_dir = TEST_EVAL_EXTRACT_DIR
+        else:
+             # We don't have a drive URL hardcoded.
+             # We will mock the arg to fail gracefully or print a message.
+             print("⚠️  Skipping test_dataset_eval: No local data found and no Drive URL configured.")
+             eval_args = None
+        if eval_args:
+            test_dataset_eval.run_evaluation(eval_args)
+    except Exception as e:
+        print(f"⚠️  Test dataset evaluation failed: {e}")
+    print("\n" + "="*60)
+    print("Training pipeline complete!")
+    print("="*60)
+    print(f"Models saved to: {PIPELINE_OUTPUT_DIR}")
+if __name__ == "__main__":
+    main()

train_vqvae.py ADDED Viewed

	@@ -0,0 +1,421 @@

+import os
+import pickle
+import torch
+import torch.nn as nn
+import numpy as np
+from torch.utils.data import Dataset, DataLoader
+import glob
+import warnings
+import json
+from datetime import datetime
+import math
+import matplotlib.pyplot as plt
+import torch.nn.functional as F
+import sys
+from tqdm import tqdm
+# ==============================================================================
+# 0) SETUP: Architecture files
+# ==============================================================================
+# Make sure your mGPT folder is in the Python path
+# sys.path.append('/path/to/your/mGPT_folder')
+from mGPT.archs.mgpt_vq import VQVae
+from mGPT.archs.tools import quantize_cnn
+warnings.filterwarnings("ignore")
+# ==============================================================================
+# 1) CONFIGURATION
+# ==============================================================================
+SANITY_CHECK_ENABLED = True
+sanity_check_counter = 0
+DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu'
+print("Device:", DEVICE)
+print(f"Sanity checks are {'ENABLED' if SANITY_CHECK_ENABLED else 'DISABLED'}.")
+# ==============================================================================
+# 2) VQ-VAE MODEL (Your instrumented classes are fine)
+# ==============================================================================
+class QuantizeEMAReset_Sanity(quantize_cnn.QuantizeEMAReset):
+    def forward(self, x, current_batch_idx=0):
+        global sanity_check_counter
+        N, width, T = x.shape
+        x_proc = self.preprocess(x)
+        if SANITY_CHECK_ENABLED and current_batch_idx == 0 and sanity_check_counter == 0:
+            print("[Quantizer.forward] Input shape `x`: ", x.shape)
+            print("[Quantizer.forward] Shape after preprocess `x_proc`: ", x_proc.shape)
+            print(f"[Quantizer.forward] Codebook shape: {self.codebook.shape}")
+            if self.training and not self.init: print("[Quantizer.forward] Codebook is UNINITIALIZED.")
+            else: print(f"[Quantizer.forward] Codebook stats: min={self.codebook.min():.3f}, max={self.codebook.max():.3f}, mean={self.codebook.mean():.3f}")
+        if self.training and not self.init: self.init_codebook(x_proc)
+        code_idx = self.quantize(x_proc)
+        x_d = self.dequantize(code_idx)
+        if SANITY_CHECK_ENABLED and current_batch_idx == 0 and sanity_check_counter == 0:
+            print(f"[Quantizer.forward] Code index range: min={code_idx.min()}, max={code_idx.max()}")
+            assert code_idx.max() < self.nb_code, "A code index is out of bounds!"
+        if self.training: perplexity = self.update_codebook(x_proc, code_idx)
+        else: perplexity = self.compute_perplexity(code_idx)
+        commit_loss = F.mse_loss(x_proc, x_d.detach())
+        x_d = x_proc + (x_d - x_proc).detach()
+        x_d = x_d.view(N, T, -1).permute(0, 2, 1).contiguous()
+        return x_d, commit_loss, perplexity
+class VQVae_Sanity(VQVae):
+    def __init__(self, *args, **kwargs):
+        super().__init__(*args, **kwargs)
+        if isinstance(self.quantizer, quantize_cnn.QuantizeEMAReset):
+            self.quantizer = QuantizeEMAReset_Sanity(
+                self.quantizer.nb_code, self.quantizer.code_dim, self.quantizer.mu
+            )
+    def forward(self, features, current_batch_idx=0):
+        global sanity_check_counter
+        x_in = self.preprocess(features)
+        if SANITY_CHECK_ENABLED and current_batch_idx == 0 and sanity_check_counter == 0: print("[VQVae.forward] Shape after preprocess (permute): ", x_in.shape)
+        x_encoder = self.encoder(x_in)
+        if SANITY_CHECK_ENABLED and current_batch_idx == 0 and sanity_check_counter == 0:
+            print("[VQVae.forward] Shape after encoder `x_encoder`: ", x_encoder.shape)
+            total_downsample_factor = 2**3
+            expected_len = math.ceil(features.shape[1] / total_downsample_factor)
+            print(f"[VQVae.forward] Calculated expected quantized length: ~{expected_len}")
+            assert abs(x_encoder.shape[2] - expected_len) <= 1, "Temporal downsampling seems incorrect."
+        x_quantized, loss, perplexity = self.quantizer(x_encoder, current_batch_idx)
+        if SANITY_CHECK_ENABLED and current_batch_idx == 0 and sanity_check_counter == 0: print("[VQVae.forward] Shape after quantizer `x_quantized`: ", x_quantized.shape)
+        x_decoder = self.decoder(x_quantized)
+        if SANITY_CHECK_ENABLED and current_batch_idx == 0 and sanity_check_counter == 0:
+            print("[VQVae.forward] Shape after decoder `x_decoder`: ", x_decoder.shape)
+            assert x_decoder.shape[2] == features.shape[1], "Decoder output temporal dim mismatch!"
+        x_out = self.postprocess(x_decoder)
+        return x_out, loss, perplexity
+# Monkey-patching
+sys.modules['mGPT.archs.mgpt_vq'].VQVae = VQVae_Sanity
+sys.modules['mGPT.archs.mgpt_vq'].QuantizeEMAReset = QuantizeEMAReset_Sanity
+class MotionGPT_VQVAE_Wrapper(nn.Module):
+    def __init__(self, smpl_dim, codebook_size=512, code_dim=512, **kwargs):
+        super().__init__()
+        self.smpl_dim = smpl_dim
+        self.vqvae = VQVae(
+            nfeats=smpl_dim, code_num=codebook_size, code_dim=code_dim,
+            output_emb_width=code_dim, **kwargs
+        )
+        param_dims = [10, 63, 45, 45, 3, 10, 3, 3]
+        param_starts = np.cumsum([0] + param_dims[:-1]).tolist()
+        loss_weights = torch.ones(smpl_dim)
+        loss_weights[param_starts[1]:param_starts[5]] = 10.0
+        loss_weights[param_starts[0]:param_starts[1]] = 5.0
+        loss_weights[param_starts[5]:param_starts[6]] = 8.0
+        self.register_buffer('loss_weights', loss_weights)
+        print(f"Initialized MotionGPT VQ-VAE with {codebook_size} codebook size")
+    def forward(self, x, current_batch_idx=0):
+        global sanity_check_counter
+        if SANITY_CHECK_ENABLED and current_batch_idx == 0 and sanity_check_counter == 0:
+            print("\n" + "="*50)
+            print("--- VQ-VAE WRAPPER SANITY CHECK (Batch 0) ---")
+            print(f"[Input] Shape of input features `x`: {x.shape}")
+            print("-"*50)
+        x_recon, vq_loss, perplexity = self.vqvae(x, current_batch_idx)
+        if SANITY_CHECK_ENABLED and current_batch_idx == 0 and sanity_check_counter == 0:
+            print("[Output] Shape of reconstructed features `x_recon`: ", x_recon.shape)
+            assert x.shape == x_recon.shape, "Shape mismatch!"
+            print(f"[Output] vq_loss: {vq_loss.item():.6f}, perplexity: {perplexity.item():.2f}")
+            print("--- VQ-VAE WRAPPER SANITY CHECK COMPLETE ---")
+            print("="*50 + "\n")
+        indices, _ = self.vqvae.encode(x)
+        return x_recon, vq_loss, indices, perplexity
+# ==============================================================================
+# 3) DATA LOADING
+# ==============================================================================
+def load_motion_from_npz(file_path):
+    try:
+        with np.load(file_path) as data:
+            motion_data = data['motion']
+            return torch.tensor(motion_data, dtype=torch.float32)
+    except Exception as e:
+        print(f"Warning: Could not load {os.path.basename(file_path)}. Skipping. Error: {e}")
+        return None
+class NpzMotionDataset(Dataset):
+    def __init__(self, root_dir, stats_path=None, min_seq_len=64):
+        self.min_seq_len = min_seq_len
+        print(f"\n[Dataset] Initializing from NPZ files in: '{root_dir}'")
+        glob_pattern = os.path.join(root_dir, '**', '*.npz')
+        self.files = glob.glob(glob_pattern, recursive=True)
+        if not self.files:
+            raise FileNotFoundError(f"FATAL: No .npz files found at '{glob_pattern}'.")
+        print(f"[Dataset] Found {len(self.files)} total .npz files.")
+        if stats_path and os.path.exists(stats_path):
+            stats = torch.load(stats_path, map_location='cpu')
+            self.mean = stats['mean']
+            self.std = stats['std']
+            print("[Dataset] Successfully loaded normalization stats to CPU.")
+        else:
+            print("❗ [Dataset] WARNING: Stats file not found. Proceeding without normalization. This will affect loss values and model performance.")
+            self.mean = 0
+            self.std = 1
+    def __len__(self):
+        return len(self.files)
+    def __getitem__(self, idx):
+        file_path = self.files[idx]
+        seq = load_motion_from_npz(file_path)
+        if seq is None or seq.shape[0] < self.min_seq_len:
+            return None
+        normalized_seq = (seq - self.mean) / self.std
+        return normalized_seq
+# ==============================================================================
+# 4) CHECKPOINT & CODEBOOK INITIALIZATION
+# ==============================================================================
+class CheckpointManager:
+    # (Your CheckpointManager code is fine, no changes needed here)
+    def __init__(self, checkpoint_dir, max_checkpoints=3):
+        self.checkpoint_dir = checkpoint_dir
+        self.max_checkpoints = max_checkpoints
+    def save_checkpoint(self, model, optimizer, epoch, loss, metadata=None):
+        checkpoint_path = os.path.join(self.checkpoint_dir, f'vqvae_epoch_{epoch:03d}.pt')
+        torch.save({
+            'epoch': epoch,
+            'model_state_dict': model.state_dict(),
+            'optimizer_state_dict': optimizer.state_dict(),
+            'loss': loss,
+            'timestamp': datetime.now().isoformat(),
+            'metadata': metadata or {}
+        }, checkpoint_path)
+        print(f"✅ Saved checkpoint: {checkpoint_path}")
+        self.cleanup_old_checkpoints()
+    def cleanup_old_checkpoints(self):
+        checkpoints = glob.glob(os.path.join(self.checkpoint_dir, 'vqvae_epoch_*.pt'))
+        if len(checkpoints) > self.max_checkpoints:
+            checkpoints.sort(key=os.path.getmtime)
+            for old_checkpoint in checkpoints[:-self.max_checkpoints]:
+                os.remove(old_checkpoint)
+                print(f"🗑️ Removed old checkpoint: {old_checkpoint}")
+    def load_latest_checkpoint(self):
+        checkpoints = glob.glob(os.path.join(self.checkpoint_dir, 'vqvae_epoch_*.pt'))
+        if not checkpoints: return None
+        latest_checkpoint_path = max(checkpoints, key=os.path.getmtime)
+        print(f"🔄 Loading latest checkpoint: {latest_checkpoint_path}")
+        return torch.load(latest_checkpoint_path, map_location=DEVICE, weights_only=False)
+def initialize_codebook_from_dataset(model, dataloader, num_batches=100):
+    print(f"⚙️ Collecting data from {num_batches} batches for codebook initialization...")
+    all_latents = []
+    model.eval()
+    with torch.no_grad():
+        for i, batch_data in enumerate(dataloader):
+            if i >= num_batches: break
+            if batch_data and batch_data[0] is not None:
+                motion_batch, _ = batch_data
+                x = motion_batch.to(DEVICE)
+                z_e = model.vqvae.encoder(model.vqvae.preprocess(x))
+                z_e_flat = z_e.permute(0, 2, 1).reshape(-1, z_e.shape[1])
+                all_latents.append(z_e_flat.cpu())
+    if not all_latents: raise ValueError("Could not collect any latents for initialization.")
+    all_latents = torch.cat(all_latents, dim=0)
+    print(f"Collected {all_latents.shape[0]} latent vectors.")
+    codebook_size = model.vqvae.quantizer.nb_code
+    indices = torch.randperm(all_latents.shape[0])[:codebook_size]
+    initial_codebook = all_latents[indices].to(DEVICE)
+    model.vqvae.quantizer.init_codebook(initial_codebook)
+    print("✅ Codebook initialized successfully from a diverse data sample.")
+    model.train()
+# ==============================================================================
+# 5) CORRECTED & COMPLETE TRAINING FUNCTION (No Globals)
+# ==============================================================================
+def train_vqvae_colab(vq_model, dataset, checkpoint_dir, num_epochs=300, batch_size=32, lr=2e-4):
+    """
+    The complete, updated training function for Colab using .npz files.
+    This version avoids global variables by accepting checkpoint_dir as an argument.
+    """
+    global sanity_check_counter
+    print("\n" + "="*70 + "\n     STARTING VQ-VAE TRAINING ON COLAB     \n" + "="*70)
+    optimizer = torch.optim.AdamW(vq_model.parameters(), lr=lr, weight_decay=1e-4)
+    # scheduler = torch.optim.lr_scheduler.CosineAnnealingWarmRestarts(optimizer, T_0=15, T_mult=2)
+    scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=num_epochs)
+    loss_fn = nn.SmoothL1Loss(reduction='none')
+    # Use the passed-in checkpoint_dir
+    checkpoint_manager = CheckpointManager(checkpoint_dir)
+    start_epoch = 1
+    checkpoint = checkpoint_manager.load_latest_checkpoint()
+    if checkpoint:
+        vq_model.load_state_dict(checkpoint['model_state_dict'], strict=False)
+        optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
+        start_epoch = checkpoint.get('epoch', 1) + 1
+        print(f"✅ Resumed training from epoch {start_epoch}")
+    else: print("No CheckPoint Found")
+    vq_model.to(DEVICE).train()
+    codebook_size = vq_model.vqvae.quantizer.nb_code
+    def collate_fn_enhanced(batch):
+        batch = [item for item in batch if item is not None]
+        if not batch: return None, None
+        batch.sort(key=lambda x: x.shape[0], reverse=True)
+        max_len = min(batch[0].shape[0], 256)
+        padded_max_len = math.ceil(max_len / 8) * 8
+        padded_batch = torch.zeros(len(batch), padded_max_len, batch[0].shape[1])
+        lengths = [min(x.shape[0], padded_max_len) for x in batch]
+        for i, x_item in enumerate(batch):
+            padded_batch[i, :lengths[i], :] = x_item[:lengths[i], :]
+        return padded_batch, torch.tensor(lengths)
+    dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True, num_workers=2,
+                            collate_fn=collate_fn_enhanced, drop_last=True, pin_memory=True)
+    if start_epoch == 1 and not getattr(vq_model.vqvae.quantizer, 'init', False):
+        initialize_codebook_from_dataset(vq_model, dataloader, num_batches=100)
+    for epoch in range(start_epoch, num_epochs + 1):
+        print(f"\n{'='*30} EPOCH {epoch}/{num_epochs} {'='*30}")
+        epoch_losses, epoch_vq_losses, epoch_rec_losses, epoch_perplexity = [], [], [], []
+        epoch_indices = []
+        for batch_idx, batch_data in enumerate(tqdm(dataloader, desc=f"Epoch {epoch}")):
+            if not batch_data or batch_data[0] is None: continue
+            motion_batch, lengths = batch_data
+            x = motion_batch.to(DEVICE)
+            x_recon, vq_loss, indices, perplexity = vq_model(x, batch_idx)
+            rec_loss_unreduced = loss_fn(x_recon, x) * vq_model.loss_weights
+            mask = torch.zeros_like(x[:, :, 0], device=DEVICE)
+            for i, length in enumerate(lengths): mask[i, :length] = 1.0
+            mask = mask.unsqueeze(-1).expand_as(rec_loss_unreduced)
+            rec_loss = (rec_loss_unreduced * mask).sum() / mask.sum()
+            # vq_weight = max(150.0 * (0.97 ** max(0, epoch - 3)), 1.0)
+            beta = 0.25 # This is a standard and effective value.
+            total_loss = rec_loss + (beta * vq_loss)
+            # total_loss = rec_loss + (vq_weight * vq_loss)
+            optimizer.zero_grad(set_to_none=True)
+            total_loss.backward()
+            torch.nn.utils.clip_grad_norm_(vq_model.parameters(), max_norm=1.0)
+            optimizer.step()
+            scheduler.step()
+            epoch_losses.append(total_loss.item())
+            epoch_vq_losses.append(vq_loss.item())
+            epoch_rec_losses.append(rec_loss.item())
+            epoch_perplexity.append(perplexity.item())
+            epoch_indices.append(indices.cpu().numpy().flatten())
+            if batch_idx % 50 == 0 and batch_idx > 0:
+                print(f"\n[E:{epoch:03d}] B:{batch_idx:03d} | Loss: {total_loss.item():.4f} (Rec: {rec_loss.item():.4f}, VQ: {vq_loss.item():.6f}) | Perplexity: {perplexity.item():.2f}")
+            if SANITY_CHECK_ENABLED and batch_idx == 0 and sanity_check_counter == 0:
+                sanity_check_counter += 1
+        if not epoch_losses: continue
+        all_epoch_indices_flat = np.concatenate(epoch_indices)
+        counts = np.bincount(all_epoch_indices_flat, minlength=codebook_size)
+        avg_usage = (counts > 0).sum()
+        with torch.no_grad(): code_variance = vq_model.vqvae.quantizer.codebook.var(dim=0).mean().item()
+        print(f"\n[EPOCH {epoch:03d} SUMMARY]")
+        print(f"  Avg Loss: {np.mean(epoch_losses):.4f} (Rec: {np.mean(epoch_rec_losses):.4f}, VQ: {np.mean(epoch_vq_losses):.6f})")
+        print(f"  Avg Perplexity: {np.mean(epoch_perplexity):.2f}")
+        print(f"  Codebook Usage: {avg_usage}/{codebook_size} ({(avg_usage/codebook_size)*100:.1f}%) | Variance: {code_variance:.6f}")
+        # Use the passed-in checkpoint_dir for saving plots
+        hist_path = os.path.join(checkpoint_dir, f'codebook_usage_epoch_{epoch:03d}.png')
+        plt.figure(figsize=(12, 6)); plt.hist(all_epoch_indices_flat, bins=codebook_size); plt.title(f'Codebook Usage - Epoch {epoch}'); plt.savefig(hist_path); plt.close()
+        if epoch > 0 and epoch % 5 == 0:
+            print("\n--- Performing End-of-Epoch Tasks ---")
+            vq_model.eval()
+            with torch.no_grad():
+                val_data = next(iter(dataloader))
+                if val_data and val_data[0] is not None:
+                    motion_batch, lengths = val_data
+                    x_val = motion_batch.to(DEVICE)
+                    x_recon_val, _, _, _ = vq_model(x_val, -1)
+                    orig = x_val[0, :lengths[0]].cpu().numpy()
+                    recon = x_recon_val[0, :lengths[0]].cpu().numpy()
+                    mse = ((orig - recon) ** 2).mean()
+                    print(f"Reconstruction MSE on sample: {mse:.6f}")
+            with torch.no_grad():
+                usage_threshold = 10
+                underutilized_indices = torch.from_numpy(np.where(counts < usage_threshold)[0]).to(DEVICE)
+                num_to_reset = len(underutilized_indices)
+                if num_to_reset > 0:
+                    print(f"[CODEBOOK MGMT] Resetting {num_to_reset} underutilized codes.")
+                    reset_data = next(iter(dataloader))
+                    if reset_data and reset_data[0] is not None:
+                        motion_batch, _ = reset_data
+                        x_reset = motion_batch.to(DEVICE)
+                        z_e = vq_model.vqvae.encoder(vq_model.vqvae.preprocess(x_reset))
+                        z_e_flat = z_e.permute(0, 2, 1).reshape(-1, z_e.shape[1])
+                        if z_e_flat.shape[0] >= num_to_reset:
+                            indices = torch.randperm(z_e_flat.size(0))[:num_to_reset]
+                            vq_model.vqvae.quantizer.codebook.data[underutilized_indices] = z_e_flat[indices]
+            vq_model.train()
+        if epoch > 0 and epoch % 5 == 0:
+            checkpoint_manager.save_checkpoint(vq_model, optimizer, epoch, np.mean(epoch_losses))
+    print("\n✅ Training loop finished.")
+    return vq_model
+# ==============================================================================
+# 6) MAIN EXECUTION SCRIPT (No Globals)
+# ==============================================================================
+def main_colab():
+    from google.colab import drive
+    drive.mount('/content/drive')
+    print("✅ Google Drive mounted successfully.")
+    GDRIVE_ROOT = '/content/drive/MyDrive'
+    # Define all paths locally within the main function
+    STATS_PATH = f'/content/dataset_stats.pt'
+    DATA_ROOT = f'{GDRIVE_ROOT}/kaggle_upload/npz_data/batch_1'
+    CHECKPOINT_DIR = f'{GDRIVE_ROOT}/Colab_Checkpoints/MotionGPT_VQVAE_Final'
+    # The 'global' keyword is no longer needed
+    os.makedirs(CHECKPOINT_DIR, exist_ok=True)
+    print(f"Data Root: {DATA_ROOT}")
+    print(f"Stats Path: {STATS_PATH}")
+    print(f"Checkpoint Dir: {CHECKPOINT_DIR}")
+    smpl_dim = 182
+    codebook_size = 512
+    code_dim = 512
+    vq_model = MotionGPT_VQVAE_Wrapper(
+        smpl_dim=smpl_dim, codebook_size=codebook_size, code_dim=code_dim,
+        quantizer="ema_reset", width=512, depth=3, down_t=3, stride_t=2,
+        dilation_growth_rate=3, activation='relu', norm=None
+    ).to(DEVICE)
+    motion_dataset = NpzMotionDataset(
+        root_dir=DATA_ROOT,
+        stats_path=STATS_PATH,
+        min_seq_len=64
+    )
+    # Pass CHECKPOINT_DIR as an argument to the training function
+    vq_model = train_vqvae_colab(
+        vq_model,
+        motion_dataset,
+        checkpoint_dir=CHECKPOINT_DIR, # Pass the path here
+        num_epochs=1000,
+        batch_size=32,
+        lr=2e-4
+    )
+    print("\n" + "="*70 + "\nVQ-VAE TRAINING COMPLETED SUCCESSFULLY!\n" + "="*70)
+    final_model_path = os.path.join(CHECKPOINT_DIR, 'final_vqvae_model.pt')
+    torch.save({'model_state_dict': vq_model.state_dict()}, final_model_path)
+    print(f"Final model saved to: {final_model_path}")
+if __name__ == "__main__":
+    main_colab()

visualize.py ADDED Viewed

	@@ -0,0 +1,681 @@

+"""
+Visualization script to convert motion tokens to SMPL-X 3D animation.
+Requires VQ-VAE checkpoint, dataset stats, and SMPL-X model files.
+Usage:
+    # Visualize from LLM output string
+    python visualize.py --tokens "<MOT_BEGIN><motion_177><motion_135>...<MOT_END>"
+    # Visualize from saved file
+    python visualize.py --input motion_output.txt
+    # Generate and visualize in one go
+    python visualize.py --prompt "walking" --stage 3
+    # Custom paths
+    python visualize.py --tokens "..." --vqvae-ckpt /path/to/vqvae.pt --smplx-dir /path/to/smplx
+"""
+import os
+import sys
+import re
+import argparse
+from pathlib import Path
+import numpy as np
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from config import WORK_DIR, DATA_DIR
+# Try importing visualization dependencies
+try:
+    import plotly.graph_objects as go
+except ImportError:
+    print("Installing plotly...")
+    os.system("pip install -q plotly")
+    import plotly.graph_objects as go
+try:
+    import smplx
+except ImportError:
+    print("Installing smplx...")
+    os.system("pip install -q smplx==0.1.28")
+    import smplx
+# =====================================================================
+# Configuration - can be overridden via command-line or environment
+# =====================================================================
+# VQ-VAE checkpoint path (trained motion encoder/decoder)
+VQVAE_CHECKPOINT = os.environ.get(
+    "VQVAE_CHECKPOINT",
+    os.path.join(DATA_DIR, "vqvae_model.pt")
+)
+# Dataset normalization stats (mean/std used during VQ-VAE training)
+STATS_PATH = os.environ.get(
+    "VQVAE_STATS_PATH",
+    os.path.join(DATA_DIR, "vqvae_stats.pt")
+)
+# SMPL-X model directory (contains SMPLX_NEUTRAL.npz, etc.)
+SMPLX_MODEL_DIR = os.environ.get(
+    "SMPLX_MODEL_DIR",
+    os.path.join(DATA_DIR, "smplx_models")
+)
+# Output directory for HTML animations
+OUTPUT_DIR = os.environ.get("VIS_OUTPUT_DIR", WORK_DIR)
+# Device
+DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+# VQ-VAE architecture params (must match training config)
+SMPL_DIM = 182
+CODEBOOK_SIZE = 512
+CODE_DIM = 512
+VQ_ARGS = dict(
+    width=512,
+    depth=3,
+    down_t=2,
+    stride_t=2,
+    dilation_growth_rate=3,
+    activation='relu',
+    norm=None,
+    quantizer="ema_reset"
+)
+# SMPL-X parameter layout (must match VQ-VAE training)
+PARAM_DIMS = [10, 63, 45, 45, 3, 10, 3, 3]
+PARAM_NAMES = ["betas", "body_pose", "left_hand_pose", "right_hand_pose",
+               "trans", "expression", "jaw_pose", "eye_pose"]
+# =====================================================================
+# Import VQ-VAE architecture
+# =====================================================================
+try:
+    # Add SignMotionGPT to path if not already
+    sign_mgpt_dir = os.path.join(os.path.dirname(__file__))
+    if sign_mgpt_dir not in sys.path:
+        sys.path.insert(0, sign_mgpt_dir)
+    from mGPT.archs.mgpt_vq import VQVae
+except ImportError as e:
+    print(f"❌ Could not import VQVae: {e}")
+    print("Make sure mGPT/archs/mgpt_vq.py exists in the project.")
+    sys.exit(1)
+# =====================================================================
+# VQ-VAE Wrapper
+# =====================================================================
+class MotionGPT_VQVAE_Wrapper(nn.Module):
+    """Wrapper matching the VQ-VAE training setup"""
+    def __init__(self, smpl_dim=SMPL_DIM, codebook_size=CODEBOOK_SIZE,
+                 code_dim=CODE_DIM, **kwargs):
+        super().__init__()
+        self.vqvae = VQVae(
+            nfeats=smpl_dim,
+            code_num=codebook_size,
+            code_dim=code_dim,
+            output_emb_width=code_dim,
+            **kwargs
+        )
+# =====================================================================
+# Token Parsing
+# =====================================================================
+def parse_motion_tokens(token_str):
+    """
+    Parse motion tokens from LLM output string.
+    Accepts:
+      - "<MOT_BEGIN><motion_177><motion_135>...<MOT_END>"
+      - "177 135 152 200 46..."
+      - List/array of ints
+    Returns:
+        List of token integers
+    """
+    if isinstance(token_str, (list, tuple, np.ndarray)):
+        return [int(x) for x in token_str]
+    if not isinstance(token_str, str):
+        raise ValueError("Tokens must be string or list-like")
+    # Try extracting <motion_ID> tokens
+    matches = re.findall(r'<motion_(\d+)>', token_str)
+    if matches:
+        return [int(x) for x in matches]
+    # Try space-separated numbers
+    token_str = token_str.strip()
+    if token_str:
+        try:
+            return [int(x) for x in token_str.split()]
+        except ValueError:
+            pass
+    raise ValueError(f"Could not parse motion tokens from: {token_str[:100]}...")
+# =====================================================================
+# Model Loading
+# =====================================================================
+def load_vqvae(checkpoint_path, device=DEVICE, vq_args=VQ_ARGS):
+    """Load trained VQ-VAE model from checkpoint"""
+    if not os.path.exists(checkpoint_path):
+        raise FileNotFoundError(
+            f"VQ-VAE checkpoint not found: {checkpoint_path}\n"
+            f"Please download it and set VQVAE_CHECKPOINT environment variable "
+            f"or use --vqvae-ckpt argument."
+        )
+    print(f"Loading VQ-VAE from: {checkpoint_path}")
+    model = MotionGPT_VQVAE_Wrapper(
+        smpl_dim=SMPL_DIM,
+        codebook_size=CODEBOOK_SIZE,
+        code_dim=CODE_DIM,
+        **vq_args
+    ).to(device)
+    ckpt = torch.load(checkpoint_path, map_location=device, weights_only=False)
+    state_dict = ckpt.get('model_state_dict', ckpt)
+    model.load_state_dict(state_dict, strict=False)
+    model.eval()
+    print(f"✅ VQ-VAE loaded (codebook size: {CODEBOOK_SIZE})")
+    return model
+def load_stats(stats_path):
+    """Load normalization statistics (mean/std) used during VQ-VAE training"""
+    if not stats_path or not os.path.exists(stats_path):
+        print(f"⚠️  Stats file not found: {stats_path}")
+        print("   Will skip denormalization (may affect quality)")
+        return None, None
+    print(f"Loading stats from: {stats_path}")
+    st = torch.load(stats_path, map_location='cpu', weights_only=False)
+    mean = st.get('mean', 0)
+    std = st.get('std', 1)
+    # Convert to numpy
+    if torch.is_tensor(mean):
+        mean = mean.cpu().numpy()
+    if torch.is_tensor(std):
+        std = std.cpu().numpy()
+    print(f"✅ Stats loaded (mean shape: {np.array(mean).shape})")
+    return mean, std
+def load_smplx_model(model_dir, device=DEVICE):
+    """Load SMPL-X body model"""
+    if not os.path.exists(model_dir):
+        raise FileNotFoundError(
+            f"SMPL-X model directory not found: {model_dir}\n"
+            f"Please download SMPL-X models and set SMPLX_MODEL_DIR environment variable "
+            f"or use --smplx-dir argument."
+        )
+    print(f"Loading SMPL-X from: {model_dir}")
+    model = smplx.SMPLX(
+        model_path=model_dir,
+        model_type='smplx',
+        gender='neutral',
+        use_pca=False,
+        create_global_orient=True,
+        create_body_pose=True,
+        create_betas=True,
+        create_expression=True,
+        create_jaw_pose=True,
+        create_left_hand_pose=True,
+        create_right_hand_pose=True,
+        create_transl=True
+    ).to(device)
+    print(f"✅ SMPL-X loaded")
+    return model
+# =====================================================================
+# Token Decoding
+# =====================================================================
+def decode_tokens_to_params(tokens, vqvae_model, mean=None, std=None, device=DEVICE):
+    """
+    Decode motion tokens to SMPL-X parameters.
+    Args:
+        tokens: List of motion token IDs
+        vqvae_model: Trained VQ-VAE model
+        mean: Optional normalization mean
+        std: Optional normalization std
+        device: Device to run on
+    Returns:
+        numpy array of shape (T, SMPL_DIM) with SMPL-X parameters
+    """
+    if not tokens:
+        return np.zeros((0, SMPL_DIM), dtype=np.float32)
+    # Prepare token indices
+    idx = torch.tensor(tokens, dtype=torch.long, device=device).unsqueeze(0)  # (1, T_q)
+    T_q = idx.shape[1]
+    quantizer = vqvae_model.vqvae.quantizer
+    # Get code dimension
+    if hasattr(quantizer, "codebook"):
+        codebook = quantizer.codebook.to(device)
+        code_dim = codebook.shape[1]
+    else:
+        code_dim = CODE_DIM
+    # Dequantize tokens
+    x_quantized = None
+    if hasattr(quantizer, "dequantize"):
+        try:
+            with torch.no_grad():
+                dq = quantizer.dequantize(idx)
+            if dq is not None:
+                dq = dq.contiguous()
+                # Ensure shape is (N, code_dim, T_q)
+                if dq.ndim == 3 and dq.shape[1] == code_dim:
+                    x_quantized = dq
+                elif dq.ndim == 3 and dq.shape[1] == T_q:
+                    x_quantized = dq.permute(0, 2, 1).contiguous()
+                else:
+                    x_quantized = None
+        except Exception:
+            x_quantized = None
+    # Fallback: manual codebook lookup
+    if x_quantized is None:
+        if not hasattr(quantizer, "codebook"):
+            raise RuntimeError("No dequantize method and no codebook available")
+        with torch.no_grad():
+            emb = codebook[idx]  # (1, T_q, code_dim)
+            x_quantized = emb.permute(0, 2, 1).contiguous()  # (1, code_dim, T_q)
+    # Decode through VQ-VAE decoder
+    with torch.no_grad():
+        x_dec = vqvae_model.vqvae.decoder(x_quantized)
+        smpl_out = vqvae_model.vqvae.postprocess(x_dec)  # (1, T_out, SMPL_DIM)
+        params_np = smpl_out.squeeze(0).cpu().numpy()  # (T_out, SMPL_DIM)
+    # Denormalize if stats provided
+    if (mean is not None) and (std is not None):
+        mean_arr = np.array(mean).reshape(1, -1)
+        std_arr = np.array(std).reshape(1, -1)
+        params_np = (params_np * std_arr) + mean_arr
+    return params_np
+# =====================================================================
+# SMPL-X Parameter to Vertices
+# =====================================================================
+def params_to_vertices(params_seq, smplx_model, batch_size=32):
+    """
+    Convert SMPL-X parameters to 3D vertices.
+    Args:
+        params_seq: numpy array (T, SMPL_DIM)
+        smplx_model: loaded SMPL-X model
+        batch_size: batch size for processing
+    Returns:
+        verts: numpy array (T, V, 3)
+        faces: numpy array (F, 3)
+    """
+    # Compute parameter slicing indices
+    starts = np.cumsum([0] + PARAM_DIMS[:-1])
+    ends = starts + np.array(PARAM_DIMS)
+    T = params_seq.shape[0]
+    all_verts = []
+    # Infer number of body joints
+    num_body_joints = getattr(smplx_model, "NUM_BODY_JOINTS", 21)
+    with torch.no_grad():
+        for s in range(0, T, batch_size):
+            batch = params_seq[s:s+batch_size]  # (B, SMPL_DIM)
+            B = batch.shape[0]
+            # Extract parameters
+            np_parts = {}
+            for name, st, ed in zip(PARAM_NAMES, starts, ends):
+                np_parts[name] = batch[:, st:ed].astype(np.float32)
+            # Convert to tensors
+            tensor_parts = {
+                name: torch.from_numpy(arr).to(DEVICE)
+                for name, arr in np_parts.items()
+            }
+            # Handle body pose (may or may not include global orient)
+            body_t = tensor_parts['body_pose']
+            L_body = body_t.shape[1]
+            expected_no_go = num_body_joints * 3
+            expected_with_go = (num_body_joints + 1) * 3
+            if L_body == expected_with_go:
+                global_orient = body_t[:, :3].contiguous()
+                body_pose_only = body_t[:, 3:].contiguous()
+            elif L_body == expected_no_go:
+                global_orient = torch.zeros((B, 3), dtype=torch.float32, device=DEVICE)
+                body_pose_only = body_t
+            else:
+                # Best-effort fallback
+                if L_body > expected_no_go:
+                    global_orient = body_t[:, :3].contiguous()
+                    body_pose_only = body_t[:, 3:].contiguous()
+                else:
+                    pad_len = max(0, expected_no_go - L_body)
+                    body_pose_only = F.pad(body_t, (0, pad_len))
+                    global_orient = torch.zeros((B, 3), dtype=torch.float32, device=DEVICE)
+            # Call SMPL-X
+            out = smplx_model(
+                betas=tensor_parts['betas'],
+                global_orient=global_orient,
+                body_pose=body_pose_only,
+                left_hand_pose=tensor_parts['left_hand_pose'],
+                right_hand_pose=tensor_parts['right_hand_pose'],
+                expression=tensor_parts['expression'],
+                jaw_pose=tensor_parts['jaw_pose'],
+                leye_pose=tensor_parts['eye_pose'],
+                reye_pose=tensor_parts['eye_pose'],
+                transl=tensor_parts['trans'],
+                return_verts=True
+            )
+            verts = out.vertices.detach().cpu().numpy()  # (B, V, 3)
+            all_verts.append(verts)
+    verts_all = np.concatenate(all_verts, axis=0)  # (T, V, 3)
+    faces = smplx_model.faces.astype(np.int32)
+    return verts_all, faces
+# =====================================================================
+# Visualization
+# =====================================================================
+def animate_motion(verts, faces, title="Generated Motion", output_path=None, fps=20):
+    """
+    Create interactive 3D animation using Plotly.
+    Args:
+        verts: numpy array (T, V, 3)
+        faces: numpy array (F, 3)
+        title: Plot title
+        output_path: Path to save HTML file
+        fps: Frames per second for animation
+    Returns:
+        Plotly figure object
+    """
+    T, V, _ = verts.shape
+    i, j, k = faces.T.tolist()
+    # Initial mesh
+    mesh = go.Mesh3d(
+        x=verts[0, :, 0],
+        y=verts[0, :, 1],
+        z=verts[0, :, 2],
+        i=i, j=j, k=k,
+        name=title,
+        flatshading=True,
+        opacity=0.7
+    )
+    # Create frames
+    frames = [
+        go.Frame(
+            data=[go.Mesh3d(
+                x=verts[t, :, 0],
+                y=verts[t, :, 1],
+                z=verts[t, :, 2],
+                i=i, j=j, k=k,
+                flatshading=True,
+                opacity=0.7
+            )],
+            name=str(t)
+        )
+        for t in range(T)
+    ]
+    # Create figure
+    fig = go.Figure(data=[mesh], frames=frames)
+    fig.update_layout(
+        title_text=title,
+        scene=dict(
+            aspectmode='data',
+            xaxis=dict(visible=False),
+            yaxis=dict(visible=False),
+            zaxis=dict(visible=False),
+            camera=dict(eye=dict(x=0, y=-2, z=0.7))
+        ),
+        updatemenus=[dict(
+            type="buttons",
+            buttons=[
+                dict(
+                    label="Play",
+                    method="animate",
+                    args=[None, {
+                        "frame": {"duration": 1000//fps, "redraw": True},
+                        "fromcurrent": True
+                    }]
+                ),
+                dict(
+                    label="Pause",
+                    method="animate",
+                    args=[[None], {
+                        "frame": {"duration": 0, "redraw": False}
+                    }]
+                )
+            ]
+        )]
+    )
+    # Save HTML
+    if output_path:
+        Path(output_path).parent.mkdir(parents=True, exist_ok=True)
+        fig.write_html(output_path)
+        print(f"✅ Animation saved to: {output_path}")
+    return fig
+# =====================================================================
+# Main Visualization Pipeline
+# =====================================================================
+def visualize(
+    tokens,
+    vqvae_ckpt=VQVAE_CHECKPOINT,
+    stats_path=STATS_PATH,
+    smplx_dir=SMPLX_MODEL_DIR,
+    output_html=None,
+    title="Generated Motion",
+    fps=20
+):
+    """
+    Complete visualization pipeline: tokens -> vertices -> animation.
+    Args:
+        tokens: Motion tokens (string or list of ints)
+        vqvae_ckpt: Path to VQ-VAE checkpoint
+        stats_path: Path to normalization stats
+        smplx_dir: Path to SMPL-X model directory
+        output_html: Path to save HTML animation
+        title: Animation title
+        fps: Frames per second
+    Returns:
+        Plotly figure object
+    """
+    print("="*60)
+    print("Motion Visualization Pipeline")
+    print("="*60)
+    # Parse tokens
+    print("\n[1/5] Parsing tokens...")
+    token_list = parse_motion_tokens(tokens)
+    print(f"   Parsed {len(token_list)} tokens")
+    if not token_list:
+        print("❌ No tokens to visualize")
+        return None
+    # Load models
+    print("\n[2/5] Loading VQ-VAE...")
+    vq_model = load_vqvae(vqvae_ckpt, device=DEVICE)
+    print("\n[3/5] Loading normalization stats...")
+    mean, std = load_stats(stats_path)
+    print("\n[4/5] Loading SMPL-X model...")
+    smplx_model = load_smplx_model(smplx_dir, device=DEVICE)
+    # Decode tokens
+    print("\n[5/5] Decoding and rendering...")
+    print("   Decoding tokens to SMPL-X parameters...")
+    params = decode_tokens_to_params(token_list, vq_model, mean, std, device=DEVICE)
+    print(f"   Decoded params shape: {params.shape}")
+    if params.shape[0] == 0:
+        print("❌ No frames produced from decoder")
+        return None
+    # Convert to vertices
+    print("   Converting parameters to vertices...")
+    verts, faces = params_to_vertices(params, smplx_model, batch_size=32)
+    print(f"   Vertices shape: {verts.shape}, Faces: {faces.shape}")
+    # Create animation
+    print("   Creating animation...")
+    if output_html is None:
+        output_html = os.path.join(OUTPUT_DIR, "motion_animation.html")
+    fig = animate_motion(verts, faces, title=title, output_path=output_html, fps=fps)
+    print("\n" + "="*60)
+    print("✅ Visualization complete!")
+    print("="*60)
+    return fig
+# =====================================================================
+# CLI
+# =====================================================================
+def main():
+    parser = argparse.ArgumentParser(
+        description="Visualize motion tokens as 3D SMPL-X animation"
+    )
+    # Input options (mutually exclusive)
+    input_group = parser.add_mutually_exclusive_group(required=True)
+    input_group.add_argument(
+        "--tokens",
+        type=str,
+        help="Motion tokens string (e.g., '<MOT_BEGIN><motion_177>...<MOT_END>' or '177 135 152...')"
+    )
+    input_group.add_argument(
+        "--input",
+        type=str,
+        help="Path to file containing motion tokens"
+    )
+    input_group.add_argument(
+        "--prompt",
+        type=str,
+        help="Generate tokens from text prompt first (requires --stage)"
+    )
+    # Generation options (if using --prompt)
+    parser.add_argument(
+        "--stage",
+        type=int,
+        default=3,
+        choices=[1, 2, 3],
+        help="Stage model to use for generation (default: 3)"
+    )
+    # Model paths
+    parser.add_argument(
+        "--vqvae-ckpt",
+        type=str,
+        default=VQVAE_CHECKPOINT,
+        help=f"Path to VQ-VAE checkpoint (default: {VQVAE_CHECKPOINT})"
+    )
+    parser.add_argument(
+        "--stats",
+        type=str,
+        default=STATS_PATH,
+        help=f"Path to normalization stats (default: {STATS_PATH})"
+    )
+    parser.add_argument(
+        "--smplx-dir",
+        type=str,
+        default=SMPLX_MODEL_DIR,
+        help=f"Path to SMPL-X model directory (default: {SMPLX_MODEL_DIR})"
+    )
+    # Output options
+    parser.add_argument(
+        "--output",
+        type=str,
+        default=None,
+        help="Path to save HTML animation (default: motion_animation.html)"
+    )
+    parser.add_argument(
+        "--title",
+        type=str,
+        default="Generated Motion",
+        help="Animation title"
+    )
+    parser.add_argument(
+        "--fps",
+        type=int,
+        default=20,
+        help="Frames per second for animation (default: 20)"
+    )
+    args = parser.parse_args()
+    # Get tokens
+    if args.prompt:
+        # Generate tokens first using inference.py
+        print("Generating motion tokens from prompt...")
+        from inference import inference
+        tokens = inference(
+            prompt=args.prompt,
+            stage=args.stage,
+            output_file=None,
+            per_prompt_vocab=True
+        )
+    elif args.input:
+        # Read from file
+        with open(args.input, 'r') as f:
+            tokens = f.read().strip()
+    else:
+        # Direct token string
+        tokens = args.tokens
+    # Visualize
+    visualize(
+        tokens=tokens,
+        vqvae_ckpt=args.vqvae_ckpt,
+        stats_path=args.stats,
+        smplx_dir=args.smplx_dir,
+        output_html=args.output,
+        title=args.title,
+        fps=args.fps
+    )
+if __name__ == "__main__":
+    main()