| # Architecture of Motion-S RVQ-VAE | |
| This implementation is a **Residual Vector Quantized Variational Autoencoder (RVQ-VAE)** designed for motion sequence compression and tokenization. Let me break down each component: | |
| --- | |
| ## 1. **Overall Architecture Flow** | |
| ``` | |
| Input Motion (B, D_POSE, N) | |
| ↓ | |
| [Encoder] → Continuous Latent (B, d, n) | |
| ↓ | |
| [RVQ] → Quantized Latent (B, d, n) + Discrete Tokens | |
| ↓ | |
| [Decoder] → Reconstructed Motion (B, D_POSE, N) | |
| ``` | |
| Where: | |
| - **B** = Batch size | |
| - **D_POSE** = Motion feature dimension (e.g., 263 for body pose) | |
| - **N** = Original sequence length (frames) | |
| - **d** = Latent dimension (default 256) | |
| - **n** = Downsampled sequence length (N // downsampling_ratio) | |
| --- | |
| ## 2. **MotionEncoder: Convolutional Downsampling** | |
| ### Purpose | |
| Compresses motion sequences both **spatially** (D_POSE → d) and **temporally** (N → n). | |
| ### Architecture | |
| ```python | |
| Input: (B, D_POSE, N) # Treats D_POSE as channels, N as sequence length | |
| Layer Structure (4 layers default): | |
| Conv1D(D_POSE → 512, kernel=3, stride=2, padding=1) # Temporal downsampling | |
| ReLU + BatchNorm | |
| Conv1D(512 → 512, kernel=3, stride=2, padding=1) # More downsampling | |
| ReLU + BatchNorm | |
| Conv1D(512 → 512, kernel=3, stride=1, padding=1) # Maintain resolution | |
| ReLU + BatchNorm | |
| Conv1D(512 → 256, kernel=3, stride=1, padding=1) # Project to latent_dim | |
| ReLU + BatchNorm | |
| Output: (B, 256, n) # n ≈ N/4 for downsampling_ratio=4 | |
| ``` | |
| ### Key Design Choices | |
| - **Stride=2 for first log₂(ratio) layers**: Achieves 4x downsampling with two stride-2 convolutions | |
| - **BatchNorm**: Stabilizes training by normalizing activations | |
| - **1D Convolutions**: Efficient for sequential data vs 2D/RNNs | |
| --- | |
| ## 3. **ResidualVectorQuantizer (RVQ): Hierarchical Quantization** | |
| ### Purpose | |
| Converts continuous latents into **discrete tokens** using a codebook hierarchy. | |
| ### Core Concept: Residual Quantization | |
| Instead of quantizing once, RVQ quantizes the **residual error** iteratively: | |
| ``` | |
| Step 0: Quantize input → b⁰ = Q₀(r⁰), where r⁰ = continuous latent | |
| Step 1: Quantize residual → b¹ = Q₁(r¹), where r¹ = r⁰ - b⁰ | |
| Step 2: Quantize new residual → b² = Q₂(r²), where r² = r¹ - b¹ | |
| ... | |
| Step V: Final residual → bⱽ = Qᵥ(rⱽ) | |
| Final Output: Σ(b⁰, b¹, ..., bⱽ) # Sum of all quantized codes | |
| ``` | |
| ### Architecture | |
| ```python | |
| num_quantizers = 6 # V+1 layers (0 to 5) | |
| For each layer v: | |
| 1. Calculate distances to codebook: | |
| distances = ||z - embedding||² # (B*n, num_embeddings) | |
| 2. Find nearest code: | |
| indices = argmin(distances) # (B*n,) | |
| 3. Lookup quantized vector: | |
| quantized = embedding[:, indices] # (B, d, n) | |
| 4. Compute next residual: | |
| residual = residual - quantized | |
| ``` | |
| ### VectorQuantizer: Single-Layer Quantization | |
| Each layer has: | |
| - **Codebook**: `embedding` tensor of shape `(d, num_embeddings=512)` | |
| - 512 learnable code vectors, each of dimension 256 | |
| - **EMA Updates** (Exponential Moving Average): | |
| ```python | |
| cluster_size = (1-decay) * new_counts + decay * old_counts | |
| embedding_avg = (1-decay) * new_codes + decay * old_codes | |
| embedding = embedding_avg / cluster_size # Normalize | |
| ``` | |
| - Prevents codebook collapse (dead codes) | |
| - No explicit gradient descent on codebook | |
| - **Straight-Through Estimator**: | |
| ```python | |
| quantized_st = inputs + (quantized - inputs).detach() | |
| ``` | |
| - Forward: Use quantized values | |
| - Backward: Gradients flow through inputs (bypassing non-differentiable argmin) | |
| - **Commitment Loss**: | |
| ```python | |
| loss = λ * ||quantized - inputs||² | |
| ``` | |
| - Encourages encoder to produce latents close to codebook entries | |
| --- | |
| ## 4. **MotionDecoder: Convolutional Upsampling** | |
| ### Purpose | |
| Reconstructs original motion from quantized latent. | |
| ### Architecture | |
| ```python | |
| Input: (B, 256, n) | |
| Layer Structure (mirror of encoder): | |
| ConvTranspose1D(256 → 512, kernel=3, stride=2, padding=1, output_padding=1) | |
| ReLU + BatchNorm | |
| ConvTranspose1D(512 → 512, kernel=3, stride=2, padding=1, output_padding=1) | |
| ReLU + BatchNorm | |
| Conv1D(512 → 512, kernel=3, stride=1, padding=1) | |
| ReLU + BatchNorm | |
| Conv1D(512 → D_POSE, kernel=3, stride=1, padding=1) # Final layer, no activation | |
| Output: (B, D_POSE, N) # Restored to original dimensions | |
| ``` | |
| ### Key Design Choices | |
| - **ConvTranspose1D**: Learns upsampling (better than fixed interpolation) | |
| - **output_padding**: Ensures exact size matching after strided convolutions | |
| - **No activation on final layer**: Allows unrestricted output range | |
| --- | |
| ## 5. **Loss Function: Multi-Component Objective** | |
| ```python | |
| Total Loss = L_rec + λ_global·L_global + λ_commit·L_commit + λ_vel·L_vel | |
| ``` | |
| ### Components | |
| 1. **Reconstruction Loss** (L_rec): | |
| ```python | |
| L_rec = SmoothL1(reconstructed, target) | |
| ``` | |
| - Main objective: Match overall motion | |
| 2. **Global/Root Loss** (L_global): | |
| ```python | |
| L_global = SmoothL1(reconstructed[:, :4], target[:, :4]) | |
| ``` | |
| - Focuses on first 4 dimensions: | |
| - Root rotation velocity | |
| - Root linear velocity (X/Z) | |
| - Root height | |
| - Weighted 1.5x to prioritize global motion | |
| 3. **Velocity Loss** (L_vel): | |
| ```python | |
| pred_vel = reconstructed[:, :, 1:] - reconstructed[:, :, :-1] | |
| target_vel = target[:, :, 1:] - target[:, :, :-1] | |
| L_vel = SmoothL1(pred_vel, target_vel) | |
| ``` | |
| - Ensures temporal smoothness | |
| - Prevents jittery motion | |
| - Weighted 2.0x for importance | |
| 4. **Commitment Loss** (L_commit): | |
| ```python | |
| L_commit = Σ(||quantized_v - inputs_v||²) for all RVQ layers | |
| ``` | |
| - From RVQ: encourages encoder outputs near codebook | |
| - Weighted 0.02x (small to avoid over-constraining) | |
| --- | |
| ## 6. **Training Features** | |
| ### Quantization Dropout | |
| ```python | |
| if training and rand() < 0.2: | |
| num_active_layers = randint(1, num_quantizers+1) | |
| ``` | |
| - Randomly uses 1 to V+1 quantization layers | |
| - Improves robustness and generalization | |
| - Forces lower layers to capture more information | |
| ### Masking Support | |
| ```python | |
| loss = mean_flat(error * mask) / (mask.sum() + ε) | |
| ``` | |
| - Handles variable-length sequences with padding | |
| - Only computes loss on valid frames | |
| --- | |
| ## 7. **Token Representation** | |
| ### Encoding to Tokens | |
| ```python | |
| tokens = [indices_0, indices_1, ..., indices_V] # List of (B, n) tensors | |
| ``` | |
| - Each token sequence represents one RVQ layer | |
| - Token values ∈ [0, 511] (for 512 codebook entries) | |
| - Total vocabulary size: 512^(V+1) combinations | |
| ### Decoding from Tokens | |
| ```python | |
| quantized = Σ(embedding[:, tokens_v]) for v in layers | |
| reconstructed = decoder(quantized) | |
| ``` | |
| - Lookup codes from each layer's codebook | |
| - Sum all codes to get final latent | |
| - Pass through decoder | |
| --- | |
| ## 8. **Key Hyperparameters** | |
| | Parameter | Default | Purpose | | |
| |-----------|---------|---------| | |
| | `input_dim` | 263 | Motion feature dimension | | |
| | `latent_dim` | 256 | Bottleneck dimension | | |
| | `downsampling_ratio` | 4 | Temporal compression (N → N/4) | | |
| | `num_quantizers` | 6 | RVQ hierarchy depth (V+1) | | |
| | `num_embeddings` | 512 | Codebook size per layer | | |
| | `commitment_cost` | 1.0 | Weight for commitment loss | | |
| | `decay` | 0.99 | EMA decay for codebook updates | | |
| | `quantization_dropout` | 0.2 | Probability of layer dropout | | |
| --- | |
| --- | |
| ## 9. **Usage Example** | |
| ```python | |
| # Training | |
| model = RVQVAE(input_dim=263, output_dim=263) | |
| reconstructed, tokens, commit_loss = model(motion) | |
| total_loss, loss_dict = compute_rvq_loss(reconstructed, motion, commit_loss) | |
| # Inference: Motion → Tokens | |
| tokens = model.encode_to_tokens(motion) # List of (B, n) discrete tokens | |
| # Inference: Tokens → Motion | |
| reconstructed = model.decode_from_tokens(tokens) | |
| ``` | |
| --- | |
| license: apache-2.0 | |
| --- | |