YAML Metadata
Warning:
empty or missing yaml metadata in repo card
(https://huggingface.co/docs/hub/model-cards#model-card-metadata)
Architecture of Motion-S RVQ-VAE
This implementation is a Residual Vector Quantized Variational Autoencoder (RVQ-VAE) designed for motion sequence compression and tokenization. Let me break down each component:
1. Overall Architecture Flow
Input Motion (B, D_POSE, N)
↓
[Encoder] → Continuous Latent (B, d, n)
↓
[RVQ] → Quantized Latent (B, d, n) + Discrete Tokens
↓
[Decoder] → Reconstructed Motion (B, D_POSE, N)
Where:
- B = Batch size
- D_POSE = Motion feature dimension (e.g., 263 for body pose)
- N = Original sequence length (frames)
- d = Latent dimension (default 256)
- n = Downsampled sequence length (N // downsampling_ratio)
2. MotionEncoder: Convolutional Downsampling
Purpose
Compresses motion sequences both spatially (D_POSE → d) and temporally (N → n).
Architecture
Input: (B, D_POSE, N) # Treats D_POSE as channels, N as sequence length
Layer Structure (4 layers default):
Conv1D(D_POSE → 512, kernel=3, stride=2, padding=1) # Temporal downsampling
ReLU + BatchNorm
Conv1D(512 → 512, kernel=3, stride=2, padding=1) # More downsampling
ReLU + BatchNorm
Conv1D(512 → 512, kernel=3, stride=1, padding=1) # Maintain resolution
ReLU + BatchNorm
Conv1D(512 → 256, kernel=3, stride=1, padding=1) # Project to latent_dim
ReLU + BatchNorm
Output: (B, 256, n) # n ≈ N/4 for downsampling_ratio=4
Key Design Choices
- Stride=2 for first log₂(ratio) layers: Achieves 4x downsampling with two stride-2 convolutions
- BatchNorm: Stabilizes training by normalizing activations
- 1D Convolutions: Efficient for sequential data vs 2D/RNNs
3. ResidualVectorQuantizer (RVQ): Hierarchical Quantization
Purpose
Converts continuous latents into discrete tokens using a codebook hierarchy.
Core Concept: Residual Quantization
Instead of quantizing once, RVQ quantizes the residual error iteratively:
Step 0: Quantize input → b⁰ = Q₀(r⁰), where r⁰ = continuous latent
Step 1: Quantize residual → b¹ = Q₁(r¹), where r¹ = r⁰ - b⁰
Step 2: Quantize new residual → b² = Q₂(r²), where r² = r¹ - b¹
...
Step V: Final residual → bⱽ = Qᵥ(rⱽ)
Final Output: Σ(b⁰, b¹, ..., bⱽ) # Sum of all quantized codes
Architecture
num_quantizers = 6 # V+1 layers (0 to 5)
For each layer v:
1. Calculate distances to codebook:
distances = ||z - embedding||² # (B*n, num_embeddings)
2. Find nearest code:
indices = argmin(distances) # (B*n,)
3. Lookup quantized vector:
quantized = embedding[:, indices] # (B, d, n)
4. Compute next residual:
residual = residual - quantized
VectorQuantizer: Single-Layer Quantization
Each layer has:
Codebook:
embeddingtensor of shape(d, num_embeddings=512)- 512 learnable code vectors, each of dimension 256
EMA Updates (Exponential Moving Average):
cluster_size = (1-decay) * new_counts + decay * old_counts embedding_avg = (1-decay) * new_codes + decay * old_codes embedding = embedding_avg / cluster_size # Normalize- Prevents codebook collapse (dead codes)
- No explicit gradient descent on codebook
Straight-Through Estimator:
quantized_st = inputs + (quantized - inputs).detach()- Forward: Use quantized values
- Backward: Gradients flow through inputs (bypassing non-differentiable argmin)
Commitment Loss:
loss = λ * ||quantized - inputs||²- Encourages encoder to produce latents close to codebook entries
4. MotionDecoder: Convolutional Upsampling
Purpose
Reconstructs original motion from quantized latent.
Architecture
Input: (B, 256, n)
Layer Structure (mirror of encoder):
ConvTranspose1D(256 → 512, kernel=3, stride=2, padding=1, output_padding=1)
ReLU + BatchNorm
ConvTranspose1D(512 → 512, kernel=3, stride=2, padding=1, output_padding=1)
ReLU + BatchNorm
Conv1D(512 → 512, kernel=3, stride=1, padding=1)
ReLU + BatchNorm
Conv1D(512 → D_POSE, kernel=3, stride=1, padding=1) # Final layer, no activation
Output: (B, D_POSE, N) # Restored to original dimensions
Key Design Choices
- ConvTranspose1D: Learns upsampling (better than fixed interpolation)
- output_padding: Ensures exact size matching after strided convolutions
- No activation on final layer: Allows unrestricted output range
5. Loss Function: Multi-Component Objective
Total Loss = L_rec + λ_global·L_global + λ_commit·L_commit + λ_vel·L_vel
Components
Reconstruction Loss (L_rec):
L_rec = SmoothL1(reconstructed, target)- Main objective: Match overall motion
Global/Root Loss (L_global):
L_global = SmoothL1(reconstructed[:, :4], target[:, :4])- Focuses on first 4 dimensions:
- Root rotation velocity
- Root linear velocity (X/Z)
- Root height
- Weighted 1.5x to prioritize global motion
- Focuses on first 4 dimensions:
Velocity Loss (L_vel):
pred_vel = reconstructed[:, :, 1:] - reconstructed[:, :, :-1] target_vel = target[:, :, 1:] - target[:, :, :-1] L_vel = SmoothL1(pred_vel, target_vel)- Ensures temporal smoothness
- Prevents jittery motion
- Weighted 2.0x for importance
Commitment Loss (L_commit):
L_commit = Σ(||quantized_v - inputs_v||²) for all RVQ layers- From RVQ: encourages encoder outputs near codebook
- Weighted 0.02x (small to avoid over-constraining)
6. Training Features
Quantization Dropout
if training and rand() < 0.2:
num_active_layers = randint(1, num_quantizers+1)
- Randomly uses 1 to V+1 quantization layers
- Improves robustness and generalization
- Forces lower layers to capture more information
Masking Support
loss = mean_flat(error * mask) / (mask.sum() + ε)
- Handles variable-length sequences with padding
- Only computes loss on valid frames
7. Token Representation
Encoding to Tokens
tokens = [indices_0, indices_1, ..., indices_V] # List of (B, n) tensors
- Each token sequence represents one RVQ layer
- Token values ∈ [0, 511] (for 512 codebook entries)
- Total vocabulary size: 512^(V+1) combinations
Decoding from Tokens
quantized = Σ(embedding[:, tokens_v]) for v in layers
reconstructed = decoder(quantized)
- Lookup codes from each layer's codebook
- Sum all codes to get final latent
- Pass through decoder
8. Key Hyperparameters
| Parameter | Default | Purpose |
|---|---|---|
input_dim |
263 | Motion feature dimension |
latent_dim |
256 | Bottleneck dimension |
downsampling_ratio |
4 | Temporal compression (N → N/4) |
num_quantizers |
6 | RVQ hierarchy depth (V+1) |
num_embeddings |
512 | Codebook size per layer |
commitment_cost |
1.0 | Weight for commitment loss |
decay |
0.99 | EMA decay for codebook updates |
quantization_dropout |
0.2 | Probability of layer dropout |
9. Usage Example
# Training
model = RVQVAE(input_dim=263, output_dim=263)
reconstructed, tokens, commit_loss = model(motion)
total_loss, loss_dict = compute_rvq_loss(reconstructed, motion, commit_loss)
# Inference: Motion → Tokens
tokens = model.encode_to_tokens(motion) # List of (B, n) discrete tokens
# Inference: Tokens → Motion
reconstructed = model.decode_from_tokens(tokens)
license: apache-2.0
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support