You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

Architecture of Motion-S RVQ-VAE

This implementation is a Residual Vector Quantized Variational Autoencoder (RVQ-VAE) designed for motion sequence compression and tokenization. Let me break down each component:


1. Overall Architecture Flow

Input Motion (B, D_POSE, N) 
    ↓
[Encoder] → Continuous Latent (B, d, n)
    ↓
[RVQ] → Quantized Latent (B, d, n) + Discrete Tokens
    ↓
[Decoder] → Reconstructed Motion (B, D_POSE, N)

Where:

  • B = Batch size
  • D_POSE = Motion feature dimension (e.g., 263 for body pose)
  • N = Original sequence length (frames)
  • d = Latent dimension (default 256)
  • n = Downsampled sequence length (N // downsampling_ratio)

2. MotionEncoder: Convolutional Downsampling

Purpose

Compresses motion sequences both spatially (D_POSE → d) and temporally (N → n).

Architecture

Input: (B, D_POSE, N)  # Treats D_POSE as channels, N as sequence length

Layer Structure (4 layers default):
  Conv1D(D_POSE → 512, kernel=3, stride=2, padding=1)  # Temporal downsampling
  ReLU + BatchNorm
  
  Conv1D(512512, kernel=3, stride=2, padding=1)      # More downsampling
  ReLU + BatchNorm
  
  Conv1D(512512, kernel=3, stride=1, padding=1)      # Maintain resolution
  ReLU + BatchNorm
  
  Conv1D(512256, kernel=3, stride=1, padding=1)      # Project to latent_dim
  ReLU + BatchNorm

Output: (B, 256, n)  # n ≈ N/4 for downsampling_ratio=4

Key Design Choices

  • Stride=2 for first log₂(ratio) layers: Achieves 4x downsampling with two stride-2 convolutions
  • BatchNorm: Stabilizes training by normalizing activations
  • 1D Convolutions: Efficient for sequential data vs 2D/RNNs

3. ResidualVectorQuantizer (RVQ): Hierarchical Quantization

Purpose

Converts continuous latents into discrete tokens using a codebook hierarchy.

Core Concept: Residual Quantization

Instead of quantizing once, RVQ quantizes the residual error iteratively:

Step 0: Quantize input          → b⁰ = Q₀(r⁰), where r⁰ = continuous latent
Step 1: Quantize residual       → b¹ = Q₁(r¹), where r¹ = r⁰ - b⁰
Step 2: Quantize new residual   → b² = Q₂(r²), where r² = r¹ - b¹
...
Step V: Final residual          → bⱽ = Qᵥ(rⱽ)

Final Output: Σ(b⁰, b¹, ..., bⱽ)  # Sum of all quantized codes

Architecture

num_quantizers = 6  # V+1 layers (0 to 5)

For each layer v:
  1. Calculate distances to codebook:
     distances = ||z - embedding||²  # (B*n, num_embeddings)
  
  2. Find nearest code:
     indices = argmin(distances)     # (B*n,)
  
  3. Lookup quantized vector:
     quantized = embedding[:, indices]  # (B, d, n)
  
  4. Compute next residual:
     residual = residual - quantized

VectorQuantizer: Single-Layer Quantization

Each layer has:

  • Codebook: embedding tensor of shape (d, num_embeddings=512)

    • 512 learnable code vectors, each of dimension 256
  • EMA Updates (Exponential Moving Average):

    cluster_size = (1-decay) * new_counts + decay * old_counts
    embedding_avg = (1-decay) * new_codes + decay * old_codes
    embedding = embedding_avg / cluster_size  # Normalize
    
    • Prevents codebook collapse (dead codes)
    • No explicit gradient descent on codebook
  • Straight-Through Estimator:

    quantized_st = inputs + (quantized - inputs).detach()
    
    • Forward: Use quantized values
    • Backward: Gradients flow through inputs (bypassing non-differentiable argmin)
  • Commitment Loss:

    loss = λ * ||quantized - inputs||²
    
    • Encourages encoder to produce latents close to codebook entries

4. MotionDecoder: Convolutional Upsampling

Purpose

Reconstructs original motion from quantized latent.

Architecture

Input: (B, 256, n)

Layer Structure (mirror of encoder):
  ConvTranspose1D(256512, kernel=3, stride=2, padding=1, output_padding=1)
  ReLU + BatchNorm
  
  ConvTranspose1D(512512, kernel=3, stride=2, padding=1, output_padding=1)
  ReLU + BatchNorm
  
  Conv1D(512512, kernel=3, stride=1, padding=1)
  ReLU + BatchNorm
  
  Conv1D(512 → D_POSE, kernel=3, stride=1, padding=1)  # Final layer, no activation

Output: (B, D_POSE, N)  # Restored to original dimensions

Key Design Choices

  • ConvTranspose1D: Learns upsampling (better than fixed interpolation)
  • output_padding: Ensures exact size matching after strided convolutions
  • No activation on final layer: Allows unrestricted output range

5. Loss Function: Multi-Component Objective

Total Loss = L_rec + λ_global·L_global + λ_commit·L_commit + λ_vel·L_vel

Components

  1. Reconstruction Loss (L_rec):

    L_rec = SmoothL1(reconstructed, target)
    
    • Main objective: Match overall motion
  2. Global/Root Loss (L_global):

    L_global = SmoothL1(reconstructed[:, :4], target[:, :4])
    
    • Focuses on first 4 dimensions:
      • Root rotation velocity
      • Root linear velocity (X/Z)
      • Root height
    • Weighted 1.5x to prioritize global motion
  3. Velocity Loss (L_vel):

    pred_vel = reconstructed[:, :, 1:] - reconstructed[:, :, :-1]
    target_vel = target[:, :, 1:] - target[:, :, :-1]
    L_vel = SmoothL1(pred_vel, target_vel)
    
    • Ensures temporal smoothness
    • Prevents jittery motion
    • Weighted 2.0x for importance
  4. Commitment Loss (L_commit):

    L_commit = Σ(||quantized_v - inputs_v||²) for all RVQ layers
    
    • From RVQ: encourages encoder outputs near codebook
    • Weighted 0.02x (small to avoid over-constraining)

6. Training Features

Quantization Dropout

if training and rand() < 0.2:
    num_active_layers = randint(1, num_quantizers+1)
  • Randomly uses 1 to V+1 quantization layers
  • Improves robustness and generalization
  • Forces lower layers to capture more information

Masking Support

loss = mean_flat(error * mask) / (mask.sum() + ε)
  • Handles variable-length sequences with padding
  • Only computes loss on valid frames

7. Token Representation

Encoding to Tokens

tokens = [indices_0, indices_1, ..., indices_V]  # List of (B, n) tensors
  • Each token sequence represents one RVQ layer
  • Token values ∈ [0, 511] (for 512 codebook entries)
  • Total vocabulary size: 512^(V+1) combinations

Decoding from Tokens

quantized = Σ(embedding[:, tokens_v]) for v in layers
reconstructed = decoder(quantized)
  • Lookup codes from each layer's codebook
  • Sum all codes to get final latent
  • Pass through decoder

8. Key Hyperparameters

Parameter Default Purpose
input_dim 263 Motion feature dimension
latent_dim 256 Bottleneck dimension
downsampling_ratio 4 Temporal compression (N → N/4)
num_quantizers 6 RVQ hierarchy depth (V+1)
num_embeddings 512 Codebook size per layer
commitment_cost 1.0 Weight for commitment loss
decay 0.99 EMA decay for codebook updates
quantization_dropout 0.2 Probability of layer dropout


9. Usage Example

# Training
model = RVQVAE(input_dim=263, output_dim=263)
reconstructed, tokens, commit_loss = model(motion)
total_loss, loss_dict = compute_rvq_loss(reconstructed, motion, commit_loss)

# Inference: Motion → Tokens
tokens = model.encode_to_tokens(motion)  # List of (B, n) discrete tokens

# Inference: Tokens → Motion
reconstructed = model.decode_from_tokens(tokens)

license: apache-2.0

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support