You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

Architecture of Motion-S RVQ-VAE

This implementation is a Residual Vector Quantized Variational Autoencoder (RVQ-VAE) designed for motion sequence compression and tokenization. Let me break down each component:

1. Overall Architecture Flow

Input Motion (B, D_POSE, N) 
    ↓
[Encoder] → Continuous Latent (B, d, n)
    ↓
[RVQ] → Quantized Latent (B, d, n) + Discrete Tokens
    ↓
[Decoder] → Reconstructed Motion (B, D_POSE, N)

Where:

B = Batch size
D_POSE = Motion feature dimension (e.g., 263 for body pose)
N = Original sequence length (frames)
d = Latent dimension (default 256)
n = Downsampled sequence length (N // downsampling_ratio)

2. MotionEncoder: Convolutional Downsampling

Purpose

Compresses motion sequences both spatially (D_POSE → d) and temporally (N → n).

Architecture

Input: (B, D_POSE, N)  # Treats D_POSE as channels, N as sequence length

Layer Structure (4 layers default):
  Conv1D(D_POSE → 512, kernel=3, stride=2, padding=1)  # Temporal downsampling
  ReLU + BatchNorm
  
  Conv1D(512 → 512, kernel=3, stride=2, padding=1)      # More downsampling
  ReLU + BatchNorm
  
  Conv1D(512 → 512, kernel=3, stride=1, padding=1)      # Maintain resolution
  ReLU + BatchNorm
  
  Conv1D(512 → 256, kernel=3, stride=1, padding=1)      # Project to latent_dim
  ReLU + BatchNorm

Output: (B, 256, n)  # n ≈ N/4 for downsampling_ratio=4

Key Design Choices

Stride=2 for first log₂(ratio) layers: Achieves 4x downsampling with two stride-2 convolutions
BatchNorm: Stabilizes training by normalizing activations
1D Convolutions: Efficient for sequential data vs 2D/RNNs

3. ResidualVectorQuantizer (RVQ): Hierarchical Quantization

Purpose

Converts continuous latents into discrete tokens using a codebook hierarchy.

Core Concept: Residual Quantization

Instead of quantizing once, RVQ quantizes the residual error iteratively:

Step 0: Quantize input          → b⁰ = Q₀(r⁰), where r⁰ = continuous latent
Step 1: Quantize residual       → b¹ = Q₁(r¹), where r¹ = r⁰ - b⁰
Step 2: Quantize new residual   → b² = Q₂(r²), where r² = r¹ - b¹
...
Step V: Final residual          → bⱽ = Qᵥ(rⱽ)

Final Output: Σ(b⁰, b¹, ..., bⱽ)  # Sum of all quantized codes

Architecture

num_quantizers = 6  # V+1 layers (0 to 5)

For each layer v:
  1. Calculate distances to codebook:
     distances = ||z - embedding||²  # (B*n, num_embeddings)
  
  2. Find nearest code:
     indices = argmin(distances)     # (B*n,)
  
  3. Lookup quantized vector:
     quantized = embedding[:, indices]  # (B, d, n)
  
  4. Compute next residual:
     residual = residual - quantized

VectorQuantizer: Single-Layer Quantization

Each layer has:

Codebook: embedding tensor of shape (d, num_embeddings=512)
- 512 learnable code vectors, each of dimension 256

EMA Updates (Exponential Moving Average):

cluster_size = (1-decay) * new_counts + decay * old_counts
embedding_avg = (1-decay) * new_codes + decay * old_codes
embedding = embedding_avg / cluster_size  # Normalize

Prevents codebook collapse (dead codes)
No explicit gradient descent on codebook

Straight-Through Estimator:
```
quantized_st = inputs + (quantized - inputs).detach()
```
- Forward: Use quantized values
- Backward: Gradients flow through inputs (bypassing non-differentiable argmin)
Commitment Loss:
```
loss = λ * ||quantized - inputs||²
```
- Encourages encoder to produce latents close to codebook entries

4. MotionDecoder: Convolutional Upsampling

Purpose

Reconstructs original motion from quantized latent.

Architecture

Input: (B, 256, n)

Layer Structure (mirror of encoder):
  ConvTranspose1D(256 → 512, kernel=3, stride=2, padding=1, output_padding=1)
  ReLU + BatchNorm
  
  ConvTranspose1D(512 → 512, kernel=3, stride=2, padding=1, output_padding=1)
  ReLU + BatchNorm
  
  Conv1D(512 → 512, kernel=3, stride=1, padding=1)
  ReLU + BatchNorm
  
  Conv1D(512 → D_POSE, kernel=3, stride=1, padding=1)  # Final layer, no activation

Output: (B, D_POSE, N)  # Restored to original dimensions

Key Design Choices

ConvTranspose1D: Learns upsampling (better than fixed interpolation)
output_padding: Ensures exact size matching after strided convolutions
No activation on final layer: Allows unrestricted output range

5. Loss Function: Multi-Component Objective

Total Loss = L_rec + λ_global·L_global + λ_commit·L_commit + λ_vel·L_vel

Components

Reconstruction Loss (L_rec):
```
L_rec = SmoothL1(reconstructed, target)
```
- Main objective: Match overall motion
Global/Root Loss (L_global):
```
L_global = SmoothL1(reconstructed[:, :4], target[:, :4])
```
- Focuses on first 4 dimensions:
  - Root rotation velocity
  - Root linear velocity (X/Z)
  - Root height
- Weighted 1.5x to prioritize global motion

Velocity Loss (L_vel):

pred_vel = reconstructed[:, :, 1:] - reconstructed[:, :, :-1]
target_vel = target[:, :, 1:] - target[:, :, :-1]
L_vel = SmoothL1(pred_vel, target_vel)

Ensures temporal smoothness
Prevents jittery motion
Weighted 2.0x for importance

Commitment Loss (L_commit):
```
L_commit = Σ(||quantized_v - inputs_v||²) for all RVQ layers
```
- From RVQ: encourages encoder outputs near codebook
- Weighted 0.02x (small to avoid over-constraining)

6. Training Features

Quantization Dropout

if training and rand() < 0.2:
    num_active_layers = randint(1, num_quantizers+1)

Randomly uses 1 to V+1 quantization layers
Improves robustness and generalization
Forces lower layers to capture more information

Masking Support

loss = mean_flat(error * mask) / (mask.sum() + ε)

Handles variable-length sequences with padding
Only computes loss on valid frames

7. Token Representation

Encoding to Tokens

tokens = [indices_0, indices_1, ..., indices_V]  # List of (B, n) tensors

Each token sequence represents one RVQ layer
Token values ∈ [0, 511] (for 512 codebook entries)
Total vocabulary size: 512^(V+1) combinations

Decoding from Tokens

quantized = Σ(embedding[:, tokens_v]) for v in layers
reconstructed = decoder(quantized)

Lookup codes from each layer's codebook
Sum all codes to get final latent
Pass through decoder

8. Key Hyperparameters

Parameter	Default	Purpose
`input_dim`	263	Motion feature dimension
`latent_dim`	256	Bottleneck dimension
`downsampling_ratio`	4	Temporal compression (N → N/4)
`num_quantizers`	6	RVQ hierarchy depth (V+1)
`num_embeddings`	512	Codebook size per layer
`commitment_cost`	1.0	Weight for commitment loss
`decay`	0.99	EMA decay for codebook updates
`quantization_dropout`	0.2	Probability of layer dropout

9. Usage Example

# Training
model = RVQVAE(input_dim=263, output_dim=263)
reconstructed, tokens, commit_loss = model(motion)
total_loss, loss_dict = compute_rvq_loss(reconstructed, motion, commit_loss)

# Inference: Motion → Tokens
tokens = model.encode_to_tokens(motion)  # List of (B, n) discrete tokens

# Inference: Tokens → Motion
reconstructed = model.decode_from_tokens(tokens)

license: apache-2.0

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support