Motion-S-RVQ-VAE-Tokenizer / README.md

Signvrse

Update README.md

f7fbdc9 verified about 1 month ago

preview code

raw

history blame contribute delete

7.72 kB

Architecture of Motion-S RVQ-VAE

This implementation is a Residual Vector Quantized Variational Autoencoder (RVQ-VAE) designed for motion sequence compression and tokenization. Let me break down each component:

1. Overall Architecture Flow

Input Motion (B, D_POSE, N) 
    ↓
[Encoder] → Continuous Latent (B, d, n)
    ↓
[RVQ] → Quantized Latent (B, d, n) + Discrete Tokens
    ↓
[Decoder] → Reconstructed Motion (B, D_POSE, N)

Where:

B = Batch size
D_POSE = Motion feature dimension (e.g., 263 for body pose)
N = Original sequence length (frames)
d = Latent dimension (default 256)
n = Downsampled sequence length (N // downsampling_ratio)

2. MotionEncoder: Convolutional Downsampling

Purpose

Compresses motion sequences both spatially (D_POSE → d) and temporally (N → n).

Architecture

Input: (B, D_POSE, N)  # Treats D_POSE as channels, N as sequence length

Layer Structure (4 layers default):
  Conv1D(D_POSE → 512, kernel=3, stride=2, padding=1)  # Temporal downsampling
  ReLU + BatchNorm
  
  Conv1D(512 → 512, kernel=3, stride=2, padding=1)      # More downsampling
  ReLU + BatchNorm
  
  Conv1D(512 → 512, kernel=3, stride=1, padding=1)      # Maintain resolution
  ReLU + BatchNorm
  
  Conv1D(512 → 256, kernel=3, stride=1, padding=1)      # Project to latent_dim
  ReLU + BatchNorm

Output: (B, 256, n)  # n ≈ N/4 for downsampling_ratio=4

Key Design Choices

Stride=2 for first log₂(ratio) layers: Achieves 4x downsampling with two stride-2 convolutions
BatchNorm: Stabilizes training by normalizing activations
1D Convolutions: Efficient for sequential data vs 2D/RNNs

3. ResidualVectorQuantizer (RVQ): Hierarchical Quantization

Purpose

Converts continuous latents into discrete tokens using a codebook hierarchy.

Core Concept: Residual Quantization

Instead of quantizing once, RVQ quantizes the residual error iteratively:

Step 0: Quantize input          → b⁰ = Q₀(r⁰), where r⁰ = continuous latent
Step 1: Quantize residual       → b¹ = Q₁(r¹), where r¹ = r⁰ - b⁰
Step 2: Quantize new residual   → b² = Q₂(r²), where r² = r¹ - b¹
...
Step V: Final residual          → bⱽ = Qᵥ(rⱽ)

Final Output: Σ(b⁰, b¹, ..., bⱽ)  # Sum of all quantized codes

Architecture

num_quantizers = 6  # V+1 layers (0 to 5)

For each layer v:
  1. Calculate distances to codebook:
     distances = ||z - embedding||²  # (B*n, num_embeddings)
  
  2. Find nearest code:
     indices = argmin(distances)     # (B*n,)
  
  3. Lookup quantized vector:
     quantized = embedding[:, indices]  # (B, d, n)
  
  4. Compute next residual:
     residual = residual - quantized

VectorQuantizer: Single-Layer Quantization

Each layer has:

Codebook: embedding tensor of shape (d, num_embeddings=512)
- 512 learnable code vectors, each of dimension 256

EMA Updates (Exponential Moving Average):

cluster_size = (1-decay) * new_counts + decay * old_counts
embedding_avg = (1-decay) * new_codes + decay * old_codes
embedding = embedding_avg / cluster_size  # Normalize

Prevents codebook collapse (dead codes)
No explicit gradient descent on codebook

Straight-Through Estimator:
```
quantized_st = inputs + (quantized - inputs).detach()
```
- Forward: Use quantized values
- Backward: Gradients flow through inputs (bypassing non-differentiable argmin)
Commitment Loss:
```
loss = λ * ||quantized - inputs||²
```
- Encourages encoder to produce latents close to codebook entries

4. MotionDecoder: Convolutional Upsampling

Purpose

Reconstructs original motion from quantized latent.

Architecture

Input: (B, 256, n)

Layer Structure (mirror of encoder):
  ConvTranspose1D(256 → 512, kernel=3, stride=2, padding=1, output_padding=1)
  ReLU + BatchNorm
  
  ConvTranspose1D(512 → 512, kernel=3, stride=2, padding=1, output_padding=1)
  ReLU + BatchNorm
  
  Conv1D(512 → 512, kernel=3, stride=1, padding=1)
  ReLU + BatchNorm
  
  Conv1D(512 → D_POSE, kernel=3, stride=1, padding=1)  # Final layer, no activation

Output: (B, D_POSE, N)  # Restored to original dimensions

Key Design Choices

ConvTranspose1D: Learns upsampling (better than fixed interpolation)
output_padding: Ensures exact size matching after strided convolutions
No activation on final layer: Allows unrestricted output range

5. Loss Function: Multi-Component Objective

Total Loss = L_rec + λ_global·L_global + λ_commit·L_commit + λ_vel·L_vel

Components

Reconstruction Loss (L_rec):
```
L_rec = SmoothL1(reconstructed, target)
```
- Main objective: Match overall motion
Global/Root Loss (L_global):
```
L_global = SmoothL1(reconstructed[:, :4], target[:, :4])
```
- Focuses on first 4 dimensions:
  - Root rotation velocity
  - Root linear velocity (X/Z)
  - Root height
- Weighted 1.5x to prioritize global motion

Velocity Loss (L_vel):

pred_vel = reconstructed[:, :, 1:] - reconstructed[:, :, :-1]
target_vel = target[:, :, 1:] - target[:, :, :-1]
L_vel = SmoothL1(pred_vel, target_vel)

Ensures temporal smoothness
Prevents jittery motion
Weighted 2.0x for importance

Commitment Loss (L_commit):
```
L_commit = Σ(||quantized_v - inputs_v||²) for all RVQ layers
```
- From RVQ: encourages encoder outputs near codebook
- Weighted 0.02x (small to avoid over-constraining)

6. Training Features

Quantization Dropout

if training and rand() < 0.2:
    num_active_layers = randint(1, num_quantizers+1)

Randomly uses 1 to V+1 quantization layers
Improves robustness and generalization
Forces lower layers to capture more information

Masking Support

loss = mean_flat(error * mask) / (mask.sum() + ε)

Handles variable-length sequences with padding
Only computes loss on valid frames

7. Token Representation

Encoding to Tokens

tokens = [indices_0, indices_1, ..., indices_V]  # List of (B, n) tensors

Each token sequence represents one RVQ layer
Token values ∈ [0, 511] (for 512 codebook entries)
Total vocabulary size: 512^(V+1) combinations

Decoding from Tokens

quantized = Σ(embedding[:, tokens_v]) for v in layers
reconstructed = decoder(quantized)

Lookup codes from each layer's codebook
Sum all codes to get final latent
Pass through decoder

8. Key Hyperparameters

Parameter	Default	Purpose
`input_dim`	263	Motion feature dimension
`latent_dim`	256	Bottleneck dimension
`downsampling_ratio`	4	Temporal compression (N → N/4)
`num_quantizers`	6	RVQ hierarchy depth (V+1)
`num_embeddings`	512	Codebook size per layer
`commitment_cost`	1.0	Weight for commitment loss
`decay`	0.99	EMA decay for codebook updates
`quantization_dropout`	0.2	Probability of layer dropout

9. Usage Example

# Training
model = RVQVAE(input_dim=263, output_dim=263)
reconstructed, tokens, commit_loss = model(motion)
total_loss, loss_dict = compute_rvq_loss(reconstructed, motion, commit_loss)

# Inference: Motion → Tokens
tokens = model.encode_to_tokens(motion)  # List of (B, n) discrete tokens

# Inference: Tokens → Motion
reconstructed = model.decode_from_tokens(tokens)

Signvrse
/

Motion-S-RVQ-VAE-Tokenizer

Architecture of Motion-S RVQ-VAE

1. Overall Architecture Flow

2. MotionEncoder: Convolutional Downsampling

Purpose

Architecture

Key Design Choices

3. ResidualVectorQuantizer (RVQ): Hierarchical Quantization

Purpose

Core Concept: Residual Quantization

Architecture

VectorQuantizer: Single-Layer Quantization

4. MotionDecoder: Convolutional Upsampling

Purpose

Architecture

Key Design Choices

5. Loss Function: Multi-Component Objective

Components

6. Training Features

Quantization Dropout

Masking Support

7. Token Representation

Encoding to Tokens

Decoding from Tokens

8. Key Hyperparameters

9. Usage Example

license: apache-2.0