Update README.md

f7fbdc9 verified about 1 month ago

7.72 kB

	# Architecture of Motion-S RVQ-VAE

	This implementation is a Residual Vector Quantized Variational Autoencoder (RVQ-VAE) designed for motion sequence compression and tokenization. Let me break down each component:

	---

	## 1. Overall Architecture Flow

	```
	Input Motion (B, D_POSE, N)
	↓
	[Encoder] → Continuous Latent (B, d, n)
	↓
	[RVQ] → Quantized Latent (B, d, n) + Discrete Tokens
	↓
	[Decoder] → Reconstructed Motion (B, D_POSE, N)
	```

	Where:
	- B = Batch size
	- D_POSE = Motion feature dimension (e.g., 263 for body pose)
	- N = Original sequence length (frames)
	- d = Latent dimension (default 256)
	- n = Downsampled sequence length (N // downsampling_ratio)

	---

	## 2. MotionEncoder: Convolutional Downsampling

	### Purpose
	Compresses motion sequences both spatially (D_POSE → d) and temporally (N → n).

	### Architecture
	```python
	Input: (B, D_POSE, N) # Treats D_POSE as channels, N as sequence length

	Layer Structure (4 layers default):
	Conv1D(D_POSE → 512, kernel=3, stride=2, padding=1) # Temporal downsampling
	ReLU + BatchNorm

	Conv1D(512 → 512, kernel=3, stride=2, padding=1) # More downsampling
	ReLU + BatchNorm

	Conv1D(512 → 512, kernel=3, stride=1, padding=1) # Maintain resolution
	ReLU + BatchNorm

	Conv1D(512 → 256, kernel=3, stride=1, padding=1) # Project to latent_dim
	ReLU + BatchNorm

	Output: (B, 256, n) # n ≈ N/4 for downsampling_ratio=4
	```

	### Key Design Choices
	- Stride=2 for first log₂(ratio) layers: Achieves 4x downsampling with two stride-2 convolutions
	- BatchNorm: Stabilizes training by normalizing activations
	- 1D Convolutions: Efficient for sequential data vs 2D/RNNs

	---

	## 3. ResidualVectorQuantizer (RVQ): Hierarchical Quantization

	### Purpose
	Converts continuous latents into discrete tokens using a codebook hierarchy.

	### Core Concept: Residual Quantization
	Instead of quantizing once, RVQ quantizes the residual error iteratively:

	```
	Step 0: Quantize input → b⁰ = Q₀(r⁰), where r⁰ = continuous latent
	Step 1: Quantize residual → b¹ = Q₁(r¹), where r¹ = r⁰ - b⁰
	Step 2: Quantize new residual → b² = Q₂(r²), where r² = r¹ - b¹
	...
	Step V: Final residual → bⱽ = Qᵥ(rⱽ)

	Final Output: Σ(b⁰, b¹, ..., bⱽ) # Sum of all quantized codes
	```

	### Architecture
	```python
	num_quantizers = 6 # V+1 layers (0 to 5)

	For each layer v:
	1. Calculate distances to codebook:
	distances = \|\|z - embedding\|\|² # (B*n, num_embeddings)

	2. Find nearest code:
	indices = argmin(distances) # (B*n,)

	3. Lookup quantized vector:
	quantized = embedding[:, indices] # (B, d, n)

	4. Compute next residual:
	residual = residual - quantized
	```

	### VectorQuantizer: Single-Layer Quantization

	Each layer has:
	- Codebook: `embedding` tensor of shape `(d, num_embeddings=512)`
	- 512 learnable code vectors, each of dimension 256

	- EMA Updates (Exponential Moving Average):
	```python
	cluster_size = (1-decay) * new_counts + decay * old_counts
	embedding_avg = (1-decay) * new_codes + decay * old_codes
	embedding = embedding_avg / cluster_size # Normalize
	```
	- Prevents codebook collapse (dead codes)
	- No explicit gradient descent on codebook

	- Straight-Through Estimator:
	```python
	quantized_st = inputs + (quantized - inputs).detach()
	```
	- Forward: Use quantized values
	- Backward: Gradients flow through inputs (bypassing non-differentiable argmin)

	- Commitment Loss:
	```python
	loss = λ * \|\|quantized - inputs\|\|²
	```
	- Encourages encoder to produce latents close to codebook entries

	---

	## 4. MotionDecoder: Convolutional Upsampling

	### Purpose
	Reconstructs original motion from quantized latent.

	### Architecture
	```python
	Input: (B, 256, n)

	Layer Structure (mirror of encoder):
	ConvTranspose1D(256 → 512, kernel=3, stride=2, padding=1, output_padding=1)
	ReLU + BatchNorm

	ConvTranspose1D(512 → 512, kernel=3, stride=2, padding=1, output_padding=1)
	ReLU + BatchNorm

	Conv1D(512 → 512, kernel=3, stride=1, padding=1)
	ReLU + BatchNorm

	Conv1D(512 → D_POSE, kernel=3, stride=1, padding=1) # Final layer, no activation

	Output: (B, D_POSE, N) # Restored to original dimensions
	```

	### Key Design Choices
	- ConvTranspose1D: Learns upsampling (better than fixed interpolation)
	- output_padding: Ensures exact size matching after strided convolutions
	- No activation on final layer: Allows unrestricted output range

	---

	## 5. Loss Function: Multi-Component Objective

	```python
	Total Loss = L_rec + λ_global·L_global + λ_commit·L_commit + λ_vel·L_vel
	```

	### Components

	1. Reconstruction Loss (L_rec):
	```python
	L_rec = SmoothL1(reconstructed, target)
	```
	- Main objective: Match overall motion

	2. Global/Root Loss (L_global):
	```python
	L_global = SmoothL1(reconstructed[:, :4], target[:, :4])
	```
	- Focuses on first 4 dimensions:
	- Root rotation velocity
	- Root linear velocity (X/Z)
	- Root height
	- Weighted 1.5x to prioritize global motion

	3. Velocity Loss (L_vel):
	```python
	pred_vel = reconstructed[:, :, 1:] - reconstructed[:, :, :-1]
	target_vel = target[:, :, 1:] - target[:, :, :-1]
	L_vel = SmoothL1(pred_vel, target_vel)
	```
	- Ensures temporal smoothness
	- Prevents jittery motion
	- Weighted 2.0x for importance

	4. Commitment Loss (L_commit):
	```python
	L_commit = Σ(\|\|quantized_v - inputs_v\|\|²) for all RVQ layers
	```
	- From RVQ: encourages encoder outputs near codebook
	- Weighted 0.02x (small to avoid over-constraining)

	---

	## 6. Training Features

	### Quantization Dropout
	```python
	if training and rand() < 0.2:
	num_active_layers = randint(1, num_quantizers+1)
	```
	- Randomly uses 1 to V+1 quantization layers
	- Improves robustness and generalization
	- Forces lower layers to capture more information

	### Masking Support
	```python
	loss = mean_flat(error * mask) / (mask.sum() + ε)
	```
	- Handles variable-length sequences with padding
	- Only computes loss on valid frames

	---

	## 7. Token Representation

	### Encoding to Tokens
	```python
	tokens = [indices_0, indices_1, ..., indices_V] # List of (B, n) tensors
	```
	- Each token sequence represents one RVQ layer
	- Token values ∈ [0, 511] (for 512 codebook entries)
	- Total vocabulary size: 512^(V+1) combinations

	### Decoding from Tokens
	```python
	quantized = Σ(embedding[:, tokens_v]) for v in layers
	reconstructed = decoder(quantized)
	```
	- Lookup codes from each layer's codebook
	- Sum all codes to get final latent
	- Pass through decoder

	---

	## 8. Key Hyperparameters

	\| Parameter \| Default \| Purpose \|
	\|-----------\|---------\|---------\|
	\| `input_dim` \| 263 \| Motion feature dimension \|
	\| `latent_dim` \| 256 \| Bottleneck dimension \|
	\| `downsampling_ratio` \| 4 \| Temporal compression (N → N/4) \|
	\| `num_quantizers` \| 6 \| RVQ hierarchy depth (V+1) \|
	\| `num_embeddings` \| 512 \| Codebook size per layer \|
	\| `commitment_cost` \| 1.0 \| Weight for commitment loss \|
	\| `decay` \| 0.99 \| EMA decay for codebook updates \|
	\| `quantization_dropout` \| 0.2 \| Probability of layer dropout \|

	---



	---

	## 9. Usage Example

	```python
	# Training
	model = RVQVAE(input_dim=263, output_dim=263)
	reconstructed, tokens, commit_loss = model(motion)
	total_loss, loss_dict = compute_rvq_loss(reconstructed, motion, commit_loss)

	# Inference: Motion → Tokens
	tokens = model.encode_to_tokens(motion) # List of (B, n) discrete tokens

	# Inference: Tokens → Motion
	reconstructed = model.decode_from_tokens(tokens)
	```



	---
	license: apache-2.0
	---