Spaces:

jbilcke-hf
/

Hunyuan-GameCraft

Paused

App Files Files Community

Hunyuan-GameCraft / how-frames-work.md

Julian Bilcke

update research

bd0b128 4 months ago

preview code

raw

history blame contribute delete

11.7 kB

	# How Frames Work in Hunyuan-GameCraft

	## Overview

	The Hunyuan-GameCraft system generates high-dynamic interactive game videos using diffusion models and a hybrid history-conditioned training strategy. The frame handling is complex due to several factors:

	1. Causal VAE compression (spatial and temporal with different ratios)
	2. Hybrid history conditioning (using past frames/clips as context for autoregressive generation)
	3. Different generation modes (image-to-video vs. video-to-video continuation)
	4. Rotary position embeddings (RoPE) requirements for the MM-DiT backbone

	## Paper Context

	According to the paper, Hunyuan-GameCraft:
	- Operates at 25 FPS with each video chunk comprising 33-frame clips at 720p resolution
	- Uses a causal VAE for encoding/decoding that has uneven encoding of initial vs. subsequent frames
	- Implements chunk-wise autoregressive extension where each chunk corresponds to one action
	- Employs hybrid history conditioning with ratios: 70% single historical clip, 5% multiple clips, 25% single frame

	## Key Frame Numbers Explained

	### The Magic Numbers: 33, 34, 37, 66, 69

	These numbers are fundamental to the architecture and not arbitrary:

	- 33 frames: The base video chunk size - each action generates exactly 33 frames (1.32 seconds at 25 FPS)
	- 34 frames: Used for image-to-video generation in latent space (33 + 1 initial frame)
	- 37 frames: Used for rotary position embeddings when starting from an image
	- 66 frames: Used for video-to-video continuation in latent space (2 × 33 frame chunks)
	- 69 frames: Used for rotary position embeddings when continuing from video

	### Why These Specific Numbers?

	The paper mentions "chunk latent denoising" where each chunk is a 33-frame segment. The specific numbers arise from:

	1. Base Chunk Size: 33 frames per action (fixed by training)
	2. Initial Frame Handling: +1 frame for the reference image in image-to-video mode
	3. RoPE Alignment: +3 frames for proper positional encoding alignment in the transformer
	4. History Conditioning: Doubling for video continuation (using previous chunk as context)

	## VAE Compression Explained

	### VAE Types and the "4n+1" / "8n+1" Formula

	The project uses different VAE (Variational Autoencoder) models identified by codes like "884" or "888":

	#### VAE Naming Convention: "XYZ-16c-hy0801"
	- First digit (X): Temporal compression ratio
	- Second digit (Y): Spatial compression ratio (height)
	- Third digit (Z): Spatial compression ratio (width)
	- 16c: 16 latent channels
	- hy0801: Version identifier

	#### "884" VAE (Default in Code)
	- Temporal compression: 4:1 (every 4 frames → 1 latent frame)
	- Spatial compression: 8:1 for both height and width
	- Frame formula:
	- Standard: `latent_frames = (video_frames - 1) // 4 + 1`
	- Special handling: `latent_frames = (video_frames - 2) // 4 + 2` (for certain cases)
	- Why "4n+1"?: The causal VAE requires frames in multiples of 4 plus 1 for proper temporal compression
	- Example: 33 frames → (33-1)/4 + 1 = 9 latent frames
	- Example: 34 frames → (34-2)/4 + 2 = 9 latent frames (special case in pipeline)
	- Example: 66 frames → (66-2)/4 + 2 = 17 latent frames

	#### "888" VAE (Alternative)
	- Temporal compression: 8:1 (every 8 frames → 1 latent frame)
	- Spatial compression: 8:1 for both height and width
	- Frame formula: `latent_frames = (video_frames - 1) // 8 + 1`
	- Why "8n+1"?: Similar principle but with 8:1 temporal compression
	- Example: 33 frames → (33-1)/8 + 1 = 5 latent frames
	- Example: 65 frames → (65-1)/8 + 1 = 9 latent frames

	#### No Compression VAE
	- When VAE code doesn't match the pattern, no temporal compression is applied
	- `latent_frames = video_frames`

	### Why Different Formulas?

	The formulas handle the causal nature of the VAE as mentioned in the paper:

	1. Causal VAE Characteristics: The paper states that causal VAEs have "uneven encoding of initial versus subsequent frames"
	2. First Frame Special Treatment: The initial frame requires different handling than subsequent frames
	3. Temporal Consistency: The causal attention ensures each frame only attends to previous frames, maintaining temporal coherence
	4. Chunk Boundaries: The formulas ensure proper alignment with the 33-frame chunk size used in training

	## Frame Processing Pipeline

	### 1. Image-to-Video Generation (First Segment)

	```python
	# Starting from a single image
	if is_image:
	target_length = 34 # In latent space
	# For RoPE embeddings
	freqs_cos, freqs_sin = get_rotary_pos_embed(37, height, width)
	```

	Why 34 and 37?
	- 34 frames in latent space = 33 generated frames + 1 initial frame
	- 37 for RoPE = 34 + 3 extra for positional encoding alignment

	### 2. Video-to-Video Continuation

	```python
	# Continuing from existing video
	else:
	target_length = 66 # In latent space
	# For RoPE embeddings
	freqs_cos, freqs_sin = get_rotary_pos_embed(69, height, width)
	```

	Why 66 and 69?
	- 66 frames = 2 × 33 frames (using previous segment as context)
	- 69 for RoPE = 66 + 3 extra for positional encoding alignment

	### 3. Camera Network Compression

	The CameraNet has special handling for these frame counts:

	```python
	def compress_time(self, x, num_frames):
	if x.shape[-1] == 66 or x.shape[-1] == 34:
	# Split into two segments
	x_len = x.shape[-1]
	# First segment: keep first frame, pool the rest
	x_clip1 = x[...,:x_len//2]
	# Second segment: keep first frame, pool the rest
	x_clip2 = x[...,x_len//2:x_len]
	```

	This compression strategy:
	1. Preserves key frames: First frame of each segment
	2. Pools temporal information: Averages remaining frames
	3. Maintains continuity: Ensures smooth transitions

	## Case Study: Using 17 Frames Instead of 33

	While the model is trained on 33-frame chunks, we can theoretically adapt it to use 17 frames, which is exactly half the duration and maintains VAE compatibility:

	### 1. Why 17 Frames Works with VAE

	17 frames is actually compatible with both VAE architectures:

	- 884 VAE (4:1 temporal compression):
	- Formula: (17-1)/4 + 1 = 5 latent frames ✓
	- Clean division ensures proper encoding/decoding

	- 888 VAE (8:1 temporal compression):
	- Formula: (17-1)/8 + 1 = 3 latent frames ✓
	- Also divides cleanly for proper compression

	### 2. Required Code Modifications

	To implement 17-frame generation, you would need to update:

	#### a. Core Frame Configuration
	- app.py: Change `args.sample_n_frames = 17`
	- ActionToPoseFromID: Update `duration=17` parameter
	- sample_inference.py: Adjust target_length calculations:
	```python
	if is_image:
	target_length = 18 # 17 generated + 1 initial
	else:
	target_length = 34 # 2 × 17 for video continuation
	```

	#### b. RoPE Embeddings
	- For image-to-video: Use 21 instead of 37 (18 + 3 for alignment)
	- For video-to-video: Use 37 instead of 69 (34 + 3 for alignment)

	#### c. CameraNet Compression
	Update the frame count checks in `cameranet.py`:
	```python
	if x.shape[-1] == 34 or x.shape[-1] == 18: # Support both 33 and 17 frame modes
	# Adjust compression logic for shorter sequences
	```

	### 3. Trade-offs and Considerations

	Advantages of 17 frames:
	- Reduced memory usage: ~48% less VRAM required
	- Faster generation: Shorter sequences process quicker
	- More responsive: Actions complete in 0.68 seconds vs 1.32 seconds

	Disadvantages:
	- Quality degradation: Model wasn't trained on 17-frame chunks
	- Choppy motion: Less temporal information for smooth transitions
	- Action granularity: Shorter actions may feel abrupt
	- Potential artifacts: VAE and attention patterns optimized for 33 frames

	### 4. Why Other Frame Counts Are Problematic

	Not all frame counts work with the VAE constraints:
	- 18 frames: ❌ (18-1)/4 = 4.25 (not integer for 884 VAE)
	- 19 frames: ❌ (19-1)/4 = 4.5 (not integer)
	- 20 frames: ❌ (20-1)/4 = 4.75 (not integer)
	- 21 frames: ✓ Works with 884 VAE (6 latent frames)
	- 25 frames: ✓ Works with both VAEs (7 and 4 latent frames)

	### 5. Implementation Note

	While technically possible, using 17 frames would require:
	1. Extensive testing: Verify quality and temporal consistency
	2. Possible fine-tuning: The model may need adaptation for optimal results
	3. Adjustment of action speeds: Camera movements calibrated for 33 frames
	4. Modified training strategy: If fine-tuning, adjust hybrid history ratios

	## Recommendations for Frame Count Modification

	If you must change frame counts, consider:

	1. Use VAE-compatible numbers:
	- For 884 VAE: 17, 21, 25, 29, 33, 37... (4n+1 pattern)
	- For 888 VAE: 17, 25, 33, 41... (8n+1 pattern)

	2. Modify all dependent locations:
	- `sample_inference.py`: Update target_length logic
	- `cameranet.py`: Update compress_time conditions
	- `ActionToPoseFromID`: Change duration parameter
	- App configuration: Update sample_n_frames

	3. Consider retraining or fine-tuning:
	- The model may need adaptation for different sequence lengths
	- Quality might be suboptimal without retraining

	4. Test thoroughly:
	- Different frame counts may expose edge cases
	- Ensure VAE encoding/decoding works correctly
	- Verify temporal consistency in generated videos

	## Technical Details

	### Latent Space Calculation Examples

	For 884 VAE (4:1 temporal compression):
	```
	Input: 33 frames → (33-1)/4 + 1 = 9 latent frames
	Input: 34 frames → (34-2)/4 + 2 = 9 latent frames (special case)
	Input: 66 frames → (66-2)/4 + 2 = 17 latent frames
	```

	For 888 VAE (8:1 temporal compression):
	```
	Input: 33 frames → (33-1)/8 + 1 = 5 latent frames
	Input: 65 frames → (65-1)/8 + 1 = 9 latent frames
	```

	### Memory Implications

	Fewer frames = less memory usage:
	- 33 frames at 704×1216: ~85MB per frame in FP16
	- 18 frames would use ~46% less memory
	- But VAE constraints limit viable options

	## Paper-Code Consistency Analysis

	The documentation is consistent with both the paper and the codebase:

	### From the Paper:
	- "The system operates at 25 fps, with each video chunk comprising 33-frame clips at 720p resolution"
	- Uses "chunk latent denoising process" for autoregressive generation
	- Implements "hybrid history-conditioned training strategy"
	- Mentions causal VAE's "uneven encoding of initial versus subsequent frames"

	### From the Code:
	- `sample_n_frames = 33` throughout the codebase
	- VAE compression formulas match the 884/888 patterns
	- Hardcoded frame values (34, 37, 66, 69) align with the chunk-based architecture
	- CameraNet's special handling for 34/66 frames confirms the two-mode generation

	## Conclusion

	The frame counts in Hunyuan-GameCraft are fundamental to its architecture:

	1. 33 frames is the atomic unit, trained into the model and fixed by the dataset construction
	2. 34/37 and 66/69 emerge from the interaction between:
	- The 33-frame chunk size
	- Causal VAE requirements
	- MM-DiT transformer's RoPE needs
	- Hybrid history conditioning strategy

	3. The 884 VAE (4:1 temporal compression) is the default, requiring frames in patterns of 4n+1 or 4n+2

	4. Changing to different frame counts (like 18) would require:
	- Retraining the entire model
	- Reconstructing the dataset
	- Modifying the VAE architecture
	- Updating all hardcoded dependencies

	The system's design reflects careful engineering trade-offs between generation quality, temporal consistency, and computational efficiency, as validated by the paper's experimental results showing superior performance compared to alternatives.