HY-WorldPlay FP8 Quantized (48GB GPU Ready)
HY-WorldPlay (8B Dense DiT, 72GB VRAM at BF16) compressed to 37.4GB peak via:
- Native FP8 weights (
float8_e4m3fn, per-tensor scale) β 32GB β 8GB (4x) - turbo3 V cache compression (PolarQuant 3-bit) β runtime, no pre-saved data needed
Successfully runs on a single RTX 4090 48GB or L40S 48GB (SM89 required for FP8).
Results
| Configuration | GPU | Peak VRAM | Status |
|---|---|---|---|
| BF16 baseline | A800 80GB | 73.8 GB | β |
| BF16 baseline | RTX 4090 48GB | OOM (46.5 GB) | β |
| FP8 + turbo3 | RTX 4090 48GB | 37.4 GB | β |
Inference Speed (v3: _scaled_mm + SageAttention)
| Chunk | Time/step |
|---|---|
| 0 | ~0.86s |
| 4 | ~2.4s |
| 7 | ~2.8s |
| Total | 196.8s (4 steps Γ 8 chunks) |
Files
diffusion_pytorch_model.fp8.safetensorsβ FP8 quantized transformer weights (8GB)scripts/native_fp8_patch.pyβ FP8 Linear layer withtorch._scaled_mmscripts/turbo3_integration.pyβ V cache PolarQuant compression (GPU optimized)scripts/run_fp8_turbo3_gpu.pyβ inference wrapperscripts/run_fp8_turbo3_gpu.shβ one-click launch scriptscripts/batch_inference.pyβ batch inference with random WASD posesvideos/β generated video samples
Usage
Requirements
- GPU: SM89 (RTX 4090 / L40S) with β₯ 48GB VRAM
- PyTorch β₯ 2.1 with FP8 support
- HY-WorldPlay codebase
- SageAttention (optional, 1.8x attention speedup)
Quick Start
# 1. Clone HY-WorldPlay
git clone https://github.com/Tencent/HunyuanVideo.git
cd HunyuanVideo
# 2. Download this repo's FP8 weights
# Place diffusion_pytorch_model.fp8.safetensors in your model directory
# 3. Run inference
bash scripts/run_fp8_turbo3_gpu.sh
Loading FP8 Weights
import safetensors.torch
import torch
# Load FP8 quantized weights
state_dict = safetensors.torch.load_file("diffusion_pytorch_model.fp8.safetensors")
# Weights with dtype float8_e4m3fn are quantized
# Corresponding *_scale tensors contain per-tensor scales
# Dequantize: weight_bf16 = fp8_weight.to(bfloat16) * weight_scale
Quality
| Optimization | Cosine Similarity | Verified |
|---|---|---|
| FP8 weights | > 0.999 | β |
| V cache turbo3 (3-bit) | 0.983 | β (A800 real KV cache) |
| FP8 + turbo3 combined | end-to-end video generated | β |
Acknowledgments
Based on Tencent-Hunyuan/HY-WorldPlay (Apache 2.0).
- Downloads last month
- 4