HY-WorldPlay FP8 Quantized (48GB GPU Ready)

HY-WorldPlay (8B Dense DiT, 72GB VRAM at BF16) compressed to 37.4GB peak via:

  • Native FP8 weights (float8_e4m3fn, per-tensor scale) β€” 32GB β†’ 8GB (4x)
  • turbo3 V cache compression (PolarQuant 3-bit) β€” runtime, no pre-saved data needed

Successfully runs on a single RTX 4090 48GB or L40S 48GB (SM89 required for FP8).

Results

Configuration GPU Peak VRAM Status
BF16 baseline A800 80GB 73.8 GB βœ…
BF16 baseline RTX 4090 48GB OOM (46.5 GB) ❌
FP8 + turbo3 RTX 4090 48GB 37.4 GB βœ…

Inference Speed (v3: _scaled_mm + SageAttention)

Chunk Time/step
0 ~0.86s
4 ~2.4s
7 ~2.8s
Total 196.8s (4 steps Γ— 8 chunks)

Files

  • diffusion_pytorch_model.fp8.safetensors β€” FP8 quantized transformer weights (8GB)
  • scripts/native_fp8_patch.py β€” FP8 Linear layer with torch._scaled_mm
  • scripts/turbo3_integration.py β€” V cache PolarQuant compression (GPU optimized)
  • scripts/run_fp8_turbo3_gpu.py β€” inference wrapper
  • scripts/run_fp8_turbo3_gpu.sh β€” one-click launch script
  • scripts/batch_inference.py β€” batch inference with random WASD poses
  • videos/ β€” generated video samples

Usage

Requirements

  • GPU: SM89 (RTX 4090 / L40S) with β‰₯ 48GB VRAM
  • PyTorch β‰₯ 2.1 with FP8 support
  • HY-WorldPlay codebase
  • SageAttention (optional, 1.8x attention speedup)

Quick Start

# 1. Clone HY-WorldPlay
git clone https://github.com/Tencent/HunyuanVideo.git
cd HunyuanVideo

# 2. Download this repo's FP8 weights
# Place diffusion_pytorch_model.fp8.safetensors in your model directory

# 3. Run inference
bash scripts/run_fp8_turbo3_gpu.sh

Loading FP8 Weights

import safetensors.torch
import torch

# Load FP8 quantized weights
state_dict = safetensors.torch.load_file("diffusion_pytorch_model.fp8.safetensors")

# Weights with dtype float8_e4m3fn are quantized
# Corresponding *_scale tensors contain per-tensor scales
# Dequantize: weight_bf16 = fp8_weight.to(bfloat16) * weight_scale

Quality

Optimization Cosine Similarity Verified
FP8 weights > 0.999 βœ…
V cache turbo3 (3-bit) 0.983 βœ… (A800 real KV cache)
FP8 + turbo3 combined end-to-end video generated βœ…

Acknowledgments

Based on Tencent-Hunyuan/HY-WorldPlay (Apache 2.0).

Downloads last month
4
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support