HunyuanImage-3 Base INT8
INT8 quantized version of tencent/HunyuanImage-3.0 using bitsandbytes. Reduces model size from ~160GB (BF16) to ~81GB while maintaining quality.
Model Details
- Architecture: ~130B parameter Mixture-of-Experts (MoE) with 64 experts, top-8 routing
- Quantization: INT8 via bitsandbytes
Linear8bitLton transformer linear layers - Original precision: BF16 → INT8 (VAE, vision model, and embeddings remain in full precision)
- Variant: Base (text-to-image only, 20 diffusion steps, no classifier-free guidance)
Quality Notes
INT8 quantization preserves the model's strengths remarkably well — generated images feature correct anatomy, proper finger counts, and strong resistance to extra limbs and other common AI artifacts. The Base INT8 variant performs particularly well on a 96GB Blackwell GPU (~4 minutes per image at 1024x1024).
Usage
With the generation scripts
The easiest way to use this model is with the companion generation scripts:
git clone https://github.com/jamesw767/hunyuan-image-int8.git
cd hunyuan-image-int8
pip install -r requirements.txt
# Download this model
huggingface-cli download jamesw767/HunyuanImage-3-Base-INT8 \
--local-dir ./HunyuanImage-3-Base-INT8
# Generate
python generate.py \
--model-path ./HunyuanImage-3-Base-INT8 \
--prompt "A red fox sitting in autumn leaves, realistic photography"
Direct loading with transformers
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
import torch
model_path = "jamesw767/HunyuanImage-3-Base-INT8"
model = AutoModelForCausalLM.from_pretrained(
model_path,
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True,
)
Note: Direct loading requires the exception-based memory management trick to handle VAE decode — see the generation scripts repo for the full pipeline.
How It Was Made
The INT8 weights were created using save_quantized.py from the generation scripts:
- Load the BF16 model with
BitsAndBytesConfig(load_in_8bit=True) - Extract the quantized state dict, resolving meta tensors from accelerate's CPU offload hooks
- Save as sharded safetensors (5GB per shard, 17 shards total)
Modules excluded from INT8 quantization (kept in original precision):
vae, vision_model, vision_aligner, patch_embed, final_layer, time_embed, time_embed_2, timestep_emb, guidance_emb, timestep_r_emb, lm_head, model.wte, model.ln_f
GPU Requirements
- 96GB VRAM recommended: RTX PRO 6000 Blackwell, A100 80GB+, H100
- 48GB+ VRAM: May work with aggressive CPU offloading via
--gpu-budget/--cpu-budget - System RAM: 64GB+ recommended (offloaded layers use CPU memory)
During diffusion, KV cache and MoE activations expand to ~80GB regardless of model weight placement. The generation scripts use an exception-based stack unwinding trick to free this memory before VAE decode.
Differences from Instruct/Distil Variants
| Base | Instruct | Instruct-Distil | |
|---|---|---|---|
| Steps | 20 | 50 | 8 |
| CFG | No | Yes | No |
| Chat format | No | Yes | Yes |
| Speed (96GB GPU) | ~4 min | ~13 min | ~90s |
Other INT8 Models
- jamesw767/HunyuanImage-3-Instruct-INT8 — Full Instruct, 50 steps, highest quality
- jamesw767/HunyuanImage-3-Instruct-Distil-INT8 — Distilled, 8 steps, fastest
License
This model is a derivative of tencent/HunyuanImage-3.0, released under the Tencent Hunyuan Community License.
Important: This license does not apply in the European Union, United Kingdom, or South Korea.
Tencent Hunyuan is licensed under the Tencent Hunyuan Community License Agreement, Copyright (c) 2025 Tencent. All Rights Reserved. The trademark rights of "Tencent Hunyuan" are owned by Tencent or its affiliate.
Credits
- Original model by Tencent Hunyuan
- INT8 quantization and generation scripts by jamesw767
- Downloads last month
- 9
Model tree for jamesw767/HunyuanImage-3-Base-INT8
Base model
tencent/HunyuanImage-3.0