HunyuanImage-3 Base INT8

INT8 quantized version of tencent/HunyuanImage-3.0 using bitsandbytes. Reduces model size from ~160GB (BF16) to ~81GB while maintaining quality.

Model Details

Architecture: ~130B parameter Mixture-of-Experts (MoE) with 64 experts, top-8 routing
Quantization: INT8 via bitsandbytes Linear8bitLt on transformer linear layers
Original precision: BF16 → INT8 (VAE, vision model, and embeddings remain in full precision)
Variant: Base (text-to-image only, 20 diffusion steps, no classifier-free guidance)

Quality Notes

INT8 quantization preserves the model's strengths remarkably well — generated images feature correct anatomy, proper finger counts, and strong resistance to extra limbs and other common AI artifacts. The Base INT8 variant performs particularly well on a 96GB Blackwell GPU (~4 minutes per image at 1024x1024).

Usage

With the generation scripts

The easiest way to use this model is with the companion generation scripts:

git clone https://github.com/jamesw767/hunyuan-image-int8.git
cd hunyuan-image-int8
pip install -r requirements.txt

# Download this model
huggingface-cli download jamesw767/HunyuanImage-3-Base-INT8 \
    --local-dir ./HunyuanImage-3-Base-INT8

# Generate
python generate.py \
    --model-path ./HunyuanImage-3-Base-INT8 \
    --prompt "A red fox sitting in autumn leaves, realistic photography"

Direct loading with transformers

from transformers import AutoModelForCausalLM, BitsAndBytesConfig
import torch

model_path = "jamesw767/HunyuanImage-3-Base-INT8"

model = AutoModelForCausalLM.from_pretrained(
    model_path,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)

Note: Direct loading requires the exception-based memory management trick to handle VAE decode — see the generation scripts repo for the full pipeline.

How It Was Made

The INT8 weights were created using save_quantized.py from the generation scripts:

Load the BF16 model with BitsAndBytesConfig(load_in_8bit=True)
Extract the quantized state dict, resolving meta tensors from accelerate's CPU offload hooks
Save as sharded safetensors (5GB per shard, 17 shards total)

Modules excluded from INT8 quantization (kept in original precision): vae, vision_model, vision_aligner, patch_embed, final_layer, time_embed, time_embed_2, timestep_emb, guidance_emb, timestep_r_emb, lm_head, model.wte, model.ln_f

GPU Requirements

96GB VRAM recommended: RTX PRO 6000 Blackwell, A100 80GB+, H100
48GB+ VRAM: May work with aggressive CPU offloading via --gpu-budget / --cpu-budget
System RAM: 64GB+ recommended (offloaded layers use CPU memory)

During diffusion, KV cache and MoE activations expand to ~80GB regardless of model weight placement. The generation scripts use an exception-based stack unwinding trick to free this memory before VAE decode.

Differences from Instruct/Distil Variants

	Base	Instruct	Instruct-Distil
Steps	20	50	8
CFG	No	Yes	No
Chat format	No	Yes	Yes
Speed (96GB GPU)	~4 min	~13 min	~90s

Other INT8 Models

jamesw767/HunyuanImage-3-Instruct-INT8 — Full Instruct, 50 steps, highest quality
jamesw767/HunyuanImage-3-Instruct-Distil-INT8 — Distilled, 8 steps, fastest

License

This model is a derivative of tencent/HunyuanImage-3.0, released under the Tencent Hunyuan Community License.

Important: This license does not apply in the European Union, United Kingdom, or South Korea.

Tencent Hunyuan is licensed under the Tencent Hunyuan Community License Agreement, Copyright (c) 2025 Tencent. All Rights Reserved. The trademark rights of "Tencent Hunyuan" are owned by Tencent or its affiliate.

Credits

Original model by Tencent Hunyuan
INT8 quantization and generation scripts by jamesw767

Downloads last month: 4

Safetensors

Model size

83B params

Tensor type

F32

BF16

Model tree for jamesw767/HunyuanImage-3-Base-INT8

Base model

tencent/HunyuanImage-3.0

Quantized

(8)

this model