HunyuanImage-3 Base INT8

INT8 quantized version of tencent/HunyuanImage-3.0 using bitsandbytes. Reduces model size from ~160GB (BF16) to ~81GB while maintaining quality.

Model Details

  • Architecture: ~130B parameter Mixture-of-Experts (MoE) with 64 experts, top-8 routing
  • Quantization: INT8 via bitsandbytes Linear8bitLt on transformer linear layers
  • Original precision: BF16 → INT8 (VAE, vision model, and embeddings remain in full precision)
  • Variant: Base (text-to-image only, 20 diffusion steps, no classifier-free guidance)

Quality Notes

INT8 quantization preserves the model's strengths remarkably well — generated images feature correct anatomy, proper finger counts, and strong resistance to extra limbs and other common AI artifacts. The Base INT8 variant performs particularly well on a 96GB Blackwell GPU (~4 minutes per image at 1024x1024).

Usage

With the generation scripts

The easiest way to use this model is with the companion generation scripts:

git clone https://github.com/jamesw767/hunyuan-image-int8.git
cd hunyuan-image-int8
pip install -r requirements.txt

# Download this model
huggingface-cli download jamesw767/HunyuanImage-3-Base-INT8 \
    --local-dir ./HunyuanImage-3-Base-INT8

# Generate
python generate.py \
    --model-path ./HunyuanImage-3-Base-INT8 \
    --prompt "A red fox sitting in autumn leaves, realistic photography"

Direct loading with transformers

from transformers import AutoModelForCausalLM, BitsAndBytesConfig
import torch

model_path = "jamesw767/HunyuanImage-3-Base-INT8"

model = AutoModelForCausalLM.from_pretrained(
    model_path,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)

Note: Direct loading requires the exception-based memory management trick to handle VAE decode — see the generation scripts repo for the full pipeline.

How It Was Made

The INT8 weights were created using save_quantized.py from the generation scripts:

  1. Load the BF16 model with BitsAndBytesConfig(load_in_8bit=True)
  2. Extract the quantized state dict, resolving meta tensors from accelerate's CPU offload hooks
  3. Save as sharded safetensors (5GB per shard, 17 shards total)

Modules excluded from INT8 quantization (kept in original precision): vae, vision_model, vision_aligner, patch_embed, final_layer, time_embed, time_embed_2, timestep_emb, guidance_emb, timestep_r_emb, lm_head, model.wte, model.ln_f

GPU Requirements

  • 96GB VRAM recommended: RTX PRO 6000 Blackwell, A100 80GB+, H100
  • 48GB+ VRAM: May work with aggressive CPU offloading via --gpu-budget / --cpu-budget
  • System RAM: 64GB+ recommended (offloaded layers use CPU memory)

During diffusion, KV cache and MoE activations expand to ~80GB regardless of model weight placement. The generation scripts use an exception-based stack unwinding trick to free this memory before VAE decode.

Differences from Instruct/Distil Variants

Base Instruct Instruct-Distil
Steps 20 50 8
CFG No Yes No
Chat format No Yes Yes
Speed (96GB GPU) ~4 min ~13 min ~90s

Other INT8 Models

License

This model is a derivative of tencent/HunyuanImage-3.0, released under the Tencent Hunyuan Community License.

Important: This license does not apply in the European Union, United Kingdom, or South Korea.

Tencent Hunyuan is licensed under the Tencent Hunyuan Community License Agreement, Copyright (c) 2025 Tencent. All Rights Reserved. The trademark rights of "Tencent Hunyuan" are owned by Tencent or its affiliate.

Credits

Downloads last month
9
Safetensors
Model size
83B params
Tensor type
F32
·
BF16
·
I8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for jamesw767/HunyuanImage-3-Base-INT8

Quantized
(7)
this model