ไธญๆ–‡ๆ–‡ๆกฃ

HunyuanImage-3.0 Logo

๐ŸŽจ HunyuanImage-3.0: A Powerful Native Multimodal Model for Image Generation

HunyuanImage-3.0 Banner

๐Ÿ‘ Join our WeChat and Discord | ๐Ÿ’ป Official website(ๅฎ˜็ฝ‘) Try our model!  

๐Ÿ”ฅ๐Ÿ”ฅ๐Ÿ”ฅ News

๐Ÿงฉ Community Contributions

If you develop/use HunyuanImage-3.0 in your projects, welcome to let us know.

๐Ÿ“‘ Open-source Plan

  • HunyuanImage-3.0 (Image Generation Model)
    • Inference
    • HunyuanImage-3.0 Checkpoints
    • HunyuanImage-3.0-Instruct Checkpoints (with reasoning)
    • vLLM Support
    • Distilled Checkpoints
    • Image-to-Image Generation
    • Multi-turn Interaction

๐Ÿ—‚๏ธ Contents


๐Ÿ“– Introduction

HunyuanImage-3.0 is a groundbreaking native multimodal model that unifies multimodal understanding and generation within an autoregressive framework. Our text-to-image and image-to-image model achieves performance comparable to or surpassing leading closed-source models.

HunyuanImage-3.0 Framework

โœจ Key Features

  • ๐Ÿง  Unified Multimodal Architecture: Moving beyond the prevalent DiT-based architectures, HunyuanImage-3.0 employs a unified autoregressive framework. This design enables a more direct and integrated modeling of text and image modalities, leading to surprisingly effective and contextually rich image generation.

  • ๐Ÿ† The Largest Image Generation MoE Model: This is the largest open-source image generation Mixture of Experts (MoE) model to date. It features 64 experts and a total of 80 billion parameters, with 13 billion activated per token, significantly enhancing its capacity and performance.

  • ๐ŸŽจ Superior Image Generation Performance: Through rigorous dataset curation and advanced reinforcement learning post-training, we've achieved an optimal balance between semantic accuracy and visual excellence. The model demonstrates exceptional prompt adherence while delivering photorealistic imagery with stunning aesthetic quality and fine-grained details.

  • ๐Ÿ’ญ Intelligent Image Understanding and World-Knowledge Reasoning: The unified multimodal architecture endows HunyuanImage-3.0 with powerful reasoning capabilities. It under stands user's input image, and leverages its extensive world knowledge to intelligently interpret user intent, automatically elaborating on sparse prompts with contextually appropriate details to produce superior, more complete visual outputs.

๐Ÿš€ Usage

๐Ÿ“ฆ Environment Setup

  • ๐Ÿ Python: 3.12+ (recommended and tested)
  • โšก CUDA: 12.8

๐Ÿ“ฅ Install Dependencies

# 1. First install PyTorch (CUDA 12.8 Version)
pip install torch==2.8.0 torchvision==0.23.0 torchaudio==2.8.0 --index-url https://download.pytorch.org/whl/cu128

# 2. Install tencentcloud-sdk for Prompt Enhancement (PE) only for HunyuanImage-3.0 not HunyuanImage-3.0-Instruct
pip install -i https://mirrors.tencent.com/pypi/simple/ --upgrade tencentcloud-sdk-python

# 3. Then install other dependencies
pip install -r requirements.txt

For up to 3x faster inference, install these optimizations:

# FlashInfer for optimized moe inference. v0.5.0 is tested.
pip install flashinfer-python==0.5.0

๐Ÿ’กInstallation Tips: It is critical that the CUDA version used by PyTorch matches the system's CUDA version. FlashInfer relies on this compatibility when compiling kernels at runtime. GCC version >=9 is recommended for compiling FlashAttention and FlashInfer.

โšก Performance Tips: These optimizations can significantly speed up your inference!

๐Ÿ’กNotation: When FlashInfer is enabled, the first inference may be slower (about 10 minutes) due to kernel compilation. Subsequent inferences on the same machine will be much faster.

HunyuanImage-3.0-Instruct (Instruction reasoning and Image-to-image generation, including editing and multi-image fusion)

๐Ÿ”ฅ Quick Start with Transformers

1๏ธโƒฃ Download model weights
# Download from HuggingFace and rename the directory.
# Notice that the directory name should not contain dots, which may cause issues when loading using Transformers.
hf download tencent/HunyuanImage-3.0-Instruct --local-dir ./HunyuanImage-3-Instruct
2๏ธโƒฃ Run with Transformers
from transformers import AutoModelForCausalLM

# Load the model
model_id = "./HunyuanImage-3-Instruct"
# Currently we can not load the model using HF model_id `tencent/HunyuanImage-3.0-Instruct` directly 
# due to the dot in the name.

kwargs = dict(
    attn_implementation="sdpa", 
    trust_remote_code=True,
    torch_dtype="auto",
    device_map="auto",
    moe_impl="eager",   # Use "flashinfer" if FlashInfer is installed
    moe_drop_tokens=True,
)

model = AutoModelForCausalLM.from_pretrained(model_id, **kwargs)
model.load_tokenizer(model_id)

# Image-to-Image generation (TI2I)
prompt = "ๅŸบไบŽๅ›พไธ€็š„logo๏ผŒๅ‚่€ƒๅ›พไบŒไธญๅ†ฐ็ฎฑ่ดด็š„ๆ่ดจ๏ผŒๅˆถไฝœไธ€ไธชๆ–ฐ็š„ๅ†ฐ็ฎฑ่ดด"

input_img1 = "./assets/demo_instruct_imgs/input_1_0.png"
input_img2 = "./assets/demo_instruct_imgs/input_1_1.png"
imgs_input = [input_img1, input_img2]

cot_text, samples = model.generate_image(
    prompt=prompt,
    image=imgs_input,
    seed=42,
    image_size="auto",
    use_system_prompt="en_unified",
    bot_task="think_recaption",  # Use "think_recaption" for reasoning and enhancement
    infer_align_image_size=True,  # Align output image size to input image size
    diff_infer_steps=50, 
    verbose=2
)

# Save the generated image
samples[0].save("image_edit.png")

๐Ÿ  Local Installation & Usage

1๏ธโƒฃ Clone the Repository
git clone https://github.com/Tencent-Hunyuan/HunyuanImage-3.0.git
cd HunyuanImage-3.0/
2๏ธโƒฃ Download Model Weights
# Download from HuggingFace
hf download tencent/HunyuanImage-3.0-Instruct --local-dir ./HunyuanImage-3-Instruct
3๏ธโƒฃ Run the Demo

More demos in run_demo_instruct.sh.

export MODEL_PATH="./HunyuanImage-3-Instruct"
bash run_demo_instruct.sh
4๏ธโƒฃ Command Line Arguments
Arguments Description Recommended
--prompt Input prompt (Required)
--image Image to run. For multiple images, use comma-separated paths (e.g., 'img1.png,img2.png') (Required)
--model-id Model path (Required)
--attn-impl Attention implementation. Now only support 'sdpa' sdpa
--moe-impl MoE implementation. Either eager or flashinfer flashinfer
--seed Random seed for image generation. Use None for random seed None
--diff-infer-steps Number of inference steps 50
--image-size Image resolution. Can be auto, like 1280x768 or 16:9 auto
--use-system-prompt System prompt type. Options: None, dynamic, en_vanilla, en_recaption, en_think_recaption, en_unified, custom en_unified
--system-prompt Custom system prompt. Used when --use-system-prompt is custom None
--bot-task Task type. image for direct generation; auto for text; recaption for re-write->image; think_recaption for think->re-write->image think_recaption
--save Image save path image.png
--verbose Verbose level 2
--reproduce Whether to reproduce the results True
--infer-align-image-size Whether to align the target image size to the src image size True
--max_new_tokens Maximum number of new tokens to generate 2048
--use-taylor-cache Use Taylor Cache when sampling False
5๏ธโƒฃ For fewer Sampling Steps

We recommend using the model HunyuanImage-3.0-Instruct-Distil with --diff-infer-steps 8, while keeping all other recommended parameter values unchanged.

# Download HunyuanImage-3.0-Instruct-Distil from HuggingFace
hf download tencent/HunyuanImage-3.0-Instruct-Distil --local-dir ./HunyuanImage-3-Instruct-Distil

# Run the demo with 8 steps to samples
export MODEL_PATH="./HunyuanImage-3-Instruct-Distil"
bash run_demo_instruct_Distil.sh
Previous Version (Pure Text-to-Image)

HunyuanImage-3.0 (Text-to-image)

๐Ÿ”ฅ Quick Start with Transformers

1๏ธโƒฃ Download model weights
# Download from HuggingFace and rename the directory.
# Notice that the directory name should not contain dots, which may cause issues when loading using Transformers.
hf download tencent/HunyuanImage-3.0 --local-dir ./HunyuanImage-3
2๏ธโƒฃ Run with Transformers
from transformers import AutoModelForCausalLM

# Load the model
model_id = "./HunyuanImage-3"
# Currently we can not load the model using HF model_id `tencent/HunyuanImage-3.0` directly 
# due to the dot in the name.

kwargs = dict(
    attn_implementation="sdpa",     # Use "flash_attention_2" if FlashAttention is installed
    trust_remote_code=True,
    torch_dtype="auto",
    device_map="auto",
    moe_impl="eager",   # Use "flashinfer" if FlashInfer is installed
)

model = AutoModelForCausalLM.from_pretrained(model_id, **kwargs)
model.load_tokenizer(model_id)

# generate the image
prompt = "A brown and white dog is running on the grass"
image = model.generate_image(prompt=prompt, stream=True)
image.save("image.png")

๐Ÿ  Local Installation & Usage

1๏ธโƒฃ Clone the Repository
git clone https://github.com/Tencent-Hunyuan/HunyuanImage-3.0.git
cd HunyuanImage-3.0/
2๏ธโƒฃ Download Model Weights
# Download from HuggingFace
hf download tencent/HunyuanImage-3.0 --local-dir ./HunyuanImage-3
3๏ธโƒฃ Run the Demo

The Pretrain Checkpoint does not automatically rewrite or enhance input prompts, for optimal results currently, we recommend community partners to use deepseek to rewrite the prompts. You can go to Tencent Cloud to apply for an API Key.

# Without PE
export MODEL_PATH="./HunyuanImage-3"
python3 run_image_gen.py \
    --model-id $MODEL_PATH \
    --verbose 1 \
    --prompt "A brown and white dog is running on the grass" \
    --bot-task image \
    --image-size "1024x1024" \
    --save ./image.png \
    --moe-impl flashinfer

# With PE
export DEEPSEEK_KEY_ID="your_deepseek_key_id"
export DEEPSEEK_KEY_SECRET="your_deepseek_key_secret"
export MODEL_PATH="./HunyuanImage-3"
python3 run_image_gen.py \
    --model-id $MODEL_PATH \
    --verbose 1 \
    --prompt "A brown and white dog is running on the grass" \
    --bot-task image \
    --image-size "1024x1024" \
    --save ./image.png \
    --moe-impl flashinfer \
    --rewrite 1
4๏ธโƒฃ Command Line Arguments
Arguments Description Recommended
--prompt Input prompt (Required)
--model-id Model path (Required)
--attn-impl Attention implementation. Either sdpa or flash_attention_2. sdpa
--moe-impl MoE implementation. Either eager or flashinfer flashinfer
--seed Random seed for image generation None
--diff-infer-steps Diffusion infer steps 50
--image-size Image resolution. Can be auto, like 1280x768 or 16:9 auto
--save Image save path. image.png
--verbose Verbose level. 0: No log; 1: log inference information. 0
--rewrite Whether to enable rewriting 1

๐ŸŽจ Interactive Gradio Demo

Launch an interactive web interface for easy text-to-image generation.

1๏ธโƒฃ Install Gradio
pip install gradio>=4.21.0
2๏ธโƒฃ Configure Environment
# Set your model path
export MODEL_ID="path/to/your/model"

# Optional: Configure GPU usage (default: 0,1,2,3)
export GPUS="0,1,2,3"

# Optional: Configure host and port (default: 0.0.0.0:443)
export HOST="0.0.0.0"
export PORT="443"
3๏ธโƒฃ Launch the Web Interface

Basic Launch:

sh run_app.sh

With Performance Optimizations:

# Use both optimizations for maximum performance
sh run_app.sh --moe-impl flashinfer --attn-impl flash_attention_2
4๏ธโƒฃ Access the Interface

๐ŸŒ Web Interface: Open your browser and navigate to http://localhost:443 (or your configured port)

๐Ÿงฑ Models Cards

Model Params Download Recommended VRAM Supported
HunyuanImage-3.0 80B total (13B active) HuggingFace โ‰ฅ 3 ร— 80 GB โœ… Text-to-Image
HunyuanImage-3.0-Instruct 80B total (13B active) HuggingFace โ‰ฅ 8 ร— 80 GB โœ… Text-to-Image
โœ… Text-Image-to-Image
โœ… Prompt Self-Rewrite
โœ… CoT Think
HunyuanImage-3.0-Instruct-Distil 80B total (13B active) HuggingFace โ‰ฅ 8 ร— 80 GB โœ… Text-to-Image
โœ… Text-Image-to-Image
โœ… Prompt Self-Rewrite
โœ… CoT Think
โœ… Fewer sampling steps (8 steps recommended)

Notes:

  • Install performance extras (FlashAttention, FlashInfer) for faster inference.
  • Multiโ€‘GPU inference is recommended for the Base model.

๐Ÿ“Š Evaluation

Evaluation of HunyuanImage-3.0-Instruct

  • ๐Ÿ‘ฅ GSB (Human Evaluation) We adopted the GSB (Good/Same/Bad) evaluation method commonly used to assess the relative performance between two models from an overall image perception perspective. In total, we utilized 1,000+ single- and multi-images editing cases, generating an equal number of image samples for all compared models in a single run. For a fair comparison, we conducted inference only once for each prompt, avoiding any cherry-picking of results. When comparing with the baseline methods, we maintained the default settings for all selected models. The evaluation was performed by more than 100 professional evaluators.

Human Evaluation with Other Models

Evaluation of HunyuanImage-3.0 (Text-to-Image)

  • ๐Ÿค– SSAE (Machine Evaluation)
    SSAE (Structured Semantic Alignment Evaluation) is an intelligent evaluation metric for image-text alignment based on advanced multimodal large language models (MLLMs). We extracted 3500 key points across 12 categories, then used multimodal large language models to automatically evaluate and score by comparing the generated images with these key points based on the visual content of the images. Mean Image Accuracy represents the image-wise average score across all key points, while Global Accuracy directly calculates the average score across all key points.

Human Evaluation with Other Models

Human Evaluation with Other Models

  • ๐Ÿ‘ฅ GSB (Human Evaluation)

We adopted the GSB (Good/Same/Bad) evaluation method commonly used to assess the relative performance between two models from an overall image perception perspective. In total, we utilized 1,000 text prompts, generating an equal number of image samples for all compared models in a single run. For a fair comparison, we conducted inference only once for each prompt, avoiding any cherry-picking of results. When comparing with the baseline methods, we maintained the default settings for all selected models. The evaluation was performed by more than 100 professional evaluators.

Human Evaluation with Other Models

๐Ÿ–ผ๏ธ Showcase

Our model can follow complex instructions to generate highโ€‘quality, creative images.

HunyuanImage 3.0 Demo

For text-to-image showcases in HunyuanImage-3.0, click the following links:

Showcases of HunyuanImage-3.0-Instruct

HunyuanImage-3.0-Instruct demonstrates powerful capabilities in intelligent image generation and editing. The following showcases highlight its core features:

  • ๐Ÿง  Intelligent Visual Understanding and Reasoning (CoT Think): The model performs structured thinking to analyze user's input image and prompt, expand user's intent and editing tasks into a stucture, comprehnsive instructions, and leading to a better image generation and editing performance.

breaking down complex prompts and editing tasks into detailed visual components including subject, composition, lighting, color palette, and style.

  • โœ๏ธ Prompt Self-Rewrite: Automatically enhances sparse or vague prompts into professional-grade, detail-rich descriptions that capture the user's intent more accurately.

  • ๐ŸŽจ Text-to-Image (T2I): Generates high-quality images from text prompts with exceptional prompt adherence and photorealistic quality.

  • ๐Ÿ–ผ๏ธ Image-to-Image (TI2I): Supports creative image editing, including adding elements, removing objects, modifying styles, and seamless background replacement while preserving key visual elements.

  • ๐Ÿ”€ Multi-Image Fusion: Intelligently combines multiple reference images (up to 3 inputs) to create coherent composite images that integrate visual elements from different sources.

Showcase 1: Detailed Thought and Reasoning Process

HunyuanImage-3.0-Instruct Showcase 1

Showcase 2: Creative T2I Generation with Complex Scene Understanding

Prompt: 3D ๆฏ›็ป’่ดจๆ„Ÿๆ‹ŸไบบๅŒ–้ฉฌ๏ผŒๆš–ๆฃ•ๆต…ๆฃ•่‚Œ็†๏ผŒ็ฉฟ่—่“่ฅฟ่ฃ…ใ€็™ฝ่กฌ่กซ๏ผŒๆˆดๆทฑๆฃ•ๆ‰‹ๅฅ—๏ผ›็–ฒๆƒซๅธฆๆœŸๅพ…๏ผŒๅไบŽ็”ต่„‘ๅ‰๏ผŒๆ—็ฝฎๅฐ "HAPPY AGAIN" ็š„้ฉฌๅ…‹ๆฏใ€‚ๆฉ™็บขๆธๅ˜่ƒŒๆ™ฏ๏ผŒ้…่ถ…ๅคงๅท่—่“็ฒ—ไฝ“ "้ฉฌไธŠไธ‹็ญ"๏ผŒๅ ๅŠ ็ฑณ้ป„ "Happy New Year" ๅนถๆ ‡ "(2026)"ใ€‚ๆฉ™็บขไธบไธป๏ผŒ่—่“็ฑณ้ป„ๆ’ž่‰ฒ๏ผŒๆฏ›็ป’ๆธฉๆš–ๆŸ”ๅ’Œใ€‚

HunyuanImage-3.0-Instruct Showcase 2

Showcase 3: Precise Image Editing with Element Preservation

HunyuanImage-3.0-Instruct Showcase 3

Showcase 4: Style Transformation with Thematic Enhancement

HunyuanImage-3.0-Instruct Showcase 4

Showcase 5: Advanced Style Transfer and Product Mockup Generation

HunyuanImage-3.0-Instruct Showcase 5

Showcase 6: Multi-Image Fusion and Creative Composition

HunyuanImage-3.0-Instruct Showcase 6

๐Ÿ“š Citation

If you find HunyuanImage-3.0 useful in your research, please cite our work:

@article{cao2025hunyuanimage,
  title={HunyuanImage 3.0 Technical Report},
  author={Cao, Siyu and Chen, Hangting and Chen, Peng and Cheng, Yiji and Cui, Yutao and Deng, Xinchi and Dong, Ying and Gong, Kipper and Gu, Tianpeng and Gu, Xiusen and others},
  journal={arXiv preprint arXiv:2509.23951},
  year={2025}
}

๐Ÿ™ Acknowledgements

We extend our heartfelt gratitude to the following open-source projects and communities for their invaluable contributions:

๐ŸŒŸ๐Ÿš€ Github Star History

GitHub stars GitHub forks

Star History Chart

Downloads last month
-
Safetensors
Model size
83B params
Tensor type
BF16
ยท
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ 4 Ask for provider support

Collection including tencent/HunyuanImage-3.0-Instruct

Paper for tencent/HunyuanImage-3.0-Instruct