File size: 5,218 Bytes
f8e25e7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
---

license: other
license_name: tencent-hunyuan-community
license_link: https://huggingface.co/tencent/HunyuanImage-3.0/blob/main/LICENSE.txt
base_model: tencent/HunyuanImage-3.0-Instruct-Distil
pipeline_tag: text-to-image
library_name: transformers
tags:
- Hunyuan
- hunyuan
- quantization
- nf4
- comfyui
- custom-nodes
- autoregressive
- DiT
- HunyuanImage-3.0
- instruct
- image-editing
- bitsandbytes
- 4bit
- distilled
---


# Hunyuan Image 3.0 Instruct Distil -- NF4 Quantized (v2)

NF4 (4-bit) quantization of the HunyuanImage-3.0 Instruct Distil model (v2). The most accessible option -- fits on a single 48GB GPU with ~6x faster generation (8 steps vs 50). Best balance of speed, quality, and VRAM.

## What's New in v2

v2 uses improved quantization with more precise skip-module selection, keeping attention projections and critical embedding layers in full BF16 precision for better image quality.

## Key Features

- **Instruct model** -- supports text-to-image, image editing, multi-image fusion
- **Chain-of-Thought** -- built-in `think_recaption` mode for highest quality
- **NF4 quantized** -- ~48 GB on disk
- **8 diffusion steps** (CFG-distilled)
- **Block swap support** -- offload transformer blocks to CPU for lower VRAM
- **ComfyUI ready** -- works with [Comfy_HunyuanImage3](https://github.com/EricRollei/Comfy_HunyuanImage3) nodes

## VRAM Requirements

| Component | Memory |
|-----------|--------|
| Weight Loading | ~29 GB weights |
| Inference (additional) | ~12-20 GB inference |
| **Total** | **~41-49 GB** |

**Recommended Hardware:**

- **Single 48GB GPU** (RTX 6000 Ada, RTX PRO 5000, A6000)
- With block swap: may work on 24GB GPUs (swapping ~20 blocks)


## Model Details

- **Architecture:** HunyuanImage-3.0 Mixture-of-Experts Diffusion Transformer
- **Parameters:** 80B total, 13B active per token (top-K MoE routing)
- **Variant:** Instruct Distil (CFG-Distilled, 8-step)
- **Quantization:** 4-bit NormalFloat (NF4) quantization via bitsandbytes with double quantization
- **Diffusion Steps:** 8
- **Default Guidance Scale:** 2.5
- **Resolution:** Up to 2048x2048
- **Language:** English and Chinese prompts

### Distillation

This is the **CFG-Distilled** variant:
- Only **8 diffusion steps** needed (vs 50 for the full Instruct model)
- **~6x faster** image generation
- No quality loss -- distilled to match the full model's output
- `cfg_distilled: true` means no classifier-free guidance needed

## Quantization Details

**Layers quantized to NF4:**
- Feed-forward networks (FFN/MLP layers)
- Expert layers in MoE architecture (64 experts per layer)
- Large linear transformations

**Kept in full precision (BF16):**
- VAE encoder/decoder (critical for image quality)
- Attention projection layers (q_proj, k_proj, v_proj, o_proj)
- Patch embedding layers
- Time embedding layers
- Vision model (SigLIP2)
- Final output layers

## Usage

### ComfyUI (Recommended)

This model is designed to work with the [Comfy_HunyuanImage3](https://github.com/EricRollei/Comfy_HunyuanImage3) custom nodes:

```bash

cd ComfyUI/custom_nodes

git clone https://github.com/EricRollei/Comfy_HunyuanImage3

```

1. Download this model to your preferred models directory
2. Use the **"Hunyuan 3 Instruct Loader"** node
3. Select this model folder and choose `nf4` precision
4. Connect to the **"Hunyuan 3 Instruct Generate"** node for text-to-image
5. Or use **"Hunyuan 3 Instruct Edit"** for image editing
6. Or use **"Hunyuan 3 Instruct Multi-Fusion"** for combining multiple images

### Bot Task Modes

The Instruct model supports three generation modes:

| Mode | Description | Speed |
|------|-------------|-------|
| `image` | Direct text-to-image, prompt used as-is | Fastest |
| `recaption` | Model rewrites prompt into detailed description, then generates | Medium |
| `think_recaption` | CoT reasoning -> prompt enhancement -> generation (best quality) | Slowest |

## Block Swap

Block swap allows running INT8 and BF16 models on GPUs with less VRAM than the
full model requires. The system keeps N transformer blocks on CPU and swaps them
to GPU on demand during each diffusion step.

| blocks_to_swap | VRAM Saved | Recommended For |
|---------------|------------|-----------------|
| 0 | 0 GB | 96GB+ GPU (no swap needed) |
| 4 | ~5 GB | 80-90GB GPU |
| 8 | ~10 GB | 64-80GB GPU |
| 16 | ~19 GB | 48-64GB GPU |
| -1 (auto) | varies | Let the system decide |

## Original Model

This is a quantized derivative of [Tencent's HunyuanImage-3.0 Instruct](tencent/HunyuanImage-3.0-Instruct-Distil).

- **License:** [Tencent Hunyuan Community License](https://huggingface.co/tencent/HunyuanImage-3.0/blob/main/LICENSE.txt)

## Credits

- **Original Model:** [Tencent Hunyuan Team](https://huggingface.co/tencent)
- **Quantization:** Eric Rollei
- **ComfyUI Integration:** [Comfy_HunyuanImage3](https://github.com/EricRollei/Comfy_HunyuanImage3)

## License

This model inherits the license from the original Hunyuan Image 3.0 model:
[Tencent Hunyuan Community License](https://huggingface.co/tencent/HunyuanImage-3.0/blob/main/LICENSE.txt)