Buckets:
| # AutoRound | |
| [AutoRound](https://github.com/intel/auto-round) is an advanced quantization toolkit. It achieves high accuracy at ultra-low bit widths (2-4 bits) with minimal tuning by leveraging sign-gradient descent and providing broad hardware compatibility. See our papers [SignRoundV1](https://arxiv.org/pdf/2309.05516) and [SignRoundV2](https://arxiv.org/abs/2512.04746) for more details. | |
| Install `auto-round`(version ≥ 0.13.0): | |
| ```bash | |
| pip install "auto-round>=0.13.0" | |
| ``` | |
| To use the Marlin kernel for faster CUDA inference, install `gptqmodel`: | |
| ```bash | |
| pip install "gptqmodel>=5.8.0" | |
| ``` | |
| ## Load a quantized model | |
| Load a pre-quantized AutoRound model by passing `AutoRoundConfig` to [from_pretrained()](/docs/diffusers/main/en/api/models/overview#diffusers.ModelMixin.from_pretrained). The method works with any model that loads via [Accelerate](https://hf.co/docs/accelerate/index) and has `torch.nn.Linear` layers. | |
| You can use [PipelineQuantizationConfig](/docs/diffusers/main/en/api/quantization#diffusers.PipelineQuantizationConfig) to quantize specific components of a pipeline: | |
| ```python | |
| import torch | |
| from diffusers import DiffusionPipeline, PipelineQuantizationConfig, AutoRoundConfig | |
| pipeline_quant_config = PipelineQuantizationConfig( | |
| quant_mapping={"transformer": AutoRoundConfig(backend="auto")} | |
| ) | |
| pipe = DiffusionPipeline.from_pretrained( | |
| "INCModel/Z-Image-W4A16-AutoRound", | |
| quantization_config=pipeline_quant_config, | |
| torch_dtype=torch.bfloat16, | |
| device_map="cuda", | |
| ) | |
| image = pipe("a cat holding a sign that says hello").images[0] | |
| image.save("output.png") | |
| ``` | |
| Or load a quantized model component directly: | |
| ```python | |
| import torch | |
| from diffusers import ZImageTransformer2DModel, ZImagePipeline, AutoRoundConfig | |
| model_id = "INCModel/Z-Image-W4A16-AutoRound" | |
| quantization_config = AutoRoundConfig(backend="auto") | |
| transformer = ZImageTransformer2DModel.from_pretrained( | |
| model_id, | |
| subfolder="transformer", | |
| quantization_config=quantization_config, | |
| torch_dtype=torch.bfloat16, | |
| device_map="cuda", | |
| ) | |
| pipe = ZImagePipeline.from_pretrained( | |
| model_id, | |
| transformer=transformer, | |
| torch_dtype=torch.bfloat16, | |
| device_map="cuda", | |
| ) | |
| image = pipe("a cat holding a sign that says hello").images[0] | |
| image.save("output.png") | |
| ``` | |
| > [!NOTE] | |
| > AutoRound in Diffusers only supports loading *pre-quantized* models. To quantize a model from scratch, use the [AutoRound CLI or Python API](https://github.com/intel/auto-round) directly, then load the result with Diffusers. | |
| ## torch.compile | |
| AutoRound is compatible with [`torch.compile`](../optimization/fp16#torchcompile) for faster inference. You can compile the quantized transformer (DiT) for better performance: | |
| ```python | |
| import torch | |
| from diffusers import DiffusionPipeline, PipelineQuantizationConfig, AutoRoundConfig | |
| pipeline_quant_config = PipelineQuantizationConfig( | |
| quant_mapping={"transformer": AutoRoundConfig(backend="auto")} | |
| ) | |
| pipe = DiffusionPipeline.from_pretrained( | |
| "INCModel/Z-Image-W4A16-AutoRound", | |
| quantization_config=pipeline_quant_config, | |
| torch_dtype=torch.bfloat16, | |
| device_map="cuda", | |
| ) | |
| pipe.transformer = torch.compile(pipe.transformer, mode="default", fullgraph=False) | |
| ``` | |
| ## Backends | |
| AutoRound supports multiple inference backends for Weight-only quantized model. The backend controls which kernel handles dequantization during the forward pass. Set the `backend` parameter in `AutoRoundConfig` to choose one: | |
| | Backend | Value | Device | Requirements | Notes | | |
| |---------|-------|--------|--------------|-------| | |
| | **Auto** | `"auto"` | Any | — | Default. Automatically selects the best available backend. | | |
| | **PyTorch** | `"torch"` | CPU / CUDA | — | Pure PyTorch implementation. Broadest compatibility. | | |
| | **Triton** | `"tritonv2"` | CUDA | `triton` | Triton-based kernel for GPU inference. | | |
| | **ExllamaV2** | `"exllamav2"` | CUDA | `gptqmodel>=5.8.0` | Good CUDA performance via the ExllamaV2 kernel. | | |
| | **Marlin** | `"marlin"` | CUDA | `gptqmodel>=5.8.0` | Best CUDA performance via the Marlin kernel. | | |
| ```python | |
| from diffusers import AutoRoundConfig | |
| # Auto-select (default) | |
| config = AutoRoundConfig() | |
| # Explicit Triton backend for CUDA | |
| config = AutoRoundConfig(backend="tritonv2") | |
| # Marlin backend for best CUDA performance (requires gptqmodel>=5.8.0) | |
| config = AutoRoundConfig(backend="marlin") | |
| # ExllamaV2 backend for good CUDA performance (requires gptqmodel>=5.8.0) | |
| config = AutoRoundConfig(backend="exllamav2") | |
| # PyTorch backend for CPU/CUDA inference | |
| config = AutoRoundConfig(backend="torch") | |
| ``` | |
| ## Save and load | |
| AutoRound requires data calibration to quantize a model. This is done outside of Diffusers using the [AutoRound library](https://github.com/intel/auto-round) directly: | |
| ```python | |
| from auto_round import AutoRound | |
| autoround = AutoRound( | |
| "Tongyi-MAI/Z-Image", | |
| scheme="W4A16", # W4G128 symmetric | |
| enable_torch_compile=True, | |
| num_inference_steps=3, | |
| guidance_scale=7.5, | |
| dataset="coco2014", | |
| ) | |
| autoround.quantize_and_save("Z-Image-W4A16-AutoRound") | |
| ``` | |
| For more details on calibration options, see the [AutoRound documentation](https://github.com/intel/auto-round). | |
| ```python | |
| import torch | |
| from diffusers import ZImageTransformer2DModel, ZImagePipeline | |
| model_id = "INCModel/Z-Image-W4A16-AutoRound" | |
| # The inference backend will be automatically selected. | |
| pipe = ZImagePipeline.from_pretrained( | |
| model_id, | |
| torch_dtype=torch.bfloat16, | |
| device_map="cuda", | |
| ) | |
| image = pipe("a cat holding a sign that says hello").images[0] | |
| image.save("output.png") | |
| ``` | |
| ### Supported Quantization Schemes | |
| AutoRound supports several Schemes: | |
| - **W4A16**(bits:4,group_size:128,sym:True,act_bits:16) | |
| - **W8A16**(bits:8,group_size:128,sym:True,act_bits:16) | |
| - **W3A16**(bits:3,group_size:128,sym:True,act_bits:16) | |
| - **W2A16**(bits:2,group_size:128,sym:True,act_bits:16) | |
| - **GGUF:Q4_K_M**(all Q*_K,Q*_0,Q*_1 provided by llamacpp are supported) | |
| - **NVFP4**(Experimental feature, recommend exporting to `llm_compressor` format.data_type nvfp4,act_data_type nvfp4,static_global_scale,group_size 16) | |
| - **MXFP4**(**Research feature, no real kernel**, Standard MXFP4, data_type mxfp,act_data_type mxfp,bits 4, act_bits 4, group_size 32) | |
| - **MXINT4**(**Research feature, no real kernel**, Standard MXINT4, data_type mxint,act_data_type mxint,bits 4, act_bits 4, group_size 32) | |
| - **MXFP4_RCEIL**(**Research feature,no real kernel**, NVIDIA's variant, data_type mxfp,act_data_type mxfp_rceil,bits 4, act_bits 4, group_size 32) | |
| - **MXFP8**(**Research feature, no real kernel**, data_type mxfp,act_data_type mxfp_rceil,group_size 32) | |
| - **FPW8A16**(**Research feature, no real kernel**, data_type fp8,group_size 0->per tensor ) | |
| - **FP8_STATIC**(**Research feature, no real kernel**, data_type:fp8,act_data_type:fp8,group_size -1 ->per channel, act_group_size=0->per tensor) | |
| Besides, you could modify the `group_size`, `bits`, `sym` and many other configs you want, though there are maybe no real kernels. | |
| ## Resources | |
| - [Pre-quantized AutoRound models on the Hub](https://huggingface.co/models?search=autoround) | |
Xet Storage Details
- Size:
- 7.12 kB
- Xet hash:
- 55a5c402dd703b1276bb4079b38c8f4621fd38d971b513ced471654672ba55b7
·
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.