Buckets:

hf-doc-build
/

doc

Files

xet

hf-doc-build/doc / diffusers /main /en /quantization /autoround.md

HuggingFaceDocBuilder

about 2 hours ago

preview code

download

raw

7.12 kB

	# AutoRound

	[AutoRound](https://github.com/intel/auto-round) is an advanced quantization toolkit. It achieves high accuracy at ultra-low bit widths (2-4 bits) with minimal tuning by leveraging sign-gradient descent and providing broad hardware compatibility. See our papers [SignRoundV1](https://arxiv.org/pdf/2309.05516) and [SignRoundV2](https://arxiv.org/abs/2512.04746) for more details.

	Install `auto-round`(version ≥ 0.13.0):

	```bash
	pip install "auto-round>=0.13.0"
	```

	To use the Marlin kernel for faster CUDA inference, install `gptqmodel`:

	```bash
	pip install "gptqmodel>=5.8.0"
	```

	## Load a quantized model

	Load a pre-quantized AutoRound model by passing `AutoRoundConfig` to [from_pretrained()](/docs/diffusers/main/en/api/models/overview#diffusers.ModelMixin.from_pretrained). The method works with any model that loads via [Accelerate](https://hf.co/docs/accelerate/index) and has `torch.nn.Linear` layers.

	You can use [PipelineQuantizationConfig](/docs/diffusers/main/en/api/quantization#diffusers.PipelineQuantizationConfig) to quantize specific components of a pipeline:

	```python
	import torch
	from diffusers import DiffusionPipeline, PipelineQuantizationConfig, AutoRoundConfig

	pipeline_quant_config = PipelineQuantizationConfig(
	quant_mapping={"transformer": AutoRoundConfig(backend="auto")}
	)
	pipe = DiffusionPipeline.from_pretrained(
	"INCModel/Z-Image-W4A16-AutoRound",
	quantization_config=pipeline_quant_config,
	torch_dtype=torch.bfloat16,
	device_map="cuda",
	)

	image = pipe("a cat holding a sign that says hello").images[0]
	image.save("output.png")
	```

	Or load a quantized model component directly:

	```python
	import torch
	from diffusers import ZImageTransformer2DModel, ZImagePipeline, AutoRoundConfig

	model_id = "INCModel/Z-Image-W4A16-AutoRound"

	quantization_config = AutoRoundConfig(backend="auto")
	transformer = ZImageTransformer2DModel.from_pretrained(
	model_id,
	subfolder="transformer",
	quantization_config=quantization_config,
	torch_dtype=torch.bfloat16,
	device_map="cuda",
	)

	pipe = ZImagePipeline.from_pretrained(
	model_id,
	transformer=transformer,
	torch_dtype=torch.bfloat16,
	device_map="cuda",
	)

	image = pipe("a cat holding a sign that says hello").images[0]
	image.save("output.png")
	```

	> [!NOTE]
	> AutoRound in Diffusers only supports loading pre-quantized models. To quantize a model from scratch, use the [AutoRound CLI or Python API](https://github.com/intel/auto-round) directly, then load the result with Diffusers.

	## torch.compile

	AutoRound is compatible with [`torch.compile`](../optimization/fp16#torchcompile) for faster inference. You can compile the quantized transformer (DiT) for better performance:

	```python
	import torch
	from diffusers import DiffusionPipeline, PipelineQuantizationConfig, AutoRoundConfig

	pipeline_quant_config = PipelineQuantizationConfig(
	quant_mapping={"transformer": AutoRoundConfig(backend="auto")}
	)
	pipe = DiffusionPipeline.from_pretrained(
	"INCModel/Z-Image-W4A16-AutoRound",
	quantization_config=pipeline_quant_config,
	torch_dtype=torch.bfloat16,
	device_map="cuda",
	)

	pipe.transformer = torch.compile(pipe.transformer, mode="default", fullgraph=False)
	```

	## Backends

	AutoRound supports multiple inference backends for Weight-only quantized model. The backend controls which kernel handles dequantization during the forward pass. Set the `backend` parameter in `AutoRoundConfig` to choose one:

	\| Backend \| Value \| Device \| Requirements \| Notes \|
	\|---------\|-------\|--------\|--------------\|-------\|
	\| Auto \| `"auto"` \| Any \| — \| Default. Automatically selects the best available backend. \|
	\| PyTorch \| `"torch"` \| CPU / CUDA \| — \| Pure PyTorch implementation. Broadest compatibility. \|
	\| Triton \| `"tritonv2"` \| CUDA \| `triton` \| Triton-based kernel for GPU inference. \|
	\| ExllamaV2 \| `"exllamav2"` \| CUDA \| `gptqmodel>=5.8.0` \| Good CUDA performance via the ExllamaV2 kernel. \|
	\| Marlin \| `"marlin"` \| CUDA \| `gptqmodel>=5.8.0` \| Best CUDA performance via the Marlin kernel. \|

	```python
	from diffusers import AutoRoundConfig

	# Auto-select (default)
	config = AutoRoundConfig()

	# Explicit Triton backend for CUDA
	config = AutoRoundConfig(backend="tritonv2")

	# Marlin backend for best CUDA performance (requires gptqmodel>=5.8.0)
	config = AutoRoundConfig(backend="marlin")

	# ExllamaV2 backend for good CUDA performance (requires gptqmodel>=5.8.0)
	config = AutoRoundConfig(backend="exllamav2")

	# PyTorch backend for CPU/CUDA inference
	config = AutoRoundConfig(backend="torch")
	```

	## Save and load

	AutoRound requires data calibration to quantize a model. This is done outside of Diffusers using the [AutoRound library](https://github.com/intel/auto-round) directly:

	```python
	from auto_round import AutoRound

	autoround = AutoRound(
	"Tongyi-MAI/Z-Image",
	scheme="W4A16", # W4G128 symmetric
	enable_torch_compile=True,
	num_inference_steps=3,
	guidance_scale=7.5,
	dataset="coco2014",
	)
	autoround.quantize_and_save("Z-Image-W4A16-AutoRound")
	```

	For more details on calibration options, see the [AutoRound documentation](https://github.com/intel/auto-round).

	```python
	import torch
	from diffusers import ZImageTransformer2DModel, ZImagePipeline

	model_id = "INCModel/Z-Image-W4A16-AutoRound"

	# The inference backend will be automatically selected.
	pipe = ZImagePipeline.from_pretrained(
	model_id,
	torch_dtype=torch.bfloat16,
	device_map="cuda",
	)

	image = pipe("a cat holding a sign that says hello").images[0]
	image.save("output.png")
	```

	### Supported Quantization Schemes

	AutoRound supports several Schemes:

	- W4A16(bits:4,group_size:128,sym:True,act_bits:16)
	- W8A16(bits:8,group_size:128,sym:True,act_bits:16)
	- W3A16(bits:3,group_size:128,sym:True,act_bits:16)
	- W2A16(bits:2,group_size:128,sym:True,act_bits:16)
	- GGUF:Q4_K_M(all Q_K,Q_0,Q*_1 provided by llamacpp are supported)
	- NVFP4(Experimental feature, recommend exporting to `llm_compressor` format.data_type nvfp4,act_data_type nvfp4,static_global_scale,group_size 16)
	- MXFP4(Research feature, no real kernel, Standard MXFP4, data_type mxfp,act_data_type mxfp,bits 4, act_bits 4, group_size 32)
	- MXINT4(Research feature, no real kernel, Standard MXINT4, data_type mxint,act_data_type mxint,bits 4, act_bits 4, group_size 32)
	- MXFP4_RCEIL(Research feature,no real kernel, NVIDIA's variant, data_type mxfp,act_data_type mxfp_rceil,bits 4, act_bits 4, group_size 32)
	- MXFP8(Research feature, no real kernel, data_type mxfp,act_data_type mxfp_rceil,group_size 32)
	- FPW8A16(Research feature, no real kernel, data_type fp8,group_size 0->per tensor )
	- FP8_STATIC(Research feature, no real kernel, data_type:fp8,act_data_type:fp8,group_size -1 ->per channel, act_group_size=0->per tensor)

	Besides, you could modify the `group_size`, `bits`, `sym` and many other configs you want, though there are maybe no real kernels.

	## Resources

	- [Pre-quantized AutoRound models on the Hub](https://huggingface.co/models?search=autoround)

Xet Storage Details

Size:: 7.12 kB
Xet hash:: 55a5c402dd703b1276bb4079b38c8f4621fd38d971b513ced471654672ba55b7

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.