Buckets:

hf-doc-build
/

doc-dev

Files

xet

hf-doc-build/doc-dev / transformers /pr_33892 /en /quantization /mxfp4.md

rtrm

about 2 months ago

preview code

download

raw

2.52 kB

	# MXFP4

	Note: MXFP4 quantisation currently only works for OpenAI GPT-OSS 120b and 20b.

	MXFP4 is a 4-bit floating point format that dramatically reduces the memory requirements of large models. Large models (GPT-OSS-120B) can fit on a single 80GB GPU and smaller models (GPT-OSS-20B) only require 16GB of memory. It uses blockwise scaling to preserve it's range and accuracy, which typically becomes degraded at lower precisions.

	To use MXPF4, make sure your hardware meets the following requirements.

	- Install Accelerate, kernels, and Triton ≥ 3.4. Only manually install Triton ≥ 3.4 if you're using PyTorch 2.7 because it is already supported in PyTorch 2.8.
	- NVIDIA GPU Compute Capability ≥ 7.5 which includes Tesla GPUs and newer. Use [get_device_capability](https://docs.pytorch.org/docs/stable/generated/torch.cuda.get_device_capability.html) to check Compute Capability.

	```python
	from torch import cuda
	cuda.get_device_capability()

	# (7, 5)
	```

	Check a model's quantization config as shown below to see if it supports MXFP4. If `'quant_method': 'mxfp4'`, then the model automatically uses MXFP4.

	```py
	from transformers import GptOssConfig

	model_id = "openai/gpt-oss-120b"
	cfg = GptOssConfig.from_pretrained(model_id)
	print(cfg.quantization_config)

	# Example output:
	# {
	# 'modules_to_not_convert': [
	# 'model.layers.*.self_attn',
	# 'model.layers.*.mlp.router',
	# 'model.embed_tokens',
	# 'lm_head'
	# ],
	# 'quant_method': 'mxfp4'
	# }
	```

	## MXFP4 kernels

	Transformers automatically pulls the MXFP4-aware Triton kernels from the community repository when you load a model that needs them. The kernels are stored in your local cache and used during the forward pass.

	MXFP4 kernels are used by default, if available and supported, and does not require any code changes.

	You can use [hf cache scan](https://huggingface.co/docs/huggingface_hub/en/guides/manage-cache#scan-your-cache) to verify the kernels are downloaded.

	```shell
	hf cache scan
	```

	```shell
	REPO ID REPO TYPE SIZE ON DISK
	-------------------------------- --------- ------------
	kernels-community/triton_kernels model 536.2K
	openai/gpt-oss-20b model 13.8G
	```

	## Resources

	Learn more about MXFP4 quantization and how blockwise scaling works in this [blog post](https://huggingface.co/blog/faster-transformers#mxfp4-quantization).


	<EditOnGithub source="https://github.com/huggingface/transformers/blob/main/docs/source/en/quantization/mxfp4.md" />

Xet Storage Details

Size:: 2.52 kB
Xet hash:: 5ccd4fd2c0a0b4df2aed278aa1856111e63ff29cf8734ee9616b6aeee46141bb

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.