Buckets:

hf-doc-build
/

doc-dev

Files

xet

hf-doc-build/doc-dev / transformers /pr_26617 /en /quantization /finegrained_fp8.md

rtrm

about 2 months ago

preview code

download

raw

2.35 kB

	# Fine-grained FP8

	Fine-grained FP8 quantization quantizes the weights and activations to fp8.

	- The weights are quantized to 8-bits for each 2D block (`weight_block_size=(128, 128)`).
	- The activations are quantized to 8-bits for each group per token. The group value matches the weights in the input channel (128 by default).

	FP8 quantization enables support for [DeepSeek-V3](https://hf.co/papers/2412.19437) and DeepSeek-R1.



	> [!TIP]
	> You need a GPU with Compute Capability>=9 (H100), and install a PyTorch version compatible with the CUDA version of your GPU.

	Install Accelerate and upgrade to the latest version of PyTorch.

	```bash
	pip install --upgrade accelerate torch
	```

	Create a [FineGrainedFP8Config](/docs/transformers/pr_26617/en/main_classes/quantization#transformers.FineGrainedFP8Config) class and pass it to [from_pretrained()](/docs/transformers/pr_26617/en/main_classes/model#transformers.PreTrainedModel.from_pretrained) to quantize it. The weights are loaded in full precision (`torch.float32`) by default regardless of the actual data type the weights are stored in. Set `dtype="auto"` to load the weights in the data type defined in a models `config.json` file to automatically load the most memory-optimal data type.

	```py
	from transformers import FineGrainedFP8Config, AutoModelForCausalLM, AutoTokenizer

	model_name = "meta-llama/Meta-Llama-3-8B"
	quantization_config = FineGrainedFP8Config()
	quantized_model = AutoModelForCausalLM.from_pretrained(model_name, dtype="auto", device_map="auto", quantization_config=quantization_config)

	tokenizer = AutoTokenizer.from_pretrained(model_name)
	input_text = "What are we having for dinner?"
	input_ids = tokenizer(input_text, return_tensors="pt").to(quantized_model.device.type)

	output = quantized_model.generate(**input_ids, max_new_tokens=10)
	print(tokenizer.decode(output[0], skip_special_tokens=True))
	```

	Use [save_pretrained()](/docs/transformers/pr_26617/en/main_classes/model#transformers.PreTrainedModel.save_pretrained) to save the quantized model and reload it with [from_pretrained()](/docs/transformers/pr_26617/en/main_classes/model#transformers.PreTrainedModel.from_pretrained).

	```py
	quant_path = "/path/to/save/quantized/model"
	model.save_pretrained(quant_path)
	model = AutoModelForCausalLM.from_pretrained(quant_path, device_map="auto")
	```

Xet Storage Details

Size:: 2.35 kB
Xet hash:: 62b1a3e121558153908c7a2856b5eb9b615ad2d9dc24b963c5633797bfdfa6bf

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.