Buckets:
| # Fine-grained FP8 | |
| Fine-grained FP8 quantization quantizes the weights and activations to fp8. | |
| - The weights are quantized to 8-bits for each 2D block (`weight_block_size=(128, 128)`). | |
| - The activations are quantized to 8-bits for each group per token. The group value matches the weights in the input channel (128 by default). | |
| FP8 quantization enables support for [DeepSeek-V3](https://hf.co/papers/2412.19437) and DeepSeek-R1. | |
| <div class="flex justify-center"> | |
| <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/b7b3b34bf826a6423ea82ffc57ecac80c46c3c76/transformers/quantization/quantization_deepseek.png"> | |
| </div> | |
| > [!TIP] | |
| > You need a GPU with Compute Capability>=9 (H100), and install a PyTorch version compatible with the CUDA version of your GPU. | |
| Install Accelerate and upgrade to the latest version of PyTorch. | |
| ```bash | |
| pip install --upgrade accelerate torch | |
| ``` | |
| Create a [FineGrainedFP8Config](/docs/transformers/pr_33892/en/main_classes/quantization#transformers.FineGrainedFP8Config) class and pass it to [from_pretrained()](/docs/transformers/pr_33892/en/main_classes/model#transformers.PreTrainedModel.from_pretrained) to quantize it. The weights are loaded in full precision (`torch.float32`) by default regardless of the actual data type the weights are stored in. Set `dtype="auto"` to load the weights in the data type defined in a models `config.json` file to automatically load the most memory-optiomal data type. | |
| ```py | |
| from transformers import FineGrainedFP8Config, AutoModelForCausalLM, AutoTokenizer | |
| model_name = "meta-llama/Meta-Llama-3-8B" | |
| quantization_config = FineGrainedFP8Config() | |
| quantized_model = AutoModelForCausalLM.from_pretrained(model_name, dtype="auto", device_map="auto", quantization_config=quantization_config) | |
| tokenizer = AutoTokenizer.from_pretrained(model_name) | |
| input_text = "What are we having for dinner?" | |
| input_ids = tokenizer(input_text, return_tensors="pt").to(quantized_model.device.type) | |
| output = quantized_model.generate(**input_ids, max_new_tokens=10) | |
| print(tokenizer.decode(output[0], skip_special_tokens=True)) | |
| ``` | |
| Use [save_pretrained()](/docs/transformers/pr_33892/en/main_classes/model#transformers.PreTrainedModel.save_pretrained) to save the quantized model and reload it with [from_pretrained()](/docs/transformers/pr_33892/en/main_classes/model#transformers.PreTrainedModel.from_pretrained). | |
| ```py | |
| quant_path = "/path/to/save/quantized/model" | |
| model.save_pretrained(quant_path) | |
| model = AutoModelForCausalLM.from_pretrained(quant_path, device_map="auto") | |
| ``` | |
| <EditOnGithub source="https://github.com/huggingface/transformers/blob/main/docs/source/en/quantization/finegrained_fp8.md" /> |
Xet Storage Details
- Size:
- 2.69 kB
- Xet hash:
- 7c8ada624dabd759cc6921160bec4866fc3d25415ade51a3e5a5d51c623c3d08
·
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.