Buckets:
Quantization
GPTQ-Model Integration
🤗 Optimum integrates with GPTQ-Model to provide a simple API for GPTQ quantization on language models. With GPTQ quantization, you can quantize your favorite language model to 8, 4, 3 or even 2 bits. This comes without a big drop of performance and with faster inference speed. This is supported by most GPU hardware.
If you want to quantize 🤗 Transformers models with GPTQ, follow this documentation.
To learn more about the quantization technique used in GPTQ, please refer to:
- the GPTQ paper
- GPTQ-Model
Optimum requires GPTQ-Model for GPTQ quantization and quantized loading.
Requirements
You need to have the following requirements installed to run the code below:
Optimum library:
pip install --upgrade optimumGPTQ-Model:
pip install "gptqmodel>=6.0.3"Install latest
transformerslibrary from source:pip install --upgrade git+https://github.com/huggingface/transformers.gitInstall latest
acceleratelibrary:pip install --upgrade accelerate
Load and quantize a model
The GPTQQuantizer class is used to quantize your model. In order to quantize your model, you need to provide a few arguments:
- the number of bits:
bits - the dataset used to calibrate the quantization:
dataset - the model sequence length used to process the dataset:
model_seqlen - the block name to quantize:
block_name_to_quantize
With 🤗 Transformers integration, you don't need to pass the block_name_to_quantize and model_seqlen as we can retrieve them. However, for custom model, you need to specify them. Also, make sure that your model is converted to torch.float16 before quantization.
from transformers import AutoModelForCausalLM, AutoTokenizer
from optimum.gptq import GPTQQuantizer, load_quantized_model
import torch
model_name = "facebook/opt-125m"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16)
quantizer = GPTQQuantizer(bits=4, dataset="c4", block_name_to_quantize = "model.decoder.layers", model_seqlen = 2048)
quantized_model = quantizer.quantize_model(model, tokenizer)
GPTQ quantization only works for text model for now. Furthermore, the quantization process can take a lot of time depending on one's hardware (175B model = 4 gpu hours using NVIDIA A100). Please check on the Hugging Face Hub if there is not already a GPTQ quantized version of the model you would like to quantize.
Save the model
To save your model, use the save method from GPTQQuantizer class. It will create a folder with your model state dict along with the quantization config.
save_folder = "/path/to/save_folder/"
quantizer.save(model,save_folder)
Load quantized weights
You can load your quantized weights by using the load_quantized_model() function.
Through the Accelerate library, it is possible to load a model faster with a lower memory usage. The model needs to be initialized using empty weights, with weights loaded as a next step.
from accelerate import init_empty_weights
with init_empty_weights():
empty_model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16)
empty_model.tie_weights()
quantized_model = load_quantized_model(empty_model, save_folder=save_folder, device_map="auto")
Kernel selection for faster inference
When GPTQ models are loaded through GPTQ-Model, the runtime automatically selects an appropriate inference kernel for the current hardware and quantized model. In the common case, you should not pass backend explicitly.
from optimum.gptq import GPTQQuantizer, load_quantized_model
import torch
from accelerate import init_empty_weights
with init_empty_weights():
empty_model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16)
empty_model.tie_weights()
quantized_model = load_quantized_model(
empty_model,
save_folder=save_folder,
device_map="auto",
)
If you are finetuning with PEFT, prefer the default automatic backend selection unless you have a specific reason to override it.
You can find the benchmark of these kernels here
Fine-tune a quantized model
With the official support of adapters in the Hugging Face ecosystem, you can fine-tune models that have been quantized with GPTQ.
Please have a look at peft library for more details.
Xet Storage Details
- Size:
- 4.73 kB
- Xet hash:
- f9e92055bf234f1bdfd9eec02fda68bb6a5c679c5cb4fdb1026c05fe1e45af19
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.