Do you have a quantized version of the model that works with sentence_transformers?

by sungkim - opened Feb 20, 2024

Discussion

sungkim

Feb 20, 2024

Do you plan to add a quantized version of the model that works with sentence_transformers?

yliu279

Salesforce AI Research org Feb 21, 2024

Hi @sungkim ,

We didn't plan to release the quantized version of the model because Hugging Face's model loading already incorporates quantization. To enable quantization to 4 bits, you can use the following code snippet:

import torch
from transformers import BitsAndBytesConfig, AutoModel

bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16
)

encoder = AutoModel.from_pretrained(
'Salesforce/SFR-Embedding-Mistral',
trust_remote_code=True,
device_map='auto',
torch_dtype=torch.bfloat16,
quantization_config=bnb_config
)

prudant

Mar 10, 2024

•

edited Mar 10, 2024

Hi! HF quants are really slow for production environments, but I have a question, it would be possible to quant to AWQ or GPTQ in order to run the model in TGI or VLLM for serving purposes? I can quant it to AWQ or GPTQ and pushed to the hub, but I need to kwnow its is compatible with that quant formats, regards!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment