How to run the Int4 quantized model？

#10

by CharlesLincoln - opened Jun 6, 2024

Jun 6, 2024

Same as the title

CharlesLincoln changed discussion title from How to run the 4-bit quantized model？ to How to run the INT4 quantized model？ Jun 6, 2024

CharlesLincoln changed discussion title from How to run the INT4 quantized model？ to How to run the Int4 quantized model？ Jun 6, 2024

GeroldMeisinger

Aug 17, 2024

from github code in basic_demo/trans_cli_vision_demo.py uncomment the block:

#model = AutoModel.from_pretrained(
#    MODEL_PATH,
#    trust_remote_code=True,
#    # attn_implementation="flash_attention_2",  # Use Flash Attention
#    torch_dtype=torch.bfloat16,
#    device_map="auto",
#).eval()


## For INT4 inference
model = AutoModel.from_pretrained(
    MODEL_PATH,
    trust_remote_code=True,
    quantization_config=BitsAndBytesConfig(load_in_4bit=True),
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True
).eval()

kylesayrs

Jan 25, 2025

Note that you can also quantize the model yourself and run using VLLM using this branch and examples

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment