AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
Paper • 2306.00978 • Published • 12
How to use radi-cho/gemma-2-2b-AWQ with Transformers:
# Use a pipeline as a high-level helper
from transformers import pipeline
pipe = pipeline("text-generation", model="radi-cho/gemma-2-2b-AWQ") # Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("radi-cho/gemma-2-2b-AWQ")
model = AutoModelForCausalLM.from_pretrained("radi-cho/gemma-2-2b-AWQ")How to use radi-cho/gemma-2-2b-AWQ with vLLM:
# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "radi-cho/gemma-2-2b-AWQ"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
-H "Content-Type: application/json" \
--data '{
"model": "radi-cho/gemma-2-2b-AWQ",
"prompt": "Once upon a time,",
"max_tokens": 512,
"temperature": 0.5
}'docker model run hf.co/radi-cho/gemma-2-2b-AWQ
How to use radi-cho/gemma-2-2b-AWQ with SGLang:
# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
--model-path "radi-cho/gemma-2-2b-AWQ" \
--host 0.0.0.0 \
--port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
-H "Content-Type: application/json" \
--data '{
"model": "radi-cho/gemma-2-2b-AWQ",
"prompt": "Once upon a time,",
"max_tokens": 512,
"temperature": 0.5
}'docker run --gpus all \
--shm-size 32g \
-p 30000:30000 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HF_TOKEN=<secret>" \
--ipc=host \
lmsysorg/sglang:latest \
python3 -m sglang.launch_server \
--model-path "radi-cho/gemma-2-2b-AWQ" \
--host 0.0.0.0 \
--port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
-H "Content-Type: application/json" \
--data '{
"model": "radi-cho/gemma-2-2b-AWQ",
"prompt": "Once upon a time,",
"max_tokens": 512,
"temperature": 0.5
}'How to use radi-cho/gemma-2-2b-AWQ with Docker Model Runner:
docker model run hf.co/radi-cho/gemma-2-2b-AWQ
AWQ-quantized package (W4G128) of google/gemma-2-2b.
Support for Gemma2 in the codebase of AutoAWQ is proposed in the following pull request.
To use the model, follow the AutoAWQ examples with the source from #562.
Evaluation
WikiText-2 PPL: 11.05
C4 PPL: 12.99
Loading
model_path = "radi-cho/gemma-2-2b-AWQ"
# With transformers
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained(model_path, device_map="cuda:0")
# With transformers (fused)
from transformers import AutoModelForCausalLM, AwqConfig
quantization_config = AwqConfig(bits=4, fuse_max_seq_len=512, do_fuse=True)
model = AutoModelForCausalLM.from_pretrained(model_path, quantization_config=quantization_config).to(0)
# With AutoAWQ
from awq import AutoAWQForCausalLM
model = AutoAWQForCausalLM.from_quantized(model_path)