Instructions to use google/gemma-2-27b-it with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use google/gemma-2-27b-it with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="google/gemma-2-27b-it")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("google/gemma-2-27b-it")
model = AutoModelForCausalLM.from_pretrained("google/gemma-2-27b-it")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Inference
Local Apps Settings

vLLM

How to use google/gemma-2-27b-it with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "google/gemma-2-27b-it"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "google/gemma-2-27b-it",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/google/gemma-2-27b-it

SGLang

How to use google/gemma-2-27b-it with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "google/gemma-2-27b-it" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "google/gemma-2-27b-it",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "google/gemma-2-27b-it" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "google/gemma-2-27b-it",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use google/gemma-2-27b-it with Docker Model Runner:
```
docker model run hf.co/google/gemma-2-27b-it
```

4bit-quantized gemma-2-27b-it generates only pad tokens, like '<pad><pad><pad><pad><pad><pad><pad><pad><pad>'.

#29

by kshinoda - opened Jul 17, 2024

Discussion

kshinoda

Jul 17, 2024

•

edited Jul 17, 2024

Thank you for releasing the great models!

I found that this model (gemma-2-27b-it) seems to generate only PAD tokes in my environment when using 4-bit quantization.
My environment and codes are as follows.

How should this issue be fixed?
Thanks for your support in advance.

torch==2.3.0+cu118
transformers==4.42.4
bitsandbytes==0.43.1
CUDA==11.6

from transformers import BitsAndBytesConfig, AutoModelForCausalLM, AutoTokenizer
kwargs = {'device_map': 'auto'}
kwargs['quantization_config'] = BitsAndBytesConfig(
    load_in_4bit=True
)
model = AutoModelForCausalLM.from_pretrained('google/gemma-2-27b-it', low_cpu_mem_usage=True, **kwargs)
tokenizer = AutoTokenizer.from_pretrained('google/gemma-2-27b-it', use_fast=False, padding_side='right')

chat = [
    {'role': 'user', 'content': 'Hello!'},
]

prompt = tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)

inputs = tokenizer([prompt], add_special_tokens=False, padding=True, truncation=True, return_tensors="pt")
inputs = {k: inputs[k].to('cuda') for k in inputs}

outputs = model.generate(**inputs)

tokenizer.decode(outputs[0].cpu().numpy().tolist())

and this is the output

'<bos><start_of_turn>user\nHello!<end_of_turn>\n<start_of_turn>model\n<pad><pad><pad><pad><pad><pad><pad><pad><pad>'

kshinoda changed discussion title from It generates only pad tokens, like '<pad><pad><pad><pad><pad><pad><pad><pad><pad>'. to 4bit-quantized gemma-2-27b-it generates only pad tokens, like '<pad><pad><pad><pad><pad><pad><pad><pad><pad>'. Jul 17, 2024

Jaume

Jul 17, 2024

Just add that I'm facing the same issue with while using 8-bit quantization.

luanagbmartins

Jul 19, 2024

Same here with 4-bit quantization too.

mdouglas

Aug 1, 2024

Hi all. Please use torch_dtype=torch.bfloat16 when loading with from_pretrained(). There's a PR to update the model card examples here: #33.

lkv

Google org Dec 4, 2024

Hi @kshinoda , I hope this issue has been resolved after using torch_dtype=torch.bfloat16. Could you please confirm if you have any concerns otherwise will close this issue. Thank you.

Jaume

Dec 4, 2024

Hello @lkv , I was having the same issue and worked for me, thank you for checking!

kshinoda

Dec 4, 2024

Hello @lkv , this issue has been resolved for me as well. I will close this issue.
Thank you all for the responses!

kshinoda changed discussion status to closed Dec 4, 2024

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment