Instructions to use google/gemma-7b-it with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use google/gemma-7b-it with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="google/gemma-7b-it")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("google/gemma-7b-it")
model = AutoModelForCausalLM.from_pretrained("google/gemma-7b-it")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

llama-cpp-python

How to use google/gemma-7b-it with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="google/gemma-7b-it",
	filename="gemma-7b-it.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use google/gemma-7b-it with llama.cpp:

Install (macOS, Linux)

curl -LsSf https://llama.app/install.sh | sh
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf google/gemma-7b-it
# Run inference directly in the terminal:
llama cli -hf google/gemma-7b-it

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf google/gemma-7b-it
# Run inference directly in the terminal:
llama cli -hf google/gemma-7b-it

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf google/gemma-7b-it
# Run inference directly in the terminal:
./llama-cli -hf google/gemma-7b-it

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf google/gemma-7b-it
# Run inference directly in the terminal:
./build/bin/llama-cli -hf google/gemma-7b-it

Use Docker

docker model run hf.co/google/gemma-7b-it

LM Studio
Jan

vLLM

How to use google/gemma-7b-it with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "google/gemma-7b-it"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "google/gemma-7b-it",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/google/gemma-7b-it

SGLang

How to use google/gemma-7b-it with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "google/gemma-7b-it" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "google/gemma-7b-it",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "google/gemma-7b-it" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "google/gemma-7b-it",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Ollama
How to use google/gemma-7b-it with Ollama:
```
ollama run hf.co/google/gemma-7b-it
```

Unsloth Studio

How to use google/gemma-7b-it with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for google/gemma-7b-it to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for google/gemma-7b-it to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for google/gemma-7b-it to start chatting

Atomic Chat new
Docker Model Runner
How to use google/gemma-7b-it with Docker Model Runner:
```
docker model run hf.co/google/gemma-7b-it
```

Lemonade

How to use google/gemma-7b-it with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull google/gemma-7b-it

Run and chat with the model

lemonade run user.gemma-7b-it-{{QUANT_TAG}}

List all available models

lemonade list

<pad> spam issue

#40

by Zewsic - opened Feb 23, 2024

Discussion

Zewsic

Feb 23, 2024

•

edited Feb 23, 2024

im truing to run example code

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

tokenizer = AutoTokenizer.from_pretrained("google/gemma-7b-it")
model = AutoModelForCausalLM.from_pretrained("google/gemma-7b-it", device_map="auto", torch_dtype=torch.float16)

chat = [
    { "role": "user", "content": "Write a hello world program on python" },
]
prompt = tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)

inputs = tokenizer.encode(prompt, add_special_tokens=True, return_tensors="pt").to("mps")
outputs = model.generate(input_ids=inputs.to(model.device), max_new_tokens=150)
print(tokenizer.decode(outputs[0]))

and i get this output

Write a hello world program on python<end_of_turn>
<start_of_turn>model
<pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad>```


why this happening?

suryabhupa

Google org Feb 23, 2024

That's really odd, can you try share exactly what prompt variable looks like?

EarthWorm001

Feb 23, 2024

I had the same issue: https://huggingface.co/google/gemma-7b/discussions/33

By my experience, it may work in your case by loading the model in float32.

Zewsic

Feb 23, 2024

That's really odd, can you try share exactly what prompt variable looks like?

<bos><start_of_turn>user
Write a hello world program on python<end_of_turn>
<start_of_turn>model

Zewsic

Feb 23, 2024

I had the same issue: https://huggingface.co/google/gemma-7b/discussions/33

By my experience, it may work in your case by loading the model in float32.

So, I tried this, and the result is just none. I get this message:

WARNING:root:Some parameters are on the meta device device because they were offloaded to the disk.

And the generation does not happen. It just doesn't happen, no errors are displayed. The same situation happens both with a regular model and with an instructional model. The result is absolutely the same.

yunhuan929

Mar 6, 2024

•

edited Mar 6, 2024

It works right with bf16 or fp32, but will generate pad token when use fp16. Want to know why.

suryabhupa

Google org Mar 7, 2024

Do you only see this with the 7B IT model and not any other model?

YM1024

Mar 8, 2024

•

edited Mar 8, 2024

Hi @suryabhupa

I've got similar errors, the 2B-it model works pretty good with all precision options, but the 7B-it only works fine under bfloat16. For float16, 8bit and 4 bit, when dealing with long inputs, the model freeze for couple of minutes, then repeat the input and generate lots of <pad>.
p.s. the experiments are running on a server with Tesla A100, so I don't think it's triggered by hardwares.

drewjiang

Apr 5, 2024

i have the same problem, did u have any solution?

suryabhupa

Google org Apr 5, 2024

That's quite bizarre, I'm curious if you find this happens if using the PyTorch or JAX codepaths? Just trying to diagnose where the issue might be coming from.

mindoflight

Apr 7, 2024

i have same problem. If I switch to CPU works well but gpu not working. i have 2 gpu and running huggingface examples.

from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("google/gemma-7b-it")
model = AutoModelForCausalLM.from_pretrained("google/gemma-7b-it", device_map="auto")
input_text = "Write me a poem about Machine Learning."
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")
outputs = model.generate(**input_ids)
print(tokenizer.decode(outputs[0]))

output like this
<bos>Write me a poem about Machine Learning.<pad><pad>...

mindoflight

Apr 7, 2024

single gpu works
model = AutoModelForCausalLM.from_pretrained("google/gemma-7b-it", device_map="cuda:0", torch_dtype=torch.bfloat16)
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda:0")

aiqwe

May 21, 2024

•

edited May 21, 2024

hello, @suryabhupa
continuous <pad> padding occurs on gemma-2b-it model too.
I guess it is related with hardware issue because it works different on each hardware with same dtype.
I attached screenshot, and i wish it helps you.

Model : gemma-1.1-2b-it

Experiments
Case1) CPU + float16 -> works well
Case2) MPS + float16 -> continuous <pad> padding occurs
Case3) CUDA + float16 -> works well

device specs

CPU : Macbook Air M3
MPS : same as CPU
CUDA : L4(google colab)

Screenshots

mfixman

Aug 19, 2024

@aiqwe I can confirm that changing torch_dtype to float16 when declaring the model for CUDA fixes the issue with pads in Gemma.

Thanks!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment