Instructions to use google/gemma-7b-it with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use google/gemma-7b-it with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="google/gemma-7b-it")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("google/gemma-7b-it")
model = AutoModelForCausalLM.from_pretrained("google/gemma-7b-it")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

llama-cpp-python

How to use google/gemma-7b-it with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="google/gemma-7b-it",
	filename="gemma-7b-it.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Inference
Local Apps Settings

llama.cpp

How to use google/gemma-7b-it with llama.cpp:

Install (macOS, Linux)

curl -LsSf https://llama.app/install.sh | sh
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf google/gemma-7b-it
# Run inference directly in the terminal:
llama cli -hf google/gemma-7b-it

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf google/gemma-7b-it
# Run inference directly in the terminal:
llama cli -hf google/gemma-7b-it

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf google/gemma-7b-it
# Run inference directly in the terminal:
./llama-cli -hf google/gemma-7b-it

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf google/gemma-7b-it
# Run inference directly in the terminal:
./build/bin/llama-cli -hf google/gemma-7b-it

Use Docker

docker model run hf.co/google/gemma-7b-it

LM Studio
Jan

vLLM

How to use google/gemma-7b-it with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "google/gemma-7b-it"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "google/gemma-7b-it",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/google/gemma-7b-it

SGLang

How to use google/gemma-7b-it with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "google/gemma-7b-it" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "google/gemma-7b-it",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "google/gemma-7b-it" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "google/gemma-7b-it",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Ollama
How to use google/gemma-7b-it with Ollama:
```
ollama run hf.co/google/gemma-7b-it
```

Unsloth Studio

How to use google/gemma-7b-it with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for google/gemma-7b-it to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for google/gemma-7b-it to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for google/gemma-7b-it to start chatting

Atomic Chat new
Docker Model Runner
How to use google/gemma-7b-it with Docker Model Runner:
```
docker model run hf.co/google/gemma-7b-it
```

Lemonade

How to use google/gemma-7b-it with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull google/gemma-7b-it

Run and chat with the model

lemonade run user.gemma-7b-it-{{QUANT_TAG}}

List all available models

lemonade list

How can I input the sys message for the gemma instruct model?

#25

by Yingding - opened Feb 22, 2024

Discussion

Yingding

Feb 22, 2024

•

edited Feb 22, 2024

I enter the prompt

<start_of_turn>user
You are a helpful, respectful and honest assistant.
Always answer as helpfully as possible using the context text provided.
Your answers should only answer the question once and not have any text after the answer is done.
Your answers should only be text and not include any HTML or other markup.

If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct.
If you don't know the answer to a question, please don't share false information. Just return "<end_of_turn>"


Q: Roger has 3 tennis balls. He buys 2 more cans of tennis balls. Each can has 4 tennis balls. How many tennis balls does he have now?
A: Roger started with 3 balls. 2 cans of 4 tennis balls each is 8 tennis balls. 3 + 8 = 11. The answer is 11.
Q: The cafeteria had 23 apples. If they used 20 to make lunch and bought 6 more, how many apples do they have?

<end_of_turn>

and got the following response back. Can I set the sys message for gemma instruct model at all?

 'Q: Roger has 3 tennis balls. He buys 2 more cans of tennis balls. Each can '
 'has 4 tennis balls. How many tennis balls does he have now?\n'
 'A: Roger started with 3 balls. 2 cans of 4 tennis balls each is 8 tennis '
 'balls. 3 + 8 = 11. The answer is 11.\n'
 'Q: The cafeteria had 23 apples. If they used 20 to make lunch and bought 6 '
 'more, how many apples do they have?\n'
 'A; They initially  had   \n'
 '\n'
 '**Answer:**\n'
 '\n'
 '\n'
 '## Q& A Explanation\n'
 '\n'
 '\n'
 '\n'
 '\n'
 '### **Question One**:\n'
 '\n'
 '\n'
 '\n'
 '- This question involves counting the number or objects (tennis ball) that '
 'are already present in possession by "Roger" which was three(<b><u>Starting '

Yingding changed discussion title from How can I input the sys message for the gemma instruct model. to How can I input the sys message for the gemma instruct model? Feb 22, 2024

ArthurZ

Feb 22, 2024

I don't think the model was trained with system prompt role 😉

ahmedabobakr

Feb 22, 2024

•

edited Feb 22, 2024

This worked with some modifications to your prompt

<start_of_turn>user
You are a helpful, respectful, and honest assistant.
Always answer as helpfully as possible using the context text provided.
Your answers should only answer the question once and not have any text after the answer is done.
Your answers should only be text and not include any HTML or other markup.

If a question does not make sense or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information. Just say I don't know

BEGIN EXAMPLE
Q: Roger has 3 tennis balls. He buys 2 more cans of tennis balls. Each can has 4 tennis balls. How many tennis balls does he have now?
A: Roger started with 3 balls. 2 cans of 4 tennis balls each is 8 tennis balls. 3 + 8 = 11. The answer is 11.
END EXAMPLE 

Your turn:
Q1: The cafeteria had 23 apples. If they used 20 apples to make lunch and bought 6 more, how many apples do they have?

Q2: What color is the sound of music?

<end_of_turn>
<start_of_turn>model

Model response

## Q1:

The cafeteria had 23 apples, used 20 apples for lunch, and bought 6 more apples. Therefore, there are 23 - 20 + 6 = 9 apples left.

## Q2:

The sound of music does not have a color associated with it.

suryabhupa

Google org Feb 23, 2024

That's correct -- the model wasn't trained with any system instructions.

To get the best performance, try using the right chat template, as @ahmedabobakr did, i.e. adding "\nmodel" part. Thanks @ahmedabobakr !

suryabhupa changed discussion status to closed Feb 24, 2024

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment