Instructions to use google/gemma-7b-it with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use google/gemma-7b-it with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="google/gemma-7b-it")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("google/gemma-7b-it")
model = AutoModelForCausalLM.from_pretrained("google/gemma-7b-it")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

llama-cpp-python

How to use google/gemma-7b-it with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="google/gemma-7b-it",
	filename="gemma-7b-it.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Inference
Local Apps Settings

llama.cpp

How to use google/gemma-7b-it with llama.cpp:

Install (macOS, Linux)

curl -LsSf https://llama.app/install.sh | sh
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf google/gemma-7b-it
# Run inference directly in the terminal:
llama cli -hf google/gemma-7b-it

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf google/gemma-7b-it
# Run inference directly in the terminal:
llama cli -hf google/gemma-7b-it

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf google/gemma-7b-it
# Run inference directly in the terminal:
./llama-cli -hf google/gemma-7b-it

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf google/gemma-7b-it
# Run inference directly in the terminal:
./build/bin/llama-cli -hf google/gemma-7b-it

Use Docker

docker model run hf.co/google/gemma-7b-it

LM Studio
Jan

vLLM

How to use google/gemma-7b-it with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "google/gemma-7b-it"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "google/gemma-7b-it",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/google/gemma-7b-it

SGLang

How to use google/gemma-7b-it with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "google/gemma-7b-it" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "google/gemma-7b-it",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "google/gemma-7b-it" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "google/gemma-7b-it",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Ollama
How to use google/gemma-7b-it with Ollama:
```
ollama run hf.co/google/gemma-7b-it
```

Unsloth Studio

How to use google/gemma-7b-it with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for google/gemma-7b-it to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for google/gemma-7b-it to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for google/gemma-7b-it to start chatting

Atomic Chat new
Docker Model Runner
How to use google/gemma-7b-it with Docker Model Runner:
```
docker model run hf.co/google/gemma-7b-it
```

Lemonade

How to use google/gemma-7b-it with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull google/gemma-7b-it

Run and chat with the model

lemonade run user.gemma-7b-it-{{QUANT_TAG}}

List all available models

lemonade list

error model.generate()

#13

by NickyNicky - opened Feb 21, 2024

Discussion

NickyNicky

Feb 21, 2024

error images:

code:

model_id = "google/gemma-7b-it"
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

tokenizer = AutoTokenizer.from_pretrained(model_id, token=os.environ['HF_TOKEN'])
model = AutoModelForCausalLM.from_pretrained(model_id, 
                                             quantization_config=bnb_config, 
                                             device_map={"":0}, 
                                             token=os.environ['HF_TOKEN'])

%%time
chat = [
    { "role": "user", "content": "Write a hello world program" },
]
prompt = tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)

inputs = tokenizer.encode(prompt, add_special_tokens=True, return_tensors="pt")
outputs = model.generate(input_ids=inputs.to(model.device), max_new_tokens=250)


# Decode and print the output
text = tokenizer.batch_decode(outputs)[0]
print(text)

LuciferYagami

Feb 21, 2024

Facing the same Issue here

pcuenq

Feb 21, 2024

I just tested locally and it works for me. Would you mind sharing the hardware you are running on? cc @ArthurZ @ybelkada in case they have any ideas.

LuciferYagami

Feb 21, 2024

I'm running it on Colab T4GPU. Some how the gemma-2b-it is running but the 7b-it is throwing the above error

acondor99

Feb 21, 2024

Same issue here with gemma-7b-it:

RuntimeError: shape '[1, 9, 3072]' is invalid for input of size 36864

LuciferYagami

Feb 21, 2024

•

edited Feb 21, 2024

And somehow, the model runs fine in Kaggle. I can use the gemma-7b-it in Kaggle but throwing the size error in Colab. On the flip side, the gemma-2b-it runs fine in Colab (but I donno how to control the output tokens generated. The generated response is not full but cut off in the middle, for example, for the question "Who are you?", the response I received was "I am a large language model, trained by Google. I am a")

NickyNicky

Feb 21, 2024

•

edited Feb 21, 2024

Curiously, the gemma-2b-it model works correctly but the 7b-it and 7b base model does not.

google colab t4, v100 and a100 GPU no work.

viole

Feb 21, 2024

•

edited Feb 21, 2024

same here -- gemma-7b and gemma-7b-it both fail on colab with the same error

lysandre

Feb 21, 2024

Thanks all for reporting! I'm managing to reproduce using torch 2.1.0, but the error doesn't appear if I'm using torch 2.2.0.

Is it possible for you to share your torch version/upgrade it to 2.2.0 if not already the case and let us know if it helps?

NickyNicky

Feb 21, 2024

•

edited Feb 21, 2024

Google Colab

version error:

lysandre

Feb 21, 2024

•

edited Feb 21, 2024

Hey all! The source of the code is the difference in the attention implementation. Using any version before 2.1.1 will use eager as sdpa isn't supported in torch in these versions. We will fix the models to work with these versions in transformers ASAP and release a patch; but in the meantime, we recommend using a torch version that satisfies torch>=2.1.1 in order to leverage the sdpa attention implementation, which works correctly.

Here is the necessary line to install the relevant pytorch version in colab:

pip install "torch>=2.1.1" -U

Please restart your runtime afterwards for it to leverage the updated pytorch version!

NickyNicky

Feb 21, 2024

•

edited Feb 21, 2024

work 2.2.0

# with this line of code it automatically updates torch to 2.2.0+cu121

!pip install torchaudio==2.2.0

NickyNicky

Feb 21, 2024

https://huggingface.co/google/gemma-7b/discussions/17

NickyNicky changed discussion status to closed Feb 21, 2024

sanchit-gandhi

Feb 21, 2024

Hey all! There's a PR to fix the "eager" attention in Transformers: https://github.com/huggingface/transformers/pull/29187. Once this is merged, we'll do a patch release and bump the latest PyPi version of Transformers to include this fix

cc @ArthurZ

ArthurZ

Feb 22, 2024

Patch release is done! Thanks all for the prompt report, and sorry for not catching ! pip install -U transformers

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment