Instructions to use google/gemma-7b with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use google/gemma-7b with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="google/gemma-7b")

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("google/gemma-7b")
model = AutoModelForCausalLM.from_pretrained("google/gemma-7b")

llama-cpp-python

How to use google/gemma-7b with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="google/gemma-7b",
	filename="gemma-7b.gguf",
)

output = llm(
	"Once upon a time,",
	max_tokens=512,
	echo=True
)
print(output)

Inference
Notebooks
Google Colab
Kaggle
Local Apps

llama.cpp

How to use google/gemma-7b with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf google/gemma-7b
# Run inference directly in the terminal:
llama-cli -hf google/gemma-7b

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf google/gemma-7b
# Run inference directly in the terminal:
llama-cli -hf google/gemma-7b

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf google/gemma-7b
# Run inference directly in the terminal:
./llama-cli -hf google/gemma-7b

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf google/gemma-7b
# Run inference directly in the terminal:
./build/bin/llama-cli -hf google/gemma-7b

Use Docker

docker model run hf.co/google/gemma-7b

LM Studio
Jan

vLLM

How to use google/gemma-7b with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "google/gemma-7b"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "google/gemma-7b",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/google/gemma-7b

SGLang

How to use google/gemma-7b with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "google/gemma-7b" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "google/gemma-7b",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "google/gemma-7b" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "google/gemma-7b",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Ollama
How to use google/gemma-7b with Ollama:
```
ollama run hf.co/google/gemma-7b
```

Unsloth Studio new

How to use google/gemma-7b with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for google/gemma-7b to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for google/gemma-7b to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for google/gemma-7b to start chatting

Docker Model Runner
How to use google/gemma-7b with Docker Model Runner:
```
docker model run hf.co/google/gemma-7b
```

Lemonade

How to use google/gemma-7b with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull google/gemma-7b

Run and chat with the model

lemonade run user.gemma-7b-{{QUANT_TAG}}

List all available models

lemonade list

Error "shape '[1, 9, 3072]' is invalid for input of size 36864" while running Gemma 7b using torch.float16

#17

by ahmedsaoudi - opened Feb 21, 2024

Discussion

ahmedsaoudi

Feb 21, 2024

•

edited Feb 21, 2024

Hello,
I'm trying to run the Gemma 7b example from this model's card using torch.float16 but I keep getting shape '[1, 9, 3072]' is invalid for input of size 36864 as an error.

I just copy/pasted the example from the card page to a Google Colab notebook (and installed the necessary dependencies of course).

Am I doing something wrong?

EDIT: tried using 8-bit precision but got the same error.

susumuota

Feb 21, 2024

I got same error on Google Colab with T4.
I found gemma-2b and gemma-2b-it worked, but gemma-7b and gemma-7b-it got error RuntimeError: shape '[1, 9, 3072]' is invalid for input of size 36864.

!pip3 install -q -U bitsandbytes==0.42.0
!pip3 install -q -U peft==0.8.2
!pip3 install -q -U trl==0.7.10
!pip3 install -q -U accelerate==0.27.1
!pip3 install -q -U datasets==2.17.0
!pip3 install -q -U transformers==4.38.0

import os
from google.colab import userdata
os.environ["HF_TOKEN"] = userdata.get('HF_TOKEN')

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(load_in_4bit=True)
# quantization_config = BitsAndBytesConfig(load_in_8bit=True)

model_id = "google/gemma-7b" # gemma-2b and gemma-2b-it worked, but gemma-7b and gemma-7b-it got error
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=quantization_config)
# model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", torch_dtype=torch.bfloat16)
# model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", torch_dtype=torch.float16)
# model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")

input_text = "Write me a poem about Machine Learning."
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")

outputs = model.generate(**input_ids)
print(tokenizer.decode(outputs[0]))

osanseviero

Google org Feb 21, 2024

Hey all! We're looking into it! Things work with torch 2.2.0 but not 2.1.0. We'll update here once we find the issue.

lysandre

Feb 21, 2024

•

edited Feb 21, 2024

Hey all! The source of the code is the difference in the attention implementation. Using any version before 2.1.1 will use eager as sdpa isn't supported in torch in these versions. We will fix the models to work with these versions in transformers ASAP and release a patch; but in the meantime, we recommend using a torch version that satisfies torch>=2.1.1 in order to leverage the sdpa attention implementation, which works correctly.

Here is the necessary line to install the relevant pytorch version in colab:

pip install "torch>=2.1.1" -U

Please restart your runtime afterwards for it to leverage the updated pytorch version!

ahmedsaoudi

Feb 21, 2024

Hey all! The source of the code is the difference in the attention implementation. Using any version before 2.1.1 will use eager as sdpa isn't supported in torch in these versions. We will fix the models to work with these versions in transformers ASAP and release a patch; but in the meantime, we recommend using a torch version that satisfies torch>=2.1.1 in order to leverage the sdpa attention implementation, which works correctly.

Here is the necessary line to install the relevant pytorch version in colab:
pip install "torch>=2.1.1" -U
Please restart your runtime afterwards for it to leverage the updated pytorch version!

Thank you so much!

NickyNicky

Feb 21, 2024

https://huggingface.co/google/gemma-7b-it/discussions/13

susumuota

Feb 21, 2024

@osanseviero @lysandre Thank you!

I tested on Google Colab on T4 and confirmed that it works without error by adding this cell at the top of the notebook.

!pip3 install -q -U torch==2.1.1 torchvision==0.16.1 torchaudio==2.1.1 torchdata==0.7.1 torchtext==0.16.1 --index-url https://download.pytorch.org/whl/cu121

By the way, it seems that the example prompt Write me a poem about Machine Learning. is not suitable for the non-instruct model gemma-7b. Because it generates nonsense output so that it is hard to tell whether it works well or not.

<bos>Write me a poem about Machine Learning.

<bos><bos><bos><bos><bos><bos><bos><bos><bos><bos>

But it actually works well with Write me a poem about Machine Learning. Because.

<bos>Write me a poem about Machine Learning. Because I’m a poet. And I’m

osanseviero

Google org Feb 22, 2024

Hi all! We just did a new release in transformers that fixes the issue being discussed in this thread. Make sure to upgrade. Thanks everyone!

susumuota

Feb 22, 2024

•

edited Feb 22, 2024

@osanseviero Thank you so much!

I tested on Google Colab (torch 2.1.0+cu121) with transformers==4.38.1, and confirmed example worked well.

!pip3 install -q -U bitsandbytes==0.42.0
!pip3 install -q -U peft==0.8.2
!pip3 install -q -U trl==0.7.10
!pip3 install -q -U accelerate==0.27.1
!pip3 install -q -U datasets==2.17.0
!pip3 install -q -U transformers==4.38.1  # NOT 4.38.0

osanseviero

Google org Feb 22, 2024

Great to hear! I'll close this discussion, but feel free to comment if you still face the issue!

osanseviero changed discussion status to closed Feb 22, 2024

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment