Instructions to use google/gemma-7b with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use google/gemma-7b with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="google/gemma-7b")

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("google/gemma-7b")
model = AutoModelForCausalLM.from_pretrained("google/gemma-7b")

llama-cpp-python

How to use google/gemma-7b with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="google/gemma-7b",
	filename="gemma-7b.gguf",
)

output = llm(
	"Once upon a time,",
	max_tokens=512,
	echo=True
)
print(output)

Inference
Local Apps Settings

llama.cpp

How to use google/gemma-7b with llama.cpp:

Install (macOS, Linux)

curl -LsSf https://llama.app/install.sh | sh
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf google/gemma-7b
# Run inference directly in the terminal:
llama cli -hf google/gemma-7b

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf google/gemma-7b
# Run inference directly in the terminal:
llama cli -hf google/gemma-7b

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf google/gemma-7b
# Run inference directly in the terminal:
./llama-cli -hf google/gemma-7b

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf google/gemma-7b
# Run inference directly in the terminal:
./build/bin/llama-cli -hf google/gemma-7b

Use Docker

docker model run hf.co/google/gemma-7b

LM Studio
Jan

vLLM

How to use google/gemma-7b with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "google/gemma-7b"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "google/gemma-7b",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/google/gemma-7b

SGLang

How to use google/gemma-7b with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "google/gemma-7b" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "google/gemma-7b",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "google/gemma-7b" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "google/gemma-7b",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Ollama
How to use google/gemma-7b with Ollama:
```
ollama run hf.co/google/gemma-7b
```

Unsloth Studio

How to use google/gemma-7b with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for google/gemma-7b to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for google/gemma-7b to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for google/gemma-7b to start chatting

Atomic Chat new
Docker Model Runner
How to use google/gemma-7b with Docker Model Runner:
```
docker model run hf.co/google/gemma-7b
```

Lemonade

How to use google/gemma-7b with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull google/gemma-7b

Run and chat with the model

lemonade run user.gemma-7b-{{QUANT_TAG}}

List all available models

lemonade list

save, loading and inferencing the Gemma model

#64

by Iamexperimenting - opened Mar 7, 2024

Discussion

Iamexperimenting

Mar 7, 2024

hi team, thanks for the model and examples. I just noticed this example and couldn't find section where you save the fine tuned model, load the fined tuned model and do inference with the fine-tuned model.

Notebook: https://huggingface.co/google/gemma-7b/blob/main/examples/notebook_sft_peft.ipynb

can you please add those section it will be very helpful?

@suryabhupa

Iamexperimenting

Mar 7, 2024

additionally, I noticed example for distributed training using TPU device. However, I use Nvidia GPU and I don't have access to TPU machine, could you please provide an example for distributed training in Nvidia GPU?

Example for TPU: https://huggingface.co/google/gemma-7b/blob/main/examples/example_fsdp.py

@suryabhupa

suryabhupa

Google org Mar 7, 2024

hello!

re: loading the IT model; the use case you have in mind seems somewhat specific, do you think it might be possible for you to add a PR to do this?

re: GPUs: unfortunately I don't have any examples on hand about distributed training on GPUs (our internal stack in Google is TPUs), maybe others have pointers here? cc @pengchong @osanseviero

Iamexperimenting

Mar 7, 2024

@suryabhupa , I'm new to huggingface that's a reason I'm looking for some concrete example. Because, I googled as much as I can, and I noticed there are different ways to save the model and also I noticed model fine-tuned with peft has different method to save and load. That's a reason, I was recommending the Gemma team to add those section in their example.

I believe for all the use-case, saving and loading is same.

Iamexperimenting

Mar 12, 2024

•

edited Mar 13, 2024

@suryabhupa @ybelkada I fine-tuned 7b model, after fine-tuning when I try to inference I'm getting correct results. Later, I saved the fine-tuned model. Then I loaded that fine-tuned and tried to inference with it I see model i hallucinating like un-fine tuned model. It is not closer to the fine-tuned model result.

I'm not sure whether I'm saving and merging and loading the model corretly, can you please guide me here?

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

from peft import LoraConfig

lora_config = LoraConfig(
    r=8,
    target_modules=["q_proj", "o_proj", "k_proj", "v_proj", "gate_proj", "up_proj", "down_proj"],
    task_type="CAUSAL_LM",
)


model_id = "google/gemma-7b"
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

tokenizer = AutoTokenizer.from_pretrained(model_id, token=os.environ['HF_TOKEN'])
model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config, device_map={"":0}, token=os.environ['HF_TOKEN'])

import transformers
from trl import SFTTrainer

trainer = SFTTrainer(
    model=model,
    train_dataset=data["train"],
    args=transformers.TrainingArguments(
       num_train_epochs = 50,
        per_device_train_batch_size=1,
        gradient_accumulation_steps=4,
        warmup_steps=2,
        max_steps=10,
        learning_rate=2e-4,
        fp16=True,
        logging_steps=5,
        output_dir="outputs",
        optim="paged_adamw_8bit"
    ),
    peft_config=lora_config,
)
trainer.train()

# save model in loca
trainer.save_model()


# Empty VRAM
del model
del trainer
import gc
gc.collect()
gc.collect()
torch.cuda.empty_cache()

from peft import AutoPeftModelForCausalLM

new_model = AutoPeftModelForCausalLM.from_pretrained(
    'outputs',
    low_cpu_mem_usage=True,
    return_dict=True,
    torch_dtype=torch.float16,
    device_map="auto",
)

merged_model = new_model.merge_and_unload()

# Save the merged model
merged_model.save_pretrained("custom-fine-tuned-merged", safe_serialization=True)
tokenizer.save_pretrained("custom-fine-tuned-merged")

text = "Quote: Imagination is"
device = "cuda:0"
inputs = tokenizer(text, return_tensors="pt").to(device)

with torch.cuda.amp.autocast():
    outputs = merged_model.generate(**inputs, max_new_tokens=20)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

ybelkada

Mar 13, 2024

hi @Iamexperimenting
Thanks for the experimentations ! Can you show us how do you create the LoraConfig?

Iamexperimenting

Mar 13, 2024

oh my bad, I have updated the code,

from peft import LoraConfig

lora_config = LoraConfig(
    r=8,
    target_modules=["q_proj", "o_proj", "k_proj", "v_proj", "gate_proj", "up_proj", "down_proj"],
    task_type="CAUSAL_LM",
)

@ybelkada

ybelkada

Mar 15, 2024

Hi @Iamexperimenting
Thanks! what happens if you generate with the un-merged model? Can you also try to reload the model in 4bit precision or torch.bfloat16? I wonder if there something off with the model precision

Iamexperimenting

Mar 19, 2024

@ybelkada

what happens if you generate with the un-merged model?
Answer: It generates answer what it is expected(basically it is generating expected output).

Can you also try to reload the model in 4bit precision or torch.bfloat16?
can you please provide a code snipet to reload the model in 4bit precision or torch.bfloat16?

Iamexperimenting

Mar 26, 2024

hi @ybelkada can you please provide some sample?

Iamexperimenting

Apr 4, 2024

@ybelkada @suryabhupa can you please help me here?

suryabhupa

Google org Apr 5, 2024

I would just carefully suggest checking that everything you expect (the precision, the weights, the activations) are what you expect at every step of the way, all the way to inference, whether it's from the un-merged model or not. I would suspect something is getting overwritten or something is being loaded up properly.

Iamexperimenting

Apr 13, 2024

•

edited Apr 13, 2024

@suryabhupa @ybelkada , even I feel the same, can you please help me here I have put my training and inference script here.

Please find the reproducible code below,

Here i'm fine-tuning the gemma model with my dataset.

import pandas as pd
import torch
import transformers
from trl import SFTTrainer
from peft import LoraConfig
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from datasets import Dataset, load_dataset

model_id = "google/gemma-2b"
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

tokenizer = AutoTokenizer.from_pretrained(model_id, token=os.environ['HF_TOKEN'])
model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config, device_map={"":0}, token=os.environ['HF_TOKEN'])

data = pd.read_csv('train_data.csv')
train_df = Dataset.from_pandas(data)

lora_config = LoraConfig(
    r=8,
    target_modules=["q_proj", "o_proj", "k_proj", "v_proj", "gate_proj", "up_proj", "down_proj"],
    task_type="CAUSAL_LM",
)

trainer = SFTTrainer(
    model=model,
    train_dataset=train_df,
    dataset_text_field = "text",
    max_seq_length = 512,
    args=transformers.TrainingArguments(
        num_train_epochs = 10,
        per_device_train_batch_size=4,
        gradient_accumulation_steps=16,
        warmup_steps=2,
        max_steps=10,
        learning_rate=2e-4,
        fp16=True,
        seed = 12,
        logging_steps=1,
        output_dir="outputs",
        optim="paged_adamw_8bit"
    ),
    peft_config=lora_config,
)

trainer.train()

After finishing training, I immediately tested with two examples to check how model does prediction for it, and I noticed model generated the expected output.

# Below is with example 1 input

text = "trained input example text 1"
device = "cuda:0"
inputs = tokenizer(text, return_tensors="pt").to(device)

outputs = model.generate(**inputs, max_new_tokens=200)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

# Below is with example 2 input

text = "trained input example text 2"
device = "cuda:0"
inputs = tokenizer(text, return_tensors="pt").to(device)

outputs = model.generate(**inputs, max_new_tokens=200)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

After checking the performance of the fine-tuned model, I saved the model with the below step

trainer.save_model("finetuned_model")

After saving the model, i restarted the kernel and loaded the fine-tuned model

from peft import AutoPeftModelForCausalLM
import torch
new_finetuned_model = AutoPeftModelForCausalLM.from_pretrained(
                                                                                                        "finetuned_model",
                                                                                                        low_cpu_mem_usage=True,
                                                                                                        return_dict = True, 
                                                                                                        torch_dtype = torch.float16,
                                                                                                        device_map = "cuda:0",)

After loading the fine-tuned model, I tested the model with the same example input and I noticed generated output from the model is different.

text = "trained input example text 1"
device = "cuda:0"
inputs = tokenizer(text, return_tensors="pt").to(device)

outputs = new_finetuned_model.generate(**inputs, max_new_tokens=200)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

I'm not sure where I'm making mistake could you please help me here?

suryabhupa

Google org Apr 24, 2024

It seems like the problem might be between restarting the kernel and re-loading the fine-tuned model, something seems broken.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment