Instructions to use mlabonne/gemma-2b-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use mlabonne/gemma-2b-GGUF with Transformers:

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("mlabonne/gemma-2b-GGUF", dtype="auto")

llama-cpp-python

How to use mlabonne/gemma-2b-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="mlabonne/gemma-2b-GGUF",
	filename="gemma-2b.Q2_K.gguf",
)

output = llm(
	"Once upon a time,",
	max_tokens=512,
	echo=True
)
print(output)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use mlabonne/gemma-2b-GGUF with llama.cpp:

Install (macOS, Linux)

curl -LsSf https://llama.app/install.sh | sh
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf mlabonne/gemma-2b-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama cli -hf mlabonne/gemma-2b-GGUF:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf mlabonne/gemma-2b-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama cli -hf mlabonne/gemma-2b-GGUF:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf mlabonne/gemma-2b-GGUF:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf mlabonne/gemma-2b-GGUF:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf mlabonne/gemma-2b-GGUF:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf mlabonne/gemma-2b-GGUF:Q4_K_M

Use Docker

docker model run hf.co/mlabonne/gemma-2b-GGUF:Q4_K_M

LM Studio
Jan
Ollama
How to use mlabonne/gemma-2b-GGUF with Ollama:
```
ollama run hf.co/mlabonne/gemma-2b-GGUF:Q4_K_M
```

Unsloth Studio

How to use mlabonne/gemma-2b-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for mlabonne/gemma-2b-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for mlabonne/gemma-2b-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for mlabonne/gemma-2b-GGUF to start chatting

Atomic Chat new
Docker Model Runner
How to use mlabonne/gemma-2b-GGUF with Docker Model Runner:
```
docker model run hf.co/mlabonne/gemma-2b-GGUF:Q4_K_M
```

Lemonade

How to use mlabonne/gemma-2b-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull mlabonne/gemma-2b-GGUF:Q4_K_M

Run and chat with the model

lemonade run user.gemma-2b-GGUF-Q4_K_M

List all available models

lemonade list

Gemma-2B GGUF

This is a quantized version of the google/gemma-2b model using llama.cpp.

This model card corresponds to the 2B base version of the Gemma model. You can also visit the model card of the 7B base model, 7B instruct model, and 2B instruct model.

Model Page: Gemma

Terms of Use: Terms

⚡ Quants

q2_k: Uses Q4_K for the attention.vw and feed_forward.w2 tensors, Q2_K for the other tensors.
q3_k_l: Uses Q5_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else Q3_K
q3_k_m: Uses Q4_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else Q3_K
q3_k_s: Uses Q3_K for all tensors
q4_0: Original quant method, 4-bit.
q4_1: Higher accuracy than q4_0 but not as high as q5_0. However has quicker inference than q5 models.
q4_k_m: Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K
q4_k_s: Uses Q4_K for all tensors
q5_0: Higher accuracy, higher resource usage and slower inference.
q5_1: Even higher accuracy, resource usage and slower inference.
q5_k_m: Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K
q5_k_s: Uses Q5_K for all tensors
q6_k: Uses Q8_K for all tensors
q8_0: Almost indistinguishable from float16. High resource use and slow. Not recommended for most users.

💻 Usage

This model can be used with the latest version of llama.cpp and LM Studio >0.2.16.

Downloads last month: 1,698

GGUF

Model size

3B params

Architecture

gemma

Hardware compatibility

2-bit

3-bit

4-bit

5-bit

6-bit

8-bit

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support