Instructions to use mlabonne/gemma-2b-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use mlabonne/gemma-2b-GGUF with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("mlabonne/gemma-2b-GGUF", dtype="auto") - llama-cpp-python
How to use mlabonne/gemma-2b-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="mlabonne/gemma-2b-GGUF", filename="gemma-2b.Q2_K.gguf", )
output = llm( "Once upon a time,", max_tokens=512, echo=True ) print(output)
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use mlabonne/gemma-2b-GGUF with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf mlabonne/gemma-2b-GGUF:Q4_K_M # Run inference directly in the terminal: llama-cli -hf mlabonne/gemma-2b-GGUF:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf mlabonne/gemma-2b-GGUF:Q4_K_M # Run inference directly in the terminal: llama-cli -hf mlabonne/gemma-2b-GGUF:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf mlabonne/gemma-2b-GGUF:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf mlabonne/gemma-2b-GGUF:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf mlabonne/gemma-2b-GGUF:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf mlabonne/gemma-2b-GGUF:Q4_K_M
Use Docker
docker model run hf.co/mlabonne/gemma-2b-GGUF:Q4_K_M
- LM Studio
- Jan
- Ollama
How to use mlabonne/gemma-2b-GGUF with Ollama:
ollama run hf.co/mlabonne/gemma-2b-GGUF:Q4_K_M
- Unsloth Studio new
How to use mlabonne/gemma-2b-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for mlabonne/gemma-2b-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for mlabonne/gemma-2b-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for mlabonne/gemma-2b-GGUF to start chatting
- Docker Model Runner
How to use mlabonne/gemma-2b-GGUF with Docker Model Runner:
docker model run hf.co/mlabonne/gemma-2b-GGUF:Q4_K_M
- Lemonade
How to use mlabonne/gemma-2b-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull mlabonne/gemma-2b-GGUF:Q4_K_M
Run and chat with the model
lemonade run user.gemma-2b-GGUF-Q4_K_M
List all available models
lemonade list
output = llm(
"Once upon a time,",
max_tokens=512,
echo=True
)
print(output)Gemma-2B GGUF
This is a quantized version of the google/gemma-2b model using llama.cpp.
This model card corresponds to the 2B base version of the Gemma model. You can also visit the model card of the 7B base model, 7B instruct model, and 2B instruct model.
Model Page: Gemma
Terms of Use: Terms
⚡ Quants
q2_k: Uses Q4_K for the attention.vw and feed_forward.w2 tensors, Q2_K for the other tensors.q3_k_l: Uses Q5_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else Q3_Kq3_k_m: Uses Q4_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else Q3_Kq3_k_s: Uses Q3_K for all tensorsq4_0: Original quant method, 4-bit.q4_1: Higher accuracy than q4_0 but not as high as q5_0. However has quicker inference than q5 models.q4_k_m: Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_Kq4_k_s: Uses Q4_K for all tensorsq5_0: Higher accuracy, higher resource usage and slower inference.q5_1: Even higher accuracy, resource usage and slower inference.q5_k_m: Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_Kq5_k_s: Uses Q5_K for all tensorsq6_k: Uses Q8_K for all tensorsq8_0: Almost indistinguishable from float16. High resource use and slow. Not recommended for most users.
💻 Usage
This model can be used with the latest version of llama.cpp and LM Studio >0.2.16.
- Downloads last month
- 1,216
2-bit
3-bit
4-bit
5-bit
6-bit
8-bit
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="mlabonne/gemma-2b-GGUF", filename="", )