Instructions to use Omarrran/gemma2_model_9B_q4_km with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Omarrran/gemma2_model_9B_q4_km with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="Omarrran/gemma2_model_9B_q4_km",
	filename="unsloth.F16.gguf",
)

output = llm(
	"Once upon a time,",
	max_tokens=512,
	echo=True
)
print(output)

Notebooks
Google Colab
Kaggle
Local Apps

llama.cpp

How to use Omarrran/gemma2_model_9B_q4_km with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf Omarrran/gemma2_model_9B_q4_km:F16
# Run inference directly in the terminal:
llama-cli -hf Omarrran/gemma2_model_9B_q4_km:F16

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf Omarrran/gemma2_model_9B_q4_km:F16
# Run inference directly in the terminal:
llama-cli -hf Omarrran/gemma2_model_9B_q4_km:F16

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf Omarrran/gemma2_model_9B_q4_km:F16
# Run inference directly in the terminal:
./llama-cli -hf Omarrran/gemma2_model_9B_q4_km:F16

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf Omarrran/gemma2_model_9B_q4_km:F16
# Run inference directly in the terminal:
./build/bin/llama-cli -hf Omarrran/gemma2_model_9B_q4_km:F16

Use Docker

docker model run hf.co/Omarrran/gemma2_model_9B_q4_km:F16

LM Studio
Jan
Ollama
How to use Omarrran/gemma2_model_9B_q4_km with Ollama:
```
ollama run hf.co/Omarrran/gemma2_model_9B_q4_km:F16
```

Unsloth Studio

How to use Omarrran/gemma2_model_9B_q4_km with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for Omarrran/gemma2_model_9B_q4_km to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for Omarrran/gemma2_model_9B_q4_km to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for Omarrran/gemma2_model_9B_q4_km to start chatting

Docker Model Runner
How to use Omarrran/gemma2_model_9B_q4_km with Docker Model Runner:
```
docker model run hf.co/Omarrran/gemma2_model_9B_q4_km:F16
```

Lemonade

How to use Omarrran/gemma2_model_9B_q4_km with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull Omarrran/gemma2_model_9B_q4_km:F16

Run and chat with the model

lemonade run user.gemma2_model_9B_q4_km-F16

List all available models

lemonade list

To use this model directly import the following

from llama_cpp import Llama

llm = Llama.from_pretrained(
    repo_id="Omarrran/gemma2_model_9B_q4_km",
    filename="unsloth.Q4_K_M.gguf",
)

To know more about how to use llama_cpp Inferenece mode see here : Link to llama_cpp

llama.cpp Conversion, Quantization, & Merging

This guide was originally part of the CPU LoRA guide. Check it out if you'd like to train a LoRA using your CPU!

To keep things simple, I recommend creating a single folder somewhere on your system to work out of. For example, C:\working-dir. I'll use this path in the examples below. If you use a different path, just make sure to adjust the commands.

Required Tools
Setup
- Updating
Quantizing Models
Converting Models to GGUF
Quantizing Models
- Quantizing Special Cases
Converting LoRAs to GGUF
Merging LoRAs into a Model
Q&A
Changelog

Required Tools

Install git so you can download and update llama.cpp easily.
Install Python (3.10 or newer, use the 64-bit installer)
Download the latest w64devkit (you want the file named w64devkit-fortran-x.xx.x.zip)
Unzip it, move the files to C:\working-dir\w64devkit

Setup

!!! info Once llama.cpp has been compiled, you don't need to repeat any of these steps unless you update to a newer version of llama.cpp. If you are updating, skip the first 2 steps.

Open a command prompt and move to our working folder: cd C:\working-dir
Download llama.cpp using git: git clone https://github.com/ggerganov/llama.cpp.git
Move into the llama.cpp directory: cd llama.cpp
If there is an existing .venv directory, delete it: rmdir .venv /s /q
Create a python virtual environment: python -m venv .venv
Activate the environment: .venv\Scripts\activate.bat (use .venv\Scripts\Activate.ps1 if you're using PowerShell)
Install the required python modules to the environment: pip install -r requirements.txt
Install pytorch: pip install torch
(Optional) Deactivate the environment: deactivate
Move up one directory: cd ..
Run the compiler tools: w64devkit\w64devkit.exe
Once you see ~ $, move to the llama.cpp repo: cd C:/working-dir/llama.cpp (Make sure to use forward slashes!)
Delete any leftover files: make clean
Compile everything: make all -j
Once it's done compiling, close the compiler tools: exit

Updating

Open a command prompt and move to our llama.cpp folder: cd C:\working-dir\llama.cpp
Download the updates: git pull
Move up one directory: cd ..
Re-run the setup instructions above. Skip the first 2 steps related to downloading llama.cpp.

Converting Models to GGUF

This converts models to the GGUF format (FP32 or FP16). For quantized models, see the next section.

Make sure you installed the required tools and set up llama.cpp.
Open a command prompt and move to our working folder: cd C:\working-dir
Download your base model using git, for example: git clone https://huggingface.co/Sao10K/Stheno-L2-13B
(Optional) Inside the model folder, you can delete the .git directory to save some hard drive space.
Activate the environment: llama.cpp\.venv\Scripts\activate.bat (use llama.cpp\.venv\Scripts\Activate.ps1 if you're using PowerShell)
Convert your model to a GGUF: python llama.cpp\convert.py Stheno-L2-13B --outtype F32 --outfile Stheno-L2-13B.FP32.gguf
(Optional) Deactivate the environment: deactivate

Step 6 uses FP32 format, you can also use FP16 format by changing --outtype to F16. Remember to update the file name for --outfile too!

!!! info Speed up convert.py by adding --concurrency N to step 5 above. Replace N with the number of physical CPU cores in your system.

Quantizing Models

This converts models to quantized GGUF formats (Q8_0, Q6_K, Q6_K_M, etc.). For FP32 and FP16 see the previous section.

Make sure you installed the required tools and set up llama.cpp.
Convert the model to either FP16 or FP32 (either is fine). Follow Converting Models to GGUF.
Open a command prompt and move to our working folder: cd C:\working-dir
Quantize the model: llama.cpp\quantize.exe Stheno-L2-13B.FP32.gguf Stheno-L2-13B.Q8_0.gguf Q8_0

Obviously, step 4 needs to be customized to your conversion slightly. Change the FP32 to FP16 based on your conversion. Then change both of the Q8_0 items to the quantization format of your choice: Q8_0, Q6_K, Q5_K_M, Q5_K_S, Q4_K_M, Q4_K_S, Q3_K_L, Q3_K_M, Q3_K_S, or Q2_K. Make sure to update the model names too.

!!! info Speed up quantize.exe by adding the number of physical CPU cores in your system to the end of step 4's command (after the quantization format).

Quantizing Special Cases

Ideally, we want to quantize our models with the default QKK setting as it's efficient. However, on rare occasions, you need to change the QKK setting to get it to work with some models. Basically, if you get an error when quantizing a model, try building quantize.exe with the LLAMA_QKK_64 flag set (e.g., make quantize -j LLAMA_QKK_64=1).

Converting LoRAs to GGUF

In order for LoRAs to work with llama.cpp (and its derivatives like koboldcpp), you need to convert them to GGUF format. If you trained a LoRA using llama.cpp, you don't need to do this as the LoRA is already in GGUF format.

Make sure you installed the required tools and set up llama.cpp.
Open a command prompt and move to our working folder: cd C:\working-dir
Download your LoRA, for example: git clone https://huggingface.co/Undi95/Storytelling-v2-13B-lora
(Optional) Inside the model folder, you can delete the .git directory to save some hard drive space.
Activate the environment: llama.cpp\.venv\Scripts\activate.bat (use llama.cpp\.venv\Scripts\Activate.ps1 if you're using PowerShell)
Convert your model: python llama.cpp\convert-lora-to-ggml.py Storytelling-v2-13B-lora
(Optional) Deactivate the environment: deactivate

Merging LoRAs into a Model

If you want to offload any layers to your GPU, you're going to want to merge your LoRA with the base model.

Make sure you installed the required tools and set up llama.cpp.
Before merging, you need a GGUF model and a GGUF LoRA. If you have non-GGUF files, convert your model to GGUF, and convert your LoRA to GGUF.
Open a command prompt and move to our working folder: cd C:\working-dir
Merge the model and LoRA: llama.cpp\export-lora.exe --model-base Stheno-L2-13B.Q8_0.gguf --model-out Stheno-L2-Storytelling-13B.Q8_0.gguf --lora-scaled Storytelling-v2-13B-lora\ggml-adapter-model.bin 1.0

Remember to update the file names to match your models and LoRAs. The 1.0 at the end specifies how strongly the LoRA should be applied, with 0.0 being not at all, 1.0 being 100%, 2.0 being 200%, and so on.

You can apply as many LoRAs as you want at once! Simply add more --lora-scaled path\to\lora.bin 1.0 commands for each LoRA you want to merge. The percentages do not need to add to 100%, they are applied individually.

!!! info Speed up export-lora.exe by adding --threads N to step 4 above. Replace N with the number of physical CPU cores in your system.

Q&A

Q: How long does it take to convert/quantize/merge a model? A: It's system dependent, but generally less than a minute. Sometimes seconds.

Q: Do I need a GPU for this?/Can I make these faster if I have a GPU? A: No, you don't need a GPU. Some operations might be slightly faster using one, but it's not really worth the extra setup effort.

Changelog

2023-09-24
- Moved from CPU LoRA guide & added LoRA merge instructions.

Downloads last month: 23

GGUF

Model size

9B params

Architecture

gemma2

Hardware compatibility

4-bit

16-bit

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Omarrran
/

gemma2_model_9B_q4_km