Instructions to use Omarrran/gemma2_model_9B_q4_km with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use Omarrran/gemma2_model_9B_q4_km with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="Omarrran/gemma2_model_9B_q4_km", filename="unsloth.F16.gguf", )
output = llm( "Once upon a time,", max_tokens=512, echo=True ) print(output)
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use Omarrran/gemma2_model_9B_q4_km with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf Omarrran/gemma2_model_9B_q4_km:F16 # Run inference directly in the terminal: llama-cli -hf Omarrran/gemma2_model_9B_q4_km:F16
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf Omarrran/gemma2_model_9B_q4_km:F16 # Run inference directly in the terminal: llama-cli -hf Omarrran/gemma2_model_9B_q4_km:F16
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf Omarrran/gemma2_model_9B_q4_km:F16 # Run inference directly in the terminal: ./llama-cli -hf Omarrran/gemma2_model_9B_q4_km:F16
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf Omarrran/gemma2_model_9B_q4_km:F16 # Run inference directly in the terminal: ./build/bin/llama-cli -hf Omarrran/gemma2_model_9B_q4_km:F16
Use Docker
docker model run hf.co/Omarrran/gemma2_model_9B_q4_km:F16
- LM Studio
- Jan
- Ollama
How to use Omarrran/gemma2_model_9B_q4_km with Ollama:
ollama run hf.co/Omarrran/gemma2_model_9B_q4_km:F16
- Unsloth Studio
How to use Omarrran/gemma2_model_9B_q4_km with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for Omarrran/gemma2_model_9B_q4_km to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for Omarrran/gemma2_model_9B_q4_km to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for Omarrran/gemma2_model_9B_q4_km to start chatting
- Docker Model Runner
How to use Omarrran/gemma2_model_9B_q4_km with Docker Model Runner:
docker model run hf.co/Omarrran/gemma2_model_9B_q4_km:F16
- Lemonade
How to use Omarrran/gemma2_model_9B_q4_km with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull Omarrran/gemma2_model_9B_q4_km:F16
Run and chat with the model
lemonade run user.gemma2_model_9B_q4_km-F16
List all available models
lemonade list
Install from WinGet (Windows)
winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf Omarrran/gemma2_model_9B_q4_km:F16# Run inference directly in the terminal:
llama-cli -hf Omarrran/gemma2_model_9B_q4_km:F16Use pre-built binary
# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf Omarrran/gemma2_model_9B_q4_km:F16# Run inference directly in the terminal:
./llama-cli -hf Omarrran/gemma2_model_9B_q4_km:F16Build from source code
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf Omarrran/gemma2_model_9B_q4_km:F16# Run inference directly in the terminal:
./build/bin/llama-cli -hf Omarrran/gemma2_model_9B_q4_km:F16Use Docker
docker model run hf.co/Omarrran/gemma2_model_9B_q4_km:F16To use this model directly import the following
from llama_cpp import Llama
llm = Llama.from_pretrained(
repo_id="Omarrran/gemma2_model_9B_q4_km",
filename="unsloth.Q4_K_M.gguf",
)
To know more about how to use llama_cpp Inferenece mode see here : Link to llama_cpp
llama.cpp Conversion, Quantization, & Merging
This guide was originally part of the CPU LoRA guide. Check it out if you'd like to train a LoRA using your CPU!
To keep things simple, I recommend creating a single folder somewhere on your system to work out of. For example, C:\working-dir. I'll use this path in the examples below. If you use a different path, just make sure to adjust the commands.
Table of Contents
- Required Tools
- Setup
- Quantizing Models
- Converting Models to GGUF
- Quantizing Models
- Converting LoRAs to GGUF
- Merging LoRAs into a Model
- Q&A
- Changelog
Required Tools
- Install git so you can download and update llama.cpp easily.
- Install Python (3.10 or newer, use the 64-bit installer)
- Download the latest w64devkit (you want the file named
w64devkit-fortran-x.xx.x.zip) - Unzip it, move the files to
C:\working-dir\w64devkit
Setup
!!! info Once llama.cpp has been compiled, you don't need to repeat any of these steps unless you update to a newer version of llama.cpp. If you are updating, skip the first 2 steps.
- Open a command prompt and move to our working folder:
cd C:\working-dir - Download llama.cpp using git:
git clone https://github.com/ggerganov/llama.cpp.git - Move into the llama.cpp directory:
cd llama.cpp - If there is an existing
.venvdirectory, delete it:rmdir .venv /s /q - Create a python virtual environment:
python -m venv .venv - Activate the environment:
.venv\Scripts\activate.bat(use.venv\Scripts\Activate.ps1if you're using PowerShell) - Install the required python modules to the environment:
pip install -r requirements.txt - Install pytorch:
pip install torch - (Optional) Deactivate the environment:
deactivate - Move up one directory:
cd .. - Run the compiler tools:
w64devkit\w64devkit.exe - Once you see
~ $, move to the llama.cpp repo:cd C:/working-dir/llama.cpp(Make sure to use forward slashes!) - Delete any leftover files:
make clean - Compile everything:
make all -j - Once it's done compiling, close the compiler tools:
exit
Updating
- Open a command prompt and move to our llama.cpp folder:
cd C:\working-dir\llama.cpp - Download the updates:
git pull - Move up one directory:
cd .. - Re-run the setup instructions above. Skip the first 2 steps related to downloading llama.cpp.
Converting Models to GGUF
This converts models to the GGUF format (FP32 or FP16). For quantized models, see the next section.
- Make sure you installed the required tools and set up llama.cpp.
- Open a command prompt and move to our working folder:
cd C:\working-dir - Download your base model using git, for example:
git clone https://huggingface.co/Sao10K/Stheno-L2-13B - (Optional) Inside the model folder, you can delete the
.gitdirectory to save some hard drive space. - Activate the environment:
llama.cpp\.venv\Scripts\activate.bat(usellama.cpp\.venv\Scripts\Activate.ps1if you're using PowerShell) - Convert your model to a GGUF:
python llama.cpp\convert.py Stheno-L2-13B --outtype F32 --outfile Stheno-L2-13B.FP32.gguf - (Optional) Deactivate the environment:
deactivate
Step 6 uses FP32 format, you can also use FP16 format by changing --outtype to F16. Remember to update the file name for --outfile too!
!!! info Speed up convert.py by adding --concurrency N to step 5 above. Replace N with the number of physical CPU cores in your system.
Quantizing Models
This converts models to quantized GGUF formats (Q8_0, Q6_K, Q6_K_M, etc.). For FP32 and FP16 see the previous section.
- Make sure you installed the required tools and set up llama.cpp.
- Convert the model to either FP16 or FP32 (either is fine). Follow Converting Models to GGUF.
- Open a command prompt and move to our working folder:
cd C:\working-dir - Quantize the model:
llama.cpp\quantize.exe Stheno-L2-13B.FP32.gguf Stheno-L2-13B.Q8_0.gguf Q8_0
Obviously, step 4 needs to be customized to your conversion slightly. Change the FP32 to FP16 based on your conversion. Then change both of the Q8_0 items to the quantization format of your choice: Q8_0, Q6_K, Q5_K_M, Q5_K_S, Q4_K_M, Q4_K_S, Q3_K_L, Q3_K_M, Q3_K_S, or Q2_K. Make sure to update the model names too.
!!! info Speed up quantize.exe by adding the number of physical CPU cores in your system to the end of step 4's command (after the quantization format).
Quantizing Special Cases
Ideally, we want to quantize our models with the default QKK setting as it's efficient. However, on rare occasions, you need to change the QKK setting to get it to work with some models. Basically, if you get an error when quantizing a model, try building quantize.exe with the LLAMA_QKK_64 flag set (e.g., make quantize -j LLAMA_QKK_64=1).
Converting LoRAs to GGUF
In order for LoRAs to work with llama.cpp (and its derivatives like koboldcpp), you need to convert them to GGUF format. If you trained a LoRA using llama.cpp, you don't need to do this as the LoRA is already in GGUF format.
- Make sure you installed the required tools and set up llama.cpp.
- Open a command prompt and move to our working folder:
cd C:\working-dir - Download your LoRA, for example:
git clone https://huggingface.co/Undi95/Storytelling-v2-13B-lora - (Optional) Inside the model folder, you can delete the
.gitdirectory to save some hard drive space. - Activate the environment:
llama.cpp\.venv\Scripts\activate.bat(usellama.cpp\.venv\Scripts\Activate.ps1if you're using PowerShell) - Convert your model:
python llama.cpp\convert-lora-to-ggml.py Storytelling-v2-13B-lora - (Optional) Deactivate the environment:
deactivate
Merging LoRAs into a Model
If you want to offload any layers to your GPU, you're going to want to merge your LoRA with the base model.
- Make sure you installed the required tools and set up llama.cpp.
- Before merging, you need a GGUF model and a GGUF LoRA. If you have non-GGUF files, convert your model to GGUF, and convert your LoRA to GGUF.
- Open a command prompt and move to our working folder:
cd C:\working-dir - Merge the model and LoRA:
llama.cpp\export-lora.exe --model-base Stheno-L2-13B.Q8_0.gguf --model-out Stheno-L2-Storytelling-13B.Q8_0.gguf --lora-scaled Storytelling-v2-13B-lora\ggml-adapter-model.bin 1.0
Remember to update the file names to match your models and LoRAs. The 1.0 at the end specifies how strongly the LoRA should be applied, with 0.0 being not at all, 1.0 being 100%, 2.0 being 200%, and so on.
You can apply as many LoRAs as you want at once! Simply add more --lora-scaled path\to\lora.bin 1.0 commands for each LoRA you want to merge. The percentages do not need to add to 100%, they are applied individually.
!!! info Speed up export-lora.exe by adding --threads N to step 4 above. Replace N with the number of physical CPU cores in your system.
Q&A
Q: How long does it take to convert/quantize/merge a model? A: It's system dependent, but generally less than a minute. Sometimes seconds.
Q: Do I need a GPU for this?/Can I make these faster if I have a GPU? A: No, you don't need a GPU. Some operations might be slightly faster using one, but it's not really worth the extra setup effort.
Changelog
- 2023-09-24
- Moved from CPU LoRA guide & added LoRA merge instructions.
- Downloads last month
- 23
4-bit
16-bit
Install from brew
# Start a local OpenAI-compatible server with a web UI: llama-server -hf Omarrran/gemma2_model_9B_q4_km:F16# Run inference directly in the terminal: llama-cli -hf Omarrran/gemma2_model_9B_q4_km:F16