Instructions to use rahul7star/gemma-gguf with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use rahul7star/gemma-gguf with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="rahul7star/gemma-gguf", filename="Qwopus3.5-9B-v3-abliterated-apex-i-quality.gguf", )
llm.create_chat_completion( messages = "No input example has been defined for this model task." )
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use rahul7star/gemma-gguf with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf rahul7star/gemma-gguf:Q4_0 # Run inference directly in the terminal: llama-cli -hf rahul7star/gemma-gguf:Q4_0
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf rahul7star/gemma-gguf:Q4_0 # Run inference directly in the terminal: llama-cli -hf rahul7star/gemma-gguf:Q4_0
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf rahul7star/gemma-gguf:Q4_0 # Run inference directly in the terminal: ./llama-cli -hf rahul7star/gemma-gguf:Q4_0
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf rahul7star/gemma-gguf:Q4_0 # Run inference directly in the terminal: ./build/bin/llama-cli -hf rahul7star/gemma-gguf:Q4_0
Use Docker
docker model run hf.co/rahul7star/gemma-gguf:Q4_0
- LM Studio
- Jan
- Ollama
How to use rahul7star/gemma-gguf with Ollama:
ollama run hf.co/rahul7star/gemma-gguf:Q4_0
- Unsloth Studio
How to use rahul7star/gemma-gguf with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for rahul7star/gemma-gguf to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for rahul7star/gemma-gguf to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for rahul7star/gemma-gguf to start chatting
- Pi
How to use rahul7star/gemma-gguf with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf rahul7star/gemma-gguf:Q4_0
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "rahul7star/gemma-gguf:Q4_0" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use rahul7star/gemma-gguf with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf rahul7star/gemma-gguf:Q4_0
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default rahul7star/gemma-gguf:Q4_0
Run Hermes
hermes
- Docker Model Runner
How to use rahul7star/gemma-gguf with Docker Model Runner:
docker model run hf.co/rahul7star/gemma-gguf:Q4_0
- Lemonade
How to use rahul7star/gemma-gguf with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull rahul7star/gemma-gguf:Q4_0
Run and chat with the model
lemonade run user.gemma-gguf-Q4_0
List all available models
lemonade list
Install from WinGet (Windows)
winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf rahul7star/gemma-gguf:Q4_0# Run inference directly in the terminal:
llama-cli -hf rahul7star/gemma-gguf:Q4_0Use pre-built binary
# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf rahul7star/gemma-gguf:Q4_0# Run inference directly in the terminal:
./llama-cli -hf rahul7star/gemma-gguf:Q4_0Build from source code
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf rahul7star/gemma-gguf:Q4_0# Run inference directly in the terminal:
./build/bin/llama-cli -hf rahul7star/gemma-gguf:Q4_0Use Docker
docker model run hf.co/rahul7star/gemma-gguf:Q4_0model-apex-i-quality.gguf => is gemma4 apex
model-gemma4-heretic-apexi-quality.gguf => Gemma4 Fast Heretic Model
DEMO
https://huggingface.co/spaces/rahul7star/apex-gguf
Gemma 12 B model
google/gemma-4-12B-it => model-i-gemma-e4-12b-quality.gguf
DEMO https://huggingface.co/spaces/rahul7star/Gguf-gradio
Gemma 12B Heretic
model-gemma-4-12B-it-qat-q4_0-unquantized-heretic-i-quality.gguf
Llama-3.1-Nemotron-Nano-4B-v1.1-heretic
model-iLlama-3.1-Nemotron-Nano-4B-v1.1-heretic-quality.gguf
Knowledge base
understading 46 layers of GEMMA
Gemma 4 12B model:
### What are the layers?
This line:
```text
gemma4.block_count = 48
means the model has 48 transformer blocks (layers).
Think of them as a pipeline:
Input text
β
Layer 1
β
Layer 2
β
...
β
Layer 48
β
Output probabilities
Every token always passes through all 48 layers during inference (unless you explicitly use layer skipping techniques, which GGUF normally doesn't).
What does each layer do?
Very roughly:
| Layer Range | Typical Role |
|---|---|
| 1-10 | Basic language patterns, spelling, grammar |
| 11-20 | Word relationships, syntax |
| 21-35 | Facts, concepts, reasoning |
| 36-48 | Final prediction and response shaping |
This isn't a hard rule, but it's a useful mental model.
What does temperature do?
Temperature only affects the final token selection after all 48 layers have finished computing.
The model calculates something like:
Token Score
----------------
"cat" 0.60
"dog" 0.25
"bird" 0.10
"banana" 0.05
Then temperature modifies those probabilities.
Temperature = 0.2
Very deterministic:
cat 95%
dog 4%
bird 1%
Almost always picks "cat".
Temperature = 1.0
Normal:
cat 60%
dog 25%
bird 10%
banana 5%
Balanced.
Temperature = 2.0
More random:
cat 35%
dog 28%
bird 22%
banana 15%
Much more variety.
So what are these values?
Your GGUF metadata says:
general.sampling.top_k = 64
general.sampling.top_p = 0.95
general.sampling.temp = 1.0
These are just recommended default sampling settings stored inside the GGUF.
They mean:
- temp=1.0 β normal randomness
- top_k=64 β only consider best 64 candidate tokens
- top_p=0.95 β consider tokens whose cumulative probability reaches 95%
Many frontends ignore these and use their own settings.
What about Q6_K?
You mentioned:
Q6_K
That's not a layer either.
It means the weights are quantized to roughly 6 bits per weight.
Typical quality ladder:
Q2_K = very small, lower quality
Q3_K
Q4_K
Q5_K
Q6_K β very high quality
Q8_0
F16 = full precision
For a 12B model:
F16 β 24 GB
Q8 β 13 GB
Q6_K β 10 GB
Q5_K β 8 GB
Q4_K β 7 GB
Q6_K is usually very close to F16 quality.
Interesting Gemma-specific settings
Your model has:
context_length = 262144
That's 262K context, which is huge.
And:
attention.head_count = 16
key_length = 512
value_length = 512
Meaning each of the 48 layers contains a multi-head attention system with 16 attention heads processing information in parallel.
A simplified picture:
48 Layers
ββββββββββββββββ
β Layer 1 β β 16 attention heads
β Layer 2 β β 16 attention heads
β Layer 3 β β 16 attention heads
β ... β
β Layer 48 β β 16 attention heads
ββββββββββββββββ
Total attention computations are happening across all layers every token generation step.
So:
- 48 layers = model depth
- 16 heads per layer = parallel attention mechanisms
- Temperature = randomness of token selection
- Top-k / Top-p = filtering candidate tokens
- Q6_K = quantization level
- Temperature does NOT change which layers are used; all 48 layers run regardless of temperature.
- Downloads last month
- 22,639
4-bit
Install from brew
# Start a local OpenAI-compatible server with a web UI: llama-server -hf rahul7star/gemma-gguf:Q4_0# Run inference directly in the terminal: llama-cli -hf rahul7star/gemma-gguf:Q4_0