Text Generation
MLX
Safetensors
English
gemma4
4-bit precision
quantized
apple-silicon
multimodal
vision
reasoning
chain-of-thought
opus
claude-code
sft
fused
turboquant
kv-cache-compression
long-context
ravenx
tool-calling
function-calling
conversational
Instructions to use muralcode/zion-r1-4b with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use muralcode/zion-r1-4b with MLX:
# Make sure mlx-lm is installed # pip install --upgrade mlx-lm # Generate text with mlx-lm from mlx_lm import load, generate model, tokenizer = load("muralcode/zion-r1-4b") prompt = "Write a story about Einstein" messages = [{"role": "user", "content": prompt}] prompt = tokenizer.apply_chat_template( messages, add_generation_prompt=True ) text = generate(model, tokenizer, prompt=prompt, verbose=True) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- LM Studio
- Pi new
How to use muralcode/zion-r1-4b with Pi:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "muralcode/zion-r1-4b"
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "mlx-lm": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "muralcode/zion-r1-4b" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use muralcode/zion-r1-4b with Hermes Agent:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "muralcode/zion-r1-4b"
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default muralcode/zion-r1-4b
Run Hermes
hermes
- MLX LM
How to use muralcode/zion-r1-4b with MLX LM:
Generate or start a chat session
# Install MLX LM uv tool install mlx-lm # Interactive chat REPL mlx_lm.chat --model "muralcode/zion-r1-4b"
Run an OpenAI-compatible server
# Install MLX LM uv tool install mlx-lm # Start the server mlx_lm.server --model "muralcode/zion-r1-4b" # Calling the OpenAI-compatible server with curl curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "muralcode/zion-r1-4b", "messages": [ {"role": "user", "content": "Hello"} ] }'
File size: 11,919 Bytes
b7e96c9 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 | ---
library_name: mlx
license: gemma
license_link: https://ai.google.dev/gemma/docs/gemma_4_license
pipeline_tag: text-generation
tags:
- mlx
- safetensors
- gemma4
- 4-bit
- quantized
- apple-silicon
- multimodal
- vision
- reasoning
- chain-of-thought
- opus
- claude-code
- sft
- fused
- turboquant
- kv-cache-compression
- long-context
- ravenx
- tool-calling
- function-calling
base_model: deadbydawn101/gemma-4-E4B-mlx-4bit
base_model_relation: finetune
language:
- en
datasets:
- Crownelius/Opus-4.6-Reasoning-2100x-formatted
---
<div align="center">
# Gemma 4 E4B β Opus Reasoning + Claude Code | Tool Calling β
| OpenHarness β
| OpenClaw β
| Hermes Agent β
| Reasoning Baked In
> **Opus 4.6 reasoning + Claude Code fused into weights. Native tool calling. OpenHarness agent harness. OpenClaw orchestration. Hermes terminal-agent skill. `<think>` reasoning baked in β no adapter needed. 10.5 GB.**
### Reasoning baked in. No adapter needed. Built by [RavenX AI](https://github.com/DeadByDawn101)
[](https://github.com/DeadByDawn101/turboquant-mlx)
[](https://github.com/DeadByDawn101/gemini-cli)
[](https://ai.google.dev/gemma/docs/gemma_4_license)
</div>
---
**Gemma 4 E4B with Opus Reasoning + Claude Code LoRA fused directly into the weights** β no adapter needed, no extra memory, just load and run with Claude-style `<think>` reasoning baked in.
> **~10.5 GB. 131K context. Text + vision. Drop-in reasoning upgrade.**
This is [`gemma-4-E4B-mlx-4bit`](https://huggingface.co/deadbydawn101/gemma-4-E4B-mlx-4bit) with the [Opus Reasoning + Claude Code LoRA](https://huggingface.co/deadbydawn101/gemma-4-E4B-opus-reasoning-claude-code-lora) merged directly into the base weights using `mlx` weight arithmetic.
---
## What's different from the base model
| | Base model | This model |
|--|:--:|:--:|
| `<think>` tag reasoning | β | β
baked in |
| Claude-style structured answers | β | β
|
| Tool-use patterns | β | β
|
| Requires adapter | β | β no adapter needed |
| File size | 4.86 GB (4-bit) | ~10.5 GB (bfloat16 merged) |
| Vision support | β
| β
|
---
## π§ͺ Live Demos β Try It Now
<div align="center">
| Space | What to try |
|---|---|
| π₯ [**Agentic Tool Calling Demo**](https://huggingface.co/spaces/deadbydawn101/gemma4-agentic-tool-calling-demo) | Live agentic loop β tool calling, `<think>` reasoning, calculator, web search |
| π³ [**OpenClaw Sandbox Demo**](https://huggingface.co/spaces/deadbydawn101/openclaw-agent-sandbox-demo) | OpenClaw-style orchestration, Docker runtime, sandbox/approval modes |
</div>
## Quickstart
```bash
pip install mlx-lm mlx-vlm
```
```python
from mlx_lm import load, generate
# No adapter_path needed β reasoning is in the weights
model, tokenizer = load("deadbydawn101/gemma-4-E4B-Agentic-Opus-Reasoning-GeminiCLI-mlx-4bit")
messages = [{"role": "user", "content": "Explain why RSA encryption is hard to break."}]
prompt = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
response = generate(model, tokenizer, prompt=prompt, max_tokens=1024, verbose=True)
# β Will produce <think>...</think> followed by structured answer
```
### CLI
```bash
mlx_lm.generate \
--model deadbydawn101/gemma-4-E4B-Agentic-Opus-Reasoning-GeminiCLI-mlx-4bit \
--prompt "Debug this Python code: def fib(n): return fib(n-1) + fib(n-2)" \
--max-tokens 1024
```
---
## π§© OpenHarness + OpenClaw + Hermes Agent
This model is built to sit inside a **real agent stack**, not just a chat box.
We support:
- **[OpenHarness](https://github.com/HKUDS/OpenHarness)** for agent harness/runtime, skills, hooks, tool loops, and multi-agent flows
- **OpenClaw** for orchestration, sessions, reminders, and cross-agent routing
- **Hermes agent skill** for terminal-native coding posture, short planning, aggressive tool use, and repo-aware execution
### Why this combo matters
| Layer | Role |
|---|---|
| **Gemma 4 E4B Opus Reasoning + Claude Code** | reasoning + tool-use behavior baked into the weights |
| **Gemini CLI** | coding agent + tool orchestration |
| **OpenHarness** | harness runtime, tool loop, swarm, hooks, memory |
| **OpenClaw** | orchestration, sessions, skills, messaging |
| **Hermes skill** | agent behavior for concise, terminal-first execution |
### OpenHarness quickstart
```bash
pip install openharness
mlx_lm.server \
--model deadbydawn101/gemma-4-E4B-Agentic-Opus-Reasoning-GeminiCLI-mlx-4bit \
--port 8080
oh --model http://localhost:8080/v1 \
--skill hermes-agent \
-p "Review this repo, find bugs, patch them, and summarize the result"
```
### OpenClaw skill stack
Inside OpenClaw, pair this model with:
- `openharness` skill β run/configure `oh`
- `hermes-agent` skill β shape coding-agent behavior
That gives you a fully local Apple Silicon agent lane with:
- baked-in reasoning
- native tool calling
- Gemini CLI integration
- OpenHarness runtime support
- OpenClaw orchestration
## π» Gemini CLI β Coding Agent + Tool Orchestration
We use **[RavenX AI's Gemini CLI fork](https://github.com/DeadByDawn101/gemini-cli)** as the coding agent and tool orchestration layer on top of these models. This is what makes the tool-calling capability real in production.
Gemini CLI gives you a full agentic loop in the terminal β Google Search grounding, file read/write, shell execution, web fetching, and MCP server support β all wired to a 1M token context window.
```bash
# Install
npm install -g @google/gemini-cli
# Run as a coding agent against this model (via local mlx_lm server)
mlx_lm.server --model deadbydawn101/gemma-4-E4B-Agentic-Opus-Reasoning-GeminiCLI-mlx-4bit --port 8080 &
gemini --baseUrl http://localhost:8080
# Or use directly against Gemini API (free tier: 60 req/min)
gemini
```
### What Gemini CLI + these models unlock together
| Capability | How |
|---|---|
| **Code generation** | Gemini CLI reads your codebase, model reasons with `<think>` tags |
| **Tool calling** | Native `<\|tool>` tokens β Gemini CLI executes shell/file/web tools |
| **Long context** | 1M ctx in CLI + TurboQuant 4.6x KV compression = very long sessions |
| **MCP servers** | Connect any MCP server β databases, APIs, custom tools |
| **Search grounding** | Google Search built in β model gets live data |
```bash
# Real example: code review with tool calling enabled
gemini --baseUrl http://localhost:8080 \
"Review all Python files in ./src, find potential bugs, and suggest fixes"
# Gemini CLI will: read files β call tools β model reasons β produce structured output
```
β [DeadByDawn101/gemini-cli on GitHub](https://github.com/DeadByDawn101/gemini-cli) β Apache 2.0, free tier, MCP-compatible
## β‘ TurboQuant-MLX β 4.6x KV Cache Compression
Pair with [TurboQuant-MLX](https://github.com/DeadByDawn101/turboquant-mlx) to compress the KV cache and run 4.6x longer reasoning chains at the same memory:
```python
from turboquant_mlx.mlx_kvcache import TurboQuantKVCache
import mlx_lm.models.cache as cache_module
cache_module.make_prompt_cache = lambda model, **kw: [
TurboQuantKVCache() for _ in range(len(model.layers))
]
from mlx_lm import load, generate
model, tokenizer = load("deadbydawn101/gemma-4-E4B-Agentic-Opus-Reasoning-GeminiCLI-mlx-4bit")
# Long reasoning chains now fit in the same RAM budget
```
β [TurboQuant-MLX on GitHub](https://github.com/DeadByDawn101/turboquant-mlx) Β· [v2.0 Release](https://github.com/DeadByDawn101/turboquant-mlx/releases/tag/v2.0.0)
---
## How it was made
### Training data
| Source | Examples |
|--------|--------:|
| Crownelius/Opus-4.6-Reasoning-2100x-formatted | 2,054 |
| Claude Code tool-use patterns | 140 files |
| **Total** | **2,163** |
### Training
```
Base: deadbydawn101/gemma-4-E4B-mlx-4bit
Method: SFT completions-only (mlx_vlm.lora)
Rank: 8 Β· Alpha: 16 Β· LR: 1e-5 Β· Iters: 1,000
Hardware: Apple M4 Max 128GB Β· Peak mem: 7.876 GB
Final loss: ~3.5e-7
```
### Fusion
All **378 LoRA pairs** merged via weight arithmetic:
```
W_merged = dequantize(W_base) + (A @ B).T Γ (alpha / rank)
```
Result dequantized to bfloat16 and saved as 3-shard safetensors.
---
## π¦ Ollama / LM Studio / llama.cpp
> **This is an MLX model optimized for Apple Silicon.** For Ollama, LM Studio, or llama.cpp, use the GGUF version:
>
> π **[gemma-4-E4B-Agentic-Opus-Reasoning-GeminiCLI-GGUF](https://huggingface.co/deadbydawn101/gemma-4-E4B-Agentic-Opus-Reasoning-GeminiCLI-GGUF)**
>
> Available in Q4_K_M (2.7 GB), Q5_K_M (3.1 GB), Q8_0 (4.5 GB), and F16 (8.3 GB).
>
> ```bash
> ollama run hf.co/deadbydawn101/gemma-4-E4B-Agentic-Opus-Reasoning-GeminiCLI-GGUF
> ```
### Run with mlx_lm server (native, faster on Apple Silicon)
```bash
mlx_lm.server --model deadbydawn101/gemma-4-E4B-Agentic-Opus-Reasoning-GeminiCLI-mlx-4bit --port 8080
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "deadbydawn101/gemma-4-E4B-Agentic-Opus-Reasoning-GeminiCLI-mlx-4bit", "messages": [{"role": "user", "content": "Hello!"}]}'
```
## Related models
| Model | Size | Notes |
|-------|------|-------|
| [gemma-4-E4B-mlx-4bit](https://huggingface.co/deadbydawn101/gemma-4-E4B-mlx-4bit) | 4.86 GB | Base model (4-bit, use with adapter) |
| **gemma-4-E4B-Agentic-Opus-Reasoning-GeminiCLI-mlx-4bit** | ~10.5 GB | **This model** β fused, no adapter needed |
| [**GGUF version**](https://huggingface.co/deadbydawn101/gemma-4-E4B-Agentic-Opus-Reasoning-GeminiCLI-GGUF) | 2.7-8.3 GB | Ollama, LM Studio, llama.cpp |
| [gemma-4-E4B-opus-reasoning-claude-code-lora](https://huggingface.co/deadbydawn101/gemma-4-E4B-opus-reasoning-claude-code-lora) | 658 MB | Adapter-only |
| [gemma-4-E2B-Heretic-Uncensored-mlx-4bit](https://huggingface.co/deadbydawn101/gemma-4-E2B-Heretic-Uncensored-mlx-4bit) | 3.34 GB | 2B abliterated |
| [gemma-4-21b-REAP-Tool-Calling-mlx-4bit](https://huggingface.co/deadbydawn101/gemma-4-21b-REAP-Tool-Calling-mlx-4bit) | 12 GB | 21B MoE REAP |
---
## License
[Gemma Terms of Use](https://ai.google.dev/gemma/docs/gemma_4_license)
---
<div align="center">
Built with π€ by <a href="https://github.com/DeadByDawn101">RavenX AI</a> Β· <a href="https://github.com/DeadByDawn101/turboquant-mlx">TurboQuant-MLX</a> Β· <a href="https://github.com/DeadByDawn101/gemini-cli">Gemini CLI</a>
</div>
## TriAttention KV Compression
> **[2026-04-09] Our MLX port was merged into [TriAttention](https://github.com/WeianMao/triattention) (MIT + NVIDIA) β PR #1 by [@DeadByDawn101](https://github.com/DeadByDawn101) (RavenX AI).**
Apply **10.7x KV memory reduction** and **2.5x throughput** on top of this model's built-in 4-bit TurboQuant quantization for ~50x combined compression vs full fp16:
```python
from mlx_lm import load
from triattention.mlx import apply_triattention_mlx
model, tokenizer = load("deadbydawn101/gemma-4-E4B-Agentic-Opus-Reasoning-GeminiCLI-mlx-4bit")
apply_triattention_mlx(model, kv_budget=2048)
```
## RavenX Inference Harness
One-command inference, benchmarking, and local OpenAI-compatible server:
```bash
git clone https://github.com/DeadByDawn101/ravenx-inference-harness
cd ravenx-inference-harness
# Inference
python run.py --model deadbydawn101/gemma-4-E4B-Agentic-Opus-Reasoning-GeminiCLI-mlx-4bit --prompt "Your prompt"
# TriAttention compressed
python run.py --model deadbydawn101/gemma-4-E4B-Agentic-Opus-Reasoning-GeminiCLI-mlx-4bit --triattention --kv-budget 2048
# Local OpenAI-compatible server (works with OpenClaw)
python serve.py --model deadbydawn101/gemma-4-E4B-Agentic-Opus-Reasoning-GeminiCLI-mlx-4bit --triattention
```
|