Text Generation
MLX
Safetensors
English
gemma4
4-bit precision
quantized
apple-silicon
multimodal
vision
reasoning
chain-of-thought
opus
claude-code
sft
fused
turboquant
kv-cache-compression
long-context
ravenx
tool-calling
function-calling
conversational
Instructions to use muralcode/zion-r1-4b with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use muralcode/zion-r1-4b with MLX:
# Make sure mlx-lm is installed # pip install --upgrade mlx-lm # Generate text with mlx-lm from mlx_lm import load, generate model, tokenizer = load("muralcode/zion-r1-4b") prompt = "Write a story about Einstein" messages = [{"role": "user", "content": prompt}] prompt = tokenizer.apply_chat_template( messages, add_generation_prompt=True ) text = generate(model, tokenizer, prompt=prompt, verbose=True) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- LM Studio
- Pi new
How to use muralcode/zion-r1-4b with Pi:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "muralcode/zion-r1-4b"
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "mlx-lm": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "muralcode/zion-r1-4b" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use muralcode/zion-r1-4b with Hermes Agent:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "muralcode/zion-r1-4b"
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default muralcode/zion-r1-4b
Run Hermes
hermes
- MLX LM
How to use muralcode/zion-r1-4b with MLX LM:
Generate or start a chat session
# Install MLX LM uv tool install mlx-lm # Interactive chat REPL mlx_lm.chat --model "muralcode/zion-r1-4b"
Run an OpenAI-compatible server
# Install MLX LM uv tool install mlx-lm # Start the server mlx_lm.server --model "muralcode/zion-r1-4b" # Calling the OpenAI-compatible server with curl curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "muralcode/zion-r1-4b", "messages": [ {"role": "user", "content": "Hello"} ] }'
| library_name: mlx | |
| license: gemma | |
| license_link: https://ai.google.dev/gemma/docs/gemma_4_license | |
| pipeline_tag: text-generation | |
| tags: | |
| - mlx | |
| - safetensors | |
| - gemma4 | |
| - 4-bit | |
| - quantized | |
| - apple-silicon | |
| - multimodal | |
| - vision | |
| - reasoning | |
| - chain-of-thought | |
| - opus | |
| - claude-code | |
| - sft | |
| - fused | |
| - turboquant | |
| - kv-cache-compression | |
| - long-context | |
| - ravenx | |
| - tool-calling | |
| - function-calling | |
| base_model: deadbydawn101/gemma-4-E4B-mlx-4bit | |
| base_model_relation: finetune | |
| language: | |
| - en | |
| datasets: | |
| - Crownelius/Opus-4.6-Reasoning-2100x-formatted | |
| <div align="center"> | |
| # Gemma 4 E4B β Opus Reasoning + Claude Code | Tool Calling β | OpenHarness β | OpenClaw β | Hermes Agent β | Reasoning Baked In | |
| > **Opus 4.6 reasoning + Claude Code fused into weights. Native tool calling. OpenHarness agent harness. OpenClaw orchestration. Hermes terminal-agent skill. `<think>` reasoning baked in β no adapter needed. 10.5 GB.** | |
| ### Reasoning baked in. No adapter needed. Built by [RavenX AI](https://github.com/DeadByDawn101) | |
| [](https://github.com/DeadByDawn101/turboquant-mlx) | |
| [](https://github.com/DeadByDawn101/gemini-cli) | |
| [](https://ai.google.dev/gemma/docs/gemma_4_license) | |
| </div> | |
| --- | |
| **Gemma 4 E4B with Opus Reasoning + Claude Code LoRA fused directly into the weights** β no adapter needed, no extra memory, just load and run with Claude-style `<think>` reasoning baked in. | |
| > **~10.5 GB. 131K context. Text + vision. Drop-in reasoning upgrade.** | |
| This is [`gemma-4-E4B-mlx-4bit`](https://huggingface.co/deadbydawn101/gemma-4-E4B-mlx-4bit) with the [Opus Reasoning + Claude Code LoRA](https://huggingface.co/deadbydawn101/gemma-4-E4B-opus-reasoning-claude-code-lora) merged directly into the base weights using `mlx` weight arithmetic. | |
| --- | |
| ## What's different from the base model | |
| | | Base model | This model | | |
| |--|:--:|:--:| | |
| | `<think>` tag reasoning | β | β baked in | | |
| | Claude-style structured answers | β | β | | |
| | Tool-use patterns | β | β | | |
| | Requires adapter | β | β no adapter needed | | |
| | File size | 4.86 GB (4-bit) | ~10.5 GB (bfloat16 merged) | | |
| | Vision support | β | β | | |
| --- | |
| ## π§ͺ Live Demos β Try It Now | |
| <div align="center"> | |
| | Space | What to try | | |
| |---|---| | |
| | π₯ [**Agentic Tool Calling Demo**](https://huggingface.co/spaces/deadbydawn101/gemma4-agentic-tool-calling-demo) | Live agentic loop β tool calling, `<think>` reasoning, calculator, web search | | |
| | π³ [**OpenClaw Sandbox Demo**](https://huggingface.co/spaces/deadbydawn101/openclaw-agent-sandbox-demo) | OpenClaw-style orchestration, Docker runtime, sandbox/approval modes | | |
| </div> | |
| ## Quickstart | |
| ```bash | |
| pip install mlx-lm mlx-vlm | |
| ``` | |
| ```python | |
| from mlx_lm import load, generate | |
| # No adapter_path needed β reasoning is in the weights | |
| model, tokenizer = load("deadbydawn101/gemma-4-E4B-Agentic-Opus-Reasoning-GeminiCLI-mlx-4bit") | |
| messages = [{"role": "user", "content": "Explain why RSA encryption is hard to break."}] | |
| prompt = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False) | |
| response = generate(model, tokenizer, prompt=prompt, max_tokens=1024, verbose=True) | |
| # β Will produce <think>...</think> followed by structured answer | |
| ``` | |
| ### CLI | |
| ```bash | |
| mlx_lm.generate \ | |
| --model deadbydawn101/gemma-4-E4B-Agentic-Opus-Reasoning-GeminiCLI-mlx-4bit \ | |
| --prompt "Debug this Python code: def fib(n): return fib(n-1) + fib(n-2)" \ | |
| --max-tokens 1024 | |
| ``` | |
| --- | |
| ## π§© OpenHarness + OpenClaw + Hermes Agent | |
| This model is built to sit inside a **real agent stack**, not just a chat box. | |
| We support: | |
| - **[OpenHarness](https://github.com/HKUDS/OpenHarness)** for agent harness/runtime, skills, hooks, tool loops, and multi-agent flows | |
| - **OpenClaw** for orchestration, sessions, reminders, and cross-agent routing | |
| - **Hermes agent skill** for terminal-native coding posture, short planning, aggressive tool use, and repo-aware execution | |
| ### Why this combo matters | |
| | Layer | Role | | |
| |---|---| | |
| | **Gemma 4 E4B Opus Reasoning + Claude Code** | reasoning + tool-use behavior baked into the weights | | |
| | **Gemini CLI** | coding agent + tool orchestration | | |
| | **OpenHarness** | harness runtime, tool loop, swarm, hooks, memory | | |
| | **OpenClaw** | orchestration, sessions, skills, messaging | | |
| | **Hermes skill** | agent behavior for concise, terminal-first execution | | |
| ### OpenHarness quickstart | |
| ```bash | |
| pip install openharness | |
| mlx_lm.server \ | |
| --model deadbydawn101/gemma-4-E4B-Agentic-Opus-Reasoning-GeminiCLI-mlx-4bit \ | |
| --port 8080 | |
| oh --model http://localhost:8080/v1 \ | |
| --skill hermes-agent \ | |
| -p "Review this repo, find bugs, patch them, and summarize the result" | |
| ``` | |
| ### OpenClaw skill stack | |
| Inside OpenClaw, pair this model with: | |
| - `openharness` skill β run/configure `oh` | |
| - `hermes-agent` skill β shape coding-agent behavior | |
| That gives you a fully local Apple Silicon agent lane with: | |
| - baked-in reasoning | |
| - native tool calling | |
| - Gemini CLI integration | |
| - OpenHarness runtime support | |
| - OpenClaw orchestration | |
| ## π» Gemini CLI β Coding Agent + Tool Orchestration | |
| We use **[RavenX AI's Gemini CLI fork](https://github.com/DeadByDawn101/gemini-cli)** as the coding agent and tool orchestration layer on top of these models. This is what makes the tool-calling capability real in production. | |
| Gemini CLI gives you a full agentic loop in the terminal β Google Search grounding, file read/write, shell execution, web fetching, and MCP server support β all wired to a 1M token context window. | |
| ```bash | |
| # Install | |
| npm install -g @google/gemini-cli | |
| # Run as a coding agent against this model (via local mlx_lm server) | |
| mlx_lm.server --model deadbydawn101/gemma-4-E4B-Agentic-Opus-Reasoning-GeminiCLI-mlx-4bit --port 8080 & | |
| gemini --baseUrl http://localhost:8080 | |
| # Or use directly against Gemini API (free tier: 60 req/min) | |
| gemini | |
| ``` | |
| ### What Gemini CLI + these models unlock together | |
| | Capability | How | | |
| |---|---| | |
| | **Code generation** | Gemini CLI reads your codebase, model reasons with `<think>` tags | | |
| | **Tool calling** | Native `<\|tool>` tokens β Gemini CLI executes shell/file/web tools | | |
| | **Long context** | 1M ctx in CLI + TurboQuant 4.6x KV compression = very long sessions | | |
| | **MCP servers** | Connect any MCP server β databases, APIs, custom tools | | |
| | **Search grounding** | Google Search built in β model gets live data | | |
| ```bash | |
| # Real example: code review with tool calling enabled | |
| gemini --baseUrl http://localhost:8080 \ | |
| "Review all Python files in ./src, find potential bugs, and suggest fixes" | |
| # Gemini CLI will: read files β call tools β model reasons β produce structured output | |
| ``` | |
| β [DeadByDawn101/gemini-cli on GitHub](https://github.com/DeadByDawn101/gemini-cli) β Apache 2.0, free tier, MCP-compatible | |
| ## β‘ TurboQuant-MLX β 4.6x KV Cache Compression | |
| Pair with [TurboQuant-MLX](https://github.com/DeadByDawn101/turboquant-mlx) to compress the KV cache and run 4.6x longer reasoning chains at the same memory: | |
| ```python | |
| from turboquant_mlx.mlx_kvcache import TurboQuantKVCache | |
| import mlx_lm.models.cache as cache_module | |
| cache_module.make_prompt_cache = lambda model, **kw: [ | |
| TurboQuantKVCache() for _ in range(len(model.layers)) | |
| ] | |
| from mlx_lm import load, generate | |
| model, tokenizer = load("deadbydawn101/gemma-4-E4B-Agentic-Opus-Reasoning-GeminiCLI-mlx-4bit") | |
| # Long reasoning chains now fit in the same RAM budget | |
| ``` | |
| β [TurboQuant-MLX on GitHub](https://github.com/DeadByDawn101/turboquant-mlx) Β· [v2.0 Release](https://github.com/DeadByDawn101/turboquant-mlx/releases/tag/v2.0.0) | |
| --- | |
| ## How it was made | |
| ### Training data | |
| | Source | Examples | | |
| |--------|--------:| | |
| | Crownelius/Opus-4.6-Reasoning-2100x-formatted | 2,054 | | |
| | Claude Code tool-use patterns | 140 files | | |
| | **Total** | **2,163** | | |
| ### Training | |
| ``` | |
| Base: deadbydawn101/gemma-4-E4B-mlx-4bit | |
| Method: SFT completions-only (mlx_vlm.lora) | |
| Rank: 8 Β· Alpha: 16 Β· LR: 1e-5 Β· Iters: 1,000 | |
| Hardware: Apple M4 Max 128GB Β· Peak mem: 7.876 GB | |
| Final loss: ~3.5e-7 | |
| ``` | |
| ### Fusion | |
| All **378 LoRA pairs** merged via weight arithmetic: | |
| ``` | |
| W_merged = dequantize(W_base) + (A @ B).T Γ (alpha / rank) | |
| ``` | |
| Result dequantized to bfloat16 and saved as 3-shard safetensors. | |
| --- | |
| ## π¦ Ollama / LM Studio / llama.cpp | |
| > **This is an MLX model optimized for Apple Silicon.** For Ollama, LM Studio, or llama.cpp, use the GGUF version: | |
| > | |
| > π **[gemma-4-E4B-Agentic-Opus-Reasoning-GeminiCLI-GGUF](https://huggingface.co/deadbydawn101/gemma-4-E4B-Agentic-Opus-Reasoning-GeminiCLI-GGUF)** | |
| > | |
| > Available in Q4_K_M (2.7 GB), Q5_K_M (3.1 GB), Q8_0 (4.5 GB), and F16 (8.3 GB). | |
| > | |
| > ```bash | |
| > ollama run hf.co/deadbydawn101/gemma-4-E4B-Agentic-Opus-Reasoning-GeminiCLI-GGUF | |
| > ``` | |
| ### Run with mlx_lm server (native, faster on Apple Silicon) | |
| ```bash | |
| mlx_lm.server --model deadbydawn101/gemma-4-E4B-Agentic-Opus-Reasoning-GeminiCLI-mlx-4bit --port 8080 | |
| curl http://localhost:8080/v1/chat/completions \ | |
| -H "Content-Type: application/json" \ | |
| -d '{"model": "deadbydawn101/gemma-4-E4B-Agentic-Opus-Reasoning-GeminiCLI-mlx-4bit", "messages": [{"role": "user", "content": "Hello!"}]}' | |
| ``` | |
| ## Related models | |
| | Model | Size | Notes | | |
| |-------|------|-------| | |
| | [gemma-4-E4B-mlx-4bit](https://huggingface.co/deadbydawn101/gemma-4-E4B-mlx-4bit) | 4.86 GB | Base model (4-bit, use with adapter) | | |
| | **gemma-4-E4B-Agentic-Opus-Reasoning-GeminiCLI-mlx-4bit** | ~10.5 GB | **This model** β fused, no adapter needed | | |
| | [**GGUF version**](https://huggingface.co/deadbydawn101/gemma-4-E4B-Agentic-Opus-Reasoning-GeminiCLI-GGUF) | 2.7-8.3 GB | Ollama, LM Studio, llama.cpp | | |
| | [gemma-4-E4B-opus-reasoning-claude-code-lora](https://huggingface.co/deadbydawn101/gemma-4-E4B-opus-reasoning-claude-code-lora) | 658 MB | Adapter-only | | |
| | [gemma-4-E2B-Heretic-Uncensored-mlx-4bit](https://huggingface.co/deadbydawn101/gemma-4-E2B-Heretic-Uncensored-mlx-4bit) | 3.34 GB | 2B abliterated | | |
| | [gemma-4-21b-REAP-Tool-Calling-mlx-4bit](https://huggingface.co/deadbydawn101/gemma-4-21b-REAP-Tool-Calling-mlx-4bit) | 12 GB | 21B MoE REAP | | |
| --- | |
| ## License | |
| [Gemma Terms of Use](https://ai.google.dev/gemma/docs/gemma_4_license) | |
| --- | |
| <div align="center"> | |
| Built with π€ by <a href="https://github.com/DeadByDawn101">RavenX AI</a> Β· <a href="https://github.com/DeadByDawn101/turboquant-mlx">TurboQuant-MLX</a> Β· <a href="https://github.com/DeadByDawn101/gemini-cli">Gemini CLI</a> | |
| </div> | |
| ## TriAttention KV Compression | |
| > **[2026-04-09] Our MLX port was merged into [TriAttention](https://github.com/WeianMao/triattention) (MIT + NVIDIA) β PR #1 by [@DeadByDawn101](https://github.com/DeadByDawn101) (RavenX AI).** | |
| Apply **10.7x KV memory reduction** and **2.5x throughput** on top of this model's built-in 4-bit TurboQuant quantization for ~50x combined compression vs full fp16: | |
| ```python | |
| from mlx_lm import load | |
| from triattention.mlx import apply_triattention_mlx | |
| model, tokenizer = load("deadbydawn101/gemma-4-E4B-Agentic-Opus-Reasoning-GeminiCLI-mlx-4bit") | |
| apply_triattention_mlx(model, kv_budget=2048) | |
| ``` | |
| ## RavenX Inference Harness | |
| One-command inference, benchmarking, and local OpenAI-compatible server: | |
| ```bash | |
| git clone https://github.com/DeadByDawn101/ravenx-inference-harness | |
| cd ravenx-inference-harness | |
| # Inference | |
| python run.py --model deadbydawn101/gemma-4-E4B-Agentic-Opus-Reasoning-GeminiCLI-mlx-4bit --prompt "Your prompt" | |
| # TriAttention compressed | |
| python run.py --model deadbydawn101/gemma-4-E4B-Agentic-Opus-Reasoning-GeminiCLI-mlx-4bit --triattention --kv-budget 2048 | |
| # Local OpenAI-compatible server (works with OpenClaw) | |
| python serve.py --model deadbydawn101/gemma-4-E4B-Agentic-Opus-Reasoning-GeminiCLI-mlx-4bit --triattention | |
| ``` | |