zion-r1-4b

File size: 11,919 Bytes

b7e96c9

---
library_name: mlx
license: gemma
license_link: https://ai.google.dev/gemma/docs/gemma_4_license
pipeline_tag: text-generation
tags:
  - mlx
  - safetensors
  - gemma4
  - 4-bit
  - quantized
  - apple-silicon
  - multimodal
  - vision
  - reasoning
  - chain-of-thought
  - opus
  - claude-code
  - sft
  - fused
  - turboquant
  - kv-cache-compression
  - long-context
  - ravenx
  - tool-calling
  - function-calling
base_model: deadbydawn101/gemma-4-E4B-mlx-4bit
base_model_relation: finetune
language:
  - en
datasets:
  - Crownelius/Opus-4.6-Reasoning-2100x-formatted
---

<div align="center">

# Gemma 4 E4B — Opus Reasoning + Claude Code | Tool Calling ✅ | OpenHarness ✅ | OpenClaw ✅ | Hermes Agent ✅ | Reasoning Baked In

> **Opus 4.6 reasoning + Claude Code fused into weights. Native tool calling. OpenHarness agent harness. OpenClaw orchestration. Hermes terminal-agent skill. `<think>` reasoning baked in — no adapter needed. 10.5 GB.**

### Reasoning baked in. No adapter needed. Built by [RavenX AI](https://github.com/DeadByDawn101)

[![TurboQuant](https://img.shields.io/badge/TurboQuant--MLX-4.6x_KV_compression-blueviolet)](https://github.com/DeadByDawn101/turboquant-mlx)
[![Gemini CLI](https://img.shields.io/badge/Gemini_CLI-MCP_compatible-blue)](https://github.com/DeadByDawn101/gemini-cli)
[![License](https://img.shields.io/badge/license-Gemma-green)](https://ai.google.dev/gemma/docs/gemma_4_license)

</div>

---

**Gemma 4 E4B with Opus Reasoning + Claude Code LoRA fused directly into the weights** — no adapter needed, no extra memory, just load and run with Claude-style `<think>` reasoning baked in.

> **~10.5 GB. 131K context. Text + vision. Drop-in reasoning upgrade.**

This is [`gemma-4-E4B-mlx-4bit`](https://huggingface.co/deadbydawn101/gemma-4-E4B-mlx-4bit) with the [Opus Reasoning + Claude Code LoRA](https://huggingface.co/deadbydawn101/gemma-4-E4B-opus-reasoning-claude-code-lora) merged directly into the base weights using `mlx` weight arithmetic.

---

## What's different from the base model

| | Base model | This model |
|--|:--:|:--:|
| `<think>` tag reasoning | ❌ | ✅ baked in |
| Claude-style structured answers | ❌ | ✅ |
| Tool-use patterns | ❌ | ✅ |
| Requires adapter | — | ❌ no adapter needed |
| File size | 4.86 GB (4-bit) | ~10.5 GB (bfloat16 merged) |
| Vision support | ✅ | ✅ |

---


## 🧪 Live Demos — Try It Now

<div align="center">

| Space | What to try |
|---|---|
| 🔥 [**Agentic Tool Calling Demo**](https://huggingface.co/spaces/deadbydawn101/gemma4-agentic-tool-calling-demo) | Live agentic loop — tool calling, `<think>` reasoning, calculator, web search |
| 🐳 [**OpenClaw Sandbox Demo**](https://huggingface.co/spaces/deadbydawn101/openclaw-agent-sandbox-demo) | OpenClaw-style orchestration, Docker runtime, sandbox/approval modes |

</div>

## Quickstart

```bash
pip install mlx-lm mlx-vlm
```

```python
from mlx_lm import load, generate

# No adapter_path needed — reasoning is in the weights
model, tokenizer = load("deadbydawn101/gemma-4-E4B-Agentic-Opus-Reasoning-GeminiCLI-mlx-4bit")

messages = [{"role": "user", "content": "Explain why RSA encryption is hard to break."}]
prompt = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
response = generate(model, tokenizer, prompt=prompt, max_tokens=1024, verbose=True)
# → Will produce <think>...</think> followed by structured answer
```

### CLI
```bash
mlx_lm.generate \
  --model deadbydawn101/gemma-4-E4B-Agentic-Opus-Reasoning-GeminiCLI-mlx-4bit \
  --prompt "Debug this Python code: def fib(n): return fib(n-1) + fib(n-2)" \
  --max-tokens 1024
```

---



## 🧩 OpenHarness + OpenClaw + Hermes Agent

This model is built to sit inside a **real agent stack**, not just a chat box.

We support:
- **[OpenHarness](https://github.com/HKUDS/OpenHarness)** for agent harness/runtime, skills, hooks, tool loops, and multi-agent flows
- **OpenClaw** for orchestration, sessions, reminders, and cross-agent routing
- **Hermes agent skill** for terminal-native coding posture, short planning, aggressive tool use, and repo-aware execution

### Why this combo matters

| Layer | Role |
|---|---|
| **Gemma 4 E4B Opus Reasoning + Claude Code** | reasoning + tool-use behavior baked into the weights |
| **Gemini CLI** | coding agent + tool orchestration |
| **OpenHarness** | harness runtime, tool loop, swarm, hooks, memory |
| **OpenClaw** | orchestration, sessions, skills, messaging |
| **Hermes skill** | agent behavior for concise, terminal-first execution |

### OpenHarness quickstart

```bash
pip install openharness

mlx_lm.server \
  --model deadbydawn101/gemma-4-E4B-Agentic-Opus-Reasoning-GeminiCLI-mlx-4bit \
  --port 8080

oh --model http://localhost:8080/v1 \
   --skill hermes-agent \
   -p "Review this repo, find bugs, patch them, and summarize the result"
```

### OpenClaw skill stack

Inside OpenClaw, pair this model with:
- `openharness` skill — run/configure `oh`
- `hermes-agent` skill — shape coding-agent behavior

That gives you a fully local Apple Silicon agent lane with:
- baked-in reasoning
- native tool calling
- Gemini CLI integration
- OpenHarness runtime support
- OpenClaw orchestration

## 💻 Gemini CLI — Coding Agent + Tool Orchestration

We use **[RavenX AI's Gemini CLI fork](https://github.com/DeadByDawn101/gemini-cli)** as the coding agent and tool orchestration layer on top of these models. This is what makes the tool-calling capability real in production.

Gemini CLI gives you a full agentic loop in the terminal — Google Search grounding, file read/write, shell execution, web fetching, and MCP server support — all wired to a 1M token context window.

```bash
# Install
npm install -g @google/gemini-cli

# Run as a coding agent against this model (via local mlx_lm server)
mlx_lm.server --model deadbydawn101/gemma-4-E4B-Agentic-Opus-Reasoning-GeminiCLI-mlx-4bit --port 8080 &
gemini --baseUrl http://localhost:8080

# Or use directly against Gemini API (free tier: 60 req/min)
gemini
```

### What Gemini CLI + these models unlock together

| Capability | How |
|---|---|
| **Code generation** | Gemini CLI reads your codebase, model reasons with `<think>` tags |
| **Tool calling** | Native `<\|tool>` tokens → Gemini CLI executes shell/file/web tools |
| **Long context** | 1M ctx in CLI + TurboQuant 4.6x KV compression = very long sessions |
| **MCP servers** | Connect any MCP server — databases, APIs, custom tools |
| **Search grounding** | Google Search built in — model gets live data |

```bash
# Real example: code review with tool calling enabled
gemini --baseUrl http://localhost:8080 \
  "Review all Python files in ./src, find potential bugs, and suggest fixes"

# Gemini CLI will: read files → call tools → model reasons → produce structured output
```

→ [DeadByDawn101/gemini-cli on GitHub](https://github.com/DeadByDawn101/gemini-cli) — Apache 2.0, free tier, MCP-compatible

## ⚡ TurboQuant-MLX — 4.6x KV Cache Compression

Pair with [TurboQuant-MLX](https://github.com/DeadByDawn101/turboquant-mlx) to compress the KV cache and run 4.6x longer reasoning chains at the same memory:

```python
from turboquant_mlx.mlx_kvcache import TurboQuantKVCache
import mlx_lm.models.cache as cache_module

cache_module.make_prompt_cache = lambda model, **kw: [
    TurboQuantKVCache() for _ in range(len(model.layers))
]

from mlx_lm import load, generate
model, tokenizer = load("deadbydawn101/gemma-4-E4B-Agentic-Opus-Reasoning-GeminiCLI-mlx-4bit")
# Long reasoning chains now fit in the same RAM budget
```

→ [TurboQuant-MLX on GitHub](https://github.com/DeadByDawn101/turboquant-mlx) · [v2.0 Release](https://github.com/DeadByDawn101/turboquant-mlx/releases/tag/v2.0.0)

---

## How it was made

### Training data
| Source | Examples |
|--------|--------:|
| Crownelius/Opus-4.6-Reasoning-2100x-formatted | 2,054 |
| Claude Code tool-use patterns | 140 files |
| **Total** | **2,163** |

### Training
```
Base:      deadbydawn101/gemma-4-E4B-mlx-4bit
Method:    SFT completions-only (mlx_vlm.lora)
Rank:      8 · Alpha: 16 · LR: 1e-5 · Iters: 1,000
Hardware:  Apple M4 Max 128GB · Peak mem: 7.876 GB

Final loss: ~3.5e-7
```

### Fusion
All **378 LoRA pairs** merged via weight arithmetic:
```
W_merged = dequantize(W_base) + (A @ B).T × (alpha / rank)
```
Result dequantized to bfloat16 and saved as 3-shard safetensors.

---


## 🦙 Ollama / LM Studio / llama.cpp

> **This is an MLX model optimized for Apple Silicon.** For Ollama, LM Studio, or llama.cpp, use the GGUF version:
> 
> 👉 **[gemma-4-E4B-Agentic-Opus-Reasoning-GeminiCLI-GGUF](https://huggingface.co/deadbydawn101/gemma-4-E4B-Agentic-Opus-Reasoning-GeminiCLI-GGUF)**
> 
> Available in Q4_K_M (2.7 GB), Q5_K_M (3.1 GB), Q8_0 (4.5 GB), and F16 (8.3 GB).
>
> ```bash
> ollama run hf.co/deadbydawn101/gemma-4-E4B-Agentic-Opus-Reasoning-GeminiCLI-GGUF
> ```

### Run with mlx_lm server (native, faster on Apple Silicon)
```bash
mlx_lm.server --model deadbydawn101/gemma-4-E4B-Agentic-Opus-Reasoning-GeminiCLI-mlx-4bit --port 8080

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "deadbydawn101/gemma-4-E4B-Agentic-Opus-Reasoning-GeminiCLI-mlx-4bit", "messages": [{"role": "user", "content": "Hello!"}]}'
```

## Related models

| Model | Size | Notes |
|-------|------|-------|
| [gemma-4-E4B-mlx-4bit](https://huggingface.co/deadbydawn101/gemma-4-E4B-mlx-4bit) | 4.86 GB | Base model (4-bit, use with adapter) |
| **gemma-4-E4B-Agentic-Opus-Reasoning-GeminiCLI-mlx-4bit** | ~10.5 GB | **This model** — fused, no adapter needed |
| [**GGUF version**](https://huggingface.co/deadbydawn101/gemma-4-E4B-Agentic-Opus-Reasoning-GeminiCLI-GGUF) | 2.7-8.3 GB | Ollama, LM Studio, llama.cpp |
| [gemma-4-E4B-opus-reasoning-claude-code-lora](https://huggingface.co/deadbydawn101/gemma-4-E4B-opus-reasoning-claude-code-lora) | 658 MB | Adapter-only |
| [gemma-4-E2B-Heretic-Uncensored-mlx-4bit](https://huggingface.co/deadbydawn101/gemma-4-E2B-Heretic-Uncensored-mlx-4bit) | 3.34 GB | 2B abliterated |
| [gemma-4-21b-REAP-Tool-Calling-mlx-4bit](https://huggingface.co/deadbydawn101/gemma-4-21b-REAP-Tool-Calling-mlx-4bit) | 12 GB | 21B MoE REAP |

---

## License

[Gemma Terms of Use](https://ai.google.dev/gemma/docs/gemma_4_license)

---

<div align="center">
Built with 🖤 by <a href="https://github.com/DeadByDawn101">RavenX AI</a> · <a href="https://github.com/DeadByDawn101/turboquant-mlx">TurboQuant-MLX</a> · <a href="https://github.com/DeadByDawn101/gemini-cli">Gemini CLI</a>
</div>


## TriAttention KV Compression

> **[2026-04-09] Our MLX port was merged into [TriAttention](https://github.com/WeianMao/triattention) (MIT + NVIDIA) — PR #1 by [@DeadByDawn101](https://github.com/DeadByDawn101) (RavenX AI).**

Apply **10.7x KV memory reduction** and **2.5x throughput** on top of this model's built-in 4-bit TurboQuant quantization for ~50x combined compression vs full fp16:

```python
from mlx_lm import load
from triattention.mlx import apply_triattention_mlx

model, tokenizer = load("deadbydawn101/gemma-4-E4B-Agentic-Opus-Reasoning-GeminiCLI-mlx-4bit")
apply_triattention_mlx(model, kv_budget=2048)
```

## RavenX Inference Harness

One-command inference, benchmarking, and local OpenAI-compatible server:

```bash
git clone https://github.com/DeadByDawn101/ravenx-inference-harness
cd ravenx-inference-harness

# Inference
python run.py --model deadbydawn101/gemma-4-E4B-Agentic-Opus-Reasoning-GeminiCLI-mlx-4bit --prompt "Your prompt"

# TriAttention compressed
python run.py --model deadbydawn101/gemma-4-E4B-Agentic-Opus-Reasoning-GeminiCLI-mlx-4bit --triattention --kv-budget 2048

# Local OpenAI-compatible server (works with OpenClaw)
python serve.py --model deadbydawn101/gemma-4-E4B-Agentic-Opus-Reasoning-GeminiCLI-mlx-4bit --triattention
```