File size: 11,919 Bytes
b7e96c9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
---
library_name: mlx
license: gemma
license_link: https://ai.google.dev/gemma/docs/gemma_4_license
pipeline_tag: text-generation
tags:
  - mlx
  - safetensors
  - gemma4
  - 4-bit
  - quantized
  - apple-silicon
  - multimodal
  - vision
  - reasoning
  - chain-of-thought
  - opus
  - claude-code
  - sft
  - fused
  - turboquant
  - kv-cache-compression
  - long-context
  - ravenx
  - tool-calling
  - function-calling
base_model: deadbydawn101/gemma-4-E4B-mlx-4bit
base_model_relation: finetune
language:
  - en
datasets:
  - Crownelius/Opus-4.6-Reasoning-2100x-formatted
---

<div align="center">

# Gemma 4 E4B β€” Opus Reasoning + Claude Code | Tool Calling βœ… | OpenHarness βœ… | OpenClaw βœ… | Hermes Agent βœ… | Reasoning Baked In

> **Opus 4.6 reasoning + Claude Code fused into weights. Native tool calling. OpenHarness agent harness. OpenClaw orchestration. Hermes terminal-agent skill. `<think>` reasoning baked in β€” no adapter needed. 10.5 GB.**

### Reasoning baked in. No adapter needed. Built by [RavenX AI](https://github.com/DeadByDawn101)

[![TurboQuant](https://img.shields.io/badge/TurboQuant--MLX-4.6x_KV_compression-blueviolet)](https://github.com/DeadByDawn101/turboquant-mlx)
[![Gemini CLI](https://img.shields.io/badge/Gemini_CLI-MCP_compatible-blue)](https://github.com/DeadByDawn101/gemini-cli)
[![License](https://img.shields.io/badge/license-Gemma-green)](https://ai.google.dev/gemma/docs/gemma_4_license)

</div>

---

**Gemma 4 E4B with Opus Reasoning + Claude Code LoRA fused directly into the weights** β€” no adapter needed, no extra memory, just load and run with Claude-style `<think>` reasoning baked in.

> **~10.5 GB. 131K context. Text + vision. Drop-in reasoning upgrade.**

This is [`gemma-4-E4B-mlx-4bit`](https://huggingface.co/deadbydawn101/gemma-4-E4B-mlx-4bit) with the [Opus Reasoning + Claude Code LoRA](https://huggingface.co/deadbydawn101/gemma-4-E4B-opus-reasoning-claude-code-lora) merged directly into the base weights using `mlx` weight arithmetic.

---

## What's different from the base model

| | Base model | This model |
|--|:--:|:--:|
| `<think>` tag reasoning | ❌ | βœ… baked in |
| Claude-style structured answers | ❌ | βœ… |
| Tool-use patterns | ❌ | βœ… |
| Requires adapter | β€” | ❌ no adapter needed |
| File size | 4.86 GB (4-bit) | ~10.5 GB (bfloat16 merged) |
| Vision support | βœ… | βœ… |

---


## πŸ§ͺ Live Demos β€” Try It Now

<div align="center">

| Space | What to try |
|---|---|
| πŸ”₯ [**Agentic Tool Calling Demo**](https://huggingface.co/spaces/deadbydawn101/gemma4-agentic-tool-calling-demo) | Live agentic loop β€” tool calling, `<think>` reasoning, calculator, web search |
| 🐳 [**OpenClaw Sandbox Demo**](https://huggingface.co/spaces/deadbydawn101/openclaw-agent-sandbox-demo) | OpenClaw-style orchestration, Docker runtime, sandbox/approval modes |

</div>

## Quickstart

```bash
pip install mlx-lm mlx-vlm
```

```python
from mlx_lm import load, generate

# No adapter_path needed β€” reasoning is in the weights
model, tokenizer = load("deadbydawn101/gemma-4-E4B-Agentic-Opus-Reasoning-GeminiCLI-mlx-4bit")

messages = [{"role": "user", "content": "Explain why RSA encryption is hard to break."}]
prompt = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
response = generate(model, tokenizer, prompt=prompt, max_tokens=1024, verbose=True)
# β†’ Will produce <think>...</think> followed by structured answer
```

### CLI
```bash
mlx_lm.generate \
  --model deadbydawn101/gemma-4-E4B-Agentic-Opus-Reasoning-GeminiCLI-mlx-4bit \
  --prompt "Debug this Python code: def fib(n): return fib(n-1) + fib(n-2)" \
  --max-tokens 1024
```

---



## 🧩 OpenHarness + OpenClaw + Hermes Agent

This model is built to sit inside a **real agent stack**, not just a chat box.

We support:
- **[OpenHarness](https://github.com/HKUDS/OpenHarness)** for agent harness/runtime, skills, hooks, tool loops, and multi-agent flows
- **OpenClaw** for orchestration, sessions, reminders, and cross-agent routing
- **Hermes agent skill** for terminal-native coding posture, short planning, aggressive tool use, and repo-aware execution

### Why this combo matters

| Layer | Role |
|---|---|
| **Gemma 4 E4B Opus Reasoning + Claude Code** | reasoning + tool-use behavior baked into the weights |
| **Gemini CLI** | coding agent + tool orchestration |
| **OpenHarness** | harness runtime, tool loop, swarm, hooks, memory |
| **OpenClaw** | orchestration, sessions, skills, messaging |
| **Hermes skill** | agent behavior for concise, terminal-first execution |

### OpenHarness quickstart

```bash
pip install openharness

mlx_lm.server \
  --model deadbydawn101/gemma-4-E4B-Agentic-Opus-Reasoning-GeminiCLI-mlx-4bit \
  --port 8080

oh --model http://localhost:8080/v1 \
   --skill hermes-agent \
   -p "Review this repo, find bugs, patch them, and summarize the result"
```

### OpenClaw skill stack

Inside OpenClaw, pair this model with:
- `openharness` skill β€” run/configure `oh`
- `hermes-agent` skill β€” shape coding-agent behavior

That gives you a fully local Apple Silicon agent lane with:
- baked-in reasoning
- native tool calling
- Gemini CLI integration
- OpenHarness runtime support
- OpenClaw orchestration

## πŸ’» Gemini CLI β€” Coding Agent + Tool Orchestration

We use **[RavenX AI's Gemini CLI fork](https://github.com/DeadByDawn101/gemini-cli)** as the coding agent and tool orchestration layer on top of these models. This is what makes the tool-calling capability real in production.

Gemini CLI gives you a full agentic loop in the terminal β€” Google Search grounding, file read/write, shell execution, web fetching, and MCP server support β€” all wired to a 1M token context window.

```bash
# Install
npm install -g @google/gemini-cli

# Run as a coding agent against this model (via local mlx_lm server)
mlx_lm.server --model deadbydawn101/gemma-4-E4B-Agentic-Opus-Reasoning-GeminiCLI-mlx-4bit --port 8080 &
gemini --baseUrl http://localhost:8080

# Or use directly against Gemini API (free tier: 60 req/min)
gemini
```

### What Gemini CLI + these models unlock together

| Capability | How |
|---|---|
| **Code generation** | Gemini CLI reads your codebase, model reasons with `<think>` tags |
| **Tool calling** | Native `<\|tool>` tokens β†’ Gemini CLI executes shell/file/web tools |
| **Long context** | 1M ctx in CLI + TurboQuant 4.6x KV compression = very long sessions |
| **MCP servers** | Connect any MCP server β€” databases, APIs, custom tools |
| **Search grounding** | Google Search built in β€” model gets live data |

```bash
# Real example: code review with tool calling enabled
gemini --baseUrl http://localhost:8080 \
  "Review all Python files in ./src, find potential bugs, and suggest fixes"

# Gemini CLI will: read files β†’ call tools β†’ model reasons β†’ produce structured output
```

β†’ [DeadByDawn101/gemini-cli on GitHub](https://github.com/DeadByDawn101/gemini-cli) β€” Apache 2.0, free tier, MCP-compatible

## ⚑ TurboQuant-MLX β€” 4.6x KV Cache Compression

Pair with [TurboQuant-MLX](https://github.com/DeadByDawn101/turboquant-mlx) to compress the KV cache and run 4.6x longer reasoning chains at the same memory:

```python
from turboquant_mlx.mlx_kvcache import TurboQuantKVCache
import mlx_lm.models.cache as cache_module

cache_module.make_prompt_cache = lambda model, **kw: [
    TurboQuantKVCache() for _ in range(len(model.layers))
]

from mlx_lm import load, generate
model, tokenizer = load("deadbydawn101/gemma-4-E4B-Agentic-Opus-Reasoning-GeminiCLI-mlx-4bit")
# Long reasoning chains now fit in the same RAM budget
```

β†’ [TurboQuant-MLX on GitHub](https://github.com/DeadByDawn101/turboquant-mlx) Β· [v2.0 Release](https://github.com/DeadByDawn101/turboquant-mlx/releases/tag/v2.0.0)

---

## How it was made

### Training data
| Source | Examples |
|--------|--------:|
| Crownelius/Opus-4.6-Reasoning-2100x-formatted | 2,054 |
| Claude Code tool-use patterns | 140 files |
| **Total** | **2,163** |

### Training
```
Base:      deadbydawn101/gemma-4-E4B-mlx-4bit
Method:    SFT completions-only (mlx_vlm.lora)
Rank:      8 Β· Alpha: 16 Β· LR: 1e-5 Β· Iters: 1,000
Hardware:  Apple M4 Max 128GB Β· Peak mem: 7.876 GB

Final loss: ~3.5e-7
```

### Fusion
All **378 LoRA pairs** merged via weight arithmetic:
```
W_merged = dequantize(W_base) + (A @ B).T Γ— (alpha / rank)
```
Result dequantized to bfloat16 and saved as 3-shard safetensors.

---


## πŸ¦™ Ollama / LM Studio / llama.cpp

> **This is an MLX model optimized for Apple Silicon.** For Ollama, LM Studio, or llama.cpp, use the GGUF version:
> 
> πŸ‘‰ **[gemma-4-E4B-Agentic-Opus-Reasoning-GeminiCLI-GGUF](https://huggingface.co/deadbydawn101/gemma-4-E4B-Agentic-Opus-Reasoning-GeminiCLI-GGUF)**
> 
> Available in Q4_K_M (2.7 GB), Q5_K_M (3.1 GB), Q8_0 (4.5 GB), and F16 (8.3 GB).
>
> ```bash
> ollama run hf.co/deadbydawn101/gemma-4-E4B-Agentic-Opus-Reasoning-GeminiCLI-GGUF
> ```

### Run with mlx_lm server (native, faster on Apple Silicon)
```bash
mlx_lm.server --model deadbydawn101/gemma-4-E4B-Agentic-Opus-Reasoning-GeminiCLI-mlx-4bit --port 8080

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "deadbydawn101/gemma-4-E4B-Agentic-Opus-Reasoning-GeminiCLI-mlx-4bit", "messages": [{"role": "user", "content": "Hello!"}]}'
```

## Related models

| Model | Size | Notes |
|-------|------|-------|
| [gemma-4-E4B-mlx-4bit](https://huggingface.co/deadbydawn101/gemma-4-E4B-mlx-4bit) | 4.86 GB | Base model (4-bit, use with adapter) |
| **gemma-4-E4B-Agentic-Opus-Reasoning-GeminiCLI-mlx-4bit** | ~10.5 GB | **This model** β€” fused, no adapter needed |
| [**GGUF version**](https://huggingface.co/deadbydawn101/gemma-4-E4B-Agentic-Opus-Reasoning-GeminiCLI-GGUF) | 2.7-8.3 GB | Ollama, LM Studio, llama.cpp |
| [gemma-4-E4B-opus-reasoning-claude-code-lora](https://huggingface.co/deadbydawn101/gemma-4-E4B-opus-reasoning-claude-code-lora) | 658 MB | Adapter-only |
| [gemma-4-E2B-Heretic-Uncensored-mlx-4bit](https://huggingface.co/deadbydawn101/gemma-4-E2B-Heretic-Uncensored-mlx-4bit) | 3.34 GB | 2B abliterated |
| [gemma-4-21b-REAP-Tool-Calling-mlx-4bit](https://huggingface.co/deadbydawn101/gemma-4-21b-REAP-Tool-Calling-mlx-4bit) | 12 GB | 21B MoE REAP |

---

## License

[Gemma Terms of Use](https://ai.google.dev/gemma/docs/gemma_4_license)

---

<div align="center">
Built with πŸ–€ by <a href="https://github.com/DeadByDawn101">RavenX AI</a> Β· <a href="https://github.com/DeadByDawn101/turboquant-mlx">TurboQuant-MLX</a> Β· <a href="https://github.com/DeadByDawn101/gemini-cli">Gemini CLI</a>
</div>


## TriAttention KV Compression

> **[2026-04-09] Our MLX port was merged into [TriAttention](https://github.com/WeianMao/triattention) (MIT + NVIDIA) β€” PR #1 by [@DeadByDawn101](https://github.com/DeadByDawn101) (RavenX AI).**

Apply **10.7x KV memory reduction** and **2.5x throughput** on top of this model's built-in 4-bit TurboQuant quantization for ~50x combined compression vs full fp16:

```python
from mlx_lm import load
from triattention.mlx import apply_triattention_mlx

model, tokenizer = load("deadbydawn101/gemma-4-E4B-Agentic-Opus-Reasoning-GeminiCLI-mlx-4bit")
apply_triattention_mlx(model, kv_budget=2048)
```

## RavenX Inference Harness

One-command inference, benchmarking, and local OpenAI-compatible server:

```bash
git clone https://github.com/DeadByDawn101/ravenx-inference-harness
cd ravenx-inference-harness

# Inference
python run.py --model deadbydawn101/gemma-4-E4B-Agentic-Opus-Reasoning-GeminiCLI-mlx-4bit --prompt "Your prompt"

# TriAttention compressed
python run.py --model deadbydawn101/gemma-4-E4B-Agentic-Opus-Reasoning-GeminiCLI-mlx-4bit --triattention --kv-budget 2048

# Local OpenAI-compatible server (works with OpenClaw)
python serve.py --model deadbydawn101/gemma-4-E4B-Agentic-Opus-Reasoning-GeminiCLI-mlx-4bit --triattention
```