zion-r1-4b

zion-r1-4b / README.md

muralcode

Duplicate from ArithaAI/gemma-4-E4B-Agentic-Opus-Reasoning-GeminiCLI-mlx-4bit

b7e96c9 23 days ago

preview code

raw

history blame contribute delete

11.9 kB

	---
	library_name: mlx
	license: gemma
	license_link: https://ai.google.dev/gemma/docs/gemma_4_license
	pipeline_tag: text-generation
	tags:
	- mlx
	- safetensors
	- gemma4
	- 4-bit
	- quantized
	- apple-silicon
	- multimodal
	- vision
	- reasoning
	- chain-of-thought
	- opus
	- claude-code
	- sft
	- fused
	- turboquant
	- kv-cache-compression
	- long-context
	- ravenx
	- tool-calling
	- function-calling
	base_model: deadbydawn101/gemma-4-E4B-mlx-4bit
	base_model_relation: finetune
	language:
	- en
	datasets:
	- Crownelius/Opus-4.6-Reasoning-2100x-formatted
	---

	<div align="center">

	# Gemma 4 E4B — Opus Reasoning + Claude Code \| Tool Calling ✅ \| OpenHarness ✅ \| OpenClaw ✅ \| Hermes Agent ✅ \| Reasoning Baked In

	> Opus 4.6 reasoning + Claude Code fused into weights. Native tool calling. OpenHarness agent harness. OpenClaw orchestration. Hermes terminal-agent skill. `<think>` reasoning baked in — no adapter needed. 10.5 GB.

	### Reasoning baked in. No adapter needed. Built by [RavenX AI](https://github.com/DeadByDawn101)

	[![TurboQuant](https://img.shields.io/badge/TurboQuant--MLX-4.6x_KV_compression-blueviolet)](https://github.com/DeadByDawn101/turboquant-mlx)
	[![Gemini CLI](https://img.shields.io/badge/Gemini_CLI-MCP_compatible-blue)](https://github.com/DeadByDawn101/gemini-cli)
	[![License](https://img.shields.io/badge/license-Gemma-green)](https://ai.google.dev/gemma/docs/gemma_4_license)

	</div>

	---

	Gemma 4 E4B with Opus Reasoning + Claude Code LoRA fused directly into the weights — no adapter needed, no extra memory, just load and run with Claude-style `<think>` reasoning baked in.

	> ~10.5 GB. 131K context. Text + vision. Drop-in reasoning upgrade.

	This is [`gemma-4-E4B-mlx-4bit`](https://huggingface.co/deadbydawn101/gemma-4-E4B-mlx-4bit) with the [Opus Reasoning + Claude Code LoRA](https://huggingface.co/deadbydawn101/gemma-4-E4B-opus-reasoning-claude-code-lora) merged directly into the base weights using `mlx` weight arithmetic.

	---

	## What's different from the base model

	\| \| Base model \| This model \|
	\|--\|:--:\|:--:\|
	\| `<think>` tag reasoning \| ❌ \| ✅ baked in \|
	\| Claude-style structured answers \| ❌ \| ✅ \|
	\| Tool-use patterns \| ❌ \| ✅ \|
	\| Requires adapter \| — \| ❌ no adapter needed \|
	\| File size \| 4.86 GB (4-bit) \| ~10.5 GB (bfloat16 merged) \|
	\| Vision support \| ✅ \| ✅ \|

	---


	## 🧪 Live Demos — Try It Now

	<div align="center">

	\| Space \| What to try \|
	\|---\|---\|
	\| 🔥 [Agentic Tool Calling Demo](https://huggingface.co/spaces/deadbydawn101/gemma4-agentic-tool-calling-demo) \| Live agentic loop — tool calling, `<think>` reasoning, calculator, web search \|
	\| 🐳 [OpenClaw Sandbox Demo](https://huggingface.co/spaces/deadbydawn101/openclaw-agent-sandbox-demo) \| OpenClaw-style orchestration, Docker runtime, sandbox/approval modes \|

	</div>

	## Quickstart

	```bash
	pip install mlx-lm mlx-vlm
	```

	```python
	from mlx_lm import load, generate

	# No adapter_path needed — reasoning is in the weights
	model, tokenizer = load("deadbydawn101/gemma-4-E4B-Agentic-Opus-Reasoning-GeminiCLI-mlx-4bit")

	messages = [{"role": "user", "content": "Explain why RSA encryption is hard to break."}]
	prompt = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
	response = generate(model, tokenizer, prompt=prompt, max_tokens=1024, verbose=True)
	# → Will produce <think>...</think> followed by structured answer
	```

	### CLI
	```bash
	mlx_lm.generate \
	--model deadbydawn101/gemma-4-E4B-Agentic-Opus-Reasoning-GeminiCLI-mlx-4bit \
	--prompt "Debug this Python code: def fib(n): return fib(n-1) + fib(n-2)" \
	--max-tokens 1024
	```

	---



	## 🧩 OpenHarness + OpenClaw + Hermes Agent

	This model is built to sit inside a real agent stack, not just a chat box.

	We support:
	- [OpenHarness](https://github.com/HKUDS/OpenHarness) for agent harness/runtime, skills, hooks, tool loops, and multi-agent flows
	- OpenClaw for orchestration, sessions, reminders, and cross-agent routing
	- Hermes agent skill for terminal-native coding posture, short planning, aggressive tool use, and repo-aware execution

	### Why this combo matters

	\| Layer \| Role \|
	\|---\|---\|
	\| Gemma 4 E4B Opus Reasoning + Claude Code \| reasoning + tool-use behavior baked into the weights \|
	\| Gemini CLI \| coding agent + tool orchestration \|
	\| OpenHarness \| harness runtime, tool loop, swarm, hooks, memory \|
	\| OpenClaw \| orchestration, sessions, skills, messaging \|
	\| Hermes skill \| agent behavior for concise, terminal-first execution \|

	### OpenHarness quickstart

	```bash
	pip install openharness

	mlx_lm.server \
	--model deadbydawn101/gemma-4-E4B-Agentic-Opus-Reasoning-GeminiCLI-mlx-4bit \
	--port 8080

	oh --model http://localhost:8080/v1 \
	--skill hermes-agent \
	-p "Review this repo, find bugs, patch them, and summarize the result"
	```

	### OpenClaw skill stack

	Inside OpenClaw, pair this model with:
	- `openharness` skill — run/configure `oh`
	- `hermes-agent` skill — shape coding-agent behavior

	That gives you a fully local Apple Silicon agent lane with:
	- baked-in reasoning
	- native tool calling
	- Gemini CLI integration
	- OpenHarness runtime support
	- OpenClaw orchestration

	## 💻 Gemini CLI — Coding Agent + Tool Orchestration

	We use [RavenX AI's Gemini CLI fork](https://github.com/DeadByDawn101/gemini-cli) as the coding agent and tool orchestration layer on top of these models. This is what makes the tool-calling capability real in production.

	Gemini CLI gives you a full agentic loop in the terminal — Google Search grounding, file read/write, shell execution, web fetching, and MCP server support — all wired to a 1M token context window.

	```bash
	# Install
	npm install -g @google/gemini-cli

	# Run as a coding agent against this model (via local mlx_lm server)
	mlx_lm.server --model deadbydawn101/gemma-4-E4B-Agentic-Opus-Reasoning-GeminiCLI-mlx-4bit --port 8080 &
	gemini --baseUrl http://localhost:8080

	# Or use directly against Gemini API (free tier: 60 req/min)
	gemini
	```

	### What Gemini CLI + these models unlock together

	\| Capability \| How \|
	\|---\|---\|
	\| Code generation \| Gemini CLI reads your codebase, model reasons with `<think>` tags \|
	\| Tool calling \| Native `<\\|tool>` tokens → Gemini CLI executes shell/file/web tools \|
	\| Long context \| 1M ctx in CLI + TurboQuant 4.6x KV compression = very long sessions \|
	\| MCP servers \| Connect any MCP server — databases, APIs, custom tools \|
	\| Search grounding \| Google Search built in — model gets live data \|

	```bash
	# Real example: code review with tool calling enabled
	gemini --baseUrl http://localhost:8080 \
	"Review all Python files in ./src, find potential bugs, and suggest fixes"

	# Gemini CLI will: read files → call tools → model reasons → produce structured output
	```

	→ [DeadByDawn101/gemini-cli on GitHub](https://github.com/DeadByDawn101/gemini-cli) — Apache 2.0, free tier, MCP-compatible

	## ⚡ TurboQuant-MLX — 4.6x KV Cache Compression

	Pair with [TurboQuant-MLX](https://github.com/DeadByDawn101/turboquant-mlx) to compress the KV cache and run 4.6x longer reasoning chains at the same memory:

	```python
	from turboquant_mlx.mlx_kvcache import TurboQuantKVCache
	import mlx_lm.models.cache as cache_module

	cache_module.make_prompt_cache = lambda model, **kw: [
	TurboQuantKVCache() for _ in range(len(model.layers))
	]

	from mlx_lm import load, generate
	model, tokenizer = load("deadbydawn101/gemma-4-E4B-Agentic-Opus-Reasoning-GeminiCLI-mlx-4bit")
	# Long reasoning chains now fit in the same RAM budget
	```

	→ [TurboQuant-MLX on GitHub](https://github.com/DeadByDawn101/turboquant-mlx) · [v2.0 Release](https://github.com/DeadByDawn101/turboquant-mlx/releases/tag/v2.0.0)

	---

	## How it was made

	### Training data
	\| Source \| Examples \|
	\|--------\|--------:\|
	\| Crownelius/Opus-4.6-Reasoning-2100x-formatted \| 2,054 \|
	\| Claude Code tool-use patterns \| 140 files \|
	\| Total \| 2,163 \|

	### Training
	```
	Base: deadbydawn101/gemma-4-E4B-mlx-4bit
	Method: SFT completions-only (mlx_vlm.lora)
	Rank: 8 · Alpha: 16 · LR: 1e-5 · Iters: 1,000
	Hardware: Apple M4 Max 128GB · Peak mem: 7.876 GB

	Final loss: ~3.5e-7
	```

	### Fusion
	All 378 LoRA pairs merged via weight arithmetic:
	```
	W_merged = dequantize(W_base) + (A @ B).T × (alpha / rank)
	```
	Result dequantized to bfloat16 and saved as 3-shard safetensors.

	---


	## 🦙 Ollama / LM Studio / llama.cpp

	> This is an MLX model optimized for Apple Silicon. For Ollama, LM Studio, or llama.cpp, use the GGUF version:
	>
	> 👉 [gemma-4-E4B-Agentic-Opus-Reasoning-GeminiCLI-GGUF](https://huggingface.co/deadbydawn101/gemma-4-E4B-Agentic-Opus-Reasoning-GeminiCLI-GGUF)
	>
	> Available in Q4_K_M (2.7 GB), Q5_K_M (3.1 GB), Q8_0 (4.5 GB), and F16 (8.3 GB).
	>
	> ```bash
	> ollama run hf.co/deadbydawn101/gemma-4-E4B-Agentic-Opus-Reasoning-GeminiCLI-GGUF
	> ```

	### Run with mlx_lm server (native, faster on Apple Silicon)
	```bash
	mlx_lm.server --model deadbydawn101/gemma-4-E4B-Agentic-Opus-Reasoning-GeminiCLI-mlx-4bit --port 8080

	curl http://localhost:8080/v1/chat/completions \
	-H "Content-Type: application/json" \
	-d '{"model": "deadbydawn101/gemma-4-E4B-Agentic-Opus-Reasoning-GeminiCLI-mlx-4bit", "messages": [{"role": "user", "content": "Hello!"}]}'
	```

	## Related models

	\| Model \| Size \| Notes \|
	\|-------\|------\|-------\|
	\| [gemma-4-E4B-mlx-4bit](https://huggingface.co/deadbydawn101/gemma-4-E4B-mlx-4bit) \| 4.86 GB \| Base model (4-bit, use with adapter) \|
	\| gemma-4-E4B-Agentic-Opus-Reasoning-GeminiCLI-mlx-4bit \| ~10.5 GB \| This model — fused, no adapter needed \|
	\| [GGUF version](https://huggingface.co/deadbydawn101/gemma-4-E4B-Agentic-Opus-Reasoning-GeminiCLI-GGUF) \| 2.7-8.3 GB \| Ollama, LM Studio, llama.cpp \|
	\| [gemma-4-E4B-opus-reasoning-claude-code-lora](https://huggingface.co/deadbydawn101/gemma-4-E4B-opus-reasoning-claude-code-lora) \| 658 MB \| Adapter-only \|
	\| [gemma-4-E2B-Heretic-Uncensored-mlx-4bit](https://huggingface.co/deadbydawn101/gemma-4-E2B-Heretic-Uncensored-mlx-4bit) \| 3.34 GB \| 2B abliterated \|
	\| [gemma-4-21b-REAP-Tool-Calling-mlx-4bit](https://huggingface.co/deadbydawn101/gemma-4-21b-REAP-Tool-Calling-mlx-4bit) \| 12 GB \| 21B MoE REAP \|

	---

	## License

	[Gemma Terms of Use](https://ai.google.dev/gemma/docs/gemma_4_license)

	---

	<div align="center">
	Built with 🖤 by <a href="https://github.com/DeadByDawn101">RavenX AI</a> · <a href="https://github.com/DeadByDawn101/turboquant-mlx">TurboQuant-MLX</a> · <a href="https://github.com/DeadByDawn101/gemini-cli">Gemini CLI</a>
	</div>


	## TriAttention KV Compression

	> [2026-04-09] Our MLX port was merged into [TriAttention](https://github.com/WeianMao/triattention) (MIT + NVIDIA) — PR #1 by [@DeadByDawn101](https://github.com/DeadByDawn101) (RavenX AI).

	Apply 10.7x KV memory reduction and 2.5x throughput on top of this model's built-in 4-bit TurboQuant quantization for ~50x combined compression vs full fp16:

	```python
	from mlx_lm import load
	from triattention.mlx import apply_triattention_mlx

	model, tokenizer = load("deadbydawn101/gemma-4-E4B-Agentic-Opus-Reasoning-GeminiCLI-mlx-4bit")
	apply_triattention_mlx(model, kv_budget=2048)
	```

	## RavenX Inference Harness

	One-command inference, benchmarking, and local OpenAI-compatible server:

	```bash
	git clone https://github.com/DeadByDawn101/ravenx-inference-harness
	cd ravenx-inference-harness

	# Inference
	python run.py --model deadbydawn101/gemma-4-E4B-Agentic-Opus-Reasoning-GeminiCLI-mlx-4bit --prompt "Your prompt"

	# TriAttention compressed
	python run.py --model deadbydawn101/gemma-4-E4B-Agentic-Opus-Reasoning-GeminiCLI-mlx-4bit --triattention --kv-budget 2048

	# Local OpenAI-compatible server (works with OpenClaw)
	python serve.py --model deadbydawn101/gemma-4-E4B-Agentic-Opus-Reasoning-GeminiCLI-mlx-4bit --triattention
	```