---
base_model: Daffaadityp/AxonAI-MX4-2.0
language:
  - en
  - id
license: apache-2.0
tags:
  - gguf
  - quantized
  - qwen3
  - dora
  - axonlabs
  - reasoning
  - local-llm
  - chain-of-thought
  - edge-ai
  - ollama
  - llama-cpp
  - indonesian-ai
  - text-generation
  - 4b
  - instruct
pipeline_tag: text-generation
library_name: gguf
---

<div align="center">

# 🧠 Poterry AI — GGUF Quantized Edition

### *Reasoning-First Language Model · 4B Parameters · Chain-of-Thought Native*
### *Optimized for Local Inference · Edge Devices · Laptops · Offline AI*

<br>

[![Model](https://img.shields.io/badge/Base%20Model-AxonAI%20MX4%202.0-blueviolet?style=for-the-badge&logo=huggingface)](https://huggingface.co/Daffaadityp/AxonAI-MX4-2.0)
[![Format](https://img.shields.io/badge/Format-GGUF-orange?style=for-the-badge&logo=llvm)](https://github.com/ggerganov/llama.cpp)
[![Quantization](https://img.shields.io/badge/Quants-Q2__K%20%7C%20Q4__K__M%20%7C%20Q8__0-brightgreen?style=for-the-badge)](https://github.com/ggerganov/llama.cpp#quantization)
[![Ollama](https://img.shields.io/badge/Ollama-Compatible-informational?style=for-the-badge&logo=ollama)](https://ollama.com)
[![llama.cpp](https://img.shields.io/badge/llama.cpp-Compatible-success?style=for-the-badge)](https://github.com/ggerganov/llama.cpp)
[![LM Studio](https://img.shields.io/badge/LM%20Studio-Compatible-9cf?style=for-the-badge)](https://lmstudio.ai)
[![Parameters](https://img.shields.io/badge/Parameters-4B-blue?style=for-the-badge)](https://huggingface.co/Daffaadityp/AxonAI-MX4-2.0)
[![License](https://img.shields.io/badge/License-Apache%202.0-red?style=for-the-badge)](https://www.apache.org/licenses/LICENSE-2.0)
[![Made in Indonesia](https://img.shields.io/badge/Made%20in-Indonesia%20🇮🇩-red?style=for-the-badge)](https://github.com/Daffaadityp)

<br>

> **This repository contains the official GGUF quantized files for AxonAI MX4 2.0.**
> Run a full Chain-of-Thought reasoning LLM *entirely locally* — no GPU required, no internet connection, no API costs. Just pure, structured intelligence on your own hardware.

</div>

---

## 📌 Quick Navigation

| Section | Description |
|---|---|
| [🗂️ Available Files](#️-available-gguf-files--quantization-guide) | Q2_K, Q4_K_M, Q8_0 — which one is right for you? |
| [🚀 Ollama Quickstart](#-ollama-quickstart-recommended) | Easiest way to run locally — one command |
| [⚙️ llama.cpp CLI](#️-llamacpp-cli) | For advanced users and scripting |
| [🖥️ LM Studio / GPT4All](#️-lm-studio--gpt4all) | GUI-based local inference |
| [🧬 Why Quantized Reasoning?](#-why-a-quantized-reasoning-model-is-so-powerful) | The secret sauce — explained for GGUF |
| [🛠️ Prompt Format](#️-prompt--system-format) | How to structure your prompts |
| [🇮🇩 Komunitas Indonesia](#-untuk-developer-indonesia) | Untuk para developer Tanah Air |

---

## 🌐 What Is This Repository?

This is the **official GGUF release** of [AxonAI MX4 2.0](https://huggingface.co/Daffaadityp/AxonAI-MX4-2.0), a 4-billion-parameter reasoning-first language model built by **AxonLabs** (SMKN 26 Jakarta). The original model was trained using **DoRA (Weight-Decomposed Low-Rank Adaptation)** on top of the Qwen3 architecture, fine-tuned to produce structured, transparent Chain-of-Thought (`<think>`) reasoning before every final response.

These GGUF files were produced using `llama.cpp`'s official quantization pipeline, preserving the model's reasoning depth while dramatically reducing memory footprint — making **local LLM inference** accessible on consumer hardware.

**If you want the full-precision FP16/BF16 weights**, visit the original repository:
👉 [`Daffaadityp/AxonAI-MX4-2.0`](https://huggingface.co/Daffaadityp/AxonAI-MX4-2.0)

---

## 🗂️ Available GGUF Files & Quantization Guide

Choose the right quantization level for your hardware. As a general rule: **higher Q = better quality, higher RAM requirement**.

| File | Quant Type | Size (Est.) | Min RAM | Quality | Use Case |
|---|---|---|---|---|---|
| `AxonAI-MX4-2.0-Q2_K.gguf` | Q2_K | ~1.7 GB | 4 GB | ⚡ Fast / Compressed | Raspberry Pi, very old laptops, extreme RAM constraints |
| `AxonAI-MX4-2.0-Q4_K_M.gguf` | Q4_K_M | ~2.7 GB | 6 GB | ⭐ **Recommended** | Mac M1/M2, standard laptops, WSL2, most modern CPUs |
| `AxonAI-MX4-2.0-Q8_0.gguf` | Q8_0 | ~4.5 GB | 8 GB | 🔬 Near-FP16 | Workstations, gaming PCs with ample RAM, power users |

### ⭐ Recommendation: Start with `Q4_K_M`

`Q4_K_M` is the universally recommended sweet spot for local LLM inference. It delivers:
- **~95% of the full-precision model quality** at less than 35% of the memory cost
- Excellent performance on **Apple Silicon (M1/M2/M3)**, standard x86 laptops, and cloud VMs
- The best balance of **inference speed**, **reasoning coherence**, and **RAM efficiency**

> 💡 For most users: **Q4_K_M is the right choice. Start here.**

---

## 🚀 Ollama Quickstart (Recommended)

[Ollama](https://ollama.com) is the fastest way to run AxonAI MX4 2.0 locally. No Python setup required.

### Step 1 — Install Ollama

```bash
# macOS / Linux
curl -fsSL https://ollama.com/install.sh | sh

# Windows: Download installer from https://ollama.com/download
```

### Step 2 — Create a Modelfile

Create a file named `Modelfile` (no extension) in your working directory:

```dockerfile
# Modelfile for AxonAI MX4 2.0 (Q4_K_M - Recommended)
FROM ./AxonAI-MX4-2.0-Q4_K_M.gguf

# --- Core Identity & Reasoning System Prompt ---
SYSTEM """
You are AxonAI, an advanced reasoning assistant developed by AxonLabs.
Before answering any question, you MUST use your internal scratchpad enclosed in <think>...</think> tags to reason step-by-step.
Only after completing your reasoning should you provide a clear, structured, and helpful final answer.
Be precise, thorough, and transparent in your logic.
"""

# --- Generation Parameters (Optimized for Reasoning) ---
PARAMETER temperature 0.6
PARAMETER top_p 0.95
PARAMETER top_k 20
PARAMETER repeat_penalty 1.1
PARAMETER num_ctx 8192
```

> 💡 **Why the `<think>` system prompt?** AxonAI MX4 2.0 was fine-tuned with Chain-of-Thought supervision. Including this system prompt *unlocks* the model's full reasoning capability. Without it, you may get direct answers without the structured deliberation the model was trained to produce.

### Step 3 — Build and Run

```bash
# Build the local Ollama model from your Modelfile
ollama create axonai-mx4 -f ./Modelfile

# Run it interactively
ollama run axonai-mx4

# Or run with a direct prompt
ollama run axonai-mx4 "Explain the P vs NP problem and whether you think it will ever be solved."
```

### Using the Ollama REST API

Once running, Ollama exposes a local REST API — perfect for integrations:

```bash
curl http://localhost:11434/api/generate \
  -H "Content-Type: application/json" \
  -d '{
    "model": "axonai-mx4",
    "prompt": "What are the ethical implications of deploying AI in judicial systems?",
    "stream": false
  }'
```

---

## ⚙️ llama.cpp CLI

For advanced users, scripting pipelines, or maximum performance control.

### Install llama.cpp

```bash
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
cmake -B build
cmake --build build --config Release -j$(nproc)
```

### Run Inference

```bash
# Basic interactive mode (Q4_K_M recommended)
./build/bin/llama-cli \
  -m ./AxonAI-MX4-2.0-Q4_K_M.gguf \
  -n 2048 \
  --temp 0.6 \
  --top-p 0.95 \
  --top-k 20 \
  --repeat-penalty 1.1 \
  --ctx-size 8192 \
  -i \
  -r "User:" \
  --in-prefix " " \
  -p "You are AxonAI, a reasoning assistant. Think step by step inside <think> tags before answering.\n\nUser:"
```

```bash
# Single-shot inference (batch/scripting)
./build/bin/llama-cli \
  -m ./AxonAI-MX4-2.0-Q8_0.gguf \
  -n 1024 \
  --temp 0.6 \
  --ctx-size 8192 \
  -p "<|im_start|>system\nYou are AxonAI. Reason carefully using <think> tags.<|im_end|>\n<|im_start|>user\nSolve: If a train travels 120km at 60km/h, then 80km at 40km/h, what is the average speed for the whole journey?<|im_end|>\n<|im_start|>assistant\n"
```

> 🔧 **Performance tip:** Add `-ngl 99` flag if you have a GPU (NVIDIA/AMD/Metal) to offload layers — this can yield **3–10x speedup** even with quantized GGUF files.

---

## 🖥️ LM Studio / GPT4All

Both LM Studio and GPT4All support direct GGUF loading with a graphical interface — ideal for non-technical users or demos.

**LM Studio:**
1. Download from [lmstudio.ai](https://lmstudio.ai)
2. Go to **Search** → search `AxonAI` or import GGUF manually via **My Models**
3. Load `AxonAI-MX4-2.0-Q4_K_M.gguf`
4. In the **System Prompt** field, paste the reasoning system prompt from the Modelfile above
5. Start chatting — LM Studio also exposes a local OpenAI-compatible API on port `1234`

**GPT4All:**
1. Download from [gpt4all.io](https://www.nomic.ai/gpt4all)
2. Under **Add Model** → choose **Import from file** and select your `.gguf` file
3. GPT4All works entirely offline after the initial load — perfect for privacy-sensitive use cases

---

## 🧬 Why a Quantized Reasoning Model Is So Powerful

Most local LLMs are **answer-first** — they pattern-match to the most statistically likely response. AxonAI MX4 2.0 is fundamentally different.

It was trained to **reason before it answers** — meaning every response is preceded by an internal deliberation process encoded inside `<think>...</think>` tags. This is the Chain-of-Thought (CoT) paradigm, and when applied to a quantized local model, several powerful properties emerge:

### 🔒 Complete Privacy, Full Intelligence
Your prompts **never leave your machine**. Unlike cloud LLM APIs, there is no data sent to any server. You get structured reasoning capability that rivals much larger models — entirely offline. This is essential for:
- Legal document analysis
- Medical note summarization
- Private financial reasoning
- Proprietary code review

### 📉 Quantization ≠ Reasoning Degradation
Unlike factual recall (where quantization can cause more hallucination), **structured reasoning is surprisingly robust** to quantization. The logical flow encoded during DoRA fine-tuning is preserved at 4-bit precision. The model still deliberates. It still checks its own steps. It still produces structured conclusions.

### 🧩 The DoRA Advantage
AxonAI MX4 2.0 was adapted using **DoRA (Weight-Decomposed Low-Rank Adaptation)**, which separates weight updates into magnitude and direction components. This produces **more stable, nuanced fine-tuning** than standard LoRA — and that stability carries through quantization. You get a model that reasons with fidelity even at Q4 compression.

### ⚡ The Efficiency Equation
A 4B parameter model at Q4_K_M runs at **~20–60 tokens/second** on Apple M-series chips and modern CPUs. That's fast enough for real-time, interactive reasoning — think of it as having a thoughtful senior analyst available offline, on any machine, forever.

---

## 🛠️ Prompt & System Format

AxonAI MX4 2.0 uses the **ChatML** prompt template (inherited from Qwen3):

```
<|im_start|>system
{system_prompt}<|im_end|>
<|im_start|>user
{user_message}<|im_end|>
<|im_start|>assistant
<think>
{internal reasoning — model generates this}
</think>
{final answer — model generates this}
<|im_end|>
```

### Recommended System Prompt (Full Version)

```
You are AxonAI, an advanced reasoning language model developed by AxonLabs.
Your core capability is structured deliberation: before answering any question,
you MUST think step-by-step inside <think>...</think> tags.

Guidelines:
- Use <think> to break down the problem, consider edge cases, and verify your logic.
- After </think>, give a clear, well-structured, and helpful final answer.
- Be honest about uncertainty. Never fabricate facts.
- For math and logic, show your work explicitly inside <think>.
- For creative or open-ended tasks, use <think> to plan your response structure.
```

### Minimal System Prompt (Fast / Lightweight)

```
You are AxonAI. Always reason inside <think>...</think> before your final answer.
```

---

## 📊 Model Architecture & Training Summary

| Property | Value |
|---|---|
| **Base Architecture** | Qwen3 (4B) |
| **Fine-Tuning Method** | DoRA (Weight-Decomposed Low-Rank Adaptation) |
| **Training Paradigm** | Chain-of-Thought Supervised Fine-Tuning |
| **Context Window** | 8,192 tokens |
| **Vocab Size** | 151,936 |
| **Attention Heads** | 32 |
| **Key-Value Heads** | 8 (Grouped Query Attention) |
| **Hidden Dimensions** | 2,048 |
| **GGUF Quantizer** | llama.cpp (official) |
| **Available Quants** | Q2_K, Q4_K_M, Q8_0 |
| **Language Support** | English (primary), Indonesian (strong) |
| **License** | Apache 2.0 |

---

## 🔬 Benchmark Context

> AxonAI MX4 2.0 is a research and educational model from AxonLabs. Formal benchmark results are forthcoming. The following reflects qualitative design targets based on the training methodology.

| Capability | Assessment |
|---|---|
| Structured Reasoning (CoT) | ✅ Strong — core training objective |
| Mathematical Problem Solving | ✅ Good — benefiting from step-by-step CoT |
| Code Generation (Python/JS) | ✅ Good |
| Factual Q&A (English) | ✅ Good |
| Indonesian Language (id) | ✅ Good |
| Long-Context Coherence (8K) | ⚠️ Moderate — improves with Q8_0 |
| Complex Multi-Step Agentic Tasks | ⚠️ Moderate — use longer system prompts |

*Community evaluations and PR-based benchmark additions are welcome.*

---

## 🇮🇩 Untuk Developer Indonesia

**Halo, Developer Indonesia! 🙌**

Ini adalah model AI lokal pertama dari AxonLabs yang bisa kamu jalankan **100% offline di laptop atau PC sendiri** — tanpa perlu GPU mahal, tanpa biaya API, dan tanpa koneksi internet.

Bayangkan: punya asisten AI yang bisa berpikir langkah demi langkah, memahami konteks, dan menjawab pertanyaan kompleks — semuanya berjalan di dalam mesin kamu sendiri. Itulah tujuan AxonAI MX4 2.0 GGUF.

**Kenapa ini penting buat kamu?**
- 🔒 **Privasi total** — data kamu tidak pernah keluar dari devicemu
- 💸 **Gratis selamanya** — tidak ada biaya langganan atau token
- 🌐 **Bisa dipakai offline** — di daerah dengan koneksi terbatas sekalipun
- 🧠 **Reasoning-first** — model ini *mikir dulu* sebelum menjawab, bukan asal tebak

Dibangun oleh pelajar SMK, untuk semua orang Indonesia yang ingin mengeksplorasi AI secara langsung.

> *"AI terbaik adalah AI yang bisa kamu kontrol sendiri."*
> — AxonLabs, SMKN 26 Jakarta

**Cara paling cepat untuk mulai (5 menit):**
```bash
# 1. Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# 2. Buat Modelfile (lihat panduan di atas), lalu:
ollama create axonai-mx4 -f ./Modelfile

# 3. Jalankan!
ollama run axonai-mx4 "Jelaskan cara kerja transformer architecture dalam bahasa yang mudah dipahami."
```

---

## ⚖️ License & Usage

This model is released under the **Apache 2.0 License**.

- ✅ Free for personal, academic, and commercial use
- ✅ Modification and redistribution permitted with attribution
- ✅ Derivative models and fine-tunes welcome
- ❌ Must not be used to generate illegal, harmful, or deceptive content
- ❌ Attribution to AxonLabs / `Daffaadityp/AxonAI-MX4-2.0` required for derivative releases

---

## 🔗 Related Resources

| Resource | Link |
|---|---|
| 🧠 Original FP16 Model | [Daffaadityp/AxonAI-MX4-2.0](https://huggingface.co/Daffaadityp/AxonAI-MX4-2.0) |
| 📦 llama.cpp Repository | [github.com/ggerganov/llama.cpp](https://github.com/ggerganov/llama.cpp) |
| 🦙 Ollama Documentation | [ollama.com/docs](https://ollama.com) |
| 🖥️ LM Studio | [lmstudio.ai](https://lmstudio.ai) |
| 🏫 AxonLabs / SMKN 26 Jakarta | [Daffaadityp on HuggingFace](https://huggingface.co/Daffaadityp) |

---

## 💬 Community & Feedback

Found a bug? Have a benchmark result to share? Want to contribute evaluation data?

- **Open a Discussion** on this HuggingFace repository
- **Open an Issue** on the [AxonAI GitHub](https://github.com/Daffaadityp) (if available)
- **Community evaluations are actively welcomed** — especially Indonesian-language benchmarks

---

<div align="center">

*Built with 🧠 by AxonLabs · SMKN 26 Jakarta · Indonesia 🇮🇩*

*"Intelligence is not about speed. It's about depth of thought."*

*"Michie Edition"*

[![HuggingFace](https://img.shields.io/badge/🤗%20HuggingFace-Daffaadityp-yellow?style=for-the-badge)](https://huggingface.co/Daffaadityp)

</div>