Text Generation
GGUF
English
Indonesian
quantized
qwen3
dora
axonlabs
reasoning
local-llm
chain-of-thought
edge-ai
ollama
llama-cpp
indonesian-ai
4b
instruct
conversational
Instructions to use Daffaadityp/PoterryAI with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use Daffaadityp/PoterryAI with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="Daffaadityp/PoterryAI", filename="AxonAI-MX4-2.0-Q2_K.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use Daffaadityp/PoterryAI with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf Daffaadityp/PoterryAI:Q4_K_M # Run inference directly in the terminal: llama-cli -hf Daffaadityp/PoterryAI:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf Daffaadityp/PoterryAI:Q4_K_M # Run inference directly in the terminal: llama-cli -hf Daffaadityp/PoterryAI:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf Daffaadityp/PoterryAI:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf Daffaadityp/PoterryAI:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf Daffaadityp/PoterryAI:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf Daffaadityp/PoterryAI:Q4_K_M
Use Docker
docker model run hf.co/Daffaadityp/PoterryAI:Q4_K_M
- LM Studio
- Jan
- vLLM
How to use Daffaadityp/PoterryAI with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "Daffaadityp/PoterryAI" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Daffaadityp/PoterryAI", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/Daffaadityp/PoterryAI:Q4_K_M
- Ollama
How to use Daffaadityp/PoterryAI with Ollama:
ollama run hf.co/Daffaadityp/PoterryAI:Q4_K_M
- Unsloth Studio new
How to use Daffaadityp/PoterryAI with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for Daffaadityp/PoterryAI to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for Daffaadityp/PoterryAI to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for Daffaadityp/PoterryAI to start chatting
- Pi new
How to use Daffaadityp/PoterryAI with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf Daffaadityp/PoterryAI:Q4_K_M
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "Daffaadityp/PoterryAI:Q4_K_M" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use Daffaadityp/PoterryAI with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf Daffaadityp/PoterryAI:Q4_K_M
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default Daffaadityp/PoterryAI:Q4_K_M
Run Hermes
hermes
- Docker Model Runner
How to use Daffaadityp/PoterryAI with Docker Model Runner:
docker model run hf.co/Daffaadityp/PoterryAI:Q4_K_M
- Lemonade
How to use Daffaadityp/PoterryAI with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull Daffaadityp/PoterryAI:Q4_K_M
Run and chat with the model
lemonade run user.PoterryAI-Q4_K_M
List all available models
lemonade list
File size: 16,520 Bytes
803ceac 8cda67d 803ceac | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 | ---
base_model: Daffaadityp/AxonAI-MX4-2.0
language:
- en
- id
license: apache-2.0
tags:
- gguf
- quantized
- qwen3
- dora
- axonlabs
- reasoning
- local-llm
- chain-of-thought
- edge-ai
- ollama
- llama-cpp
- indonesian-ai
- text-generation
- 4b
- instruct
pipeline_tag: text-generation
library_name: gguf
---
<div align="center">
# ๐ง Poterry AI โ GGUF Quantized Edition
### *Reasoning-First Language Model ยท 4B Parameters ยท Chain-of-Thought Native*
### *Optimized for Local Inference ยท Edge Devices ยท Laptops ยท Offline AI*
<br>
[](https://huggingface.co/Daffaadityp/AxonAI-MX4-2.0)
[](https://github.com/ggerganov/llama.cpp)
[](https://github.com/ggerganov/llama.cpp#quantization)
[](https://ollama.com)
[](https://github.com/ggerganov/llama.cpp)
[](https://lmstudio.ai)
[](https://huggingface.co/Daffaadityp/AxonAI-MX4-2.0)
[](https://www.apache.org/licenses/LICENSE-2.0)
[](https://github.com/Daffaadityp)
<br>
> **This repository contains the official GGUF quantized files for AxonAI MX4 2.0.**
> Run a full Chain-of-Thought reasoning LLM *entirely locally* โ no GPU required, no internet connection, no API costs. Just pure, structured intelligence on your own hardware.
</div>
---
## ๐ Quick Navigation
| Section | Description |
|---|---|
| [๐๏ธ Available Files](#๏ธ-available-gguf-files--quantization-guide) | Q2_K, Q4_K_M, Q8_0 โ which one is right for you? |
| [๐ Ollama Quickstart](#-ollama-quickstart-recommended) | Easiest way to run locally โ one command |
| [โ๏ธ llama.cpp CLI](#๏ธ-llamacpp-cli) | For advanced users and scripting |
| [๐ฅ๏ธ LM Studio / GPT4All](#๏ธ-lm-studio--gpt4all) | GUI-based local inference |
| [๐งฌ Why Quantized Reasoning?](#-why-a-quantized-reasoning-model-is-so-powerful) | The secret sauce โ explained for GGUF |
| [๐ ๏ธ Prompt Format](#๏ธ-prompt--system-format) | How to structure your prompts |
| [๐ฎ๐ฉ Komunitas Indonesia](#-untuk-developer-indonesia) | Untuk para developer Tanah Air |
---
## ๐ What Is This Repository?
This is the **official GGUF release** of [AxonAI MX4 2.0](https://huggingface.co/Daffaadityp/AxonAI-MX4-2.0), a 4-billion-parameter reasoning-first language model built by **AxonLabs** (SMKN 26 Jakarta). The original model was trained using **DoRA (Weight-Decomposed Low-Rank Adaptation)** on top of the Qwen3 architecture, fine-tuned to produce structured, transparent Chain-of-Thought (`<think>`) reasoning before every final response.
These GGUF files were produced using `llama.cpp`'s official quantization pipeline, preserving the model's reasoning depth while dramatically reducing memory footprint โ making **local LLM inference** accessible on consumer hardware.
**If you want the full-precision FP16/BF16 weights**, visit the original repository:
๐ [`Daffaadityp/AxonAI-MX4-2.0`](https://huggingface.co/Daffaadityp/AxonAI-MX4-2.0)
---
## ๐๏ธ Available GGUF Files & Quantization Guide
Choose the right quantization level for your hardware. As a general rule: **higher Q = better quality, higher RAM requirement**.
| File | Quant Type | Size (Est.) | Min RAM | Quality | Use Case |
|---|---|---|---|---|---|
| `AxonAI-MX4-2.0-Q2_K.gguf` | Q2_K | ~1.7 GB | 4 GB | โก Fast / Compressed | Raspberry Pi, very old laptops, extreme RAM constraints |
| `AxonAI-MX4-2.0-Q4_K_M.gguf` | Q4_K_M | ~2.7 GB | 6 GB | โญ **Recommended** | Mac M1/M2, standard laptops, WSL2, most modern CPUs |
| `AxonAI-MX4-2.0-Q8_0.gguf` | Q8_0 | ~4.5 GB | 8 GB | ๐ฌ Near-FP16 | Workstations, gaming PCs with ample RAM, power users |
### โญ Recommendation: Start with `Q4_K_M`
`Q4_K_M` is the universally recommended sweet spot for local LLM inference. It delivers:
- **~95% of the full-precision model quality** at less than 35% of the memory cost
- Excellent performance on **Apple Silicon (M1/M2/M3)**, standard x86 laptops, and cloud VMs
- The best balance of **inference speed**, **reasoning coherence**, and **RAM efficiency**
> ๐ก For most users: **Q4_K_M is the right choice. Start here.**
---
## ๐ Ollama Quickstart (Recommended)
[Ollama](https://ollama.com) is the fastest way to run AxonAI MX4 2.0 locally. No Python setup required.
### Step 1 โ Install Ollama
```bash
# macOS / Linux
curl -fsSL https://ollama.com/install.sh | sh
# Windows: Download installer from https://ollama.com/download
```
### Step 2 โ Create a Modelfile
Create a file named `Modelfile` (no extension) in your working directory:
```dockerfile
# Modelfile for AxonAI MX4 2.0 (Q4_K_M - Recommended)
FROM ./AxonAI-MX4-2.0-Q4_K_M.gguf
# --- Core Identity & Reasoning System Prompt ---
SYSTEM """
You are AxonAI, an advanced reasoning assistant developed by AxonLabs.
Before answering any question, you MUST use your internal scratchpad enclosed in <think>...</think> tags to reason step-by-step.
Only after completing your reasoning should you provide a clear, structured, and helpful final answer.
Be precise, thorough, and transparent in your logic.
"""
# --- Generation Parameters (Optimized for Reasoning) ---
PARAMETER temperature 0.6
PARAMETER top_p 0.95
PARAMETER top_k 20
PARAMETER repeat_penalty 1.1
PARAMETER num_ctx 8192
```
> ๐ก **Why the `<think>` system prompt?** AxonAI MX4 2.0 was fine-tuned with Chain-of-Thought supervision. Including this system prompt *unlocks* the model's full reasoning capability. Without it, you may get direct answers without the structured deliberation the model was trained to produce.
### Step 3 โ Build and Run
```bash
# Build the local Ollama model from your Modelfile
ollama create axonai-mx4 -f ./Modelfile
# Run it interactively
ollama run axonai-mx4
# Or run with a direct prompt
ollama run axonai-mx4 "Explain the P vs NP problem and whether you think it will ever be solved."
```
### Using the Ollama REST API
Once running, Ollama exposes a local REST API โ perfect for integrations:
```bash
curl http://localhost:11434/api/generate \
-H "Content-Type: application/json" \
-d '{
"model": "axonai-mx4",
"prompt": "What are the ethical implications of deploying AI in judicial systems?",
"stream": false
}'
```
---
## โ๏ธ llama.cpp CLI
For advanced users, scripting pipelines, or maximum performance control.
### Install llama.cpp
```bash
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
cmake -B build
cmake --build build --config Release -j$(nproc)
```
### Run Inference
```bash
# Basic interactive mode (Q4_K_M recommended)
./build/bin/llama-cli \
-m ./AxonAI-MX4-2.0-Q4_K_M.gguf \
-n 2048 \
--temp 0.6 \
--top-p 0.95 \
--top-k 20 \
--repeat-penalty 1.1 \
--ctx-size 8192 \
-i \
-r "User:" \
--in-prefix " " \
-p "You are AxonAI, a reasoning assistant. Think step by step inside <think> tags before answering.\n\nUser:"
```
```bash
# Single-shot inference (batch/scripting)
./build/bin/llama-cli \
-m ./AxonAI-MX4-2.0-Q8_0.gguf \
-n 1024 \
--temp 0.6 \
--ctx-size 8192 \
-p "<|im_start|>system\nYou are AxonAI. Reason carefully using <think> tags.<|im_end|>\n<|im_start|>user\nSolve: If a train travels 120km at 60km/h, then 80km at 40km/h, what is the average speed for the whole journey?<|im_end|>\n<|im_start|>assistant\n"
```
> ๐ง **Performance tip:** Add `-ngl 99` flag if you have a GPU (NVIDIA/AMD/Metal) to offload layers โ this can yield **3โ10x speedup** even with quantized GGUF files.
---
## ๐ฅ๏ธ LM Studio / GPT4All
Both LM Studio and GPT4All support direct GGUF loading with a graphical interface โ ideal for non-technical users or demos.
**LM Studio:**
1. Download from [lmstudio.ai](https://lmstudio.ai)
2. Go to **Search** โ search `AxonAI` or import GGUF manually via **My Models**
3. Load `AxonAI-MX4-2.0-Q4_K_M.gguf`
4. In the **System Prompt** field, paste the reasoning system prompt from the Modelfile above
5. Start chatting โ LM Studio also exposes a local OpenAI-compatible API on port `1234`
**GPT4All:**
1. Download from [gpt4all.io](https://www.nomic.ai/gpt4all)
2. Under **Add Model** โ choose **Import from file** and select your `.gguf` file
3. GPT4All works entirely offline after the initial load โ perfect for privacy-sensitive use cases
---
## ๐งฌ Why a Quantized Reasoning Model Is So Powerful
Most local LLMs are **answer-first** โ they pattern-match to the most statistically likely response. AxonAI MX4 2.0 is fundamentally different.
It was trained to **reason before it answers** โ meaning every response is preceded by an internal deliberation process encoded inside `<think>...</think>` tags. This is the Chain-of-Thought (CoT) paradigm, and when applied to a quantized local model, several powerful properties emerge:
### ๐ Complete Privacy, Full Intelligence
Your prompts **never leave your machine**. Unlike cloud LLM APIs, there is no data sent to any server. You get structured reasoning capability that rivals much larger models โ entirely offline. This is essential for:
- Legal document analysis
- Medical note summarization
- Private financial reasoning
- Proprietary code review
### ๐ Quantization โ Reasoning Degradation
Unlike factual recall (where quantization can cause more hallucination), **structured reasoning is surprisingly robust** to quantization. The logical flow encoded during DoRA fine-tuning is preserved at 4-bit precision. The model still deliberates. It still checks its own steps. It still produces structured conclusions.
### ๐งฉ The DoRA Advantage
AxonAI MX4 2.0 was adapted using **DoRA (Weight-Decomposed Low-Rank Adaptation)**, which separates weight updates into magnitude and direction components. This produces **more stable, nuanced fine-tuning** than standard LoRA โ and that stability carries through quantization. You get a model that reasons with fidelity even at Q4 compression.
### โก The Efficiency Equation
A 4B parameter model at Q4_K_M runs at **~20โ60 tokens/second** on Apple M-series chips and modern CPUs. That's fast enough for real-time, interactive reasoning โ think of it as having a thoughtful senior analyst available offline, on any machine, forever.
---
## ๐ ๏ธ Prompt & System Format
AxonAI MX4 2.0 uses the **ChatML** prompt template (inherited from Qwen3):
```
<|im_start|>system
{system_prompt}<|im_end|>
<|im_start|>user
{user_message}<|im_end|>
<|im_start|>assistant
<think>
{internal reasoning โ model generates this}
</think>
{final answer โ model generates this}
<|im_end|>
```
### Recommended System Prompt (Full Version)
```
You are AxonAI, an advanced reasoning language model developed by AxonLabs.
Your core capability is structured deliberation: before answering any question,
you MUST think step-by-step inside <think>...</think> tags.
Guidelines:
- Use <think> to break down the problem, consider edge cases, and verify your logic.
- After </think>, give a clear, well-structured, and helpful final answer.
- Be honest about uncertainty. Never fabricate facts.
- For math and logic, show your work explicitly inside <think>.
- For creative or open-ended tasks, use <think> to plan your response structure.
```
### Minimal System Prompt (Fast / Lightweight)
```
You are AxonAI. Always reason inside <think>...</think> before your final answer.
```
---
## ๐ Model Architecture & Training Summary
| Property | Value |
|---|---|
| **Base Architecture** | Qwen3 (4B) |
| **Fine-Tuning Method** | DoRA (Weight-Decomposed Low-Rank Adaptation) |
| **Training Paradigm** | Chain-of-Thought Supervised Fine-Tuning |
| **Context Window** | 8,192 tokens |
| **Vocab Size** | 151,936 |
| **Attention Heads** | 32 |
| **Key-Value Heads** | 8 (Grouped Query Attention) |
| **Hidden Dimensions** | 2,048 |
| **GGUF Quantizer** | llama.cpp (official) |
| **Available Quants** | Q2_K, Q4_K_M, Q8_0 |
| **Language Support** | English (primary), Indonesian (strong) |
| **License** | Apache 2.0 |
---
## ๐ฌ Benchmark Context
> AxonAI MX4 2.0 is a research and educational model from AxonLabs. Formal benchmark results are forthcoming. The following reflects qualitative design targets based on the training methodology.
| Capability | Assessment |
|---|---|
| Structured Reasoning (CoT) | โ
Strong โ core training objective |
| Mathematical Problem Solving | โ
Good โ benefiting from step-by-step CoT |
| Code Generation (Python/JS) | โ
Good |
| Factual Q&A (English) | โ
Good |
| Indonesian Language (id) | โ
Good |
| Long-Context Coherence (8K) | โ ๏ธ Moderate โ improves with Q8_0 |
| Complex Multi-Step Agentic Tasks | โ ๏ธ Moderate โ use longer system prompts |
*Community evaluations and PR-based benchmark additions are welcome.*
---
## ๐ฎ๐ฉ Untuk Developer Indonesia
**Halo, Developer Indonesia! ๐**
Ini adalah model AI lokal pertama dari AxonLabs yang bisa kamu jalankan **100% offline di laptop atau PC sendiri** โ tanpa perlu GPU mahal, tanpa biaya API, dan tanpa koneksi internet.
Bayangkan: punya asisten AI yang bisa berpikir langkah demi langkah, memahami konteks, dan menjawab pertanyaan kompleks โ semuanya berjalan di dalam mesin kamu sendiri. Itulah tujuan AxonAI MX4 2.0 GGUF.
**Kenapa ini penting buat kamu?**
- ๐ **Privasi total** โ data kamu tidak pernah keluar dari devicemu
- ๐ธ **Gratis selamanya** โ tidak ada biaya langganan atau token
- ๐ **Bisa dipakai offline** โ di daerah dengan koneksi terbatas sekalipun
- ๐ง **Reasoning-first** โ model ini *mikir dulu* sebelum menjawab, bukan asal tebak
Dibangun oleh pelajar SMK, untuk semua orang Indonesia yang ingin mengeksplorasi AI secara langsung.
> *"AI terbaik adalah AI yang bisa kamu kontrol sendiri."*
> โ AxonLabs, SMKN 26 Jakarta
**Cara paling cepat untuk mulai (5 menit):**
```bash
# 1. Install Ollama
curl -fsSL https://ollama.com/install.sh | sh
# 2. Buat Modelfile (lihat panduan di atas), lalu:
ollama create axonai-mx4 -f ./Modelfile
# 3. Jalankan!
ollama run axonai-mx4 "Jelaskan cara kerja transformer architecture dalam bahasa yang mudah dipahami."
```
---
## โ๏ธ License & Usage
This model is released under the **Apache 2.0 License**.
- โ
Free for personal, academic, and commercial use
- โ
Modification and redistribution permitted with attribution
- โ
Derivative models and fine-tunes welcome
- โ Must not be used to generate illegal, harmful, or deceptive content
- โ Attribution to AxonLabs / `Daffaadityp/AxonAI-MX4-2.0` required for derivative releases
---
## ๐ Related Resources
| Resource | Link |
|---|---|
| ๐ง Original FP16 Model | [Daffaadityp/AxonAI-MX4-2.0](https://huggingface.co/Daffaadityp/AxonAI-MX4-2.0) |
| ๐ฆ llama.cpp Repository | [github.com/ggerganov/llama.cpp](https://github.com/ggerganov/llama.cpp) |
| ๐ฆ Ollama Documentation | [ollama.com/docs](https://ollama.com) |
| ๐ฅ๏ธ LM Studio | [lmstudio.ai](https://lmstudio.ai) |
| ๐ซ AxonLabs / SMKN 26 Jakarta | [Daffaadityp on HuggingFace](https://huggingface.co/Daffaadityp) |
---
## ๐ฌ Community & Feedback
Found a bug? Have a benchmark result to share? Want to contribute evaluation data?
- **Open a Discussion** on this HuggingFace repository
- **Open an Issue** on the [AxonAI GitHub](https://github.com/Daffaadityp) (if available)
- **Community evaluations are actively welcomed** โ especially Indonesian-language benchmarks
---
<div align="center">
*Built with ๐ง by AxonLabs ยท SMKN 26 Jakarta ยท Indonesia ๐ฎ๐ฉ*
*"Intelligence is not about speed. It's about depth of thought."*
*"Michie Edition"*
[](https://huggingface.co/Daffaadityp)
</div>
|