Instructions to use Daffaadityp/PoterryAI with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Daffaadityp/PoterryAI with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="Daffaadityp/PoterryAI",
	filename="AxonAI-MX4-2.0-Q2_K.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps

llama.cpp

How to use Daffaadityp/PoterryAI with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf Daffaadityp/PoterryAI:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf Daffaadityp/PoterryAI:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf Daffaadityp/PoterryAI:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf Daffaadityp/PoterryAI:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf Daffaadityp/PoterryAI:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf Daffaadityp/PoterryAI:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf Daffaadityp/PoterryAI:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf Daffaadityp/PoterryAI:Q4_K_M

Use Docker

docker model run hf.co/Daffaadityp/PoterryAI:Q4_K_M

LM Studio
Jan

vLLM

How to use Daffaadityp/PoterryAI with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "Daffaadityp/PoterryAI"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Daffaadityp/PoterryAI",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/Daffaadityp/PoterryAI:Q4_K_M

Ollama
How to use Daffaadityp/PoterryAI with Ollama:
```
ollama run hf.co/Daffaadityp/PoterryAI:Q4_K_M
```

Unsloth Studio new

How to use Daffaadityp/PoterryAI with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for Daffaadityp/PoterryAI to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for Daffaadityp/PoterryAI to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for Daffaadityp/PoterryAI to start chatting

Pi new

How to use Daffaadityp/PoterryAI with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf Daffaadityp/PoterryAI:Q4_K_M

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "Daffaadityp/PoterryAI:Q4_K_M"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use Daffaadityp/PoterryAI with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf Daffaadityp/PoterryAI:Q4_K_M

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default Daffaadityp/PoterryAI:Q4_K_M

Run Hermes

hermes

Docker Model Runner
How to use Daffaadityp/PoterryAI with Docker Model Runner:
```
docker model run hf.co/Daffaadityp/PoterryAI:Q4_K_M
```

Lemonade

How to use Daffaadityp/PoterryAI with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull Daffaadityp/PoterryAI:Q4_K_M

Run and chat with the model

lemonade run user.PoterryAI-Q4_K_M

List all available models

lemonade list

🧠 Poterry AI — GGUF Quantized Edition

Reasoning-First Language Model · 4B Parameters · Chain-of-Thought Native

Optimized for Local Inference · Edge Devices · Laptops · Offline AI

This repository contains the official GGUF quantized files for AxonAI MX4 2.0. Run a full Chain-of-Thought reasoning LLM entirely locally — no GPU required, no internet connection, no API costs. Just pure, structured intelligence on your own hardware.

📌 Quick Navigation

Section	Description
🗂️ Available Files	Q2_K, Q4_K_M, Q8_0 — which one is right for you?
🚀 Ollama Quickstart	Easiest way to run locally — one command
⚙️ llama.cpp CLI	For advanced users and scripting
🖥️ LM Studio / GPT4All	GUI-based local inference
🧬 Why Quantized Reasoning?	The secret sauce — explained for GGUF
🛠️ Prompt Format	How to structure your prompts
🇮🇩 Komunitas Indonesia	Untuk para developer Tanah Air

🌐 What Is This Repository?

This is the official GGUF release of AxonAI MX4 2.0, a 4-billion-parameter reasoning-first language model built by AxonLabs (SMKN 26 Jakarta). The original model was trained using DoRA (Weight-Decomposed Low-Rank Adaptation) on top of the Qwen3 architecture, fine-tuned to produce structured, transparent Chain-of-Thought (<think>) reasoning before every final response.

These GGUF files were produced using llama.cpp's official quantization pipeline, preserving the model's reasoning depth while dramatically reducing memory footprint — making local LLM inference accessible on consumer hardware.

If you want the full-precision FP16/BF16 weights, visit the original repository: 👉 Daffaadityp/AxonAI-MX4-2.0

🗂️ Available GGUF Files & Quantization Guide

Choose the right quantization level for your hardware. As a general rule: higher Q = better quality, higher RAM requirement.

File	Quant Type	Size (Est.)	Min RAM	Quality	Use Case
`AxonAI-MX4-2.0-Q2_K.gguf`	Q2_K	~1.7 GB	4 GB	⚡ Fast / Compressed	Raspberry Pi, very old laptops, extreme RAM constraints
`AxonAI-MX4-2.0-Q4_K_M.gguf`	Q4_K_M	~2.7 GB	6 GB	⭐ Recommended	Mac M1/M2, standard laptops, WSL2, most modern CPUs
`AxonAI-MX4-2.0-Q8_0.gguf`	Q8_0	~4.5 GB	8 GB	🔬 Near-FP16	Workstations, gaming PCs with ample RAM, power users

⭐ Recommendation: Start with `Q4_K_M`

Q4_K_M is the universally recommended sweet spot for local LLM inference. It delivers:

~95% of the full-precision model quality at less than 35% of the memory cost
Excellent performance on Apple Silicon (M1/M2/M3), standard x86 laptops, and cloud VMs
The best balance of inference speed, reasoning coherence, and RAM efficiency

💡 For most users: Q4_K_M is the right choice. Start here.

🚀 Ollama Quickstart (Recommended)

Ollama is the fastest way to run AxonAI MX4 2.0 locally. No Python setup required.

Step 1 — Install Ollama

# macOS / Linux
curl -fsSL https://ollama.com/install.sh | sh

# Windows: Download installer from https://ollama.com/download

Step 2 — Create a Modelfile

Create a file named Modelfile (no extension) in your working directory:

# Modelfile for AxonAI MX4 2.0 (Q4_K_M - Recommended)
FROM ./AxonAI-MX4-2.0-Q4_K_M.gguf

# --- Core Identity & Reasoning System Prompt ---
SYSTEM """
You are AxonAI, an advanced reasoning assistant developed by AxonLabs.
Before answering any question, you MUST use your internal scratchpad enclosed in <think>...</think> tags to reason step-by-step.
Only after completing your reasoning should you provide a clear, structured, and helpful final answer.
Be precise, thorough, and transparent in your logic.
"""

# --- Generation Parameters (Optimized for Reasoning) ---
PARAMETER temperature 0.6
PARAMETER top_p 0.95
PARAMETER top_k 20
PARAMETER repeat_penalty 1.1
PARAMETER num_ctx 8192

💡 Why the <think> system prompt? AxonAI MX4 2.0 was fine-tuned with Chain-of-Thought supervision. Including this system prompt unlocks the model's full reasoning capability. Without it, you may get direct answers without the structured deliberation the model was trained to produce.

Step 3 — Build and Run

# Build the local Ollama model from your Modelfile
ollama create axonai-mx4 -f ./Modelfile

# Run it interactively
ollama run axonai-mx4

# Or run with a direct prompt
ollama run axonai-mx4 "Explain the P vs NP problem and whether you think it will ever be solved."

Using the Ollama REST API

Once running, Ollama exposes a local REST API — perfect for integrations:

curl http://localhost:11434/api/generate \
  -H "Content-Type: application/json" \
  -d '{
    "model": "axonai-mx4",
    "prompt": "What are the ethical implications of deploying AI in judicial systems?",
    "stream": false
  }'

⚙️ llama.cpp CLI

For advanced users, scripting pipelines, or maximum performance control.

Install llama.cpp

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
cmake -B build
cmake --build build --config Release -j$(nproc)

Run Inference

# Basic interactive mode (Q4_K_M recommended)
./build/bin/llama-cli \
  -m ./AxonAI-MX4-2.0-Q4_K_M.gguf \
  -n 2048 \
  --temp 0.6 \
  --top-p 0.95 \
  --top-k 20 \
  --repeat-penalty 1.1 \
  --ctx-size 8192 \
  -i \
  -r "User:" \
  --in-prefix " " \
  -p "You are AxonAI, a reasoning assistant. Think step by step inside <think> tags before answering.\n\nUser:"

# Single-shot inference (batch/scripting)
./build/bin/llama-cli \
  -m ./AxonAI-MX4-2.0-Q8_0.gguf \
  -n 1024 \
  --temp 0.6 \
  --ctx-size 8192 \
  -p "<|im_start|>system\nYou are AxonAI. Reason carefully using <think> tags.<|im_end|>\n<|im_start|>user\nSolve: If a train travels 120km at 60km/h, then 80km at 40km/h, what is the average speed for the whole journey?<|im_end|>\n<|im_start|>assistant\n"

🔧 Performance tip: Add -ngl 99 flag if you have a GPU (NVIDIA/AMD/Metal) to offload layers — this can yield 3–10x speedup even with quantized GGUF files.

🖥️ LM Studio / GPT4All

Both LM Studio and GPT4All support direct GGUF loading with a graphical interface — ideal for non-technical users or demos.

LM Studio:

Download from lmstudio.ai
Go to Search → search AxonAI or import GGUF manually via My Models
Load AxonAI-MX4-2.0-Q4_K_M.gguf
In the System Prompt field, paste the reasoning system prompt from the Modelfile above
Start chatting — LM Studio also exposes a local OpenAI-compatible API on port 1234

GPT4All:

Download from gpt4all.io
Under Add Model → choose Import from file and select your .gguf file
GPT4All works entirely offline after the initial load — perfect for privacy-sensitive use cases

🧬 Why a Quantized Reasoning Model Is So Powerful

Most local LLMs are answer-first — they pattern-match to the most statistically likely response. AxonAI MX4 2.0 is fundamentally different.

It was trained to reason before it answers — meaning every response is preceded by an internal deliberation process encoded inside <think>...</think> tags. This is the Chain-of-Thought (CoT) paradigm, and when applied to a quantized local model, several powerful properties emerge:

🔒 Complete Privacy, Full Intelligence

Your prompts never leave your machine. Unlike cloud LLM APIs, there is no data sent to any server. You get structured reasoning capability that rivals much larger models — entirely offline. This is essential for:

Legal document analysis
Medical note summarization
Private financial reasoning
Proprietary code review

📉 Quantization ≠ Reasoning Degradation

Unlike factual recall (where quantization can cause more hallucination), structured reasoning is surprisingly robust to quantization. The logical flow encoded during DoRA fine-tuning is preserved at 4-bit precision. The model still deliberates. It still checks its own steps. It still produces structured conclusions.

🧩 The DoRA Advantage

AxonAI MX4 2.0 was adapted using DoRA (Weight-Decomposed Low-Rank Adaptation), which separates weight updates into magnitude and direction components. This produces more stable, nuanced fine-tuning than standard LoRA — and that stability carries through quantization. You get a model that reasons with fidelity even at Q4 compression.

⚡ The Efficiency Equation

A 4B parameter model at Q4_K_M runs at ~20–60 tokens/second on Apple M-series chips and modern CPUs. That's fast enough for real-time, interactive reasoning — think of it as having a thoughtful senior analyst available offline, on any machine, forever.

🛠️ Prompt & System Format

AxonAI MX4 2.0 uses the ChatML prompt template (inherited from Qwen3):

<|im_start|>system
{system_prompt}<|im_end|>
<|im_start|>user
{user_message}<|im_end|>
<|im_start|>assistant
<think>
{internal reasoning — model generates this}
</think>
{final answer — model generates this}
<|im_end|>

Recommended System Prompt (Full Version)

You are AxonAI, an advanced reasoning language model developed by AxonLabs.
Your core capability is structured deliberation: before answering any question,
you MUST think step-by-step inside <think>...</think> tags.

Guidelines:
- Use <think> to break down the problem, consider edge cases, and verify your logic.
- After </think>, give a clear, well-structured, and helpful final answer.
- Be honest about uncertainty. Never fabricate facts.
- For math and logic, show your work explicitly inside <think>.
- For creative or open-ended tasks, use <think> to plan your response structure.

Minimal System Prompt (Fast / Lightweight)

You are AxonAI. Always reason inside <think>...</think> before your final answer.

📊 Model Architecture & Training Summary

Property	Value
Base Architecture	Qwen3 (4B)
Fine-Tuning Method	DoRA (Weight-Decomposed Low-Rank Adaptation)
Training Paradigm	Chain-of-Thought Supervised Fine-Tuning
Context Window	8,192 tokens
Vocab Size	151,936
Attention Heads	32
Key-Value Heads	8 (Grouped Query Attention)
Hidden Dimensions	2,048
GGUF Quantizer	llama.cpp (official)
Available Quants	Q2_K, Q4_K_M, Q8_0
Language Support	English (primary), Indonesian (strong)
License	Apache 2.0

🔬 Benchmark Context

AxonAI MX4 2.0 is a research and educational model from AxonLabs. Formal benchmark results are forthcoming. The following reflects qualitative design targets based on the training methodology.

Capability	Assessment
Structured Reasoning (CoT)	✅ Strong — core training objective
Mathematical Problem Solving	✅ Good — benefiting from step-by-step CoT
Code Generation (Python/JS)	✅ Good
Factual Q&A (English)	✅ Good
Indonesian Language (id)	✅ Good
Long-Context Coherence (8K)	⚠️ Moderate — improves with Q8_0
Complex Multi-Step Agentic Tasks	⚠️ Moderate — use longer system prompts

Community evaluations and PR-based benchmark additions are welcome.

🇮🇩 Untuk Developer Indonesia

Halo, Developer Indonesia! 🙌

Ini adalah model AI lokal pertama dari AxonLabs yang bisa kamu jalankan 100% offline di laptop atau PC sendiri — tanpa perlu GPU mahal, tanpa biaya API, dan tanpa koneksi internet.

Bayangkan: punya asisten AI yang bisa berpikir langkah demi langkah, memahami konteks, dan menjawab pertanyaan kompleks — semuanya berjalan di dalam mesin kamu sendiri. Itulah tujuan AxonAI MX4 2.0 GGUF.

Kenapa ini penting buat kamu?

🔒 Privasi total — data kamu tidak pernah keluar dari devicemu
💸 Gratis selamanya — tidak ada biaya langganan atau token
🌐 Bisa dipakai offline — di daerah dengan koneksi terbatas sekalipun
🧠 Reasoning-first — model ini mikir dulu sebelum menjawab, bukan asal tebak

Dibangun oleh pelajar SMK, untuk semua orang Indonesia yang ingin mengeksplorasi AI secara langsung.

"AI terbaik adalah AI yang bisa kamu kontrol sendiri." — AxonLabs, SMKN 26 Jakarta

Cara paling cepat untuk mulai (5 menit):

# 1. Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# 2. Buat Modelfile (lihat panduan di atas), lalu:
ollama create axonai-mx4 -f ./Modelfile

# 3. Jalankan!
ollama run axonai-mx4 "Jelaskan cara kerja transformer architecture dalam bahasa yang mudah dipahami."

⚖️ License & Usage

This model is released under the Apache 2.0 License.

✅ Free for personal, academic, and commercial use
✅ Modification and redistribution permitted with attribution
✅ Derivative models and fine-tunes welcome
❌ Must not be used to generate illegal, harmful, or deceptive content
❌ Attribution to AxonLabs / Daffaadityp/AxonAI-MX4-2.0 required for derivative releases

🔗 Related Resources

Resource	Link
🧠 Original FP16 Model	Daffaadityp/AxonAI-MX4-2.0
📦 llama.cpp Repository	github.com/ggerganov/llama.cpp
🦙 Ollama Documentation	ollama.com/docs
🖥️ LM Studio	lmstudio.ai
🏫 AxonLabs / SMKN 26 Jakarta	Daffaadityp on HuggingFace

💬 Community & Feedback

Found a bug? Have a benchmark result to share? Want to contribute evaluation data?

Open a Discussion on this HuggingFace repository
Open an Issue on the AxonAI GitHub (if available)
Community evaluations are actively welcomed — especially Indonesian-language benchmarks

Built with 🧠 by AxonLabs · SMKN 26 Jakarta · Indonesia 🇮🇩

"Intelligence is not about speed. It's about depth of thought."

"Michie Edition"

Downloads last month: 42

GGUF

Model size

4B params

Architecture

qwen3

Hardware compatibility

2-bit

4-bit

8-bit

Model tree for Daffaadityp/PoterryAI

Base model

Qwen/Qwen3-4B-Base

Finetuned

Qwen/Qwen3-4B

Finetuned

unsloth/Qwen3-4B

Finetuned

Daffaadityp/AxonAI-MX4-2.0

Quantized

(2)

this model

🧠 Poterry AI — GGUF Quantized Edition

Reasoning-First Language Model · 4B Parameters · Chain-of-Thought Native

Optimized for Local Inference · Edge Devices · Laptops · Offline AI

📌 Quick Navigation

🌐 What Is This Repository?

🗂️ Available GGUF Files & Quantization Guide

⭐ Recommendation: Start with Q4_K_M

🚀 Ollama Quickstart (Recommended)

Step 1 — Install Ollama

Step 2 — Create a Modelfile

Step 3 — Build and Run

Using the Ollama REST API

⚙️ llama.cpp CLI

Install llama.cpp

Run Inference

🖥️ LM Studio / GPT4All

🧬 Why a Quantized Reasoning Model Is So Powerful

🔒 Complete Privacy, Full Intelligence

📉 Quantization ≠ Reasoning Degradation

🧩 The DoRA Advantage

⚡ The Efficiency Equation

🛠️ Prompt & System Format

Recommended System Prompt (Full Version)

Minimal System Prompt (Fast / Lightweight)

📊 Model Architecture & Training Summary

🔬 Benchmark Context

🇮🇩 Untuk Developer Indonesia

⚖️ License & Usage

🔗 Related Resources

💬 Community & Feedback

Model tree for Daffaadityp/PoterryAI

⭐ Recommendation: Start with `Q4_K_M`