Instructions to use SimplySara/Kai-3B-Instruct-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use SimplySara/Kai-3B-Instruct-GGUF with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="SimplySara/Kai-3B-Instruct-GGUF")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("SimplySara/Kai-3B-Instruct-GGUF", dtype="auto")

llama-cpp-python

How to use SimplySara/Kai-3B-Instruct-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="SimplySara/Kai-3B-Instruct-GGUF",
	filename="Kai-3B-Instruct-BF16.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps

llama.cpp

How to use SimplySara/Kai-3B-Instruct-GGUF with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf SimplySara/Kai-3B-Instruct-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf SimplySara/Kai-3B-Instruct-GGUF:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf SimplySara/Kai-3B-Instruct-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf SimplySara/Kai-3B-Instruct-GGUF:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf SimplySara/Kai-3B-Instruct-GGUF:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf SimplySara/Kai-3B-Instruct-GGUF:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf SimplySara/Kai-3B-Instruct-GGUF:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf SimplySara/Kai-3B-Instruct-GGUF:Q4_K_M

Use Docker

docker model run hf.co/SimplySara/Kai-3B-Instruct-GGUF:Q4_K_M

LM Studio
Jan

vLLM

How to use SimplySara/Kai-3B-Instruct-GGUF with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "SimplySara/Kai-3B-Instruct-GGUF"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "SimplySara/Kai-3B-Instruct-GGUF",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/SimplySara/Kai-3B-Instruct-GGUF:Q4_K_M

SGLang

How to use SimplySara/Kai-3B-Instruct-GGUF with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "SimplySara/Kai-3B-Instruct-GGUF" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "SimplySara/Kai-3B-Instruct-GGUF",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "SimplySara/Kai-3B-Instruct-GGUF" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "SimplySara/Kai-3B-Instruct-GGUF",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Ollama
How to use SimplySara/Kai-3B-Instruct-GGUF with Ollama:
```
ollama run hf.co/SimplySara/Kai-3B-Instruct-GGUF:Q4_K_M
```

Unsloth Studio

How to use SimplySara/Kai-3B-Instruct-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for SimplySara/Kai-3B-Instruct-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for SimplySara/Kai-3B-Instruct-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for SimplySara/Kai-3B-Instruct-GGUF to start chatting

How to use SimplySara/Kai-3B-Instruct-GGUF with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf SimplySara/Kai-3B-Instruct-GGUF:Q4_K_M

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "SimplySara/Kai-3B-Instruct-GGUF:Q4_K_M"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use SimplySara/Kai-3B-Instruct-GGUF with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf SimplySara/Kai-3B-Instruct-GGUF:Q4_K_M

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default SimplySara/Kai-3B-Instruct-GGUF:Q4_K_M

Run Hermes

hermes

Docker Model Runner
How to use SimplySara/Kai-3B-Instruct-GGUF with Docker Model Runner:
```
docker model run hf.co/SimplySara/Kai-3B-Instruct-GGUF:Q4_K_M
```

Lemonade

How to use SimplySara/Kai-3B-Instruct-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull SimplySara/Kai-3B-Instruct-GGUF:Q4_K_M

Run and chat with the model

lemonade run user.Kai-3B-Instruct-GGUF-Q4_K_M

List all available models

lemonade list

This is a Static quantization of NoesisLab/Kai-3B-Instruct, made by SimplySara

Note from NoesisLab "Due to the ADS distillation method, this model is highly sensitive to quantization noise. Q8_0 or Q6_K are strongly recommended for preserving both logical integrity and conversational alignment. Q4 variants may exhibit template collapse."

Model	Size_GB	BPW	PPL_Q	KLD_Mean	KLD_Max	Top_P_Match
Kai-3B-Instruct-BF16.gguf	5.735	16.02	12.2614	-1.2e-05	4e-06	100.000%
Kai-3B-Instruct-MXFP4_MOE.gguf	3.051	8.52	12.268	0.001919	0.161748	97.288%
Kai-3B-Instruct-i1-MXFP4_MOE.gguf	3.051	8.52	12.268	0.001919	0.161748	97.288%
Kai-3B-Instruct-Q8_0.gguf	3.051	8.52	12.268	0.001919	0.161748	97.288%
Kai-3B-Instruct-i1-Q8_0.gguf	3.051	8.52	12.268	0.001919	0.161748	97.288%
Kai-3B-Instruct-Q6_K.gguf	2.357	6.58	12.3055	0.009404	0.366649	94.435%
Kai-3B-Instruct-i1-Q6_K.gguf	2.357	6.58	12.3486	0.008842	0.528699	94.605%
Kai-3B-Instruct-Q5_1.gguf	2.173	6.07	12.4607	0.022546	1.62058	92.336%
Kai-3B-Instruct-i1-Q5_1.gguf	2.173	6.07	12.3913	0.015555	0.887861	93.164%
Kai-3B-Instruct-Q5_K_M.gguf	2.062	5.76	12.3932	0.015953	2.06684	93.315%
Kai-3B-Instruct-i1-Q5_K_M.gguf	2.062	5.76	12.3974	0.014712	1.21054	93.344%
Kai-3B-Instruct-i1-Q5_0.gguf	2.014	5.63	12.3845	0.018582	1.7811	92.676%
Kai-3B-Instruct-Q5_K_S.gguf	2.009	5.61	12.4705	0.021112	2.25188	92.477%
Kai-3B-Instruct-i1-Q5_K_S.gguf	2.009	5.61	12.422	0.016098	1.02742	93.198%
Kai-3B-Instruct-Q5_0.gguf	2.009	5.61	12.5354	0.024549	2.64757	91.846%
Kai-3B-Instruct-i1-Q4_1.gguf	1.845	5.16	12.6693	0.039282	2.17269	90.104%
Kai-3B-Instruct-Q4_1.gguf	1.845	5.16	12.8411	0.070893	9.75963	87.274%
Kai-3B-Instruct-i1-Q4_K_M.gguf	1.784	4.98	12.562	0.033791	2.37929	90.693%
Kai-3B-Instruct-Q4_K_M.gguf	1.784	4.98	12.5551	0.039329	8.08951	90.011%
Kai-3B-Instruct-IQ4_NL.gguf	1.697	4.74	12.6349	0.04746	3.75837	89.164%
Kai-3B-Instruct-Q4_K_S.gguf	1.693	4.73	12.6881	0.050317	7.15421	88.889%
Kai-3B-Instruct-i1-Q4_K_S.gguf	1.693	4.73	12.672	0.038976	2.35062	90.141%
Kai-3B-Instruct-i1-Q4_0.gguf	1.687	4.71	12.9318	0.056914	4.90942	88.242%
Kai-3B-Instruct-i1-IQ4_NL.gguf	1.686	4.71	12.7029	0.041041	2.82814	89.995%
Kai-3B-Instruct-Q4_0.gguf	1.682	4.7	13.1831	0.079359	5.30813	86.546%
Kai-3B-Instruct-IQ4_XS.gguf	1.619	4.52	12.6642	0.048527	3.11693	89.010%
Kai-3B-Instruct-i1-IQ4_XS.gguf	1.605	4.48	12.7351	0.042119	2.81661	89.976%
Kai-3B-Instruct-Q3_K_L.gguf	1.574	4.4	13.2229	0.095355	8.63835	85.518%
Kai-3B-Instruct-i1-Q3_K_L.gguf	1.574	4.4	13.2477	0.084668	5.71143	86.163%
Kai-3B-Instruct-Q3_K_M.gguf	1.463	4.09	13.3455	0.112669	9.19842	84.135%
Kai-3B-Instruct-i1-Q3_K_M.gguf	1.463	4.09	13.4095	0.095939	7.93677	85.368%
Kai-3B-Instruct-i1-IQ3_M.gguf	1.368	3.82	13.1481	0.112437	6.45799	84.307%
Kai-3B-Instruct-IQ3_M.gguf	1.368	3.82	14.5693	0.246713	7.29781	77.711%
Kai-3B-Instruct-IQ3_S.gguf	1.339	3.74	20.2851	0.623557	14.9444	66.169%
Kai-3B-Instruct-i1-IQ3_S.gguf	1.339	3.74	13.2823	0.120975	6.12451	83.724%
Kai-3B-Instruct-i1-Q3_K_S.gguf	1.334	3.73	14.4279	0.196396	11.9249	79.536%
Kai-3B-Instruct-Q3_K_S.gguf	1.334	3.73	14.5753	0.20947	10.2762	79.235%
Kai-3B-Instruct-i1-IQ3_XS.gguf	1.277	3.57	13.5713	0.149838	5.19091	81.978%
Kai-3B-Instruct-i1-IQ3_XXS.gguf	1.181	3.3	14.4968	0.218333	7.41132	78.317%
Kai-3B-Instruct-i1-Q2_K.gguf	1.167	3.26	17.0515	0.362859	13.7054	73.511%
Kai-3B-Instruct-Q2_K.gguf	1.167	3.26	18.421	0.471699	10.9955	70.276%
Kai-3B-Instruct-i1-Q2_K_S.gguf	1.096	3.06	19.0203	0.47105	9.39981	70.322%
Kai-3B-Instruct-i1-IQ2_M.gguf	1.048	2.93	16.8179	0.377914	8.06048	72.505%
Kai-3B-Instruct-i1-IQ2_S.gguf	0.974	2.72	18.9657	0.507571	10.146	68.855%
Kai-3B-Instruct-i1-IQ2_XS.gguf	0.946	2.64	20.7434	0.60263	12.2848	66.248%
Kai-3B-Instruct-i1-IQ2_XXS.gguf	0.868	2.42	28.0716	0.912772	20.8551	59.005%
Kai-3B-Instruct-i1-IQ1_M.gguf	0.776	2.17	56.0938	1.71797	16.7686	46.262%
Kai-3B-Instruct-i1-IQ1_S.gguf	0.72	2.01	142.119	2.71244	23.1949	35.970%

Kai-3B-Instruct

A 3B-parameter instruction-tuned language model optimized for reasoning, math, and code generation tasks, powered by our new ADS (Adaptive Dual-Search Distillation) technique.

Model Details


Model	Kai-3B-Instruct
Architecture	SmolLM3ForCausalLM
Parameters	3B
Hidden size	2048
Intermediate size	11008
Layers	36
Attention heads	16 (4 KV heads, GQA)
Context length	65536
Precision	bfloat16
Vocab size	128,256

What is ADS?

Adaptive Dual-Search Distillation treats model fine-tuning as a constrained optimization problem inspired by Operations Research. The core mechanism is a dynamic loss function with a stateful dual penalty factor that adapts based on embedding space entropy — forcing the model to converge to high-confidence predictions at difficult reasoning points, without modifying the model architecture.

Benchmark Results

General (5-shot, log-likelihood)

Model	Params	MMLU	ARC-c (acc_norm)	HellaSwag (acc_norm)	PIQA (acc_norm)
TinyLlama	1.1B	~26.0%	~33.0%	~60.0%	~71.0%
SmolLM2	1.7B	~35.0%	~38.0%	~65.0%	~74.0%
Llama-2-7B	7B	45.3%	46.2%	77.2%	79.8%
Gemma-2-2B	2.6B	~52.0%	~53.0%	75.0%	~78.0%
Kai-3B-Instruct	3B	53.62%	51.88%	69.53%	77.53%
Qwen2.5-3B	3B	~63.0%	~55.0%	~73.0%	~80.0%

Code Generation — HumanEval (Pass@1, 0-shot)

Model	Params	HumanEval (Pass@1)	Notes
Llama-2-7B	7B	~12.8%	3x overtake — smaller model, far better code
SmolLM2-1.7B	1.7B	~25.0%	ADS delivers +14pp pure gain
Gemma-2-2B	2B	~30.0%	Surpasses Google's heavily distilled 2B flagship
Kai-3B-Instruct	3B	39.02%	ADS topological pruning, full pipeline
GPT-3.5 (Legacy)	175B	~48.0%	Kai-3B trails the original GPT-3.5 by only ~9pp

Math — GSM8K (0-shot)

Model	Params	GSM8K (exact_match)
Kai-3B-Instruct	3B	39.27%

Key Observations

Surpasses Llama-2-7B: Kai-3B outperforms Llama-2-7B on MMLU (+8.3pp) and ARC-Challenge (+5.7pp) with less than half the parameters — a 7B model decisively beaten by a 3B distilled model.
Competitive with Gemma-2-2B: Matches or exceeds Google's Gemma-2-2B on MMLU (+1.6pp) and PIQA, despite Gemma being trained with significantly more compute.
HellaSwag: At 69.53%, Kai-3B surpasses all sub-2B models by a wide margin and trails the compute-heavy Qwen2.5-3B by only ~3.5pp.
PIQA: At 77.53%, Kai-3B nearly matches Gemma-2-2B (~~78.0%) and approaches the 3B-class ceiling set by Qwen2.5-3B (~~80.0%).

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
    "NoesisLab/Kai-3B-Instruct",
    torch_dtype=torch.bfloat16,
)
tokenizer = AutoTokenizer.from_pretrained("NoesisLab/Kai-3B-Instruct")

messages = [{"role": "user", "content": "What is 25 * 4?"}]
input_ids = tokenizer.apply_chat_template(messages, return_tensors="pt")
output = model.generate(input_ids, max_new_tokens=256)
print(tokenizer.decode(output[0], skip_special_tokens=True))

Citation

@misc{noesislab2026kai3b,
  title={Kai-3B-Instruct},
  author={NoesisLab},
  year={2026},
  url={https://huggingface.co/NoesisLab/Kai-3B-Instruct}
}

License

Apache 2.0

Downloads last month: 102

GGUF

Model size

3B params

Architecture

smollm3

Hardware compatibility

2-bit

3-bit

4-bit

5-bit

6-bit

8-bit

16-bit

Model tree for SimplySara/Kai-3B-Instruct-GGUF

Base model

NoesisLab/Kai-3B-Instruct

Quantized

(5)

this model