Instructions to use NoesisLab/Kai-3B-Instruct with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use NoesisLab/Kai-3B-Instruct with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="NoesisLab/Kai-3B-Instruct")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForMultimodalLM

tokenizer = AutoTokenizer.from_pretrained("NoesisLab/Kai-3B-Instruct")
model = AutoModelForMultimodalLM.from_pretrained("NoesisLab/Kai-3B-Instruct")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use NoesisLab/Kai-3B-Instruct with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "NoesisLab/Kai-3B-Instruct"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "NoesisLab/Kai-3B-Instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/NoesisLab/Kai-3B-Instruct

SGLang

How to use NoesisLab/Kai-3B-Instruct with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "NoesisLab/Kai-3B-Instruct" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "NoesisLab/Kai-3B-Instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "NoesisLab/Kai-3B-Instruct" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "NoesisLab/Kai-3B-Instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use NoesisLab/Kai-3B-Instruct with Docker Model Runner:
```
docker model run hf.co/NoesisLab/Kai-3B-Instruct
```
Browse Quantizations to use this model in llama.cpp, Ollama, LM Studio, or any compatible app.

Kai-3B-Instruct

A 3B-parameter instruction-tuned language model optimized for reasoning, math, and code generation tasks, powered by our new ADS (Adaptive Dual-Search Distillation) technique.

Model Details


Model	Kai-3B-Instruct
Architecture	SmolLM3ForCausalLM
Parameters	3B
Hidden size	2048
Intermediate size	11008
Layers	36
Attention heads	16 (4 KV heads, GQA)
Context length	65536
Precision	bfloat16
Vocab size	128,256

What is ADS?

Adaptive Dual-Search Distillation (自适应对偶搜索蒸馏) treats model fine-tuning as a constrained optimization problem inspired by Operations Research. The core mechanism is a dynamic loss function with a stateful dual penalty factor that adapts based on embedding space entropy — forcing the model to converge to high-confidence predictions at difficult reasoning points, without modifying the model architecture.

Benchmark Results

General (5-shot, log-likelihood)

Model	Params	MMLU	ARC-c (acc_norm)	HellaSwag (acc_norm)	PIQA (acc_norm)
TinyLlama	1.1B	~26.0%	~33.0%	~60.0%	~71.0%
SmolLM2	1.7B	~35.0%	~38.0%	~65.0%	~74.0%
Llama-2-7B	7B	45.3%	46.2%	77.2%	79.8%
Gemma-2-2B	2.6B	~52.0%	~53.0%	75.0%	~78.0%
Kai-3B-Instruct	3B	53.62%	51.88%	69.53%	77.53%
Qwen2.5-3B	3B	~63.0%	~55.0%	~73.0%	~80.0%

Code Generation — HumanEval (Pass@1, 0-shot)

Model	Params	HumanEval (Pass@1)	Notes
Llama-2-7B	7B	~12.8%	3x overtake — smaller model, far better code
SmolLM2-1.7B	1.7B	~25.0%	ADS delivers +14pp pure gain
Gemma-2-2B	2B	~30.0%	Surpasses Google's heavily distilled 2B flagship
Kai-3B-Instruct	3B	39.02%	ADS topological pruning, full pipeline
GPT-3.5 (Legacy)	175B	~48.0%	Kai-3B trails the original GPT-3.5 by only ~9pp

Math — GSM8K (0-shot)

Model	Params	GSM8K (exact_match)
Kai-3B-Instruct	3B	39.27%

Key Observations

Surpasses Llama-2-7B: Kai-3B outperforms Llama-2-7B on MMLU (+8.3pp) and ARC-Challenge (+5.7pp) with less than half the parameters — a 7B model decisively beaten by a 3B distilled model.
Competitive with Gemma-2-2B: Matches or exceeds Google's Gemma-2-2B on MMLU (+1.6pp) and PIQA, despite Gemma being trained with significantly more compute.
HellaSwag: At 69.53%, Kai-3B surpasses all sub-2B models by a wide margin and trails the compute-heavy Qwen2.5-3B by only ~3.5pp.
PIQA: At 77.53%, Kai-3B nearly matches Gemma-2-2B (~~78.0%) and approaches the 3B-class ceiling set by Qwen2.5-3B (~~80.0%).

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
    "NoesisLab/Kai-3B-Instruct",
    torch_dtype=torch.bfloat16,
)
tokenizer = AutoTokenizer.from_pretrained("NoesisLab/Kai-3B-Instruct")

messages = [{"role": "user", "content": "What is 25 * 4?"}]
input_ids = tokenizer.apply_chat_template(messages, return_tensors="pt")
output = model.generate(input_ids, max_new_tokens=256)
print(tokenizer.decode(output[0], skip_special_tokens=True))

Citation

@misc{noesislab2026kai3b,
  title={Kai-3B-Instruct},
  author={NoesisLab},
  year={2026},
  url={https://huggingface.co/NoesisLab/Kai-3B-Instruct}
}

License

Apache 2.0

Downloads last month: 10

Safetensors

Model size

3B params

Tensor type

BF16

Model tree for NoesisLab/Kai-3B-Instruct

Quantizations

5 models

Collection including NoesisLab/Kai-3B-Instruct

Kai Models Series

Collection

Kai Models Distilled via Adaptive Dual Search Distillation • 3 items • Updated Mar 2 • 2

Article mentioning NoesisLab/Kai-3B-Instruct

Exploring New Frontiers of LLMs: Adaptive Dual-Search Distillation (ADS) and the 30B Model Open Beta

Evaluation results

Accuracy (normalized) on ARC-Challenge
test set self-reported

51.880
Accuracy (normalized) on HellaSwag
validation set self-reported

69.530
Accuracy on MMLU
test set self-reported

53.620
Accuracy (normalized) on PIQA
validation set self-reported

77.530
Pass@1 on HumanEval
test set self-reported

39.020
Exact Match (flexible) on GSM8K
test set self-reported

39.270