File size: 5,553 Bytes
3ac4e23 eede50b 3ac4e23 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 | ---
library_name: transformers
license: apache-2.0
tags:
- math
- reasoning
- text-generation
- ads
- distillation
language:
- en
pipeline_tag: text-generation
model-index:
- name: Kai-3B-Instruct
results:
- task:
type: multiple-choice
name: ARC-Challenge
dataset:
name: ARC-Challenge
type: allenai/ai2_arc
config: ARC-Challenge
split: test
metrics:
- type: acc_norm
value: 51.88
name: Accuracy (normalized)
- task:
type: multiple-choice
name: HellaSwag
dataset:
name: HellaSwag
type: Rowan/hellaswag
split: validation
metrics:
- type: acc_norm
value: 69.53
name: Accuracy (normalized)
- task:
type: multiple-choice
name: MMLU
dataset:
name: MMLU
type: cais/mmlu
split: test
metrics:
- type: acc
value: 53.62
name: Accuracy
- task:
type: multiple-choice
name: PIQA
dataset:
name: PIQA
type: piqa
split: validation
metrics:
- type: acc_norm
value: 77.53
name: Accuracy (normalized)
- task:
type: text-generation
name: HumanEval
dataset:
name: HumanEval
type: openai/openai_humaneval
split: test
metrics:
- type: pass@1
value: 39.02
name: Pass@1
- task:
type: text-generation
name: GSM8K
dataset:
name: GSM8K
type: gsm8k
split: test
metrics:
- type: exact_match
value: 39.27
name: Exact Match (flexible)
---
# Kai-3B-Instruct
A 3B-parameter instruction-tuned language model optimized for reasoning, math, and code generation tasks, powered by our new **ADS (Adaptive Dual-Search Distillation)** technique.
## Model Details
| | |
|---|---|
| **Model** | Kai-3B-Instruct |
| **Architecture** | SmolLM3ForCausalLM |
| **Parameters** | 3B |
| **Hidden size** | 2048 |
| **Intermediate size** | 11008 |
| **Layers** | 36 |
| **Attention heads** | 16 (4 KV heads, GQA) |
| **Context length** | 65536 |
| **Precision** | bfloat16 |
| **Vocab size** | 128,256 |
## What is ADS?
**Adaptive Dual-Search Distillation (自适应对偶搜索蒸馏)** treats model fine-tuning as a constrained optimization problem inspired by Operations Research. The core mechanism is a dynamic loss function with a stateful dual penalty factor that adapts based on embedding space entropy — forcing the model to converge to high-confidence predictions at difficult reasoning points, without modifying the model architecture.
## Benchmark Results

### General (5-shot, log-likelihood)
| Model | Params | MMLU | ARC-c (acc_norm) | HellaSwag (acc_norm) | PIQA (acc_norm) |
|---|:---:|:---:|:---:|:---:|:---:|
| TinyLlama | 1.1B | ~26.0% | ~33.0% | ~60.0% | ~71.0% |
| SmolLM2 | 1.7B | ~35.0% | ~38.0% | ~65.0% | ~74.0% |
| Llama-2-7B | 7B | 45.3% | 46.2% | 77.2% | 79.8% |
| Gemma-2-2B | 2.6B | ~52.0% | ~53.0% | 75.0% | ~78.0% |
| **Kai-3B-Instruct** | **3B** | **53.62%** | **51.88%** | **69.53%** | **77.53%** |
| Qwen2.5-3B | 3B | ~63.0% | ~55.0% | ~73.0% | ~80.0% |
## Code Generation — HumanEval (Pass@1, 0-shot)
| Model | Params | HumanEval (Pass@1) | Notes |
|---|:---:|:---:|---|
| Llama-2-7B | 7B | ~12.8% | 3x overtake — smaller model, far better code |
| SmolLM2-1.7B | 1.7B | ~25.0% | ADS delivers +14pp pure gain |
| Gemma-2-2B | 2B | ~30.0% | Surpasses Google's heavily distilled 2B flagship |
| **Kai-3B-Instruct** | **3B** | **39.02%** | **ADS topological pruning, full pipeline** |
| GPT-3.5 (Legacy) | 175B | ~48.0% | Kai-3B trails the original GPT-3.5 by only ~9pp |
## Math — GSM8K (0-shot)
| Model | Params | GSM8K (exact_match) |
|---|:---:|:---:|
| **Kai-3B-Instruct** | **3B** | **39.27%** |
### Key Observations
1. **Surpasses Llama-2-7B**: Kai-3B outperforms Llama-2-7B on MMLU (+8.3pp) and ARC-Challenge (+5.7pp) with less than half the parameters — a 7B model decisively beaten by a 3B distilled model.
2. **Competitive with Gemma-2-2B**: Matches or exceeds Google's Gemma-2-2B on MMLU (+1.6pp) and PIQA, despite Gemma being trained with significantly more compute.
3. **HellaSwag**: At **69.53%**, Kai-3B surpasses all sub-2B models by a wide margin and trails the compute-heavy Qwen2.5-3B by only ~3.5pp.
4. **PIQA**: At **77.53%**, Kai-3B nearly matches Gemma-2-2B (~78.0%) and approaches the 3B-class ceiling set by Qwen2.5-3B (~80.0%).
## Usage
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model = AutoModelForCausalLM.from_pretrained(
"NoesisLab/Kai-3B-Instruct",
torch_dtype=torch.bfloat16,
)
tokenizer = AutoTokenizer.from_pretrained("NoesisLab/Kai-3B-Instruct")
messages = [{"role": "user", "content": "What is 25 * 4?"}]
input_ids = tokenizer.apply_chat_template(messages, return_tensors="pt")
output = model.generate(input_ids, max_new_tokens=256)
print(tokenizer.decode(output[0], skip_special_tokens=True))
```
## Citation
```bibtex
@misc{noesislab2026kai3b,
title={Kai-3B-Instruct},
author={NoesisLab},
year={2026},
url={https://huggingface.co/NoesisLab/Kai-3B-Instruct}
}
```
## License
Apache 2.0
|