File size: 5,553 Bytes
3ac4e23
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
eede50b
3ac4e23
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
---
library_name: transformers
license: apache-2.0
tags:
  - math
  - reasoning
  - text-generation
  - ads
  - distillation
language:
  - en
pipeline_tag: text-generation
model-index:
  - name: Kai-3B-Instruct
    results:
      - task:
          type: multiple-choice
          name: ARC-Challenge
        dataset:
          name: ARC-Challenge
          type: allenai/ai2_arc
          config: ARC-Challenge
          split: test
        metrics:
          - type: acc_norm
            value: 51.88
            name: Accuracy (normalized)
      - task:
          type: multiple-choice
          name: HellaSwag
        dataset:
          name: HellaSwag
          type: Rowan/hellaswag
          split: validation
        metrics:
          - type: acc_norm
            value: 69.53
            name: Accuracy (normalized)
      - task:
          type: multiple-choice
          name: MMLU
        dataset:
          name: MMLU
          type: cais/mmlu
          split: test
        metrics:
          - type: acc
            value: 53.62
            name: Accuracy
      - task:
          type: multiple-choice
          name: PIQA
        dataset:
          name: PIQA
          type: piqa
          split: validation
        metrics:
          - type: acc_norm
            value: 77.53
            name: Accuracy (normalized)
      - task:
          type: text-generation
          name: HumanEval
        dataset:
          name: HumanEval
          type: openai/openai_humaneval
          split: test
        metrics:
          - type: pass@1
            value: 39.02
            name: Pass@1
      - task:
          type: text-generation
          name: GSM8K
        dataset:
          name: GSM8K
          type: gsm8k
          split: test
        metrics:
          - type: exact_match
            value: 39.27
            name: Exact Match (flexible)
---
# Kai-3B-Instruct

A 3B-parameter instruction-tuned language model optimized for reasoning, math, and code generation tasks, powered by our new **ADS (Adaptive Dual-Search Distillation)** technique.

## Model Details

| | |
|---|---|
| **Model** | Kai-3B-Instruct |
| **Architecture** | SmolLM3ForCausalLM |
| **Parameters** | 3B |
| **Hidden size** | 2048 |
| **Intermediate size** | 11008 |
| **Layers** | 36 |
| **Attention heads** | 16 (4 KV heads, GQA) |
| **Context length** | 65536 |
| **Precision** | bfloat16 |
| **Vocab size** | 128,256 |

## What is ADS?

**Adaptive Dual-Search Distillation (自适应对偶搜索蒸馏)** treats model fine-tuning as a constrained optimization problem inspired by Operations Research. The core mechanism is a dynamic loss function with a stateful dual penalty factor that adapts based on embedding space entropy — forcing the model to converge to high-confidence predictions at difficult reasoning points, without modifying the model architecture.

## Benchmark Results

![Performance Comparison Across General, Code, and Math Benchmarks](model_comparison.png)

### General (5-shot, log-likelihood)

| Model | Params | MMLU | ARC-c (acc_norm) | HellaSwag (acc_norm) | PIQA (acc_norm) |
|---|:---:|:---:|:---:|:---:|:---:|
| TinyLlama | 1.1B | ~26.0% | ~33.0% | ~60.0% | ~71.0% |
| SmolLM2 | 1.7B | ~35.0% | ~38.0% | ~65.0% | ~74.0% |
| Llama-2-7B | 7B | 45.3% | 46.2% | 77.2% | 79.8% |
| Gemma-2-2B | 2.6B | ~52.0% | ~53.0% | 75.0% | ~78.0% |
| **Kai-3B-Instruct** | **3B** | **53.62%** | **51.88%** | **69.53%** | **77.53%** |
| Qwen2.5-3B | 3B | ~63.0% | ~55.0% | ~73.0% | ~80.0% |

## Code Generation — HumanEval (Pass@1, 0-shot)

| Model | Params | HumanEval (Pass@1) | Notes |
|---|:---:|:---:|---|
| Llama-2-7B | 7B | ~12.8% | 3x overtake — smaller model, far better code |
| SmolLM2-1.7B | 1.7B | ~25.0% | ADS delivers +14pp pure gain |
| Gemma-2-2B | 2B | ~30.0% | Surpasses Google's heavily distilled 2B flagship |
| **Kai-3B-Instruct** | **3B** | **39.02%** | **ADS topological pruning, full pipeline** |
| GPT-3.5 (Legacy) | 175B | ~48.0% | Kai-3B trails the original GPT-3.5 by only ~9pp |

## Math — GSM8K (0-shot)

| Model | Params | GSM8K (exact_match) |
|---|:---:|:---:|
| **Kai-3B-Instruct** | **3B** | **39.27%** |

### Key Observations

1. **Surpasses Llama-2-7B**: Kai-3B outperforms Llama-2-7B on MMLU (+8.3pp) and ARC-Challenge (+5.7pp) with less than half the parameters — a 7B model decisively beaten by a 3B distilled model.

2. **Competitive with Gemma-2-2B**: Matches or exceeds Google's Gemma-2-2B on MMLU (+1.6pp) and PIQA, despite Gemma being trained with significantly more compute.

3. **HellaSwag**: At **69.53%**, Kai-3B surpasses all sub-2B models by a wide margin and trails the compute-heavy Qwen2.5-3B by only ~3.5pp.

4. **PIQA**: At **77.53%**, Kai-3B nearly matches Gemma-2-2B (~78.0%) and approaches the 3B-class ceiling set by Qwen2.5-3B (~80.0%).

## Usage

```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
    "NoesisLab/Kai-3B-Instruct",
    torch_dtype=torch.bfloat16,
)
tokenizer = AutoTokenizer.from_pretrained("NoesisLab/Kai-3B-Instruct")

messages = [{"role": "user", "content": "What is 25 * 4?"}]
input_ids = tokenizer.apply_chat_template(messages, return_tensors="pt")
output = model.generate(input_ids, max_new_tokens=256)
print(tokenizer.decode(output[0], skip_special_tokens=True))
```

## Citation

```bibtex
@misc{noesislab2026kai3b,
  title={Kai-3B-Instruct},
  author={NoesisLab},
  year={2026},
  url={https://huggingface.co/NoesisLab/Kai-3B-Instruct}
}
```

## License

Apache 2.0