Upload folder using huggingface_hub

eede50b verified about 22 hours ago

5.55 kB

	---
	library_name: transformers
	license: apache-2.0
	tags:
	- math
	- reasoning
	- text-generation
	- ads
	- distillation
	language:
	- en
	pipeline_tag: text-generation
	model-index:
	- name: Kai-3B-Instruct
	results:
	- task:
	type: multiple-choice
	name: ARC-Challenge
	dataset:
	name: ARC-Challenge
	type: allenai/ai2_arc
	config: ARC-Challenge
	split: test
	metrics:
	- type: acc_norm
	value: 51.88
	name: Accuracy (normalized)
	- task:
	type: multiple-choice
	name: HellaSwag
	dataset:
	name: HellaSwag
	type: Rowan/hellaswag
	split: validation
	metrics:
	- type: acc_norm
	value: 69.53
	name: Accuracy (normalized)
	- task:
	type: multiple-choice
	name: MMLU
	dataset:
	name: MMLU
	type: cais/mmlu
	split: test
	metrics:
	- type: acc
	value: 53.62
	name: Accuracy
	- task:
	type: multiple-choice
	name: PIQA
	dataset:
	name: PIQA
	type: piqa
	split: validation
	metrics:
	- type: acc_norm
	value: 77.53
	name: Accuracy (normalized)
	- task:
	type: text-generation
	name: HumanEval
	dataset:
	name: HumanEval
	type: openai/openai_humaneval
	split: test
	metrics:
	- type: pass@1
	value: 39.02
	name: Pass@1
	- task:
	type: text-generation
	name: GSM8K
	dataset:
	name: GSM8K
	type: gsm8k
	split: test
	metrics:
	- type: exact_match
	value: 39.27
	name: Exact Match (flexible)
	---
	# Kai-3B-Instruct

	A 3B-parameter instruction-tuned language model optimized for reasoning, math, and code generation tasks, powered by our new ADS (Adaptive Dual-Search Distillation) technique.

	## Model Details

	\| \| \|
	\|---\|---\|
	\| Model \| Kai-3B-Instruct \|
	\| Architecture \| SmolLM3ForCausalLM \|
	\| Parameters \| 3B \|
	\| Hidden size \| 2048 \|
	\| Intermediate size \| 11008 \|
	\| Layers \| 36 \|
	\| Attention heads \| 16 (4 KV heads, GQA) \|
	\| Context length \| 65536 \|
	\| Precision \| bfloat16 \|
	\| Vocab size \| 128,256 \|

	## What is ADS?

	Adaptive Dual-Search Distillation (自适应对偶搜索蒸馏) treats model fine-tuning as a constrained optimization problem inspired by Operations Research. The core mechanism is a dynamic loss function with a stateful dual penalty factor that adapts based on embedding space entropy — forcing the model to converge to high-confidence predictions at difficult reasoning points, without modifying the model architecture.

	## Benchmark Results

	![Performance Comparison Across General, Code, and Math Benchmarks](model_comparison.png)

	### General (5-shot, log-likelihood)

	\| Model \| Params \| MMLU \| ARC-c (acc_norm) \| HellaSwag (acc_norm) \| PIQA (acc_norm) \|
	\|---\|:---:\|:---:\|:---:\|:---:\|:---:\|
	\| TinyLlama \| 1.1B \| ~26.0% \| ~33.0% \| ~60.0% \| ~71.0% \|
	\| SmolLM2 \| 1.7B \| ~35.0% \| ~38.0% \| ~65.0% \| ~74.0% \|
	\| Llama-2-7B \| 7B \| 45.3% \| 46.2% \| 77.2% \| 79.8% \|
	\| Gemma-2-2B \| 2.6B \| ~52.0% \| ~53.0% \| 75.0% \| ~78.0% \|
	\| Kai-3B-Instruct \| 3B \| 53.62% \| 51.88% \| 69.53% \| 77.53% \|
	\| Qwen2.5-3B \| 3B \| ~63.0% \| ~55.0% \| ~73.0% \| ~80.0% \|

	## Code Generation — HumanEval (Pass@1, 0-shot)

	\| Model \| Params \| HumanEval (Pass@1) \| Notes \|
	\|---\|:---:\|:---:\|---\|
	\| Llama-2-7B \| 7B \| ~12.8% \| 3x overtake — smaller model, far better code \|
	\| SmolLM2-1.7B \| 1.7B \| ~25.0% \| ADS delivers +14pp pure gain \|
	\| Gemma-2-2B \| 2B \| ~30.0% \| Surpasses Google's heavily distilled 2B flagship \|
	\| Kai-3B-Instruct \| 3B \| 39.02% \| ADS topological pruning, full pipeline \|
	\| GPT-3.5 (Legacy) \| 175B \| ~48.0% \| Kai-3B trails the original GPT-3.5 by only ~9pp \|

	## Math — GSM8K (0-shot)

	\| Model \| Params \| GSM8K (exact_match) \|
	\|---\|:---:\|:---:\|
	\| Kai-3B-Instruct \| 3B \| 39.27% \|

	### Key Observations

	1. Surpasses Llama-2-7B: Kai-3B outperforms Llama-2-7B on MMLU (+8.3pp) and ARC-Challenge (+5.7pp) with less than half the parameters — a 7B model decisively beaten by a 3B distilled model.

	2. Competitive with Gemma-2-2B: Matches or exceeds Google's Gemma-2-2B on MMLU (+1.6pp) and PIQA, despite Gemma being trained with significantly more compute.

	3. HellaSwag: At 69.53%, Kai-3B surpasses all sub-2B models by a wide margin and trails the compute-heavy Qwen2.5-3B by only ~3.5pp.

	4. PIQA: At 77.53%, Kai-3B nearly matches Gemma-2-2B (~78.0%) and approaches the 3B-class ceiling set by Qwen2.5-3B (~80.0%).

	## Usage

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer
	import torch

	model = AutoModelForCausalLM.from_pretrained(
	"NoesisLab/Kai-3B-Instruct",
	torch_dtype=torch.bfloat16,
	)
	tokenizer = AutoTokenizer.from_pretrained("NoesisLab/Kai-3B-Instruct")

	messages = [{"role": "user", "content": "What is 25 * 4?"}]
	input_ids = tokenizer.apply_chat_template(messages, return_tensors="pt")
	output = model.generate(input_ids, max_new_tokens=256)
	print(tokenizer.decode(output[0], skip_special_tokens=True))
	```

	## Citation

	```bibtex
	@misc{noesislab2026kai3b,
	title={Kai-3B-Instruct},
	author={NoesisLab},
	year={2026},
	url={https://huggingface.co/NoesisLab/Kai-3B-Instruct}
	}
	```

	## License

	Apache 2.0