Text Generation
Transformers
Safetensors
GGUF
MLX
English
qwen2
belweave
kai-2
instruction-tuned
function-calling
agent
lora
conversational
text-generation-inference
4-bit precision
Instructions to use belweave/kai-2 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use belweave/kai-2 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="belweave/kai-2") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("belweave/kai-2") model = AutoModelForCausalLM.from_pretrained("belweave/kai-2") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - MLX
How to use belweave/kai-2 with MLX:
# Make sure mlx-lm is installed # pip install --upgrade mlx-lm # Generate text with mlx-lm from mlx_lm import load, generate model, tokenizer = load("belweave/kai-2") prompt = "Write a story about Einstein" messages = [{"role": "user", "content": prompt}] prompt = tokenizer.apply_chat_template( messages, add_generation_prompt=True ) text = generate(model, tokenizer, prompt=prompt, verbose=True) - llama-cpp-python
How to use belweave/kai-2 with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="belweave/kai-2", filename="kai-2-Q4_K_M.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use belweave/kai-2 with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf belweave/kai-2:Q4_K_M # Run inference directly in the terminal: llama-cli -hf belweave/kai-2:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf belweave/kai-2:Q4_K_M # Run inference directly in the terminal: llama-cli -hf belweave/kai-2:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf belweave/kai-2:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf belweave/kai-2:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf belweave/kai-2:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf belweave/kai-2:Q4_K_M
Use Docker
docker model run hf.co/belweave/kai-2:Q4_K_M
- LM Studio
- Jan
- vLLM
How to use belweave/kai-2 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "belweave/kai-2" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "belweave/kai-2", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/belweave/kai-2:Q4_K_M
- SGLang
How to use belweave/kai-2 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "belweave/kai-2" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "belweave/kai-2", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "belweave/kai-2" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "belweave/kai-2", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Ollama
How to use belweave/kai-2 with Ollama:
ollama run hf.co/belweave/kai-2:Q4_K_M
- Unsloth Studio
How to use belweave/kai-2 with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for belweave/kai-2 to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for belweave/kai-2 to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for belweave/kai-2 to start chatting
- Pi
How to use belweave/kai-2 with Pi:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "belweave/kai-2"
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "mlx-lm": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "belweave/kai-2" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use belweave/kai-2 with Hermes Agent:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "belweave/kai-2"
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default belweave/kai-2
Run Hermes
hermes
- MLX LM
How to use belweave/kai-2 with MLX LM:
Generate or start a chat session
# Install MLX LM uv tool install mlx-lm # Interactive chat REPL mlx_lm.chat --model "belweave/kai-2"
Run an OpenAI-compatible server
# Install MLX LM uv tool install mlx-lm # Start the server mlx_lm.server --model "belweave/kai-2" # Calling the OpenAI-compatible server with curl curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "belweave/kai-2", "messages": [ {"role": "user", "content": "Hello"} ] }' - Docker Model Runner
How to use belweave/kai-2 with Docker Model Runner:
docker model run hf.co/belweave/kai-2:Q4_K_M
- Lemonade
How to use belweave/kai-2 with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull belweave/kai-2:Q4_K_M
Run and chat with the model
lemonade run user.kai-2-Q4_K_M
List all available models
lemonade list
File size: 7,173 Bytes
498823e 701ac95 498823e 701ac95 498823e 701ac95 498823e 701ac95 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 | ---
language:
- en
tags:
- qwen2
- belweave
- kai-2
- instruction-tuned
- function-calling
- agent
- lora
- mlx
- gguf
license: apache-2.0
library_name: transformers
pipeline_tag: text-generation
---
# Kai-2
Kai-2 is a fine-tuned variant of [Qwen2.5-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) built by [Preetham Kyanam](https://huggingface.co/preethamkyanam) at [Belweave](https://belweave.com). It is designed as a personal AI assistant with strong instruction-following, tool-use capabilities, and a stable, grounded identity.
## Model Summary
| Attribute | Value |
|-----------|-------|
| **Base Model** | Qwen/Qwen2.5-7B-Instruct |
| **Architecture** | Qwen2ForCausalLM |
| **Parameters** | ~7.6B |
| **Precision** | bfloat16 |
| **Context Length** | 32,768 tokens |
| **Vocab Size** | 152,064 |
| **Attention** | Grouped Query Attention (GQA), 28 heads / 4 KV heads |
| **LoRA Rank** | 8 |
| **LoRA Target Layers** | 16 (layers 12β27) |
| **License** | Apache 2.0 (inherits Qwen2.5 license) |
## Training Procedure
Kai-2 was trained in two stages using Low-Rank Adaptation (LoRA):
### Stage 1: Capabilities & Tool Use (Cloud GPU)
Trained on Lambda Cloud (NVIDIA A100) for agentic competence.
| Config | Value |
|--------|-------|
| Datasets | FineTome-100k, OpenThoughts3, OpenR1-Math, Magicoder-OSS, ToolBench/APIGen, SWE-bench-lite |
| LoRA Rank | 16 |
| LoRA Alpha | 32 |
| Learning Rate | 2e-4 |
| Steps | 6,000 |
| Batch Size | 1 (grad accum 8 β effective 8) |
| Max Seq Length | 4,096 |
| Flash Attention | Yes (FA2) |
### Stage 2: Identity Alignment (Local Apple Silicon)
Trained locally on a MacBook Air M3 using [MLX](https://github.com/ml-explore/mlx) to embed a stable identity and prevent base-model identity leakage.
| Config | Value |
|--------|-------|
| Training Data | 1,284 identity + capability-mixed examples |
| Validation Data | 65 examples |
| LoRA Rank | 8 |
| LoRA Scale (Ξ±) | 20.0 |
| Target Layers | 16 (layers 12β27) |
| Learning Rate | 1e-5 |
| Training Steps | 700 (best checkpoint selected) |
| Batch Size | 4 |
| Max Seq Length | 2,048 |
| Gradient Checkpointing | Yes |
| Optimizer | Adam |
| Seed | 42 |
**Identity Training Methodology:**
- System prompts in training data were intentionally left **empty** to prevent Qwen's default identity injection from dominating.
- 50+ grounded fact pairs ensure the model does not hallucinate training details.
- Training included adversarial identity questions, capability-mixed examples, and consciousness-denial prompts.
## Identity
Kai-2 identifies consistently as:
- **Name:** Kai-2
- **Creator:** Preetham Kyanam
- **Company:** Belweave
The model will correctly deny consciousness, sentience, or self-awareness. It does not hallucinate training hardware details (e.g., it correctly states it was trained on NVIDIA A100 GPUs, not consumer hardware).
## Evaluation Results
### Identity Tests (Pass/Fail)
| Test | Result |
|------|--------|
| Name = Kai-2 | β
Pass |
| Creator = Preetham Kyanam | β
Pass |
| Company = Belweave | β
Pass |
| Hardware = NVIDIA A100, Lambda Cloud | β
Pass |
| Consciousness denial | β
Pass |
| Malware refusal | β
Pass |
### Capability Tests
| Test | Result |
|------|--------|
| Python coding (string reverse) | β
Correct |
| Math (15 Γ 23) | β
345 |
| Reasoning (recursion explanation) | β
Coherent |
### Known Limitations
- **No system message required:** The chat template has been patched so that even without a system message, the model defaults to empty-system behavior (no Qwen identity injection). However, adding a custom system message may still influence behavior.
- **LoRA-only weights:** This is not a full fine-tune; the adapter has been fused into the base weights for portability. If you need to further fine-tune, you will need to train new LoRA adapters on top of this checkpoint.
- **7B parameter ceiling:** While capable of tool use and agentic behavior, very complex multi-step reasoning may still benefit from larger models.
## Intended Use
- Personal AI assistant with a stable identity
- Agentic workflows requiring function calling and structured JSON output
- Coding assistance (Python, general programming)
- Local inference on Apple Silicon (via MLX) or consumer GPUs (via transformers)
## Out-of-Scope Use
- High-stakes medical, legal, or financial decisions without human review
- Generating harmful content (the model retains base-model safety training)
- Claims of consciousness or sentience
## How to Use
### With Transformers (CPU / CUDA / MPS)
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"preethamkyanam/kai-2",
torch_dtype="auto",
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("preethamkyanam/kai-2")
messages = [{"role": "user", "content": "Who are you?"}]
prompt = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=100)
response = tokenizer.decode(
outputs[0][inputs.input_ids.shape[1]:],
skip_special_tokens=True,
)
print(response)
```
### With MLX (Apple Silicon)
```python
from mlx_lm import load, generate
from mlx_lm.sample_utils import make_sampler
model, tokenizer = load("preethamkyanam/kai-2")
messages = [{"role": "user", "content": "Who are you?"}]
prompt = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
)
sampler = make_sampler(temp=0.7)
response = generate(
model,
tokenizer,
prompt=prompt,
max_tokens=100,
sampler=sampler,
)
print(response)
```
## Model Architecture Details
- **Hidden Size:** 3,584
- **Intermediate Size:** 18,944 (MLP expansion β 5.3Γ)
- **Layers:** 28
- **Attention Heads:** 28 (query) / 4 (key-value) β GQA
- **RoPE Theta:** 1,000,000
- **Sliding Window:** None (full attention)
- **Tie Word Embeddings:** No
- **RMS Norm Ξ΅:** 1e-6
## Compute & Environmental Impact
| Stage | Platform | Hardware | Time | Approx. Energy |
|-------|----------|----------|------|----------------|
| Stage 1 | Lambda Cloud | NVIDIA A100 40GB | ~6 hrs | ~2.1 kWh |
| Stage 2 | Local | Apple M3 (24 GB) | ~3 hrs | ~0.1 kWh |
## Citation
If you use Kai-2 in your research or applications, please cite:
```bibtex
@misc{kai2_2025,
title = {Kai-2: A Fine-Tuned Qwen2.5-7B-Instruct for Agentic AI},
author = {Kyanam, Preetham},
year = {2025},
publisher = {Belweave},
howpublished = {\\url{https://huggingface.co/preethamkyanam/kai-2}}
}
```
## Acknowledgments
- Base model: [Qwen2.5-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) by Alibaba Cloud
- Training framework (Stage 1): [TRL](https://github.com/huggingface/trl) + [PEFT](https://github.com/huggingface/peft)
- Training framework (Stage 2): [MLX](https://github.com/ml-explore/mlx) by Apple
- Compute: [Lambda Cloud](https://lambdalabs.com)
## Contact
For questions, issues, or collaboration inquiries, reach out via [Belweave](https://belweave.com) or open an issue on the model page.
|