Kisoku 3B SFT
The instruction-tuned version of Kisoku 3B Base, fine-tuned using supervised fine-tuning (SFT) on Google Cloud TPUs with MaxText.
Trained entirely from scratch (pretraining + SFT) by a solo researcher, supported by Google's TPU Research Cloud (TRC).
Overview
This model was SFT'd from the Kisoku 3B base checkpoint using a custom text-only chat template (### User / ### Assistant format) designed to avoid out-of-vocabulary special token issues common with Llama-family tokenizers.
The model uses Granite architecture (identical to Llama but with runtime logit scaling), enabling GGUF conversion and local deployment via llama.cpp.
Architecture
| Parameter | Value |
|---|---|
| Architecture | GraniteForCausalLM |
| Parameters | ~3B |
| Layers | 28 |
| Hidden size | 3072 |
| FFN size | 8192 |
| Attention heads | 24 |
| KV heads | 6 (Grouped-Query Attention) |
| Head dim | 128 |
| Vocab size | 128,256 |
| Context length | 4,096 |
| Logit scaling | 55.43 (Granite-specific) |
| Activation | SiLU |
Training Details
Pretraining (Base Model)
| Detail | Value |
|---|---|
| Framework | MaxText (JAX) on TPU v4-32 |
| Steps | 460,000 |
| Data | DCLM-Baseline 1.0, FineWeb-Edu |
SFT
| Detail | Value |
|---|---|
| Framework | MaxText SFT on TPU |
| Steps | ~2,499 |
| Final loss | ~1.6 |
| Chat template | Custom text-only (### User / ### Assistant) |
| Tokenizer | Custom (at kisoku-sft-tokenizer/) |
Local Deployment (GGUF)
A GGUF quantized version (Q8_0, 3.5GB) is available for local serving via llama.cpp:
# Serve with llama-server
llama-server -m kisoku-3b-sft-q8.gguf -c 4096 --port 8900
# Use with any OpenAI-compatible client
curl http://localhost:8900/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "kisoku", "messages": [{"role": "user", "content": "Hello!"}]}'
Note: Due to Granite logit scaling (55.4x), use temperature ~0.01 for standard behavior, or use the included proxy script that auto-adjusts temperature and injects logit_bias for special tokens.
Limitations
- Undertrained base model (needs more pretraining tokens for competitive performance)
- English-focused
- No safety alignment (RLHF/DPO) applied
- Granite logit scaling requires temperature adjustment at inference
Acknowledgments
Research supported with Cloud TPUs from Google's TPU Research Cloud (TRC).
License
Apache 2.0
- Downloads last month
- 25