Kisoku 3B Base
A 3B parameter language model trained entirely from scratch on Google Cloud TPUs using MaxText (JAX), supported by Google's TPU Research Cloud (TRC).
Overview
Kisoku 3B is an independent research project by a solo researcher, trained from scratch with no pretrained weight initialization. This model represents a learning journey in open-source LLM pretraining on TPU infrastructure.
This is the base (pretrained) model. For the instruction-tuned version, see kisoku-3b-sft.
Architecture
| Parameter | Value |
|---|---|
| Architecture | LlamaForCausalLM |
| Parameters | ~3B |
| Layers | 28 |
| Hidden size | 3072 |
| FFN size | 8192 |
| Attention heads | 24 |
| KV heads | 6 (Grouped-Query Attention) |
| Head dim | 128 |
| Vocab size | 128,256 |
| Context length | 4,096 |
| Position encoding | RoPE (theta=500,000) |
| Activation | SiLU |
| Tied embeddings | Yes |
Training Details
| Detail | Value |
|---|---|
| Framework | MaxText (JAX) on Google Cloud TPU |
| Hardware | TPU v4-32 (32 chips), on-demand |
| Training steps | 460,000 |
| Training data | DCLM-Baseline 1.0, FineWeb-Edu |
| Precision | bfloat16 |
| Compute provider | Google TPU Research Cloud (TRC) |
Benchmarks
Evaluated with lm-evaluation-harness v0.4.11, 0-shot, 500 samples per task:
| Benchmark | Metric | Score |
|---|---|---|
| HellaSwag | acc_norm | 0.300 |
| ARC-Challenge | acc_norm | 0.262 |
| WinoGrande | acc | 0.508 |
| TruthfulQA MC2 | acc | 0.493 |
Note: Benchmarks indicate the model is significantly undertrained. Competitive 3B models require 2-11T tokens of training data. This model was trained on substantially fewer tokens. A v2 with dramatically more training data is planned.
Usage
Limitations
- Undertrained: needs significantly more tokens to reach competitive performance
- English-focused
- No safety alignment or content filtering applied
- Base model only (not instruction-tuned)
Acknowledgments
Research supported with Cloud TPUs from Google's TPU Research Cloud (TRC).
License
Apache 2.0
- Downloads last month
- 26
Model tree for 0arch-io/kisoku-3b-base
Datasets used to train 0arch-io/kisoku-3b-base
Evaluation results
- acc_norm (0-shot, 500 samples) on HellaSwagself-reported0.300
- acc_norm (0-shot, 500 samples) on ARC-Challengeself-reported0.262
- acc (0-shot, 500 samples) on WinoGrandeself-reported0.508
- acc (0-shot, 500 samples) on TruthfulQA MC2self-reported0.492