Kisoku 3B Base

A 3B parameter language model trained entirely from scratch on Google Cloud TPUs using MaxText (JAX), supported by Google's TPU Research Cloud (TRC).

Overview

Kisoku 3B is an independent research project by a solo researcher, trained from scratch with no pretrained weight initialization. This model represents a learning journey in open-source LLM pretraining on TPU infrastructure.

This is the base (pretrained) model. For the instruction-tuned version, see kisoku-3b-sft.

Architecture

Parameter	Value
Architecture	LlamaForCausalLM
Parameters	~3B
Layers	28
Hidden size	3072
FFN size	8192
Attention heads	24
KV heads	6 (Grouped-Query Attention)
Head dim	128
Vocab size	128,256
Context length	4,096
Position encoding	RoPE (theta=500,000)
Activation	SiLU
Tied embeddings	Yes

Training Details

Detail	Value
Framework	MaxText (JAX) on Google Cloud TPU
Hardware	TPU v4-32 (32 chips), on-demand
Training steps	460,000
Training data	DCLM-Baseline 1.0, FineWeb-Edu
Precision	bfloat16
Compute provider	Google TPU Research Cloud (TRC)

Benchmarks

Evaluated with lm-evaluation-harness v0.4.11, 0-shot, 500 samples per task:

Benchmark	Metric	Score
HellaSwag	acc_norm	0.300
ARC-Challenge	acc_norm	0.262
WinoGrande	acc	0.508
TruthfulQA MC2	acc	0.493

Note: Benchmarks indicate the model is significantly undertrained. Competitive 3B models require 2-11T tokens of training data. This model was trained on substantially fewer tokens. A v2 with dramatically more training data is planned.

Usage

Limitations

Undertrained: needs significantly more tokens to reach competitive performance
English-focused
No safety alignment or content filtering applied
Base model only (not instruction-tuned)

Acknowledgments

Research supported with Cloud TPUs from Google's TPU Research Cloud (TRC).

License

Apache 2.0

Downloads last month: 3

Safetensors

Model size

4B params

Tensor type

F16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for 0arch-io/kisoku-3b-base

Finetunes

1 model

Quantizations

2 models

Datasets used to train 0arch-io/kisoku-3b-base

Evaluation results

acc_norm (0-shot, 500 samples) on HellaSwag
self-reported

0.300
acc_norm (0-shot, 500 samples) on ARC-Challenge
self-reported

0.262
acc (0-shot, 500 samples) on WinoGrande
self-reported

0.508
acc (0-shot, 500 samples) on TruthfulQA MC2
self-reported

0.492