Kisoku 3B Base

A 3B parameter language model trained entirely from scratch on Google Cloud TPUs using MaxText (JAX), supported by Google's TPU Research Cloud (TRC).

Overview

Kisoku 3B is an independent research project by a solo researcher, trained from scratch with no pretrained weight initialization. This model represents a learning journey in open-source LLM pretraining on TPU infrastructure.

This is the base (pretrained) model. For the instruction-tuned version, see kisoku-3b-sft.

Architecture

Parameter Value
Architecture LlamaForCausalLM
Parameters ~3B
Layers 28
Hidden size 3072
FFN size 8192
Attention heads 24
KV heads 6 (Grouped-Query Attention)
Head dim 128
Vocab size 128,256
Context length 4,096
Position encoding RoPE (theta=500,000)
Activation SiLU
Tied embeddings Yes

Training Details

Detail Value
Framework MaxText (JAX) on Google Cloud TPU
Hardware TPU v4-32 (32 chips), on-demand
Training steps 460,000
Training data DCLM-Baseline 1.0, FineWeb-Edu
Precision bfloat16
Compute provider Google TPU Research Cloud (TRC)

Benchmarks

Evaluated with lm-evaluation-harness v0.4.11, 0-shot, 500 samples per task:

Benchmark Metric Score
HellaSwag acc_norm 0.300
ARC-Challenge acc_norm 0.262
WinoGrande acc 0.508
TruthfulQA MC2 acc 0.493

Note: Benchmarks indicate the model is significantly undertrained. Competitive 3B models require 2-11T tokens of training data. This model was trained on substantially fewer tokens. A v2 with dramatically more training data is planned.

Usage

Limitations

  • Undertrained: needs significantly more tokens to reach competitive performance
  • English-focused
  • No safety alignment or content filtering applied
  • Base model only (not instruction-tuned)

Acknowledgments

Research supported with Cloud TPUs from Google's TPU Research Cloud (TRC).

License

Apache 2.0

Downloads last month
26
Safetensors
Model size
4B params
Tensor type
F16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for 0arch-io/kisoku-3b-base

Finetunes
1 model
Quantizations
2 models

Datasets used to train 0arch-io/kisoku-3b-base

Evaluation results