KodaLite-1.3B (Koda-v0.1)

A 1.27B parameter LLaMA-style decoder-only language model, trained entirely from scratch on 2x NVIDIA L40S GPUs using JAX + Flax NNX, then converted to HuggingFace Transformers format.

TL;DR — KodaLite reaches ~37% average accuracy on standard LLM benchmarks. It is severely undertrained (only 1.64B tokens vs 40B–3T for comparable models), which places it just below GPT-2-124M despite having 10× more parameters. A nice illustration of the Chinchilla scaling law: tokens matter more than parameters at this budget.

Benchmark results (zero-shot, 8 standard tasks)

Evaluated against 8 comparable ~1B-parameter models on the same benchmarks (HellaSwag, ARC-E/C, WinoGrande, PIQA, BoolQ, OpenBookQA, LAMBADA-OpenAI).

Rank Model Params Train tokens Avg accuracy
1 TinyLlama-1.1B 1.10B 3000B 50.3%
2 Pythia-1.4B 1.41B 300B 50.2%
3 GPT-2-XL 1.56B 40B 49.4%
4 OPT-1.3B 1.32B 180B 49.1%
5 Pythia-1B 1.01B 300B 47.6%
6 GPT-2-large 0.77B 40B 46.2%
7 GPT-2-medium 0.35B 40B 44.2%
8 GPT-2-124m 0.12B 40B 39.7%
9 KodaLite-1.3B 1.27B 1.64B 36.8%

Per-task breakdown

Task KodaLite-1.3B GPT-2-124M GPT-2-XL Pythia-1.4B TinyLlama-1.1B Random
HellaSwag 25.65 29.22 47.94 49.21 56.2 25.0
ARC-Easy 32.79 38.30 50.80 51.73 43.9 25.0
ARC-Challenge 21.50 22.70 28.16 29.01 30.0 25.0
WinoGrande 49.57 49.49 51.93 52.88 52.2 50.0
PIQA 58.92 62.24 70.89 71.22 72.1 50.0
BoolQ 44.34 49.76 61.59 63.70 60.6 50.0
OpenBookQA 25.00 26.40 34.20 33.40 37.2 25.0
LAMBADA (acc / ppl) 18.22 / 93.8 30.84 / 17.5 50.79 / 6.4 61.03 / 3.8

Why KodaLite scores below GPT-2-124M (despite being 10× bigger)

The Chinchilla scaling law (DeepMind, 2022) states that a model with N parameters needs approximately 20×N training tokens to be well-trained:

Model Params Chinchilla target (~20× params) Actual tokens Ratio
KodaLite-1.3B 1.27B ~25B 1.64B 6.5 % 🔴
GPT-2-XL 1.5B ~30B 40B 133 %
Pythia-1.4B 1.4B ~28B 300B 1070 %
TinyLlama-1.1B 1.1B ~22B 3000B 13600 %

KodaLite has seen only 6.5% of what it would need to be competitive. A bigger but undertrained model scores lower than a smaller but well-trained one. The LAMBADA perplexity (94 vs 17 for GPT-2-124M) is the clearest signal: the base language modeling is not converged.

On PIQA (physical commonsense) the gap is smallest — that kind of knowledge appears to be learned faster than factual knowledge or precise language modeling.

Chat Format

Model uses 3 text markers (<|user|>, <|assistant|>, <|end|>) followed by <|endoftext|> (token id 50256, the GPT-2 BPE EOS):

<|user|>
Your question
<|assistant|>
Model response
<|end|><|endoftext|>

A short LoRA pass (May 2026) taught the model to emit <|endoftext|> (50256) right after <|end|>, so generation now stops natively on EOS in Transformers, MLX, llama.cpp, Ollama, and LM Studio, without any stop_strings workaround.

Usage (Transformers)

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

tok = AutoTokenizer.from_pretrained("YoAbriel/KodaLite-1.3B")
model = AutoModelForCausalLM.from_pretrained(
    "YoAbriel/KodaLite-1.3B", dtype=torch.bfloat16, device_map="auto"
)

msg = [{"role": "user", "content": "What is the capital of France?"}]
prompt = tok.apply_chat_template(msg, tokenize=False, add_generation_prompt=False)
inputs = tok(prompt, return_tensors="pt").to(model.device)

out = model.generate(
    **inputs, max_new_tokens=150, do_sample=True, temperature=0.7, top_k=40,
)
print(tok.decode(out[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))

Usage (MLX — Apple Silicon)

See YoAbriel/KodaLite-1.3B-mlx.

from mlx_lm import load, generate

model, tok = load("YoAbriel/KodaLite-1.3B-mlx-8bit")
prompt = tok.apply_chat_template(
    [{"role": "user", "content": "What is the capital of France?"}],
    tokenize=False,
)
print(generate(model, tok, prompt=prompt, max_tokens=150))

Usage (llama.cpp / Ollama / LM Studio)

See YoAbriel/KodaLite-1.3B-GGUF.

ollama run hf.co/YoAbriel/KodaLite-1.3B-GGUF:Q4_K_M

In LM Studio, just load the GGUF file. The model now emits <|endoftext|> (token 50256) at the end of every turn, so it stops natively without any Stop String configuration. Output may include the <|end|> text marker, which is harmless and easy to strip.

Architecture (LLaMA-compatible)

Component Value
Parameters 1.27B
Layers 24
Hidden size 2048
Attention GQA (32Q / 8KV heads)
Head dim 64
FFN SwiGLU, intermediate 5504
Normalization RMSNorm (pre-norm)
Position RoPE (theta=10000)
Context 1024 tokens
Vocab 50,257 (GPT-2 BPE)

Training

Pre-training

  • Dataset: SlimPajama-6B (streaming)
  • Tokens seen: 1.64B
  • Hardware: 2x NVIDIA L40S (96GB VRAM total)
  • Precision: bfloat16
  • Framework: JAX + Flax NNX (trained from scratch, no base model)

SFT

  • Datasets: Databricks Dolly-15K + OpenAssistant OASST1
  • Method: LoRA (rank=16, alpha=32), then merged into base weights
  • End-of-turn: <|end|> (5 BPE tokens) followed by <|endoftext|> (token 50256, the GPT-2 EOS)

EOS fine-tune (May 2026)

  • Goal: teach the model to emit <|endoftext|> (50256) right after <|end|> so any framework with single-token EOS support (GGUF, MLX, Transformers) can stop natively.
  • Method: 200 extra LoRA steps on the existing SFT corpus with <|endoftext|> appended after each <|end|> boundary.
  • Result: 5/5 MLX 8bit and llama.cpp tests stop on EOS without stop_strings workarounds.

Limitations

  • Severely undertrained (6.5% of Chinchilla-optimal) — factual accuracy is low
  • May produce repetitive or inaccurate responses
  • English only
  • 1024 context window
  • Educational / research project — not production-ready

Lessons learned (for a potential v0.2)

  1. Train longer: aim for 20B+ tokens (Chinchilla-optimal for 1.3B would be ~25B).
  2. Pick a single-token end-of-turn marker from the start. We initially trained with <|end|> (5 BPE tokens), which broke single-token EOS frameworks. Patching it after the fact via a 200-step LoRA worked, but designing with a single token like <|endoftext|> would have been cleaner.
  3. SwiGLU + RMSNorm + GQA + RoPE architecture is correct, no issues there, confirmed by the fact that our scaling follows the expected curve.

License

Apache 2.0

Downloads last month
736
Safetensors
Model size
1B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for YoAbriel/KodaLite-1.3B

Finetunes
1 model
Quantizations
4 models

Evaluation results