PicoLM-0.5M

A ~0.48M parameter GPT-2 style causal language model pretrained from scratch using a custom 4096-vocab BPE tokenizer. The smallest model in the PicoLM family. Trained in ~25 minutes on a single NVIDIA T4 GPU.

Model Details

Property Value
Architecture GPT-2 (decoder-only transformer)
Parameters ~0.48M
Context length 256 tokens
Vocabulary size 4,096 (custom BPE)
Layers 4
Attention heads 4
Hidden size 64
FFN size 256
Tokenizer Custom BPE trained on TinyStories
Training steps 5,000

Training

Hardware: Google Colab, NVIDIA T4 (15GB VRAM)

Dataset mix:

  • 50% TinyStories — synthetic English children's stories
  • 25% Gutenberg English — public domain classic literature
  • 15% WikiText-2 — English Wikipedia slice
  • 10% CoLA — grammatically labeled English sentences

Training config:

  • Optimizer: AdamW (lr=5e-4, weight_decay=0.1)
  • LR schedule: Cosine with 300 warmup steps
  • Batch size: 32 x 2 grad accum = effective batch 64
  • Mixed precision: fp16
  • Streaming: yes (no full dataset download)
  • Custom BPE tokenizer trained on 50k TinyStories samples

Usage

from transformers import GPT2LMHeadModel, PreTrainedTokenizerFast, GenerationConfig
import torch

tokenizer = PreTrainedTokenizerFast.from_pretrained("Tralalabs/PicoLM-0.5M")
model = GPT2LMHeadModel.from_pretrained("Tralalabs/PicoLM-0.5M")

inputs = tokenizer("Once upon a time", return_tensors="pt")
gen_config = GenerationConfig(
    max_new_tokens=60,
    do_sample=True,
    temperature=0.9,
    pad_token_id=tokenizer.eos_token_id,
)
out = model.generate(**inputs, generation_config=gen_config)
print(tokenizer.decode(out[0], skip_special_tokens=True))

Sample Outputs

Prompt: Once upon a time

Once upon a time the Itfore, or more as he must not of the place. As he could ween her, "I said, and all his head...

Prompt: In the beginning

In the beginning of the "I What did I think you, so the devan the I would have done...

Parameter Breakdown

Component Params
Token embedding (4096 x 64) 262,144
Position embedding (256 x 64) 16,384
4 transformer layers 198,656
Final LayerNorm 128
LM head (tied to embedding) 0
Total ~477,312

Languages

English only. All training datasets are English. The model has no meaningful multilingual capability.

Knowledge

Property Value
Training data cutoff 2023 (TinyStories generation date)
Knowledge cutoff ~2016 (WikiText-2 Wikipedia snapshot)
Real-world knowledge Effectively none
Oldest data Pre-1928 (Project Gutenberg)

Limitations

  • Extremely small scale — outputs are often incoherent or repetitive
  • Custom 4096-vocab tokenizer means incompatibility with standard HuggingFace tokenizers
  • Gutenberg OCR artifacts may appear in outputs (stray numbers, broken unicode)
  • Not instruction-tuned — base pretrained model only
  • No real-world knowledge retention at this scale
  • Best treated as a research/educational artifact

Intended Use

  • Educational — understanding minimum viable language model pretraining
  • Baseline for ultra-small model research
  • Experimentation with tiny model fine-tuning
  • Part of the PicoLM model family

PicoLM Family

Model Params Status
PicoLM-0.5M 0.48M Released
PicoLM-15M 19M Released
PicoLM-15M-IT 19M Released
PicoLM-60M 60M Planned

Citation

@misc{picolm2026,
  author = {Tralalabs},
  title = {PicoLM-0.5M: A Minimum Viable Language Model},
  year = {2026},
  publisher = {HuggingFace},
  url = {https://huggingface.co/Tralalabs/PicoLM-0.5M}
}
Downloads last month
45
Safetensors
Model size
479k params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Datasets used to train Tralalabs/PicoLM-0.5M

Collection including Tralalabs/PicoLM-0.5M