PicoLM-0.5M

A ~0.48M parameter GPT-2 style causal language model pretrained from scratch using a custom 4096-vocab BPE tokenizer. The smallest model in the PicoLM family. Trained in ~25 minutes on a single NVIDIA T4 GPU.

Model Details

Property	Value
Architecture	GPT-2 (decoder-only transformer)
Parameters	~0.48M
Context length	256 tokens
Vocabulary size	4,096 (custom BPE)
Layers	4
Attention heads	4
Hidden size	64
FFN size	256
Tokenizer	Custom BPE trained on TinyStories
Training steps	5,000

Training

Hardware: Google Colab, NVIDIA T4 (15GB VRAM)

Dataset mix:

50% TinyStories — synthetic English children's stories
25% Gutenberg English — public domain classic literature
15% WikiText-2 — English Wikipedia slice
10% CoLA — grammatically labeled English sentences

Training config:

Optimizer: AdamW (lr=5e-4, weight_decay=0.1)
LR schedule: Cosine with 300 warmup steps
Batch size: 32 x 2 grad accum = effective batch 64
Mixed precision: fp16
Streaming: yes (no full dataset download)
Custom BPE tokenizer trained on 50k TinyStories samples

Usage

from transformers import GPT2LMHeadModel, PreTrainedTokenizerFast, GenerationConfig
import torch

tokenizer = PreTrainedTokenizerFast.from_pretrained("Tralalabs/PicoLM-0.5M")
model = GPT2LMHeadModel.from_pretrained("Tralalabs/PicoLM-0.5M")

inputs = tokenizer("Once upon a time", return_tensors="pt")
gen_config = GenerationConfig(
    max_new_tokens=60,
    do_sample=True,
    temperature=0.9,
    pad_token_id=tokenizer.eos_token_id,
)
out = model.generate(**inputs, generation_config=gen_config)
print(tokenizer.decode(out[0], skip_special_tokens=True))

Sample Outputs

Prompt: Once upon a time

Once upon a time the Itfore, or more as he must not of the place. As he could ween her, "I said, and all his head...

Prompt: In the beginning

In the beginning of the "I What did I think you, so the devan the I would have done...

Parameter Breakdown

Component	Params
Token embedding (4096 x 64)	262,144
Position embedding (256 x 64)	16,384
4 transformer layers	198,656
Final LayerNorm	128
LM head (tied to embedding)	0
Total	~477,312

Languages

English only. All training datasets are English. The model has no meaningful multilingual capability.

Knowledge

Property	Value
Training data cutoff	2023 (TinyStories generation date)
Knowledge cutoff	~2016 (WikiText-2 Wikipedia snapshot)
Real-world knowledge	Effectively none
Oldest data	Pre-1928 (Project Gutenberg)

Limitations

Extremely small scale — outputs are often incoherent or repetitive
Custom 4096-vocab tokenizer means incompatibility with standard HuggingFace tokenizers
Gutenberg OCR artifacts may appear in outputs (stray numbers, broken unicode)
Not instruction-tuned — base pretrained model only
No real-world knowledge retention at this scale
Best treated as a research/educational artifact

Intended Use

Educational — understanding minimum viable language model pretraining
Baseline for ultra-small model research
Experimentation with tiny model fine-tuning
Part of the PicoLM model family

PicoLM Family

Model	Params	Status
PicoLM-0.5M	0.48M	Released
PicoLM-15M	19M	Released
PicoLM-15M-IT	19M	Released
PicoLM-60M	60M	Planned

Citation

@misc{picolm2026,
  author = {Tralalabs},
  title = {PicoLM-0.5M: A Minimum Viable Language Model},
  year = {2026},
  publisher = {HuggingFace},
  url = {https://huggingface.co/Tralalabs/PicoLM-0.5M}
}

Downloads last month: 45

Safetensors

Model size

479k params

Tensor type

F32

Datasets used to train Tralalabs/PicoLM-0.5M

Collection including Tralalabs/PicoLM-0.5M

PicoLM

Collection

The PicoLM family made by Tralalabs • 2 items • Updated 2 days ago