ZeroShot-500M

530M parameter decoder-only transformer trained entirely from scratch — base pre-training, mid-training, and supervised fine-tuning — on a single rented RTX 5090.

Part of the ZeroShot scaling series by TobiasLogic, a progression of increasingly large GPT-2 style LLMs trained from zero, on consumer/prosumer GPUs, at minimal cost.

Model Details


Parameters	~530M
Architecture	GPT-2 style decoder-only transformer
Layers	24
Attention Heads	20
Embedding Dim	1280
Head Dim	64
Context Window	2,048 tokens
Vocab Size	50,304 (GPT-2 BPE, padded for tensor cores)
Precision	bfloat16
Attention	Flash Attention via `F.scaled_dot_product_attention`
Weight Tying	Embedding ↔ LM head

Training

Stage 1 — Base Pre-training ✅


Data	FineWeb-Edu (streamed, zero disk usage)
Tokens	~7.9B
Steps	30,000
LR Schedule	Cosine decay: 4e-4 → 4e-5
Effective Batch	128 sequences (4 micro × 32 grad accumulation)
Final Loss	2.75

Stage 2 — Mid-training ✅


Data	SmolTalk (80k) + Alpaca (50k) + OpenAssistant (50k)
Steps	4,975
LR	1e-4 (cosine decay)
Final Loss	1.03

Stage 3 — SFT ✅


Steps	1,975
LR	3e-5 (cosine decay)
Final Loss	0.93

Hardware


GPU	NVIDIA RTX 5090 (32GB GDDR7)
Platform	Vast.ai (South Korea)
Cost/hr	$0.343/hr
Throughput	~43,000 tokens/sec
Total Cost	~$18
PyTorch	Nightly (cu128, Blackwell sm_120)
torch.compile	Disabled (unsupported on sm_120)

Checkpoints

File	Description
`ckpt_base_final.pt`	Base pre-trained — 30k steps, loss 2.75
`ckpt_mid_final.pt`	Post mid-training — 4,975 steps, loss 1.03
`ckpt_sft_final.pt`	Final chat model — 1,975 steps, loss 0.93

Usage

Requires the GPT class from train.py included in this repo.

import torch
import tiktoken
from train import GPT, ModelConfig

# Load the SFT checkpoint (chat model)
ckpt = torch.load("ckpt_sft_final.pt", map_location="cuda", weights_only=False)
model = GPT(ModelConfig(**ckpt["model_config"])).to("cuda")
model.load_state_dict(ckpt["model"])
model.eval()

enc = tiktoken.get_encoding("gpt2")
tokens = torch.tensor(
    [enc.encode("The meaning of life is")],
    dtype=torch.long,
    device="cuda"
)

with torch.no_grad():
    output = model.generate(tokens, max_new_tokens=200, temperature=0.8, top_k=200)

print(enc.decode(output[0].tolist()))

Blackwell GPU users (RTX 5060 Ti / 5090): disable torch.compile and use PyTorch nightly cu128.

ZeroShot Family

Model	Params	Base Loss	SFT Loss	Cost	GPU
MicroGPT	30.5M	3.85	—	Free	RTX 3050
ZeroShot-124M	124M	3.45	1.60	~$6.77	RTX 5060 Ti
ZeroShot-350M	337M	3.20	1.3	~$7.88	RTX 5090
ZeroShot-500M	530M	2.75	0.93	~$18	RTX 5090

Limitations

Undertrained by Chinchilla standards (~75% of optimal tokens for 530M params)
Will hallucinate, repeat, and struggle on complex reasoning tasks
English only
No RLHF or safety alignment

License

MIT

Downloads last month: 84

TobiasLogic
/

ZeroShot-500M