ZeroShot-500M
530M parameter decoder-only transformer trained entirely from scratch β base pre-training, mid-training, and supervised fine-tuning β on a single rented RTX 5090.
Part of the ZeroShot scaling series by TobiasLogic, a progression of increasingly large GPT-2 style LLMs trained from zero, on consumer/prosumer GPUs, at minimal cost.
Model Details
|
|
| Parameters |
~530M |
| Architecture |
GPT-2 style decoder-only transformer |
| Layers |
24 |
| Attention Heads |
20 |
| Embedding Dim |
1280 |
| Head Dim |
64 |
| Context Window |
2,048 tokens |
| Vocab Size |
50,304 (GPT-2 BPE, padded for tensor cores) |
| Precision |
bfloat16 |
| Attention |
Flash Attention via F.scaled_dot_product_attention |
| Weight Tying |
Embedding β LM head |
Training
Stage 1 β Base Pre-training β
|
|
| Data |
FineWeb-Edu (streamed, zero disk usage) |
| Tokens |
~7.9B |
| Steps |
30,000 |
| LR Schedule |
Cosine decay: 4e-4 β 4e-5 |
| Effective Batch |
128 sequences (4 micro Γ 32 grad accumulation) |
| Final Loss |
2.75 |

Stage 2 β Mid-training β

Stage 3 β SFT β
|
|
| Steps |
1,975 |
| LR |
3e-5 (cosine decay) |
| Final Loss |
0.93 |

Hardware
|
|
| GPU |
NVIDIA RTX 5090 (32GB GDDR7) |
| Platform |
Vast.ai (South Korea) |
| Cost/hr |
$0.343/hr |
| Throughput |
~43,000 tokens/sec |
| Total Cost |
~$18 |
| PyTorch |
Nightly (cu128, Blackwell sm_120) |
| torch.compile |
Disabled (unsupported on sm_120) |
Checkpoints
| File |
Description |
ckpt_base_final.pt |
Base pre-trained β 30k steps, loss 2.75 |
ckpt_mid_final.pt |
Post mid-training β 4,975 steps, loss 1.03 |
ckpt_sft_final.pt |
Final chat model β 1,975 steps, loss 0.93 |
Usage
Requires the GPT class from train.py included in this repo.
import torch
import tiktoken
from train import GPT, ModelConfig
ckpt = torch.load("ckpt_sft_final.pt", map_location="cuda", weights_only=False)
model = GPT(ModelConfig(**ckpt["model_config"])).to("cuda")
model.load_state_dict(ckpt["model"])
model.eval()
enc = tiktoken.get_encoding("gpt2")
tokens = torch.tensor(
[enc.encode("The meaning of life is")],
dtype=torch.long,
device="cuda"
)
with torch.no_grad():
output = model.generate(tokens, max_new_tokens=200, temperature=0.8, top_k=200)
print(enc.decode(output[0].tolist()))
Blackwell GPU users (RTX 5060 Ti / 5090): disable torch.compile and use PyTorch nightly cu128.
ZeroShot Family
| Model |
Params |
Base Loss |
SFT Loss |
Cost |
GPU |
| MicroGPT |
30.5M |
3.85 |
β |
Free |
RTX 3050 |
| ZeroShot-124M |
124M |
3.45 |
1.60 |
~$6.77 |
RTX 5060 Ti |
| ZeroShot-350M |
337M |
3.20 |
1.3 |
~$7.88 |
RTX 5090 |
| ZeroShot-500M |
530M |
2.75 |
0.93 |
~$18 |
RTX 5090 |
Limitations
- Undertrained by Chinchilla standards (~75% of optimal tokens for 530M params)
- Will hallucinate, repeat, and struggle on complex reasoning tasks
- English only
- No RLHF or safety alignment
License
MIT