LLaMA-124M
A 124M parameter LLaMA-style decoder-only transformer trained from scratch on FineWeb-Edu (sample-10BT, ~9.65 billion tokens).
Every component β attention, feed-forward layers, normalization, positional encoding, training loop β is implemented from scratch in pure PyTorch. No HuggingFace transformers library.
Source code: github.com/alexgarabt/transformer
Model Architecture
| Parameter | Value |
|---|---|
| Parameters | 124,472,064 |
d_model |
768 |
n_heads |
12 |
n_layers |
12 |
d_ff |
2,560 |
max_seq_len |
1,024 |
| Activation | SwiGLU |
| Normalization | RMSNorm (pre-norm) |
| Position encoding | Learned |
| Weight tying | Yes (embedding β LM head) |
| Vocab size | 32,000 (SentencePiece BPE) |
| Dropout | 0.0 |
| Bias | No |
Training Details
| Detail | Value |
|---|---|
| Dataset | FineWeb-Edu sample-10BT (~9.65B tokens) |
| Epochs | 1 |
| Effective batch size | 128 Γ 1,024 Γ 4 = 524,288 tokens/step |
| Optimizer | AdamW (lr=6e-4, betas=(0.9, 0.95), weight_decay=0.1) |
| LR schedule | Cosine decay with 750 warmup steps |
| Precision | bfloat16 (mixed precision) |
| Attention | Flash Attention (F.scaled_dot_product_attention) |
| Gradient clipping | 1.0 (max norm) |
| Hardware | 1Γ NVIDIA H200 (~110 GB VRAM) |
| Training time | ~6 hours |
| Estimated cost | ~$24 |
| Final val loss | 3.1626 |
| Final perplexity | ~23.6 |
| Seed | 42 |
Usage
Quick Start
git clone https://github.com/alexgarabt/transformer
cd transformer
uv sync --extra data
uv run python scripts/inference.py --prompt "The meaning of life" --device cpu
Download Weights Manually
from huggingface_hub import hf_hub_download
model_path = hf_hub_download("alexgarabt/llama-124m-fineweb", "model.pt")
params_path = hf_hub_download("alexgarabt/llama-124m-fineweb", "params.json")
tokenizer_path = hf_hub_download("alexgarabt/llama-124m-fineweb", "tokenizer.model")
Load and Generate (Python)
import torch
from transformer.config import TransformerLMConfig
from transformer.models import TransformerLM
from transformer.data import Tokenizer
from transformer.generation import generate
# Load config
import json
with open(params_path) as f:
params = json.load(f)
config = TransformerLMConfig(**params["model"])
model = TransformerLM(config)
# Load weights
checkpoint = torch.load(model_path, map_location="cpu", weights_only=False)
model.load_state_dict(checkpoint["model_state_dict"])
model.eval()
# Load tokenizer and generate
tokenizer = Tokenizer(tokenizer_path)
prompt_ids = torch.tensor([tokenizer.encode("Once upon a time", add_bos=True)])
output = generate(model, prompt_ids, max_new_tokens=100, temperature=0.8, top_p=0.95)
print(tokenizer.decode(output[0].tolist()))
Files in This Repository
| File | Description | Size |
|---|---|---|
model.pt |
Model weights (clean state dict) | ~500 MB |
params.json |
Model & training configuration | ~1 KB |
tokenizer.model |
SentencePiece BPE tokenizer | ~786 KB |
tokenizer.vocab |
Vocabulary file | ~503 KB |
config/llama_124M.json |
Training configuration | ~1 KB |
runs/ |
TensorBoard training logs | ~95 MB |
Limitations
- Base model only β no instruction tuning (SFT), no RLHF, no DPO. The model generates coherent English text but does not follow instructions or answer questions reliably.
- Small model β 124M parameters is relatively small. Generation quality is limited compared to larger models.
- Single epoch β trained for only one pass over the dataset.
- English only β the tokenizer and training data are English.
- Max 1024 tokens β the model's positional encoding supports sequences up to 1024 tokens.
- Downloads last month
- 137
Dataset used to train alexgara/llama-124m
Evaluation results
- Validation Loss on FineWeb-Edu (sample-10BT)self-reported3.163
- Perplexity on FineWeb-Edu (sample-10BT)self-reported23.620