LLM from Scratch (124M)

A clean, from-scratch implementation of a 124M-parameter decoder-only Transformer in PyTorch. No nn.Transformer, no shortcuts β€” every layer (attention, FFN, LayerNorm, embeddings) is built manually. Trained on FineWeb-Edu with mixed precision, gradient accumulation, and automatic checkpoint resume.

πŸš€ Live Demo: https://avneeshjadhav04--llm-api.modal.run

πŸ“‚ Training Code: github.com/avneeshjadhav04/llm-from-scratch

Model Description

This is a GPT-2-style causal language model trained entirely from random initialization on the FineWeb-Edu dataset. It was built as an educational and portfolio project to demonstrate deep understanding of Transformer internals and large-scale training loops.

Property Value
Architecture Decoder-only Transformer (GPT-2 style)
Parameters ~124.44M
Vocab Size 50,257 (GPT-2 / tiktoken)
Context Length 1024 tokens
Hidden Size 768
Layers 12
Attention Heads 12
FFN Dimension 3,072
Activation GELU
Normalization Pre-LayerNorm
Weight Tying Input / Output embeddings shared

Training Results

Metric Value
Final validation loss 2.6943
Final validation perplexity 14.8
Training tokens ~2B (FineWeb-Edu)
Train/val split 95% / 5%
Hardware NVIDIA A100-SXM4-80GB
Wall time ~5 hours
Framework PyTorch 2.10 + torch.compile
Precision FP16 mixed precision

Intended Use

  • Educational: Understanding how GPT-2-scale models work under the hood.
  • Research: Baseline for data-efficiency or architectural ablation studies.
  • Text Generation: Short-form English text completion with temperature / top-k / top-p sampling.

Limitations

  • Scale: 124M params is small by modern standards. Output can be repetitive on long generations.
  • Dataset: Trained exclusively on FineWeb-Edu (filtered web text). May reflect web biases.
  • Language: English only.
  • Length: Best results under 50–100 generated tokens. Longer generations may loop.
  • Optimization: Trained to Chinchilla-optimal scale (~20Γ— parameters in tokens).

Architecture Highlights

  1. Manual Transformer: Every component coded from scratch (no nn.Transformer).
  2. Pre-Norm + Weight Tying: Stable training, fewer parameters, better perplexity.
  3. Mixed Precision: FP16 via torch.amp for ~1.5Γ— speedup.
  4. Session-Safe: Auto-resumes from any checkpoint (safe for Colab / Kaggle timeouts).
  5. Custom Sampling: Temperature, top-k, and top-p nucleus sampling implemented manually.

How to Use

from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("your-username/llm-from-scratch-124m")
tokenizer = AutoTokenizer.from_pretrained("your-username/llm-from-scratch-124m")
inputs = tokenizer("The future of artificial intelligence is", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=100, temperature=0.8, top_k=40)
print(tokenizer.decode(outputs[0]))
Citation
If you use this model or code, please cite:
@misc{llm-from-scratch-2024,
  author = {Avneesh Jadhav},
  title = {LLM from Scratch: A 124M-Parameter Transformer in PyTorch},
  year = {2024},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/avneeshjadhav04/llm-from-scratch}}
}

Acknowledgments

License

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Dataset used to train avneeshjadhav04/llm-from-scratch

Space using avneeshjadhav04/llm-from-scratch 1

Paper for avneeshjadhav04/llm-from-scratch