Custom GPT-2 Small (124M) - Trained from Scratch

This model is a faithful implementation of the GPT-2 Small (124M) architecture, built and trained entirely from scratch using PyTorch. It serves as an educational benchmark for training Large Language Models (LLMs) on constrained hardware (NVIDIA L4) using modern optimization techniques.

Model Details

Model Description

  • Developed by: [Your Name/BitLabs]
  • Model type: Causal Language Model (Transformer Decoder)
  • Language(s) (NLP): English
  • License: MIT
  • Architecture: GPT-2 Small (12 Layers, 12 Heads, 768 Embedding Dim)
  • Context Length: 1024 Tokens

Model Sources

Uses

Direct Use

This model is intended for:

  • Educational research on LLM training dynamics.
  • Benchmarking optimization techniques (Gradient Checkpointing, Flash Attention) on consumer/mid-range GPUs.
  • Generating short, grammatically correct English text.

Out-of-Scope Use

  • This model is not suitable for factual queries, reasoning tasks, or production deployment requiring deep world knowledge. It was trained on a "Low-Data Regime" (100k samples) and will hallucinate facts.

Training Details

Training Data

The model was trained on a subset of the FineWeb dataset (sample-10BT configuration).

  • Split: 100,000 samples (~100 Million tokens).
  • Tokenizer: GPT-2 Byte-Pair Encoding (BPE) with a vocabulary size of 50,257.

Training Procedure

The primary goal of this training run was to maximize throughput on a single NVIDIA L4 (24GB VRAM) while maintaining stability with a large context window (1024).

Engineering & optimization ("Ablation")

The following engineering choices were implemented to fit the model and batch size into memory:

  1. Gradient Checkpointing: Implemented manually on Transformer Blocks to reduce VRAM usage by ~60%, allowing for a physical batch size of 16.
  2. Flash Attention (SDPA): Utilized torch.nn.functional.scaled_dot_product_attention for memory-efficient QKV operations.
  3. Mixed Precision (AMP): Enabled torch.amp with TensorFloat-32 (TF32) support for faster matrix multiplications on Ampere architecture.
  4. Fused AdamW: Used the fused kernel implementation of the AdamW optimizer to reduce CPU-GPU dispatch overhead.
  5. Gradient Accumulation: Used 16 accumulation steps to simulate a global effective batch size of 256.

Training Hyperparameters

  • Batch Size: 16 (Physical) / 256 (Effective)
  • Learning Rate: 6e-4 (Cosine Decay with Warmup)
  • Weight Decay: 0.1
  • Max Gradient Norm: 1.0
  • Optimization Steps: 1600 (Stopped via Early Stopping)

Evaluation

Testing Data, Factors & Metrics

  • Metric: Cross-Entropy Loss
  • Final Validation Loss: 4.2818
  • Final Training Loss: 4.1493
  • Implied Perplexity: ~72.37

Training Logs

The model reached its peak performance at Step 1600, after which it began to saturate (Validation loss rose to 4.31 at Step 1700).

How to Get Started with the Model

Use the code below to get started with the model:

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

repo_id = "bitlabsdb/gpt2-124m-transformer_model"

# 1. Load Model & Tokenizer
tokenizer = AutoTokenizer.from_pretrained(repo_id)
model = AutoModelForCausalLM.from_pretrained(repo_id)

# 2. Move to GPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

# 3. Generate
prompt = "The future of artificial intelligence is"
inputs = tokenizer(prompt, return_tensors="pt").to(device)

# Use sampling for better results
outputs = model.generate(
    **inputs, 
    max_new_tokens=50, 
    do_sample=True, 
    temperature=0.8, 
    top_k=40
)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Downloads last month
21
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for bitlabsdb/gpt2-124m-transformer_model

Finetuned
(2083)
this model

Dataset used to train bitlabsdb/gpt2-124m-transformer_model

Evaluation results