Pico-GPT

A small GPT-style decoder-only language model (~49.2M parameters) trained from scratch on OpenWebText.

Model Details

Property Value
Architecture Decoder-only Transformer with Pre-LayerNorm
Parameters ~49,218,816
Layers 6
Hidden Size 384
FFN Hidden Size 1536
Attention Heads 6
Head Dimension 64
Context Length 128 tokens
Vocabulary 50257 (GPT-2)
Flash Attention ✅ Enabled
Dropout 0.1
Bias Disabled

Training Objective

The model was trained using causal language modeling (next-token prediction). The loss function is cross-entropy over the vocabulary.

For a given sequence of tokens x_1, x_2, ..., x_n, the model is trained to predict x_{i+1} given x_1, ..., x_i.

Dataset

Source

  • Dataset: OpenWebText
  • Hugging Face: Skylion007/openwebtext
  • Mode: Streaming preprocessing
  • License: Same as OpenAI's GPT-2 dataset

Preprocessing Pipeline

  • Tokenizer: GPT-2 (tiktoken)
  • Tokenization: Streaming, incremental
  • EOS Token: Appended after each document
  • Text Cleaning: Minimal (strip whitespace, skip empty strings)
  • Sharding: Binary shards (uint16), 5M tokens per shard
  • Train/Val Split: Deterministic split by token count
  • Memory Mapping: Enabled for efficient loading

Dataset Statistics

  • Total Tokens Collected: 1B tokens
  • Training Tokens: 950M tokens
  • Validation Tokens: 50M tokens
  • Training Shards: ~190 files (train_000.bin to train_189.bin)
  • Validation Shard: val.bin
  • Data Type: uint16 (supports memory mapping)

Training Configuration

Hyperparameters

Parameter Value
Optimizer AdamW
Learning Rate 3e-4
Weight Decay 0.1
Betas (0.9, 0.95)
Max Steps N/A
Batch Size 64
Context Window 128
Gradient Clipping 1.0
Checkpoint Interval N/A
Log Interval N/A

Training Results

Metric Value
Final Training Loss N/A
Training Time N/A
Hardware NVIDIA A100 (20GB) or equivalent

Model Files

File Description
model.safetensors Model weights in safetensors format (secure, fast loading)
config.json Model architecture configuration
training_config.json Training hyperparameters and results
training_log.csv Training metrics over time (step, loss, elapsed_time)
samples.txt Sample generations from the trained model
tokenizer_config.json Tokenizer configuration
special_tokens_map.json Special tokens mapping

Installation

This model requires the custom pico_gpt package to load and run.

git clone https://github.com/ChidambaraRaju/pico-gpt.git
cd pico-gpt
pip install -e .

Usage

Loading with safetensors:

import torch
from safetensors.torch import load_file
import json

# Load config
with open("config.json", "r") as f:
    config = json.load(f)

# Load weights
state_dict = load_file("model.safetensors")

# Create model (requires custom model class from pico_gpt/model.py)
from pico_gpt.model import GPT
from pico_gpt.config import ModelConfig

model = GPT(ModelConfig(**config))
model.load_state_dict(state_dict)
model.eval()

Text Generation:

import torch
import tiktoken

# Load tokenizer
enc = tiktoken.get_encoding("gpt2")

# Prepare prompt
prompt = "The future of artificial intelligence is"
tokens = enc.encode(prompt)
tokens = tokens[-context_length:]  # Truncate to context length if needed
idx = torch.tensor([tokens], dtype=torch.long)

# Generate
with torch.no_grad():
    generated = model.generate(
        idx,
        max_new_tokens=100,
        temperature=0.8,
        eos_token_id=enc.eot_token,
    )

# Decode result
generated_text = enc.decode(generated[0].tolist())
print(generated_text)

Loading Checkpoint:

import torch

# Load checkpoint
checkpoint = torch.load("checkpoint_step_<N>.pt", map_location="cpu")
model_state = checkpoint["model_state_dict"]
config = checkpoint["config"]

# Load training config if needed
training_config = checkpoint.get("training_config", {})

# Use with custom GPT class
from pico_gpt.model import GPT
from pico_gpt.config import ModelConfig

model = GPT(config)
model.load_state_dict(model_state)

Limitations

  • Small Model Size: ~49.2M parameters limits reasoning capability
  • Short Context: 128 token context window limits long-range dependencies
  • Single Dataset: Trained only on web text (OpenWebText subset)
  • No Instruction Tuning: Not aligned for chat/instruction following
  • Potential Biases: May contain biases present in the training data
  • No Weight Tying: Embedding and output layers have separate parameters

Future Work

  • Convert to native Hugging Face GPT-2 architecture
  • Increase model size and context length
  • Add instruction tuning / alignment
  • Evaluation on downstream benchmarks (perplexity, etc.)
  • Fine-tune for specific tasks
  • Implement more sampling strategies (top-k, top-p)
  • Add support for streaming inference

Citation

@misc{pico-gpt,
  title={Pico-GPT: A Small Language Model from Scratch},
  author={Chidambara Raju G},
  year={2026},
  howpublished={\url{https://huggingface.co/justjuu/pico-gpt}},
}

Acknowledgments

  • This project uses the GPT-2 tokenizer from OpenAI's tiktoken library
  • Dataset: OpenWebText by Skylion007
  • Architecture inspired by GPT, GPT-2, and nanoGPT

For training details, see training_config.json and training_log.csv. Model files use the safetensors format for safe and efficient loading.

Downloads last month
378
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support