Pico-GPT

A small GPT-style decoder-only language model (~49.2M parameters) trained from scratch on OpenWebText.

Model Details

Property	Value
Architecture	Decoder-only Transformer with Pre-LayerNorm
Parameters	~49,218,816
Layers	6
Hidden Size	384
FFN Hidden Size	1536
Attention Heads	6
Head Dimension	64
Context Length	128 tokens
Vocabulary	50257 (GPT-2)
Flash Attention	✅ Enabled
Dropout	0.1
Bias	Disabled

Training Objective

The model was trained using causal language modeling (next-token prediction). The loss function is cross-entropy over the vocabulary.

For a given sequence of tokens x_1, x_2, ..., x_n, the model is trained to predict x_{i+1} given x_1, ..., x_i.

Dataset

Source

Dataset: OpenWebText
Hugging Face: Skylion007/openwebtext
Mode: Streaming preprocessing
License: Same as OpenAI's GPT-2 dataset

Preprocessing Pipeline

Tokenizer: GPT-2 (tiktoken)
Tokenization: Streaming, incremental
EOS Token: Appended after each document
Text Cleaning: Minimal (strip whitespace, skip empty strings)
Sharding: Binary shards (uint16), 5M tokens per shard
Train/Val Split: Deterministic split by token count
Memory Mapping: Enabled for efficient loading

Dataset Statistics

Total Tokens Collected: 1B tokens
Training Tokens: 950M tokens
Validation Tokens: 50M tokens
Training Shards: ~190 files (train_000.bin to train_189.bin)
Validation Shard: val.bin
Data Type: uint16 (supports memory mapping)

Training Configuration

Hyperparameters

Parameter	Value
Optimizer	AdamW
Learning Rate	3e-4
Weight Decay	0.1
Betas	(0.9, 0.95)
Max Steps	N/A
Batch Size	64
Context Window	128
Gradient Clipping	1.0
Checkpoint Interval	N/A
Log Interval	N/A

Training Results

Metric	Value
Final Training Loss	N/A
Training Time	N/A
Hardware	NVIDIA A100 (20GB) or equivalent

Model Files

File	Description
`model.safetensors`	Model weights in safetensors format (secure, fast loading)
`config.json`	Model architecture configuration
`training_config.json`	Training hyperparameters and results
`training_log.csv`	Training metrics over time (step, loss, elapsed_time)
`samples.txt`	Sample generations from the trained model
`tokenizer_config.json`	Tokenizer configuration
`special_tokens_map.json`	Special tokens mapping

Installation

This model requires the custom pico_gpt package to load and run.

git clone https://github.com/ChidambaraRaju/pico-gpt.git
cd pico-gpt
pip install -e .

Usage

Loading with safetensors:

import torch
from safetensors.torch import load_file
import json

# Load config
with open("config.json", "r") as f:
    config = json.load(f)

# Load weights
state_dict = load_file("model.safetensors")

# Create model (requires custom model class from pico_gpt/model.py)
from pico_gpt.model import GPT
from pico_gpt.config import ModelConfig

model = GPT(ModelConfig(**config))
model.load_state_dict(state_dict)
model.eval()

Text Generation:

import torch
import tiktoken

# Load tokenizer
enc = tiktoken.get_encoding("gpt2")

# Prepare prompt
prompt = "The future of artificial intelligence is"
tokens = enc.encode(prompt)
tokens = tokens[-context_length:]  # Truncate to context length if needed
idx = torch.tensor([tokens], dtype=torch.long)

# Generate
with torch.no_grad():
    generated = model.generate(
        idx,
        max_new_tokens=100,
        temperature=0.8,
        eos_token_id=enc.eot_token,
    )

# Decode result
generated_text = enc.decode(generated[0].tolist())
print(generated_text)

Loading Checkpoint:

import torch

# Load checkpoint
checkpoint = torch.load("checkpoint_step_<N>.pt", map_location="cpu")
model_state = checkpoint["model_state_dict"]
config = checkpoint["config"]

# Load training config if needed
training_config = checkpoint.get("training_config", {})

# Use with custom GPT class
from pico_gpt.model import GPT
from pico_gpt.config import ModelConfig

model = GPT(config)
model.load_state_dict(model_state)

Limitations

Small Model Size: ~49.2M parameters limits reasoning capability
Short Context: 128 token context window limits long-range dependencies
Single Dataset: Trained only on web text (OpenWebText subset)
No Instruction Tuning: Not aligned for chat/instruction following
Potential Biases: May contain biases present in the training data
No Weight Tying: Embedding and output layers have separate parameters

Future Work

Convert to native Hugging Face GPT-2 architecture
Increase model size and context length
Add instruction tuning / alignment
Evaluation on downstream benchmarks (perplexity, etc.)
Fine-tune for specific tasks
Implement more sampling strategies (top-k, top-p)
Add support for streaming inference

Citation

@misc{pico-gpt,
  title={Pico-GPT: A Small Language Model from Scratch},
  author={Chidambara Raju G},
  year={2026},
  howpublished={\url{https://huggingface.co/justjuu/pico-gpt}},
}

Acknowledgments

This project uses the GPT-2 tokenizer from OpenAI's tiktoken library
Dataset: OpenWebText by Skylion007
Architecture inspired by GPT, GPT-2, and nanoGPT

For training details, see training_config.json and training_log.csv. Model files use the safetensors format for safe and efficient loading.

Downloads last month: 36

Safetensors

Model size

49.3M params

Tensor type

F32