Pico-GPT
A small GPT-style decoder-only language model (~49.2M parameters) trained from scratch on OpenWebText.
Model Details
| Property | Value |
|---|---|
| Architecture | Decoder-only Transformer with Pre-LayerNorm |
| Parameters | ~49,218,816 |
| Layers | 6 |
| Hidden Size | 384 |
| FFN Hidden Size | 1536 |
| Attention Heads | 6 |
| Head Dimension | 64 |
| Context Length | 128 tokens |
| Vocabulary | 50257 (GPT-2) |
| Flash Attention | ✅ Enabled |
| Dropout | 0.1 |
| Bias | Disabled |
Training Objective
The model was trained using causal language modeling (next-token prediction). The loss function is cross-entropy over the vocabulary.
For a given sequence of tokens x_1, x_2, ..., x_n, the model is trained to predict x_{i+1} given x_1, ..., x_i.
Dataset
Source
- Dataset: OpenWebText
- Hugging Face:
Skylion007/openwebtext - Mode: Streaming preprocessing
- License: Same as OpenAI's GPT-2 dataset
Preprocessing Pipeline
- Tokenizer: GPT-2 (tiktoken)
- Tokenization: Streaming, incremental
- EOS Token: Appended after each document
- Text Cleaning: Minimal (strip whitespace, skip empty strings)
- Sharding: Binary shards (uint16), 5M tokens per shard
- Train/Val Split: Deterministic split by token count
- Memory Mapping: Enabled for efficient loading
Dataset Statistics
- Total Tokens Collected: 1B tokens
- Training Tokens: 950M tokens
- Validation Tokens: 50M tokens
- Training Shards: ~190 files (train_000.bin to train_189.bin)
- Validation Shard: val.bin
- Data Type: uint16 (supports memory mapping)
Training Configuration
Hyperparameters
| Parameter | Value |
|---|---|
| Optimizer | AdamW |
| Learning Rate | 3e-4 |
| Weight Decay | 0.1 |
| Betas | (0.9, 0.95) |
| Max Steps | N/A |
| Batch Size | 64 |
| Context Window | 128 |
| Gradient Clipping | 1.0 |
| Checkpoint Interval | N/A |
| Log Interval | N/A |
Training Results
| Metric | Value |
|---|---|
| Final Training Loss | N/A |
| Training Time | N/A |
| Hardware | NVIDIA A100 (20GB) or equivalent |
Model Files
| File | Description |
|---|---|
model.safetensors |
Model weights in safetensors format (secure, fast loading) |
config.json |
Model architecture configuration |
training_config.json |
Training hyperparameters and results |
training_log.csv |
Training metrics over time (step, loss, elapsed_time) |
samples.txt |
Sample generations from the trained model |
tokenizer_config.json |
Tokenizer configuration |
special_tokens_map.json |
Special tokens mapping |
Installation
This model requires the custom pico_gpt package to load and run.
git clone https://github.com/ChidambaraRaju/pico-gpt.git
cd pico-gpt
pip install -e .
Usage
Loading with safetensors:
import torch
from safetensors.torch import load_file
import json
# Load config
with open("config.json", "r") as f:
config = json.load(f)
# Load weights
state_dict = load_file("model.safetensors")
# Create model (requires custom model class from pico_gpt/model.py)
from pico_gpt.model import GPT
from pico_gpt.config import ModelConfig
model = GPT(ModelConfig(**config))
model.load_state_dict(state_dict)
model.eval()
Text Generation:
import torch
import tiktoken
# Load tokenizer
enc = tiktoken.get_encoding("gpt2")
# Prepare prompt
prompt = "The future of artificial intelligence is"
tokens = enc.encode(prompt)
tokens = tokens[-context_length:] # Truncate to context length if needed
idx = torch.tensor([tokens], dtype=torch.long)
# Generate
with torch.no_grad():
generated = model.generate(
idx,
max_new_tokens=100,
temperature=0.8,
eos_token_id=enc.eot_token,
)
# Decode result
generated_text = enc.decode(generated[0].tolist())
print(generated_text)
Loading Checkpoint:
import torch
# Load checkpoint
checkpoint = torch.load("checkpoint_step_<N>.pt", map_location="cpu")
model_state = checkpoint["model_state_dict"]
config = checkpoint["config"]
# Load training config if needed
training_config = checkpoint.get("training_config", {})
# Use with custom GPT class
from pico_gpt.model import GPT
from pico_gpt.config import ModelConfig
model = GPT(config)
model.load_state_dict(model_state)
Limitations
- Small Model Size: ~49.2M parameters limits reasoning capability
- Short Context: 128 token context window limits long-range dependencies
- Single Dataset: Trained only on web text (OpenWebText subset)
- No Instruction Tuning: Not aligned for chat/instruction following
- Potential Biases: May contain biases present in the training data
- No Weight Tying: Embedding and output layers have separate parameters
Future Work
- Convert to native Hugging Face GPT-2 architecture
- Increase model size and context length
- Add instruction tuning / alignment
- Evaluation on downstream benchmarks (perplexity, etc.)
- Fine-tune for specific tasks
- Implement more sampling strategies (top-k, top-p)
- Add support for streaming inference
Citation
@misc{pico-gpt,
title={Pico-GPT: A Small Language Model from Scratch},
author={Chidambara Raju G},
year={2026},
howpublished={\url{https://huggingface.co/justjuu/pico-gpt}},
}
Acknowledgments
- This project uses the GPT-2 tokenizer from OpenAI's
tiktokenlibrary - Dataset: OpenWebText by Skylion007
- Architecture inspired by GPT, GPT-2, and nanoGPT
For training details, see training_config.json and training_log.csv.
Model files use the safetensors format for safe and efficient loading.
- Downloads last month
- 378