NanoGPT-X - GPT-Style Transformer Model
Model Card
Model Description
This is a GPT-style Transformer language model pretrained from scratch on approximately 2 billion tokens from the FineWeb-Edu dataset. The model architecture is inspired by modern Transformer designs, incorporating Grouped Query Attention (GQA), RMSNorm, Rotary Positional Embeddings (RoPE), and SwiGLU feed-forward layers. It supports efficient training with Flash Attention 2 (if available) and uses a memmapped dataset for handling large-scale data.
- Developed by: Antonín Tomeček
- Model type: Causal language model (autoregressive Transformer)
- Language(s): English (trained on clean, educational English content from FineWeb-Edu)
- License: Apache 2.0 (or specify your preferred license)
- Model size: Approximately 130 million parameters
- Vocabulary size: 32,000 (using SentencePiece tokenizer)
- Maximum sequence length: 1,024 tokens
- Training tokens: ~2B from FineWeb-Edu (a high-quality, deduplicated, educational subset of CommonCrawl data, filtered for English and educational value)
- Pretraining objective: Next-token prediction (causal language modeling)
- Framework: PyTorch with Accelerate for distributed training
- Date: Pretrained as of January 3, 2026
The model is suitable for fine-tuning on downstream tasks such as text generation, summarization, or question answering. It was trained with a focus on efficiency, including gradient checkpointing, mixed-precision (BF16/FP16), and correct gradient accumulation.
Architecture Details
- Embedding dimension: 768
- Number of layers: 12
- Number of attention heads: 12 (query heads)
- Number of KV heads: 4 (GQA for efficiency)
- FFN hidden dimension multiplier: 3.0 (resulting in ~2,304 hidden units per layer, aligned to multiple_of=256)
- Normalization: RMSNorm (eps=1e-5)
- Attention mechanism: Flash Attention 2 (fallback to PyTorch SDPA)
- Positional encoding: RoPE (precomputed for up to 2,048 positions)
- Tokenizer: SentencePiece (BPE-based, model file:
tokenizer.model)
The model achieves a parameter count of ~130M, making it lightweight yet capable for research and prototyping.
Intended Uses & Limitations
Intended Uses
- Text generation: Generate coherent continuations from prompts (e.g., stories, explanations).
- Fine-tuning: Adapt to specific tasks like chatbots, code generation, or educational content creation.
- Research: Study Transformer efficiency, scaling laws, or dataset quality impacts.
Example usage for inference (after loading the model):
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer # Assuming uploaded to HF
model_name = "antonintomecek/gpt-fineweb-edu-130m" # Replace with your HF repo name
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
prompt = "Once upon a time"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(inputs.input_ids, max_new_tokens=100, temperature=0.8, top_p=0.95)
print(tokenizer.decode(outputs[0]))
Limitations
- Dataset bias: Trained solely on FineWeb-Edu, which emphasizes educational content but may inherit biases from web crawls (e.g., Western-centric views).
- Hallucinations: As a pretrained model, it may generate factually incorrect information.
- Context length: Limited to 1,024 tokens; longer contexts require modifications.
- No fine-tuning: This is a base pretrained model; performance on specific tasks will improve with fine-tuning.
- Compute requirements: Training requires GPU(s) with at least 16GB VRAM for the provided batch size/accumulation settings.
- Language: Primarily English; multilingual capabilities are untested.
- Safety: Not aligned or safety-tuned; may produce harmful or inappropriate content.
Training Data
The model was pretrained on ~2B tokens from FineWeb-Edu, a curated dataset derived from CommonCrawl. FineWeb-Edu applies deduplication, language filtering (English), and quality scoring to focus on educational text (e.g., Wikipedia-like articles, textbooks). Data was tokenized using the provided SentencePiece model and stored in memmapped binary files (dataset.bin for training, valid.bin for validation).
- Preprocessing: Tokenized into int32 sequences; no additional filtering beyond dataset defaults.
- Validation split: A small holdout from the dataset for perplexity evaluation.
Training Procedure
The model was trained using the provided script, which handles:
- Optimizer: AdamW (betas=0.9/0.95, weight_decay=0.01)
- Learning rate: Peak LR=1e-5 with cosine annealing (warmup=500 steps)
- Batch size: Effective batch size of 8 (batch_size=1, grad_accum=8; scalable with Accelerate)
- Epochs: 1 (full pass over ~2B tokens)
- Mixed precision: BF16 (or FP16 fallback)
- Gradient checkpointing: Enabled for memory efficiency
- Checkpoints: Saved every 100,000 steps, including model, optimizer, and scheduler states
- Hardware: Trained on GPU(s) with CUDA support; Flash Attention 2 for faster attention computation
- Logging: Clean English logs with tqdm progress bars
- Resuming: Supports loading from checkpoints (e.g.,
checkpoints/step_XXXXXX.pt)
Total training steps: Approximately (2B tokens / (1024 seq_len * effective_batch_size)) steps.
During training, periodic text samples were generated from fixed prompts to monitor progress qualitatively.
Hyperparameters
- See
ModelArgsin the code for full config. - Customizable: Sequence length, batch size, accumulation steps, LR, etc.
Evaluation
- Perplexity: Validation loss reported during training (e.g., aim for <10 on held-out FineWeb-Edu data for this scale).
- Qualitative: Generated samples from prompts like "Once upon a time" improve in coherence over steps.
- No downstream benchmarks yet; evaluate after fine-tuning (e.g., using LM-Eval).
How to Get Started
- Clone the repository or download from Hugging Face.
- Install dependencies:
pip install torch accelerate tqdm sentencepiece flash-attn # Flash Attention optional - Prepare data: Tokenize FineWeb-Edu into
.binfiles (not included; generate your own). - Run training:
python train.py - For inference, convert to HF format if needed (use
transformersfor easy loading).
Citation
If you use this model, please cite:
@misc{tomecek2026nanogpt-x,
author = {Antonín Tomeček},
title = {GPT-Style Transformer Pretrained on FineWeb-Edu},
year = {2026},
url = {https://huggingface.co/luxopes/NanoGPT-X_Base},
}
Acknowledgments
- Inspired by NanoGPT and Llama architectures.
- Thanks to Hugging Face for hosting and the FineWeb team for the dataset.
- Built with PyTorch, Accelerate, and Flash Attention.
For questions, contact Antonín Tomeček (Prague, CZ).