# NanoGPT-X - GPT-Style Transformer Model ## Model Card ### Model Description This is a GPT-style Transformer language model pretrained from scratch on approximately 2 billion tokens from the FineWeb-Edu dataset. The model architecture is inspired by modern Transformer designs, incorporating Grouped Query Attention (GQA), RMSNorm, Rotary Positional Embeddings (RoPE), and SwiGLU feed-forward layers. It supports efficient training with Flash Attention 2 (if available) and uses a memmapped dataset for handling large-scale data. - **Developed by**: Antonín Tomeček - **Model type**: Causal language model (autoregressive Transformer) - **Language(s)**: English (trained on clean, educational English content from FineWeb-Edu) - **License**: Apache 2.0 (or specify your preferred license) - **Model size**: Approximately 130 million parameters - **Vocabulary size**: 32,000 (using SentencePiece tokenizer) - **Maximum sequence length**: 1,024 tokens - **Training tokens**: ~2B from FineWeb-Edu (a high-quality, deduplicated, educational subset of CommonCrawl data, filtered for English and educational value) - **Pretraining objective**: Next-token prediction (causal language modeling) - **Framework**: PyTorch with Accelerate for distributed training - **Date**: Pretrained as of January 3, 2026 The model is suitable for fine-tuning on downstream tasks such as text generation, summarization, or question answering. It was trained with a focus on efficiency, including gradient checkpointing, mixed-precision (BF16/FP16), and correct gradient accumulation. ### Architecture Details - **Embedding dimension**: 768 - **Number of layers**: 12 - **Number of attention heads**: 12 (query heads) - **Number of KV heads**: 4 (GQA for efficiency) - **FFN hidden dimension multiplier**: 3.0 (resulting in ~2,304 hidden units per layer, aligned to multiple_of=256) - **Normalization**: RMSNorm (eps=1e-5) - **Attention mechanism**: Flash Attention 2 (fallback to PyTorch SDPA) - **Positional encoding**: RoPE (precomputed for up to 2,048 positions) - **Tokenizer**: SentencePiece (BPE-based, model file: `tokenizer.model`) The model achieves a parameter count of ~130M, making it lightweight yet capable for research and prototyping. ### Intended Uses & Limitations #### Intended Uses - **Text generation**: Generate coherent continuations from prompts (e.g., stories, explanations). - **Fine-tuning**: Adapt to specific tasks like chatbots, code generation, or educational content creation. - **Research**: Study Transformer efficiency, scaling laws, or dataset quality impacts. Example usage for inference (after loading the model): ```python import torch from transformers import AutoModelForCausalLM, AutoTokenizer # Assuming uploaded to HF model_name = "antonintomecek/gpt-fineweb-edu-130m" # Replace with your HF repo name tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForCausalLM.from_pretrained(model_name) prompt = "Once upon a time" inputs = tokenizer(prompt, return_tensors="pt") outputs = model.generate(inputs.input_ids, max_new_tokens=100, temperature=0.8, top_p=0.95) print(tokenizer.decode(outputs[0])) ``` #### Limitations - **Dataset bias**: Trained solely on FineWeb-Edu, which emphasizes educational content but may inherit biases from web crawls (e.g., Western-centric views). - **Hallucinations**: As a pretrained model, it may generate factually incorrect information. - **Context length**: Limited to 1,024 tokens; longer contexts require modifications. - **No fine-tuning**: This is a base pretrained model; performance on specific tasks will improve with fine-tuning. - **Compute requirements**: Training requires GPU(s) with at least 16GB VRAM for the provided batch size/accumulation settings. - **Language**: Primarily English; multilingual capabilities are untested. - **Safety**: Not aligned or safety-tuned; may produce harmful or inappropriate content. ### Training Data The model was pretrained on ~2B tokens from [FineWeb-Edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu), a curated dataset derived from CommonCrawl. FineWeb-Edu applies deduplication, language filtering (English), and quality scoring to focus on educational text (e.g., Wikipedia-like articles, textbooks). Data was tokenized using the provided SentencePiece model and stored in memmapped binary files (`dataset.bin` for training, `valid.bin` for validation). - **Preprocessing**: Tokenized into int32 sequences; no additional filtering beyond dataset defaults. - **Validation split**: A small holdout from the dataset for perplexity evaluation. ### Training Procedure The model was trained using the provided script, which handles: - **Optimizer**: AdamW (betas=0.9/0.95, weight_decay=0.01) - **Learning rate**: Peak LR=1e-5 with cosine annealing (warmup=500 steps) - **Batch size**: Effective batch size of 8 (batch_size=1, grad_accum=8; scalable with Accelerate) - **Epochs**: 1 (full pass over ~2B tokens) - **Mixed precision**: BF16 (or FP16 fallback) - **Gradient checkpointing**: Enabled for memory efficiency - **Checkpoints**: Saved every 100,000 steps, including model, optimizer, and scheduler states - **Hardware**: Trained on GPU(s) with CUDA support; Flash Attention 2 for faster attention computation - **Logging**: Clean English logs with tqdm progress bars - **Resuming**: Supports loading from checkpoints (e.g., `checkpoints/step_XXXXXX.pt`) Total training steps: Approximately (2B tokens / (1024 seq_len * effective_batch_size)) steps. During training, periodic text samples were generated from fixed prompts to monitor progress qualitatively. #### Hyperparameters - See `ModelArgs` in the code for full config. - Customizable: Sequence length, batch size, accumulation steps, LR, etc. ### Evaluation - **Perplexity**: Validation loss reported during training (e.g., aim for <10 on held-out FineWeb-Edu data for this scale). - **Qualitative**: Generated samples from prompts like "Once upon a time" improve in coherence over steps. - No downstream benchmarks yet; evaluate after fine-tuning (e.g., using LM-Eval). ### How to Get Started 1. Clone the repository or download from Hugging Face. 2. Install dependencies: ```bash pip install torch accelerate tqdm sentencepiece flash-attn # Flash Attention optional ``` 3. Prepare data: Tokenize FineWeb-Edu into `.bin` files (not included; generate your own). 4. Run training: ```bash python train.py ``` 5. For inference, convert to HF format if needed (use `transformers` for easy loading). ### Citation If you use this model, please cite: ``` @misc{tomecek2026nanogpt-x, author = {Antonín Tomeček}, title = {GPT-Style Transformer Pretrained on FineWeb-Edu}, year = {2026}, url = {https://huggingface.co/luxopes/NanoGPT-X_Base}, } ``` ### Acknowledgments - Inspired by NanoGPT and Llama architectures. - Thanks to Hugging Face for hosting and the FineWeb team for the dataset. - Built with PyTorch, Accelerate, and Flash Attention. For questions, contact Antonín Tomeček (Prague, CZ).