| # NanoGPT-X - GPT-Style Transformer Model | |
| ## Model Card | |
| ### Model Description | |
| This is a GPT-style Transformer language model pretrained from scratch on approximately 2 billion tokens from the FineWeb-Edu dataset. The model architecture is inspired by modern Transformer designs, incorporating Grouped Query Attention (GQA), RMSNorm, Rotary Positional Embeddings (RoPE), and SwiGLU feed-forward layers. It supports efficient training with Flash Attention 2 (if available) and uses a memmapped dataset for handling large-scale data. | |
| - **Developed by**: Antonín Tomeček | |
| - **Model type**: Causal language model (autoregressive Transformer) | |
| - **Language(s)**: English (trained on clean, educational English content from FineWeb-Edu) | |
| - **License**: Apache 2.0 (or specify your preferred license) | |
| - **Model size**: Approximately 130 million parameters | |
| - **Vocabulary size**: 32,000 (using SentencePiece tokenizer) | |
| - **Maximum sequence length**: 1,024 tokens | |
| - **Training tokens**: ~2B from FineWeb-Edu (a high-quality, deduplicated, educational subset of CommonCrawl data, filtered for English and educational value) | |
| - **Pretraining objective**: Next-token prediction (causal language modeling) | |
| - **Framework**: PyTorch with Accelerate for distributed training | |
| - **Date**: Pretrained as of January 3, 2026 | |
| The model is suitable for fine-tuning on downstream tasks such as text generation, summarization, or question answering. It was trained with a focus on efficiency, including gradient checkpointing, mixed-precision (BF16/FP16), and correct gradient accumulation. | |
| ### Architecture Details | |
| - **Embedding dimension**: 768 | |
| - **Number of layers**: 12 | |
| - **Number of attention heads**: 12 (query heads) | |
| - **Number of KV heads**: 4 (GQA for efficiency) | |
| - **FFN hidden dimension multiplier**: 3.0 (resulting in ~2,304 hidden units per layer, aligned to multiple_of=256) | |
| - **Normalization**: RMSNorm (eps=1e-5) | |
| - **Attention mechanism**: Flash Attention 2 (fallback to PyTorch SDPA) | |
| - **Positional encoding**: RoPE (precomputed for up to 2,048 positions) | |
| - **Tokenizer**: SentencePiece (BPE-based, model file: `tokenizer.model`) | |
| The model achieves a parameter count of ~130M, making it lightweight yet capable for research and prototyping. | |
| ### Intended Uses & Limitations | |
| #### Intended Uses | |
| - **Text generation**: Generate coherent continuations from prompts (e.g., stories, explanations). | |
| - **Fine-tuning**: Adapt to specific tasks like chatbots, code generation, or educational content creation. | |
| - **Research**: Study Transformer efficiency, scaling laws, or dataset quality impacts. | |
| Example usage for inference (after loading the model): | |
| ```python | |
| import torch | |
| from transformers import AutoModelForCausalLM, AutoTokenizer # Assuming uploaded to HF | |
| model_name = "antonintomecek/gpt-fineweb-edu-130m" # Replace with your HF repo name | |
| tokenizer = AutoTokenizer.from_pretrained(model_name) | |
| model = AutoModelForCausalLM.from_pretrained(model_name) | |
| prompt = "Once upon a time" | |
| inputs = tokenizer(prompt, return_tensors="pt") | |
| outputs = model.generate(inputs.input_ids, max_new_tokens=100, temperature=0.8, top_p=0.95) | |
| print(tokenizer.decode(outputs[0])) | |
| ``` | |
| #### Limitations | |
| - **Dataset bias**: Trained solely on FineWeb-Edu, which emphasizes educational content but may inherit biases from web crawls (e.g., Western-centric views). | |
| - **Hallucinations**: As a pretrained model, it may generate factually incorrect information. | |
| - **Context length**: Limited to 1,024 tokens; longer contexts require modifications. | |
| - **No fine-tuning**: This is a base pretrained model; performance on specific tasks will improve with fine-tuning. | |
| - **Compute requirements**: Training requires GPU(s) with at least 16GB VRAM for the provided batch size/accumulation settings. | |
| - **Language**: Primarily English; multilingual capabilities are untested. | |
| - **Safety**: Not aligned or safety-tuned; may produce harmful or inappropriate content. | |
| ### Training Data | |
| The model was pretrained on ~2B tokens from [FineWeb-Edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu), a curated dataset derived from CommonCrawl. FineWeb-Edu applies deduplication, language filtering (English), and quality scoring to focus on educational text (e.g., Wikipedia-like articles, textbooks). Data was tokenized using the provided SentencePiece model and stored in memmapped binary files (`dataset.bin` for training, `valid.bin` for validation). | |
| - **Preprocessing**: Tokenized into int32 sequences; no additional filtering beyond dataset defaults. | |
| - **Validation split**: A small holdout from the dataset for perplexity evaluation. | |
| ### Training Procedure | |
| The model was trained using the provided script, which handles: | |
| - **Optimizer**: AdamW (betas=0.9/0.95, weight_decay=0.01) | |
| - **Learning rate**: Peak LR=1e-5 with cosine annealing (warmup=500 steps) | |
| - **Batch size**: Effective batch size of 8 (batch_size=1, grad_accum=8; scalable with Accelerate) | |
| - **Epochs**: 1 (full pass over ~2B tokens) | |
| - **Mixed precision**: BF16 (or FP16 fallback) | |
| - **Gradient checkpointing**: Enabled for memory efficiency | |
| - **Checkpoints**: Saved every 100,000 steps, including model, optimizer, and scheduler states | |
| - **Hardware**: Trained on GPU(s) with CUDA support; Flash Attention 2 for faster attention computation | |
| - **Logging**: Clean English logs with tqdm progress bars | |
| - **Resuming**: Supports loading from checkpoints (e.g., `checkpoints/step_XXXXXX.pt`) | |
| Total training steps: Approximately (2B tokens / (1024 seq_len * effective_batch_size)) steps. | |
| During training, periodic text samples were generated from fixed prompts to monitor progress qualitatively. | |
| #### Hyperparameters | |
| - See `ModelArgs` in the code for full config. | |
| - Customizable: Sequence length, batch size, accumulation steps, LR, etc. | |
| ### Evaluation | |
| - **Perplexity**: Validation loss reported during training (e.g., aim for <10 on held-out FineWeb-Edu data for this scale). | |
| - **Qualitative**: Generated samples from prompts like "Once upon a time" improve in coherence over steps. | |
| - No downstream benchmarks yet; evaluate after fine-tuning (e.g., using LM-Eval). | |
| ### How to Get Started | |
| 1. Clone the repository or download from Hugging Face. | |
| 2. Install dependencies: | |
| ```bash | |
| pip install torch accelerate tqdm sentencepiece flash-attn # Flash Attention optional | |
| ``` | |
| 3. Prepare data: Tokenize FineWeb-Edu into `.bin` files (not included; generate your own). | |
| 4. Run training: | |
| ```bash | |
| python train.py | |
| ``` | |
| 5. For inference, convert to HF format if needed (use `transformers` for easy loading). | |
| ### Citation | |
| If you use this model, please cite: | |
| ``` | |
| @misc{tomecek2026nanogpt-x, | |
| author = {Antonín Tomeček}, | |
| title = {GPT-Style Transformer Pretrained on FineWeb-Edu}, | |
| year = {2026}, | |
| url = {https://huggingface.co/luxopes/NanoGPT-X_Base}, | |
| } | |
| ``` | |
| ### Acknowledgments | |
| - Inspired by NanoGPT and Llama architectures. | |
| - Thanks to Hugging Face for hosting and the FineWeb team for the dataset. | |
| - Built with PyTorch, Accelerate, and Flash Attention. | |
| For questions, contact Antonín Tomeček (Prague, CZ). | |