NanoGPT-X - GPT-Style Transformer Model

Model Card

Model Description

This is a GPT-style Transformer language model pretrained from scratch on approximately 2 billion tokens from the FineWeb-Edu dataset. The model architecture is inspired by modern Transformer designs, incorporating Grouped Query Attention (GQA), RMSNorm, Rotary Positional Embeddings (RoPE), and SwiGLU feed-forward layers. It supports efficient training with Flash Attention 2 (if available) and uses a memmapped dataset for handling large-scale data.

Developed by: Antonín Tomeček
Model type: Causal language model (autoregressive Transformer)
Language(s): English (trained on clean, educational English content from FineWeb-Edu)
License: Apache 2.0
Model size: Approximately 130 million parameters
Vocabulary size: 32,000 (using SentencePiece tokenizer)
Maximum sequence length: 1,024 tokens
Training tokens: ~2B from FineWeb-Edu (a high-quality, deduplicated, educational subset of CommonCrawl data, filtered for English and educational value)
Pretraining objective: Next-token prediction (causal language modeling)
Framework: PyTorch with Accelerate for distributed training
Date: Pretrained as of January 3, 2026

The model is suitable for fine-tuning on downstream tasks such as text generation, summarization, or question answering. It was trained with a focus on efficiency, including gradient checkpointing, mixed-precision (BF16/FP16), and correct gradient accumulation.

Architecture Details

Embedding dimension: 768
Number of layers: 12
Number of attention heads: 12 (query heads)
Number of KV heads: 4 (GQA for efficiency)
FFN hidden dimension multiplier: 3.0 (resulting in ~2,304 hidden units per layer, aligned to multiple_of=256)
Normalization: RMSNorm (eps=1e-5)
Attention mechanism: Flash Attention 2 (fallback to PyTorch SDPA)
Positional encoding: RoPE (precomputed for up to 2,048 positions)
Tokenizer: SentencePiece (BPE-based, model file: tokenizer.model)

The model achieves a parameter count of ~130M, making it lightweight yet capable for research and prototyping.

Intended Uses & Limitations

Intended Uses

Text generation: Generate coherent continuations from prompts (e.g., stories, explanations).
Fine-tuning: Adapt to specific tasks like chatbots, code generation, or educational content creation.
Research: Study Transformer efficiency, scaling laws, or dataset quality impacts.

Example usage for inference (after loading the model):

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer  # Assuming uploaded to HF

model_name = "luxopes/nanogpt-x_base" 
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

prompt = "Once upon a time"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(inputs.input_ids, max_new_tokens=100, temperature=0.8, top_p=0.95)
print(tokenizer.decode(outputs[0]))

Limitations

Dataset bias: Trained solely on FineWeb-Edu, which emphasizes educational content but may inherit biases from web crawls (e.g., Western-centric views).
Hallucinations: As a pretrained model, it may generate factually incorrect information.
Context length: Limited to 1,024 tokens; longer contexts require modifications.
No fine-tuning: This is a base pretrained model; performance on specific tasks will improve with fine-tuning.
Compute requirements: Training requires GPU(s) with at least 16GB VRAM for the provided batch size/accumulation settings.
Language: Primarily English; multilingual capabilities are untested.
Safety: Not aligned or safety-tuned; may produce harmful or inappropriate content.

Training Data

The model was pretrained on ~2B tokens from FineWeb-Edu, a curated dataset derived from CommonCrawl. FineWeb-Edu applies deduplication, language filtering (English), and quality scoring to focus on educational text (e.g., Wikipedia-like articles, textbooks). Data was tokenized using the provided SentencePiece model and stored in memmapped binary files (dataset.bin for training, valid.bin for validation).

Preprocessing: Tokenized into int32 sequences; no additional filtering beyond dataset defaults.
Validation split: A small holdout from the dataset for perplexity evaluation.

Training Procedure

The model was trained using the provided script, which handles:

Optimizer: AdamW (betas=0.9/0.95, weight_decay=0.01)
Learning rate: Peak LR=1e-5 with cosine annealing (warmup=500 steps)
Batch size: Effective batch size of 8 (batch_size=1, grad_accum=8; scalable with Accelerate)
Epochs: 1 (full pass over ~2B tokens)
Mixed precision: BF16 (or FP16 fallback)
Gradient checkpointing: Enabled for memory efficiency
Checkpoints: Saved every 100,000 steps, including model, optimizer, and scheduler states
Hardware: Trained on GPU(s) with CUDA support; Flash Attention 2 for faster attention computation
Logging: Clean English logs with tqdm progress bars
Resuming: Supports loading from checkpoints (e.g., checkpoints/step_XXXXXX.pt)

Total training steps: Approximately (2B tokens / (1024 seq_len * effective_batch_size)) steps.

During training, periodic text samples were generated from fixed prompts to monitor progress qualitatively.

Hyperparameters

See ModelArgs in the code for full config.
Customizable: Sequence length, batch size, accumulation steps, LR, etc.

Evaluation

Perplexity: Validation loss reported during training (e.g., aim for <10 on held-out FineWeb-Edu data for this scale).
Qualitative: Generated samples from prompts like "Once upon a time" improve in coherence over steps.
No downstream benchmarks yet; evaluate after fine-tuning (e.g., using LM-Eval).

How to Get Started

Clone the repository or download from Hugging Face.

Install dependencies:

pip install torch accelerate tqdm sentencepiece flash-attn  # Flash Attention optional

Prepare data: Tokenize FineWeb-Edu into .bin files (not included; generate your own).
Run training:
```
python train.py
```
For inference, convert to HF format if needed (use transformers for easy loading).

Citation

If you use this model, please cite:

@misc{tomecek2026nanogpt-x,
  author = {Antonín Tomeček},
  title = {GPT-Style Transformer Pretrained on FineWeb-Edu},
  year = {2026},
  url = {https://huggingface.co/luxopes/NanoGPT-X_Base},
}

Acknowledgments

Inspired by NanoGPT and Llama architectures.
Thanks to Hugging Face for hosting and the FineWeb team for the dataset.
Built with PyTorch, Accelerate, and Flash Attention.

For questions, contact Antonín Tomeček (Prague, CZ).

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support