YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

NanoGPT-X - GPT-Style Transformer Model

Model Card

Model Description

This is a GPT-style Transformer language model pretrained from scratch on approximately 2 billion tokens from the FineWeb-Edu dataset. The model architecture is inspired by modern Transformer designs, incorporating Grouped Query Attention (GQA), RMSNorm, Rotary Positional Embeddings (RoPE), and SwiGLU feed-forward layers. It supports efficient training with Flash Attention 2 (if available) and uses a memmapped dataset for handling large-scale data.

  • Developed by: Antonín Tomeček
  • Model type: Causal language model (autoregressive Transformer)
  • Language(s): English (trained on clean, educational English content from FineWeb-Edu)
  • License: Apache 2.0 (or specify your preferred license)
  • Model size: Approximately 130 million parameters
  • Vocabulary size: 32,000 (using SentencePiece tokenizer)
  • Maximum sequence length: 1,024 tokens
  • Training tokens: ~2B from FineWeb-Edu (a high-quality, deduplicated, educational subset of CommonCrawl data, filtered for English and educational value)
  • Pretraining objective: Next-token prediction (causal language modeling)
  • Framework: PyTorch with Accelerate for distributed training
  • Date: Pretrained as of January 3, 2026

The model is suitable for fine-tuning on downstream tasks such as text generation, summarization, or question answering. It was trained with a focus on efficiency, including gradient checkpointing, mixed-precision (BF16/FP16), and correct gradient accumulation.

Architecture Details

  • Embedding dimension: 768
  • Number of layers: 12
  • Number of attention heads: 12 (query heads)
  • Number of KV heads: 4 (GQA for efficiency)
  • FFN hidden dimension multiplier: 3.0 (resulting in ~2,304 hidden units per layer, aligned to multiple_of=256)
  • Normalization: RMSNorm (eps=1e-5)
  • Attention mechanism: Flash Attention 2 (fallback to PyTorch SDPA)
  • Positional encoding: RoPE (precomputed for up to 2,048 positions)
  • Tokenizer: SentencePiece (BPE-based, model file: tokenizer.model)

The model achieves a parameter count of ~130M, making it lightweight yet capable for research and prototyping.

Intended Uses & Limitations

Intended Uses

  • Text generation: Generate coherent continuations from prompts (e.g., stories, explanations).
  • Fine-tuning: Adapt to specific tasks like chatbots, code generation, or educational content creation.
  • Research: Study Transformer efficiency, scaling laws, or dataset quality impacts.

Example usage for inference (after loading the model):

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer  # Assuming uploaded to HF

model_name = "antonintomecek/gpt-fineweb-edu-130m"  # Replace with your HF repo name
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

prompt = "Once upon a time"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(inputs.input_ids, max_new_tokens=100, temperature=0.8, top_p=0.95)
print(tokenizer.decode(outputs[0]))

Limitations

  • Dataset bias: Trained solely on FineWeb-Edu, which emphasizes educational content but may inherit biases from web crawls (e.g., Western-centric views).
  • Hallucinations: As a pretrained model, it may generate factually incorrect information.
  • Context length: Limited to 1,024 tokens; longer contexts require modifications.
  • No fine-tuning: This is a base pretrained model; performance on specific tasks will improve with fine-tuning.
  • Compute requirements: Training requires GPU(s) with at least 16GB VRAM for the provided batch size/accumulation settings.
  • Language: Primarily English; multilingual capabilities are untested.
  • Safety: Not aligned or safety-tuned; may produce harmful or inappropriate content.

Training Data

The model was pretrained on ~2B tokens from FineWeb-Edu, a curated dataset derived from CommonCrawl. FineWeb-Edu applies deduplication, language filtering (English), and quality scoring to focus on educational text (e.g., Wikipedia-like articles, textbooks). Data was tokenized using the provided SentencePiece model and stored in memmapped binary files (dataset.bin for training, valid.bin for validation).

  • Preprocessing: Tokenized into int32 sequences; no additional filtering beyond dataset defaults.
  • Validation split: A small holdout from the dataset for perplexity evaluation.

Training Procedure

The model was trained using the provided script, which handles:

  • Optimizer: AdamW (betas=0.9/0.95, weight_decay=0.01)
  • Learning rate: Peak LR=1e-5 with cosine annealing (warmup=500 steps)
  • Batch size: Effective batch size of 8 (batch_size=1, grad_accum=8; scalable with Accelerate)
  • Epochs: 1 (full pass over ~2B tokens)
  • Mixed precision: BF16 (or FP16 fallback)
  • Gradient checkpointing: Enabled for memory efficiency
  • Checkpoints: Saved every 100,000 steps, including model, optimizer, and scheduler states
  • Hardware: Trained on GPU(s) with CUDA support; Flash Attention 2 for faster attention computation
  • Logging: Clean English logs with tqdm progress bars
  • Resuming: Supports loading from checkpoints (e.g., checkpoints/step_XXXXXX.pt)

Total training steps: Approximately (2B tokens / (1024 seq_len * effective_batch_size)) steps.

During training, periodic text samples were generated from fixed prompts to monitor progress qualitatively.

Hyperparameters

  • See ModelArgs in the code for full config.
  • Customizable: Sequence length, batch size, accumulation steps, LR, etc.

Evaluation

  • Perplexity: Validation loss reported during training (e.g., aim for <10 on held-out FineWeb-Edu data for this scale).
  • Qualitative: Generated samples from prompts like "Once upon a time" improve in coherence over steps.
  • No downstream benchmarks yet; evaluate after fine-tuning (e.g., using LM-Eval).

How to Get Started

  1. Clone the repository or download from Hugging Face.
  2. Install dependencies:
    pip install torch accelerate tqdm sentencepiece flash-attn  # Flash Attention optional
    
  3. Prepare data: Tokenize FineWeb-Edu into .bin files (not included; generate your own).
  4. Run training:
    python train.py
    
  5. For inference, convert to HF format if needed (use transformers for easy loading).

Citation

If you use this model, please cite:

@misc{tomecek2026nanogpt-x,
  author = {Antonín Tomeček},
  title = {GPT-Style Transformer Pretrained on FineWeb-Edu},
  year = {2026},
  url = {https://huggingface.co/luxopes/NanoGPT-X_Base},
}

Acknowledgments

  • Inspired by NanoGPT and Llama architectures.
  • Thanks to Hugging Face for hosting and the FineWeb team for the dataset.
  • Built with PyTorch, Accelerate, and Flash Attention.

For questions, contact Antonín Tomeček (Prague, CZ).

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support