NanoGPT-X_Base / README.md

Update README.md

037761a verified 23 days ago

7.05 kB

	# NanoGPT-X - GPT-Style Transformer Model

	## Model Card

	### Model Description
	This is a GPT-style Transformer language model pretrained from scratch on approximately 2 billion tokens from the FineWeb-Edu dataset. The model architecture is inspired by modern Transformer designs, incorporating Grouped Query Attention (GQA), RMSNorm, Rotary Positional Embeddings (RoPE), and SwiGLU feed-forward layers. It supports efficient training with Flash Attention 2 (if available) and uses a memmapped dataset for handling large-scale data.

	- Developed by: Antonín Tomeček
	- Model type: Causal language model (autoregressive Transformer)
	- Language(s): English (trained on clean, educational English content from FineWeb-Edu)
	- License: Apache 2.0 (or specify your preferred license)
	- Model size: Approximately 130 million parameters
	- Vocabulary size: 32,000 (using SentencePiece tokenizer)
	- Maximum sequence length: 1,024 tokens
	- Training tokens: ~2B from FineWeb-Edu (a high-quality, deduplicated, educational subset of CommonCrawl data, filtered for English and educational value)
	- Pretraining objective: Next-token prediction (causal language modeling)
	- Framework: PyTorch with Accelerate for distributed training
	- Date: Pretrained as of January 3, 2026

	The model is suitable for fine-tuning on downstream tasks such as text generation, summarization, or question answering. It was trained with a focus on efficiency, including gradient checkpointing, mixed-precision (BF16/FP16), and correct gradient accumulation.

	### Architecture Details
	- Embedding dimension: 768
	- Number of layers: 12
	- Number of attention heads: 12 (query heads)
	- Number of KV heads: 4 (GQA for efficiency)
	- FFN hidden dimension multiplier: 3.0 (resulting in ~2,304 hidden units per layer, aligned to multiple_of=256)
	- Normalization: RMSNorm (eps=1e-5)
	- Attention mechanism: Flash Attention 2 (fallback to PyTorch SDPA)
	- Positional encoding: RoPE (precomputed for up to 2,048 positions)
	- Tokenizer: SentencePiece (BPE-based, model file: `tokenizer.model`)

	The model achieves a parameter count of ~130M, making it lightweight yet capable for research and prototyping.

	### Intended Uses & Limitations
	#### Intended Uses
	- Text generation: Generate coherent continuations from prompts (e.g., stories, explanations).
	- Fine-tuning: Adapt to specific tasks like chatbots, code generation, or educational content creation.
	- Research: Study Transformer efficiency, scaling laws, or dataset quality impacts.

	Example usage for inference (after loading the model):
	```python
	import torch
	from transformers import AutoModelForCausalLM, AutoTokenizer # Assuming uploaded to HF

	model_name = "antonintomecek/gpt-fineweb-edu-130m" # Replace with your HF repo name
	tokenizer = AutoTokenizer.from_pretrained(model_name)
	model = AutoModelForCausalLM.from_pretrained(model_name)

	prompt = "Once upon a time"
	inputs = tokenizer(prompt, return_tensors="pt")
	outputs = model.generate(inputs.input_ids, max_new_tokens=100, temperature=0.8, top_p=0.95)
	print(tokenizer.decode(outputs[0]))
	```

	#### Limitations
	- Dataset bias: Trained solely on FineWeb-Edu, which emphasizes educational content but may inherit biases from web crawls (e.g., Western-centric views).
	- Hallucinations: As a pretrained model, it may generate factually incorrect information.
	- Context length: Limited to 1,024 tokens; longer contexts require modifications.
	- No fine-tuning: This is a base pretrained model; performance on specific tasks will improve with fine-tuning.
	- Compute requirements: Training requires GPU(s) with at least 16GB VRAM for the provided batch size/accumulation settings.
	- Language: Primarily English; multilingual capabilities are untested.
	- Safety: Not aligned or safety-tuned; may produce harmful or inappropriate content.

	### Training Data
	The model was pretrained on ~2B tokens from [FineWeb-Edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu), a curated dataset derived from CommonCrawl. FineWeb-Edu applies deduplication, language filtering (English), and quality scoring to focus on educational text (e.g., Wikipedia-like articles, textbooks). Data was tokenized using the provided SentencePiece model and stored in memmapped binary files (`dataset.bin` for training, `valid.bin` for validation).

	- Preprocessing: Tokenized into int32 sequences; no additional filtering beyond dataset defaults.
	- Validation split: A small holdout from the dataset for perplexity evaluation.

	### Training Procedure
	The model was trained using the provided script, which handles:
	- Optimizer: AdamW (betas=0.9/0.95, weight_decay=0.01)
	- Learning rate: Peak LR=1e-5 with cosine annealing (warmup=500 steps)
	- Batch size: Effective batch size of 8 (batch_size=1, grad_accum=8; scalable with Accelerate)
	- Epochs: 1 (full pass over ~2B tokens)
	- Mixed precision: BF16 (or FP16 fallback)
	- Gradient checkpointing: Enabled for memory efficiency
	- Checkpoints: Saved every 100,000 steps, including model, optimizer, and scheduler states
	- Hardware: Trained on GPU(s) with CUDA support; Flash Attention 2 for faster attention computation
	- Logging: Clean English logs with tqdm progress bars
	- Resuming: Supports loading from checkpoints (e.g., `checkpoints/step_XXXXXX.pt`)

	Total training steps: Approximately (2B tokens / (1024 seq_len * effective_batch_size)) steps.

	During training, periodic text samples were generated from fixed prompts to monitor progress qualitatively.

	#### Hyperparameters
	- See `ModelArgs` in the code for full config.
	- Customizable: Sequence length, batch size, accumulation steps, LR, etc.

	### Evaluation
	- Perplexity: Validation loss reported during training (e.g., aim for <10 on held-out FineWeb-Edu data for this scale).
	- Qualitative: Generated samples from prompts like "Once upon a time" improve in coherence over steps.
	- No downstream benchmarks yet; evaluate after fine-tuning (e.g., using LM-Eval).

	### How to Get Started
	1. Clone the repository or download from Hugging Face.
	2. Install dependencies:
	```bash
	pip install torch accelerate tqdm sentencepiece flash-attn # Flash Attention optional
	```
	3. Prepare data: Tokenize FineWeb-Edu into `.bin` files (not included; generate your own).
	4. Run training:
	```bash
	python train.py
	```
	5. For inference, convert to HF format if needed (use `transformers` for easy loading).

	### Citation
	If you use this model, please cite:
	```
	@misc{tomecek2026nanogpt-x,
	author = {Antonín Tomeček},
	title = {GPT-Style Transformer Pretrained on FineWeb-Edu},
	year = {2026},
	url = {https://huggingface.co/luxopes/NanoGPT-X_Base},
	}
	```

	### Acknowledgments
	- Inspired by NanoGPT and Llama architectures.
	- Thanks to Hugging Face for hosting and the FineWeb team for the dataset.
	- Built with PyTorch, Accelerate, and Flash Attention.

	For questions, contact Antonín Tomeček (Prague, CZ).