File size: 7,046 Bytes
037761a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
# NanoGPT-X - GPT-Style Transformer Model

## Model Card

### Model Description
This is a GPT-style Transformer language model pretrained from scratch on approximately 2 billion tokens from the FineWeb-Edu dataset. The model architecture is inspired by modern Transformer designs, incorporating Grouped Query Attention (GQA), RMSNorm, Rotary Positional Embeddings (RoPE), and SwiGLU feed-forward layers. It supports efficient training with Flash Attention 2 (if available) and uses a memmapped dataset for handling large-scale data.

- **Developed by**: Antonín Tomeček
- **Model type**: Causal language model (autoregressive Transformer)
- **Language(s)**: English (trained on clean, educational English content from FineWeb-Edu)
- **License**: Apache 2.0 (or specify your preferred license)
- **Model size**: Approximately 130 million parameters
- **Vocabulary size**: 32,000 (using SentencePiece tokenizer)
- **Maximum sequence length**: 1,024 tokens
- **Training tokens**: ~2B from FineWeb-Edu (a high-quality, deduplicated, educational subset of CommonCrawl data, filtered for English and educational value)
- **Pretraining objective**: Next-token prediction (causal language modeling)
- **Framework**: PyTorch with Accelerate for distributed training
- **Date**: Pretrained as of January 3, 2026

The model is suitable for fine-tuning on downstream tasks such as text generation, summarization, or question answering. It was trained with a focus on efficiency, including gradient checkpointing, mixed-precision (BF16/FP16), and correct gradient accumulation.

### Architecture Details
- **Embedding dimension**: 768
- **Number of layers**: 12
- **Number of attention heads**: 12 (query heads)
- **Number of KV heads**: 4 (GQA for efficiency)
- **FFN hidden dimension multiplier**: 3.0 (resulting in ~2,304 hidden units per layer, aligned to multiple_of=256)
- **Normalization**: RMSNorm (eps=1e-5)
- **Attention mechanism**: Flash Attention 2 (fallback to PyTorch SDPA)
- **Positional encoding**: RoPE (precomputed for up to 2,048 positions)
- **Tokenizer**: SentencePiece (BPE-based, model file: `tokenizer.model`)

The model achieves a parameter count of ~130M, making it lightweight yet capable for research and prototyping.

### Intended Uses & Limitations
#### Intended Uses
- **Text generation**: Generate coherent continuations from prompts (e.g., stories, explanations).
- **Fine-tuning**: Adapt to specific tasks like chatbots, code generation, or educational content creation.
- **Research**: Study Transformer efficiency, scaling laws, or dataset quality impacts.

Example usage for inference (after loading the model):
```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer  # Assuming uploaded to HF

model_name = "antonintomecek/gpt-fineweb-edu-130m"  # Replace with your HF repo name
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

prompt = "Once upon a time"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(inputs.input_ids, max_new_tokens=100, temperature=0.8, top_p=0.95)
print(tokenizer.decode(outputs[0]))
```

#### Limitations
- **Dataset bias**: Trained solely on FineWeb-Edu, which emphasizes educational content but may inherit biases from web crawls (e.g., Western-centric views).
- **Hallucinations**: As a pretrained model, it may generate factually incorrect information.
- **Context length**: Limited to 1,024 tokens; longer contexts require modifications.
- **No fine-tuning**: This is a base pretrained model; performance on specific tasks will improve with fine-tuning.
- **Compute requirements**: Training requires GPU(s) with at least 16GB VRAM for the provided batch size/accumulation settings.
- **Language**: Primarily English; multilingual capabilities are untested.
- **Safety**: Not aligned or safety-tuned; may produce harmful or inappropriate content.

### Training Data
The model was pretrained on ~2B tokens from [FineWeb-Edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu), a curated dataset derived from CommonCrawl. FineWeb-Edu applies deduplication, language filtering (English), and quality scoring to focus on educational text (e.g., Wikipedia-like articles, textbooks). Data was tokenized using the provided SentencePiece model and stored in memmapped binary files (`dataset.bin` for training, `valid.bin` for validation).

- **Preprocessing**: Tokenized into int32 sequences; no additional filtering beyond dataset defaults.
- **Validation split**: A small holdout from the dataset for perplexity evaluation.

### Training Procedure
The model was trained using the provided script, which handles:
- **Optimizer**: AdamW (betas=0.9/0.95, weight_decay=0.01)
- **Learning rate**: Peak LR=1e-5 with cosine annealing (warmup=500 steps)
- **Batch size**: Effective batch size of 8 (batch_size=1, grad_accum=8; scalable with Accelerate)
- **Epochs**: 1 (full pass over ~2B tokens)
- **Mixed precision**: BF16 (or FP16 fallback)
- **Gradient checkpointing**: Enabled for memory efficiency
- **Checkpoints**: Saved every 100,000 steps, including model, optimizer, and scheduler states
- **Hardware**: Trained on GPU(s) with CUDA support; Flash Attention 2 for faster attention computation
- **Logging**: Clean English logs with tqdm progress bars
- **Resuming**: Supports loading from checkpoints (e.g., `checkpoints/step_XXXXXX.pt`)

Total training steps: Approximately (2B tokens / (1024 seq_len * effective_batch_size)) steps.

During training, periodic text samples were generated from fixed prompts to monitor progress qualitatively.

#### Hyperparameters
- See `ModelArgs` in the code for full config.
- Customizable: Sequence length, batch size, accumulation steps, LR, etc.

### Evaluation
- **Perplexity**: Validation loss reported during training (e.g., aim for <10 on held-out FineWeb-Edu data for this scale).
- **Qualitative**: Generated samples from prompts like "Once upon a time" improve in coherence over steps.
- No downstream benchmarks yet; evaluate after fine-tuning (e.g., using LM-Eval).

### How to Get Started
1. Clone the repository or download from Hugging Face.
2. Install dependencies:
   ```bash
   pip install torch accelerate tqdm sentencepiece flash-attn  # Flash Attention optional
   ```
3. Prepare data: Tokenize FineWeb-Edu into `.bin` files (not included; generate your own).
4. Run training:
   ```bash
   python train.py
   ```
5. For inference, convert to HF format if needed (use `transformers` for easy loading).

### Citation
If you use this model, please cite:
```
@misc{tomecek2026nanogpt-x,
  author = {Antonín Tomeček},
  title = {GPT-Style Transformer Pretrained on FineWeb-Edu},
  year = {2026},
  url = {https://huggingface.co/luxopes/NanoGPT-X_Base},
}
```

### Acknowledgments
- Inspired by NanoGPT and Llama architectures.
- Thanks to Hugging Face for hosting and the FineWeb team for the dataset.
- Built with PyTorch, Accelerate, and Flash Attention.

For questions, contact Antonín Tomeček (Prague, CZ).