YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
GPT-WikiText103
A 110M parameter GPT-2 style causal language model trained from scratch on the WikiText-103 dataset using PyTorch and HuggingFace Transformers.
Model Architecture
| Component | Details |
|---|---|
| Architecture | GPT-2 (decoder-only transformer) |
| Parameters | 110,418,432 (110.4M) |
| Layers | 12 |
| Hidden Size | 768 |
| Attention Heads | 12 |
| Feed-Forward Size | 3072 |
| Context Length | 1024 tokens |
| Vocabulary | 32,000 (Byte-Level BPE) |
| Activation | GELU |
| Positional Encoding | Learned absolute |
Training Details
| Setting | Value |
|---|---|
| Dataset | WikiText-103 (~112M tokens) |
| Epochs | 3 |
| Optimizer | AdamW (betas=0.9/0.95, weight_decay=0.1) |
| Learning Rate | 3e-4 (cosine decay with 5% warmup) |
| Batch Size | 8 micro-batch x 8 accumulation = 64 effective |
| Precision | FP16 mixed precision |
| Gradient Clipping | max_norm = 1.0 |
| Dropout | 0.1 (residual, embedding, attention) |
| Hardware | Single NVIDIA Tesla T4 (16 GB) |
| Training Time | |
| Total Optimizer Steps | 5,124 |
Tokenizer
A custom Byte-Level BPE tokenizer was trained on the WikiText-103 training corpus (529 MB of cleaned text) with a vocabulary size of 32,000 tokens.
Weight Initialization
- Linear and embedding layers: N(0, 0.02)
- Residual projection layers: scaled by 1/sqrt(2N) per the GPT-2 paper
- Layer norms: weight=1.0, bias=0.0
Results
Evaluation Metrics
| Epoch | Train Loss | Train PPL | Eval Loss | Eval PPL |
|---|---|---|---|---|
| 1 | 5.1393 | 170.6 | 3.9563 | 52.3 |
| 2 | 3.8409 | 46.6 | 3.5589 | 35.1 |
| 3 | 3.5854 | 36.1 | 3.4760 | 32.3 |
Best model: Epoch 3 โ Eval Loss 3.4760, Eval Perplexity 32.3
Training Progression
Loss decreased steadily across all 5,124 optimizer steps:
| Step | Loss | PPL | LR |
|---|---|---|---|
| 50 | 8.0844 | 3,243.6 | 5.86e-05 |
| 500 | 5.4757 | 238.8 | 2.98e-04 |
| 1,000 | 4.7087 | 110.9 | 2.83e-04 |
| 1,500 | 4.0939 | 60.0 | 2.54e-04 |
| 2,000 | 3.9379 | 51.3 | 2.15e-04 |
| 2,500 | 3.7692 | 43.3 | 1.68e-04 |
| 3,000 | 3.8041 | 44.9 | 1.20e-04 |
| 3,500 | 3.6710 | 39.3 | 7.51e-05 |
| 4,000 | 3.6703 | 39.3 | 3.78e-05 |
| 4,500 | 3.5960 | 36.5 | 1.20e-05 |
| 5,100 | 3.6593 | 38.8 | 1.80e-08 |
Generation Samples
Text generated with temperature=0.8, top_k=50, top_p=0.95, repetition_penalty=1.2:
Prompt: The history of artificial intelligence
The history of artificial intelligence has not been discovered in the ancient Egyptian @-@ era period , with a variety of scientific and educational institutions such as those from the Middle Ages to the 13th century . These include the Authorsian Archaeological Institute ( now called the University of Egypt ) , the British Museum of Archaeology and Art .
Prompt: In a distant galaxy, far beyond
In a distant galaxy, far beyond Earth , it is the only planet of its class .
Prompt: The economy of the United States
The economy of the United States was based on large @-@ scale industries ; a large majority , including many smaller states and cities . The nation 's population growth in agriculture increased from around $ 2 billion to $ 3 @.@ 6 million . This growth continued throughout the 1970s to 80 percent during the 1980s .
Dataset
WikiText-103-raw-v1 โ a collection of over 100 million tokens extracted from verified Good and Featured articles on Wikipedia.
| Split | Raw Lines | Cleaned Paragraphs | Tokens |
|---|---|---|---|
| Train | 1,801,350 | 874,786 | 111,960,667 |
| Validation | 3,760 | 1,863 | 234,817 |
| Test | 4,358 | โ | โ |
After tokenization, the corpus was concatenated and chunked into fixed-length sequences of 1,024 tokens, yielding 109,336 training sequences and 229 validation sequences.
Model Files
| File | Description |
|---|---|
model.safetensors |
Model weights (safetensors format) |
config.json |
Model architecture configuration |
tokenizer.json |
Trained BPE tokenizer |
tokenizer_config.json |
Tokenizer settings |
generation_config.json |
Default generation parameters |
Total size on disk: 444 MB
Usage
from transformers import GPT2LMHeadModel, PreTrainedTokenizerFast
model = GPT2LMHeadModel.from_pretrained("gpt_wikitext103/final_model")
tokenizer = PreTrainedTokenizerFast.from_pretrained("gpt_wikitext103/final_model")
input_ids = tokenizer.encode("The history of", return_tensors="pt")
output = model.generate(
input_ids,
max_new_tokens=100,
do_sample=True,
temperature=0.8,
top_k=50,
top_p=0.95,
repetition_penalty=1.2,
)
print(tokenizer.decode(output[0], skip_special_tokens=True))
Reproducing
Install dependencies:
pip install torch transformers datasets tokenizers matplotlib tqdmRun the training notebook:
jupyter notebook train_gpt_wikitext103.ipynbExecute all cells sequentially. Training takes approximately 7 hours on a single T4 GPU.
Limitations
- Trained on Wikipedia text only โ the model reflects Wikipedia's style, coverage biases, and knowledge cutoff.
- 110M parameters is relatively small; outputs can be repetitive or incoherent over long spans.
- The
@-@and@.@artifacts in generated text come from WikiText-103's preprocessing of hyphens and decimals. - Not instruction-tuned โ the model is a raw language model and will not follow instructions or answer questions reliably.
License
This project uses the WikiText-103 dataset, which is released under the Creative Commons Attribution-ShareAlike License.
- Downloads last month
- 4