YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

GPT-WikiText103

A 110M parameter GPT-2 style causal language model trained from scratch on the WikiText-103 dataset using PyTorch and HuggingFace Transformers.

Model Architecture

Component Details
Architecture GPT-2 (decoder-only transformer)
Parameters 110,418,432 (110.4M)
Layers 12
Hidden Size 768
Attention Heads 12
Feed-Forward Size 3072
Context Length 1024 tokens
Vocabulary 32,000 (Byte-Level BPE)
Activation GELU
Positional Encoding Learned absolute

Training Details

Setting Value
Dataset WikiText-103 (~112M tokens)
Epochs 3
Optimizer AdamW (betas=0.9/0.95, weight_decay=0.1)
Learning Rate 3e-4 (cosine decay with 5% warmup)
Batch Size 8 micro-batch x 8 accumulation = 64 effective
Precision FP16 mixed precision
Gradient Clipping max_norm = 1.0
Dropout 0.1 (residual, embedding, attention)
Hardware Single NVIDIA Tesla T4 (16 GB)
Training Time 426 minutes (7.1 hours)
Total Optimizer Steps 5,124

Tokenizer

A custom Byte-Level BPE tokenizer was trained on the WikiText-103 training corpus (529 MB of cleaned text) with a vocabulary size of 32,000 tokens.

Weight Initialization

  • Linear and embedding layers: N(0, 0.02)
  • Residual projection layers: scaled by 1/sqrt(2N) per the GPT-2 paper
  • Layer norms: weight=1.0, bias=0.0

Results

Evaluation Metrics

Epoch Train Loss Train PPL Eval Loss Eval PPL
1 5.1393 170.6 3.9563 52.3
2 3.8409 46.6 3.5589 35.1
3 3.5854 36.1 3.4760 32.3

Best model: Epoch 3 โ€” Eval Loss 3.4760, Eval Perplexity 32.3

Training Progression

Loss decreased steadily across all 5,124 optimizer steps:

Step Loss PPL LR
50 8.0844 3,243.6 5.86e-05
500 5.4757 238.8 2.98e-04
1,000 4.7087 110.9 2.83e-04
1,500 4.0939 60.0 2.54e-04
2,000 3.9379 51.3 2.15e-04
2,500 3.7692 43.3 1.68e-04
3,000 3.8041 44.9 1.20e-04
3,500 3.6710 39.3 7.51e-05
4,000 3.6703 39.3 3.78e-05
4,500 3.5960 36.5 1.20e-05
5,100 3.6593 38.8 1.80e-08

Generation Samples

Text generated with temperature=0.8, top_k=50, top_p=0.95, repetition_penalty=1.2:

Prompt: The history of artificial intelligence

The history of artificial intelligence has not been discovered in the ancient Egyptian @-@ era period , with a variety of scientific and educational institutions such as those from the Middle Ages to the 13th century . These include the Authorsian Archaeological Institute ( now called the University of Egypt ) , the British Museum of Archaeology and Art .

Prompt: In a distant galaxy, far beyond

In a distant galaxy, far beyond Earth , it is the only planet of its class .

Prompt: The economy of the United States

The economy of the United States was based on large @-@ scale industries ; a large majority , including many smaller states and cities . The nation 's population growth in agriculture increased from around $ 2 billion to $ 3 @.@ 6 million . This growth continued throughout the 1970s to 80 percent during the 1980s .

Dataset

WikiText-103-raw-v1 โ€” a collection of over 100 million tokens extracted from verified Good and Featured articles on Wikipedia.

Split Raw Lines Cleaned Paragraphs Tokens
Train 1,801,350 874,786 111,960,667
Validation 3,760 1,863 234,817
Test 4,358 โ€” โ€”

After tokenization, the corpus was concatenated and chunked into fixed-length sequences of 1,024 tokens, yielding 109,336 training sequences and 229 validation sequences.

Model Files

File Description
model.safetensors Model weights (safetensors format)
config.json Model architecture configuration
tokenizer.json Trained BPE tokenizer
tokenizer_config.json Tokenizer settings
generation_config.json Default generation parameters

Total size on disk: 444 MB

Usage

from transformers import GPT2LMHeadModel, PreTrainedTokenizerFast

model = GPT2LMHeadModel.from_pretrained("gpt_wikitext103/final_model")
tokenizer = PreTrainedTokenizerFast.from_pretrained("gpt_wikitext103/final_model")

input_ids = tokenizer.encode("The history of", return_tensors="pt")
output = model.generate(
    input_ids,
    max_new_tokens=100,
    do_sample=True,
    temperature=0.8,
    top_k=50,
    top_p=0.95,
    repetition_penalty=1.2,
)
print(tokenizer.decode(output[0], skip_special_tokens=True))

Reproducing

  1. Install dependencies:

    pip install torch transformers datasets tokenizers matplotlib tqdm
    
  2. Run the training notebook:

    jupyter notebook train_gpt_wikitext103.ipynb
    

    Execute all cells sequentially. Training takes approximately 7 hours on a single T4 GPU.

Limitations

  • Trained on Wikipedia text only โ€” the model reflects Wikipedia's style, coverage biases, and knowledge cutoff.
  • 110M parameters is relatively small; outputs can be repetitive or incoherent over long spans.
  • The @-@ and @.@ artifacts in generated text come from WikiText-103's preprocessing of hyphens and decimals.
  • Not instruction-tuned โ€” the model is a raw language model and will not follow instructions or answer questions reliably.

License

This project uses the WikiText-103 dataset, which is released under the Creative Commons Attribution-ShareAlike License.

Downloads last month
4
Safetensors
Model size
0.1B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support