GPT-WikiText103

A 110M parameter GPT-2 style causal language model trained from scratch on the WikiText-103 dataset using PyTorch and HuggingFace Transformers.

Model Architecture

Component	Details
Architecture	GPT-2 (decoder-only transformer)
Parameters	110,418,432 (110.4M)
Layers	12
Hidden Size	768
Attention Heads	12
Feed-Forward Size	3072
Context Length	1024 tokens
Vocabulary	32,000 (Byte-Level BPE)
Activation	GELU
Positional Encoding	Learned absolute

Training Details

Setting	Value
Dataset	WikiText-103 (~112M tokens)
Epochs	3
Optimizer	AdamW (betas=0.9/0.95, weight_decay=0.1)
Learning Rate	3e-4 (cosine decay with 5% warmup)
Batch Size	8 micro-batch x 8 accumulation = 64 effective
Precision	FP16 mixed precision
Gradient Clipping	max_norm = 1.0
Dropout	0.1 (residual, embedding, attention)
Hardware	Single NVIDIA Tesla T4 (16 GB)
Training Time	~~426 minutes (~~7.1 hours)
Total Optimizer Steps	5,124

Tokenizer

A custom Byte-Level BPE tokenizer was trained on the WikiText-103 training corpus (529 MB of cleaned text) with a vocabulary size of 32,000 tokens.

Weight Initialization

Linear and embedding layers: N(0, 0.02)
Residual projection layers: scaled by 1/sqrt(2N) per the GPT-2 paper
Layer norms: weight=1.0, bias=0.0

Results

Evaluation Metrics

Epoch	Train Loss	Train PPL	Eval Loss	Eval PPL
1	5.1393	170.6	3.9563	52.3
2	3.8409	46.6	3.5589	35.1
3	3.5854	36.1	3.4760	32.3

Best model: Epoch 3 — Eval Loss 3.4760, Eval Perplexity 32.3

Training Progression

Loss decreased steadily across all 5,124 optimizer steps:

Step	Loss	PPL	LR
50	8.0844	3,243.6	5.86e-05
500	5.4757	238.8	2.98e-04
1,000	4.7087	110.9	2.83e-04
1,500	4.0939	60.0	2.54e-04
2,000	3.9379	51.3	2.15e-04
2,500	3.7692	43.3	1.68e-04
3,000	3.8041	44.9	1.20e-04
3,500	3.6710	39.3	7.51e-05
4,000	3.6703	39.3	3.78e-05
4,500	3.5960	36.5	1.20e-05
5,100	3.6593	38.8	1.80e-08

Generation Samples

Text generated with temperature=0.8, top_k=50, top_p=0.95, repetition_penalty=1.2:

Prompt: The history of artificial intelligence

The history of artificial intelligence has not been discovered in the ancient Egyptian @-@ era period , with a variety of scientific and educational institutions such as those from the Middle Ages to the 13th century . These include the Authorsian Archaeological Institute ( now called the University of Egypt ) , the British Museum of Archaeology and Art .

Prompt: In a distant galaxy, far beyond

In a distant galaxy, far beyond Earth , it is the only planet of its class .

Prompt: The economy of the United States

The economy of the United States was based on large @-@ scale industries ; a large majority , including many smaller states and cities . The nation 's population growth in agriculture increased from around $ 2 billion to $ 3 @.@ 6 million . This growth continued throughout the 1970s to 80 percent during the 1980s .

Dataset

WikiText-103-raw-v1 — a collection of over 100 million tokens extracted from verified Good and Featured articles on Wikipedia.

Split	Raw Lines	Cleaned Paragraphs	Tokens
Train	1,801,350	874,786	111,960,667
Validation	3,760	1,863	234,817
Test	4,358	—	—

After tokenization, the corpus was concatenated and chunked into fixed-length sequences of 1,024 tokens, yielding 109,336 training sequences and 229 validation sequences.

Model Files

File	Description
`model.safetensors`	Model weights (safetensors format)
`config.json`	Model architecture configuration
`tokenizer.json`	Trained BPE tokenizer
`tokenizer_config.json`	Tokenizer settings
`generation_config.json`	Default generation parameters

Total size on disk: 444 MB

Usage

from transformers import GPT2LMHeadModel, PreTrainedTokenizerFast

model = GPT2LMHeadModel.from_pretrained("gpt_wikitext103/final_model")
tokenizer = PreTrainedTokenizerFast.from_pretrained("gpt_wikitext103/final_model")

input_ids = tokenizer.encode("The history of", return_tensors="pt")
output = model.generate(
    input_ids,
    max_new_tokens=100,
    do_sample=True,
    temperature=0.8,
    top_k=50,
    top_p=0.95,
    repetition_penalty=1.2,
)
print(tokenizer.decode(output[0], skip_special_tokens=True))

Reproducing

Install dependencies:

pip install torch transformers datasets tokenizers matplotlib tqdm

Run the training notebook:
```
jupyter notebook train_gpt_wikitext103.ipynb
```
Execute all cells sequentially. Training takes approximately 7 hours on a single T4 GPU.

Limitations

Trained on Wikipedia text only — the model reflects Wikipedia's style, coverage biases, and knowledge cutoff.
110M parameters is relatively small; outputs can be repetitive or incoherent over long spans.
The @-@ and @.@ artifacts in generated text come from WikiText-103's preprocessing of hyphens and decimals.
Not instruction-tuned — the model is a raw language model and will not follow instructions or answer questions reliably.

License

This project uses the WikiText-103 dataset, which is released under the Creative Commons Attribution-ShareAlike License.

Downloads last month: 4

Safetensors

Model size

0.1B params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support