gpt2-wikitext2

This model is a fine-tuned version of gpt2 on an wikitext2 dataset. It achieves the following results on the evaluation set:

  • Loss: 6.3377

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 2e-05
  • train_batch_size: 8
  • eval_batch_size: 8
  • seed: 42
  • optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
  • lr_scheduler_type: linear
  • num_epochs: 2.0

Training results

Training Loss Epoch Step Validation Loss
6.3991 1.0 2249 6.5169
6.2508 2.0 4498 6.3377

Framework versions

  • Transformers 4.52.4
  • Pytorch 2.6.0+cu124
  • Datasets 3.6.0
  • Tokenizers 0.21.2

Language Model Training Notebook

  • Causal Language Modeling (CLM): Training a model to predict the next token in a sequence. (Current model)
  • Masked Language Modeling (MLM): Training a model to predict masked tokens in a sequence.

Dataset

I used the Wikitext 2 dataset as an example in this notebook, but you can easily adapt it to use other datasets from the Hugging Face Hub or your own local data.

Pre-training

I fine-tuned a GPT-2 model using the steps outlined in this notebook.

  • Model Architecture: I used the gpt2 model checkpoint.
  • Dataset: I used the wikitext-2-raw-v1 dataset.
  • Training Duration: I trained for 2 epochs.
  • Key Configurations:
    • Learning rate: 2e-5
    • Weight decay: 0.01
    • Batch size: Determined by per_device_train_batch_size in TrainingArguments (default is 8). Gradient accumulation steps are not set (default is 1).
    • Optimizer: AdamW (default in Trainer)
    • Scheduler: Linear with warmup (default in Trainer)
    • Checkpointing: Saved checkpoint every epoch, keeping the last 2.

GPT-2 Architecture

The GPT-2 model is based on a decoder-only transformer architecture. Key architectural details:

  • Type: Decoder-only Transformer
  • Layers: 12 transformer blocks (GPT-2 base)
  • Hidden Size: 768
  • Attention Heads: 12
  • Vocabulary Size: 50,257
  • Max Sequence Length: 1024 tokens
  • Positional Embeddings: Learned
  • Activation Function: GELU
  • Layer Normalization: Applied before attention and feed-forward layers (pre-LN)
  • Causal Self-Attention: Each token attends only to previous tokens
Downloads last month
5
Safetensors
Model size
0.1B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for akshayjambhulkar/gpt2-wikitext2

Finetuned
(2072)
this model

Evaluation results