--- library_name: transformers license: mit base_model: gpt2 tags: - generated_from_trainer model-index: - name: gpt2-wikitext2 results: [] --- # gpt2-wikitext2 This model is a fine-tuned version of [gpt2](https://huggingface.co/gpt2) on an wikitext2 dataset. It achieves the following results on the evaluation set: - Loss: 6.3377 ## Model description More information needed ## Intended uses & limitations More information needed ## Training and evaluation data More information needed ## Training procedure ### Training hyperparameters The following hyperparameters were used during training: - learning_rate: 2e-05 - train_batch_size: 8 - eval_batch_size: 8 - seed: 42 - optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments - lr_scheduler_type: linear - num_epochs: 2.0 ### Training results | Training Loss | Epoch | Step | Validation Loss | |:-------------:|:-----:|:----:|:---------------:| | 6.3991 | 1.0 | 2249 | 6.5169 | | 6.2508 | 2.0 | 4498 | 6.3377 | ### Framework versions - Transformers 4.52.4 - Pytorch 2.6.0+cu124 - Datasets 3.6.0 - Tokenizers 0.21.2 # Language Model Training Notebook - **Causal Language Modeling (CLM)**: Training a model to predict the next token in a sequence. (Current model) - **Masked Language Modeling (MLM)**: Training a model to predict masked tokens in a sequence. ## Dataset I used the [Wikitext 2](https://huggingface.co/datasets/wikitext) dataset as an example in this notebook, but you can easily adapt it to use other datasets from the Hugging Face Hub or your own local data. ## Pre-training I fine-tuned a GPT-2 model using the steps outlined in this notebook. - **Model Architecture**: I used the `gpt2` model checkpoint. - **Dataset**: I used the `wikitext-2-raw-v1` dataset. - **Training Duration**: I trained for 2 epochs. - **Key Configurations**: - **Learning rate**: `2e-5` - **Weight decay**: `0.01` - **Batch size**: Determined by `per_device_train_batch_size` in `TrainingArguments` (default is 8). Gradient accumulation steps are not set (default is 1). - **Optimizer**: AdamW (default in `Trainer`) - **Scheduler**: Linear with warmup (default in `Trainer`) - **Checkpointing**: Saved checkpoint every epoch, keeping the last 2. ## GPT-2 Architecture The GPT-2 model is based on a decoder-only transformer architecture. Key architectural details: - **Type**: Decoder-only Transformer - **Layers**: 12 transformer blocks (GPT-2 base) - **Hidden Size**: 768 - **Attention Heads**: 12 - **Vocabulary Size**: 50,257 - **Max Sequence Length**: 1024 tokens - **Positional Embeddings**: Learned - **Activation Function**: GELU - **Layer Normalization**: Applied before attention and feed-forward layers (pre-LN) - **Causal Self-Attention**: Each token attends only to previous tokens