---
license: mit
datasets:
- HuggingFaceFW/fineweb-edu
- common-pile/arxiv_papers_filtered
- tiiuae/falcon-refinedweb
- manu/project_gutenberg
- nampdn-ai/tiny-textbooks
- SciPhi/textbooks-are-all-you-need-lite
- abehandlerorg/ccnews
base_model:
- openai-community/gpt2
pipeline_tag: text-generation
---

# GPT-2 from Scratch

This model implements the GPT-2 architecture (125M parameters) trained from scratch.

## Model Description

- **Model type:** GPT-2 (125M parameters)
- **Architecture:** Transformer-based autoregressive language model following the original GPT-2 design
- **Training data:** Uses multiple datasets (check tags) - 18Billion tokens.
- **Language:** English


## Performance and Evaluation

| Dataset        | Metric    | thecr7guy/gpt2-pretrain | GPT-2 (baseline) |
|----------------|-----------|------------|------------------|
| HellaSwag      | acc       | **0.291**  | 0.289 |
| SciQ           | acc       | **0.754**  | 0.752 |
| Winogrande     | acc       | 0.491      | **0.516** |
| TruthfulQA MC1 | acc       | **0.236**  | 0.228 |
| MMLU (overall) | acc       | **0.230**  | 0.229 |
| - Humanities   | acc       | 0.242      | 0.242 |
| - Social Sci.  | acc       | 0.217      | 0.217 |
| - STEM         | acc       | 0.213      | 0.213 |
| - Other        | acc       | **0.239**  | 0.238 |

## Training Details

- **Training corpus:** Approximately 18B tokens (120GB)
- **Training duration:** 1 epochs (approximately 8 hours total)
- **Hardware:** 8× NVIDIA A100 PCE GPUs via runpod.io
- **Estimated cost:** $ (8*13.52) for complete training
- **Token context:** 1024 tokens

### Hyperparameters

- context_len: 1024
- seed: 42
- epochs: 2
- batch_size: 64
- total_batch_size: 524288 tokens
- grad_clip: 1.0
- optimizer: "adamw"
- max_lr: 6.0e-4
- min_lr: 6.0e-5
- beta1: 0.9
- beta2: 0.95
- weight_decay: 0.1


.

## Commands used during installation 

- pip install wandb
- pip install tiktoken
- pip install --upgrade huggingface_hub
- pip install torchinfo
- pip install datasets
- sudo apt update && sudo apt install tmux  
- tmux new -s training
- wandb login
- CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 NCCL_P2P_DISABLE=1 \
torchrun --standalone --nproc_per_node=8 train.py 

## Contact

GitHub: [thecr7guy2](https://github.com/thecr7guy2)