|
|
--- |
|
|
license: mit |
|
|
datasets: |
|
|
- HuggingFaceFW/fineweb-edu |
|
|
- common-pile/arxiv_papers_filtered |
|
|
- tiiuae/falcon-refinedweb |
|
|
- manu/project_gutenberg |
|
|
- nampdn-ai/tiny-textbooks |
|
|
- SciPhi/textbooks-are-all-you-need-lite |
|
|
- abehandlerorg/ccnews |
|
|
base_model: |
|
|
- openai-community/gpt2 |
|
|
pipeline_tag: text-generation |
|
|
--- |
|
|
|
|
|
# GPT-2 from Scratch |
|
|
|
|
|
This model implements the GPT-2 architecture (125M parameters) trained from scratch. |
|
|
|
|
|
## Model Description |
|
|
|
|
|
- **Model type:** GPT-2 (125M parameters) |
|
|
- **Architecture:** Transformer-based autoregressive language model following the original GPT-2 design |
|
|
- **Training data:** Uses multiple datasets (check tags) - 18Billion tokens. |
|
|
- **Language:** English |
|
|
|
|
|
|
|
|
## Performance and Evaluation |
|
|
|
|
|
| Dataset | Metric | thecr7guy/gpt2-pretrain | GPT-2 (baseline) | |
|
|
|----------------|-----------|------------|------------------| |
|
|
| HellaSwag | acc | **0.291** | 0.289 | |
|
|
| SciQ | acc | **0.754** | 0.752 | |
|
|
| Winogrande | acc | 0.491 | **0.516** | |
|
|
| TruthfulQA MC1 | acc | **0.236** | 0.228 | |
|
|
| MMLU (overall) | acc | **0.230** | 0.229 | |
|
|
| - Humanities | acc | 0.242 | 0.242 | |
|
|
| - Social Sci. | acc | 0.217 | 0.217 | |
|
|
| - STEM | acc | 0.213 | 0.213 | |
|
|
| - Other | acc | **0.239** | 0.238 | |
|
|
|
|
|
## Training Details |
|
|
|
|
|
- **Training corpus:** Approximately 18B tokens (120GB) |
|
|
- **Training duration:** 1 epochs (approximately 8 hours total) |
|
|
- **Hardware:** 8× NVIDIA A100 PCE GPUs via runpod.io |
|
|
- **Estimated cost:** $ (8*13.52) for complete training |
|
|
- **Token context:** 1024 tokens |
|
|
|
|
|
### Hyperparameters |
|
|
|
|
|
- context_len: 1024 |
|
|
- seed: 42 |
|
|
- epochs: 2 |
|
|
- batch_size: 64 |
|
|
- total_batch_size: 524288 tokens |
|
|
- grad_clip: 1.0 |
|
|
- optimizer: "adamw" |
|
|
- max_lr: 6.0e-4 |
|
|
- min_lr: 6.0e-5 |
|
|
- beta1: 0.9 |
|
|
- beta2: 0.95 |
|
|
- weight_decay: 0.1 |
|
|
|
|
|
|
|
|
. |
|
|
|
|
|
## Commands used during installation |
|
|
|
|
|
- pip install wandb |
|
|
- pip install tiktoken |
|
|
- pip install --upgrade huggingface_hub |
|
|
- pip install torchinfo |
|
|
- pip install datasets |
|
|
- sudo apt update && sudo apt install tmux |
|
|
- tmux new -s training |
|
|
- wandb login |
|
|
- CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 NCCL_P2P_DISABLE=1 \ |
|
|
torchrun --standalone --nproc_per_node=8 train.py |
|
|
|
|
|
## Contact |
|
|
|
|
|
GitHub: [thecr7guy2](https://github.com/thecr7guy2) |