|
|
---
|
|
|
language:
|
|
|
- en
|
|
|
license: apache-2.0
|
|
|
tags:
|
|
|
- gpt2
|
|
|
- pytorch
|
|
|
- causal-lm
|
|
|
- text-generation
|
|
|
- fineweb
|
|
|
datasets:
|
|
|
- HuggingFaceFW/fineweb-edu
|
|
|
---
|
|
|
|
|
|
# LiteGPT-Base
|
|
|
|
|
|
This is a **124M parameter** Language Model (GPT-2 Small architecture) pre-trained from scratch on the **FineWeb-Edu** dataset.
|
|
|
|
|
|
It is the base model for [LiteGPT-Instruct](https://huggingface.co/koganrath/LiteGPT-Instruct).
|
|
|
|
|
|
## Model Details
|
|
|
|
|
|
- **Architecture**: GPT-2 Small (12 layers, 12 heads, 768 embedding dim)
|
|
|
- **Parameters**: ~124 Million
|
|
|
- **Context Length**: 1024 tokens
|
|
|
- **Training Data**: 10 Billion tokens from [FineWeb-Edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu) (Sample 10BT).
|
|
|
- **Tokenizer**: GPT-2 (TikToken)
|
|
|
|
|
|
## Usage
|
|
|
|
|
|
This is a **completion model**. It predicts the next tokens based on the input text. It is NOT an instruction-following model (chatbot).
|
|
|
|
|
|
### Python Example
|
|
|
|
|
|
```python
|
|
|
from transformers import GPT2LMHeadModel, GPT2Tokenizer
|
|
|
|
|
|
model = GPT2LMHeadModel.from_pretrained("koganrath/LiteGPT-Base")
|
|
|
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
|
|
|
|
|
|
text = "Once upon a time in a digital world,"
|
|
|
inputs = tokenizer(text, return_tensors="pt")
|
|
|
outputs = model.generate(**inputs, max_new_tokens=50)
|
|
|
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
|
|
|
```
|
|
|
|
|
|
## Limitations
|
|
|
|
|
|
- **Size**: 124M parameters is small by modern standards.
|
|
|
- **Coherence**: Long-form generation may lose coherence.
|
|
|
- **Knowledge**: Limited to the training data cut-off and scope.
|
|
|
|
|
|
## Authors
|
|
|
|
|
|
Trained by **koganrath** as part of the LiteGPT Project.
|
|
|
|