Fix dataset mix to 50-30-20 and update code snippet with better generation params
Browse files
README.md
CHANGED
|
@@ -42,7 +42,7 @@ datasets:
|
|
| 42 |
|
| 43 |
# GPT-2 70M - Optimal Dataset Mixing
|
| 44 |
|
| 45 |
-
A 70M parameter GPT-2 model trained on 1 billion tokens using an optimized
|
| 46 |
|
| 47 |
## Model Description
|
| 48 |
|
|
@@ -64,7 +64,7 @@ The model was trained on **1 billion tokens** with the following composition:
|
|
| 64 |
- **30%** - DCLM Baseline (300M tokens): Filtered web content
|
| 65 |
- **30%** - FineWeb-Edu (300M tokens): Educational web content
|
| 66 |
|
| 67 |
-
This
|
| 68 |
|
| 69 |
## Training Details
|
| 70 |
|
|
@@ -106,16 +106,23 @@ from transformers import AutoTokenizer, AutoModelForCausalLM
|
|
| 106 |
tokenizer = AutoTokenizer.from_pretrained("codelion/gpt-2-70m")
|
| 107 |
model = AutoModelForCausalLM.from_pretrained("codelion/gpt-2-70m")
|
| 108 |
|
| 109 |
-
# Generate text
|
| 110 |
inputs = tokenizer("The future of AI is", return_tensors="pt")
|
| 111 |
-
outputs = model.generate(
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 112 |
print(tokenizer.decode(outputs[0]))
|
| 113 |
```
|
| 114 |
|
| 115 |
## Key Insights
|
| 116 |
|
| 117 |
-
1. **Data Quality > Quantity**: The
|
| 118 |
-
2. **Factual Accuracy**: The model excels at truthfulness (TruthfulQA), likely due to high-quality FinePDF content (
|
| 119 |
3. **Practical Commonsense**: Strong performance on PIQA and WinoGrande shows effective real-world reasoning
|
| 120 |
4. **Knowledge Gaps**: Below-random performance on MMLU and ARC-Challenge indicates insufficient academic/scientific knowledge for this scale
|
| 121 |
|
|
|
|
| 42 |
|
| 43 |
# GPT-2 70M - Optimal Dataset Mixing
|
| 44 |
|
| 45 |
+
A 70M parameter GPT-2 model trained on 1 billion tokens using an optimized 50-30-20 dataset mixing strategy.
|
| 46 |
|
| 47 |
## Model Description
|
| 48 |
|
|
|
|
| 64 |
- **30%** - DCLM Baseline (300M tokens): Filtered web content
|
| 65 |
- **30%** - FineWeb-Edu (300M tokens): Educational web content
|
| 66 |
|
| 67 |
+
This 50-30-20 mixing ratio was identified through systematic experimentation as optimal for balanced performance across multiple domains.
|
| 68 |
|
| 69 |
## Training Details
|
| 70 |
|
|
|
|
| 106 |
tokenizer = AutoTokenizer.from_pretrained("codelion/gpt-2-70m")
|
| 107 |
model = AutoModelForCausalLM.from_pretrained("codelion/gpt-2-70m")
|
| 108 |
|
| 109 |
+
# Generate text with better sampling parameters
|
| 110 |
inputs = tokenizer("The future of AI is", return_tensors="pt")
|
| 111 |
+
outputs = model.generate(
|
| 112 |
+
**inputs,
|
| 113 |
+
max_length=50,
|
| 114 |
+
do_sample=True, # Enable sampling
|
| 115 |
+
temperature=0.8, # Control randomness
|
| 116 |
+
top_p=0.9, # Nucleus sampling
|
| 117 |
+
pad_token_id=tokenizer.eos_token_id
|
| 118 |
+
)
|
| 119 |
print(tokenizer.decode(outputs[0]))
|
| 120 |
```
|
| 121 |
|
| 122 |
## Key Insights
|
| 123 |
|
| 124 |
+
1. **Data Quality > Quantity**: The 50-30-20 mixing strategy demonstrates that careful dataset composition can achieve strong performance with significantly reduced compute
|
| 125 |
+
2. **Factual Accuracy**: The model excels at truthfulness (TruthfulQA), likely due to high-quality FinePDF content (50%)
|
| 126 |
3. **Practical Commonsense**: Strong performance on PIQA and WinoGrande shows effective real-world reasoning
|
| 127 |
4. **Knowledge Gaps**: Below-random performance on MMLU and ARC-Challenge indicates insufficient academic/scientific knowledge for this scale
|
| 128 |
|