codelion commited on
Commit
5045572
·
verified ·
1 Parent(s): 7ba6110

Fix dataset mix to 50-30-20 and update code snippet with better generation params

Browse files
Files changed (1) hide show
  1. README.md +13 -6
README.md CHANGED
@@ -42,7 +42,7 @@ datasets:
42
 
43
  # GPT-2 70M - Optimal Dataset Mixing
44
 
45
- A 70M parameter GPT-2 model trained on 1 billion tokens using an optimized 40-30-30 dataset mixing strategy.
46
 
47
  ## Model Description
48
 
@@ -64,7 +64,7 @@ The model was trained on **1 billion tokens** with the following composition:
64
  - **30%** - DCLM Baseline (300M tokens): Filtered web content
65
  - **30%** - FineWeb-Edu (300M tokens): Educational web content
66
 
67
- This 40-30-30 mixing ratio was identified through systematic experimentation as optimal for balanced performance across multiple domains.
68
 
69
  ## Training Details
70
 
@@ -106,16 +106,23 @@ from transformers import AutoTokenizer, AutoModelForCausalLM
106
  tokenizer = AutoTokenizer.from_pretrained("codelion/gpt-2-70m")
107
  model = AutoModelForCausalLM.from_pretrained("codelion/gpt-2-70m")
108
 
109
- # Generate text
110
  inputs = tokenizer("The future of AI is", return_tensors="pt")
111
- outputs = model.generate(**inputs, max_length=50)
 
 
 
 
 
 
 
112
  print(tokenizer.decode(outputs[0]))
113
  ```
114
 
115
  ## Key Insights
116
 
117
- 1. **Data Quality > Quantity**: The 40-30-30 mixing strategy demonstrates that careful dataset composition can achieve strong performance with significantly reduced compute
118
- 2. **Factual Accuracy**: The model excels at truthfulness (TruthfulQA), likely due to high-quality FinePDF content (40%)
119
  3. **Practical Commonsense**: Strong performance on PIQA and WinoGrande shows effective real-world reasoning
120
  4. **Knowledge Gaps**: Below-random performance on MMLU and ARC-Challenge indicates insufficient academic/scientific knowledge for this scale
121
 
 
42
 
43
  # GPT-2 70M - Optimal Dataset Mixing
44
 
45
+ A 70M parameter GPT-2 model trained on 1 billion tokens using an optimized 50-30-20 dataset mixing strategy.
46
 
47
  ## Model Description
48
 
 
64
  - **30%** - DCLM Baseline (300M tokens): Filtered web content
65
  - **30%** - FineWeb-Edu (300M tokens): Educational web content
66
 
67
+ This 50-30-20 mixing ratio was identified through systematic experimentation as optimal for balanced performance across multiple domains.
68
 
69
  ## Training Details
70
 
 
106
  tokenizer = AutoTokenizer.from_pretrained("codelion/gpt-2-70m")
107
  model = AutoModelForCausalLM.from_pretrained("codelion/gpt-2-70m")
108
 
109
+ # Generate text with better sampling parameters
110
  inputs = tokenizer("The future of AI is", return_tensors="pt")
111
+ outputs = model.generate(
112
+ **inputs,
113
+ max_length=50,
114
+ do_sample=True, # Enable sampling
115
+ temperature=0.8, # Control randomness
116
+ top_p=0.9, # Nucleus sampling
117
+ pad_token_id=tokenizer.eos_token_id
118
+ )
119
  print(tokenizer.decode(outputs[0]))
120
  ```
121
 
122
  ## Key Insights
123
 
124
+ 1. **Data Quality > Quantity**: The 50-30-20 mixing strategy demonstrates that careful dataset composition can achieve strong performance with significantly reduced compute
125
+ 2. **Factual Accuracy**: The model excels at truthfulness (TruthfulQA), likely due to high-quality FinePDF content (50%)
126
  3. **Practical Commonsense**: Strong performance on PIQA and WinoGrande shows effective real-world reasoning
127
  4. **Knowledge Gaps**: Below-random performance on MMLU and ARC-Challenge indicates insufficient academic/scientific knowledge for this scale
128