codelion
/

gpt-2-70m

@@ -42,7 +42,7 @@ datasets:
 # GPT-2 70M - Optimal Dataset Mixing
-A 70M parameter GPT-2 model trained on 1 billion tokens using an optimized 40-30-30 dataset mixing strategy.
 ## Model Description
@@ -64,7 +64,7 @@ The model was trained on **1 billion tokens** with the following composition:
 - **30%** - DCLM Baseline (300M tokens): Filtered web content
 - **30%** - FineWeb-Edu (300M tokens): Educational web content
-This 40-30-30 mixing ratio was identified through systematic experimentation as optimal for balanced performance across multiple domains.
 ## Training Details
@@ -106,16 +106,23 @@ from transformers import AutoTokenizer, AutoModelForCausalLM
 tokenizer = AutoTokenizer.from_pretrained("codelion/gpt-2-70m")
 model = AutoModelForCausalLM.from_pretrained("codelion/gpt-2-70m")
-# Generate text
 inputs = tokenizer("The future of AI is", return_tensors="pt")
-outputs = model.generate(**inputs, max_length=50)
 print(tokenizer.decode(outputs[0]))
 ```
 ## Key Insights
-1. **Data Quality > Quantity**: The 40-30-30 mixing strategy demonstrates that careful dataset composition can achieve strong performance with significantly reduced compute
-2. **Factual Accuracy**: The model excels at truthfulness (TruthfulQA), likely due to high-quality FinePDF content (40%)
 3. **Practical Commonsense**: Strong performance on PIQA and WinoGrande shows effective real-world reasoning
 4. **Knowledge Gaps**: Below-random performance on MMLU and ARC-Challenge indicates insufficient academic/scientific knowledge for this scale

 # GPT-2 70M - Optimal Dataset Mixing
+A 70M parameter GPT-2 model trained on 1 billion tokens using an optimized 50-30-20 dataset mixing strategy.
 ## Model Description
 - **30%** - DCLM Baseline (300M tokens): Filtered web content
 - **30%** - FineWeb-Edu (300M tokens): Educational web content
+This 50-30-20 mixing ratio was identified through systematic experimentation as optimal for balanced performance across multiple domains.
 ## Training Details
 tokenizer = AutoTokenizer.from_pretrained("codelion/gpt-2-70m")
 model = AutoModelForCausalLM.from_pretrained("codelion/gpt-2-70m")
+# Generate text with better sampling parameters
 inputs = tokenizer("The future of AI is", return_tensors="pt")
+outputs = model.generate(
+    **inputs,
+    max_length=50,
+    do_sample=True,           # Enable sampling
+    temperature=0.8,          # Control randomness
+    top_p=0.9,               # Nucleus sampling
+    pad_token_id=tokenizer.eos_token_id
+)
 print(tokenizer.decode(outputs[0]))
 ```
 ## Key Insights
+1. **Data Quality > Quantity**: The 50-30-20 mixing strategy demonstrates that careful dataset composition can achieve strong performance with significantly reduced compute
+2. **Factual Accuracy**: The model excels at truthfulness (TruthfulQA), likely due to high-quality FinePDF content (50%)
 3. **Practical Commonsense**: Strong performance on PIQA and WinoGrande shows effective real-world reasoning
 4. **Knowledge Gaps**: Below-random performance on MMLU and ARC-Challenge indicates insufficient academic/scientific knowledge for this scale