samkeet commited on
Commit
fa72749
·
verified ·
1 Parent(s): 1a24421

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +11 -3
README.md CHANGED
@@ -111,15 +111,23 @@ This model can be fine-tuned for specific NLP applications like:
111
  ## Evaluation
112
 
113
  ### Training & Validation Loss
114
- Validation was conducted using 100 million tokens from the HuggingFaceFW/fineweb-edu dataset. The training and validation loss graph indicates a stable convergence with minimal overfitting. The training loss achieved a minimum value of 2.88, while the validation loss stabilized at 2.97.
115
  ![image/png](https://cdn-uploads.huggingface.co/production/uploads/670142e648894dfbedacacaf/fAwiSHr4f9SmO9PYiCntY.png)
116
 
117
  ### Results
118
- The model was benchmarked against OpenAI’s GPT-2 Small and GPT-3 Small (both ~124M parameters). Remarkably, despite being trained on only 10 billion tokens, compared to GPT-3 Small's 300 billion tokens, GPT-124M was able to outperform both models in HellaSwag evaluation. This performance advantage is attributed to the specialized training data (educational content), which contrasts with GPT-3 Small’s broader multilingual and multi-domain training data.
119
 
120
- According to Chinchilla’s scaling laws, an optimal token-to-parameter ratio suggests that a 124M-parameter model ideally requires 2.48 billion tokens for training. The excess training tokens used in GPT-3 Small might have led to diminishing returns in performance.
121
  ![image/png](https://cdn-uploads.huggingface.co/production/uploads/670142e648894dfbedacacaf/Ne2MYAB2C0yHWFJLjCww3.png)
122
 
 
 
 
 
 
 
 
 
123
  ## Environmental Impact
124
 
125
  - **Hardware Used:** `8x NVIDIA RTX 4090 GPUs`
 
111
  ## Evaluation
112
 
113
  ### Training & Validation Loss
114
+ Validation was conducted using `100 million tokens` from the `HuggingFaceFW/fineweb-edu` dataset. The training and validation loss graph indicates a stable convergence with minimal overfitting. The training loss achieved a minimum value of 2.88, while the validation loss stabilized at 2.97.
115
  ![image/png](https://cdn-uploads.huggingface.co/production/uploads/670142e648894dfbedacacaf/fAwiSHr4f9SmO9PYiCntY.png)
116
 
117
  ### Results
118
+ The model was benchmarked against OpenAI’s GPT-2 Small and GPT-3 Small (both ~124M parameters). Remarkably, despite being trained on only `10 billion tokens`, compared to GPT-3 Small's `300 billion tokens`, GPT-124M was able to outperform both models in `HellaSwag` evaluation. This performance advantage is attributed to the specialized training data (educational content), which contrasts with GPT-3 Small’s broader multilingual and multi-domain training data.
119
 
120
+ According to Chinchilla’s scaling laws, an optimal token-to-parameter ratio suggests that a 124M-parameter model ideally requires `2.48 billion tokens` for training. The excess training tokens used in GPT-3 Small might have led to diminishing returns in performance.
121
  ![image/png](https://cdn-uploads.huggingface.co/production/uploads/670142e648894dfbedacacaf/Ne2MYAB2C0yHWFJLjCww3.png)
122
 
123
+ ### Key Insights from Evaluation
124
+
125
+ - Efficient Training: The model demonstrates impressive performance relative to its training token count, suggesting an efficient use of resources.
126
+
127
+ - Data-Specific Advantage: Training exclusively on educational data may have given GPT-124M an edge in evaluation metrics like `HellaSwag`.
128
+
129
+ - Scaling Considerations: GPT-3 Small, despite being trained on 300B tokens, does not exhibit proportionally better performance due to scaling limitations.
130
+
131
  ## Environmental Impact
132
 
133
  - **Hardware Used:** `8x NVIDIA RTX 4090 GPUs`