Update README.md

Browse files

Files changed (1) hide show

README.md +11 -3

README.md CHANGED Viewed

@@ -111,15 +111,23 @@ This model can be fine-tuned for specific NLP applications like:
 ## Evaluation
 ### Training & Validation Loss
-Validation was conducted using 100 million tokens from the HuggingFaceFW/fineweb-edu dataset. The training and validation loss graph indicates a stable convergence with minimal overfitting. The training loss achieved a minimum value of 2.88, while the validation loss stabilized at 2.97.
 ![image/png](https://cdn-uploads.huggingface.co/production/uploads/670142e648894dfbedacacaf/fAwiSHr4f9SmO9PYiCntY.png)
 ### Results
-The model was benchmarked against OpenAI’s GPT-2 Small and GPT-3 Small (both ~124M parameters). Remarkably, despite being trained on only 10 billion tokens, compared to GPT-3 Small's 300 billion tokens, GPT-124M was able to outperform both models in HellaSwag evaluation. This performance advantage is attributed to the specialized training data (educational content), which contrasts with GPT-3 Small’s broader multilingual and multi-domain training data.
-According to Chinchilla’s scaling laws, an optimal token-to-parameter ratio suggests that a 124M-parameter model ideally requires 2.48 billion tokens for training. The excess training tokens used in GPT-3 Small might have led to diminishing returns in performance.
 ![image/png](https://cdn-uploads.huggingface.co/production/uploads/670142e648894dfbedacacaf/Ne2MYAB2C0yHWFJLjCww3.png)
 ## Environmental Impact
 - **Hardware Used:** `8x NVIDIA RTX 4090 GPUs`

 ## Evaluation
 ### Training & Validation Loss
+Validation was conducted using `100 million tokens` from the `HuggingFaceFW/fineweb-edu` dataset. The training and validation loss graph indicates a stable convergence with minimal overfitting. The training loss achieved a minimum value of 2.88, while the validation loss stabilized at 2.97.
 ![image/png](https://cdn-uploads.huggingface.co/production/uploads/670142e648894dfbedacacaf/fAwiSHr4f9SmO9PYiCntY.png)
 ### Results
+The model was benchmarked against OpenAI’s GPT-2 Small and GPT-3 Small (both ~124M parameters). Remarkably, despite being trained on only `10 billion tokens`, compared to GPT-3 Small's `300 billion tokens`, GPT-124M was able to outperform both models in `HellaSwag` evaluation. This performance advantage is attributed to the specialized training data (educational content), which contrasts with GPT-3 Small’s broader multilingual and multi-domain training data.
+According to Chinchilla’s scaling laws, an optimal token-to-parameter ratio suggests that a 124M-parameter model ideally requires `2.48 billion tokens` for training. The excess training tokens used in GPT-3 Small might have led to diminishing returns in performance.
 ![image/png](https://cdn-uploads.huggingface.co/production/uploads/670142e648894dfbedacacaf/Ne2MYAB2C0yHWFJLjCww3.png)
+### Key Insights from Evaluation
+- Efficient Training: The model demonstrates impressive performance relative to its training token count, suggesting an efficient use of resources.
+- Data-Specific Advantage: Training exclusively on educational data may have given GPT-124M an edge in evaluation metrics like `HellaSwag`.
+- Scaling Considerations: GPT-3 Small, despite being trained on 300B tokens, does not exhibit proportionally better performance due to scaling limitations.
 ## Environmental Impact
 - **Hardware Used:** `8x NVIDIA RTX 4090 GPUs`