Update README.md
Browse files
README.md
CHANGED
|
@@ -128,12 +128,9 @@ According to Chinchilla’s scaling laws, an optimal token-to-parameter ratio su
|
|
| 128 |

|
| 129 |
|
| 130 |
### Key Insights from Evaluation
|
| 131 |
-
|
| 132 |
-
-
|
| 133 |
-
|
| 134 |
-
- Data-Specific Advantage: Training exclusively on educational data may have given GPT-124M an edge in evaluation metrics like `HellaSwag`.
|
| 135 |
-
|
| 136 |
-
- Scaling Considerations: GPT-3 Small, despite being trained on 300B tokens, does not exhibit proportionally better performance due to scaling limitations.
|
| 137 |
|
| 138 |
## Environmental Impact
|
| 139 |
|
|
|
|
| 128 |

|
| 129 |
|
| 130 |
### Key Insights from Evaluation
|
| 131 |
+
- **Efficient Training:** The model demonstrates impressive performance relative to its training token count, suggesting an efficient use of resources due to training using the Distributed Data Parallel (DDP) technique.
|
| 132 |
+
- **Data-Specific Advantage:** Training exclusively on educational data may have given GPT-124M an edge in evaluation metrics like `HellaSwag`.
|
| 133 |
+
- **Scaling Considerations:** GPT-3 Small, despite being trained on 300B tokens, does not exhibit proportionally better performance due to scaling limitations.
|
|
|
|
|
|
|
|
|
|
| 134 |
|
| 135 |
## Environmental Impact
|
| 136 |
|