Update README.md
Browse files
README.md
CHANGED
|
@@ -111,15 +111,23 @@ This model can be fine-tuned for specific NLP applications like:
|
|
| 111 |
## Evaluation
|
| 112 |
|
| 113 |
### Training & Validation Loss
|
| 114 |
-
Validation was conducted using 100 million tokens from the HuggingFaceFW/fineweb-edu dataset. The training and validation loss graph indicates a stable convergence with minimal overfitting. The training loss achieved a minimum value of 2.88, while the validation loss stabilized at 2.97.
|
| 115 |

|
| 116 |
|
| 117 |
### Results
|
| 118 |
-
The model was benchmarked against OpenAI’s GPT-2 Small and GPT-3 Small (both ~124M parameters). Remarkably, despite being trained on only 10 billion tokens, compared to GPT-3 Small's 300 billion tokens, GPT-124M was able to outperform both models in HellaSwag evaluation. This performance advantage is attributed to the specialized training data (educational content), which contrasts with GPT-3 Small’s broader multilingual and multi-domain training data.
|
| 119 |
|
| 120 |
-
According to Chinchilla’s scaling laws, an optimal token-to-parameter ratio suggests that a 124M-parameter model ideally requires 2.48 billion tokens for training. The excess training tokens used in GPT-3 Small might have led to diminishing returns in performance.
|
| 121 |

|
| 122 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 123 |
## Environmental Impact
|
| 124 |
|
| 125 |
- **Hardware Used:** `8x NVIDIA RTX 4090 GPUs`
|
|
|
|
| 111 |
## Evaluation
|
| 112 |
|
| 113 |
### Training & Validation Loss
|
| 114 |
+
Validation was conducted using `100 million tokens` from the `HuggingFaceFW/fineweb-edu` dataset. The training and validation loss graph indicates a stable convergence with minimal overfitting. The training loss achieved a minimum value of 2.88, while the validation loss stabilized at 2.97.
|
| 115 |

|
| 116 |
|
| 117 |
### Results
|
| 118 |
+
The model was benchmarked against OpenAI’s GPT-2 Small and GPT-3 Small (both ~124M parameters). Remarkably, despite being trained on only `10 billion tokens`, compared to GPT-3 Small's `300 billion tokens`, GPT-124M was able to outperform both models in `HellaSwag` evaluation. This performance advantage is attributed to the specialized training data (educational content), which contrasts with GPT-3 Small’s broader multilingual and multi-domain training data.
|
| 119 |
|
| 120 |
+
According to Chinchilla’s scaling laws, an optimal token-to-parameter ratio suggests that a 124M-parameter model ideally requires `2.48 billion tokens` for training. The excess training tokens used in GPT-3 Small might have led to diminishing returns in performance.
|
| 121 |

|
| 122 |
|
| 123 |
+
### Key Insights from Evaluation
|
| 124 |
+
|
| 125 |
+
- Efficient Training: The model demonstrates impressive performance relative to its training token count, suggesting an efficient use of resources.
|
| 126 |
+
|
| 127 |
+
- Data-Specific Advantage: Training exclusively on educational data may have given GPT-124M an edge in evaluation metrics like `HellaSwag`.
|
| 128 |
+
|
| 129 |
+
- Scaling Considerations: GPT-3 Small, despite being trained on 300B tokens, does not exhibit proportionally better performance due to scaling limitations.
|
| 130 |
+
|
| 131 |
## Environmental Impact
|
| 132 |
|
| 133 |
- **Hardware Used:** `8x NVIDIA RTX 4090 GPUs`
|