Update README.md
Browse files
README.md
CHANGED
|
@@ -132,6 +132,15 @@ Layers 11-16: Full Attention Blocks
|
|
| 132 |
| Training Loss | ~6.0 | ~2.0 | 1.98 |
|
| 133 |
| Perplexity | ~400+ | ~7-10 | 7.29 |
|
| 134 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 135 |
The model shows strong convergence with stable training dynamics and efficient GPU utilization.
|
| 136 |
|
| 137 |
## Usage
|
|
|
|
| 132 |
| Training Loss | ~6.0 | ~2.0 | 1.98 |
|
| 133 |
| Perplexity | ~400+ | ~7-10 | 7.29 |
|
| 134 |
|
| 135 |
+
|
| 136 |
+

|
| 137 |
+
> [!NOTE]
|
| 138 |
+
> I dont know why the logging starts at step 4.6k .
|
| 139 |
+
|
| 140 |
+
**i3-22m** and **i3-80m** comparation?
|
| 141 |
+
|
| 142 |
+

|
| 143 |
+
|
| 144 |
The model shows strong convergence with stable training dynamics and efficient GPU utilization.
|
| 145 |
|
| 146 |
## Usage
|