Update README.md
Browse files
README.md
CHANGED
|
@@ -12,11 +12,11 @@ pipeline_tag: text-generation
|
|
| 12 |
|
| 13 |

|
| 14 |
|
| 15 |
-
Doge uses `wsd_scheduler` as the training scheduler, which divides the learning rate into three stages: warmup
|
| 16 |
|
| 17 |
Here are the initial learning rates required to continue training at each checkpoint:
|
| 18 |
|
| 19 |
-
-
|
| 20 |
-
-
|
| 21 |
-
-
|
| 22 |
-
-
|
|
|
|
| 12 |
|
| 13 |

|
| 14 |
|
| 15 |
+
Doge uses `wsd_scheduler` as the training scheduler, which divides the learning rate into three stages: `warmup`, `stable`, and `decay`. It allows us to continue training from any checkpoint in the `stable stage` without causing loss rebound.
|
| 16 |
|
| 17 |
Here are the initial learning rates required to continue training at each checkpoint:
|
| 18 |
|
| 19 |
+
- **Doge-20M**: 8e-3
|
| 20 |
+
- **Doge-60M**: 6e-3
|
| 21 |
+
- **Doge-160M**: 4e-3
|
| 22 |
+
- **Doge-320M**: 2e-3
|