End of training
Browse files
README.md
CHANGED
|
@@ -15,14 +15,14 @@ This student model is distilled from the teacher model [roneneldan/TinyStories-3
|
|
| 15 |
The [Distily](https://github.com/lapp0/distily) library was used for this distillation.
|
| 16 |
|
| 17 |
It achieves the following results on the evaluation set:
|
| 18 |
-
- eval_enwikippl:
|
| 19 |
-
- eval_frwikippl:
|
| 20 |
-
- eval_zhwikippl:
|
| 21 |
-
- eval_tinystoriesppl:
|
| 22 |
-
- eval_loss: 1.
|
| 23 |
-
- eval_runtime: 13.
|
| 24 |
-
- eval_samples_per_second: 75.
|
| 25 |
-
- eval_steps_per_second: 9.
|
| 26 |
|
| 27 |
<!-- This model card has been generated automatically according to the information the Trainer had access to. You
|
| 28 |
should probably proofread and complete it, then remove this comment.
|
|
@@ -47,7 +47,7 @@ More information needed
|
|
| 47 |
The following hyperparameters were used during training:
|
| 48 |
- distillation_objective: DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl, layer_mapper=None, projector=None), hs_loss_component=LossComponent(label=hs, weight=0, loss_fn=None, layer_mapper=None, projector=None), attn_loss_component=LossComponent(label=attn, weight=0, loss_fn=None, layer_mapper=None, projector=None))
|
| 49 |
- train_embeddings: True
|
| 50 |
-
- learning_rate: 0.
|
| 51 |
- train_batch_size: 8
|
| 52 |
- eval_batch_size: 8
|
| 53 |
- seed: 42
|
|
@@ -57,38 +57,38 @@ The following hyperparameters were used during training:
|
|
| 57 |
- num_epochs: 1.0
|
| 58 |
|
| 59 |
### Resource Usage
|
| 60 |
-
Peak GPU Memory: 8.
|
| 61 |
|
| 62 |
### Eval-Phase Metrics
|
| 63 |
| step | epoch | enwikippl | frwikippl | loss | runtime | samples_per_second | steps_per_second | tinystoriesppl | zhwikippl |
|
| 64 |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
|
| 65 |
| **teacher eval** | | 169.9865 | 47377.9414 | | | | | 3.9789 | 4998.1294 |
|
| 66 |
-
| 0 | 0 |
|
| 67 |
-
| 500 | 0.0404 |
|
| 68 |
-
| 1000 | 0.0808 |
|
| 69 |
-
| 1500 | 0.1212 |
|
| 70 |
-
| 2000 | 0.1616 |
|
| 71 |
-
| 2500 | 0.2020 |
|
| 72 |
-
| 3000 | 0.2424 |
|
| 73 |
-
| 3500 | 0.2828 |
|
| 74 |
-
| 4000 | 0.3232 |
|
| 75 |
-
| 4500 | 0.3636 |
|
| 76 |
-
| 5000 | 0.4040 |
|
| 77 |
-
| 5500 | 0.4444 |
|
| 78 |
-
| 6000 | 0.4848 |
|
| 79 |
-
| 6500 | 0.5253 |
|
| 80 |
-
| 7000 | 0.5657 |
|
| 81 |
-
| 7500 | 0.6061 |
|
| 82 |
-
| 8000 | 0.6465 |
|
| 83 |
-
| 8500 | 0.6869 |
|
| 84 |
-
| 9000 | 0.7273 |
|
| 85 |
-
| 9500 | 0.7677 |
|
| 86 |
-
| 10000 | 0.8081 |
|
| 87 |
-
| 10500 | 0.8485 |
|
| 88 |
-
| 11000 | 0.8889 |
|
| 89 |
-
| 11500 | 0.9293 |
|
| 90 |
-
| 12000 | 0.9697 |
|
| 91 |
-
| 12375 | 1.0 |
|
| 92 |
|
| 93 |
### Framework versions
|
| 94 |
- Distily 0.2.0
|
|
|
|
| 15 |
The [Distily](https://github.com/lapp0/distily) library was used for this distillation.
|
| 16 |
|
| 17 |
It achieves the following results on the evaluation set:
|
| 18 |
+
- eval_enwikippl: 177.7982
|
| 19 |
+
- eval_frwikippl: 71457.8281
|
| 20 |
+
- eval_zhwikippl: 1401097.25
|
| 21 |
+
- eval_tinystoriesppl: 9.7578
|
| 22 |
+
- eval_loss: 1.1698
|
| 23 |
+
- eval_runtime: 13.1698
|
| 24 |
+
- eval_samples_per_second: 75.932
|
| 25 |
+
- eval_steps_per_second: 9.491
|
| 26 |
|
| 27 |
<!-- This model card has been generated automatically according to the information the Trainer had access to. You
|
| 28 |
should probably proofread and complete it, then remove this comment.
|
|
|
|
| 47 |
The following hyperparameters were used during training:
|
| 48 |
- distillation_objective: DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl, layer_mapper=None, projector=None), hs_loss_component=LossComponent(label=hs, weight=0, loss_fn=None, layer_mapper=None, projector=None), attn_loss_component=LossComponent(label=attn, weight=0, loss_fn=None, layer_mapper=None, projector=None))
|
| 49 |
- train_embeddings: True
|
| 50 |
+
- learning_rate: 0.01
|
| 51 |
- train_batch_size: 8
|
| 52 |
- eval_batch_size: 8
|
| 53 |
- seed: 42
|
|
|
|
| 57 |
- num_epochs: 1.0
|
| 58 |
|
| 59 |
### Resource Usage
|
| 60 |
+
Peak GPU Memory: 8.0568 GB
|
| 61 |
|
| 62 |
### Eval-Phase Metrics
|
| 63 |
| step | epoch | enwikippl | frwikippl | loss | runtime | samples_per_second | steps_per_second | tinystoriesppl | zhwikippl |
|
| 64 |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
|
| 65 |
| **teacher eval** | | 169.9865 | 47377.9414 | | | | | 3.9789 | 4998.1294 |
|
| 66 |
+
| 0 | 0 | 88697.0156 | 150478.2188 | 6.9925 | 13.279 | 75.307 | 9.413 | 69390.6016 | 113346.8047 |
|
| 67 |
+
| 500 | 0.0404 | 4772.3877 | 41781.0312 | 4.4675 | 13.261 | 75.409 | 9.426 | 1755.5269 | 85471.3125 |
|
| 68 |
+
| 1000 | 0.0808 | 159.9700 | 23484.4004 | 2.7209 | 13.208 | 75.712 | 9.464 | 14.2542 | 117503.6719 |
|
| 69 |
+
| 1500 | 0.1212 | 339.4117 | 183525.9688 | 2.5581 | 13.2723 | 75.345 | 9.418 | 11.8515 | 4105987.25 |
|
| 70 |
+
| 2000 | 0.1616 | 331.9490 | 187405.2031 | 1.7211 | 13.2071 | 75.717 | 9.465 | 12.0039 | 1849146.875 |
|
| 71 |
+
| 2500 | 0.2020 | 495.6664 | 785470.0 | 1.6396 | 13.2577 | 75.428 | 9.428 | 11.9702 | 16244978.0 |
|
| 72 |
+
| 3000 | 0.2424 | 303.5385 | 246071.1094 | 1.3174 | 13.2799 | 75.302 | 9.413 | 11.2371 | 5965317.5 |
|
| 73 |
+
| 3500 | 0.2828 | 203.5562 | 98305.3203 | 1.2166 | 13.2691 | 75.363 | 9.42 | 10.0871 | 1967744.0 |
|
| 74 |
+
| 4000 | 0.3232 | 173.5457 | 63847.1445 | 1.1965 | 13.2141 | 75.677 | 9.46 | 9.5835 | 1061893.875 |
|
| 75 |
+
| 4500 | 0.3636 | 166.6617 | 57248.3125 | 1.1922 | 13.2557 | 75.439 | 9.43 | 9.4960 | 928536.9375 |
|
| 76 |
+
| 5000 | 0.4040 | 196.6893 | 119945.6172 | 1.2002 | 13.2052 | 75.728 | 9.466 | 9.5117 | 3237270.25 |
|
| 77 |
+
| 5500 | 0.4444 | 180.5183 | 85077.1328 | 1.1802 | 13.2023 | 75.744 | 9.468 | 9.3612 | 2035016.125 |
|
| 78 |
+
| 6000 | 0.4848 | 173.1228 | 71347.1719 | 1.1743 | 13.2273 | 75.601 | 9.45 | 9.3863 | 1387704.875 |
|
| 79 |
+
| 6500 | 0.5253 | 173.5524 | 73148.3516 | 1.1740 | 13.2503 | 75.47 | 9.434 | 9.2676 | 1480652.625 |
|
| 80 |
+
| 7000 | 0.5657 | 172.1599 | 69698.2734 | 1.1730 | 13.2236 | 75.623 | 9.453 | 9.3360 | 1310343.875 |
|
| 81 |
+
| 7500 | 0.6061 | 174.2867 | 74426.7188 | 1.1723 | 13.2073 | 75.716 | 9.464 | 9.3767 | 1511788.0 |
|
| 82 |
+
| 8000 | 0.6465 | 173.0625 | 69953.9844 | 1.1716 | 13.2511 | 75.465 | 9.433 | 9.4510 | 1417641.375 |
|
| 83 |
+
| 8500 | 0.6869 | 175.9622 | 72553.1562 | 1.1703 | 13.2273 | 75.601 | 9.45 | 9.5974 | 1453256.875 |
|
| 84 |
+
| 9000 | 0.7273 | 177.5917 | 71942.5625 | 1.1698 | 13.2247 | 75.616 | 9.452 | 9.7756 | 1375906.75 |
|
| 85 |
+
| 9500 | 0.7677 | 177.7982 | 71457.8281 | 1.1698 | 13.1698 | 75.932 | 9.491 | 9.7578 | 1401097.25 |
|
| 86 |
+
| 10000 | 0.8081 | 182.3172 | 70652.1328 | 1.1690 | 13.2521 | 75.46 | 9.432 | 10.2168 | 1272107.75 |
|
| 87 |
+
| 10500 | 0.8485 | 184.0270 | 70617.3047 | 1.1694 | 13.2146 | 75.674 | 9.459 | 10.4603 | 1281646.125 |
|
| 88 |
+
| 11000 | 0.8889 | 181.9786 | 70831.5156 | 1.1687 | 13.2442 | 75.505 | 9.438 | 10.2134 | 1352613.125 |
|
| 89 |
+
| 11500 | 0.9293 | 182.2607 | 71593.8438 | 1.1688 | 13.2727 | 75.343 | 9.418 | 10.2172 | 1358399.375 |
|
| 90 |
+
| 12000 | 0.9697 | 181.1417 | 70522.8828 | 1.1687 | 13.2373 | 75.544 | 9.443 | 10.2155 | 1319816.5 |
|
| 91 |
+
| 12375 | 1.0 | 181.2119 | 70612.3203 | 1.1688 | 13.2662 | 75.38 | 9.422 | 10.2206 | 1326170.375 |
|
| 92 |
|
| 93 |
### Framework versions
|
| 94 |
- Distily 0.2.0
|
logs/learning_rate=0.01, lr_scheduler_type=linear, warmup_ratio=0.5/events.out.tfevents.1723847774.93d6cbb3ad53
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:0170d1cb74de8a18089ef197819bc686153124057a621bb3a611e10437aa43c3
|
| 3 |
+
size 307
|