Update README.md
Browse files
README.md
CHANGED
|
@@ -136,11 +136,6 @@ A model to test how MoE will route without square expansion.
|
|
| 136 |
|
| 137 |
### "[What is a Mixture of Experts (MoE)?](https://huggingface.co/blog/moe)"
|
| 138 |
|
| 139 |
-
<a href='https://ko-fi.com/S6S2UH2TC' target='_blank'><img height='38' style='border:0px;height:36px;' src='https://storage.ko-fi.com/cdn/kofi1.png?v=3' border='0' alt='Buy Me a Coffee at ko-fi.com' /></a>
|
| 140 |
-
<a href='https://discord.gg/KFS229xD' target='_blank'><img width='140' height='500' style='border:0px;height:36px;' src='https://i.ibb.co/tqwznYM/Discord-button.png' border='0' alt='Join Our Discord!' /></a>
|
| 141 |
-
|
| 142 |
-
### (from the MistralAI papers...click the quoted question above to navigate to it directly.)
|
| 143 |
-
|
| 144 |
The scale of a model is one of the most important axes for better model quality. Given a fixed computing budget, training a larger model for fewer steps is better than training a smaller model for more steps.
|
| 145 |
|
| 146 |
Mixture of Experts enable models to be pretrained with far less compute, which means you can dramatically scale up the model or dataset size with the same compute budget as a dense model. In particular, a MoE model should achieve the same quality as its dense counterpart much faster during pretraining.
|
|
|
|
| 136 |
|
| 137 |
### "[What is a Mixture of Experts (MoE)?](https://huggingface.co/blog/moe)"
|
| 138 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 139 |
The scale of a model is one of the most important axes for better model quality. Given a fixed computing budget, training a larger model for fewer steps is better than training a smaller model for more steps.
|
| 140 |
|
| 141 |
Mixture of Experts enable models to be pretrained with far less compute, which means you can dramatically scale up the model or dataset size with the same compute budget as a dense model. In particular, a MoE model should achieve the same quality as its dense counterpart much faster during pretraining.
|