Upload folder using huggingface_hub
Browse files
README.md
CHANGED
|
@@ -47,10 +47,8 @@ embeddings = model.encode(["Example sentence"])
|
|
| 47 |
Model2vec creates a small, static model that outperforms other static embedding models by a large margin on all tasks on MTEB. This model is pre-trained using Tokenlearn. It's created using the following steps:
|
| 48 |
|
| 49 |
- Distillation: first, a model is distilled from a sentence transformer model using Model2Vec.
|
| 50 |
-
- Training data creation: the sentence transformer model is used to create training data by creating mean output embeddings on a large corpus.
|
| 51 |
- Training: the distilled model is trained on the training data using Tokenlearn.
|
| 52 |
-
- Post-training re-regularization: after training, the model is re-regularized by weighting the tokens based on their
|
| 53 |
-
- frequency, applying PCA, and finally applying SIF weighting.
|
| 54 |
|
| 55 |
The results for this model can be found on the [Model2Vec results page](https://github.com/MinishLab/model2vec/blob/main/results/README.md).
|
| 56 |
|
|
|
|
| 47 |
Model2vec creates a small, static model that outperforms other static embedding models by a large margin on all tasks on MTEB. This model is pre-trained using Tokenlearn. It's created using the following steps:
|
| 48 |
|
| 49 |
- Distillation: first, a model is distilled from a sentence transformer model using Model2Vec.
|
| 50 |
+
- Training data creation: the sentence transformer model is used to create training data by creating mean output embeddings on a large corpus. In this case, 2 million sentences from the C4 dataset were used from 101 different languages, sampled using temperature-smoothed sampling proportional to the language size.
|
| 51 |
- Training: the distilled model is trained on the training data using Tokenlearn.
|
|
|
|
|
|
|
| 52 |
|
| 53 |
The results for this model can be found on the [Model2Vec results page](https://github.com/MinishLab/model2vec/blob/main/results/README.md).
|
| 54 |
|