Pringled commited on
Commit
45359ff
·
verified ·
1 Parent(s): 50d4c7e

Upload folder using huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +1 -3
README.md CHANGED
@@ -47,10 +47,8 @@ embeddings = model.encode(["Example sentence"])
47
  Model2vec creates a small, static model that outperforms other static embedding models by a large margin on all tasks on MTEB. This model is pre-trained using Tokenlearn. It's created using the following steps:
48
 
49
  - Distillation: first, a model is distilled from a sentence transformer model using Model2Vec.
50
- - Training data creation: the sentence transformer model is used to create training data by creating mean output embeddings on a large corpus.
51
  - Training: the distilled model is trained on the training data using Tokenlearn.
52
- - Post-training re-regularization: after training, the model is re-regularized by weighting the tokens based on their
53
- - frequency, applying PCA, and finally applying SIF weighting.
54
 
55
  The results for this model can be found on the [Model2Vec results page](https://github.com/MinishLab/model2vec/blob/main/results/README.md).
56
 
 
47
  Model2vec creates a small, static model that outperforms other static embedding models by a large margin on all tasks on MTEB. This model is pre-trained using Tokenlearn. It's created using the following steps:
48
 
49
  - Distillation: first, a model is distilled from a sentence transformer model using Model2Vec.
50
+ - Training data creation: the sentence transformer model is used to create training data by creating mean output embeddings on a large corpus. In this case, 2 million sentences from the C4 dataset were used from 101 different languages, sampled using temperature-smoothed sampling proportional to the language size.
51
  - Training: the distilled model is trained on the training data using Tokenlearn.
 
 
52
 
53
  The results for this model can be found on the [Model2Vec results page](https://github.com/MinishLab/model2vec/blob/main/results/README.md).
54