Update README.md
Browse files
README.md
CHANGED
|
@@ -6,12 +6,43 @@ tags:
|
|
| 6 |
- embeddings
|
| 7 |
- static-embeddings
|
| 8 |
- sentence-transformers
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 9 |
---
|
| 10 |
|
| 11 |
# rasgaard/m2v-dfm-large Model Card
|
| 12 |
|
| 13 |
This [Model2Vec](https://github.com/MinishLab/model2vec) model is a distilled version of a Sentence Transformer. It uses static embeddings, allowing text embeddings to be computed orders of magnitude faster on both GPU and CPU. It is designed for applications where computational resources are limited or where real-time performance is critical. Model2Vec models are the smallest, fastest, and most performant static embedders available. The distilled models are up to 50 times smaller and 500 times faster than traditional Sentence Transformers.
|
| 14 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 15 |
|
| 16 |
## Installation
|
| 17 |
|
|
|
|
| 6 |
- embeddings
|
| 7 |
- static-embeddings
|
| 8 |
- sentence-transformers
|
| 9 |
+
datasets:
|
| 10 |
+
- HuggingFaceFW/fineweb-2
|
| 11 |
+
language:
|
| 12 |
+
- da
|
| 13 |
+
base_model:
|
| 14 |
+
- KennethEnevoldsen/dfm-sentence-encoder-large
|
| 15 |
---
|
| 16 |
|
| 17 |
# rasgaard/m2v-dfm-large Model Card
|
| 18 |
|
| 19 |
This [Model2Vec](https://github.com/MinishLab/model2vec) model is a distilled version of a Sentence Transformer. It uses static embeddings, allowing text embeddings to be computed orders of magnitude faster on both GPU and CPU. It is designed for applications where computational resources are limited or where real-time performance is critical. Model2Vec models are the smallest, fastest, and most performant static embedders available. The distilled models are up to 50 times smaller and 500 times faster than traditional Sentence Transformers.
|
| 20 |
|
| 21 |
+
## Training
|
| 22 |
+
|
| 23 |
+
Followed the [docs](https://minish.ai/packages/tokenlearn/usage).
|
| 24 |
+
|
| 25 |
+
|
| 26 |
+
Create the features
|
| 27 |
+
```
|
| 28 |
+
uv run python -m tokenlearn.featurize \
|
| 29 |
+
--model-name "KennethEnevoldsen/dfm-sentence-encoder-large" \
|
| 30 |
+
--output-dir "./tokenlearn/dfm-large/data" \
|
| 31 |
+
--dataset-path "HuggingFaceFW/fineweb-2" \
|
| 32 |
+
--dataset-name "dan_Latn" \
|
| 33 |
+
--dataset-split "train" \
|
| 34 |
+
--max-means 2000000 \
|
| 35 |
+
--batch-size 512
|
| 36 |
+
```
|
| 37 |
+
|
| 38 |
+
and train the model
|
| 39 |
+
```
|
| 40 |
+
uv run python -m tokenlearn.train \
|
| 41 |
+
--model-name "KennethEnevoldsen/dfm-sentence-encoder-large" \
|
| 42 |
+
--data-path "./dfm-large/data" \
|
| 43 |
+
--save-path "./dfm-large/model"
|
| 44 |
+
```
|
| 45 |
+
|
| 46 |
|
| 47 |
## Installation
|
| 48 |
|