rasgaard commited on
Commit
4d34d4d
·
verified ·
1 Parent(s): 4884e69

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +31 -0
README.md CHANGED
@@ -6,12 +6,43 @@ tags:
6
  - embeddings
7
  - static-embeddings
8
  - sentence-transformers
 
 
 
 
 
 
9
  ---
10
 
11
  # rasgaard/m2v-dfm-large Model Card
12
 
13
  This [Model2Vec](https://github.com/MinishLab/model2vec) model is a distilled version of a Sentence Transformer. It uses static embeddings, allowing text embeddings to be computed orders of magnitude faster on both GPU and CPU. It is designed for applications where computational resources are limited or where real-time performance is critical. Model2Vec models are the smallest, fastest, and most performant static embedders available. The distilled models are up to 50 times smaller and 500 times faster than traditional Sentence Transformers.
14
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
15
 
16
  ## Installation
17
 
 
6
  - embeddings
7
  - static-embeddings
8
  - sentence-transformers
9
+ datasets:
10
+ - HuggingFaceFW/fineweb-2
11
+ language:
12
+ - da
13
+ base_model:
14
+ - KennethEnevoldsen/dfm-sentence-encoder-large
15
  ---
16
 
17
  # rasgaard/m2v-dfm-large Model Card
18
 
19
  This [Model2Vec](https://github.com/MinishLab/model2vec) model is a distilled version of a Sentence Transformer. It uses static embeddings, allowing text embeddings to be computed orders of magnitude faster on both GPU and CPU. It is designed for applications where computational resources are limited or where real-time performance is critical. Model2Vec models are the smallest, fastest, and most performant static embedders available. The distilled models are up to 50 times smaller and 500 times faster than traditional Sentence Transformers.
20
 
21
+ ## Training
22
+
23
+ Followed the [docs](https://minish.ai/packages/tokenlearn/usage).
24
+
25
+
26
+ Create the features
27
+ ```
28
+ uv run python -m tokenlearn.featurize \
29
+ --model-name "KennethEnevoldsen/dfm-sentence-encoder-large" \
30
+ --output-dir "./tokenlearn/dfm-large/data" \
31
+ --dataset-path "HuggingFaceFW/fineweb-2" \
32
+ --dataset-name "dan_Latn" \
33
+ --dataset-split "train" \
34
+ --max-means 2000000 \
35
+ --batch-size 512
36
+ ```
37
+
38
+ and train the model
39
+ ```
40
+ uv run python -m tokenlearn.train \
41
+ --model-name "KennethEnevoldsen/dfm-sentence-encoder-large" \
42
+ --data-path "./dfm-large/data" \
43
+ --save-path "./dfm-large/model"
44
+ ```
45
+
46
 
47
  ## Installation
48