rasgaard
/

m2v-dfm-large

sentence-transformers

static-embeddings

Model card Files Files and versions

rasgaard commited on Oct 8, 2025

Commit

4d34d4d

·

verified ·

1 Parent(s): 4884e69

Update README.md

Files changed (1) hide show

README.md +31 -0

README.md CHANGED Viewed

@@ -6,12 +6,43 @@ tags:
 - embeddings
 - static-embeddings
 - sentence-transformers
 ---
 # rasgaard/m2v-dfm-large Model Card
 This [Model2Vec](https://github.com/MinishLab/model2vec) model is a distilled version of a Sentence Transformer. It uses static embeddings, allowing text embeddings to be computed orders of magnitude faster on both GPU and CPU. It is designed for applications where computational resources are limited or where real-time performance is critical. Model2Vec models are the smallest, fastest, and most performant static embedders available. The distilled models are up to 50 times smaller and 500 times faster than traditional Sentence Transformers.
 ## Installation

 - embeddings
 - static-embeddings
 - sentence-transformers
+datasets:
+- HuggingFaceFW/fineweb-2
+language:
+- da
+base_model:
+- KennethEnevoldsen/dfm-sentence-encoder-large
 ---
 # rasgaard/m2v-dfm-large Model Card
 This [Model2Vec](https://github.com/MinishLab/model2vec) model is a distilled version of a Sentence Transformer. It uses static embeddings, allowing text embeddings to be computed orders of magnitude faster on both GPU and CPU. It is designed for applications where computational resources are limited or where real-time performance is critical. Model2Vec models are the smallest, fastest, and most performant static embedders available. The distilled models are up to 50 times smaller and 500 times faster than traditional Sentence Transformers.
+## Training
+Followed the [docs](https://minish.ai/packages/tokenlearn/usage).
+Create the features
+```
+uv run python -m tokenlearn.featurize \
+     --model-name "KennethEnevoldsen/dfm-sentence-encoder-large" \
+     --output-dir "./tokenlearn/dfm-large/data" \
+     --dataset-path "HuggingFaceFW/fineweb-2" \
+     --dataset-name "dan_Latn" \
+     --dataset-split "train" \
+     --max-means 2000000 \
+     --batch-size 512
+```
+and train the model
+```
+uv run python -m tokenlearn.train \
+    --model-name "KennethEnevoldsen/dfm-sentence-encoder-large" \
+    --data-path "./dfm-large/data" \
+    --save-path "./dfm-large/model"
+```
 ## Installation