wolfnuker commited on
Commit
525fcdf
·
verified ·
1 Parent(s): 0e3fbf1

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +158 -0
README.md ADDED
@@ -0,0 +1,158 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ license: apache-2.0
5
+ tags:
6
+ - sentence-transformers
7
+ - feature-extraction
8
+ - sentence-similarity
9
+ - mteb
10
+ - beir
11
+ - embedding
12
+ - leaf-distillation
13
+ datasets:
14
+ - BeIR
15
+ - ms_marco
16
+ - wikipedia
17
+ pipeline_tag: feature-extraction
18
+ library_name: transformers
19
+ model-index:
20
+ - name: leaf-embed-beir
21
+ results:
22
+ - task:
23
+ type: Retrieval
24
+ dataset:
25
+ type: BeIR
26
+ name: BEIR
27
+ config: nfcorpus
28
+ metrics:
29
+ - type: ndcg_at_10
30
+ value: 0.0896
31
+ ---
32
+
33
+ # LEAF Embed BEIR
34
+
35
+ A text embedding model trained using **LEAF (Lightweight Embedding Alignment Framework) Distillation** to achieve competitive performance on the BEIR benchmark.
36
+
37
+ ## Model Description
38
+
39
+ This model was created by distilling knowledge from `Snowflake/snowflake-arctic-embed-m-v1.5` (teacher) into a smaller, more efficient student architecture.
40
+
41
+ ### Architecture
42
+
43
+ | Component | Details |
44
+ |-----------|---------|
45
+ | **Encoder** | 8-layer BERT with 512 hidden size |
46
+ | **Attention Heads** | 8 |
47
+ | **Output Dimension** | 768 |
48
+ | **Parameters** | ~65M (vs 109M teacher) |
49
+ | **Pooling** | Mean pooling |
50
+
51
+ ### Training
52
+
53
+ - **Method**: LEAF Distillation (L2 loss on normalized embeddings)
54
+ - **Teacher**: `Snowflake/snowflake-arctic-embed-m-v1.5`
55
+ - **Hardware**: NVIDIA B200 GPU on Modal.com
56
+ - **Training Data**: 5M samples from BEIR, MS MARCO, Wikipedia
57
+ - **Epochs**: 3
58
+ - **Final Teacher-Student Similarity**: 77.2%
59
+
60
+ ## Usage
61
+
62
+ ### With Transformers
63
+
64
+ ```python
65
+ import torch
66
+ from transformers import AutoTokenizer, AutoModel
67
+
68
+ tokenizer = AutoTokenizer.from_pretrained("wolfnuker/leaf-embed-beir")
69
+ model = AutoModel.from_pretrained("wolfnuker/leaf-embed-beir")
70
+
71
+ def mean_pooling(model_output, attention_mask):
72
+ token_embeddings = model_output.last_hidden_state
73
+ input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
74
+ return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
75
+
76
+ # Example usage
77
+ sentences = ["This is an example sentence", "Each sentence is converted to a vector"]
78
+ encoded = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")
79
+
80
+ with torch.no_grad():
81
+ outputs = model(**encoded)
82
+ embeddings = mean_pooling(outputs, encoded["attention_mask"])
83
+ embeddings = torch.nn.functional.normalize(embeddings, p=2, dim=1)
84
+
85
+ print(embeddings.shape) # [2, 768]
86
+ ```
87
+
88
+ ### With Sentence-Transformers
89
+
90
+ ```python
91
+ from sentence_transformers import SentenceTransformer
92
+
93
+ model = SentenceTransformer("wolfnuker/leaf-embed-beir")
94
+ embeddings = model.encode(["This is an example sentence", "Each sentence is converted"])
95
+ ```
96
+
97
+ ## Evaluation Results
98
+
99
+ ### BEIR Benchmark
100
+
101
+ | Dataset | NDCG@10 |
102
+ |---------|---------|
103
+ | NFCorpus | 0.0896 |
104
+
105
+ *Note: This is an initial baseline model. Performance will improve with:*
106
+ - More training data and epochs
107
+ - IE-specific contrastive training (entity masking, relation pairs)
108
+ - Hyperparameter tuning
109
+
110
+ ## Training Details
111
+
112
+ ### Hyperparameters
113
+
114
+ | Parameter | Value |
115
+ |-----------|-------|
116
+ | Learning Rate | 2e-5 → 2e-8 (cosine decay) |
117
+ | Batch Size | 320 (64 × 5 gradient accumulation) |
118
+ | Warmup Ratio | 10% |
119
+ | Mixed Precision | FP16 |
120
+ | Max Sequence Length | 256 |
121
+
122
+ ### Loss Function
123
+
124
+ LEAF uses L2 loss on normalized embeddings:
125
+
126
+ ```
127
+ L = MSE(normalize(student_emb), normalize(teacher_emb))
128
+ ```
129
+
130
+ ## Limitations
131
+
132
+ - Trained primarily on English text
133
+ - Initial baseline - further tuning recommended for production use
134
+ - Optimized for retrieval, may need adaptation for other tasks
135
+
136
+ ## Citation
137
+
138
+ If you use this model, please cite:
139
+
140
+ ```bibtex
141
+ @misc{leaf-embed-beir,
142
+ author = {RankSaga},
143
+ title = {LEAF Embed BEIR: Text Embeddings via Distillation},
144
+ year = {2026},
145
+ publisher = {HuggingFace},
146
+ url = {https://huggingface.co/wolfnuker/leaf-embed-beir}
147
+ }
148
+ ```
149
+
150
+ ## Acknowledgments
151
+
152
+ - [MongoDB LEAF Paper](https://www.mongodb.com/company/blog/engineering/leaf-distillation-state-of-the-art-text-embedding-models)
153
+ - [Snowflake Arctic Embed](https://huggingface.co/Snowflake/snowflake-arctic-embed-m-v1.5)
154
+ - [Modal.com](https://modal.com) for GPU compute
155
+
156
+ ## License
157
+
158
+ Apache 2.0