File size: 4,099 Bytes
525fcdf |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 |
---
language:
- en
license: apache-2.0
tags:
- sentence-transformers
- feature-extraction
- sentence-similarity
- mteb
- beir
- embedding
- leaf-distillation
datasets:
- BeIR
- ms_marco
- wikipedia
pipeline_tag: feature-extraction
library_name: transformers
model-index:
- name: leaf-embed-beir
results:
- task:
type: Retrieval
dataset:
type: BeIR
name: BEIR
config: nfcorpus
metrics:
- type: ndcg_at_10
value: 0.0896
---
# LEAF Embed BEIR
A text embedding model trained using **LEAF (Lightweight Embedding Alignment Framework) Distillation** to achieve competitive performance on the BEIR benchmark.
## Model Description
This model was created by distilling knowledge from `Snowflake/snowflake-arctic-embed-m-v1.5` (teacher) into a smaller, more efficient student architecture.
### Architecture
| Component | Details |
|-----------|---------|
| **Encoder** | 8-layer BERT with 512 hidden size |
| **Attention Heads** | 8 |
| **Output Dimension** | 768 |
| **Parameters** | ~65M (vs 109M teacher) |
| **Pooling** | Mean pooling |
### Training
- **Method**: LEAF Distillation (L2 loss on normalized embeddings)
- **Teacher**: `Snowflake/snowflake-arctic-embed-m-v1.5`
- **Hardware**: NVIDIA B200 GPU on Modal.com
- **Training Data**: 5M samples from BEIR, MS MARCO, Wikipedia
- **Epochs**: 3
- **Final Teacher-Student Similarity**: 77.2%
## Usage
### With Transformers
```python
import torch
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("wolfnuker/leaf-embed-beir")
model = AutoModel.from_pretrained("wolfnuker/leaf-embed-beir")
def mean_pooling(model_output, attention_mask):
token_embeddings = model_output.last_hidden_state
input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
# Example usage
sentences = ["This is an example sentence", "Each sentence is converted to a vector"]
encoded = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")
with torch.no_grad():
outputs = model(**encoded)
embeddings = mean_pooling(outputs, encoded["attention_mask"])
embeddings = torch.nn.functional.normalize(embeddings, p=2, dim=1)
print(embeddings.shape) # [2, 768]
```
### With Sentence-Transformers
```python
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("wolfnuker/leaf-embed-beir")
embeddings = model.encode(["This is an example sentence", "Each sentence is converted"])
```
## Evaluation Results
### BEIR Benchmark
| Dataset | NDCG@10 |
|---------|---------|
| NFCorpus | 0.0896 |
*Note: This is an initial baseline model. Performance will improve with:*
- More training data and epochs
- IE-specific contrastive training (entity masking, relation pairs)
- Hyperparameter tuning
## Training Details
### Hyperparameters
| Parameter | Value |
|-----------|-------|
| Learning Rate | 2e-5 → 2e-8 (cosine decay) |
| Batch Size | 320 (64 × 5 gradient accumulation) |
| Warmup Ratio | 10% |
| Mixed Precision | FP16 |
| Max Sequence Length | 256 |
### Loss Function
LEAF uses L2 loss on normalized embeddings:
```
L = MSE(normalize(student_emb), normalize(teacher_emb))
```
## Limitations
- Trained primarily on English text
- Initial baseline - further tuning recommended for production use
- Optimized for retrieval, may need adaptation for other tasks
## Citation
If you use this model, please cite:
```bibtex
@misc{leaf-embed-beir,
author = {RankSaga},
title = {LEAF Embed BEIR: Text Embeddings via Distillation},
year = {2026},
publisher = {HuggingFace},
url = {https://huggingface.co/wolfnuker/leaf-embed-beir}
}
```
## Acknowledgments
- [MongoDB LEAF Paper](https://www.mongodb.com/company/blog/engineering/leaf-distillation-state-of-the-art-text-embedding-models)
- [Snowflake Arctic Embed](https://huggingface.co/Snowflake/snowflake-arctic-embed-m-v1.5)
- [Modal.com](https://modal.com) for GPU compute
## License
Apache 2.0
|