wolfnuker
/

leaf-embed-beir

+---
+language:
+- en
+license: apache-2.0
+tags:
+- sentence-transformers
+- feature-extraction
+- sentence-similarity
+- mteb
+- beir
+- embedding
+- leaf-distillation
+datasets:
+- BeIR
+- ms_marco
+- wikipedia
+pipeline_tag: feature-extraction
+library_name: transformers
+model-index:
+- name: leaf-embed-beir
+  results:
+  - task:
+      type: Retrieval
+    dataset:
+      type: BeIR
+      name: BEIR
+      config: nfcorpus
+    metrics:
+    - type: ndcg_at_10
+      value: 0.0896
+---
+# LEAF Embed BEIR
+A text embedding model trained using **LEAF (Lightweight Embedding Alignment Framework) Distillation** to achieve competitive performance on the BEIR benchmark.
+## Model Description
+This model was created by distilling knowledge from `Snowflake/snowflake-arctic-embed-m-v1.5` (teacher) into a smaller, more efficient student architecture.
+### Architecture
+| Component | Details |
+|-----------|---------|
+| **Encoder** | 8-layer BERT with 512 hidden size |
+| **Attention Heads** | 8 |
+| **Output Dimension** | 768 |
+| **Parameters** | ~65M (vs 109M teacher) |
+| **Pooling** | Mean pooling |
+### Training
+- **Method**: LEAF Distillation (L2 loss on normalized embeddings)
+- **Teacher**: `Snowflake/snowflake-arctic-embed-m-v1.5`
+- **Hardware**: NVIDIA B200 GPU on Modal.com
+- **Training Data**: 5M samples from BEIR, MS MARCO, Wikipedia
+- **Epochs**: 3
+- **Final Teacher-Student Similarity**: 77.2%
+## Usage
+### With Transformers
+```python
+import torch
+from transformers import AutoTokenizer, AutoModel
+tokenizer = AutoTokenizer.from_pretrained("wolfnuker/leaf-embed-beir")
+model = AutoModel.from_pretrained("wolfnuker/leaf-embed-beir")
+def mean_pooling(model_output, attention_mask):
+    token_embeddings = model_output.last_hidden_state
+    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
+    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
+# Example usage
+sentences = ["This is an example sentence", "Each sentence is converted to a vector"]
+encoded = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")
+with torch.no_grad():
+    outputs = model(**encoded)
+    embeddings = mean_pooling(outputs, encoded["attention_mask"])
+    embeddings = torch.nn.functional.normalize(embeddings, p=2, dim=1)
+print(embeddings.shape)  # [2, 768]
+```
+### With Sentence-Transformers
+```python
+from sentence_transformers import SentenceTransformer
+model = SentenceTransformer("wolfnuker/leaf-embed-beir")
+embeddings = model.encode(["This is an example sentence", "Each sentence is converted"])
+```
+## Evaluation Results
+### BEIR Benchmark
+| Dataset | NDCG@10 |
+|---------|---------|
+| NFCorpus | 0.0896 |
+*Note: This is an initial baseline model. Performance will improve with:*
+- More training data and epochs
+- IE-specific contrastive training (entity masking, relation pairs)
+- Hyperparameter tuning
+## Training Details
+### Hyperparameters
+| Parameter | Value |
+|-----------|-------|
+| Learning Rate | 2e-5 → 2e-8 (cosine decay) |
+| Batch Size | 320 (64 × 5 gradient accumulation) |
+| Warmup Ratio | 10% |
+| Mixed Precision | FP16 |
+| Max Sequence Length | 256 |
+### Loss Function
+LEAF uses L2 loss on normalized embeddings:
+```
+L = MSE(normalize(student_emb), normalize(teacher_emb))
+```
+## Limitations
+- Trained primarily on English text
+- Initial baseline - further tuning recommended for production use
+- Optimized for retrieval, may need adaptation for other tasks
+## Citation
+If you use this model, please cite:
+```bibtex
+@misc{leaf-embed-beir,
+  author = {RankSaga},
+  title = {LEAF Embed BEIR: Text Embeddings via Distillation},
+  year = {2026},
+  publisher = {HuggingFace},
+  url = {https://huggingface.co/wolfnuker/leaf-embed-beir}
+}
+```
+## Acknowledgments
+- [MongoDB LEAF Paper](https://www.mongodb.com/company/blog/engineering/leaf-distillation-state-of-the-art-text-embedding-models)
+- [Snowflake Arctic Embed](https://huggingface.co/Snowflake/snowflake-arctic-embed-m-v1.5)
+- [Modal.com](https://modal.com) for GPU compute
+## License
+Apache 2.0