---
language:
- en
license: apache-2.0
tags:
- sentence-transformers
- feature-extraction
- sentence-similarity
- mteb
- beir
- embedding
- leaf-distillation
datasets:
- BeIR
- ms_marco
- wikipedia
pipeline_tag: feature-extraction
library_name: transformers
model-index:
- name: leaf-embed-beir
  results:
  - task:
      type: Retrieval
    dataset:
      type: BeIR
      name: BEIR
      config: nfcorpus
    metrics:
    - type: ndcg_at_10
      value: 0.0896
---

# LEAF Embed BEIR

A text embedding model trained using **LEAF (Lightweight Embedding Alignment Framework) Distillation** to achieve competitive performance on the BEIR benchmark.

## Model Description

This model was created by distilling knowledge from `Snowflake/snowflake-arctic-embed-m-v1.5` (teacher) into a smaller, more efficient student architecture.

### Architecture

| Component | Details |
|-----------|---------|
| **Encoder** | 8-layer BERT with 512 hidden size |
| **Attention Heads** | 8 |
| **Output Dimension** | 768 |
| **Parameters** | ~65M (vs 109M teacher) |
| **Pooling** | Mean pooling |

### Training

- **Method**: LEAF Distillation (L2 loss on normalized embeddings)
- **Teacher**: `Snowflake/snowflake-arctic-embed-m-v1.5`
- **Hardware**: NVIDIA B200 GPU on Modal.com
- **Training Data**: 5M samples from BEIR, MS MARCO, Wikipedia
- **Epochs**: 3
- **Final Teacher-Student Similarity**: 77.2%

## Usage

### With Transformers

```python
import torch
from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("wolfnuker/leaf-embed-beir")
model = AutoModel.from_pretrained("wolfnuker/leaf-embed-beir")

def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output.last_hidden_state
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)

# Example usage
sentences = ["This is an example sentence", "Each sentence is converted to a vector"]
encoded = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")

with torch.no_grad():
    outputs = model(**encoded)
    embeddings = mean_pooling(outputs, encoded["attention_mask"])
    embeddings = torch.nn.functional.normalize(embeddings, p=2, dim=1)

print(embeddings.shape)  # [2, 768]
```

### With Sentence-Transformers

```python
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("wolfnuker/leaf-embed-beir")
embeddings = model.encode(["This is an example sentence", "Each sentence is converted"])
```

## Evaluation Results

### BEIR Benchmark

| Dataset | NDCG@10 |
|---------|---------|
| NFCorpus | 0.0896 |

*Note: This is an initial baseline model. Performance will improve with:*
- More training data and epochs
- IE-specific contrastive training (entity masking, relation pairs)
- Hyperparameter tuning

## Training Details

### Hyperparameters

| Parameter | Value |
|-----------|-------|
| Learning Rate | 2e-5 → 2e-8 (cosine decay) |
| Batch Size | 320 (64 × 5 gradient accumulation) |
| Warmup Ratio | 10% |
| Mixed Precision | FP16 |
| Max Sequence Length | 256 |

### Loss Function

LEAF uses L2 loss on normalized embeddings:

```
L = MSE(normalize(student_emb), normalize(teacher_emb))
```

## Limitations

- Trained primarily on English text
- Initial baseline - further tuning recommended for production use
- Optimized for retrieval, may need adaptation for other tasks

## Citation

If you use this model, please cite:

```bibtex
@misc{leaf-embed-beir,
  author = {RankSaga},
  title = {LEAF Embed BEIR: Text Embeddings via Distillation},
  year = {2026},
  publisher = {HuggingFace},
  url = {https://huggingface.co/wolfnuker/leaf-embed-beir}
}
```

## Acknowledgments

- [MongoDB LEAF Paper](https://www.mongodb.com/company/blog/engineering/leaf-distillation-state-of-the-art-text-embedding-models)
- [Snowflake Arctic Embed](https://huggingface.co/Snowflake/snowflake-arctic-embed-m-v1.5)
- [Modal.com](https://modal.com) for GPU compute

## License

Apache 2.0