|
|
--- |
|
|
language: |
|
|
- en |
|
|
license: apache-2.0 |
|
|
tags: |
|
|
- sentence-transformers |
|
|
- feature-extraction |
|
|
- sentence-similarity |
|
|
- mteb |
|
|
- beir |
|
|
- embedding |
|
|
- leaf-distillation |
|
|
datasets: |
|
|
- BeIR |
|
|
- ms_marco |
|
|
- wikipedia |
|
|
pipeline_tag: feature-extraction |
|
|
library_name: transformers |
|
|
model-index: |
|
|
- name: leaf-embed-beir |
|
|
results: |
|
|
- task: |
|
|
type: Retrieval |
|
|
dataset: |
|
|
type: BeIR |
|
|
name: BEIR |
|
|
config: nfcorpus |
|
|
metrics: |
|
|
- type: ndcg_at_10 |
|
|
value: 0.0896 |
|
|
--- |
|
|
|
|
|
# LEAF Embed BEIR |
|
|
|
|
|
A text embedding model trained using **LEAF (Lightweight Embedding Alignment Framework) Distillation** to achieve competitive performance on the BEIR benchmark. |
|
|
|
|
|
## Model Description |
|
|
|
|
|
This model was created by distilling knowledge from `Snowflake/snowflake-arctic-embed-m-v1.5` (teacher) into a smaller, more efficient student architecture. |
|
|
|
|
|
### Architecture |
|
|
|
|
|
| Component | Details | |
|
|
|-----------|---------| |
|
|
| **Encoder** | 8-layer BERT with 512 hidden size | |
|
|
| **Attention Heads** | 8 | |
|
|
| **Output Dimension** | 768 | |
|
|
| **Parameters** | ~65M (vs 109M teacher) | |
|
|
| **Pooling** | Mean pooling | |
|
|
|
|
|
### Training |
|
|
|
|
|
- **Method**: LEAF Distillation (L2 loss on normalized embeddings) |
|
|
- **Teacher**: `Snowflake/snowflake-arctic-embed-m-v1.5` |
|
|
- **Hardware**: NVIDIA B200 GPU on Modal.com |
|
|
- **Training Data**: 5M samples from BEIR, MS MARCO, Wikipedia |
|
|
- **Epochs**: 3 |
|
|
- **Final Teacher-Student Similarity**: 77.2% |
|
|
|
|
|
## Usage |
|
|
|
|
|
### With Transformers |
|
|
|
|
|
```python |
|
|
import torch |
|
|
from transformers import AutoTokenizer, AutoModel |
|
|
|
|
|
tokenizer = AutoTokenizer.from_pretrained("wolfnuker/leaf-embed-beir") |
|
|
model = AutoModel.from_pretrained("wolfnuker/leaf-embed-beir") |
|
|
|
|
|
def mean_pooling(model_output, attention_mask): |
|
|
token_embeddings = model_output.last_hidden_state |
|
|
input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float() |
|
|
return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9) |
|
|
|
|
|
# Example usage |
|
|
sentences = ["This is an example sentence", "Each sentence is converted to a vector"] |
|
|
encoded = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt") |
|
|
|
|
|
with torch.no_grad(): |
|
|
outputs = model(**encoded) |
|
|
embeddings = mean_pooling(outputs, encoded["attention_mask"]) |
|
|
embeddings = torch.nn.functional.normalize(embeddings, p=2, dim=1) |
|
|
|
|
|
print(embeddings.shape) # [2, 768] |
|
|
``` |
|
|
|
|
|
### With Sentence-Transformers |
|
|
|
|
|
```python |
|
|
from sentence_transformers import SentenceTransformer |
|
|
|
|
|
model = SentenceTransformer("wolfnuker/leaf-embed-beir") |
|
|
embeddings = model.encode(["This is an example sentence", "Each sentence is converted"]) |
|
|
``` |
|
|
|
|
|
## Evaluation Results |
|
|
|
|
|
### BEIR Benchmark |
|
|
|
|
|
| Dataset | NDCG@10 | |
|
|
|---------|---------| |
|
|
| NFCorpus | 0.0896 | |
|
|
|
|
|
*Note: This is an initial baseline model. Performance will improve with:* |
|
|
- More training data and epochs |
|
|
- IE-specific contrastive training (entity masking, relation pairs) |
|
|
- Hyperparameter tuning |
|
|
|
|
|
## Training Details |
|
|
|
|
|
### Hyperparameters |
|
|
|
|
|
| Parameter | Value | |
|
|
|-----------|-------| |
|
|
| Learning Rate | 2e-5 → 2e-8 (cosine decay) | |
|
|
| Batch Size | 320 (64 × 5 gradient accumulation) | |
|
|
| Warmup Ratio | 10% | |
|
|
| Mixed Precision | FP16 | |
|
|
| Max Sequence Length | 256 | |
|
|
|
|
|
### Loss Function |
|
|
|
|
|
LEAF uses L2 loss on normalized embeddings: |
|
|
|
|
|
``` |
|
|
L = MSE(normalize(student_emb), normalize(teacher_emb)) |
|
|
``` |
|
|
|
|
|
## Limitations |
|
|
|
|
|
- Trained primarily on English text |
|
|
- Initial baseline - further tuning recommended for production use |
|
|
- Optimized for retrieval, may need adaptation for other tasks |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you use this model, please cite: |
|
|
|
|
|
```bibtex |
|
|
@misc{leaf-embed-beir, |
|
|
author = {RankSaga}, |
|
|
title = {LEAF Embed BEIR: Text Embeddings via Distillation}, |
|
|
year = {2026}, |
|
|
publisher = {HuggingFace}, |
|
|
url = {https://huggingface.co/wolfnuker/leaf-embed-beir} |
|
|
} |
|
|
``` |
|
|
|
|
|
## Acknowledgments |
|
|
|
|
|
- [MongoDB LEAF Paper](https://www.mongodb.com/company/blog/engineering/leaf-distillation-state-of-the-art-text-embedding-models) |
|
|
- [Snowflake Arctic Embed](https://huggingface.co/Snowflake/snowflake-arctic-embed-m-v1.5) |
|
|
- [Modal.com](https://modal.com) for GPU compute |
|
|
|
|
|
## License |
|
|
|
|
|
Apache 2.0 |
|
|
|