--- language: - en license: apache-2.0 tags: - sentence-transformers - feature-extraction - sentence-similarity - mteb - beir - embedding - leaf-distillation datasets: - BeIR - ms_marco - wikipedia pipeline_tag: feature-extraction library_name: transformers model-index: - name: leaf-embed-beir results: - task: type: Retrieval dataset: type: BeIR name: BEIR config: nfcorpus metrics: - type: ndcg_at_10 value: 0.0896 --- # LEAF Embed BEIR A text embedding model trained using **LEAF (Lightweight Embedding Alignment Framework) Distillation** to achieve competitive performance on the BEIR benchmark. ## Model Description This model was created by distilling knowledge from `Snowflake/snowflake-arctic-embed-m-v1.5` (teacher) into a smaller, more efficient student architecture. ### Architecture | Component | Details | |-----------|---------| | **Encoder** | 8-layer BERT with 512 hidden size | | **Attention Heads** | 8 | | **Output Dimension** | 768 | | **Parameters** | ~65M (vs 109M teacher) | | **Pooling** | Mean pooling | ### Training - **Method**: LEAF Distillation (L2 loss on normalized embeddings) - **Teacher**: `Snowflake/snowflake-arctic-embed-m-v1.5` - **Hardware**: NVIDIA B200 GPU on Modal.com - **Training Data**: 5M samples from BEIR, MS MARCO, Wikipedia - **Epochs**: 3 - **Final Teacher-Student Similarity**: 77.2% ## Usage ### With Transformers ```python import torch from transformers import AutoTokenizer, AutoModel tokenizer = AutoTokenizer.from_pretrained("wolfnuker/leaf-embed-beir") model = AutoModel.from_pretrained("wolfnuker/leaf-embed-beir") def mean_pooling(model_output, attention_mask): token_embeddings = model_output.last_hidden_state input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float() return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9) # Example usage sentences = ["This is an example sentence", "Each sentence is converted to a vector"] encoded = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt") with torch.no_grad(): outputs = model(**encoded) embeddings = mean_pooling(outputs, encoded["attention_mask"]) embeddings = torch.nn.functional.normalize(embeddings, p=2, dim=1) print(embeddings.shape) # [2, 768] ``` ### With Sentence-Transformers ```python from sentence_transformers import SentenceTransformer model = SentenceTransformer("wolfnuker/leaf-embed-beir") embeddings = model.encode(["This is an example sentence", "Each sentence is converted"]) ``` ## Evaluation Results ### BEIR Benchmark | Dataset | NDCG@10 | |---------|---------| | NFCorpus | 0.0896 | *Note: This is an initial baseline model. Performance will improve with:* - More training data and epochs - IE-specific contrastive training (entity masking, relation pairs) - Hyperparameter tuning ## Training Details ### Hyperparameters | Parameter | Value | |-----------|-------| | Learning Rate | 2e-5 → 2e-8 (cosine decay) | | Batch Size | 320 (64 × 5 gradient accumulation) | | Warmup Ratio | 10% | | Mixed Precision | FP16 | | Max Sequence Length | 256 | ### Loss Function LEAF uses L2 loss on normalized embeddings: ``` L = MSE(normalize(student_emb), normalize(teacher_emb)) ``` ## Limitations - Trained primarily on English text - Initial baseline - further tuning recommended for production use - Optimized for retrieval, may need adaptation for other tasks ## Citation If you use this model, please cite: ```bibtex @misc{leaf-embed-beir, author = {RankSaga}, title = {LEAF Embed BEIR: Text Embeddings via Distillation}, year = {2026}, publisher = {HuggingFace}, url = {https://huggingface.co/wolfnuker/leaf-embed-beir} } ``` ## Acknowledgments - [MongoDB LEAF Paper](https://www.mongodb.com/company/blog/engineering/leaf-distillation-state-of-the-art-text-embedding-models) - [Snowflake Arctic Embed](https://huggingface.co/Snowflake/snowflake-arctic-embed-m-v1.5) - [Modal.com](https://modal.com) for GPU compute ## License Apache 2.0