leaf-embed-beir / README.md

Upload README.md with huggingface_hub

525fcdf verified 22 days ago

4.1 kB

	---
	language:
	- en
	license: apache-2.0
	tags:
	- sentence-transformers
	- feature-extraction
	- sentence-similarity
	- mteb
	- beir
	- embedding
	- leaf-distillation
	datasets:
	- BeIR
	- ms_marco
	- wikipedia
	pipeline_tag: feature-extraction
	library_name: transformers
	model-index:
	- name: leaf-embed-beir
	results:
	- task:
	type: Retrieval
	dataset:
	type: BeIR
	name: BEIR
	config: nfcorpus
	metrics:
	- type: ndcg_at_10
	value: 0.0896
	---

	# LEAF Embed BEIR

	A text embedding model trained using LEAF (Lightweight Embedding Alignment Framework) Distillation to achieve competitive performance on the BEIR benchmark.

	## Model Description

	This model was created by distilling knowledge from `Snowflake/snowflake-arctic-embed-m-v1.5` (teacher) into a smaller, more efficient student architecture.

	### Architecture

	\| Component \| Details \|
	\|-----------\|---------\|
	\| Encoder \| 8-layer BERT with 512 hidden size \|
	\| Attention Heads \| 8 \|
	\| Output Dimension \| 768 \|
	\| Parameters \| ~65M (vs 109M teacher) \|
	\| Pooling \| Mean pooling \|

	### Training

	- Method: LEAF Distillation (L2 loss on normalized embeddings)
	- Teacher: `Snowflake/snowflake-arctic-embed-m-v1.5`
	- Hardware: NVIDIA B200 GPU on Modal.com
	- Training Data: 5M samples from BEIR, MS MARCO, Wikipedia
	- Epochs: 3
	- Final Teacher-Student Similarity: 77.2%

	## Usage

	### With Transformers

	```python
	import torch
	from transformers import AutoTokenizer, AutoModel

	tokenizer = AutoTokenizer.from_pretrained("wolfnuker/leaf-embed-beir")
	model = AutoModel.from_pretrained("wolfnuker/leaf-embed-beir")

	def mean_pooling(model_output, attention_mask):
	token_embeddings = model_output.last_hidden_state
	input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
	return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)

	# Example usage
	sentences = ["This is an example sentence", "Each sentence is converted to a vector"]
	encoded = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")

	with torch.no_grad():
	outputs = model(**encoded)
	embeddings = mean_pooling(outputs, encoded["attention_mask"])
	embeddings = torch.nn.functional.normalize(embeddings, p=2, dim=1)

	print(embeddings.shape) # [2, 768]
	```

	### With Sentence-Transformers

	```python
	from sentence_transformers import SentenceTransformer

	model = SentenceTransformer("wolfnuker/leaf-embed-beir")
	embeddings = model.encode(["This is an example sentence", "Each sentence is converted"])
	```

	## Evaluation Results

	### BEIR Benchmark

	\| Dataset \| NDCG@10 \|
	\|---------\|---------\|
	\| NFCorpus \| 0.0896 \|

	Note: This is an initial baseline model. Performance will improve with:
	- More training data and epochs
	- IE-specific contrastive training (entity masking, relation pairs)
	- Hyperparameter tuning

	## Training Details

	### Hyperparameters

	\| Parameter \| Value \|
	\|-----------\|-------\|
	\| Learning Rate \| 2e-5 → 2e-8 (cosine decay) \|
	\| Batch Size \| 320 (64 × 5 gradient accumulation) \|
	\| Warmup Ratio \| 10% \|
	\| Mixed Precision \| FP16 \|
	\| Max Sequence Length \| 256 \|

	### Loss Function

	LEAF uses L2 loss on normalized embeddings:

	```
	L = MSE(normalize(student_emb), normalize(teacher_emb))
	```

	## Limitations

	- Trained primarily on English text
	- Initial baseline - further tuning recommended for production use
	- Optimized for retrieval, may need adaptation for other tasks

	## Citation

	If you use this model, please cite:

	```bibtex
	@misc{leaf-embed-beir,
	author = {RankSaga},
	title = {LEAF Embed BEIR: Text Embeddings via Distillation},
	year = {2026},
	publisher = {HuggingFace},
	url = {https://huggingface.co/wolfnuker/leaf-embed-beir}
	}
	```

	## Acknowledgments

	- [MongoDB LEAF Paper](https://www.mongodb.com/company/blog/engineering/leaf-distillation-state-of-the-art-text-embedding-models)
	- [Snowflake Arctic Embed](https://huggingface.co/Snowflake/snowflake-arctic-embed-m-v1.5)
	- [Modal.com](https://modal.com) for GPU compute

	## License

	Apache 2.0