drAbreu's picture
Add comprehensive model card for SODA-VEC negative sampling model
e86139d verified
# SODA-VEC Negative Sampling: Biomedical Sentence Embeddings
## Model Overview
**SODA-VEC Negative Sampling** is a specialized sentence embedding model trained on 26.5M biomedical text pairs using the MultipleNegativesRankingLoss from sentence-transformers. This model is optimized for biomedical and life sciences applications, providing high-quality semantic representations for scientific literature.
## Key Features
- 🧬 **Biomedical Specialization**: Trained exclusively on PubMed abstracts and titles
- 🔬 **Large Scale**: 26.5M training pairs from complete PubMed baseline (July 2024)
-**Modern Architecture**: Based on ModernBERT-embed-base with 768-dimensional embeddings
- 🎯 **Negative Sampling**: Uses standard MultipleNegativesRankingLoss for robust contrastive learning
- 📊 **Production Ready**: Optimized training with FP16, gradient clipping, and cosine scheduling
## Model Details
### Base Model
- **Architecture**: ModernBERT-embed-base (nomic-ai/modernbert-embed-base)
- **Embedding Dimension**: 768
- **Max Sequence Length**: 768 tokens
- **Parameters**: ~110M
### Training Configuration
- **Loss Function**: MultipleNegativesRankingLoss (sentence-transformers)
- **Training Data**: 26,473,900 biomedical text pairs
- **Epochs**: 3
- **Effective Batch Size**: 256 (32 per GPU × 4 GPUs × 2 gradient accumulation)
- **Learning Rate**: 1e-5 with cosine scheduling
- **Optimization**: AdamW with weight decay (0.01)
- **Precision**: FP16 for efficiency
- **Hardware**: 4x Tesla V100-DGXS-32GB
## Dataset
### Source Data
- **Origin**: Complete PubMed baseline (July 2024)
- **Content**: Scientific abstracts and titles from biomedical literature
- **Quality**: 99.7% retention after filtering (128-6,000 character abstracts)
- **Splits**: 99.6% train / 0.2% validation / 0.2% test
### Data Processing
- Error pattern removal and quality filtering
- Balanced train/validation/test splits
- Character length filtering for optimal training
- Duplicate detection and removal
## Performance & Use Cases
### Intended Applications
- **Literature Search**: Semantic search across biomedical publications
- **Research Discovery**: Finding related papers and concepts
- **Knowledge Mining**: Extracting relationships from scientific text
- **Document Classification**: Categorizing biomedical documents
- **Similarity Analysis**: Comparing research abstracts and papers
### Biomedical Domains
- Molecular Biology
- Clinical Medicine
- Pharmacology
- Genetics & Genomics
- Biochemistry
- Neuroscience
- Public Health
## Usage
### Installation
```bash
pip install sentence-transformers
```
### Basic Usage
```python
from sentence_transformers import SentenceTransformer
# Load the model
model = SentenceTransformer('EMBO/soda-vec-negative-sampling')
# Encode biomedical texts
texts = [
"CRISPR-Cas9 gene editing in human embryos",
"mRNA vaccine efficacy against COVID-19 variants",
"Protein folding mechanisms in neurodegenerative diseases"
]
embeddings = model.encode(texts)
print(f"Embeddings shape: {embeddings.shape}") # (3, 768)
```
### Semantic Search
```python
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
# Query and corpus
query = "Alzheimer's disease biomarkers"
corpus = [
"Tau protein aggregation in neurodegeneration",
"COVID-19 vaccine development strategies",
"Beta-amyloid plaques in dementia patients"
]
# Encode
query_embedding = model.encode([query])
corpus_embeddings = model.encode(corpus)
# Find most similar
similarities = cosine_similarity(query_embedding, corpus_embeddings)[0]
best_match = np.argmax(similarities)
print(f"Best match: {corpus[best_match]} (similarity: {similarities[best_match]:.3f})")
```
## Training Details
### Loss Function
The model uses **MultipleNegativesRankingLoss**, which:
- Treats all other samples in a batch as negatives
- Optimizes for high similarity between related texts
- Provides robust contrastive learning without explicit negative sampling
- Well-established in sentence-transformers ecosystem
### Training Process
- **Duration**: ~4 days on 4x V100 GPUs
- **Steps**: 310,239 total training steps
- **Evaluation**: Every 1000 steps (310 evaluations, 1.8% overhead)
- **Monitoring**: Real-time TensorBoard logging
- **Checkpointing**: Model saved at end of each epoch
### Optimization Features
- Gradient clipping (max_norm=5.0) for training stability
- Weight decay regularization for generalization
- Cosine learning rate scheduling
- Loss-only evaluation for efficiency
- Reproducible training (seed=42)
## Technical Specifications
### Hardware Requirements
- **Training**: 4x Tesla V100-DGXS-32GB (recommended)
- **Inference**: Any GPU with 4GB+ VRAM, or CPU
- **Memory**: ~2GB GPU memory for inference
### Software Dependencies
- sentence-transformers >= 2.0.0
- transformers >= 4.20.0
- torch >= 1.12.0
- Python >= 3.8
## Comparison with SODA-VEC (VICReg)
| Feature | SODA-VEC (VICReg) | SODA-VEC Negative Sampling |
|---------|-------------------|----------------------------|
| Loss Function | VICReg (custom biomedical) | MultipleNegativesRankingLoss |
| Optimization | Empirically tuned coefficients | Standard contrastive learning |
| Training Data | Same (26.5M pairs) | Same (26.5M pairs) |
| Use Case | Biomedical research focus | General semantic similarity |
| Framework | Custom implementation | sentence-transformers standard |
## Limitations
- **Domain Specificity**: Optimized for biomedical text, may not generalize to other domains
- **Language**: English-only training data
- **Recency**: Training data cutoff at July 2024
- **Bias**: May reflect biases present in PubMed literature
## Citation
If you use this model in your research, please cite:
```bibtex
@misc{soda-vec-negative-sampling-2024,
title={SODA-VEC Negative Sampling: Biomedical Sentence Embeddings},
author={EMBO},
year={2024},
url={https://huggingface.co/EMBO/soda-vec-negative-sampling},
note={Trained on 26.5M PubMed text pairs using MultipleNegativesRankingLoss}
}
```
## License
This model is released under the same license as the base ModernBERT model. Please refer to the original model card for licensing details.
## Acknowledgments
- **Base Model**: nomic-ai/modernbert-embed-base
- **Training Framework**: sentence-transformers
- **Data Source**: PubMed/MEDLINE database
- **Infrastructure**: EMBO computational resources
## Model Card Contact
For questions about this model, please contact EMBO or open an issue in the associated repository.
---
**Last Updated**: August 2024
**Model Version**: 1.0
**Training Completion**: In Progress (ETA: 4 days)