File size: 6,646 Bytes
e86139d |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 |
# SODA-VEC Negative Sampling: Biomedical Sentence Embeddings
## Model Overview
**SODA-VEC Negative Sampling** is a specialized sentence embedding model trained on 26.5M biomedical text pairs using the MultipleNegativesRankingLoss from sentence-transformers. This model is optimized for biomedical and life sciences applications, providing high-quality semantic representations for scientific literature.
## Key Features
- 🧬 **Biomedical Specialization**: Trained exclusively on PubMed abstracts and titles
- 🔬 **Large Scale**: 26.5M training pairs from complete PubMed baseline (July 2024)
- ⚡ **Modern Architecture**: Based on ModernBERT-embed-base with 768-dimensional embeddings
- 🎯 **Negative Sampling**: Uses standard MultipleNegativesRankingLoss for robust contrastive learning
- 📊 **Production Ready**: Optimized training with FP16, gradient clipping, and cosine scheduling
## Model Details
### Base Model
- **Architecture**: ModernBERT-embed-base (nomic-ai/modernbert-embed-base)
- **Embedding Dimension**: 768
- **Max Sequence Length**: 768 tokens
- **Parameters**: ~110M
### Training Configuration
- **Loss Function**: MultipleNegativesRankingLoss (sentence-transformers)
- **Training Data**: 26,473,900 biomedical text pairs
- **Epochs**: 3
- **Effective Batch Size**: 256 (32 per GPU × 4 GPUs × 2 gradient accumulation)
- **Learning Rate**: 1e-5 with cosine scheduling
- **Optimization**: AdamW with weight decay (0.01)
- **Precision**: FP16 for efficiency
- **Hardware**: 4x Tesla V100-DGXS-32GB
## Dataset
### Source Data
- **Origin**: Complete PubMed baseline (July 2024)
- **Content**: Scientific abstracts and titles from biomedical literature
- **Quality**: 99.7% retention after filtering (128-6,000 character abstracts)
- **Splits**: 99.6% train / 0.2% validation / 0.2% test
### Data Processing
- Error pattern removal and quality filtering
- Balanced train/validation/test splits
- Character length filtering for optimal training
- Duplicate detection and removal
## Performance & Use Cases
### Intended Applications
- **Literature Search**: Semantic search across biomedical publications
- **Research Discovery**: Finding related papers and concepts
- **Knowledge Mining**: Extracting relationships from scientific text
- **Document Classification**: Categorizing biomedical documents
- **Similarity Analysis**: Comparing research abstracts and papers
### Biomedical Domains
- Molecular Biology
- Clinical Medicine
- Pharmacology
- Genetics & Genomics
- Biochemistry
- Neuroscience
- Public Health
## Usage
### Installation
```bash
pip install sentence-transformers
```
### Basic Usage
```python
from sentence_transformers import SentenceTransformer
# Load the model
model = SentenceTransformer('EMBO/soda-vec-negative-sampling')
# Encode biomedical texts
texts = [
"CRISPR-Cas9 gene editing in human embryos",
"mRNA vaccine efficacy against COVID-19 variants",
"Protein folding mechanisms in neurodegenerative diseases"
]
embeddings = model.encode(texts)
print(f"Embeddings shape: {embeddings.shape}") # (3, 768)
```
### Semantic Search
```python
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
# Query and corpus
query = "Alzheimer's disease biomarkers"
corpus = [
"Tau protein aggregation in neurodegeneration",
"COVID-19 vaccine development strategies",
"Beta-amyloid plaques in dementia patients"
]
# Encode
query_embedding = model.encode([query])
corpus_embeddings = model.encode(corpus)
# Find most similar
similarities = cosine_similarity(query_embedding, corpus_embeddings)[0]
best_match = np.argmax(similarities)
print(f"Best match: {corpus[best_match]} (similarity: {similarities[best_match]:.3f})")
```
## Training Details
### Loss Function
The model uses **MultipleNegativesRankingLoss**, which:
- Treats all other samples in a batch as negatives
- Optimizes for high similarity between related texts
- Provides robust contrastive learning without explicit negative sampling
- Well-established in sentence-transformers ecosystem
### Training Process
- **Duration**: ~4 days on 4x V100 GPUs
- **Steps**: 310,239 total training steps
- **Evaluation**: Every 1000 steps (310 evaluations, 1.8% overhead)
- **Monitoring**: Real-time TensorBoard logging
- **Checkpointing**: Model saved at end of each epoch
### Optimization Features
- Gradient clipping (max_norm=5.0) for training stability
- Weight decay regularization for generalization
- Cosine learning rate scheduling
- Loss-only evaluation for efficiency
- Reproducible training (seed=42)
## Technical Specifications
### Hardware Requirements
- **Training**: 4x Tesla V100-DGXS-32GB (recommended)
- **Inference**: Any GPU with 4GB+ VRAM, or CPU
- **Memory**: ~2GB GPU memory for inference
### Software Dependencies
- sentence-transformers >= 2.0.0
- transformers >= 4.20.0
- torch >= 1.12.0
- Python >= 3.8
## Comparison with SODA-VEC (VICReg)
| Feature | SODA-VEC (VICReg) | SODA-VEC Negative Sampling |
|---------|-------------------|----------------------------|
| Loss Function | VICReg (custom biomedical) | MultipleNegativesRankingLoss |
| Optimization | Empirically tuned coefficients | Standard contrastive learning |
| Training Data | Same (26.5M pairs) | Same (26.5M pairs) |
| Use Case | Biomedical research focus | General semantic similarity |
| Framework | Custom implementation | sentence-transformers standard |
## Limitations
- **Domain Specificity**: Optimized for biomedical text, may not generalize to other domains
- **Language**: English-only training data
- **Recency**: Training data cutoff at July 2024
- **Bias**: May reflect biases present in PubMed literature
## Citation
If you use this model in your research, please cite:
```bibtex
@misc{soda-vec-negative-sampling-2024,
title={SODA-VEC Negative Sampling: Biomedical Sentence Embeddings},
author={EMBO},
year={2024},
url={https://huggingface.co/EMBO/soda-vec-negative-sampling},
note={Trained on 26.5M PubMed text pairs using MultipleNegativesRankingLoss}
}
```
## License
This model is released under the same license as the base ModernBERT model. Please refer to the original model card for licensing details.
## Acknowledgments
- **Base Model**: nomic-ai/modernbert-embed-base
- **Training Framework**: sentence-transformers
- **Data Source**: PubMed/MEDLINE database
- **Infrastructure**: EMBO computational resources
## Model Card Contact
For questions about this model, please contact EMBO or open an issue in the associated repository.
---
**Last Updated**: August 2024
**Model Version**: 1.0
**Training Completion**: In Progress (ETA: 4 days) |