File size: 6,646 Bytes
e86139d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
# SODA-VEC Negative Sampling: Biomedical Sentence Embeddings

## Model Overview

**SODA-VEC Negative Sampling** is a specialized sentence embedding model trained on 26.5M biomedical text pairs using the MultipleNegativesRankingLoss from sentence-transformers. This model is optimized for biomedical and life sciences applications, providing high-quality semantic representations for scientific literature.

## Key Features

- 🧬 **Biomedical Specialization**: Trained exclusively on PubMed abstracts and titles
- 🔬 **Large Scale**: 26.5M training pairs from complete PubMed baseline (July 2024)
-**Modern Architecture**: Based on ModernBERT-embed-base with 768-dimensional embeddings
- 🎯 **Negative Sampling**: Uses standard MultipleNegativesRankingLoss for robust contrastive learning
- 📊 **Production Ready**: Optimized training with FP16, gradient clipping, and cosine scheduling

## Model Details

### Base Model
- **Architecture**: ModernBERT-embed-base (nomic-ai/modernbert-embed-base)
- **Embedding Dimension**: 768
- **Max Sequence Length**: 768 tokens
- **Parameters**: ~110M

### Training Configuration
- **Loss Function**: MultipleNegativesRankingLoss (sentence-transformers)
- **Training Data**: 26,473,900 biomedical text pairs
- **Epochs**: 3
- **Effective Batch Size**: 256 (32 per GPU × 4 GPUs × 2 gradient accumulation)
- **Learning Rate**: 1e-5 with cosine scheduling
- **Optimization**: AdamW with weight decay (0.01)
- **Precision**: FP16 for efficiency
- **Hardware**: 4x Tesla V100-DGXS-32GB

## Dataset

### Source Data
- **Origin**: Complete PubMed baseline (July 2024)
- **Content**: Scientific abstracts and titles from biomedical literature
- **Quality**: 99.7% retention after filtering (128-6,000 character abstracts)
- **Splits**: 99.6% train / 0.2% validation / 0.2% test

### Data Processing
- Error pattern removal and quality filtering
- Balanced train/validation/test splits
- Character length filtering for optimal training
- Duplicate detection and removal

## Performance & Use Cases

### Intended Applications
- **Literature Search**: Semantic search across biomedical publications
- **Research Discovery**: Finding related papers and concepts
- **Knowledge Mining**: Extracting relationships from scientific text
- **Document Classification**: Categorizing biomedical documents
- **Similarity Analysis**: Comparing research abstracts and papers

### Biomedical Domains
- Molecular Biology
- Clinical Medicine
- Pharmacology
- Genetics & Genomics
- Biochemistry
- Neuroscience
- Public Health

## Usage

### Installation
```bash
pip install sentence-transformers
```

### Basic Usage
```python
from sentence_transformers import SentenceTransformer

# Load the model
model = SentenceTransformer('EMBO/soda-vec-negative-sampling')

# Encode biomedical texts
texts = [
    "CRISPR-Cas9 gene editing in human embryos",
    "mRNA vaccine efficacy against COVID-19 variants",
    "Protein folding mechanisms in neurodegenerative diseases"
]

embeddings = model.encode(texts)
print(f"Embeddings shape: {embeddings.shape}")  # (3, 768)
```

### Semantic Search
```python
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

# Query and corpus
query = "Alzheimer's disease biomarkers"
corpus = [
    "Tau protein aggregation in neurodegeneration",
    "COVID-19 vaccine development strategies", 
    "Beta-amyloid plaques in dementia patients"
]

# Encode
query_embedding = model.encode([query])
corpus_embeddings = model.encode(corpus)

# Find most similar
similarities = cosine_similarity(query_embedding, corpus_embeddings)[0]
best_match = np.argmax(similarities)
print(f"Best match: {corpus[best_match]} (similarity: {similarities[best_match]:.3f})")
```

## Training Details

### Loss Function
The model uses **MultipleNegativesRankingLoss**, which:
- Treats all other samples in a batch as negatives
- Optimizes for high similarity between related texts
- Provides robust contrastive learning without explicit negative sampling
- Well-established in sentence-transformers ecosystem

### Training Process
- **Duration**: ~4 days on 4x V100 GPUs
- **Steps**: 310,239 total training steps
- **Evaluation**: Every 1000 steps (310 evaluations, 1.8% overhead)
- **Monitoring**: Real-time TensorBoard logging
- **Checkpointing**: Model saved at end of each epoch

### Optimization Features
- Gradient clipping (max_norm=5.0) for training stability
- Weight decay regularization for generalization
- Cosine learning rate scheduling
- Loss-only evaluation for efficiency
- Reproducible training (seed=42)

## Technical Specifications

### Hardware Requirements
- **Training**: 4x Tesla V100-DGXS-32GB (recommended)
- **Inference**: Any GPU with 4GB+ VRAM, or CPU
- **Memory**: ~2GB GPU memory for inference

### Software Dependencies
- sentence-transformers >= 2.0.0
- transformers >= 4.20.0
- torch >= 1.12.0
- Python >= 3.8

## Comparison with SODA-VEC (VICReg)

| Feature | SODA-VEC (VICReg) | SODA-VEC Negative Sampling |
|---------|-------------------|----------------------------|
| Loss Function | VICReg (custom biomedical) | MultipleNegativesRankingLoss |
| Optimization | Empirically tuned coefficients | Standard contrastive learning |
| Training Data | Same (26.5M pairs) | Same (26.5M pairs) |
| Use Case | Biomedical research focus | General semantic similarity |
| Framework | Custom implementation | sentence-transformers standard |

## Limitations

- **Domain Specificity**: Optimized for biomedical text, may not generalize to other domains
- **Language**: English-only training data
- **Recency**: Training data cutoff at July 2024
- **Bias**: May reflect biases present in PubMed literature

## Citation

If you use this model in your research, please cite:

```bibtex
@misc{soda-vec-negative-sampling-2024,
  title={SODA-VEC Negative Sampling: Biomedical Sentence Embeddings},
  author={EMBO},
  year={2024},
  url={https://huggingface.co/EMBO/soda-vec-negative-sampling},
  note={Trained on 26.5M PubMed text pairs using MultipleNegativesRankingLoss}
}
```

## License

This model is released under the same license as the base ModernBERT model. Please refer to the original model card for licensing details.

## Acknowledgments

- **Base Model**: nomic-ai/modernbert-embed-base
- **Training Framework**: sentence-transformers
- **Data Source**: PubMed/MEDLINE database
- **Infrastructure**: EMBO computational resources

## Model Card Contact

For questions about this model, please contact EMBO or open an issue in the associated repository.

---

**Last Updated**: August 2024  
**Model Version**: 1.0  
**Training Completion**: In Progress (ETA: 4 days)