| # SODA-VEC Negative Sampling: Biomedical Sentence Embeddings | |
| ## Model Overview | |
| **SODA-VEC Negative Sampling** is a specialized sentence embedding model trained on 26.5M biomedical text pairs using the MultipleNegativesRankingLoss from sentence-transformers. This model is optimized for biomedical and life sciences applications, providing high-quality semantic representations for scientific literature. | |
| ## Key Features | |
| - 🧬 **Biomedical Specialization**: Trained exclusively on PubMed abstracts and titles | |
| - 🔬 **Large Scale**: 26.5M training pairs from complete PubMed baseline (July 2024) | |
| - ⚡ **Modern Architecture**: Based on ModernBERT-embed-base with 768-dimensional embeddings | |
| - 🎯 **Negative Sampling**: Uses standard MultipleNegativesRankingLoss for robust contrastive learning | |
| - 📊 **Production Ready**: Optimized training with FP16, gradient clipping, and cosine scheduling | |
| ## Model Details | |
| ### Base Model | |
| - **Architecture**: ModernBERT-embed-base (nomic-ai/modernbert-embed-base) | |
| - **Embedding Dimension**: 768 | |
| - **Max Sequence Length**: 768 tokens | |
| - **Parameters**: ~110M | |
| ### Training Configuration | |
| - **Loss Function**: MultipleNegativesRankingLoss (sentence-transformers) | |
| - **Training Data**: 26,473,900 biomedical text pairs | |
| - **Epochs**: 3 | |
| - **Effective Batch Size**: 256 (32 per GPU × 4 GPUs × 2 gradient accumulation) | |
| - **Learning Rate**: 1e-5 with cosine scheduling | |
| - **Optimization**: AdamW with weight decay (0.01) | |
| - **Precision**: FP16 for efficiency | |
| - **Hardware**: 4x Tesla V100-DGXS-32GB | |
| ## Dataset | |
| ### Source Data | |
| - **Origin**: Complete PubMed baseline (July 2024) | |
| - **Content**: Scientific abstracts and titles from biomedical literature | |
| - **Quality**: 99.7% retention after filtering (128-6,000 character abstracts) | |
| - **Splits**: 99.6% train / 0.2% validation / 0.2% test | |
| ### Data Processing | |
| - Error pattern removal and quality filtering | |
| - Balanced train/validation/test splits | |
| - Character length filtering for optimal training | |
| - Duplicate detection and removal | |
| ## Performance & Use Cases | |
| ### Intended Applications | |
| - **Literature Search**: Semantic search across biomedical publications | |
| - **Research Discovery**: Finding related papers and concepts | |
| - **Knowledge Mining**: Extracting relationships from scientific text | |
| - **Document Classification**: Categorizing biomedical documents | |
| - **Similarity Analysis**: Comparing research abstracts and papers | |
| ### Biomedical Domains | |
| - Molecular Biology | |
| - Clinical Medicine | |
| - Pharmacology | |
| - Genetics & Genomics | |
| - Biochemistry | |
| - Neuroscience | |
| - Public Health | |
| ## Usage | |
| ### Installation | |
| ```bash | |
| pip install sentence-transformers | |
| ``` | |
| ### Basic Usage | |
| ```python | |
| from sentence_transformers import SentenceTransformer | |
| # Load the model | |
| model = SentenceTransformer('EMBO/soda-vec-negative-sampling') | |
| # Encode biomedical texts | |
| texts = [ | |
| "CRISPR-Cas9 gene editing in human embryos", | |
| "mRNA vaccine efficacy against COVID-19 variants", | |
| "Protein folding mechanisms in neurodegenerative diseases" | |
| ] | |
| embeddings = model.encode(texts) | |
| print(f"Embeddings shape: {embeddings.shape}") # (3, 768) | |
| ``` | |
| ### Semantic Search | |
| ```python | |
| import numpy as np | |
| from sklearn.metrics.pairwise import cosine_similarity | |
| # Query and corpus | |
| query = "Alzheimer's disease biomarkers" | |
| corpus = [ | |
| "Tau protein aggregation in neurodegeneration", | |
| "COVID-19 vaccine development strategies", | |
| "Beta-amyloid plaques in dementia patients" | |
| ] | |
| # Encode | |
| query_embedding = model.encode([query]) | |
| corpus_embeddings = model.encode(corpus) | |
| # Find most similar | |
| similarities = cosine_similarity(query_embedding, corpus_embeddings)[0] | |
| best_match = np.argmax(similarities) | |
| print(f"Best match: {corpus[best_match]} (similarity: {similarities[best_match]:.3f})") | |
| ``` | |
| ## Training Details | |
| ### Loss Function | |
| The model uses **MultipleNegativesRankingLoss**, which: | |
| - Treats all other samples in a batch as negatives | |
| - Optimizes for high similarity between related texts | |
| - Provides robust contrastive learning without explicit negative sampling | |
| - Well-established in sentence-transformers ecosystem | |
| ### Training Process | |
| - **Duration**: ~4 days on 4x V100 GPUs | |
| - **Steps**: 310,239 total training steps | |
| - **Evaluation**: Every 1000 steps (310 evaluations, 1.8% overhead) | |
| - **Monitoring**: Real-time TensorBoard logging | |
| - **Checkpointing**: Model saved at end of each epoch | |
| ### Optimization Features | |
| - Gradient clipping (max_norm=5.0) for training stability | |
| - Weight decay regularization for generalization | |
| - Cosine learning rate scheduling | |
| - Loss-only evaluation for efficiency | |
| - Reproducible training (seed=42) | |
| ## Technical Specifications | |
| ### Hardware Requirements | |
| - **Training**: 4x Tesla V100-DGXS-32GB (recommended) | |
| - **Inference**: Any GPU with 4GB+ VRAM, or CPU | |
| - **Memory**: ~2GB GPU memory for inference | |
| ### Software Dependencies | |
| - sentence-transformers >= 2.0.0 | |
| - transformers >= 4.20.0 | |
| - torch >= 1.12.0 | |
| - Python >= 3.8 | |
| ## Comparison with SODA-VEC (VICReg) | |
| | Feature | SODA-VEC (VICReg) | SODA-VEC Negative Sampling | | |
| |---------|-------------------|----------------------------| | |
| | Loss Function | VICReg (custom biomedical) | MultipleNegativesRankingLoss | | |
| | Optimization | Empirically tuned coefficients | Standard contrastive learning | | |
| | Training Data | Same (26.5M pairs) | Same (26.5M pairs) | | |
| | Use Case | Biomedical research focus | General semantic similarity | | |
| | Framework | Custom implementation | sentence-transformers standard | | |
| ## Limitations | |
| - **Domain Specificity**: Optimized for biomedical text, may not generalize to other domains | |
| - **Language**: English-only training data | |
| - **Recency**: Training data cutoff at July 2024 | |
| - **Bias**: May reflect biases present in PubMed literature | |
| ## Citation | |
| If you use this model in your research, please cite: | |
| ```bibtex | |
| @misc{soda-vec-negative-sampling-2024, | |
| title={SODA-VEC Negative Sampling: Biomedical Sentence Embeddings}, | |
| author={EMBO}, | |
| year={2024}, | |
| url={https://huggingface.co/EMBO/soda-vec-negative-sampling}, | |
| note={Trained on 26.5M PubMed text pairs using MultipleNegativesRankingLoss} | |
| } | |
| ``` | |
| ## License | |
| This model is released under the same license as the base ModernBERT model. Please refer to the original model card for licensing details. | |
| ## Acknowledgments | |
| - **Base Model**: nomic-ai/modernbert-embed-base | |
| - **Training Framework**: sentence-transformers | |
| - **Data Source**: PubMed/MEDLINE database | |
| - **Infrastructure**: EMBO computational resources | |
| ## Model Card Contact | |
| For questions about this model, please contact EMBO or open an issue in the associated repository. | |
| --- | |
| **Last Updated**: August 2024 | |
| **Model Version**: 1.0 | |
| **Training Completion**: In Progress (ETA: 4 days) |