# SODA-VEC Negative Sampling: Biomedical Sentence Embeddings ## Model Overview **SODA-VEC Negative Sampling** is a specialized sentence embedding model trained on 26.5M biomedical text pairs using the MultipleNegativesRankingLoss from sentence-transformers. This model is optimized for biomedical and life sciences applications, providing high-quality semantic representations for scientific literature. ## Key Features - 🧬 **Biomedical Specialization**: Trained exclusively on PubMed abstracts and titles - 🔬 **Large Scale**: 26.5M training pairs from complete PubMed baseline (July 2024) - ⚡ **Modern Architecture**: Based on ModernBERT-embed-base with 768-dimensional embeddings - 🎯 **Negative Sampling**: Uses standard MultipleNegativesRankingLoss for robust contrastive learning - 📊 **Production Ready**: Optimized training with FP16, gradient clipping, and cosine scheduling ## Model Details ### Base Model - **Architecture**: ModernBERT-embed-base (nomic-ai/modernbert-embed-base) - **Embedding Dimension**: 768 - **Max Sequence Length**: 768 tokens - **Parameters**: ~110M ### Training Configuration - **Loss Function**: MultipleNegativesRankingLoss (sentence-transformers) - **Training Data**: 26,473,900 biomedical text pairs - **Epochs**: 3 - **Effective Batch Size**: 256 (32 per GPU × 4 GPUs × 2 gradient accumulation) - **Learning Rate**: 1e-5 with cosine scheduling - **Optimization**: AdamW with weight decay (0.01) - **Precision**: FP16 for efficiency - **Hardware**: 4x Tesla V100-DGXS-32GB ## Dataset ### Source Data - **Origin**: Complete PubMed baseline (July 2024) - **Content**: Scientific abstracts and titles from biomedical literature - **Quality**: 99.7% retention after filtering (128-6,000 character abstracts) - **Splits**: 99.6% train / 0.2% validation / 0.2% test ### Data Processing - Error pattern removal and quality filtering - Balanced train/validation/test splits - Character length filtering for optimal training - Duplicate detection and removal ## Performance & Use Cases ### Intended Applications - **Literature Search**: Semantic search across biomedical publications - **Research Discovery**: Finding related papers and concepts - **Knowledge Mining**: Extracting relationships from scientific text - **Document Classification**: Categorizing biomedical documents - **Similarity Analysis**: Comparing research abstracts and papers ### Biomedical Domains - Molecular Biology - Clinical Medicine - Pharmacology - Genetics & Genomics - Biochemistry - Neuroscience - Public Health ## Usage ### Installation ```bash pip install sentence-transformers ``` ### Basic Usage ```python from sentence_transformers import SentenceTransformer # Load the model model = SentenceTransformer('EMBO/soda-vec-negative-sampling') # Encode biomedical texts texts = [ "CRISPR-Cas9 gene editing in human embryos", "mRNA vaccine efficacy against COVID-19 variants", "Protein folding mechanisms in neurodegenerative diseases" ] embeddings = model.encode(texts) print(f"Embeddings shape: {embeddings.shape}") # (3, 768) ``` ### Semantic Search ```python import numpy as np from sklearn.metrics.pairwise import cosine_similarity # Query and corpus query = "Alzheimer's disease biomarkers" corpus = [ "Tau protein aggregation in neurodegeneration", "COVID-19 vaccine development strategies", "Beta-amyloid plaques in dementia patients" ] # Encode query_embedding = model.encode([query]) corpus_embeddings = model.encode(corpus) # Find most similar similarities = cosine_similarity(query_embedding, corpus_embeddings)[0] best_match = np.argmax(similarities) print(f"Best match: {corpus[best_match]} (similarity: {similarities[best_match]:.3f})") ``` ## Training Details ### Loss Function The model uses **MultipleNegativesRankingLoss**, which: - Treats all other samples in a batch as negatives - Optimizes for high similarity between related texts - Provides robust contrastive learning without explicit negative sampling - Well-established in sentence-transformers ecosystem ### Training Process - **Duration**: ~4 days on 4x V100 GPUs - **Steps**: 310,239 total training steps - **Evaluation**: Every 1000 steps (310 evaluations, 1.8% overhead) - **Monitoring**: Real-time TensorBoard logging - **Checkpointing**: Model saved at end of each epoch ### Optimization Features - Gradient clipping (max_norm=5.0) for training stability - Weight decay regularization for generalization - Cosine learning rate scheduling - Loss-only evaluation for efficiency - Reproducible training (seed=42) ## Technical Specifications ### Hardware Requirements - **Training**: 4x Tesla V100-DGXS-32GB (recommended) - **Inference**: Any GPU with 4GB+ VRAM, or CPU - **Memory**: ~2GB GPU memory for inference ### Software Dependencies - sentence-transformers >= 2.0.0 - transformers >= 4.20.0 - torch >= 1.12.0 - Python >= 3.8 ## Comparison with SODA-VEC (VICReg) | Feature | SODA-VEC (VICReg) | SODA-VEC Negative Sampling | |---------|-------------------|----------------------------| | Loss Function | VICReg (custom biomedical) | MultipleNegativesRankingLoss | | Optimization | Empirically tuned coefficients | Standard contrastive learning | | Training Data | Same (26.5M pairs) | Same (26.5M pairs) | | Use Case | Biomedical research focus | General semantic similarity | | Framework | Custom implementation | sentence-transformers standard | ## Limitations - **Domain Specificity**: Optimized for biomedical text, may not generalize to other domains - **Language**: English-only training data - **Recency**: Training data cutoff at July 2024 - **Bias**: May reflect biases present in PubMed literature ## Citation If you use this model in your research, please cite: ```bibtex @misc{soda-vec-negative-sampling-2024, title={SODA-VEC Negative Sampling: Biomedical Sentence Embeddings}, author={EMBO}, year={2024}, url={https://huggingface.co/EMBO/soda-vec-negative-sampling}, note={Trained on 26.5M PubMed text pairs using MultipleNegativesRankingLoss} } ``` ## License This model is released under the same license as the base ModernBERT model. Please refer to the original model card for licensing details. ## Acknowledgments - **Base Model**: nomic-ai/modernbert-embed-base - **Training Framework**: sentence-transformers - **Data Source**: PubMed/MEDLINE database - **Infrastructure**: EMBO computational resources ## Model Card Contact For questions about this model, please contact EMBO or open an issue in the associated repository. --- **Last Updated**: August 2024 **Model Version**: 1.0 **Training Completion**: In Progress (ETA: 4 days)