EMBO
/

dot_only

+---
+license: apache-2.0
+base_model: answerdotai/ModernBERT-base
+tags:
+- sentence-transformers
+- feature-extraction
+- sentence-similarity
+- biomedical
+- embeddings
+- life-sciences
+- scientific-text
+- SODA-VEC
+- EMBO
+datasets:
+- EMBO/soda-vec-data-full_pmc_title_abstract_paired
+metrics:
+- cosine-similarity
+---
+# Dot Only Model
+## Model Description
+SODA-VEC embedding model trained with dot product loss only. This model uses normalized embeddings with only contrastive learning (dot product) to learn biomedical text representations.
+This model is part of the **SODA-VEC** (Scientific Open Domain Adaptation for Vector Embeddings) project, which focuses on creating high-quality embedding models for biomedical and life sciences text.
+**Key Features:**
+- Trained on **26.5M biomedical title-abstract pairs** from PubMed Central
+- Based on **ModernBERT-base** architecture
+- Optimized for **biomedical text similarity** and **semantic search**
+- Produces **768-dimensional embeddings** with mean pooling
+## Training Details
+### Training Data
+- **Dataset**: [`EMBO/soda-vec-data-full_pmc_title_abstract_paired`](https://huggingface.co/datasets/EMBO/soda-vec-data-full_pmc_title_abstract_paired)
+- **Size**: 26,473,900 training pairs
+- **Source**: Complete PubMed Central baseline (July 2024)
+- **Format**: Paired title-abstract examples optimized for contrastive learning
+### Training Procedure
+**Loss Function**: Dot Only: normalized embeddings with only dot product loss (diagonal + off-diagonal)
+**Coefficients**: dot=1.0
+**Base Model**: `answerdotai/ModernBERT-base`
+**Training Configuration:**
+- **GPUs**: 4
+- **Batch Size per GPU**: 16
+- **Gradient Accumulation**: 4
+- **Effective Batch Size**: 256
+- **Learning Rate**: 2e-05
+- **Warmup Steps**: 100
+- **Pooling Strategy**: mean
+- **Epochs**: 1 (full dataset pass)
+**Training Command:**
+```bash
+python scripts/soda-vec-train.py --config dot_only --coeff_dot 1 --push_to_hub --hub_org EMBO --save_limit 5
+```
+### Model Architecture
+- **Base Architecture**: ModernBERT-base (12 layers, 768 hidden size)
+- **Pooling**: Mean pooling over token embeddings
+- **Output Dimension**: 768
+- **Normalization**: L2-normalized embeddings (for VICReg-based models)
+## Usage
+### Using Sentence-Transformers
+```python
+from sentence_transformers import SentenceTransformer
+# Load the model
+model = SentenceTransformer("EMBO/dot_only")
+# Encode sentences
+sentences = [
+    "CRISPR-Cas9 gene editing in human cells",
+    "Genome editing using CRISPR technology"
+]
+embeddings = model.encode(sentences)
+print(f"Embedding shape: {embeddings.shape}")
+# Compute similarity
+from sentence_transformers.util import cos_sim
+similarity = cos_sim(embeddings[0], embeddings[1])
+print(f"Similarity: {similarity.item():.4f}")
+```
+### Using Hugging Face Transformers
+```python
+from transformers import AutoTokenizer, AutoModel
+import torch
+import torch.nn.functional as F
+# Load model and tokenizer
+tokenizer = AutoTokenizer.from_pretrained("EMBO/dot_only")
+model = AutoModel.from_pretrained("EMBO/dot_only")
+# Encode sentences
+sentences = [
+    "CRISPR-Cas9 gene editing in human cells",
+    "Genome editing using CRISPR technology"
+]
+inputs = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")
+with torch.no_grad():
+    outputs = model(**inputs)
+# Mean pooling
+embeddings = outputs.last_hidden_state.mean(dim=1)
+# Normalize (for VICReg models)
+embeddings = F.normalize(embeddings, p=2, dim=1)
+# Compute similarity
+similarity = F.cosine_similarity(embeddings[0:1], embeddings[1:2])
+print(f"Similarity: {similarity.item():.4f}")
+```
+## Evaluation
+The model has been evaluated on comprehensive biomedical benchmarks including:
+- **Journal-Category Classification**: Matching journals to BioRxiv subject categories
+- **Title-Abstract Similarity**: Discriminating between related and unrelated paper pairs
+- **Field-Specific Separability**: Distinguishing between different biological fields
+- **Semantic Search**: Retrieval quality on biomedical text corpora
+For detailed evaluation results, see the [SODA-VEC benchmark notebooks](https://github.com/EMBO/soda-vec).
+## Intended Use
+This model is designed for:
+- **Biomedical Semantic Search**: Finding relevant papers, abstracts, or text passages
+- **Scientific Text Similarity**: Computing similarity between biomedical texts
+- **Information Retrieval**: Building search systems for scientific literature
+- **Downstream Tasks**: As a base for fine-tuning on specific biomedical tasks
+- **Research Applications**: Academic and research use in life sciences
+## Limitations
+- **Domain Specificity**: Optimized for biomedical and life sciences text; may not perform as well on general domain text
+- **Language**: English only
+- **Text Length**: Optimized for titles and abstracts; longer documents may require chunking
+- **Bias**: Inherits biases from the training data (PubMed Central corpus)
+## Citation
+If you use this model, please cite:
+```bibtex
+@software{soda_vec,
+  title = {SODA-VEC: Scientific Open Domain Adaptation for Vector Embeddings},
+  author = {EMBO},
+  year = {2024},
+  url = {https://github.com/EMBO/soda-vec}
+}
+```
+## Model Card Contact
+For questions or issues, please open an issue on the [SODA-VEC GitHub repository](https://github.com/EMBO/soda-vec).
+---
+**Model Card Generated**: 2025-11-10