File size: 5,308 Bytes
38581f1 a20162b 38581f1 a20162b 38581f1 a20162b 38581f1 a20162b 38581f1 a20162b 38581f1 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 |
---
license: apache-2.0
base_model: answerdotai/ModernBERT-base
tags:
- sentence-transformers
- feature-extraction
- sentence-similarity
- biomedical
- embeddings
- life-sciences
- scientific-text
- SODA-VEC
- EMBO
datasets:
- EMBO/soda-vec-data-full_pmc_title_abstract_paired
metrics:
- cosine-similarity
---
# VICReg Exact Model
## Model Description
SODA-VEC embedding model trained with [VICReg](https://arxiv.org/pdf/2105.04906) Exact loss function. This model implements the exact VICReg objective with invariance, variance, and covariance terms for biomedical text embeddings.
This model is part of the **SODA-VEC** (Scientific Open Domain Adaptation for Vector Embeddings) project, which focuses on creating high-quality embedding models for biomedical and life sciences text.
**Key Features:**
- Trained on **26.5M biomedical title-abstract pairs** from PubMed Central
- Based on **ModernBERT-base** architecture
- Optimized for **biomedical text similarity** and **semantic search**
- Produces **768-dimensional embeddings** with mean pooling
## Training Details
### Training Data
- **Dataset**: [`EMBO/soda-vec-data-full_pmc_title_abstract_paired`](https://huggingface.co/datasets/EMBO/soda-vec-data-full_pmc_title_abstract_paired)
- **Size**: 26,473,900 training pairs
- **Source**: Complete PubMed Central baseline (July 2024)
- **Format**: Paired title-abstract examples optimized for contrastive learning
### Training Procedure
**Loss Function**: VICReg Exact: exact [VICReg](https://arxiv.org/pdf/2105.04906) objective with invariance (MSE), variance (std), and covariance losses
**Coefficients**: sim=25.0, std=25.0, cov=1.0
**Base Model**: `answerdotai/ModernBERT-base`
**Training Configuration:**
- **GPUs**: 4
- **Batch Size per GPU**: 16
- **Gradient Accumulation**: 4
- **Effective Batch Size**: 256
- **Learning Rate**: 2e-05
- **Warmup Steps**: 100
- **Pooling Strategy**: mean
- **Epochs**: 1 (full dataset pass)
**Training Command:**
```bash
python scripts/soda-vec-train.py --config vicreg_exact --coeff_sim 25 --coeff_std 25 --coeff_cov 1 --push_to_hub --hub_org EMBO --save_limit 5
```
### Model Architecture
- **Base Architecture**: ModernBERT-base (12 layers, 768 hidden size)
- **Pooling**: Mean pooling over token embeddings
- **Output Dimension**: 768
- **Normalization**: L2-normalized embeddings (for VICReg-based models)
## Usage
### Using Sentence-Transformers
```python
from sentence_transformers import SentenceTransformer
# Load the model
model = SentenceTransformer("EMBO/vicreg_exact")
# Encode sentences
sentences = [
"CRISPR-Cas9 gene editing in human cells",
"Genome editing using CRISPR technology"
]
embeddings = model.encode(sentences)
print(f"Embedding shape: {embeddings.shape}")
# Compute similarity
from sentence_transformers.util import cos_sim
similarity = cos_sim(embeddings[0], embeddings[1])
print(f"Similarity: {similarity.item():.4f}")
```
### Using Hugging Face Transformers
```python
from transformers import AutoTokenizer, AutoModel
import torch
import torch.nn.functional as F
# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("EMBO/vicreg_exact")
model = AutoModel.from_pretrained("EMBO/vicreg_exact")
# Encode sentences
sentences = [
"CRISPR-Cas9 gene editing in human cells",
"Genome editing using CRISPR technology"
]
inputs = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs)
# Mean pooling
embeddings = outputs.last_hidden_state.mean(dim=1)
# Normalize (for VICReg models)
embeddings = F.normalize(embeddings, p=2, dim=1)
# Compute similarity
similarity = F.cosine_similarity(embeddings[0:1], embeddings[1:2])
print(f"Similarity: {similarity.item():.4f}")
```
## Evaluation
The model has been evaluated on comprehensive biomedical benchmarks including:
- **Journal-Category Classification**: Matching journals to BioRxiv subject categories
- **Title-Abstract Similarity**: Discriminating between related and unrelated paper pairs
- **Field-Specific Separability**: Distinguishing between different biological fields
- **Semantic Search**: Retrieval quality on biomedical text corpora
For detailed evaluation results, see the [SODA-VEC benchmark notebooks](https://github.com/source-data/soda-vec).
## Intended Use
This model is designed for:
- **Biomedical Semantic Search**: Finding relevant papers, abstracts, or text passages
- **Scientific Text Similarity**: Computing similarity between biomedical texts
## Limitations
- **Domain Specificity**: Optimized for biomedical and life sciences text; may not perform as well on general domain text
- **Language**: English only
- **Text Length**: Optimized for titles and abstracts; longer documents may require chunking
- **Bias**: Inherits biases from the training data (PubMed Central corpus)
## Citation
If you use this model, please cite:
```bibtex
@software{soda_vec,
title = {SODA-VEC: Scientific Open Domain Adaptation for Vector Embeddings},
author = {EMBO},
year = {2024},
url = {https://github.com/source-data/soda-vec}
}
```
## Model Card Contact
For questions or issues, please open an issue on the [SODA-VEC GitHub repository](https://github.com/source-data/soda-vec).
---
**Model Card Generated**: 2025-11-10
|