File size: 5,943 Bytes
d3712e2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1426fb8
d3712e2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1426fb8
 
4425be0
 
 
 
 
 
 
 
1426fb8
d3712e2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1426fb8
d3712e2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1426fb8
d3712e2
 
 
 
 
1426fb8
d3712e2
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
---
license: apache-2.0
base_model: answerdotai/ModernBERT-base
tags:
- sentence-transformers
- feature-extraction
- sentence-similarity
- biomedical
- embeddings
- life-sciences
- scientific-text
- SODA-VEC
- EMBO
datasets:
- EMBO/soda-vec-data-full_pmc_title_abstract_paired
metrics:
- cosine-similarity
---

# VICReg Our Model

## Model Description

SODA-VEC embedding model trained with [VICReg](https://arxiv.org/pdf/2105.04906) Our loss function. This model uses normalized embeddings with covariance, feature, and dot product losses (diagonal-only) to learn biomedical text representations.

This model is part of the **SODA-VEC** (Scientific Open Domain Adaptation for Vector Embeddings) project, which focuses on creating high-quality embedding models for biomedical and life sciences text.

**Key Features:**
- Trained on **26.5M biomedical title-abstract pairs** from PubMed Central
- Based on **ModernBERT-base** architecture
- Optimized for **biomedical text similarity** and **semantic search**
- Produces **768-dimensional embeddings** with mean pooling

## Training Details

### Training Data

- **Dataset**: [`EMBO/soda-vec-data-full_pmc_title_abstract_paired`](https://huggingface.co/datasets/EMBO/soda-vec-data-full_pmc_title_abstract_paired)
- **Size**: 26,473,900 training pairs
- **Source**: Complete PubMed Central baseline (July 2024)
- **Format**: Paired title-abstract examples optimized for contrastive learning

### Training Procedure

**Loss Function**: VICReg Our: normalized embeddings with covariance loss, feature loss, and dot product loss (diagonal-only)

We have implemented a series of changes from the original [VICREG in the paper from Meta](https://arxiv.org/pdf/2105.04906). Here we show the main differences:

| Feature | Original VICReg | VICReg Our | VICReg Our Contrast |
|---------|----------------|------------|---------------------|
| Normalization | No | Yes (L2-normalized) | Yes (L2-normalized) |
| Invariance (MSE) | Yes | No | No |
| Variance (hinge) | Yes | No | No |
| Covariance | Yes (unnormalized) | Yes (normalized) | Yes (normalized) |
| Feature correlation | No | Yes (cross-view) | Yes (cross-view) |
| Sample similarity | No | Yes (diagonal only) | Yes (diagonal + off-diagonal) |

**Coefficients**: cov=1.0, feature=1.0, dot=1.0
**Base Model**: `answerdotai/ModernBERT-base`

**Training Configuration:**
- **GPUs**: 4
- **Batch Size per GPU**: 16
- **Gradient Accumulation**: 4
- **Effective Batch Size**: 256
- **Learning Rate**: 2e-05
- **Warmup Steps**: 100
- **Pooling Strategy**: mean
- **Epochs**: 1 (full dataset pass)

**Training Command:**
```bash
python scripts/soda-vec-train.py --config vicreg_our --coeff_cov 1 --coeff_feature 1 --coeff_dot 1 --push_to_hub --hub_org EMBO --save_limit 5
```

### Model Architecture

- **Base Architecture**: ModernBERT-base (12 layers, 768 hidden size)
- **Pooling**: Mean pooling over token embeddings
- **Output Dimension**: 768
- **Normalization**: L2-normalized embeddings (for VICReg-based models)

## Usage

### Using Sentence-Transformers

```python
from sentence_transformers import SentenceTransformer

# Load the model
model = SentenceTransformer("EMBO/vicreg_our")

# Encode sentences
sentences = [
    "CRISPR-Cas9 gene editing in human cells",
    "Genome editing using CRISPR technology"
]

embeddings = model.encode(sentences)
print(f"Embedding shape: {embeddings.shape}")

# Compute similarity
from sentence_transformers.util import cos_sim
similarity = cos_sim(embeddings[0], embeddings[1])
print(f"Similarity: {similarity.item():.4f}")
```

### Using Hugging Face Transformers

```python
from transformers import AutoTokenizer, AutoModel
import torch
import torch.nn.functional as F

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("EMBO/vicreg_our")
model = AutoModel.from_pretrained("EMBO/vicreg_our")

# Encode sentences
sentences = [
    "CRISPR-Cas9 gene editing in human cells",
    "Genome editing using CRISPR technology"
]

inputs = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")
with torch.no_grad():
    outputs = model(**inputs)
    
# Mean pooling
embeddings = outputs.last_hidden_state.mean(dim=1)

# Normalize (for VICReg models)
embeddings = F.normalize(embeddings, p=2, dim=1)

# Compute similarity
similarity = F.cosine_similarity(embeddings[0:1], embeddings[1:2])
print(f"Similarity: {similarity.item():.4f}")
```

## Evaluation

The model has been evaluated on comprehensive biomedical benchmarks including:

- **Journal-Category Classification**: Matching journals to BioRxiv subject categories
- **Title-Abstract Similarity**: Discriminating between related and unrelated paper pairs
- **Field-Specific Separability**: Distinguishing between different biological fields
- **Semantic Search**: Retrieval quality on biomedical text corpora

For detailed evaluation results, see the [SODA-VEC benchmark notebooks](https://github.com/source-data/soda-vec).

## Intended Use

This model is designed for:

- **Biomedical Semantic Search**: Finding relevant papers, abstracts, or text passages
- **Scientific Text Similarity**: Computing similarity between biomedical texts

## Limitations

- **Domain Specificity**: Optimized for biomedical and life sciences text; may not perform as well on general domain text
- **Language**: English only
- **Text Length**: Optimized for titles and abstracts; longer documents may require chunking
- **Bias**: Inherits biases from the training data (PubMed Central corpus)

## Citation

If you use this model, please cite:

```bibtex
@software{soda_vec,
  title = {SODA-VEC: Scientific Open Domain Adaptation for Vector Embeddings},
  author = {EMBO},
  year = {2024},
  url = {https://github.com/source-data/soda-vec}
}
```

## Model Card Contact

For questions or issues, please open an issue on the [SODA-VEC GitHub repository](https://github.com/source-data/soda-vec).

---

**Model Card Generated**: 2025-11-10