vicreg_our / README.md

Update README.md

4425be0 verified about 2 months ago

5.94 kB

	---
	license: apache-2.0
	base_model: answerdotai/ModernBERT-base
	tags:
	- sentence-transformers
	- feature-extraction
	- sentence-similarity
	- biomedical
	- embeddings
	- life-sciences
	- scientific-text
	- SODA-VEC
	- EMBO
	datasets:
	- EMBO/soda-vec-data-full_pmc_title_abstract_paired
	metrics:
	- cosine-similarity
	---

	# VICReg Our Model

	## Model Description

	SODA-VEC embedding model trained with [VICReg](https://arxiv.org/pdf/2105.04906) Our loss function. This model uses normalized embeddings with covariance, feature, and dot product losses (diagonal-only) to learn biomedical text representations.

	This model is part of the SODA-VEC (Scientific Open Domain Adaptation for Vector Embeddings) project, which focuses on creating high-quality embedding models for biomedical and life sciences text.

	Key Features:
	- Trained on 26.5M biomedical title-abstract pairs from PubMed Central
	- Based on ModernBERT-base architecture
	- Optimized for biomedical text similarity and semantic search
	- Produces 768-dimensional embeddings with mean pooling

	## Training Details

	### Training Data

	- Dataset: [`EMBO/soda-vec-data-full_pmc_title_abstract_paired`](https://huggingface.co/datasets/EMBO/soda-vec-data-full_pmc_title_abstract_paired)
	- Size: 26,473,900 training pairs
	- Source: Complete PubMed Central baseline (July 2024)
	- Format: Paired title-abstract examples optimized for contrastive learning

	### Training Procedure

	Loss Function: VICReg Our: normalized embeddings with covariance loss, feature loss, and dot product loss (diagonal-only)

	We have implemented a series of changes from the original [VICREG in the paper from Meta](https://arxiv.org/pdf/2105.04906). Here we show the main differences:

	\| Feature \| Original VICReg \| VICReg Our \| VICReg Our Contrast \|
	\|---------\|----------------\|------------\|---------------------\|
	\| Normalization \| No \| Yes (L2-normalized) \| Yes (L2-normalized) \|
	\| Invariance (MSE) \| Yes \| No \| No \|
	\| Variance (hinge) \| Yes \| No \| No \|
	\| Covariance \| Yes (unnormalized) \| Yes (normalized) \| Yes (normalized) \|
	\| Feature correlation \| No \| Yes (cross-view) \| Yes (cross-view) \|
	\| Sample similarity \| No \| Yes (diagonal only) \| Yes (diagonal + off-diagonal) \|

	Coefficients: cov=1.0, feature=1.0, dot=1.0
	Base Model: `answerdotai/ModernBERT-base`

	Training Configuration:
	- GPUs: 4
	- Batch Size per GPU: 16
	- Gradient Accumulation: 4
	- Effective Batch Size: 256
	- Learning Rate: 2e-05
	- Warmup Steps: 100
	- Pooling Strategy: mean
	- Epochs: 1 (full dataset pass)

	Training Command:
	```bash
	python scripts/soda-vec-train.py --config vicreg_our --coeff_cov 1 --coeff_feature 1 --coeff_dot 1 --push_to_hub --hub_org EMBO --save_limit 5
	```

	### Model Architecture

	- Base Architecture: ModernBERT-base (12 layers, 768 hidden size)
	- Pooling: Mean pooling over token embeddings
	- Output Dimension: 768
	- Normalization: L2-normalized embeddings (for VICReg-based models)

	## Usage

	### Using Sentence-Transformers

	```python
	from sentence_transformers import SentenceTransformer

	# Load the model
	model = SentenceTransformer("EMBO/vicreg_our")

	# Encode sentences
	sentences = [
	"CRISPR-Cas9 gene editing in human cells",
	"Genome editing using CRISPR technology"
	]

	embeddings = model.encode(sentences)
	print(f"Embedding shape: {embeddings.shape}")

	# Compute similarity
	from sentence_transformers.util import cos_sim
	similarity = cos_sim(embeddings[0], embeddings[1])
	print(f"Similarity: {similarity.item():.4f}")
	```

	### Using Hugging Face Transformers

	```python
	from transformers import AutoTokenizer, AutoModel
	import torch
	import torch.nn.functional as F

	# Load model and tokenizer
	tokenizer = AutoTokenizer.from_pretrained("EMBO/vicreg_our")
	model = AutoModel.from_pretrained("EMBO/vicreg_our")

	# Encode sentences
	sentences = [
	"CRISPR-Cas9 gene editing in human cells",
	"Genome editing using CRISPR technology"
	]

	inputs = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")
	with torch.no_grad():
	outputs = model(**inputs)

	# Mean pooling
	embeddings = outputs.last_hidden_state.mean(dim=1)

	# Normalize (for VICReg models)
	embeddings = F.normalize(embeddings, p=2, dim=1)

	# Compute similarity
	similarity = F.cosine_similarity(embeddings[0:1], embeddings[1:2])
	print(f"Similarity: {similarity.item():.4f}")
	```

	## Evaluation

	The model has been evaluated on comprehensive biomedical benchmarks including:

	- Journal-Category Classification: Matching journals to BioRxiv subject categories
	- Title-Abstract Similarity: Discriminating between related and unrelated paper pairs
	- Field-Specific Separability: Distinguishing between different biological fields
	- Semantic Search: Retrieval quality on biomedical text corpora

	For detailed evaluation results, see the [SODA-VEC benchmark notebooks](https://github.com/source-data/soda-vec).

	## Intended Use

	This model is designed for:

	- Biomedical Semantic Search: Finding relevant papers, abstracts, or text passages
	- Scientific Text Similarity: Computing similarity between biomedical texts

	## Limitations

	- Domain Specificity: Optimized for biomedical and life sciences text; may not perform as well on general domain text
	- Language: English only
	- Text Length: Optimized for titles and abstracts; longer documents may require chunking
	- Bias: Inherits biases from the training data (PubMed Central corpus)

	## Citation

	If you use this model, please cite:

	```bibtex
	@software{soda_vec,
	title = {SODA-VEC: Scientific Open Domain Adaptation for Vector Embeddings},
	author = {EMBO},
	year = {2024},
	url = {https://github.com/source-data/soda-vec}
	}
	```

	## Model Card Contact

	For questions or issues, please open an issue on the [SODA-VEC GitHub repository](https://github.com/source-data/soda-vec).

	---

	Model Card Generated: 2025-11-10