dot_only / README.md

Update README.md

af5458f verified about 2 months ago

5.38 kB

	---
	license: apache-2.0
	base_model: answerdotai/ModernBERT-base
	tags:
	- sentence-transformers
	- feature-extraction
	- sentence-similarity
	- biomedical
	- embeddings
	- life-sciences
	- scientific-text
	- SODA-VEC
	- EMBO
	datasets:
	- EMBO/soda-vec-data-full_pmc_title_abstract_paired
	metrics:
	- cosine-similarity
	---

	# Dot Only Model

	## Model Description

	SODA-VEC embedding model trained with dot product loss only. This model uses normalized embeddings with only contrastive learning (dot product) to learn biomedical text representations.

	This model is part of the SODA-VEC (Scientific Open Domain Adaptation for Vector Embeddings) project, which focuses on creating high-quality embedding models for biomedical and life sciences text.

	Key Features:
	- Trained on 26.5M biomedical title-abstract pairs from PubMed Central
	- Based on ModernBERT-base architecture
	- Optimized for biomedical text similarity and semantic search
	- Produces 768-dimensional embeddings with mean pooling

	## Training Details

	### Training Data

	- Dataset: [`EMBO/soda-vec-data-full_pmc_title_abstract_paired`](https://huggingface.co/datasets/EMBO/soda-vec-data-full_pmc_title_abstract_paired)
	- Size: 26,473,900 training pairs
	- Source: Complete PubMed Central baseline (July 2024)
	- Format: Paired title-abstract examples optimized for contrastive learning

	### Training Procedure

	Loss Function: Dot Only: normalized embeddings with only dot product loss (diagonal + off-diagonal)

	Coefficients: dot=1.0
	Base Model: `answerdotai/ModernBERT-base`

	Training Configuration:
	- GPUs: 4
	- Batch Size per GPU: 16
	- Gradient Accumulation: 4
	- Effective Batch Size: 256
	- Learning Rate: 2e-05
	- Warmup Steps: 100
	- Pooling Strategy: mean
	- Epochs: 1 (full dataset pass)

	Training Command:
	```bash
	python scripts/soda-vec-train.py --config dot_only --coeff_dot 1 --push_to_hub --hub_org EMBO --save_limit 5
	```

	### Model Architecture

	- Base Architecture: ModernBERT-base (12 layers, 768 hidden size)
	- Pooling: Mean pooling over token embeddings
	- Output Dimension: 768
	- Normalization: L2-normalized embeddings (for VICReg-based models)

	## Usage

	### Using Sentence-Transformers

	```python
	from sentence_transformers import SentenceTransformer

	# Load the model
	model = SentenceTransformer("EMBO/dot_only")

	# Encode sentences
	sentences = [
	"CRISPR-Cas9 gene editing in human cells",
	"Genome editing using CRISPR technology"
	]

	embeddings = model.encode(sentences)
	print(f"Embedding shape: {embeddings.shape}")

	# Compute similarity
	from sentence_transformers.util import cos_sim
	similarity = cos_sim(embeddings[0], embeddings[1])
	print(f"Similarity: {similarity.item():.4f}")
	```

	### Using Hugging Face Transformers

	```python
	from transformers import AutoTokenizer, AutoModel
	import torch
	import torch.nn.functional as F

	# Load model and tokenizer
	tokenizer = AutoTokenizer.from_pretrained("EMBO/dot_only")
	model = AutoModel.from_pretrained("EMBO/dot_only")

	# Encode sentences
	sentences = [
	"CRISPR-Cas9 gene editing in human cells",
	"Genome editing using CRISPR technology"
	]

	inputs = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")
	with torch.no_grad():
	outputs = model(**inputs)

	# Mean pooling
	embeddings = outputs.last_hidden_state.mean(dim=1)

	# Normalize (for VICReg models)
	embeddings = F.normalize(embeddings, p=2, dim=1)

	# Compute similarity
	similarity = F.cosine_similarity(embeddings[0:1], embeddings[1:2])
	print(f"Similarity: {similarity.item():.4f}")
	```

	<!-- ## Evaluation

	The model has been evaluated on comprehensive biomedical benchmarks including:

	- Journal-Category Classification: Matching journals to BioRxiv subject categories
	- Title-Abstract Similarity: Discriminating between related and unrelated paper pairs
	- Field-Specific Separability: Distinguishing between different biological fields
	- Semantic Search: Retrieval quality on biomedical text corpora

	For detailed evaluation results, see the [SODA-VEC benchmark notebooks](https://github.com/EMBO/soda-vec).
	-->
	## Intended Use

	This model is designed for:

	- Biomedical Semantic Search: Finding relevant papers, abstracts, or text passages
	- Scientific Text Similarity: Computing similarity between biomedical texts
	<!-- - Information Retrieval: Building search systems for scientific literature
	- Downstream Tasks: As a base for fine-tuning on specific biomedical tasks
	- Research Applications: Academic and research use in life sciences
	-->
	## Limitations

	- Domain Specificity: Optimized for biomedical and life sciences text; may not perform as well on general domain text
	- Language: English only
	- Text Length: Optimized for titles and abstracts; longer documents may require chunking
	- Bias: Inherits biases from the training data (PubMed Central corpus)

	## Citation

	If you use this model, please cite:

	```bibtex
	@software{soda_vec,
	title = {SODA-VEC: Scientific Open Domain Adaptation for Vector Embeddings},
	author = {EMBO},
	year = {2024},
	url = {https://github.com/source-data/soda-vec}
	}
	```

	## Model Card Contact

	For questions or issues, please open an issue on the [SODA-VEC GitHub repository](https://github.com/EMBO/soda-vec).

	---

	Model Card Generated: 2025-11-10