|
|
--- |
|
|
license: apache-2.0 |
|
|
base_model: answerdotai/ModernBERT-base |
|
|
tags: |
|
|
- sentence-transformers |
|
|
- feature-extraction |
|
|
- sentence-similarity |
|
|
- biomedical |
|
|
- embeddings |
|
|
- life-sciences |
|
|
- scientific-text |
|
|
- SODA-VEC |
|
|
- EMBO |
|
|
datasets: |
|
|
- EMBO/soda-vec-data-full_pmc_title_abstract_paired |
|
|
metrics: |
|
|
- cosine-similarity |
|
|
--- |
|
|
|
|
|
# Dot Only Model |
|
|
|
|
|
## Model Description |
|
|
|
|
|
SODA-VEC embedding model trained with dot product loss only. This model uses normalized embeddings with only contrastive learning (dot product) to learn biomedical text representations. |
|
|
|
|
|
This model is part of the **SODA-VEC** (Scientific Open Domain Adaptation for Vector Embeddings) project, which focuses on creating high-quality embedding models for biomedical and life sciences text. |
|
|
|
|
|
**Key Features:** |
|
|
- Trained on **26.5M biomedical title-abstract pairs** from PubMed Central |
|
|
- Based on **ModernBERT-base** architecture |
|
|
- Optimized for **biomedical text similarity** and **semantic search** |
|
|
- Produces **768-dimensional embeddings** with mean pooling |
|
|
|
|
|
## Training Details |
|
|
|
|
|
### Training Data |
|
|
|
|
|
- **Dataset**: [`EMBO/soda-vec-data-full_pmc_title_abstract_paired`](https://huggingface.co/datasets/EMBO/soda-vec-data-full_pmc_title_abstract_paired) |
|
|
- **Size**: 26,473,900 training pairs |
|
|
- **Source**: Complete PubMed Central baseline (July 2024) |
|
|
- **Format**: Paired title-abstract examples optimized for contrastive learning |
|
|
|
|
|
### Training Procedure |
|
|
|
|
|
**Loss Function**: Dot Only: normalized embeddings with only dot product loss (diagonal + off-diagonal) |
|
|
|
|
|
**Coefficients**: dot=1.0 |
|
|
**Base Model**: `answerdotai/ModernBERT-base` |
|
|
|
|
|
**Training Configuration:** |
|
|
- **GPUs**: 4 |
|
|
- **Batch Size per GPU**: 16 |
|
|
- **Gradient Accumulation**: 4 |
|
|
- **Effective Batch Size**: 256 |
|
|
- **Learning Rate**: 2e-05 |
|
|
- **Warmup Steps**: 100 |
|
|
- **Pooling Strategy**: mean |
|
|
- **Epochs**: 1 (full dataset pass) |
|
|
|
|
|
**Training Command:** |
|
|
```bash |
|
|
python scripts/soda-vec-train.py --config dot_only --coeff_dot 1 --push_to_hub --hub_org EMBO --save_limit 5 |
|
|
``` |
|
|
|
|
|
### Model Architecture |
|
|
|
|
|
- **Base Architecture**: ModernBERT-base (12 layers, 768 hidden size) |
|
|
- **Pooling**: Mean pooling over token embeddings |
|
|
- **Output Dimension**: 768 |
|
|
- **Normalization**: L2-normalized embeddings (for VICReg-based models) |
|
|
|
|
|
## Usage |
|
|
|
|
|
### Using Sentence-Transformers |
|
|
|
|
|
```python |
|
|
from sentence_transformers import SentenceTransformer |
|
|
|
|
|
# Load the model |
|
|
model = SentenceTransformer("EMBO/dot_only") |
|
|
|
|
|
# Encode sentences |
|
|
sentences = [ |
|
|
"CRISPR-Cas9 gene editing in human cells", |
|
|
"Genome editing using CRISPR technology" |
|
|
] |
|
|
|
|
|
embeddings = model.encode(sentences) |
|
|
print(f"Embedding shape: {embeddings.shape}") |
|
|
|
|
|
# Compute similarity |
|
|
from sentence_transformers.util import cos_sim |
|
|
similarity = cos_sim(embeddings[0], embeddings[1]) |
|
|
print(f"Similarity: {similarity.item():.4f}") |
|
|
``` |
|
|
|
|
|
### Using Hugging Face Transformers |
|
|
|
|
|
```python |
|
|
from transformers import AutoTokenizer, AutoModel |
|
|
import torch |
|
|
import torch.nn.functional as F |
|
|
|
|
|
# Load model and tokenizer |
|
|
tokenizer = AutoTokenizer.from_pretrained("EMBO/dot_only") |
|
|
model = AutoModel.from_pretrained("EMBO/dot_only") |
|
|
|
|
|
# Encode sentences |
|
|
sentences = [ |
|
|
"CRISPR-Cas9 gene editing in human cells", |
|
|
"Genome editing using CRISPR technology" |
|
|
] |
|
|
|
|
|
inputs = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt") |
|
|
with torch.no_grad(): |
|
|
outputs = model(**inputs) |
|
|
|
|
|
# Mean pooling |
|
|
embeddings = outputs.last_hidden_state.mean(dim=1) |
|
|
|
|
|
# Normalize (for VICReg models) |
|
|
embeddings = F.normalize(embeddings, p=2, dim=1) |
|
|
|
|
|
# Compute similarity |
|
|
similarity = F.cosine_similarity(embeddings[0:1], embeddings[1:2]) |
|
|
print(f"Similarity: {similarity.item():.4f}") |
|
|
``` |
|
|
|
|
|
<!-- ## Evaluation |
|
|
|
|
|
The model has been evaluated on comprehensive biomedical benchmarks including: |
|
|
|
|
|
- **Journal-Category Classification**: Matching journals to BioRxiv subject categories |
|
|
- **Title-Abstract Similarity**: Discriminating between related and unrelated paper pairs |
|
|
- **Field-Specific Separability**: Distinguishing between different biological fields |
|
|
- **Semantic Search**: Retrieval quality on biomedical text corpora |
|
|
|
|
|
For detailed evaluation results, see the [SODA-VEC benchmark notebooks](https://github.com/EMBO/soda-vec). |
|
|
--> |
|
|
## Intended Use |
|
|
|
|
|
This model is designed for: |
|
|
|
|
|
- **Biomedical Semantic Search**: Finding relevant papers, abstracts, or text passages |
|
|
- **Scientific Text Similarity**: Computing similarity between biomedical texts |
|
|
<!-- - **Information Retrieval**: Building search systems for scientific literature |
|
|
- **Downstream Tasks**: As a base for fine-tuning on specific biomedical tasks |
|
|
- **Research Applications**: Academic and research use in life sciences |
|
|
--> |
|
|
## Limitations |
|
|
|
|
|
- **Domain Specificity**: Optimized for biomedical and life sciences text; may not perform as well on general domain text |
|
|
- **Language**: English only |
|
|
- **Text Length**: Optimized for titles and abstracts; longer documents may require chunking |
|
|
- **Bias**: Inherits biases from the training data (PubMed Central corpus) |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you use this model, please cite: |
|
|
|
|
|
```bibtex |
|
|
@software{soda_vec, |
|
|
title = {SODA-VEC: Scientific Open Domain Adaptation for Vector Embeddings}, |
|
|
author = {EMBO}, |
|
|
year = {2024}, |
|
|
url = {https://github.com/source-data/soda-vec} |
|
|
} |
|
|
``` |
|
|
|
|
|
## Model Card Contact |
|
|
|
|
|
For questions or issues, please open an issue on the [SODA-VEC GitHub repository](https://github.com/EMBO/soda-vec). |
|
|
|
|
|
--- |
|
|
|
|
|
**Model Card Generated**: 2025-11-10 |
|
|
|