MOSAIC-embed-biomed

A biomedical sentence embedding model trained using the MOSAIC framework (Masked Objective with Selective Adaptation for In-domain Contrastive Learning).

This model is optimized for biomedical and clinical text, including PubMed abstracts, clinical notes, and scientific literature.

📄 Paper: MOSAIC: Masked Objective with Selective Adaptation for In-domain Contrastive Learning (EACL 2026, Findings)

💻 Training Code: github.com/rttl-ai/mosaic

Usage

With Sentence Transformers

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("rttl-ai/MOSAIC-embed-biomed", trust_remote_code=True)

sentences = [
    "search_document: Metformin is a first-line treatment for type 2 diabetes.",
    "search_query: What medications treat diabetes?"
]

embeddings = model.encode(sentences)
print(embeddings.shape)

With Transformers

import torch
import torch.nn.functional as F
from transformers import AutoTokenizer, AutoModel

def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0]
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)

tokenizer = AutoTokenizer.from_pretrained("rttl-ai/MOSAIC-embed-biomed")
model = AutoModel.from_pretrained("rttl-ai/MOSAIC-embed-biomed", trust_remote_code=True)

sentences = ["search_query: What causes Alzheimer's disease?"]

inputs = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")
with torch.no_grad():
    outputs = model(**inputs)

embeddings = mean_pooling(outputs, inputs["attention_mask"])
embeddings = F.normalize(embeddings, p=2, dim=1)

print(embeddings.shape)

Task Prefixes

This model uses task-specific prefixes for optimal performance:

Task	Prefix	Example
Document embedding	`search_document:`	`search_document: Aspirin inhibits platelet aggregation.`
Query embedding	`search_query:`	`search_query: How does aspirin work?`
Clustering	`clustering:`	`clustering: cardiac arrest treatment protocols`
Classification	`classification:`	`classification: The patient presents with fever and cough.`

Model Details

Architecture: NomicBERT (based on nomic-embed)
Embedding Dimension: 768
Max Sequence Length: 256
Training: Joint contrastive + domain-restricted MLM
Domain: Biomedical / Clinical

Citation

@inproceedings{mosaic2026,
  title={MOSAIC: Masked Objective with Selective Adaptation for In-domain Contrastive Learning},
  author={Pavlova, Vera and ...},
  booktitle={Findings of the European Chapter of the Association for Computational Linguistics (EACL)},
  year={2026}
}

License

Apache 2.0

Downloads last month: 9

Safetensors

Model size

0.1B params

Tensor type

F32

Dataset used to train rttl-ai/MOSAIC-embed-biomed

Paper for rttl-ai/MOSAIC-embed-biomed

MOSAIC: Masked Objective with Selective Adaptation for In-domain Contrastive Learning

Paper • 2510.16797 • Published Oct 19, 2025