MOSAIC-embed-biomed

A biomedical sentence embedding model trained using the MOSAIC framework (Masked Objective with Selective Adaptation for In-domain Contrastive Learning).

This model is optimized for biomedical and clinical text, including PubMed abstracts, clinical notes, and scientific literature.

📄 Paper: MOSAIC: Masked Objective with Selective Adaptation for In-domain Contrastive Learning (EACL 2026, Findings)

💻 Training Code: github.com/rttl-ai/mosaic


Usage

With Sentence Transformers

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("rttl-ai/MOSAIC-embed-biomed", trust_remote_code=True)

sentences = [
    "search_document: Metformin is a first-line treatment for type 2 diabetes.",
    "search_query: What medications treat diabetes?"
]

embeddings = model.encode(sentences)
print(embeddings.shape)

With Transformers

import torch
import torch.nn.functional as F
from transformers import AutoTokenizer, AutoModel

def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0]
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)

tokenizer = AutoTokenizer.from_pretrained("rttl-ai/MOSAIC-embed-biomed")
model = AutoModel.from_pretrained("rttl-ai/MOSAIC-embed-biomed", trust_remote_code=True)

sentences = ["search_query: What causes Alzheimer's disease?"]

inputs = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")
with torch.no_grad():
    outputs = model(**inputs)

embeddings = mean_pooling(outputs, inputs["attention_mask"])
embeddings = F.normalize(embeddings, p=2, dim=1)

print(embeddings.shape)

Task Prefixes

This model uses task-specific prefixes for optimal performance:

Task Prefix Example
Document embedding search_document: search_document: Aspirin inhibits platelet aggregation.
Query embedding search_query: search_query: How does aspirin work?
Clustering clustering: clustering: cardiac arrest treatment protocols
Classification classification: classification: The patient presents with fever and cough.

Model Details

  • Architecture: NomicBERT (based on nomic-embed)
  • Embedding Dimension: 768
  • Max Sequence Length: 256
  • Training: Joint contrastive + domain-restricted MLM
  • Domain: Biomedical / Clinical

Citation

@inproceedings{mosaic2026,
  title={MOSAIC: Masked Objective with Selective Adaptation for In-domain Contrastive Learning},
  author={Pavlova, Vera and ...},
  booktitle={Findings of the European Chapter of the Association for Computational Linguistics (EACL)},
  year={2026}
}

License

Apache 2.0

Downloads last month
9
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train rttl-ai/MOSAIC-embed-biomed

Paper for rttl-ai/MOSAIC-embed-biomed