Sentence Similarity
sentence-transformers
Safetensors
Transformers
English
bert
feature-extraction
text-embeddings-inference
Instructions to use Tao-AI-Informatics/NA-SapBERT with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- sentence-transformers
How to use Tao-AI-Informatics/NA-SapBERT with sentence-transformers:
from sentence_transformers import SentenceTransformer model = SentenceTransformer("Tao-AI-Informatics/NA-SapBERT") sentences = [ "That is a happy person", "That is a happy dog", "That is a very happy person", "Today is a sunny day" ] embeddings = model.encode(sentences) similarities = model.similarity(embeddings, embeddings) print(similarities.shape) # [4, 4] - Transformers
How to use Tao-AI-Informatics/NA-SapBERT with Transformers:
# Load model directly from transformers import AutoTokenizer, AutoModel tokenizer = AutoTokenizer.from_pretrained("Tao-AI-Informatics/NA-SapBERT") model = AutoModel.from_pretrained("Tao-AI-Informatics/NA-SapBERT") - Notebooks
- Google Colab
- Kaggle
| library_name: sentence-transformers | |
| pipeline_tag: sentence-similarity | |
| tags: | |
| - sentence-transformers | |
| - feature-extraction | |
| - sentence-similarity | |
| - transformers | |
| language: | |
| - en | |
| base_model: | |
| - cambridgeltl/SapBERT-from-PubMedBERT-fulltext-mean-token | |
| # NA-SapBERT: Noise-Augmented SapBERT Encoder for Clinical Concept Normalization | |
| NA-SapBERT is a **biomedical sentence embedding model** designed for encoding clinical mentions into dense vectors for downstream retrieval tasks. | |
| This model is a noise-augmented extension of SapBERT, trained to produce robust embeddings for: | |
| - abbreviations (e.g., "NAD", "DM") | |
| - misspellings | |
| - shorthand / telegraphic clinical text | |
| - surface variation in real-world clinical notes | |
| --- | |
| ## What This Model Is | |
| NA-SapBERT is **only an encoder**. | |
| It maps input text → 768-dimensional normalized embedding vectors. | |
| It does NOT include: | |
| - retrieval logic | |
| - FAISS index | |
| - exact match | |
| - rewrite modules | |
| - reranking | |
| These belong to downstream pipelines. | |
| --- | |
| ## Key Idea | |
| The model is trained using contrastive learning to align: | |
| - noisy clinical mentions | |
| - clean ontology concept names and synonyms | |
| This improves embedding robustness and semantic consistency. | |
| --- | |
| ## Model Architecture | |
| - Backbone: PubMedBERT | |
| - Pooling: Mean pooling (attention-mask aware) | |
| - Output: 768-dim normalized embeddings | |
| - Max sequence length: 32 (optimized for short clinical mentions) | |
| --- | |
| ## Training Summary | |
| - Objective: MultipleNegativesRankingLoss (contrastive / InfoNCE-style) | |
| - Data: | |
| - SNOMED CT concepts (subset of key semantic types) | |
| - synthetic noisy variants (LLM + abbreviation-based) | |
| Training pairs: | |
| - clean → clean | |
| - noisy → clean | |
| --- | |
| ## Usage (Recommended) | |
| Use with Hugging Face Transformers + custom pooling. | |
| ### Encoding Example | |
| ```python | |
| import torch | |
| import numpy as np | |
| from transformers import AutoTokenizer, AutoModel | |
| class Encoder: | |
| def __init__(self, model_name, device="cuda", max_length=32): | |
| self.device = device | |
| self.max_length = max_length | |
| self.tokenizer = AutoTokenizer.from_pretrained(model_name) | |
| self.model = AutoModel.from_pretrained(model_name) | |
| if device == "cuda": | |
| self.model = self.model.cuda() | |
| self.model.eval() | |
| def encode(self, texts, batch_size=256): | |
| all_vecs = [] | |
| with torch.no_grad(): | |
| for i in range(0, len(texts), batch_size): | |
| batch = texts[i:i+batch_size] | |
| tokens = self.tokenizer( | |
| batch, | |
| padding=True, | |
| truncation=True, | |
| max_length=self.max_length, | |
| return_tensors="pt" | |
| ) | |
| if self.device == "cuda": | |
| tokens = {k: v.cuda() for k, v in tokens.items()} | |
| out = self.model(**tokens) | |
| hidden = out.last_hidden_state | |
| mask = tokens["attention_mask"].unsqueeze(-1) | |
| pooled = (hidden * mask).sum(1) / mask.sum(1) | |
| # IMPORTANT: normalize embeddings | |
| pooled = torch.nn.functional.normalize(pooled, p=2, dim=1) | |
| all_vecs.append(pooled.cpu().numpy()) | |
| return np.vstack(all_vecs).astype("float32") | |
| ``` | |
| --- | |
| ## Important Notes | |
| - Mean pooling is required (CLS token is NOT used) | |
| - L2 normalization is critical for similarity search | |
| - Designed for short clinical mentions (max_length=32) | |
| --- | |
| ## Intended Use | |
| This model is intended for: | |
| - clinical concept normalization pipelines | |
| - dense retrieval over medical ontologies (SNOMED CT, UMLS) | |
| - embedding generation for biomedical text | |
| --- | |
| ## Not Intended For | |
| - general-purpose sentence similarity | |
| - long document encoding | |
| - non-biomedical domains | |
| --- | |
| ## Limitations | |
| - Does not encode: | |
| - negation | |
| - temporality | |
| - broader context | |
| - Abbreviations remain ambiguous without external context | |
| - Performance depends on downstream retrieval pipeline | |