--- library_name: sentence-transformers pipeline_tag: sentence-similarity tags: - sentence-transformers - feature-extraction - sentence-similarity - transformers language: - en base_model: - cambridgeltl/SapBERT-from-PubMedBERT-fulltext-mean-token --- # NA-SapBERT: Noise-Augmented SapBERT Encoder for Clinical Concept Normalization NA-SapBERT is a **biomedical sentence embedding model** designed for encoding clinical mentions into dense vectors for downstream retrieval tasks. This model is a noise-augmented extension of SapBERT, trained to produce robust embeddings for: - abbreviations (e.g., "NAD", "DM") - misspellings - shorthand / telegraphic clinical text - surface variation in real-world clinical notes --- ## What This Model Is NA-SapBERT is **only an encoder**. It maps input text → 768-dimensional normalized embedding vectors. It does NOT include: - retrieval logic - FAISS index - exact match - rewrite modules - reranking These belong to downstream pipelines. --- ## Key Idea The model is trained using contrastive learning to align: - noisy clinical mentions - clean ontology concept names and synonyms This improves embedding robustness and semantic consistency. --- ## Model Architecture - Backbone: PubMedBERT - Pooling: Mean pooling (attention-mask aware) - Output: 768-dim normalized embeddings - Max sequence length: 32 (optimized for short clinical mentions) --- ## Training Summary - Objective: MultipleNegativesRankingLoss (contrastive / InfoNCE-style) - Data: - SNOMED CT concepts (subset of key semantic types) - synthetic noisy variants (LLM + abbreviation-based) Training pairs: - clean → clean - noisy → clean --- ## Usage (Recommended) Use with Hugging Face Transformers + custom pooling. ### Encoding Example ```python import torch import numpy as np from transformers import AutoTokenizer, AutoModel class Encoder: def __init__(self, model_name, device="cuda", max_length=32): self.device = device self.max_length = max_length self.tokenizer = AutoTokenizer.from_pretrained(model_name) self.model = AutoModel.from_pretrained(model_name) if device == "cuda": self.model = self.model.cuda() self.model.eval() def encode(self, texts, batch_size=256): all_vecs = [] with torch.no_grad(): for i in range(0, len(texts), batch_size): batch = texts[i:i+batch_size] tokens = self.tokenizer( batch, padding=True, truncation=True, max_length=self.max_length, return_tensors="pt" ) if self.device == "cuda": tokens = {k: v.cuda() for k, v in tokens.items()} out = self.model(**tokens) hidden = out.last_hidden_state mask = tokens["attention_mask"].unsqueeze(-1) pooled = (hidden * mask).sum(1) / mask.sum(1) # IMPORTANT: normalize embeddings pooled = torch.nn.functional.normalize(pooled, p=2, dim=1) all_vecs.append(pooled.cpu().numpy()) return np.vstack(all_vecs).astype("float32") ``` --- ## Important Notes - Mean pooling is required (CLS token is NOT used) - L2 normalization is critical for similarity search - Designed for short clinical mentions (max_length=32) --- ## Intended Use This model is intended for: - clinical concept normalization pipelines - dense retrieval over medical ontologies (SNOMED CT, UMLS) - embedding generation for biomedical text --- ## Not Intended For - general-purpose sentence similarity - long document encoding - non-biomedical domains --- ## Limitations - Does not encode: - negation - temporality - broader context - Abbreviations remain ambiguous without external context - Performance depends on downstream retrieval pipeline