Sentence Similarity
sentence-transformers
Safetensors
Transformers
English
bert
feature-extraction
text-embeddings-inference
Instructions to use Tao-AI-Informatics/NA-SapBERT with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- sentence-transformers
How to use Tao-AI-Informatics/NA-SapBERT with sentence-transformers:
from sentence_transformers import SentenceTransformer model = SentenceTransformer("Tao-AI-Informatics/NA-SapBERT") sentences = [ "That is a happy person", "That is a happy dog", "That is a very happy person", "Today is a sunny day" ] embeddings = model.encode(sentences) similarities = model.similarity(embeddings, embeddings) print(similarities.shape) # [4, 4] - Transformers
How to use Tao-AI-Informatics/NA-SapBERT with Transformers:
# Load model directly from transformers import AutoTokenizer, AutoModel tokenizer = AutoTokenizer.from_pretrained("Tao-AI-Informatics/NA-SapBERT") model = AutoModel.from_pretrained("Tao-AI-Informatics/NA-SapBERT") - Notebooks
- Google Colab
- Kaggle
File size: 3,972 Bytes
38433aa ed133d0 39bf384 ed133d0 39bf384 ed133d0 39bf384 ed133d0 39bf384 38433aa 39bf384 38433aa 39bf384 38433aa 39bf384 38433aa 39bf384 38433aa ed133d0 38433aa ed133d0 38433aa 39bf384 38433aa 39bf384 38433aa ed133d0 39bf384 38433aa ed133d0 39bf384 ed133d0 39bf384 38433aa 39bf384 ed133d0 38433aa 39bf384 ed133d0 39bf384 ed133d0 38433aa 39bf384 ed133d0 39bf384 38433aa 39bf384 38433aa 39bf384 38433aa 39bf384 38433aa 39bf384 38433aa 39bf384 38433aa 39bf384 38433aa 39bf384 38433aa 39bf384 38433aa 39bf384 38433aa 39bf384 38433aa 39bf384 38433aa 39bf384 38433aa 39bf384 38433aa 39bf384 38433aa 39bf384 38433aa 39bf384 38433aa ed133d0 38433aa 39bf384 38433aa 39bf384 38433aa ed133d0 38433aa 39bf384 38433aa 39bf384 38433aa 39bf384 38433aa 39bf384 38433aa 39bf384 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 | ---
library_name: sentence-transformers
pipeline_tag: sentence-similarity
tags:
- sentence-transformers
- feature-extraction
- sentence-similarity
- transformers
language:
- en
base_model:
- cambridgeltl/SapBERT-from-PubMedBERT-fulltext-mean-token
---
# NA-SapBERT: Noise-Augmented SapBERT Encoder for Clinical Concept Normalization
NA-SapBERT is a **biomedical sentence embedding model** designed for encoding clinical mentions into dense vectors for downstream retrieval tasks.
This model is a noise-augmented extension of SapBERT, trained to produce robust embeddings for:
- abbreviations (e.g., "NAD", "DM")
- misspellings
- shorthand / telegraphic clinical text
- surface variation in real-world clinical notes
---
## What This Model Is
NA-SapBERT is **only an encoder**.
It maps input text → 768-dimensional normalized embedding vectors.
It does NOT include:
- retrieval logic
- FAISS index
- exact match
- rewrite modules
- reranking
These belong to downstream pipelines.
---
## Key Idea
The model is trained using contrastive learning to align:
- noisy clinical mentions
- clean ontology concept names and synonyms
This improves embedding robustness and semantic consistency.
---
## Model Architecture
- Backbone: PubMedBERT
- Pooling: Mean pooling (attention-mask aware)
- Output: 768-dim normalized embeddings
- Max sequence length: 32 (optimized for short clinical mentions)
---
## Training Summary
- Objective: MultipleNegativesRankingLoss (contrastive / InfoNCE-style)
- Data:
- SNOMED CT concepts (subset of key semantic types)
- synthetic noisy variants (LLM + abbreviation-based)
Training pairs:
- clean → clean
- noisy → clean
---
## Usage (Recommended)
Use with Hugging Face Transformers + custom pooling.
### Encoding Example
```python
import torch
import numpy as np
from transformers import AutoTokenizer, AutoModel
class Encoder:
def __init__(self, model_name, device="cuda", max_length=32):
self.device = device
self.max_length = max_length
self.tokenizer = AutoTokenizer.from_pretrained(model_name)
self.model = AutoModel.from_pretrained(model_name)
if device == "cuda":
self.model = self.model.cuda()
self.model.eval()
def encode(self, texts, batch_size=256):
all_vecs = []
with torch.no_grad():
for i in range(0, len(texts), batch_size):
batch = texts[i:i+batch_size]
tokens = self.tokenizer(
batch,
padding=True,
truncation=True,
max_length=self.max_length,
return_tensors="pt"
)
if self.device == "cuda":
tokens = {k: v.cuda() for k, v in tokens.items()}
out = self.model(**tokens)
hidden = out.last_hidden_state
mask = tokens["attention_mask"].unsqueeze(-1)
pooled = (hidden * mask).sum(1) / mask.sum(1)
# IMPORTANT: normalize embeddings
pooled = torch.nn.functional.normalize(pooled, p=2, dim=1)
all_vecs.append(pooled.cpu().numpy())
return np.vstack(all_vecs).astype("float32")
```
---
## Important Notes
- Mean pooling is required (CLS token is NOT used)
- L2 normalization is critical for similarity search
- Designed for short clinical mentions (max_length=32)
---
## Intended Use
This model is intended for:
- clinical concept normalization pipelines
- dense retrieval over medical ontologies (SNOMED CT, UMLS)
- embedding generation for biomedical text
---
## Not Intended For
- general-purpose sentence similarity
- long document encoding
- non-biomedical domains
---
## Limitations
- Does not encode:
- negation
- temporality
- broader context
- Abbreviations remain ambiguous without external context
- Performance depends on downstream retrieval pipeline
|