Tao-AI-Informatics
/

NA-SapBERT

Sentence Similarity

sentence-transformers

feature-extraction

text-embeddings-inference

Model card Files Files and versions

NA-SapBERT / README.md

AvantiB's picture

Update README.md

39bf384 verified 21 days ago

|

history blame contribute delete

3.97 kB

	---
	library_name: sentence-transformers
	pipeline_tag: sentence-similarity
	tags:
	- sentence-transformers
	- feature-extraction
	- sentence-similarity
	- transformers
	language:
	- en
	base_model:
	- cambridgeltl/SapBERT-from-PubMedBERT-fulltext-mean-token
	---

	# NA-SapBERT: Noise-Augmented SapBERT Encoder for Clinical Concept Normalization

	NA-SapBERT is a biomedical sentence embedding model designed for encoding clinical mentions into dense vectors for downstream retrieval tasks.

	This model is a noise-augmented extension of SapBERT, trained to produce robust embeddings for:
	- abbreviations (e.g., "NAD", "DM")
	- misspellings
	- shorthand / telegraphic clinical text
	- surface variation in real-world clinical notes

	---

	## What This Model Is

	NA-SapBERT is only an encoder.

	It maps input text → 768-dimensional normalized embedding vectors.

	It does NOT include:
	- retrieval logic
	- FAISS index
	- exact match
	- rewrite modules
	- reranking

	These belong to downstream pipelines.

	---

	## Key Idea

	The model is trained using contrastive learning to align:

	- noisy clinical mentions
	- clean ontology concept names and synonyms

	This improves embedding robustness and semantic consistency.

	---

	## Model Architecture

	- Backbone: PubMedBERT
	- Pooling: Mean pooling (attention-mask aware)
	- Output: 768-dim normalized embeddings
	- Max sequence length: 32 (optimized for short clinical mentions)

	---

	## Training Summary

	- Objective: MultipleNegativesRankingLoss (contrastive / InfoNCE-style)
	- Data:
	- SNOMED CT concepts (subset of key semantic types)
	- synthetic noisy variants (LLM + abbreviation-based)

	Training pairs:
	- clean → clean
	- noisy → clean

	---

	## Usage (Recommended)

	Use with Hugging Face Transformers + custom pooling.

	### Encoding Example

	```python
	import torch
	import numpy as np
	from transformers import AutoTokenizer, AutoModel

	class Encoder:

	def __init__(self, model_name, device="cuda", max_length=32):

	self.device = device
	self.max_length = max_length

	self.tokenizer = AutoTokenizer.from_pretrained(model_name)
	self.model = AutoModel.from_pretrained(model_name)

	if device == "cuda":
	self.model = self.model.cuda()

	self.model.eval()

	def encode(self, texts, batch_size=256):

	all_vecs = []

	with torch.no_grad():
	for i in range(0, len(texts), batch_size):

	batch = texts[i:i+batch_size]

	tokens = self.tokenizer(
	batch,
	padding=True,
	truncation=True,
	max_length=self.max_length,
	return_tensors="pt"
	)

	if self.device == "cuda":
	tokens = {k: v.cuda() for k, v in tokens.items()}

	out = self.model(**tokens)

	hidden = out.last_hidden_state
	mask = tokens["attention_mask"].unsqueeze(-1)

	pooled = (hidden * mask).sum(1) / mask.sum(1)

	# IMPORTANT: normalize embeddings
	pooled = torch.nn.functional.normalize(pooled, p=2, dim=1)

	all_vecs.append(pooled.cpu().numpy())

	return np.vstack(all_vecs).astype("float32")
	```

	---

	## Important Notes

	- Mean pooling is required (CLS token is NOT used)
	- L2 normalization is critical for similarity search
	- Designed for short clinical mentions (max_length=32)

	---

	## Intended Use

	This model is intended for:

	- clinical concept normalization pipelines
	- dense retrieval over medical ontologies (SNOMED CT, UMLS)
	- embedding generation for biomedical text

	---

	## Not Intended For

	- general-purpose sentence similarity
	- long document encoding
	- non-biomedical domains

	---

	## Limitations

	- Does not encode:
	- negation
	- temporality
	- broader context
	- Abbreviations remain ambiguous without external context
	- Performance depends on downstream retrieval pipeline