---
library_name: sentence-transformers
pipeline_tag: sentence-similarity
tags:
- sentence-transformers
- feature-extraction
- sentence-similarity
- transformers
language:
- en
base_model:
- cambridgeltl/SapBERT-from-PubMedBERT-fulltext-mean-token
---

# NA-SapBERT: Noise-Augmented SapBERT Encoder for Clinical Concept Normalization

NA-SapBERT is a **biomedical sentence embedding model** designed for encoding clinical mentions into dense vectors for downstream retrieval tasks.

This model is a noise-augmented extension of SapBERT, trained to produce robust embeddings for:
- abbreviations (e.g., "NAD", "DM")
- misspellings
- shorthand / telegraphic clinical text
- surface variation in real-world clinical notes

---

## What This Model Is

NA-SapBERT is **only an encoder**.

It maps input text → 768-dimensional normalized embedding vectors.

It does NOT include:
- retrieval logic
- FAISS index
- exact match
- rewrite modules
- reranking

These belong to downstream pipelines.

---

## Key Idea

The model is trained using contrastive learning to align:

- noisy clinical mentions  
- clean ontology concept names and synonyms  

This improves embedding robustness and semantic consistency.

---

## Model Architecture

- Backbone: PubMedBERT
- Pooling: Mean pooling (attention-mask aware)
- Output: 768-dim normalized embeddings
- Max sequence length: 32 (optimized for short clinical mentions)

---

## Training Summary

- Objective: MultipleNegativesRankingLoss (contrastive / InfoNCE-style)
- Data:
  - SNOMED CT concepts (subset of key semantic types)
  - synthetic noisy variants (LLM + abbreviation-based)

Training pairs:
- clean → clean
- noisy → clean

---

## Usage (Recommended)

Use with Hugging Face Transformers + custom pooling.

### Encoding Example

```python
import torch
import numpy as np
from transformers import AutoTokenizer, AutoModel

class Encoder:

    def __init__(self, model_name, device="cuda", max_length=32):

        self.device = device
        self.max_length = max_length

        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.model = AutoModel.from_pretrained(model_name)

        if device == "cuda":
            self.model = self.model.cuda()

        self.model.eval()

    def encode(self, texts, batch_size=256):

        all_vecs = []

        with torch.no_grad():
            for i in range(0, len(texts), batch_size):

                batch = texts[i:i+batch_size]

                tokens = self.tokenizer(
                    batch,
                    padding=True,
                    truncation=True,
                    max_length=self.max_length,
                    return_tensors="pt"
                )

                if self.device == "cuda":
                    tokens = {k: v.cuda() for k, v in tokens.items()}

                out = self.model(**tokens)

                hidden = out.last_hidden_state
                mask = tokens["attention_mask"].unsqueeze(-1)

                pooled = (hidden * mask).sum(1) / mask.sum(1)

                # IMPORTANT: normalize embeddings
                pooled = torch.nn.functional.normalize(pooled, p=2, dim=1)

                all_vecs.append(pooled.cpu().numpy())

        return np.vstack(all_vecs).astype("float32")
```

---

## Important Notes

- Mean pooling is required (CLS token is NOT used)
- L2 normalization is critical for similarity search
- Designed for short clinical mentions (max_length=32)

---

## Intended Use

This model is intended for:

- clinical concept normalization pipelines
- dense retrieval over medical ontologies (SNOMED CT, UMLS)
- embedding generation for biomedical text

---

## Not Intended For

- general-purpose sentence similarity
- long document encoding
- non-biomedical domains

---

## Limitations

- Does not encode:
  - negation
  - temporality
  - broader context
- Abbreviations remain ambiguous without external context
- Performance depends on downstream retrieval pipeline