Tao-AI-Informatics
/

NA-SapBERT

@@ -12,150 +12,162 @@ base_model:
 - cambridgeltl/SapBERT-from-PubMedBERT-fulltext-mean-token
 ---
-# NA-SapBERT: Noise-Augmented SapBERT for Clinical Concept Normalization
-NA-SapBERT is a dense retrieval model designed for clinical concept normalization over large ontologies such as SNOMED CT. It extends SapBERT by incorporating noise-aware training, enabling robust retrieval for real-world clinical mentions.
-Unlike standard SapBERT, this model is trained to handle:
 - abbreviations (e.g., "NAD", "DM")
 - misspellings
-- shorthand and telegraphic clinical text
-- surface form variation across notes
 ---
-## Overview
-Clinical concept normalization maps noisy text mentions to standardized ontology concepts. While modern NER systems perform well, entity linking remains challenging due to:
-- large ontology size
-- noisy clinical text
-- ambiguous abbreviations
-- mismatch between ontology terms and real-world mentions
-NA-SapBERT addresses this by learning invariant embeddings across noisy and canonical forms.
 ---
 ## Key Idea
-During training, the model learns to align:
-- noisy mentions (LLM-generated variants, abbreviations)
-- clean ontology terms (concept names and synonyms)
-This is achieved using contrastive learning:
-- clean–clean pairs preserve structure
-- noisy–clean pairs improve robustness
 ---
 ## Model Architecture
-SentenceTransformer:
-- Transformer (PubMedBERT backbone)
-- Mean Pooling
-Embedding dimension: 768
-Max sequence length: 64
 ---
-## Training Details
-### Data
-- SNOMED CT concepts (subset of key semantic types)
-- Synthetic variants:
-  - LLM-generated (MedGemma) noise
-  - abbreviation mappings
-### Objective
-MultipleNegativesRankingLoss (InfoNCE-style)
-### Training Configuration
-- epochs: 1
-- batch_size: 256
-- learning_rate: 1e-5
-- warmup_steps: 85
 ---
-## Usage
-### Install
-pip install -U sentence-transformers
-### Encode Mentions
 ```python
-from sentence_transformers import SentenceTransformer
-model = SentenceTransformer("YOUR_MODEL_NAME")
-mentions = ["NAD", "hx of diabetes", "left axillary lymph node"]
-embeddings = model.encode(mentions, normalize_embeddings=True)
-```
----
-## Retrieval Example (FAISS)
-```python
-import faiss
-import numpy as np
-from sentence_transformers import SentenceTransformer
-model = SentenceTransformer("YOUR_MODEL_NAME")
-concept_embeddings = np.load("concept_embeddings.npy").astype("float32")
-index = faiss.IndexFlatIP(768)
-index.add(concept_embeddings)
-query = "NAD"
-q_emb = model.encode([query], normalize_embeddings=True)
-scores, indices = index.search(q_emb, k=10)
-```
----
-## Pipeline Integration
-Typical pipeline:
-1. Exact match
-2. Dense retrieval (NA-SapBERT)
-3. Optional rewrite / multi-query
-4. Optional reranking
----
-## Performance Summary
-- SapBERT: XX recall@1
-- NA-SapBERT: XX recall@1
-Improvements:
-- Better handling of noisy mentions
-- Strong generalization to full SNOMED CT
 ---
-## Limitations
-- No explicit modeling of negation or temporality
-- Abbreviations remain ambiguous without context
-- Depends on ontology synonym quality
 ---
-## Use Cases
-Use for:
-- clinical NLP
-- concept normalization
-- ontology retrieval
-Not intended for:
-- general semantic similarity
-- non-biomedical tasks

 - cambridgeltl/SapBERT-from-PubMedBERT-fulltext-mean-token
 ---
+# NA-SapBERT: Noise-Augmented SapBERT Encoder for Clinical Concept Normalization
+NA-SapBERT is a **biomedical sentence embedding model** designed for encoding clinical mentions into dense vectors for downstream retrieval tasks.
+This model is a noise-augmented extension of SapBERT, trained to produce robust embeddings for:
 - abbreviations (e.g., "NAD", "DM")
 - misspellings
+- shorthand / telegraphic clinical text
+- surface variation in real-world clinical notes
 ---
+## What This Model Is
+NA-SapBERT is **only an encoder**.
+It maps input text → 768-dimensional normalized embedding vectors.
+It does NOT include:
+- retrieval logic
+- FAISS index
+- exact match
+- rewrite modules
+- reranking
+These belong to downstream pipelines.
 ---
 ## Key Idea
+The model is trained using contrastive learning to align:
+- noisy clinical mentions
+- clean ontology concept names and synonyms
+This improves embedding robustness and semantic consistency.
 ---
 ## Model Architecture
+- Backbone: PubMedBERT
+- Pooling: Mean pooling (attention-mask aware)
+- Output: 768-dim normalized embeddings
+- Max sequence length: 32 (optimized for short clinical mentions)
 ---
+## Training Summary
+- Objective: MultipleNegativesRankingLoss (contrastive / InfoNCE-style)
+- Data:
+  - SNOMED CT concepts (subset of key semantic types)
+  - synthetic noisy variants (LLM + abbreviation-based)
+Training pairs:
+- clean → clean
+- noisy → clean
 ---
+## Usage (Recommended)
+Use with Hugging Face Transformers + custom pooling.
+### Encoding Example
 ```python
+import torch
+import numpy as np
+from transformers import AutoTokenizer, AutoModel
+class Encoder:
+    def __init__(self, model_name, device="cuda", max_length=32):
+        self.device = device
+        self.max_length = max_length
+        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
+        self.model = AutoModel.from_pretrained(model_name)
+        if device == "cuda":
+            self.model = self.model.cuda()
+        self.model.eval()
+    def encode(self, texts, batch_size=256):
+        all_vecs = []
+        with torch.no_grad():
+            for i in range(0, len(texts), batch_size):
+                batch = texts[i:i+batch_size]
+                tokens = self.tokenizer(
+                    batch,
+                    padding=True,
+                    truncation=True,
+                    max_length=self.max_length,
+                    return_tensors="pt"
+                )
+                if self.device == "cuda":
+                    tokens = {k: v.cuda() for k, v in tokens.items()}
+                out = self.model(**tokens)
+                hidden = out.last_hidden_state
+                mask = tokens["attention_mask"].unsqueeze(-1)
+                pooled = (hidden * mask).sum(1) / mask.sum(1)
+                # IMPORTANT: normalize embeddings
+                pooled = torch.nn.functional.normalize(pooled, p=2, dim=1)
+                all_vecs.append(pooled.cpu().numpy())
+        return np.vstack(all_vecs).astype("float32")
+```
 ---
+## Important Notes
+- Mean pooling is required (CLS token is NOT used)
+- L2 normalization is critical for similarity search
+- Designed for short clinical mentions (max_length=32)
 ---
+## Intended Use
+This model is intended for:
+- clinical concept normalization pipelines
+- dense retrieval over medical ontologies (SNOMED CT, UMLS)
+- embedding generation for biomedical text
+---
+## Not Intended For
+- general-purpose sentence similarity
+- long document encoding
+- non-biomedical domains
+---
+## Limitations
+- Does not encode:
+  - negation
+  - temporality
+  - broader context
+- Abbreviations remain ambiguous without external context
+- Performance depends on downstream retrieval pipeline