AvantiB commited on
Commit
39bf384
·
verified ·
1 Parent(s): ed133d0

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +99 -87
README.md CHANGED
@@ -12,150 +12,162 @@ base_model:
12
  - cambridgeltl/SapBERT-from-PubMedBERT-fulltext-mean-token
13
  ---
14
 
15
- # NA-SapBERT: Noise-Augmented SapBERT for Clinical Concept Normalization
16
 
17
- NA-SapBERT is a dense retrieval model designed for clinical concept normalization over large ontologies such as SNOMED CT. It extends SapBERT by incorporating noise-aware training, enabling robust retrieval for real-world clinical mentions.
18
 
19
- Unlike standard SapBERT, this model is trained to handle:
20
  - abbreviations (e.g., "NAD", "DM")
21
  - misspellings
22
- - shorthand and telegraphic clinical text
23
- - surface form variation across notes
24
 
25
  ---
26
 
27
- ## Overview
28
 
29
- Clinical concept normalization maps noisy text mentions to standardized ontology concepts. While modern NER systems perform well, entity linking remains challenging due to:
30
 
31
- - large ontology size
32
- - noisy clinical text
33
- - ambiguous abbreviations
34
- - mismatch between ontology terms and real-world mentions
35
 
36
- NA-SapBERT addresses this by learning invariant embeddings across noisy and canonical forms.
 
 
 
 
 
 
 
37
 
38
  ---
39
 
40
  ## Key Idea
41
 
42
- During training, the model learns to align:
43
- - noisy mentions (LLM-generated variants, abbreviations)
44
- - clean ontology terms (concept names and synonyms)
 
45
 
46
- This is achieved using contrastive learning:
47
- - clean–clean pairs preserve structure
48
- - noisy–clean pairs improve robustness
49
 
50
  ---
51
 
52
  ## Model Architecture
53
 
54
- SentenceTransformer:
55
- - Transformer (PubMedBERT backbone)
56
- - Mean Pooling
57
-
58
- Embedding dimension: 768
59
- Max sequence length: 64
60
 
61
  ---
62
 
63
- ## Training Details
64
 
65
- ### Data
66
- - SNOMED CT concepts (subset of key semantic types)
67
- - Synthetic variants:
68
- - LLM-generated (MedGemma) noise
69
- - abbreviation mappings
70
 
71
- ### Objective
72
- MultipleNegativesRankingLoss (InfoNCE-style)
73
-
74
- ### Training Configuration
75
- - epochs: 1
76
- - batch_size: 256
77
- - learning_rate: 1e-5
78
- - warmup_steps: 85
79
 
80
  ---
81
 
82
- ## Usage
 
 
83
 
84
- ### Install
85
- pip install -U sentence-transformers
86
 
87
- ### Encode Mentions
88
  ```python
89
- from sentence_transformers import SentenceTransformer
 
 
90
 
91
- model = SentenceTransformer("YOUR_MODEL_NAME")
92
 
93
- mentions = ["NAD", "hx of diabetes", "left axillary lymph node"]
94
- embeddings = model.encode(mentions, normalize_embeddings=True)
95
- ```
96
 
97
- ---
 
98
 
99
- ## Retrieval Example (FAISS)
 
100
 
101
- ```python
102
- import faiss
103
- import numpy as np
104
- from sentence_transformers import SentenceTransformer
105
 
106
- model = SentenceTransformer("YOUR_MODEL_NAME")
107
 
108
- concept_embeddings = np.load("concept_embeddings.npy").astype("float32")
109
 
110
- index = faiss.IndexFlatIP(768)
111
- index.add(concept_embeddings)
112
 
113
- query = "NAD"
114
- q_emb = model.encode([query], normalize_embeddings=True)
115
 
116
- scores, indices = index.search(q_emb, k=10)
117
- ```
118
 
119
- ---
 
 
 
 
 
 
120
 
121
- ## Pipeline Integration
 
122
 
123
- Typical pipeline:
124
- 1. Exact match
125
- 2. Dense retrieval (NA-SapBERT)
126
- 3. Optional rewrite / multi-query
127
- 4. Optional reranking
128
 
129
- ---
 
 
 
130
 
131
- ## Performance Summary
 
132
 
133
- - SapBERT: XX recall@1
134
- - NA-SapBERT: XX recall@1
135
 
136
- Improvements:
137
- - Better handling of noisy mentions
138
- - Strong generalization to full SNOMED CT
139
 
140
  ---
141
 
142
- ## Limitations
143
 
144
- - No explicit modeling of negation or temporality
145
- - Abbreviations remain ambiguous without context
146
- - Depends on ontology synonym quality
147
 
148
  ---
149
 
150
- ## Use Cases
 
 
151
 
152
- Use for:
153
- - clinical NLP
154
- - concept normalization
155
- - ontology retrieval
156
 
157
- Not intended for:
158
- - general semantic similarity
159
- - non-biomedical tasks
160
 
 
 
 
 
 
 
 
 
 
161
 
 
 
 
 
 
 
 
12
  - cambridgeltl/SapBERT-from-PubMedBERT-fulltext-mean-token
13
  ---
14
 
15
+ # NA-SapBERT: Noise-Augmented SapBERT Encoder for Clinical Concept Normalization
16
 
17
+ NA-SapBERT is a **biomedical sentence embedding model** designed for encoding clinical mentions into dense vectors for downstream retrieval tasks.
18
 
19
+ This model is a noise-augmented extension of SapBERT, trained to produce robust embeddings for:
20
  - abbreviations (e.g., "NAD", "DM")
21
  - misspellings
22
+ - shorthand / telegraphic clinical text
23
+ - surface variation in real-world clinical notes
24
 
25
  ---
26
 
27
+ ## What This Model Is
28
 
29
+ NA-SapBERT is **only an encoder**.
30
 
31
+ It maps input text → 768-dimensional normalized embedding vectors.
 
 
 
32
 
33
+ It does NOT include:
34
+ - retrieval logic
35
+ - FAISS index
36
+ - exact match
37
+ - rewrite modules
38
+ - reranking
39
+
40
+ These belong to downstream pipelines.
41
 
42
  ---
43
 
44
  ## Key Idea
45
 
46
+ The model is trained using contrastive learning to align:
47
+
48
+ - noisy clinical mentions
49
+ - clean ontology concept names and synonyms
50
 
51
+ This improves embedding robustness and semantic consistency.
 
 
52
 
53
  ---
54
 
55
  ## Model Architecture
56
 
57
+ - Backbone: PubMedBERT
58
+ - Pooling: Mean pooling (attention-mask aware)
59
+ - Output: 768-dim normalized embeddings
60
+ - Max sequence length: 32 (optimized for short clinical mentions)
 
 
61
 
62
  ---
63
 
64
+ ## Training Summary
65
 
66
+ - Objective: MultipleNegativesRankingLoss (contrastive / InfoNCE-style)
67
+ - Data:
68
+ - SNOMED CT concepts (subset of key semantic types)
69
+ - synthetic noisy variants (LLM + abbreviation-based)
 
70
 
71
+ Training pairs:
72
+ - clean → clean
73
+ - noisy → clean
 
 
 
 
 
74
 
75
  ---
76
 
77
+ ## Usage (Recommended)
78
+
79
+ Use with Hugging Face Transformers + custom pooling.
80
 
81
+ ### Encoding Example
 
82
 
 
83
  ```python
84
+ import torch
85
+ import numpy as np
86
+ from transformers import AutoTokenizer, AutoModel
87
 
88
+ class Encoder:
89
 
90
+ def __init__(self, model_name, device="cuda", max_length=32):
 
 
91
 
92
+ self.device = device
93
+ self.max_length = max_length
94
 
95
+ self.tokenizer = AutoTokenizer.from_pretrained(model_name)
96
+ self.model = AutoModel.from_pretrained(model_name)
97
 
98
+ if device == "cuda":
99
+ self.model = self.model.cuda()
 
 
100
 
101
+ self.model.eval()
102
 
103
+ def encode(self, texts, batch_size=256):
104
 
105
+ all_vecs = []
 
106
 
107
+ with torch.no_grad():
108
+ for i in range(0, len(texts), batch_size):
109
 
110
+ batch = texts[i:i+batch_size]
 
111
 
112
+ tokens = self.tokenizer(
113
+ batch,
114
+ padding=True,
115
+ truncation=True,
116
+ max_length=self.max_length,
117
+ return_tensors="pt"
118
+ )
119
 
120
+ if self.device == "cuda":
121
+ tokens = {k: v.cuda() for k, v in tokens.items()}
122
 
123
+ out = self.model(**tokens)
 
 
 
 
124
 
125
+ hidden = out.last_hidden_state
126
+ mask = tokens["attention_mask"].unsqueeze(-1)
127
+
128
+ pooled = (hidden * mask).sum(1) / mask.sum(1)
129
 
130
+ # IMPORTANT: normalize embeddings
131
+ pooled = torch.nn.functional.normalize(pooled, p=2, dim=1)
132
 
133
+ all_vecs.append(pooled.cpu().numpy())
 
134
 
135
+ return np.vstack(all_vecs).astype("float32")
136
+ ```
 
137
 
138
  ---
139
 
140
+ ## Important Notes
141
 
142
+ - Mean pooling is required (CLS token is NOT used)
143
+ - L2 normalization is critical for similarity search
144
+ - Designed for short clinical mentions (max_length=32)
145
 
146
  ---
147
 
148
+ ## Intended Use
149
+
150
+ This model is intended for:
151
 
152
+ - clinical concept normalization pipelines
153
+ - dense retrieval over medical ontologies (SNOMED CT, UMLS)
154
+ - embedding generation for biomedical text
 
155
 
156
+ ---
 
 
157
 
158
+ ## Not Intended For
159
+
160
+ - general-purpose sentence similarity
161
+ - long document encoding
162
+ - non-biomedical domains
163
+
164
+ ---
165
+
166
+ ## Limitations
167
 
168
+ - Does not encode:
169
+ - negation
170
+ - temporality
171
+ - broader context
172
+ - Abbreviations remain ambiguous without external context
173
+ - Performance depends on downstream retrieval pipeline