---
license: cc-by-nc-sa-3.0
---

#### AffilBERT

A ModernBERT embedding model, based on [Nomic's ModernBERT embed base](https://huggingface.co/nomic-ai/modernbert-embed-base), finetuned using contrastive loss on the names of research institutions.

This model is intended for researcher affiliation canonicalization. 

#### Description

Embeddings can be used to link or standardize researcher affiliations by way of measuring the cosine similarity between two encoded representations. 
However, standard embedding models frequently confound geographic or topical commonalities with affiliation identity. This may result in `boston university computer science`
being closer to `college of charleston computer science` than it is to `boston university department of public health`. 

#### Training

This embedding model was trained using hard-negative mining and InfoNCE on a mixture of hand-annotated data gathered from PubMed alongside data sourced from [ROR](https://ror.org/). 
Hard negatives were identified using TF-IDF in conjunction with false positive high-similarity pairs derived from encoding strings with the base embedding model.

The outcome is a finetune which more aggressively separates different institutions with confounding commonalities, when compared to the base model. 
![finetunesvgmbertaffilwords](https://cdn-uploads.huggingface.co/production/uploads/62cf05b026c94b143172379c/DgjkXSFDoys5LO3b1MsOi.png)


#### Usage

```python
from transformers import AutoTokenizer, AutoModel
import torch, torch.nn.functional as F

model_id  = "aimgo/AffilBERT"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model     = AutoModel.from_pretrained(model_id).eval()

def embed(texts):
    enc  = tokenizer(["clustering: " + t for t in texts],
                     padding=True, truncation=True,
                     max_length=128, return_tensors="pt")
    with torch.no_grad():
        out  = model(**enc).last_hidden_state
        mask = enc['attention_mask'].unsqueeze(-1).float()
        emb  = (out * mask).sum(1) / mask.sum(1).clamp(min=1e-9)
        return F.normalize(emb, p=2, dim=-1)

strings = [
    "boston university computer science",
    "harvard college computer science",
    "college of charleston",
    "cofc",
    "university of south carolina",
    "clemson university",
    "boston university public health",
]

x   = embed(strings)
sim = (x @ x.t()).tolist()  
```

#### Citation

If you use this model in your work, please cite: 

```
@misc{mccarthy2026AffilBERT,
  author       = {McCarthy, A. M. and Rao, Sowmya R.},
  title        = {{AffilBERT}},
  year         = {2026},
  howpublished = {\url{https://huggingface.co/aimgo/AffilBERT}},
  note         = {Model}
}
```