| --- |
| license: cc-by-nc-sa-3.0 |
| --- |
| |
| #### AffilBERT |
|
|
| A ModernBERT embedding model, based on [Nomic's ModernBERT embed base](https://huggingface.co/nomic-ai/modernbert-embed-base), finetuned using contrastive loss on the names of research institutions. |
|
|
| This model is intended for researcher affiliation canonicalization. |
|
|
| #### Description |
|
|
| Embeddings can be used to link or standardize researcher affiliations by way of measuring the cosine similarity between two encoded representations. |
| However, standard embedding models frequently confound geographic or topical commonalities with affiliation identity. This may result in `boston university computer science` |
| being closer to `college of charleston computer science` than it is to `boston university department of public health`. |
|
|
| #### Training |
|
|
| This embedding model was trained using hard-negative mining and InfoNCE on a mixture of hand-annotated data gathered from PubMed alongside data sourced from [ROR](https://ror.org/). |
| Hard negatives were identified using TF-IDF in conjunction with false positive high-similarity pairs derived from encoding strings with the base embedding model. |
|
|
| The outcome is a finetune which more aggressively separates different institutions with confounding commonalities, when compared to the base model. |
|  |
|
|
|
|
| #### Usage |
|
|
| ```python |
| from transformers import AutoTokenizer, AutoModel |
| import torch, torch.nn.functional as F |
| |
| model_id = "aimgo/AffilBERT" |
| tokenizer = AutoTokenizer.from_pretrained(model_id) |
| model = AutoModel.from_pretrained(model_id).eval() |
| |
| def embed(texts): |
| enc = tokenizer(["clustering: " + t for t in texts], |
| padding=True, truncation=True, |
| max_length=128, return_tensors="pt") |
| with torch.no_grad(): |
| out = model(**enc).last_hidden_state |
| mask = enc['attention_mask'].unsqueeze(-1).float() |
| emb = (out * mask).sum(1) / mask.sum(1).clamp(min=1e-9) |
| return F.normalize(emb, p=2, dim=-1) |
| |
| strings = [ |
| "boston university computer science", |
| "harvard college computer science", |
| "college of charleston", |
| "cofc", |
| "university of south carolina", |
| "clemson university", |
| "boston university public health", |
| ] |
| |
| x = embed(strings) |
| sim = (x @ x.t()).tolist() |
| ``` |
|
|
| #### Citation |
|
|
| If you use this model in your work, please cite: |
|
|
| ``` |
| @misc{mccarthy2026AffilBERT, |
| author = {McCarthy, A. M. and Rao, Sowmya R.}, |
| title = {{AffilBERT}}, |
| year = {2026}, |
| howpublished = {\url{https://huggingface.co/aimgo/AffilBERT}}, |
| note = {Model} |
| } |
| ``` |
|
|
|
|