--- license: cc-by-nc-sa-3.0 --- #### AffilBERT A ModernBERT embedding model, based on [Nomic's ModernBERT embed base](https://huggingface.co/nomic-ai/modernbert-embed-base), finetuned using contrastive loss on the names of research institutions. This model is intended for researcher affiliation canonicalization. #### Description Embeddings can be used to link or standardize researcher affiliations by way of measuring the cosine similarity between two encoded representations. However, standard embedding models frequently confound geographic or topical commonalities with affiliation identity. This may result in `boston university computer science` being closer to `college of charleston computer science` than it is to `boston university department of public health`. #### Training This embedding model was trained using hard-negative mining and InfoNCE on a mixture of hand-annotated data gathered from PubMed alongside data sourced from [ROR](https://ror.org/). Hard negatives were identified using TF-IDF in conjunction with false positive high-similarity pairs derived from encoding strings with the base embedding model. The outcome is a finetune which more aggressively separates different institutions with confounding commonalities, when compared to the base model. ![finetunesvgmbertaffilwords](https://cdn-uploads.huggingface.co/production/uploads/62cf05b026c94b143172379c/DgjkXSFDoys5LO3b1MsOi.png) #### Usage ```python from transformers import AutoTokenizer, AutoModel import torch, torch.nn.functional as F model_id = "aimgo/AffilBERT" tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModel.from_pretrained(model_id).eval() def embed(texts): enc = tokenizer(["clustering: " + t for t in texts], padding=True, truncation=True, max_length=128, return_tensors="pt") with torch.no_grad(): out = model(**enc).last_hidden_state mask = enc['attention_mask'].unsqueeze(-1).float() emb = (out * mask).sum(1) / mask.sum(1).clamp(min=1e-9) return F.normalize(emb, p=2, dim=-1) strings = [ "boston university computer science", "harvard college computer science", "college of charleston", "cofc", "university of south carolina", "clemson university", "boston university public health", ] x = embed(strings) sim = (x @ x.t()).tolist() ``` #### Citation If you use this model in your work, please cite: ``` @misc{mccarthy2026AffilBERT, author = {McCarthy, A. M. and Rao, Sowmya R.}, title = {{AffilBERT}}, year = {2026}, howpublished = {\url{https://huggingface.co/aimgo/AffilBERT}}, note = {Model} } ```