aimgo
/

AffilBERT

Model card Files Files and versions

AffilBERT / README.md

aimgo's picture

Update README.md

7dd5c25 verified 15 days ago

|

history blame contribute delete

2.71 kB

	---
	license: cc-by-nc-sa-3.0
	---

	#### AffilBERT

	A ModernBERT embedding model, based on [Nomic's ModernBERT embed base](https://huggingface.co/nomic-ai/modernbert-embed-base), finetuned using contrastive loss on the names of research institutions.

	This model is intended for researcher affiliation canonicalization.

	#### Description

	Embeddings can be used to link or standardize researcher affiliations by way of measuring the cosine similarity between two encoded representations.
	However, standard embedding models frequently confound geographic or topical commonalities with affiliation identity. This may result in `boston university computer science`
	being closer to `college of charleston computer science` than it is to `boston university department of public health`.

	#### Training

	This embedding model was trained using hard-negative mining and InfoNCE on a mixture of hand-annotated data gathered from PubMed alongside data sourced from [ROR](https://ror.org/).
	Hard negatives were identified using TF-IDF in conjunction with false positive high-similarity pairs derived from encoding strings with the base embedding model.

	The outcome is a finetune which more aggressively separates different institutions with confounding commonalities, when compared to the base model.
	![finetunesvgmbertaffilwords](https://cdn-uploads.huggingface.co/production/uploads/62cf05b026c94b143172379c/DgjkXSFDoys5LO3b1MsOi.png)


	#### Usage

	```python
	from transformers import AutoTokenizer, AutoModel
	import torch, torch.nn.functional as F

	model_id = "aimgo/AffilBERT"
	tokenizer = AutoTokenizer.from_pretrained(model_id)
	model = AutoModel.from_pretrained(model_id).eval()

	def embed(texts):
	enc = tokenizer(["clustering: " + t for t in texts],
	padding=True, truncation=True,
	max_length=128, return_tensors="pt")
	with torch.no_grad():
	out = model(**enc).last_hidden_state
	mask = enc['attention_mask'].unsqueeze(-1).float()
	emb = (out * mask).sum(1) / mask.sum(1).clamp(min=1e-9)
	return F.normalize(emb, p=2, dim=-1)

	strings = [
	"boston university computer science",
	"harvard college computer science",
	"college of charleston",
	"cofc",
	"university of south carolina",
	"clemson university",
	"boston university public health",
	]

	x = embed(strings)
	sim = (x @ x.t()).tolist()
	```

	#### Citation

	If you use this model in your work, please cite:

	```
	@misc{mccarthy2026AffilBERT,
	author = {McCarthy, A. M. and Rao, Sowmya R.},
	title = {{AffilBERT}},
	year = {2026},
	howpublished = {\url{https://huggingface.co/aimgo/AffilBERT}},
	note = {Model}
	}
	```