--- language: - en - multilingual license: apache-2.0 tags: - sentence-transformers - embedding - ror - affiliation-matching - contrastive-learning base_model: Snowflake/snowflake-arctic-embed-l-v2.0 datasets: - SIRIS-Lab/affilgood-contrastive-dataset pipeline_tag: sentence-similarity --- # Snowflake Arctic Embeddings for ROR Affiliation Matching A sentence embedding model fine-tuned for Research Organization Registry (ROR) affiliation matching. ## Model Description This model is fine-tuned from `Snowflake/snowflake-arctic-embed-l-v2.0` using contrastive learning on the AffilGood contrastive dataset. It produces embeddings optimized for matching affiliation strings to ROR organization records. ## Training - **Base model**: Snowflake/snowflake-arctic-embed-l-v2.0 - **Training dataset**: SIRIS-Lab/affilgood-contrastive-dataset - **Training examples**: 50,255 - **Validation examples**: 2,645 - **Loss**: MultipleNegativesRankingLoss (with hard negatives) - **Epochs**: 3 - **Batch size**: 32 - **Learning rate**: 2e-05 - **Max sequence length**: 256 ## Usage ```python from sentence_transformers import SentenceTransformer model = SentenceTransformer("cometadata/snowflake-arctic-ror-affiliations") # Encode affiliations affiliations = [ "Department of Physics, MIT, Cambridge, MA", "Harvard Medical School, Boston", ] embeddings = model.encode(affiliations, normalize_embeddings=True) # Encode ROR organization names for matching organizations = [ "Massachusetts Institute of Technology", "Harvard University", ] org_embeddings = model.encode(organizations, normalize_embeddings=True) # Compute similarity import numpy as np similarities = np.dot(embeddings, org_embeddings.T) ``` ## Intended Use This model is designed for dense retrieval in affiliation matching pipelines. It should be used as the first-stage retriever to find candidate ROR organizations for a given affiliation string. ## Training Data Fine-tuned on [SIRIS-Lab/affilgood-contrastive-dataset](https://huggingface.co/datasets/SIRIS-Lab/affilgood-contrastive-dataset), which contains 52,900 affiliation-organization pairs with curated hard negatives across 105 languages. ## Timestamp 2026-01-07T08:08:33.561241+00:00