| | --- |
| | language: |
| | - en |
| | - multilingual |
| | license: apache-2.0 |
| | tags: |
| | - sentence-transformers |
| | - embedding |
| | - ror |
| | - affiliation-matching |
| | - contrastive-learning |
| | base_model: Snowflake/snowflake-arctic-embed-l-v2.0 |
| | datasets: |
| | - SIRIS-Lab/affilgood-contrastive-dataset |
| | pipeline_tag: sentence-similarity |
| | --- |
| | |
| | # Snowflake Arctic Embeddings for ROR Affiliation Matching |
| |
|
| | A sentence embedding model fine-tuned for Research Organization Registry (ROR) affiliation matching. |
| |
|
| | ## Model Description |
| |
|
| | This model is fine-tuned from `Snowflake/snowflake-arctic-embed-l-v2.0` using contrastive learning |
| | on the AffilGood contrastive dataset. It produces embeddings optimized for matching affiliation |
| | strings to ROR organization records. |
| |
|
| | ## Training |
| |
|
| | - **Base model**: Snowflake/snowflake-arctic-embed-l-v2.0 |
| | - **Training dataset**: SIRIS-Lab/affilgood-contrastive-dataset |
| | - **Training examples**: 50,255 |
| | - **Validation examples**: 2,645 |
| | - **Loss**: MultipleNegativesRankingLoss (with hard negatives) |
| | - **Epochs**: 3 |
| | - **Batch size**: 32 |
| | - **Learning rate**: 2e-05 |
| | - **Max sequence length**: 256 |
| |
|
| | ## Usage |
| |
|
| | ```python |
| | from sentence_transformers import SentenceTransformer |
| | |
| | model = SentenceTransformer("cometadata/snowflake-arctic-ror-affiliations") |
| | |
| | # Encode affiliations |
| | affiliations = [ |
| | "Department of Physics, MIT, Cambridge, MA", |
| | "Harvard Medical School, Boston", |
| | ] |
| | embeddings = model.encode(affiliations, normalize_embeddings=True) |
| | |
| | # Encode ROR organization names for matching |
| | organizations = [ |
| | "Massachusetts Institute of Technology", |
| | "Harvard University", |
| | ] |
| | org_embeddings = model.encode(organizations, normalize_embeddings=True) |
| | |
| | # Compute similarity |
| | import numpy as np |
| | similarities = np.dot(embeddings, org_embeddings.T) |
| | ``` |
| |
|
| | ## Intended Use |
| |
|
| | This model is designed for dense retrieval in affiliation matching pipelines. |
| | It should be used as the first-stage retriever to find candidate ROR organizations |
| | for a given affiliation string. |
| |
|
| | ## Training Data |
| |
|
| | Fine-tuned on [SIRIS-Lab/affilgood-contrastive-dataset](https://huggingface.co/datasets/SIRIS-Lab/affilgood-contrastive-dataset), |
| | which contains 52,900 affiliation-organization pairs with curated hard negatives across 105 languages. |
| |
|
| | ## Timestamp |
| |
|
| | 2026-01-07T08:08:33.561241+00:00 |
| |
|