| --- |
| language: en |
| license: apache-2.0 |
| library_name: sentence-transformers |
| base_model: Snowflake/snowflake-arctic-embed-xs |
| pipeline_tag: sentence-similarity |
| tags: |
| - sentence-transformers |
| - feature-extraction |
| - entity-resolution |
| - record-linkage |
| - record-matching |
| - data-matching |
| - deduplication |
| - arctic |
| - snowflake-arctic-embed |
| - lora |
| - fine-tuned |
| model-index: |
| - name: arctic-embed-xs-entity-resolution |
| results: |
| - task: |
| type: entity-resolution |
| name: Entity Resolution |
| dataset: |
| type: synthetic |
| name: Melder Entity Resolution Benchmark (10k x 10k) |
| metrics: |
| - type: precision |
| value: 88.6 |
| name: Precision |
| - type: recall |
| value: 99.7 |
| name: Combined Recall |
| - type: overlap |
| value: 0.031 |
| name: Score Overlap Coefficient |
| --- |
| |
| # Arctic-embed-xs for Entity Resolution |
|
|
| A fine-tuned version of [Snowflake/snowflake-arctic-embed-xs](https://huggingface.co/Snowflake/snowflake-arctic-embed-xs) optimised for **entity resolution** -- matching records that refer to the same real-world entity across messy, inconsistent datasets. |
|
|
| The canonical use case is matching counterparty names, addresses, and identifiers between a clean reference master (side A) and noisy operational data (side B). For example, resolving "GS Intl Ltd" to "Goldman Sachs International". |
|
|
| This model was trained as part of [Melder](https://github.com/anomalyco/melder), an open-source record matching engine in Rust. |
|
|
| ## Key results |
|
|
| Evaluated on a held-out dataset of 10,000 entity pairs (never seen during training): |
|
|
| | Metric | Base model (untrained) | This model (R22) | |
| |---|---|---| |
| | Score overlap (lower is better) | 0.162 | **0.031** (5.2x reduction) | |
| | Combined recall | 98.1% | **99.7%** | |
| | Precision | 84.2% | **88.6%** | |
| | False positives in auto-match | 131 | **0** | |
| | Non-matches in review queue | 2,826 | **184** (93.5% reduction) | |
| | Missed matches (clean) | 4 | 19 | |
| | Missed matches (heavy noise) | 0 | 11 | |
|
|
| "Score overlap" measures how much the score distributions of true matches and non-matches overlap -- lower means better separation. This model reduces overlap by 5.2x compared to the base model, meaning the scoring threshold between "match" and "not a match" becomes much cleaner. |
|
|
| Combined recall (auto-matched + review) stays at 99.7%, meaning almost no true matches are lost. The main benefit of fine-tuning is **cleaning the review queue** -- non-matches that would have required human review are pushed clearly below threshold. |
|
|
| ## When to use this model |
|
|
| - **Entity resolution / record linkage** across datasets with name, address, and identifier fields |
| - **Counterparty matching** in financial data (the training domain) |
| - **Deduplication** of entity records with noisy or inconsistent naming |
| - **Any short-text matching task** where entities have legal names, abbreviations, addresses, and codes |
|
|
| The model produces 384-dimensional L2-normalised embeddings. Cosine similarity (= dot product for normalised vectors) between embeddings indicates how likely two records refer to the same entity. |
|
|
| ## When NOT to use this model |
|
|
| - General-purpose semantic similarity or retrieval (use the base model instead) |
| - Long-document embedding (entity names and addresses are short sequences) |
| - Non-English text (trained on English entity names only) |
| - Acronym matching ("TRMS" vs "Taylor, Reeves and Mcdaniel SRL") -- no embedding model can reliably resolve these; use a composite scoring approach |
|
|
| ## Usage |
|
|
| ### With sentence-transformers |
|
|
| ```python |
| from sentence_transformers import SentenceTransformer |
| |
| model = SentenceTransformer("themelder/arctic-embed-xs-entity-resolution") |
| |
| # Encode entity records (concatenate name + address for best results) |
| queries = ["Goldman Sachs International 133 Fleet Street, London EC4A 2BB"] |
| candidates = [ |
| "GS Intl Ltd 133 Fleet St London EC4A 2BB", |
| "Morgan Stanley & Co 20 Bank Street, London E14 4AD", |
| "Goldman Sachs Asset Management Christchurch Court, London EC1A 7HT", |
| ] |
| |
| query_emb = model.encode(queries, prompt_name="query") |
| candidate_emb = model.encode(candidates) |
| |
| scores = query_emb @ candidate_emb.T |
| for candidate, score in sorted(zip(candidates, scores[0]), key=lambda x: -x[1]): |
| print(f"{score:.3f} {candidate}") |
| # 0.872 GS Intl Ltd 133 Fleet St London EC4A 2BB |
| # 0.614 Goldman Sachs Asset Management Christchurch Court, London EC1A 7HT |
| # 0.298 Morgan Stanley & Co 20 Bank Street, London E14 4AD |
| ``` |
|
|
| ### With Melder |
|
|
| In your Melder config YAML, point the model at the HuggingFace model ID or a local path to the ONNX export: |
|
|
| ```yaml |
| embeddings: |
| model: themelder/arctic-embed-xs-entity-resolution |
| ``` |
|
|
| Melder uses the ONNX export (`model.onnx`) for inference via [fastembed](https://github.com/qdrant/fastembed). The model produces 384-dimensional embeddings at roughly 2x the speed of BGE-small models (6 layers vs 12). |
|
|
| ### With ONNX Runtime directly |
|
|
| The repository includes `model.onnx` for direct use with ONNX Runtime in any language (Rust, C++, Java, etc.) without Python dependencies. |
|
|
| ## Model details |
|
|
| | Property | Value | |
| |---|---| |
| | Base model | [Snowflake/snowflake-arctic-embed-xs](https://huggingface.co/Snowflake/snowflake-arctic-embed-xs) | |
| | Architecture | BERT (6 layers, 12 heads, 384 hidden) | |
| | Parameters | 22M | |
| | Embedding dimension | 384 | |
| | Max sequence length | 512 tokens | |
| | Similarity function | Cosine similarity | |
| | Pooling | CLS token | |
| | Output | L2-normalised | |
|
|
| ## Training details |
|
|
| ### Approach |
|
|
| Fine-tuned using **LoRA** (Low-Rank Adaptation) over 22 iterative rounds. Each round: |
|
|
| 1. Run Melder's matching pipeline on a training dataset |
| 2. Extract training pairs: confirmed matches become positives, high-scoring non-matches become hard negatives |
| 3. Fine-tune the model with LoRA on the accumulated pairs |
| 4. Evaluate on a fixed holdout set |
| 5. Repeat with the improved model |
|
|
| This iterative approach means the model learns from its own mistakes -- hard negatives from round N become training signal for round N+1. Combined with accumulation of pairs across all rounds, the model sees progressively harder examples. |
|
|
| ### Hyperparameters |
|
|
| | Parameter | Value | |
| |---|---| |
| | Loss function | MultipleNegativesRankingLoss | |
| | Batch size | 128 | |
| | Learning rate | 2e-5 | |
| | Epochs per round | 1 | |
| | Warmup ratio | 0.1 | |
| | LoRA rank | 8 | |
| | LoRA alpha | 16 | |
| | LoRA dropout | 0.1 | |
| | Rounds | 22 | |
| | Total training pairs (final round) | ~127,000 | |
| | Optimizer | AdamW (fused) | |
|
|
| ### Training data |
|
|
| Synthetic entity resolution data generated by [Melder's data generator](https://github.com/anomalyco/melder): |
|
|
| - **Side A (reference)**: 10,000 synthetic entity records with legal names, short names, country codes, LEIs, and addresses |
| - **Side B (query)**: 10,000 records per round -- 60% true matches (with noise: case changes, abbreviations, typos, missing fields), 10% ambiguous/heavy noise, 30% unmatched entities |
| - **Holdout**: A separate B dataset (seed 9999) never used in training, used for all evaluation metrics |
|
|
| Training pairs consist of: |
| - **Positives**: confirmed matched entity pairs (name + address concatenation) |
| - **Hard negatives**: high-scoring non-matches from Melder's review queue -- entities that look similar but are not the same |
|
|
| ### Why Arctic-embed-xs? |
|
|
| We tested four base models across 12 experiments: |
|
|
| | Model | Parameters | Best overlap | Combined recall | Encoding speed | |
| |---|---|---|---|---| |
| | all-MiniLM-L6-v2 | 22M | (baseline only) | -- | 2x | |
| | BAAI/bge-small-en-v1.5 | 33M | 0.070 | 97.3% | 1x | |
| | BAAI/bge-base-en-v1.5 | 110M | 0.046 | ~98.5% | 0.5x | |
| | **Snowflake/arctic-embed-xs** | **22M** | **0.031** | **99.7%** | **2x** | |
|
|
| Arctic-embed-xs won on every metric despite being the smallest model. Its superior pre-training (400M samples with hard negative mining) gives it better out-of-the-box entity discrimination than larger models trained on simpler data. |
|
|
| ### Overlap trajectory |
|
|
| Score overlap coefficient across training rounds (holdout, lower is better): |
|
|
| | R0 | R4 | R8 | R10 | R14 | R17 | R22 | |
| |---|---|---|---|---|---|---| |
| | 0.162 | 0.156 | 0.085 | 0.047 | 0.034 | 0.033 | **0.031** | |
|
|
| The model converges cleanly with no regression or oscillation. Extended training to R26 confirmed convergence (overlap 0.030, within noise). |
|
|
| ## Limitations |
|
|
| - **Domain-specific**: optimised for financial entity names and addresses. May underperform on other entity types (products, locations, people) without additional fine-tuning. |
| - **English only**: trained on English-language entity data. |
| - **Short text**: designed for entity names and addresses (typically 5-30 tokens). Not suitable for paragraph-level text. |
| - **Acronyms**: cannot match acronyms to full names (e.g. "TRMS" to "Taylor, Reeves and Mcdaniel SRL"). This is a fundamental limitation of embedding models -- use a composite scoring approach (embedding + fuzzy + BM25) for production deployments. |
| - **30 irreducible missed matches** out of 6,024 reachable pairs on the holdout set (19 clean, 11 heavy noise). These are extreme noise cases that no embedding model in this size class can resolve. |
|
|
| ## Citation |
|
|
| If you use this model, please cite: |
|
|
| ```bibtex |
| @misc{melder-arctic-embed-xs-er, |
| title={Arctic-embed-xs fine-tuned for Entity Resolution}, |
| author={Melder Contributors}, |
| year={2026}, |
| url={https://huggingface.co/themelder/arctic-embed-xs-entity-resolution}, |
| } |
| ``` |
|
|
| ## Acknowledgements |
|
|
| - [Snowflake](https://www.snowflake.com/) for the excellent Arctic-embed model family |
| - [Sentence Transformers](https://www.sbert.net/) for the training framework |
| - [Melder](https://github.com/anomalyco/melder) for the evaluation pipeline and data generation |
|
|