--- language: en license: apache-2.0 library_name: sentence-transformers base_model: Snowflake/snowflake-arctic-embed-xs pipeline_tag: sentence-similarity tags: - sentence-transformers - feature-extraction - entity-resolution - record-linkage - record-matching - data-matching - deduplication - arctic - snowflake-arctic-embed - lora - fine-tuned model-index: - name: arctic-embed-xs-entity-resolution results: - task: type: entity-resolution name: Entity Resolution dataset: type: synthetic name: Melder Entity Resolution Benchmark (10k x 10k) metrics: - type: precision value: 88.6 name: Precision - type: recall value: 99.7 name: Combined Recall - type: overlap value: 0.031 name: Score Overlap Coefficient --- # Arctic-embed-xs for Entity Resolution A fine-tuned version of [Snowflake/snowflake-arctic-embed-xs](https://huggingface.co/Snowflake/snowflake-arctic-embed-xs) optimised for **entity resolution** -- matching records that refer to the same real-world entity across messy, inconsistent datasets. The canonical use case is matching counterparty names, addresses, and identifiers between a clean reference master (side A) and noisy operational data (side B). For example, resolving "GS Intl Ltd" to "Goldman Sachs International". This model was trained as part of [Melder](https://github.com/anomalyco/melder), an open-source record matching engine in Rust. ## Key results Evaluated on a held-out dataset of 10,000 entity pairs (never seen during training): | Metric | Base model (untrained) | This model (R22) | |---|---|---| | Score overlap (lower is better) | 0.162 | **0.031** (5.2x reduction) | | Combined recall | 98.1% | **99.7%** | | Precision | 84.2% | **88.6%** | | False positives in auto-match | 131 | **0** | | Non-matches in review queue | 2,826 | **184** (93.5% reduction) | | Missed matches (clean) | 4 | 19 | | Missed matches (heavy noise) | 0 | 11 | "Score overlap" measures how much the score distributions of true matches and non-matches overlap -- lower means better separation. This model reduces overlap by 5.2x compared to the base model, meaning the scoring threshold between "match" and "not a match" becomes much cleaner. Combined recall (auto-matched + review) stays at 99.7%, meaning almost no true matches are lost. The main benefit of fine-tuning is **cleaning the review queue** -- non-matches that would have required human review are pushed clearly below threshold. ## When to use this model - **Entity resolution / record linkage** across datasets with name, address, and identifier fields - **Counterparty matching** in financial data (the training domain) - **Deduplication** of entity records with noisy or inconsistent naming - **Any short-text matching task** where entities have legal names, abbreviations, addresses, and codes The model produces 384-dimensional L2-normalised embeddings. Cosine similarity (= dot product for normalised vectors) between embeddings indicates how likely two records refer to the same entity. ## When NOT to use this model - General-purpose semantic similarity or retrieval (use the base model instead) - Long-document embedding (entity names and addresses are short sequences) - Non-English text (trained on English entity names only) - Acronym matching ("TRMS" vs "Taylor, Reeves and Mcdaniel SRL") -- no embedding model can reliably resolve these; use a composite scoring approach ## Usage ### With sentence-transformers ```python from sentence_transformers import SentenceTransformer model = SentenceTransformer("themelder/arctic-embed-xs-entity-resolution") # Encode entity records (concatenate name + address for best results) queries = ["Goldman Sachs International 133 Fleet Street, London EC4A 2BB"] candidates = [ "GS Intl Ltd 133 Fleet St London EC4A 2BB", "Morgan Stanley & Co 20 Bank Street, London E14 4AD", "Goldman Sachs Asset Management Christchurch Court, London EC1A 7HT", ] query_emb = model.encode(queries, prompt_name="query") candidate_emb = model.encode(candidates) scores = query_emb @ candidate_emb.T for candidate, score in sorted(zip(candidates, scores[0]), key=lambda x: -x[1]): print(f"{score:.3f} {candidate}") # 0.872 GS Intl Ltd 133 Fleet St London EC4A 2BB # 0.614 Goldman Sachs Asset Management Christchurch Court, London EC1A 7HT # 0.298 Morgan Stanley & Co 20 Bank Street, London E14 4AD ``` ### With Melder In your Melder config YAML, point the model at the HuggingFace model ID or a local path to the ONNX export: ```yaml embeddings: model: themelder/arctic-embed-xs-entity-resolution ``` Melder uses the ONNX export (`model.onnx`) for inference via [fastembed](https://github.com/qdrant/fastembed). The model produces 384-dimensional embeddings at roughly 2x the speed of BGE-small models (6 layers vs 12). ### With ONNX Runtime directly The repository includes `model.onnx` for direct use with ONNX Runtime in any language (Rust, C++, Java, etc.) without Python dependencies. ## Model details | Property | Value | |---|---| | Base model | [Snowflake/snowflake-arctic-embed-xs](https://huggingface.co/Snowflake/snowflake-arctic-embed-xs) | | Architecture | BERT (6 layers, 12 heads, 384 hidden) | | Parameters | 22M | | Embedding dimension | 384 | | Max sequence length | 512 tokens | | Similarity function | Cosine similarity | | Pooling | CLS token | | Output | L2-normalised | ## Training details ### Approach Fine-tuned using **LoRA** (Low-Rank Adaptation) over 22 iterative rounds. Each round: 1. Run Melder's matching pipeline on a training dataset 2. Extract training pairs: confirmed matches become positives, high-scoring non-matches become hard negatives 3. Fine-tune the model with LoRA on the accumulated pairs 4. Evaluate on a fixed holdout set 5. Repeat with the improved model This iterative approach means the model learns from its own mistakes -- hard negatives from round N become training signal for round N+1. Combined with accumulation of pairs across all rounds, the model sees progressively harder examples. ### Hyperparameters | Parameter | Value | |---|---| | Loss function | MultipleNegativesRankingLoss | | Batch size | 128 | | Learning rate | 2e-5 | | Epochs per round | 1 | | Warmup ratio | 0.1 | | LoRA rank | 8 | | LoRA alpha | 16 | | LoRA dropout | 0.1 | | Rounds | 22 | | Total training pairs (final round) | ~127,000 | | Optimizer | AdamW (fused) | ### Training data Synthetic entity resolution data generated by [Melder's data generator](https://github.com/anomalyco/melder): - **Side A (reference)**: 10,000 synthetic entity records with legal names, short names, country codes, LEIs, and addresses - **Side B (query)**: 10,000 records per round -- 60% true matches (with noise: case changes, abbreviations, typos, missing fields), 10% ambiguous/heavy noise, 30% unmatched entities - **Holdout**: A separate B dataset (seed 9999) never used in training, used for all evaluation metrics Training pairs consist of: - **Positives**: confirmed matched entity pairs (name + address concatenation) - **Hard negatives**: high-scoring non-matches from Melder's review queue -- entities that look similar but are not the same ### Why Arctic-embed-xs? We tested four base models across 12 experiments: | Model | Parameters | Best overlap | Combined recall | Encoding speed | |---|---|---|---|---| | all-MiniLM-L6-v2 | 22M | (baseline only) | -- | 2x | | BAAI/bge-small-en-v1.5 | 33M | 0.070 | 97.3% | 1x | | BAAI/bge-base-en-v1.5 | 110M | 0.046 | ~98.5% | 0.5x | | **Snowflake/arctic-embed-xs** | **22M** | **0.031** | **99.7%** | **2x** | Arctic-embed-xs won on every metric despite being the smallest model. Its superior pre-training (400M samples with hard negative mining) gives it better out-of-the-box entity discrimination than larger models trained on simpler data. ### Overlap trajectory Score overlap coefficient across training rounds (holdout, lower is better): | R0 | R4 | R8 | R10 | R14 | R17 | R22 | |---|---|---|---|---|---|---| | 0.162 | 0.156 | 0.085 | 0.047 | 0.034 | 0.033 | **0.031** | The model converges cleanly with no regression or oscillation. Extended training to R26 confirmed convergence (overlap 0.030, within noise). ## Limitations - **Domain-specific**: optimised for financial entity names and addresses. May underperform on other entity types (products, locations, people) without additional fine-tuning. - **English only**: trained on English-language entity data. - **Short text**: designed for entity names and addresses (typically 5-30 tokens). Not suitable for paragraph-level text. - **Acronyms**: cannot match acronyms to full names (e.g. "TRMS" to "Taylor, Reeves and Mcdaniel SRL"). This is a fundamental limitation of embedding models -- use a composite scoring approach (embedding + fuzzy + BM25) for production deployments. - **30 irreducible missed matches** out of 6,024 reachable pairs on the holdout set (19 clean, 11 heavy noise). These are extreme noise cases that no embedding model in this size class can resolve. ## Citation If you use this model, please cite: ```bibtex @misc{melder-arctic-embed-xs-er, title={Arctic-embed-xs fine-tuned for Entity Resolution}, author={Melder Contributors}, year={2026}, url={https://huggingface.co/themelder/arctic-embed-xs-entity-resolution}, } ``` ## Acknowledgements - [Snowflake](https://www.snowflake.com/) for the excellent Arctic-embed model family - [Sentence Transformers](https://www.sbert.net/) for the training framework - [Melder](https://github.com/anomalyco/melder) for the evaluation pipeline and data generation