File size: 9,754 Bytes
c3a7c77 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 | ---
language: en
license: apache-2.0
library_name: sentence-transformers
base_model: Snowflake/snowflake-arctic-embed-xs
pipeline_tag: sentence-similarity
tags:
- sentence-transformers
- feature-extraction
- entity-resolution
- record-linkage
- record-matching
- data-matching
- deduplication
- arctic
- snowflake-arctic-embed
- lora
- fine-tuned
model-index:
- name: arctic-embed-xs-entity-resolution
results:
- task:
type: entity-resolution
name: Entity Resolution
dataset:
type: synthetic
name: Melder Entity Resolution Benchmark (10k x 10k)
metrics:
- type: precision
value: 88.6
name: Precision
- type: recall
value: 99.7
name: Combined Recall
- type: overlap
value: 0.031
name: Score Overlap Coefficient
---
# Arctic-embed-xs for Entity Resolution
A fine-tuned version of [Snowflake/snowflake-arctic-embed-xs](https://huggingface.co/Snowflake/snowflake-arctic-embed-xs) optimised for **entity resolution** -- matching records that refer to the same real-world entity across messy, inconsistent datasets.
The canonical use case is matching counterparty names, addresses, and identifiers between a clean reference master (side A) and noisy operational data (side B). For example, resolving "GS Intl Ltd" to "Goldman Sachs International".
This model was trained as part of [Melder](https://github.com/anomalyco/melder), an open-source record matching engine in Rust.
## Key results
Evaluated on a held-out dataset of 10,000 entity pairs (never seen during training):
| Metric | Base model (untrained) | This model (R22) |
|---|---|---|
| Score overlap (lower is better) | 0.162 | **0.031** (5.2x reduction) |
| Combined recall | 98.1% | **99.7%** |
| Precision | 84.2% | **88.6%** |
| False positives in auto-match | 131 | **0** |
| Non-matches in review queue | 2,826 | **184** (93.5% reduction) |
| Missed matches (clean) | 4 | 19 |
| Missed matches (heavy noise) | 0 | 11 |
"Score overlap" measures how much the score distributions of true matches and non-matches overlap -- lower means better separation. This model reduces overlap by 5.2x compared to the base model, meaning the scoring threshold between "match" and "not a match" becomes much cleaner.
Combined recall (auto-matched + review) stays at 99.7%, meaning almost no true matches are lost. The main benefit of fine-tuning is **cleaning the review queue** -- non-matches that would have required human review are pushed clearly below threshold.
## When to use this model
- **Entity resolution / record linkage** across datasets with name, address, and identifier fields
- **Counterparty matching** in financial data (the training domain)
- **Deduplication** of entity records with noisy or inconsistent naming
- **Any short-text matching task** where entities have legal names, abbreviations, addresses, and codes
The model produces 384-dimensional L2-normalised embeddings. Cosine similarity (= dot product for normalised vectors) between embeddings indicates how likely two records refer to the same entity.
## When NOT to use this model
- General-purpose semantic similarity or retrieval (use the base model instead)
- Long-document embedding (entity names and addresses are short sequences)
- Non-English text (trained on English entity names only)
- Acronym matching ("TRMS" vs "Taylor, Reeves and Mcdaniel SRL") -- no embedding model can reliably resolve these; use a composite scoring approach
## Usage
### With sentence-transformers
```python
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("themelder/arctic-embed-xs-entity-resolution")
# Encode entity records (concatenate name + address for best results)
queries = ["Goldman Sachs International 133 Fleet Street, London EC4A 2BB"]
candidates = [
"GS Intl Ltd 133 Fleet St London EC4A 2BB",
"Morgan Stanley & Co 20 Bank Street, London E14 4AD",
"Goldman Sachs Asset Management Christchurch Court, London EC1A 7HT",
]
query_emb = model.encode(queries, prompt_name="query")
candidate_emb = model.encode(candidates)
scores = query_emb @ candidate_emb.T
for candidate, score in sorted(zip(candidates, scores[0]), key=lambda x: -x[1]):
print(f"{score:.3f} {candidate}")
# 0.872 GS Intl Ltd 133 Fleet St London EC4A 2BB
# 0.614 Goldman Sachs Asset Management Christchurch Court, London EC1A 7HT
# 0.298 Morgan Stanley & Co 20 Bank Street, London E14 4AD
```
### With Melder
In your Melder config YAML, point the model at the HuggingFace model ID or a local path to the ONNX export:
```yaml
embeddings:
model: themelder/arctic-embed-xs-entity-resolution
```
Melder uses the ONNX export (`model.onnx`) for inference via [fastembed](https://github.com/qdrant/fastembed). The model produces 384-dimensional embeddings at roughly 2x the speed of BGE-small models (6 layers vs 12).
### With ONNX Runtime directly
The repository includes `model.onnx` for direct use with ONNX Runtime in any language (Rust, C++, Java, etc.) without Python dependencies.
## Model details
| Property | Value |
|---|---|
| Base model | [Snowflake/snowflake-arctic-embed-xs](https://huggingface.co/Snowflake/snowflake-arctic-embed-xs) |
| Architecture | BERT (6 layers, 12 heads, 384 hidden) |
| Parameters | 22M |
| Embedding dimension | 384 |
| Max sequence length | 512 tokens |
| Similarity function | Cosine similarity |
| Pooling | CLS token |
| Output | L2-normalised |
## Training details
### Approach
Fine-tuned using **LoRA** (Low-Rank Adaptation) over 22 iterative rounds. Each round:
1. Run Melder's matching pipeline on a training dataset
2. Extract training pairs: confirmed matches become positives, high-scoring non-matches become hard negatives
3. Fine-tune the model with LoRA on the accumulated pairs
4. Evaluate on a fixed holdout set
5. Repeat with the improved model
This iterative approach means the model learns from its own mistakes -- hard negatives from round N become training signal for round N+1. Combined with accumulation of pairs across all rounds, the model sees progressively harder examples.
### Hyperparameters
| Parameter | Value |
|---|---|
| Loss function | MultipleNegativesRankingLoss |
| Batch size | 128 |
| Learning rate | 2e-5 |
| Epochs per round | 1 |
| Warmup ratio | 0.1 |
| LoRA rank | 8 |
| LoRA alpha | 16 |
| LoRA dropout | 0.1 |
| Rounds | 22 |
| Total training pairs (final round) | ~127,000 |
| Optimizer | AdamW (fused) |
### Training data
Synthetic entity resolution data generated by [Melder's data generator](https://github.com/anomalyco/melder):
- **Side A (reference)**: 10,000 synthetic entity records with legal names, short names, country codes, LEIs, and addresses
- **Side B (query)**: 10,000 records per round -- 60% true matches (with noise: case changes, abbreviations, typos, missing fields), 10% ambiguous/heavy noise, 30% unmatched entities
- **Holdout**: A separate B dataset (seed 9999) never used in training, used for all evaluation metrics
Training pairs consist of:
- **Positives**: confirmed matched entity pairs (name + address concatenation)
- **Hard negatives**: high-scoring non-matches from Melder's review queue -- entities that look similar but are not the same
### Why Arctic-embed-xs?
We tested four base models across 12 experiments:
| Model | Parameters | Best overlap | Combined recall | Encoding speed |
|---|---|---|---|---|
| all-MiniLM-L6-v2 | 22M | (baseline only) | -- | 2x |
| BAAI/bge-small-en-v1.5 | 33M | 0.070 | 97.3% | 1x |
| BAAI/bge-base-en-v1.5 | 110M | 0.046 | ~98.5% | 0.5x |
| **Snowflake/arctic-embed-xs** | **22M** | **0.031** | **99.7%** | **2x** |
Arctic-embed-xs won on every metric despite being the smallest model. Its superior pre-training (400M samples with hard negative mining) gives it better out-of-the-box entity discrimination than larger models trained on simpler data.
### Overlap trajectory
Score overlap coefficient across training rounds (holdout, lower is better):
| R0 | R4 | R8 | R10 | R14 | R17 | R22 |
|---|---|---|---|---|---|---|
| 0.162 | 0.156 | 0.085 | 0.047 | 0.034 | 0.033 | **0.031** |
The model converges cleanly with no regression or oscillation. Extended training to R26 confirmed convergence (overlap 0.030, within noise).
## Limitations
- **Domain-specific**: optimised for financial entity names and addresses. May underperform on other entity types (products, locations, people) without additional fine-tuning.
- **English only**: trained on English-language entity data.
- **Short text**: designed for entity names and addresses (typically 5-30 tokens). Not suitable for paragraph-level text.
- **Acronyms**: cannot match acronyms to full names (e.g. "TRMS" to "Taylor, Reeves and Mcdaniel SRL"). This is a fundamental limitation of embedding models -- use a composite scoring approach (embedding + fuzzy + BM25) for production deployments.
- **30 irreducible missed matches** out of 6,024 reachable pairs on the holdout set (19 clean, 11 heavy noise). These are extreme noise cases that no embedding model in this size class can resolve.
## Citation
If you use this model, please cite:
```bibtex
@misc{melder-arctic-embed-xs-er,
title={Arctic-embed-xs fine-tuned for Entity Resolution},
author={Melder Contributors},
year={2026},
url={https://huggingface.co/themelder/arctic-embed-xs-entity-resolution},
}
```
## Acknowledgements
- [Snowflake](https://www.snowflake.com/) for the excellent Arctic-embed model family
- [Sentence Transformers](https://www.sbert.net/) for the training framework
- [Melder](https://github.com/anomalyco/melder) for the evaluation pipeline and data generation
|