Arctic-embed-xs for Entity Resolution

A fine-tuned version of Snowflake/snowflake-arctic-embed-xs optimised for entity resolution -- matching records that refer to the same real-world entity across messy, inconsistent datasets.

The canonical use case is matching counterparty names, addresses, and identifiers between a clean reference master (side A) and noisy operational data (side B). For example, resolving "GS Intl Ltd" to "Goldman Sachs International".

This model was trained as part of Melder, an open-source record matching engine in Rust.

Key results

Evaluated on a held-out dataset of 10,000 entity pairs (never seen during training):

Metric	Base model (untrained)	This model (R22)
Score overlap (lower is better)	0.162	0.031 (5.2x reduction)
Combined recall	98.1%	99.7%
Precision	84.2%	88.6%
False positives in auto-match	131	0
Non-matches in review queue	2,826	184 (93.5% reduction)
Missed matches (clean)	4	19
Missed matches (heavy noise)	0	11

"Score overlap" measures how much the score distributions of true matches and non-matches overlap -- lower means better separation. This model reduces overlap by 5.2x compared to the base model, meaning the scoring threshold between "match" and "not a match" becomes much cleaner.

Combined recall (auto-matched + review) stays at 99.7%, meaning almost no true matches are lost. The main benefit of fine-tuning is cleaning the review queue -- non-matches that would have required human review are pushed clearly below threshold.

When to use this model

Entity resolution / record linkage across datasets with name, address, and identifier fields
Counterparty matching in financial data (the training domain)
Deduplication of entity records with noisy or inconsistent naming
Any short-text matching task where entities have legal names, abbreviations, addresses, and codes

The model produces 384-dimensional L2-normalised embeddings. Cosine similarity (= dot product for normalised vectors) between embeddings indicates how likely two records refer to the same entity.

When NOT to use this model

General-purpose semantic similarity or retrieval (use the base model instead)
Long-document embedding (entity names and addresses are short sequences)
Non-English text (trained on English entity names only)
Acronym matching ("TRMS" vs "Taylor, Reeves and Mcdaniel SRL") -- no embedding model can reliably resolve these; use a composite scoring approach

Usage

With sentence-transformers

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("themelder/arctic-embed-xs-entity-resolution")

# Encode entity records (concatenate name + address for best results)
queries = ["Goldman Sachs International 133 Fleet Street, London EC4A 2BB"]
candidates = [
    "GS Intl Ltd 133 Fleet St London EC4A 2BB",
    "Morgan Stanley & Co 20 Bank Street, London E14 4AD",
    "Goldman Sachs Asset Management Christchurch Court, London EC1A 7HT",
]

query_emb = model.encode(queries, prompt_name="query")
candidate_emb = model.encode(candidates)

scores = query_emb @ candidate_emb.T
for candidate, score in sorted(zip(candidates, scores[0]), key=lambda x: -x[1]):
    print(f"{score:.3f}  {candidate}")
# 0.872  GS Intl Ltd 133 Fleet St London EC4A 2BB
# 0.614  Goldman Sachs Asset Management Christchurch Court, London EC1A 7HT
# 0.298  Morgan Stanley & Co 20 Bank Street, London E14 4AD

With Melder

In your Melder config YAML, point the model at the HuggingFace model ID or a local path to the ONNX export:

embeddings:
  model: themelder/arctic-embed-xs-entity-resolution

Melder uses the ONNX export (model.onnx) for inference via fastembed. The model produces 384-dimensional embeddings at roughly 2x the speed of BGE-small models (6 layers vs 12).

With ONNX Runtime directly

The repository includes model.onnx for direct use with ONNX Runtime in any language (Rust, C++, Java, etc.) without Python dependencies.

Model details

Property	Value
Base model	Snowflake/snowflake-arctic-embed-xs
Architecture	BERT (6 layers, 12 heads, 384 hidden)
Parameters	22M
Embedding dimension	384
Max sequence length	512 tokens
Similarity function	Cosine similarity
Pooling	CLS token
Output	L2-normalised

Training details

Approach

Fine-tuned using LoRA (Low-Rank Adaptation) over 22 iterative rounds. Each round:

Run Melder's matching pipeline on a training dataset
Extract training pairs: confirmed matches become positives, high-scoring non-matches become hard negatives
Fine-tune the model with LoRA on the accumulated pairs
Evaluate on a fixed holdout set
Repeat with the improved model

This iterative approach means the model learns from its own mistakes -- hard negatives from round N become training signal for round N+1. Combined with accumulation of pairs across all rounds, the model sees progressively harder examples.

Hyperparameters

Parameter	Value
Loss function	MultipleNegativesRankingLoss
Batch size	128
Learning rate	2e-5
Epochs per round	1
Warmup ratio	0.1
LoRA rank	8
LoRA alpha	16
LoRA dropout	0.1
Rounds	22
Total training pairs (final round)	~127,000
Optimizer	AdamW (fused)

Training data

Synthetic entity resolution data generated by Melder's data generator:

Side A (reference): 10,000 synthetic entity records with legal names, short names, country codes, LEIs, and addresses
Side B (query): 10,000 records per round -- 60% true matches (with noise: case changes, abbreviations, typos, missing fields), 10% ambiguous/heavy noise, 30% unmatched entities
Holdout: A separate B dataset (seed 9999) never used in training, used for all evaluation metrics

Training pairs consist of:

Positives: confirmed matched entity pairs (name + address concatenation)
Hard negatives: high-scoring non-matches from Melder's review queue -- entities that look similar but are not the same

Why Arctic-embed-xs?

We tested four base models across 12 experiments:

Model	Parameters	Best overlap	Combined recall	Encoding speed
all-MiniLM-L6-v2	22M	(baseline only)	--	2x
BAAI/bge-small-en-v1.5	33M	0.070	97.3%	1x
BAAI/bge-base-en-v1.5	110M	0.046	~98.5%	0.5x
Snowflake/arctic-embed-xs	22M	0.031	99.7%	2x

Arctic-embed-xs won on every metric despite being the smallest model. Its superior pre-training (400M samples with hard negative mining) gives it better out-of-the-box entity discrimination than larger models trained on simpler data.

Overlap trajectory

Score overlap coefficient across training rounds (holdout, lower is better):

R0	R4	R8	R10	R14	R17	R22
0.162	0.156	0.085	0.047	0.034	0.033	0.031

The model converges cleanly with no regression or oscillation. Extended training to R26 confirmed convergence (overlap 0.030, within noise).

Limitations

Domain-specific: optimised for financial entity names and addresses. May underperform on other entity types (products, locations, people) without additional fine-tuning.
English only: trained on English-language entity data.
Short text: designed for entity names and addresses (typically 5-30 tokens). Not suitable for paragraph-level text.
Acronyms: cannot match acronyms to full names (e.g. "TRMS" to "Taylor, Reeves and Mcdaniel SRL"). This is a fundamental limitation of embedding models -- use a composite scoring approach (embedding + fuzzy + BM25) for production deployments.
30 irreducible missed matches out of 6,024 reachable pairs on the holdout set (19 clean, 11 heavy noise). These are extreme noise cases that no embedding model in this size class can resolve.

Citation

If you use this model, please cite:

@misc{melder-arctic-embed-xs-er,
    title={Arctic-embed-xs fine-tuned for Entity Resolution},
    author={Melder Contributors},
    year={2026},
    url={https://huggingface.co/themelder/arctic-embed-xs-entity-resolution},
}

Acknowledgements

Snowflake for the excellent Arctic-embed model family
Sentence Transformers for the training framework
Melder for the evaluation pipeline and data generation

Downloads last month: 51

Safetensors

Model size

22.7M params

Tensor type

F32

Model tree for themelder/arctic-embed-xs-entity-resolution

Base model

Snowflake/snowflake-arctic-embed-xs

Adapter

(1)

this model

Evaluation results

Precision on Melder Entity Resolution Benchmark (10k x 10k)
self-reported

88.600
Combined Recall on Melder Entity Resolution Benchmark (10k x 10k)
self-reported

99.700
Score Overlap Coefficient on Melder Entity Resolution Benchmark (10k x 10k)
self-reported

0.031