jude842's picture
Upload folder using huggingface_hub
c3a7c77 verified
---
language: en
license: apache-2.0
library_name: sentence-transformers
base_model: Snowflake/snowflake-arctic-embed-xs
pipeline_tag: sentence-similarity
tags:
- sentence-transformers
- feature-extraction
- entity-resolution
- record-linkage
- record-matching
- data-matching
- deduplication
- arctic
- snowflake-arctic-embed
- lora
- fine-tuned
model-index:
- name: arctic-embed-xs-entity-resolution
results:
- task:
type: entity-resolution
name: Entity Resolution
dataset:
type: synthetic
name: Melder Entity Resolution Benchmark (10k x 10k)
metrics:
- type: precision
value: 88.6
name: Precision
- type: recall
value: 99.7
name: Combined Recall
- type: overlap
value: 0.031
name: Score Overlap Coefficient
---
# Arctic-embed-xs for Entity Resolution
A fine-tuned version of [Snowflake/snowflake-arctic-embed-xs](https://huggingface.co/Snowflake/snowflake-arctic-embed-xs) optimised for **entity resolution** -- matching records that refer to the same real-world entity across messy, inconsistent datasets.
The canonical use case is matching counterparty names, addresses, and identifiers between a clean reference master (side A) and noisy operational data (side B). For example, resolving "GS Intl Ltd" to "Goldman Sachs International".
This model was trained as part of [Melder](https://github.com/anomalyco/melder), an open-source record matching engine in Rust.
## Key results
Evaluated on a held-out dataset of 10,000 entity pairs (never seen during training):
| Metric | Base model (untrained) | This model (R22) |
|---|---|---|
| Score overlap (lower is better) | 0.162 | **0.031** (5.2x reduction) |
| Combined recall | 98.1% | **99.7%** |
| Precision | 84.2% | **88.6%** |
| False positives in auto-match | 131 | **0** |
| Non-matches in review queue | 2,826 | **184** (93.5% reduction) |
| Missed matches (clean) | 4 | 19 |
| Missed matches (heavy noise) | 0 | 11 |
"Score overlap" measures how much the score distributions of true matches and non-matches overlap -- lower means better separation. This model reduces overlap by 5.2x compared to the base model, meaning the scoring threshold between "match" and "not a match" becomes much cleaner.
Combined recall (auto-matched + review) stays at 99.7%, meaning almost no true matches are lost. The main benefit of fine-tuning is **cleaning the review queue** -- non-matches that would have required human review are pushed clearly below threshold.
## When to use this model
- **Entity resolution / record linkage** across datasets with name, address, and identifier fields
- **Counterparty matching** in financial data (the training domain)
- **Deduplication** of entity records with noisy or inconsistent naming
- **Any short-text matching task** where entities have legal names, abbreviations, addresses, and codes
The model produces 384-dimensional L2-normalised embeddings. Cosine similarity (= dot product for normalised vectors) between embeddings indicates how likely two records refer to the same entity.
## When NOT to use this model
- General-purpose semantic similarity or retrieval (use the base model instead)
- Long-document embedding (entity names and addresses are short sequences)
- Non-English text (trained on English entity names only)
- Acronym matching ("TRMS" vs "Taylor, Reeves and Mcdaniel SRL") -- no embedding model can reliably resolve these; use a composite scoring approach
## Usage
### With sentence-transformers
```python
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("themelder/arctic-embed-xs-entity-resolution")
# Encode entity records (concatenate name + address for best results)
queries = ["Goldman Sachs International 133 Fleet Street, London EC4A 2BB"]
candidates = [
"GS Intl Ltd 133 Fleet St London EC4A 2BB",
"Morgan Stanley & Co 20 Bank Street, London E14 4AD",
"Goldman Sachs Asset Management Christchurch Court, London EC1A 7HT",
]
query_emb = model.encode(queries, prompt_name="query")
candidate_emb = model.encode(candidates)
scores = query_emb @ candidate_emb.T
for candidate, score in sorted(zip(candidates, scores[0]), key=lambda x: -x[1]):
print(f"{score:.3f} {candidate}")
# 0.872 GS Intl Ltd 133 Fleet St London EC4A 2BB
# 0.614 Goldman Sachs Asset Management Christchurch Court, London EC1A 7HT
# 0.298 Morgan Stanley & Co 20 Bank Street, London E14 4AD
```
### With Melder
In your Melder config YAML, point the model at the HuggingFace model ID or a local path to the ONNX export:
```yaml
embeddings:
model: themelder/arctic-embed-xs-entity-resolution
```
Melder uses the ONNX export (`model.onnx`) for inference via [fastembed](https://github.com/qdrant/fastembed). The model produces 384-dimensional embeddings at roughly 2x the speed of BGE-small models (6 layers vs 12).
### With ONNX Runtime directly
The repository includes `model.onnx` for direct use with ONNX Runtime in any language (Rust, C++, Java, etc.) without Python dependencies.
## Model details
| Property | Value |
|---|---|
| Base model | [Snowflake/snowflake-arctic-embed-xs](https://huggingface.co/Snowflake/snowflake-arctic-embed-xs) |
| Architecture | BERT (6 layers, 12 heads, 384 hidden) |
| Parameters | 22M |
| Embedding dimension | 384 |
| Max sequence length | 512 tokens |
| Similarity function | Cosine similarity |
| Pooling | CLS token |
| Output | L2-normalised |
## Training details
### Approach
Fine-tuned using **LoRA** (Low-Rank Adaptation) over 22 iterative rounds. Each round:
1. Run Melder's matching pipeline on a training dataset
2. Extract training pairs: confirmed matches become positives, high-scoring non-matches become hard negatives
3. Fine-tune the model with LoRA on the accumulated pairs
4. Evaluate on a fixed holdout set
5. Repeat with the improved model
This iterative approach means the model learns from its own mistakes -- hard negatives from round N become training signal for round N+1. Combined with accumulation of pairs across all rounds, the model sees progressively harder examples.
### Hyperparameters
| Parameter | Value |
|---|---|
| Loss function | MultipleNegativesRankingLoss |
| Batch size | 128 |
| Learning rate | 2e-5 |
| Epochs per round | 1 |
| Warmup ratio | 0.1 |
| LoRA rank | 8 |
| LoRA alpha | 16 |
| LoRA dropout | 0.1 |
| Rounds | 22 |
| Total training pairs (final round) | ~127,000 |
| Optimizer | AdamW (fused) |
### Training data
Synthetic entity resolution data generated by [Melder's data generator](https://github.com/anomalyco/melder):
- **Side A (reference)**: 10,000 synthetic entity records with legal names, short names, country codes, LEIs, and addresses
- **Side B (query)**: 10,000 records per round -- 60% true matches (with noise: case changes, abbreviations, typos, missing fields), 10% ambiguous/heavy noise, 30% unmatched entities
- **Holdout**: A separate B dataset (seed 9999) never used in training, used for all evaluation metrics
Training pairs consist of:
- **Positives**: confirmed matched entity pairs (name + address concatenation)
- **Hard negatives**: high-scoring non-matches from Melder's review queue -- entities that look similar but are not the same
### Why Arctic-embed-xs?
We tested four base models across 12 experiments:
| Model | Parameters | Best overlap | Combined recall | Encoding speed |
|---|---|---|---|---|
| all-MiniLM-L6-v2 | 22M | (baseline only) | -- | 2x |
| BAAI/bge-small-en-v1.5 | 33M | 0.070 | 97.3% | 1x |
| BAAI/bge-base-en-v1.5 | 110M | 0.046 | ~98.5% | 0.5x |
| **Snowflake/arctic-embed-xs** | **22M** | **0.031** | **99.7%** | **2x** |
Arctic-embed-xs won on every metric despite being the smallest model. Its superior pre-training (400M samples with hard negative mining) gives it better out-of-the-box entity discrimination than larger models trained on simpler data.
### Overlap trajectory
Score overlap coefficient across training rounds (holdout, lower is better):
| R0 | R4 | R8 | R10 | R14 | R17 | R22 |
|---|---|---|---|---|---|---|
| 0.162 | 0.156 | 0.085 | 0.047 | 0.034 | 0.033 | **0.031** |
The model converges cleanly with no regression or oscillation. Extended training to R26 confirmed convergence (overlap 0.030, within noise).
## Limitations
- **Domain-specific**: optimised for financial entity names and addresses. May underperform on other entity types (products, locations, people) without additional fine-tuning.
- **English only**: trained on English-language entity data.
- **Short text**: designed for entity names and addresses (typically 5-30 tokens). Not suitable for paragraph-level text.
- **Acronyms**: cannot match acronyms to full names (e.g. "TRMS" to "Taylor, Reeves and Mcdaniel SRL"). This is a fundamental limitation of embedding models -- use a composite scoring approach (embedding + fuzzy + BM25) for production deployments.
- **30 irreducible missed matches** out of 6,024 reachable pairs on the holdout set (19 clean, 11 heavy noise). These are extreme noise cases that no embedding model in this size class can resolve.
## Citation
If you use this model, please cite:
```bibtex
@misc{melder-arctic-embed-xs-er,
title={Arctic-embed-xs fine-tuned for Entity Resolution},
author={Melder Contributors},
year={2026},
url={https://huggingface.co/themelder/arctic-embed-xs-entity-resolution},
}
```
## Acknowledgements
- [Snowflake](https://www.snowflake.com/) for the excellent Arctic-embed model family
- [Sentence Transformers](https://www.sbert.net/) for the training framework
- [Melder](https://github.com/anomalyco/melder) for the evaluation pipeline and data generation