README.md · themelder/arctic-embed-xs-entity-resolution at main

File size: 9,754 Bytes

c3a7c77

---
language: en
license: apache-2.0
library_name: sentence-transformers
base_model: Snowflake/snowflake-arctic-embed-xs
pipeline_tag: sentence-similarity
tags:
  - sentence-transformers
  - feature-extraction
  - entity-resolution
  - record-linkage
  - record-matching
  - data-matching
  - deduplication
  - arctic
  - snowflake-arctic-embed
  - lora
  - fine-tuned
model-index:
  - name: arctic-embed-xs-entity-resolution
    results:
      - task:
          type: entity-resolution
          name: Entity Resolution
        dataset:
          type: synthetic
          name: Melder Entity Resolution Benchmark (10k x 10k)
        metrics:
          - type: precision
            value: 88.6
            name: Precision
          - type: recall
            value: 99.7
            name: Combined Recall
          - type: overlap
            value: 0.031
            name: Score Overlap Coefficient
---

# Arctic-embed-xs for Entity Resolution

A fine-tuned version of [Snowflake/snowflake-arctic-embed-xs](https://huggingface.co/Snowflake/snowflake-arctic-embed-xs) optimised for **entity resolution** -- matching records that refer to the same real-world entity across messy, inconsistent datasets.

The canonical use case is matching counterparty names, addresses, and identifiers between a clean reference master (side A) and noisy operational data (side B). For example, resolving "GS Intl Ltd" to "Goldman Sachs International".

This model was trained as part of [Melder](https://github.com/anomalyco/melder), an open-source record matching engine in Rust.

## Key results

Evaluated on a held-out dataset of 10,000 entity pairs (never seen during training):

| Metric | Base model (untrained) | This model (R22) |
|---|---|---|
| Score overlap (lower is better) | 0.162 | **0.031** (5.2x reduction) |
| Combined recall | 98.1% | **99.7%** |
| Precision | 84.2% | **88.6%** |
| False positives in auto-match | 131 | **0** |
| Non-matches in review queue | 2,826 | **184** (93.5% reduction) |
| Missed matches (clean) | 4 | 19 |
| Missed matches (heavy noise) | 0 | 11 |

"Score overlap" measures how much the score distributions of true matches and non-matches overlap -- lower means better separation. This model reduces overlap by 5.2x compared to the base model, meaning the scoring threshold between "match" and "not a match" becomes much cleaner.

Combined recall (auto-matched + review) stays at 99.7%, meaning almost no true matches are lost. The main benefit of fine-tuning is **cleaning the review queue** -- non-matches that would have required human review are pushed clearly below threshold.

## When to use this model

- **Entity resolution / record linkage** across datasets with name, address, and identifier fields
- **Counterparty matching** in financial data (the training domain)
- **Deduplication** of entity records with noisy or inconsistent naming
- **Any short-text matching task** where entities have legal names, abbreviations, addresses, and codes

The model produces 384-dimensional L2-normalised embeddings. Cosine similarity (= dot product for normalised vectors) between embeddings indicates how likely two records refer to the same entity.

## When NOT to use this model

- General-purpose semantic similarity or retrieval (use the base model instead)
- Long-document embedding (entity names and addresses are short sequences)
- Non-English text (trained on English entity names only)
- Acronym matching ("TRMS" vs "Taylor, Reeves and Mcdaniel SRL") -- no embedding model can reliably resolve these; use a composite scoring approach

## Usage

### With sentence-transformers

```python
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("themelder/arctic-embed-xs-entity-resolution")

# Encode entity records (concatenate name + address for best results)
queries = ["Goldman Sachs International 133 Fleet Street, London EC4A 2BB"]
candidates = [
    "GS Intl Ltd 133 Fleet St London EC4A 2BB",
    "Morgan Stanley & Co 20 Bank Street, London E14 4AD",
    "Goldman Sachs Asset Management Christchurch Court, London EC1A 7HT",
]

query_emb = model.encode(queries, prompt_name="query")
candidate_emb = model.encode(candidates)

scores = query_emb @ candidate_emb.T
for candidate, score in sorted(zip(candidates, scores[0]), key=lambda x: -x[1]):
    print(f"{score:.3f}  {candidate}")
# 0.872  GS Intl Ltd 133 Fleet St London EC4A 2BB
# 0.614  Goldman Sachs Asset Management Christchurch Court, London EC1A 7HT
# 0.298  Morgan Stanley & Co 20 Bank Street, London E14 4AD
```

### With Melder

In your Melder config YAML, point the model at the HuggingFace model ID or a local path to the ONNX export:

```yaml
embeddings:
  model: themelder/arctic-embed-xs-entity-resolution
```

Melder uses the ONNX export (`model.onnx`) for inference via [fastembed](https://github.com/qdrant/fastembed). The model produces 384-dimensional embeddings at roughly 2x the speed of BGE-small models (6 layers vs 12).

### With ONNX Runtime directly

The repository includes `model.onnx` for direct use with ONNX Runtime in any language (Rust, C++, Java, etc.) without Python dependencies.

## Model details

| Property | Value |
|---|---|
| Base model | [Snowflake/snowflake-arctic-embed-xs](https://huggingface.co/Snowflake/snowflake-arctic-embed-xs) |
| Architecture | BERT (6 layers, 12 heads, 384 hidden) |
| Parameters | 22M |
| Embedding dimension | 384 |
| Max sequence length | 512 tokens |
| Similarity function | Cosine similarity |
| Pooling | CLS token |
| Output | L2-normalised |

## Training details

### Approach

Fine-tuned using **LoRA** (Low-Rank Adaptation) over 22 iterative rounds. Each round:

1. Run Melder's matching pipeline on a training dataset
2. Extract training pairs: confirmed matches become positives, high-scoring non-matches become hard negatives
3. Fine-tune the model with LoRA on the accumulated pairs
4. Evaluate on a fixed holdout set
5. Repeat with the improved model

This iterative approach means the model learns from its own mistakes -- hard negatives from round N become training signal for round N+1. Combined with accumulation of pairs across all rounds, the model sees progressively harder examples.

### Hyperparameters

| Parameter | Value |
|---|---|
| Loss function | MultipleNegativesRankingLoss |
| Batch size | 128 |
| Learning rate | 2e-5 |
| Epochs per round | 1 |
| Warmup ratio | 0.1 |
| LoRA rank | 8 |
| LoRA alpha | 16 |
| LoRA dropout | 0.1 |
| Rounds | 22 |
| Total training pairs (final round) | ~127,000 |
| Optimizer | AdamW (fused) |

### Training data

Synthetic entity resolution data generated by [Melder's data generator](https://github.com/anomalyco/melder):

- **Side A (reference)**: 10,000 synthetic entity records with legal names, short names, country codes, LEIs, and addresses
- **Side B (query)**: 10,000 records per round -- 60% true matches (with noise: case changes, abbreviations, typos, missing fields), 10% ambiguous/heavy noise, 30% unmatched entities
- **Holdout**: A separate B dataset (seed 9999) never used in training, used for all evaluation metrics

Training pairs consist of:
- **Positives**: confirmed matched entity pairs (name + address concatenation)
- **Hard negatives**: high-scoring non-matches from Melder's review queue -- entities that look similar but are not the same

### Why Arctic-embed-xs?

We tested four base models across 12 experiments:

| Model | Parameters | Best overlap | Combined recall | Encoding speed |
|---|---|---|---|---|
| all-MiniLM-L6-v2 | 22M | (baseline only) | -- | 2x |
| BAAI/bge-small-en-v1.5 | 33M | 0.070 | 97.3% | 1x |
| BAAI/bge-base-en-v1.5 | 110M | 0.046 | ~98.5% | 0.5x |
| **Snowflake/arctic-embed-xs** | **22M** | **0.031** | **99.7%** | **2x** |

Arctic-embed-xs won on every metric despite being the smallest model. Its superior pre-training (400M samples with hard negative mining) gives it better out-of-the-box entity discrimination than larger models trained on simpler data.

### Overlap trajectory

Score overlap coefficient across training rounds (holdout, lower is better):

| R0 | R4 | R8 | R10 | R14 | R17 | R22 |
|---|---|---|---|---|---|---|
| 0.162 | 0.156 | 0.085 | 0.047 | 0.034 | 0.033 | **0.031** |

The model converges cleanly with no regression or oscillation. Extended training to R26 confirmed convergence (overlap 0.030, within noise).

## Limitations

- **Domain-specific**: optimised for financial entity names and addresses. May underperform on other entity types (products, locations, people) without additional fine-tuning.
- **English only**: trained on English-language entity data.
- **Short text**: designed for entity names and addresses (typically 5-30 tokens). Not suitable for paragraph-level text.
- **Acronyms**: cannot match acronyms to full names (e.g. "TRMS" to "Taylor, Reeves and Mcdaniel SRL"). This is a fundamental limitation of embedding models -- use a composite scoring approach (embedding + fuzzy + BM25) for production deployments.
- **30 irreducible missed matches** out of 6,024 reachable pairs on the holdout set (19 clean, 11 heavy noise). These are extreme noise cases that no embedding model in this size class can resolve.

## Citation

If you use this model, please cite:

```bibtex
@misc{melder-arctic-embed-xs-er,
    title={Arctic-embed-xs fine-tuned for Entity Resolution},
    author={Melder Contributors},
    year={2026},
    url={https://huggingface.co/themelder/arctic-embed-xs-entity-resolution},
}
```

## Acknowledgements

- [Snowflake](https://www.snowflake.com/) for the excellent Arctic-embed model family
- [Sentence Transformers](https://www.sbert.net/) for the training framework
- [Melder](https://github.com/anomalyco/melder) for the evaluation pipeline and data generation