README: remove redundant @eladlaor link (already discoverable via author bibtex + org membership)

50e2df1 verified 16 days ago

8.22 kB

	---
	license: mit
	base_model: MoritzLaurer/DeBERTa-v3-base-mnli-fever-anli
	library_name: peft
	language:
	- en
	pipeline_tag: text-classification
	tags:
	- entity-matching
	- person-name-matching
	- record-linkage
	- deduplication
	- lora
	- deberta-v3
	- peft
	metrics:
	- f1
	- precision
	- recall
	- accuracy
	extra_gated_prompt: \|-
	Acknowledge the intended use and limitations before downloading.
	extra_gated_fields:
	Name: text
	Affiliation: text
	Intended use (one sentence): text
	I have read the Bias, Risks, and Limitations section: checkbox
	---

	# Person Name Match Likelihood (v6)

	> Author: Elad Laor  ·  [LinkedIn](https://www.linkedin.com/in/elad-laor-1b1383250/)
	>
	> A scoring head over [`MoritzLaurer/DeBERTa-v3-base-mnli-fever-anli`](https://huggingface.co/MoritzLaurer/DeBERTa-v3-base-mnli-fever-anli) trained to predict whether two strings refer to the same person. Useful for record linkage, deduplication, and KYC-style identity matching where the only signal is a pair of name strings.

	## Quick start

	```python
	from peft import PeftModel
	from transformers import AutoModelForSequenceClassification, AutoTokenizer
	import torch

	BASE = "MoritzLaurer/DeBERTa-v3-base-mnli-fever-anli"
	ADAPTER = "LessLM/person-name-match-likelihood-v6"

	tokenizer = AutoTokenizer.from_pretrained(BASE)
	base = AutoModelForSequenceClassification.from_pretrained(BASE, num_labels=2)
	model = PeftModel.from_pretrained(base, ADAPTER).eval()

	def score(name_a: str, name_b: str) -> float:
	"""Return P(same person) in [0, 1]."""
	inputs = tokenizer(name_a, name_b, return_tensors="pt", truncation=True, max_length=128)
	with torch.no_grad():
	logits = model(**inputs).logits
	return torch.softmax(logits, dim=-1)[0, 1].item()

	print(score("John A. Smith", "J. Smith")) # ~0.95 (initial expansion)
	print(score("Yitzhak Cohen", "Itzhak Cohen")) # ~0.99 (transliteration)
	print(score("John Smith", "John Smyth")) # ~0.90 (typo)
	print(score("Robert Adams", "Roberta Adams")) # ~0.05 (similar but different)
	```

	The model returns a 2-way softmax over `[no_match, match]`. The `match` probability is interpretable as a likelihood score; a temperature scaler (`calibration.pt` in this repo) is fitted on a held-out set if you want calibrated probabilities — load it and apply before the softmax for slightly tighter Expected Calibration Error.

	## Headline metrics

	Evaluated on a held-out test set of 2,510 name pairs drawn from real public entity data (OpenSanctions) and a curated synthetic edge-case set.

	\| Metric \| Score \|
	\|---\|---\|
	\| F1 \| 0.9682 \|
	\| Precision \| 0.9568 \|
	\| Recall \| 0.9798 \|
	\| Accuracy \| 0.9733 \|
	\| Expected Calibration Error \| 0.0162 \|
	\| Latency (p95, CPU) \| 0.42 ms \|

	### Performance by edge case

	\| Edge case \| Accuracy \| n \|
	\|---\|---\|---\|
	\| `nickname` (Bob ↔ Robert) \| 100.0% \| 121 \|
	\| `name_order` (Last, First ↔ First Last) \| 100.0% \| 112 \|
	\| `transliteration` (Yitzhak ↔ Itzhak) \| 100.0% \| 93 \|
	\| `initial` (J. Smith ↔ John Smith) \| 100.0% \| 112 \|
	\| `middle_name` add/drop \| 100.0% \| 50 \|
	\| `title_suffix` (Dr., Jr.) \| 100.0% \| 112 \|
	\| `hyphenation`, `case_variation`, `combined`, `unrelated` \| 100.0% \| 86 \|
	\| `tricky_non_match` (similar non-matches) \| 97.9% \| 331 \|
	\| `partial_overlap` \| 97.7% \| 353 \|
	\| `unknown` (real-world, no curated label) \| 97.5% \| 682 \|
	\| `similar_name` (Robert ↔ Roberta) \| 95.0% \| 341 \|
	\| `typo` \| 84.6% \| 117 \|

	The model is strongest on canonical edge cases (nicknames, initials, transliteration) and weakest on character-level typos where it overlaps with the `similar_name` distribution.

	## How it was trained

	- Base: `MoritzLaurer/DeBERTa-v3-base-mnli-fever-anli` (184M params, 12-layer encoder with disentangled attention).
	- Adapter: LoRA (rank=16, alpha=32, dropout=0.1), targeting `query_proj`, `key_proj`, `value_proj` in every attention block. ~600K trainable parameters (~0.3% of base).
	- Loss: Focal loss (γ=2.0) — down-weights easy examples, lets the model focus on hard pairs (similar names, typos).
	- Optimizer: AdamW, LR=2e-4, cosine schedule, 10% warmup, weight decay 0.01.
	- Schedule: 10 epochs, batch size 32, max sequence length 128 tokens, BF16 mixed precision.
	- Seed: 42.
	- Data: ~174K balanced match/no-match pairs (after 2.5× augmentation) — half drawn from OpenSanctions entities (real public records), half from a synthetic generator covering the 15 edge cases in the table above. Train/validation/calibration splits are entity-level (no person appears in more than one split) using a deterministic MD5-based hash so the splits reproduce bit-for-bit.

	## Bias, Risks, and Limitations

	- Latin script only. The model was trained on Latin-script names. It will not work well on Hebrew, Arabic, Chinese, Cyrillic, etc. scripts unless they are first transliterated.
	- OpenSanctions skew. The real-world half of the training data is drawn from a public sanctions/PEP entity database. Names in that distribution skew toward political, business, and criminal figures, with heavy representation of Russian, Ukrainian, Iranian, Chinese, and Latin American transliterations and a long tail of titles and honorifics. The model may behave differently on, say, US consumer-database names than on this distribution.
	- Pair-level only. This is a pairwise matcher: given two name strings, score their likelihood of being the same person. It does not do blocking, clustering, or one-to-many matching. For dedup over a large list, pair it with a blocking layer (cheap pre-filter on first-letter, soundex, etc.) before invoking the model.
	- Names alone. No surrounding context (DOB, email, address). Two real-world people with the same name will score as a match. Use this as one signal among several in a real identity-matching pipeline, not as the sole decision.
	- Typo accuracy is the weakest cell. 84.6% on character-level typos. If your input is OCR output or hand-transcribed names, expect more errors in this category and consider a separate spell-correction step before scoring.
	- No production guarantees. This is a research/portfolio artifact. Performance on your distribution may differ. Evaluate on a sample of your own data before relying on it.

	## Intended use

	- Record linkage and dedup of person-name fields in datasets where you have only name strings to work with.
	- KYC and identity-matching workflows as one feature among several.
	- Benchmarking and research on encoder-based entity matching.

	## Out-of-scope use

	- Non-Latin scripts (Hebrew, Arabic, Chinese, etc.) without prior transliteration.
	- Surveillance, social scoring, or any use that would single out individuals for adverse treatment based on a name-match score alone.
	- High-stakes one-shot identity decisions (eligibility, denial, arrest, eviction) — the model gives a likelihood, not a verdict.

	## License

	[MIT](https://opensource.org/license/mit/). You are free to use, modify, and redistribute, including commercially, provided you keep the attribution and license notice.

	## Citation

	If you use this model, a backlink to this repo or the author's profile is appreciated.

	```bibtex
	@misc{laor2026_person_name_match_v6,
	author = {Elad Laor},
	title = {Person Name Match Likelihood (v6) — a LoRA adapter on DeBERTa-v3 for pairwise person-name matching},
	year = {2026},
	url = {https://huggingface.co/LessLM/person-name-match-likelihood-v6}
	}
	```

	## Reproducibility & technical details

	- Framework versions: `peft==0.18.1`, `transformers>=4.40`, `torch>=2.0`.
	- Training environment: RunPod RTX 4090, ~3h wall-clock, BF16. Original v6 trained on RTX 3090 Ti (cross-GPU F1 delta: −0.0025).
	- Seed: All randomness controlled by seed=42 (numpy, torch, transformers, dataloader generators). Re-running the training script with this seed and dataset version produces F1 within ±0.005 across BF16-capable GPUs.
	- Calibration: `calibration.pt` is a single-parameter temperature scaler (T=0.95) fitted on a 1.5K held-out set. Apply it to logits before the final softmax to slightly reduce Expected Calibration Error from 0.016 to ~0.012.