lt-nobris-en

A sentence-transformer model fine-tuned for entity resolution in research security screening. Given two entity names, the model produces embeddings whose cosine similarity indicates whether they refer to the same organization.

Quickstart

from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer("nobris/lt-nobris-en")
emb1 = model.encode("Harbin Institute of Technology")
emb2 = model.encode("HIT")
similarity = util.cos_sim(emb1, emb2)  # ~0.85

Intended Use

This model is designed for matching entity names against restricted party lists in the context of research security and export control compliance. Primary use cases include:

Screening research proposal affiliations against the US Consolidated Screening List (CSL), Section 1260H, Section 1286, and BIOSECURE Act entities
Matching organization name variants across languages (English, Chinese, Russian)
Resolving acronyms, aliases, subsidiaries, and transliterations to canonical entity names
Matching institutional website domains (e.g., "hit.edu.cn") to organization names

Out-of-Scope Use

Not a compliance decision system. This model produces similarity scores, not legal determinations. All matches should be reviewed by qualified compliance personnel.
Not designed for individual/person name matching. The model is trained on organizational entity names.
Not a general-purpose semantic similarity model. Performance on tasks outside entity resolution (e.g., sentence similarity, paraphrase detection) is not validated.

Model Details

Property	Value
Architecture	MPNet (12 layers, 12 heads, 768 hidden)
Base Model	dell-research-harvard/lt-wikidata-comp-en
Max Sequence Length	512 tokens
Output Dimensions	768
Similarity Function	Cosine Similarity
Loss Function	MultipleNegativesRankingLoss (MNRL)
Pooling	CLS token
Training Precision	FP16 (mixed precision)

Performance

Validation Set Metrics

Evaluated on a held-out validation set of 259,052 entity pairs (96,168 positive, 162,884 negative):

Threshold	Accuracy	Precision	Recall	F1
0.5	85.5%	77.5%	86.0%	81.5%
0.6	85.9%	85.9%	74.2%	79.6%
0.7	82.2%	91.9%	57.2%	70.5%
0.8	75.6%	95.4%	36.1%	52.3%

Average Precision (AP): 0.877 | Best Accuracy Threshold: 0.581 | Best F1 Threshold: 0.541

Acronym Discrimination

Evaluated on a 22,146-pair acronym-focused subset (acronym in at least one side):

Category	Accuracy	Description
Cross-language acronym negatives	99.8%	English acronym vs wrong Chinese name (e.g., CASC vs 中国航天科工集团)
Acronym format variants	93.7%	"CASC" matches "C.A.S.C.", "casc", "the CASC"
Confusable acronym negatives	90.0%	CASC ≠ CASIC, AMMS ≠ AMS, HIT ≠ HEU
Defense entity negatives	100%	Curated confusable defense entity pairs

Training Progression

Epoch	Training Loss	Val AP
1.0	0.330	0.862
2.0	0.175	0.877
3.0	0.165	0.877

Training Data

The model was fine-tuned on 689,049 training pairs from 12 curated data sources covering research security screening scenarios. All positive pairs represent confirmed same-entity matches; all negative pairs represent confirmed different entities.

Data Sources

Source	Pairs	Description	License
OpenSanctions Pairs	~401K	Analyst-judged entity matching pairs from 293 sanctions data sources. Organization/company pairs only.	CC BY-NC 4.0
ROR (Research Organization Registry)	~106K	Aliases, acronyms, and foreign-language labels for 111K research organizations worldwide.	CC0 (Public Domain)
US Consolidated Screening List	~90K	Entity List, SDN, CMIC, and other US export control lists. Name-alias pairs and cross-entity negatives.	US Government (Public Domain)
Hard Negatives	~53K	Curated confusable pairs and random ROR negatives.	Derived
ROR Website Domains	~53K	Institutional domains (e.g., "hit.edu.cn") paired with org names. Prioritized CN/RU domains.	CC0 (Derived from ROR)
International Sanctions	~45K	EU Financial Sanctions, UK Sanctions List, Australia DFAT. Multilingual aliases across 20+ languages.	Public (EU/UK/AU Government)
Acronym Pairs	~16K	Acronym-to-acronym positives, confusable negatives (CASC vs CASIC, AMMS vs AMS), format variants, cross-language negatives.	Derived
CSET PARAT	~7K	702 AI companies (43 Chinese) with aliases from Georgetown CSET's Private-sector AI-Related Activity Tracker.	CC BY 4.0
OpenAlex Institutions	~2K	Real institution names from Chinese AI research papers matched against restricted entity lists.	CC0
Policy Pack Entities	~1.7K	ASPI defense entities, SOEs, BIOSECURE Act entities, SASTIND Seven Sons universities with Chinese names and aliases.	Various (see below)
Defense/Threat Entities	~400	PLA branches, defense agencies, Seven Sons universities with acronyms and Chinese aliases. Hand-curated confusable negatives.	Derived
Section 1260H / 1286 Lists	~300	Chinese military companies (1260H) and defense-linked institutions (1286) with aliases.	US Government (Public Domain)

Label Distribution

Positive (same entity): 308,573 pairs (45%)
Negative (different entity): 380,476 pairs (55%)

Languages Covered

The training data includes entity names in English, Simplified Chinese (zh-CN), Russian (Cyrillic), and 20+ additional languages from international sanctions lists (EU covers all official EU languages).

Usage

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("nobris/lt-nobris-en")

# Encode entity name pairs
pairs = [
    ("Harbin Institute of Technology", "HIT"),              # Same entity
    ("Harbin Institute of Technology", "hit.edu.cn"),        # Domain match
    ("Harbin Institute of Technology", "哈尔滨工业大学"),       # Chinese name
    ("Harbin Institute of Technology", "Harbin Medical University"),  # Different
    ("CASC", "CASIC"),                                       # Confusable acronyms
]

for a, b in pairs:
    emb_a = model.encode(a)
    emb_b = model.encode(b)
    sim = model.similarity([emb_a], [emb_b])[0][0].item()
    print(f"{sim:.3f}  {a}  <->  {b}")

Recommended Thresholds

Use Case	Threshold	Behavior
High recall (don't miss matches)	0.50	Best F1 (81.5%); catches acronym matches
Balanced	0.58	Best accuracy (85.9%)
High precision (minimize false positives)	0.70+	91.9% precision; fewer but more confident matches

Bias, Risks, and Limitations

Known Limitations

Acronym recall at high thresholds is limited. Acronym-to-name pairs (e.g., "CASC" ↔ "China Aerospace Science and Technology Corporation") often score 0.5-0.7 rather than 0.8+. Use threshold 0.5-0.6 for acronym-heavy screening.
Domain matching is a new capability. The model can associate "hit.edu.cn" with "Harbin Institute of Technology", but coverage is limited to the ~109K organizations in ROR that have website links.
Person names are excluded from training. The model is not suitable for individual name matching.
Temporal drift. Sanctions lists and entity relationships change over time. The model reflects training data as of March 2026.

Bias Considerations

The training data is heavily weighted toward Chinese and Russian entities due to the focus on US export control and sanctions screening. Performance on entities from other regions (e.g., Middle East, Africa) may be lower.
The model inherits any biases present in the underlying sanctions lists and entity databases.
False positives on legitimate Chinese academic institutions are a known risk. The model should not be used as the sole basis for restricting research collaborations.

Ethical Considerations

This model is intended to assist compliance professionals in screening research proposals against restricted party lists. It is not a decision-making system. All flagged matches should be reviewed by qualified personnel who can consider context, intent, and applicable regulations.

Research security screening affects international academic collaboration. Overly aggressive screening can harm legitimate scientific exchange. Users should calibrate thresholds to minimize both missed matches (compliance risk) and false positives (academic freedom risk).

Training Procedure

Hyperparameters

Parameter	Value
Epochs	3
Batch Size	32
Learning Rate	2e-5
Warmup Steps	100
Optimizer	AdamW (fused)
Loss	MultipleNegativesRankingLoss
Precision	FP16 (mixed)
Evaluation Steps	500
Training Time	170 minutes (NVIDIA GPU, 16GB VRAM)

Framework Versions

Python: 3.14
Sentence Transformers: 5.3.0
Transformers: 5.3.0
PyTorch: 2.12.0+cu128

Licensing and Attribution

Model License

This model is released under the MIT License.

Base Model

Fine-tuned from dell-research-harvard/lt-wikidata-comp-en (LinkTransformer), itself fine-tuned from sentence-transformers/multi-qa-mpnet-base-dot-v1 (Apache 2.0).

Training Data Licenses

Data Source	License	Commercial Use
ROR	CC0 (Public Domain)	Yes
OpenAlex	CC0 (Public Domain)	Yes
US CSL / 1260H / 1286	US Government (Public Domain)	Yes
EU / UK / AU Sanctions Lists	Government (Public Domain)	Yes
CSET PARAT	CC BY 4.0	Yes (with attribution)
OpenSanctions Pairs	CC BY-NC 4.0	Non-commercial only (commercial license available from opensanctions.org)
ASPI / Policy Pack	Research/reporting use	Verify with source

Important: The OpenSanctions training data is licensed CC BY-NC 4.0. If you intend to use this model commercially, you should either (a) obtain a commercial license from OpenSanctions, or (b) retrain without the OpenSanctions data.

Citation

This Model

@misc{nobris2026ltnobris,
  title={lt-nobris-en: Entity Resolution for Research Security Screening},
  author={Nobris},
  year={2026},
  url={https://huggingface.co/nobris/lt-nobris-en}
}

LinkTransformer (Base Model)

@misc{arora2023linktransformer,
  title={LinkTransformer: A Unified Package for Record Linkage with Transformer Language Models},
  author={Abhishek Arora and Melissa Dell},
  year={2023},
  eprint={2309.00789},
  archivePrefix={arXiv},
  primaryClass={cs.CL}
}

Sentence-Transformers

@inproceedings{reimers-2019-sentence-bert,
  title={Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks},
  author={Reimers, Nils and Gurevych, Iryna},
  booktitle={Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing},
  month={11},
  year={2019},
  publisher={Association for Computational Linguistics},
  url={https://arxiv.org/abs/1908.10084}
}

MultipleNegativesRankingLoss

@misc{oord2019representation,
  title={Representation Learning with Contrastive Predictive Coding},
  author={Aaron van den Oord and Yazhe Li and Oriol Vinyals},
  year={2019},
  eprint={1807.03748},
  archivePrefix={arXiv},
  primaryClass={cs.LG}
}

Model Card Authors

Nobris Research Security Team

Contact

For questions about this model, contact: [info@nobris.dev]

Downloads last month: 12

Safetensors

Model size

0.1B params

Tensor type

F32

Model tree for Nobris/lt-nobris-en

Base model

dell-research-harvard/lt-wikidata-comp-en

Finetuned

(1)

this model

Papers for Nobris/lt-nobris-en

LinkTransformer: A Unified Package for Record Linkage with Transformer Language Models

Paper • 2309.00789 • Published Sep 2, 2023

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

Paper • 1908.10084 • Published Aug 27, 2019 • 15

Representation Learning with Contrastive Predictive Coding

Paper • 1807.03748 • Published Jul 10, 2018 • 1

Evaluation results

Cosine Accuracy on nobris-val
self-reported

0.859
Cosine F1 on nobris-val
self-reported

0.815
Cosine Precision on nobris-val
self-reported

0.775
Cosine Recall on nobris-val
self-reported

0.860
Average Precision on nobris-val
self-reported

0.877
Matthews Correlation Coefficient on nobris-val
self-reported

0.679