lt-nobris-en
A sentence-transformer model fine-tuned for entity resolution in research security screening. Given two entity names, the model produces embeddings whose cosine similarity indicates whether they refer to the same organization.
Quickstart
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer("nobris/lt-nobris-en")
emb1 = model.encode("Harbin Institute of Technology")
emb2 = model.encode("HIT")
similarity = util.cos_sim(emb1, emb2) # ~0.85
Intended Use
This model is designed for matching entity names against restricted party lists in the context of research security and export control compliance. Primary use cases include:
- Screening research proposal affiliations against the US Consolidated Screening List (CSL), Section 1260H, Section 1286, and BIOSECURE Act entities
- Matching organization name variants across languages (English, Chinese, Russian)
- Resolving acronyms, aliases, subsidiaries, and transliterations to canonical entity names
- Matching institutional website domains (e.g., "hit.edu.cn") to organization names
Out-of-Scope Use
- Not a compliance decision system. This model produces similarity scores, not legal determinations. All matches should be reviewed by qualified compliance personnel.
- Not designed for individual/person name matching. The model is trained on organizational entity names.
- Not a general-purpose semantic similarity model. Performance on tasks outside entity resolution (e.g., sentence similarity, paraphrase detection) is not validated.
Model Details
| Property | Value |
|---|---|
| Architecture | MPNet (12 layers, 12 heads, 768 hidden) |
| Base Model | dell-research-harvard/lt-wikidata-comp-en |
| Max Sequence Length | 512 tokens |
| Output Dimensions | 768 |
| Similarity Function | Cosine Similarity |
| Loss Function | MultipleNegativesRankingLoss (MNRL) |
| Pooling | CLS token |
| Training Precision | FP16 (mixed precision) |
Performance
Validation Set Metrics
Evaluated on a held-out validation set of 259,052 entity pairs (96,168 positive, 162,884 negative):
| Threshold | Accuracy | Precision | Recall | F1 |
|---|---|---|---|---|
| 0.5 | 85.5% | 77.5% | 86.0% | 81.5% |
| 0.6 | 85.9% | 85.9% | 74.2% | 79.6% |
| 0.7 | 82.2% | 91.9% | 57.2% | 70.5% |
| 0.8 | 75.6% | 95.4% | 36.1% | 52.3% |
Average Precision (AP): 0.877 | Best Accuracy Threshold: 0.581 | Best F1 Threshold: 0.541
Acronym Discrimination
Evaluated on a 22,146-pair acronym-focused subset (acronym in at least one side):
| Category | Accuracy | Description |
|---|---|---|
| Cross-language acronym negatives | 99.8% | English acronym vs wrong Chinese name (e.g., CASC vs ä¸å›½èˆªå¤©ç§‘工集团) |
| Acronym format variants | 93.7% | "CASC" matches "C.A.S.C.", "casc", "the CASC" |
| Confusable acronym negatives | 90.0% | CASC ≠CASIC, AMMS ≠AMS, HIT ≠HEU |
| Defense entity negatives | 100% | Curated confusable defense entity pairs |
Training Progression
| Epoch | Training Loss | Val AP |
|---|---|---|
| 1.0 | 0.330 | 0.862 |
| 2.0 | 0.175 | 0.877 |
| 3.0 | 0.165 | 0.877 |
Training Data
The model was fine-tuned on 689,049 training pairs from 12 curated data sources covering research security screening scenarios. All positive pairs represent confirmed same-entity matches; all negative pairs represent confirmed different entities.
Data Sources
| Source | Pairs | Description | License |
|---|---|---|---|
| OpenSanctions Pairs | ~401K | Analyst-judged entity matching pairs from 293 sanctions data sources. Organization/company pairs only. | CC BY-NC 4.0 |
| ROR (Research Organization Registry) | ~106K | Aliases, acronyms, and foreign-language labels for 111K research organizations worldwide. | CC0 (Public Domain) |
| US Consolidated Screening List | ~90K | Entity List, SDN, CMIC, and other US export control lists. Name-alias pairs and cross-entity negatives. | US Government (Public Domain) |
| Hard Negatives | ~53K | Curated confusable pairs and random ROR negatives. | Derived |
| ROR Website Domains | ~53K | Institutional domains (e.g., "hit.edu.cn") paired with org names. Prioritized CN/RU domains. | CC0 (Derived from ROR) |
| International Sanctions | ~45K | EU Financial Sanctions, UK Sanctions List, Australia DFAT. Multilingual aliases across 20+ languages. | Public (EU/UK/AU Government) |
| Acronym Pairs | ~16K | Acronym-to-acronym positives, confusable negatives (CASC vs CASIC, AMMS vs AMS), format variants, cross-language negatives. | Derived |
| CSET PARAT | ~7K | 702 AI companies (43 Chinese) with aliases from Georgetown CSET's Private-sector AI-Related Activity Tracker. | CC BY 4.0 |
| OpenAlex Institutions | ~2K | Real institution names from Chinese AI research papers matched against restricted entity lists. | CC0 |
| Policy Pack Entities | ~1.7K | ASPI defense entities, SOEs, BIOSECURE Act entities, SASTIND Seven Sons universities with Chinese names and aliases. | Various (see below) |
| Defense/Threat Entities | ~400 | PLA branches, defense agencies, Seven Sons universities with acronyms and Chinese aliases. Hand-curated confusable negatives. | Derived |
| Section 1260H / 1286 Lists | ~300 | Chinese military companies (1260H) and defense-linked institutions (1286) with aliases. | US Government (Public Domain) |
Label Distribution
- Positive (same entity): 308,573 pairs (45%)
- Negative (different entity): 380,476 pairs (55%)
Languages Covered
The training data includes entity names in English, Simplified Chinese (zh-CN), Russian (Cyrillic), and 20+ additional languages from international sanctions lists (EU covers all official EU languages).
Usage
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("nobris/lt-nobris-en")
# Encode entity name pairs
pairs = [
("Harbin Institute of Technology", "HIT"), # Same entity
("Harbin Institute of Technology", "hit.edu.cn"), # Domain match
("Harbin Institute of Technology", "哈尔滨工业大å¦"), # Chinese name
("Harbin Institute of Technology", "Harbin Medical University"), # Different
("CASC", "CASIC"), # Confusable acronyms
]
for a, b in pairs:
emb_a = model.encode(a)
emb_b = model.encode(b)
sim = model.similarity([emb_a], [emb_b])[0][0].item()
print(f"{sim:.3f} {a} <-> {b}")
Recommended Thresholds
| Use Case | Threshold | Behavior |
|---|---|---|
| High recall (don't miss matches) | 0.50 | Best F1 (81.5%); catches acronym matches |
| Balanced | 0.58 | Best accuracy (85.9%) |
| High precision (minimize false positives) | 0.70+ | 91.9% precision; fewer but more confident matches |
Bias, Risks, and Limitations
Known Limitations
- Acronym recall at high thresholds is limited. Acronym-to-name pairs (e.g., "CASC" ↔ "China Aerospace Science and Technology Corporation") often score 0.5-0.7 rather than 0.8+. Use threshold 0.5-0.6 for acronym-heavy screening.
- Domain matching is a new capability. The model can associate "hit.edu.cn" with "Harbin Institute of Technology", but coverage is limited to the ~109K organizations in ROR that have website links.
- Person names are excluded from training. The model is not suitable for individual name matching.
- Temporal drift. Sanctions lists and entity relationships change over time. The model reflects training data as of March 2026.
Bias Considerations
- The training data is heavily weighted toward Chinese and Russian entities due to the focus on US export control and sanctions screening. Performance on entities from other regions (e.g., Middle East, Africa) may be lower.
- The model inherits any biases present in the underlying sanctions lists and entity databases.
- False positives on legitimate Chinese academic institutions are a known risk. The model should not be used as the sole basis for restricting research collaborations.
Ethical Considerations
This model is intended to assist compliance professionals in screening research proposals against restricted party lists. It is not a decision-making system. All flagged matches should be reviewed by qualified personnel who can consider context, intent, and applicable regulations.
Research security screening affects international academic collaboration. Overly aggressive screening can harm legitimate scientific exchange. Users should calibrate thresholds to minimize both missed matches (compliance risk) and false positives (academic freedom risk).
Training Procedure
Hyperparameters
| Parameter | Value |
|---|---|
| Epochs | 3 |
| Batch Size | 32 |
| Learning Rate | 2e-5 |
| Warmup Steps | 100 |
| Optimizer | AdamW (fused) |
| Loss | MultipleNegativesRankingLoss |
| Precision | FP16 (mixed) |
| Evaluation Steps | 500 |
| Training Time | 170 minutes (NVIDIA GPU, 16GB VRAM) |
Framework Versions
- Python: 3.14
- Sentence Transformers: 5.3.0
- Transformers: 5.3.0
- PyTorch: 2.12.0+cu128
Licensing and Attribution
Model License
This model is released under the MIT License.
Base Model
Fine-tuned from dell-research-harvard/lt-wikidata-comp-en (LinkTransformer), itself fine-tuned from sentence-transformers/multi-qa-mpnet-base-dot-v1 (Apache 2.0).
Training Data Licenses
| Data Source | License | Commercial Use |
|---|---|---|
| ROR | CC0 (Public Domain) | Yes |
| OpenAlex | CC0 (Public Domain) | Yes |
| US CSL / 1260H / 1286 | US Government (Public Domain) | Yes |
| EU / UK / AU Sanctions Lists | Government (Public Domain) | Yes |
| CSET PARAT | CC BY 4.0 | Yes (with attribution) |
| OpenSanctions Pairs | CC BY-NC 4.0 | Non-commercial only (commercial license available from opensanctions.org) |
| ASPI / Policy Pack | Research/reporting use | Verify with source |
Important: The OpenSanctions training data is licensed CC BY-NC 4.0. If you intend to use this model commercially, you should either (a) obtain a commercial license from OpenSanctions, or (b) retrain without the OpenSanctions data.
Citation
This Model
@misc{nobris2026ltnobris,
title={lt-nobris-en: Entity Resolution for Research Security Screening},
author={Nobris},
year={2026},
url={https://huggingface.co/nobris/lt-nobris-en}
}
LinkTransformer (Base Model)
@misc{arora2023linktransformer,
title={LinkTransformer: A Unified Package for Record Linkage with Transformer Language Models},
author={Abhishek Arora and Melissa Dell},
year={2023},
eprint={2309.00789},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
Sentence-Transformers
@inproceedings{reimers-2019-sentence-bert,
title={Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks},
author={Reimers, Nils and Gurevych, Iryna},
booktitle={Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing},
month={11},
year={2019},
publisher={Association for Computational Linguistics},
url={https://arxiv.org/abs/1908.10084}
}
MultipleNegativesRankingLoss
@misc{oord2019representation,
title={Representation Learning with Contrastive Predictive Coding},
author={Aaron van den Oord and Yazhe Li and Oriol Vinyals},
year={2019},
eprint={1807.03748},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
Model Card Authors
Nobris Research Security Team
Contact
For questions about this model, contact: [info@nobris.dev]
- Downloads last month
- 27
Model tree for Nobris/lt-nobris-en
Base model
dell-research-harvard/lt-wikidata-comp-enPapers for Nobris/lt-nobris-en
Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks
Representation Learning with Contrastive Predictive Coding
Evaluation results
- Cosine Accuracy on nobris-valself-reported0.859
- Cosine F1 on nobris-valself-reported0.815
- Cosine Precision on nobris-valself-reported0.775
- Cosine Recall on nobris-valself-reported0.860
- Average Precision on nobris-valself-reported0.877
- Matthews Correlation Coefficient on nobris-valself-reported0.679