lt-nobris-en

A sentence-transformer model fine-tuned for entity resolution in research security screening. Given two entity names, the model produces embeddings whose cosine similarity indicates whether they refer to the same organization.

Quickstart

from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer("nobris/lt-nobris-en")
emb1 = model.encode("Harbin Institute of Technology")
emb2 = model.encode("HIT")
similarity = util.cos_sim(emb1, emb2)  # ~0.85

Intended Use

This model is designed for matching entity names against restricted party lists in the context of research security and export control compliance. Primary use cases include:

  • Screening research proposal affiliations against the US Consolidated Screening List (CSL), Section 1260H, Section 1286, and BIOSECURE Act entities
  • Matching organization name variants across languages (English, Chinese, Russian)
  • Resolving acronyms, aliases, subsidiaries, and transliterations to canonical entity names
  • Matching institutional website domains (e.g., "hit.edu.cn") to organization names

Out-of-Scope Use

  • Not a compliance decision system. This model produces similarity scores, not legal determinations. All matches should be reviewed by qualified compliance personnel.
  • Not designed for individual/person name matching. The model is trained on organizational entity names.
  • Not a general-purpose semantic similarity model. Performance on tasks outside entity resolution (e.g., sentence similarity, paraphrase detection) is not validated.

Model Details

Property Value
Architecture MPNet (12 layers, 12 heads, 768 hidden)
Base Model dell-research-harvard/lt-wikidata-comp-en
Max Sequence Length 512 tokens
Output Dimensions 768
Similarity Function Cosine Similarity
Loss Function MultipleNegativesRankingLoss (MNRL)
Pooling CLS token
Training Precision FP16 (mixed precision)

Performance

Validation Set Metrics

Evaluated on a held-out validation set of 259,052 entity pairs (96,168 positive, 162,884 negative):

Threshold Accuracy Precision Recall F1
0.5 85.5% 77.5% 86.0% 81.5%
0.6 85.9% 85.9% 74.2% 79.6%
0.7 82.2% 91.9% 57.2% 70.5%
0.8 75.6% 95.4% 36.1% 52.3%

Average Precision (AP): 0.877 | Best Accuracy Threshold: 0.581 | Best F1 Threshold: 0.541

Acronym Discrimination

Evaluated on a 22,146-pair acronym-focused subset (acronym in at least one side):

Category Accuracy Description
Cross-language acronym negatives 99.8% English acronym vs wrong Chinese name (e.g., CASC vs 中国航天科工集团)
Acronym format variants 93.7% "CASC" matches "C.A.S.C.", "casc", "the CASC"
Confusable acronym negatives 90.0% CASC ≠ CASIC, AMMS ≠ AMS, HIT ≠ HEU
Defense entity negatives 100% Curated confusable defense entity pairs

Training Progression

Epoch Training Loss Val AP
1.0 0.330 0.862
2.0 0.175 0.877
3.0 0.165 0.877

Training Data

The model was fine-tuned on 689,049 training pairs from 12 curated data sources covering research security screening scenarios. All positive pairs represent confirmed same-entity matches; all negative pairs represent confirmed different entities.

Data Sources

Source Pairs Description License
OpenSanctions Pairs ~401K Analyst-judged entity matching pairs from 293 sanctions data sources. Organization/company pairs only. CC BY-NC 4.0
ROR (Research Organization Registry) ~106K Aliases, acronyms, and foreign-language labels for 111K research organizations worldwide. CC0 (Public Domain)
US Consolidated Screening List ~90K Entity List, SDN, CMIC, and other US export control lists. Name-alias pairs and cross-entity negatives. US Government (Public Domain)
Hard Negatives ~53K Curated confusable pairs and random ROR negatives. Derived
ROR Website Domains ~53K Institutional domains (e.g., "hit.edu.cn") paired with org names. Prioritized CN/RU domains. CC0 (Derived from ROR)
International Sanctions ~45K EU Financial Sanctions, UK Sanctions List, Australia DFAT. Multilingual aliases across 20+ languages. Public (EU/UK/AU Government)
Acronym Pairs ~16K Acronym-to-acronym positives, confusable negatives (CASC vs CASIC, AMMS vs AMS), format variants, cross-language negatives. Derived
CSET PARAT ~7K 702 AI companies (43 Chinese) with aliases from Georgetown CSET's Private-sector AI-Related Activity Tracker. CC BY 4.0
OpenAlex Institutions ~2K Real institution names from Chinese AI research papers matched against restricted entity lists. CC0
Policy Pack Entities ~1.7K ASPI defense entities, SOEs, BIOSECURE Act entities, SASTIND Seven Sons universities with Chinese names and aliases. Various (see below)
Defense/Threat Entities ~400 PLA branches, defense agencies, Seven Sons universities with acronyms and Chinese aliases. Hand-curated confusable negatives. Derived
Section 1260H / 1286 Lists ~300 Chinese military companies (1260H) and defense-linked institutions (1286) with aliases. US Government (Public Domain)

Label Distribution

  • Positive (same entity): 308,573 pairs (45%)
  • Negative (different entity): 380,476 pairs (55%)

Languages Covered

The training data includes entity names in English, Simplified Chinese (zh-CN), Russian (Cyrillic), and 20+ additional languages from international sanctions lists (EU covers all official EU languages).

Usage

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("nobris/lt-nobris-en")

# Encode entity name pairs
pairs = [
    ("Harbin Institute of Technology", "HIT"),              # Same entity
    ("Harbin Institute of Technology", "hit.edu.cn"),        # Domain match
    ("Harbin Institute of Technology", "哈尔滨工业大学"),       # Chinese name
    ("Harbin Institute of Technology", "Harbin Medical University"),  # Different
    ("CASC", "CASIC"),                                       # Confusable acronyms
]

for a, b in pairs:
    emb_a = model.encode(a)
    emb_b = model.encode(b)
    sim = model.similarity([emb_a], [emb_b])[0][0].item()
    print(f"{sim:.3f}  {a}  <->  {b}")

Recommended Thresholds

Use Case Threshold Behavior
High recall (don't miss matches) 0.50 Best F1 (81.5%); catches acronym matches
Balanced 0.58 Best accuracy (85.9%)
High precision (minimize false positives) 0.70+ 91.9% precision; fewer but more confident matches

Bias, Risks, and Limitations

Known Limitations

  • Acronym recall at high thresholds is limited. Acronym-to-name pairs (e.g., "CASC" ↔ "China Aerospace Science and Technology Corporation") often score 0.5-0.7 rather than 0.8+. Use threshold 0.5-0.6 for acronym-heavy screening.
  • Domain matching is a new capability. The model can associate "hit.edu.cn" with "Harbin Institute of Technology", but coverage is limited to the ~109K organizations in ROR that have website links.
  • Person names are excluded from training. The model is not suitable for individual name matching.
  • Temporal drift. Sanctions lists and entity relationships change over time. The model reflects training data as of March 2026.

Bias Considerations

  • The training data is heavily weighted toward Chinese and Russian entities due to the focus on US export control and sanctions screening. Performance on entities from other regions (e.g., Middle East, Africa) may be lower.
  • The model inherits any biases present in the underlying sanctions lists and entity databases.
  • False positives on legitimate Chinese academic institutions are a known risk. The model should not be used as the sole basis for restricting research collaborations.

Ethical Considerations

This model is intended to assist compliance professionals in screening research proposals against restricted party lists. It is not a decision-making system. All flagged matches should be reviewed by qualified personnel who can consider context, intent, and applicable regulations.

Research security screening affects international academic collaboration. Overly aggressive screening can harm legitimate scientific exchange. Users should calibrate thresholds to minimize both missed matches (compliance risk) and false positives (academic freedom risk).

Training Procedure

Hyperparameters

Parameter Value
Epochs 3
Batch Size 32
Learning Rate 2e-5
Warmup Steps 100
Optimizer AdamW (fused)
Loss MultipleNegativesRankingLoss
Precision FP16 (mixed)
Evaluation Steps 500
Training Time 170 minutes (NVIDIA GPU, 16GB VRAM)

Framework Versions

  • Python: 3.14
  • Sentence Transformers: 5.3.0
  • Transformers: 5.3.0
  • PyTorch: 2.12.0+cu128

Licensing and Attribution

Model License

This model is released under the MIT License.

Base Model

Fine-tuned from dell-research-harvard/lt-wikidata-comp-en (LinkTransformer), itself fine-tuned from sentence-transformers/multi-qa-mpnet-base-dot-v1 (Apache 2.0).

Training Data Licenses

Data Source License Commercial Use
ROR CC0 (Public Domain) Yes
OpenAlex CC0 (Public Domain) Yes
US CSL / 1260H / 1286 US Government (Public Domain) Yes
EU / UK / AU Sanctions Lists Government (Public Domain) Yes
CSET PARAT CC BY 4.0 Yes (with attribution)
OpenSanctions Pairs CC BY-NC 4.0 Non-commercial only (commercial license available from opensanctions.org)
ASPI / Policy Pack Research/reporting use Verify with source

Important: The OpenSanctions training data is licensed CC BY-NC 4.0. If you intend to use this model commercially, you should either (a) obtain a commercial license from OpenSanctions, or (b) retrain without the OpenSanctions data.

Citation

This Model

@misc{nobris2026ltnobris,
  title={lt-nobris-en: Entity Resolution for Research Security Screening},
  author={Nobris},
  year={2026},
  url={https://huggingface.co/nobris/lt-nobris-en}
}

LinkTransformer (Base Model)

@misc{arora2023linktransformer,
  title={LinkTransformer: A Unified Package for Record Linkage with Transformer Language Models},
  author={Abhishek Arora and Melissa Dell},
  year={2023},
  eprint={2309.00789},
  archivePrefix={arXiv},
  primaryClass={cs.CL}
}

Sentence-Transformers

@inproceedings{reimers-2019-sentence-bert,
  title={Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks},
  author={Reimers, Nils and Gurevych, Iryna},
  booktitle={Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing},
  month={11},
  year={2019},
  publisher={Association for Computational Linguistics},
  url={https://arxiv.org/abs/1908.10084}
}

MultipleNegativesRankingLoss

@misc{oord2019representation,
  title={Representation Learning with Contrastive Predictive Coding},
  author={Aaron van den Oord and Yazhe Li and Oriol Vinyals},
  year={2019},
  eprint={1807.03748},
  archivePrefix={arXiv},
  primaryClass={cs.LG}
}

Model Card Authors

Nobris Research Security Team

Contact

For questions about this model, contact: [info@nobris.dev]

Downloads last month
27
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Nobris/lt-nobris-en

Finetuned
(1)
this model

Papers for Nobris/lt-nobris-en

Evaluation results