Add new SentenceTransformer model

Browse files

Files changed (9) hide show

1_Pooling/config.json +10 -0
README.md +322 -0
config.json +24 -0
config_sentence_transformers.json +14 -0
model.safetensors +3 -0
modules.json +14 -0
sentence_bert_config.json +4 -0
tokenizer.json +0 -0
tokenizer_config.json +24 -0

1_Pooling/config.json ADDED Viewed

	@@ -0,0 +1,10 @@

+{
+    "word_embedding_dimension": 768,
+    "pooling_mode_cls_token": true,
+    "pooling_mode_mean_tokens": false,
+    "pooling_mode_max_tokens": false,
+    "pooling_mode_mean_sqrt_len_tokens": false,
+    "pooling_mode_weightedmean_tokens": false,
+    "pooling_mode_lasttoken": false,
+    "include_prompt": true
+}

README.md ADDED Viewed

	@@ -0,0 +1,322 @@

+---
+tags:
+- sentence-transformers
+- sentence-similarity
+- feature-extraction
+- entity-resolution
+- research-security
+- export-control
+- sanctions-screening
+license: mit
+language:
+- en
+- zh
+- ru
+base_model: dell-research-harvard/lt-wikidata-comp-en
+datasets:
+- custom
+pipeline_tag: sentence-similarity
+library_name: sentence-transformers
+metrics:
+- cosine_accuracy
+- cosine_f1
+- cosine_precision
+- cosine_recall
+- cosine_ap
+- cosine_mcc
+model-index:
+- name: lt-nobris-en
+  results:
+  - task:
+      type: binary-classification
+      name: Entity Resolution
+    dataset:
+      name: nobris-val
+      type: nobris-val
+    metrics:
+    - type: cosine_accuracy
+      value: 0.859
+      name: Cosine Accuracy
+    - type: cosine_f1
+      value: 0.815
+      name: Cosine F1
+    - type: cosine_precision
+      value: 0.775
+      name: Cosine Precision
+    - type: cosine_recall
+      value: 0.860
+      name: Cosine Recall
+    - type: cosine_ap
+      value: 0.877
+      name: Average Precision
+    - type: cosine_mcc
+      value: 0.679
+      name: Matthews Correlation Coefficient
+---
+# lt-nobris-en
+A sentence-transformer model fine-tuned for **entity resolution in research security screening**. Given two entity names, the model produces embeddings whose cosine similarity indicates whether they refer to the same organization.
+## Quickstart
+```python
+from sentence_transformers import SentenceTransformer, util
+model = SentenceTransformer("nobris/lt-nobris-en")
+emb1 = model.encode("Harbin Institute of Technology")
+emb2 = model.encode("HIT")
+similarity = util.cos_sim(emb1, emb2)  # ~0.85
+```
+## Intended Use
+This model is designed for matching entity names against restricted party lists in the context of research security and export control compliance. Primary use cases include:
+- Screening research proposal affiliations against the US Consolidated Screening List (CSL), Section 1260H, Section 1286, and BIOSECURE Act entities
+- Matching organization name variants across languages (English, Chinese, Russian)
+- Resolving acronyms, aliases, subsidiaries, and transliterations to canonical entity names
+- Matching institutional website domains (e.g., "hit.edu.cn") to organization names
+### Out-of-Scope Use
+- **Not a compliance decision system.** This model produces similarity scores, not legal determinations. All matches should be reviewed by qualified compliance personnel.
+- **Not designed for individual/person name matching.** The model is trained on organizational entity names.
+- **Not a general-purpose semantic similarity model.** Performance on tasks outside entity resolution (e.g., sentence similarity, paraphrase detection) is not validated.
+## Model Details
+| Property | Value |
+|:--|:--|
+| **Architecture** | MPNet (12 layers, 12 heads, 768 hidden) |
+| **Base Model** | [dell-research-harvard/lt-wikidata-comp-en](https://huggingface.co/dell-research-harvard/lt-wikidata-comp-en) |
+| **Max Sequence Length** | 512 tokens |
+| **Output Dimensions** | 768 |
+| **Similarity Function** | Cosine Similarity |
+| **Loss Function** | MultipleNegativesRankingLoss (MNRL) |
+| **Pooling** | CLS token |
+| **Training Precision** | FP16 (mixed precision) |
+## Performance
+### Validation Set Metrics
+Evaluated on a held-out validation set of 259,052 entity pairs (96,168 positive, 162,884 negative):
+| Threshold | Accuracy | Precision | Recall | F1 |
+|:---------:|:--------:|:---------:|:------:|:--:|
+| 0.5 | 85.5% | 77.5% | 86.0% | **81.5%** |
+| **0.6** | **85.9%** | 85.9% | 74.2% | 79.6% |
+| 0.7 | 82.2% | 91.9% | 57.2% | 70.5% |
+| 0.8 | 75.6% | 95.4% | 36.1% | 52.3% |
+**Average Precision (AP): 0.877** | **Best Accuracy Threshold: 0.581** | **Best F1 Threshold: 0.541**
+### Acronym Discrimination
+Evaluated on a 22,146-pair acronym-focused subset (acronym in at least one side):
+| Category | Accuracy | Description |
+|:---------|:--------:|:------------|
+| Cross-language acronym negatives | **99.8%** | English acronym vs wrong Chinese name (e.g., CASC vs 中国航天科工集团) |
+| Acronym format variants | **93.7%** | "CASC" matches "C.A.S.C.", "casc", "the CASC" |
+| Confusable acronym negatives | **90.0%** | CASC ≠ CASIC, AMMS ≠ AMS, HIT ≠ HEU |
+| Defense entity negatives | **100%** | Curated confusable defense entity pairs |
+### Training Progression
+| Epoch | Training Loss | Val AP |
+|:--:|:--:|:--:|
+| 1.0 | 0.330 | 0.862 |
+| 2.0 | 0.175 | 0.877 |
+| 3.0 | 0.165 | **0.877** |
+## Training Data
+The model was fine-tuned on 689,049 training pairs from 12 curated data sources covering research security screening scenarios. All positive pairs represent confirmed same-entity matches; all negative pairs represent confirmed different entities.
+### Data Sources
+| Source | Pairs | Description | License |
+|:--|--:|:--|:--|
+| **OpenSanctions Pairs** | ~401K | Analyst-judged entity matching pairs from 293 sanctions data sources. Organization/company pairs only. | CC BY-NC 4.0 |
+| **ROR (Research Organization Registry)** | ~106K | Aliases, acronyms, and foreign-language labels for 111K research organizations worldwide. | CC0 (Public Domain) |
+| **US Consolidated Screening List** | ~90K | Entity List, SDN, CMIC, and other US export control lists. Name-alias pairs and cross-entity negatives. | US Government (Public Domain) |
+| **Hard Negatives** | ~53K | Curated confusable pairs and random ROR negatives. | Derived |
+| **ROR Website Domains** | ~53K | Institutional domains (e.g., "hit.edu.cn") paired with org names. Prioritized CN/RU domains. | CC0 (Derived from ROR) |
+| **International Sanctions** | ~45K | EU Financial Sanctions, UK Sanctions List, Australia DFAT. Multilingual aliases across 20+ languages. | Public (EU/UK/AU Government) |
+| **Acronym Pairs** | ~16K | Acronym-to-acronym positives, confusable negatives (CASC vs CASIC, AMMS vs AMS), format variants, cross-language negatives. | Derived |
+| **CSET PARAT** | ~7K | 702 AI companies (43 Chinese) with aliases from Georgetown CSET's Private-sector AI-Related Activity Tracker. | CC BY 4.0 |
+| **OpenAlex Institutions** | ~2K | Real institution names from Chinese AI research papers matched against restricted entity lists. | CC0 |
+| **Policy Pack Entities** | ~1.7K | ASPI defense entities, SOEs, BIOSECURE Act entities, SASTIND Seven Sons universities with Chinese names and aliases. | Various (see below) |
+| **Defense/Threat Entities** | ~400 | PLA branches, defense agencies, Seven Sons universities with acronyms and Chinese aliases. Hand-curated confusable negatives. | Derived |
+| **Section 1260H / 1286 Lists** | ~300 | Chinese military companies (1260H) and defense-linked institutions (1286) with aliases. | US Government (Public Domain) |
+### Label Distribution
+- **Positive (same entity):** 308,573 pairs (45%)
+- **Negative (different entity):** 380,476 pairs (55%)
+### Languages Covered
+The training data includes entity names in English, Simplified Chinese (zh-CN), Russian (Cyrillic), and 20+ additional languages from international sanctions lists (EU covers all official EU languages).
+## Usage
+```python
+from sentence_transformers import SentenceTransformer
+model = SentenceTransformer("nobris/lt-nobris-en")
+# Encode entity name pairs
+pairs = [
+    ("Harbin Institute of Technology", "HIT"),              # Same entity
+    ("Harbin Institute of Technology", "hit.edu.cn"),        # Domain match
+    ("Harbin Institute of Technology", "哈尔滨工业大学"),       # Chinese name
+    ("Harbin Institute of Technology", "Harbin Medical University"),  # Different
+    ("CASC", "CASIC"),                                       # Confusable acronyms
+]
+for a, b in pairs:
+    emb_a = model.encode(a)
+    emb_b = model.encode(b)
+    sim = model.similarity([emb_a], [emb_b])[0][0].item()
+    print(f"{sim:.3f}  {a}  <->  {b}")
+```
+### Recommended Thresholds
+| Use Case | Threshold | Behavior |
+|:--|:--:|:--|
+| High recall (don't miss matches) | 0.50 | Best F1 (81.5%); catches acronym matches |
+| Balanced | 0.58 | Best accuracy (85.9%) |
+| High precision (minimize false positives) | 0.70+ | 91.9% precision; fewer but more confident matches |
+## Bias, Risks, and Limitations
+### Known Limitations
+- **Acronym recall at high thresholds is limited.** Acronym-to-name pairs (e.g., "CASC" ↔ "China Aerospace Science and Technology Corporation") often score 0.5-0.7 rather than 0.8+. Use threshold 0.5-0.6 for acronym-heavy screening.
+- **Domain matching is a new capability.** The model can associate "hit.edu.cn" with "Harbin Institute of Technology", but coverage is limited to the ~109K organizations in ROR that have website links.
+- **Person names** are excluded from training. The model is not suitable for individual name matching.
+- **Temporal drift.** Sanctions lists and entity relationships change over time. The model reflects training data as of March 2026.
+### Bias Considerations
+- The training data is heavily weighted toward Chinese and Russian entities due to the focus on US export control and sanctions screening. Performance on entities from other regions (e.g., Middle East, Africa) may be lower.
+- The model inherits any biases present in the underlying sanctions lists and entity databases.
+- False positives on legitimate Chinese academic institutions are a known risk. The model should not be used as the sole basis for restricting research collaborations.
+### Ethical Considerations
+This model is intended to assist compliance professionals in screening research proposals against restricted party lists. It is **not** a decision-making system. All flagged matches should be reviewed by qualified personnel who can consider context, intent, and applicable regulations.
+Research security screening affects international academic collaboration. Overly aggressive screening can harm legitimate scientific exchange. Users should calibrate thresholds to minimize both missed matches (compliance risk) and false positives (academic freedom risk).
+## Training Procedure
+### Hyperparameters
+| Parameter | Value |
+|:--|:--|
+| Epochs | 3 |
+| Batch Size | 32 |
+| Learning Rate | 2e-5 |
+| Warmup Steps | 100 |
+| Optimizer | AdamW (fused) |
+| Loss | MultipleNegativesRankingLoss |
+| Precision | FP16 (mixed) |
+| Evaluation Steps | 500 |
+| Training Time | 170 minutes (NVIDIA GPU, 16GB VRAM) |
+### Framework Versions
+- Python: 3.14
+- Sentence Transformers: 5.3.0
+- Transformers: 5.3.0
+- PyTorch: 2.12.0+cu128
+## Licensing and Attribution
+### Model License
+This model is released under the **MIT License**.
+### Base Model
+Fine-tuned from [dell-research-harvard/lt-wikidata-comp-en](https://huggingface.co/dell-research-harvard/lt-wikidata-comp-en) (LinkTransformer), itself fine-tuned from [sentence-transformers/multi-qa-mpnet-base-dot-v1](https://huggingface.co/sentence-transformers/multi-qa-mpnet-base-dot-v1) (Apache 2.0).
+### Training Data Licenses
+| Data Source | License | Commercial Use |
+|:--|:--|:--|
+| ROR | CC0 (Public Domain) | Yes |
+| OpenAlex | CC0 (Public Domain) | Yes |
+| US CSL / 1260H / 1286 | US Government (Public Domain) | Yes |
+| EU / UK / AU Sanctions Lists | Government (Public Domain) | Yes |
+| CSET PARAT | CC BY 4.0 | Yes (with attribution) |
+| OpenSanctions Pairs | **CC BY-NC 4.0** | **Non-commercial only** (commercial license available from opensanctions.org) |
+| ASPI / Policy Pack | Research/reporting use | Verify with source |
+**Important:** The OpenSanctions training data is licensed CC BY-NC 4.0. If you intend to use this model commercially, you should either (a) obtain a commercial license from [OpenSanctions](https://www.opensanctions.org/licensing/), or (b) retrain without the OpenSanctions data.
+## Citation
+### This Model
+```bibtex
+@misc{nobris2026ltnobris,
+  title={lt-nobris-en: Entity Resolution for Research Security Screening},
+  author={Nobris},
+  year={2026},
+  url={https://huggingface.co/nobris/lt-nobris-en}
+}
+```
+### LinkTransformer (Base Model)
+```bibtex
+@misc{arora2023linktransformer,
+  title={LinkTransformer: A Unified Package for Record Linkage with Transformer Language Models},
+  author={Abhishek Arora and Melissa Dell},
+  year={2023},
+  eprint={2309.00789},
+  archivePrefix={arXiv},
+  primaryClass={cs.CL}
+}
+```
+### Sentence-Transformers
+```bibtex
+@inproceedings{reimers-2019-sentence-bert,
+  title={Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks},
+  author={Reimers, Nils and Gurevych, Iryna},
+  booktitle={Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing},
+  month={11},
+  year={2019},
+  publisher={Association for Computational Linguistics},
+  url={https://arxiv.org/abs/1908.10084}
+}
+```
+### MultipleNegativesRankingLoss
+```bibtex
+@misc{oord2019representation,
+  title={Representation Learning with Contrastive Predictive Coding},
+  author={Aaron van den Oord and Yazhe Li and Oriol Vinyals},
+  year={2019},
+  eprint={1807.03748},
+  archivePrefix={arXiv},
+  primaryClass={cs.LG}
+}
+```
+## Model Card Authors
+Nobris Research Security Team
+## Contact
+For questions about this model, contact: [info@nobris.dev]

config.json ADDED Viewed

	@@ -0,0 +1,24 @@

+{
+  "architectures": [
+    "MPNetModel"
+  ],
+  "attention_probs_dropout_prob": 0.1,
+  "bos_token_id": 0,
+  "dtype": "float32",
+  "eos_token_id": 2,
+  "hidden_act": "gelu",
+  "hidden_dropout_prob": 0.1,
+  "hidden_size": 768,
+  "initializer_range": 0.02,
+  "intermediate_size": 3072,
+  "layer_norm_eps": 1e-05,
+  "max_position_embeddings": 514,
+  "model_type": "mpnet",
+  "num_attention_heads": 12,
+  "num_hidden_layers": 12,
+  "pad_token_id": 1,
+  "relative_attention_num_buckets": 32,
+  "tie_word_embeddings": true,
+  "transformers_version": "5.3.0",
+  "vocab_size": 30527
+}

config_sentence_transformers.json ADDED Viewed

	@@ -0,0 +1,14 @@

+{
+  "__version__": {
+    "sentence_transformers": "5.3.0",
+    "transformers": "5.3.0",
+    "pytorch": "2.12.0.dev20260314+cu128"
+  },
+  "model_type": "SentenceTransformer",
+  "prompts": {
+    "query": "",
+    "document": ""
+  },
+  "default_prompt_name": null,
+  "similarity_fn_name": "cosine"
+}

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:5d07e77995c23061d50dd4ad31437d6d3b870c858f4b59beb3f23d0a306da19c
+size 437967648

modules.json ADDED Viewed

	@@ -0,0 +1,14 @@

+[
+  {
+    "idx": 0,
+    "name": "0",
+    "path": "",
+    "type": "sentence_transformers.models.Transformer"
+  },
+  {
+    "idx": 1,
+    "name": "1",
+    "path": "1_Pooling",
+    "type": "sentence_transformers.models.Pooling"
+  }
+]

sentence_bert_config.json ADDED Viewed

	@@ -0,0 +1,4 @@

+{
+    "max_seq_length": 512,
+    "do_lower_case": false
+}

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,24 @@

+{
+  "backend": "tokenizers",
+  "bos_token": "<s>",
+  "clean_up_tokenization_spaces": true,
+  "cls_token": "<s>",
+  "do_lower_case": true,
+  "eos_token": "</s>",
+  "is_local": true,
+  "mask_token": "<mask>",
+  "max_length": 250,
+  "model_max_length": 512,
+  "pad_to_multiple_of": null,
+  "pad_token": "<pad>",
+  "pad_token_type_id": 0,
+  "padding_side": "right",
+  "sep_token": "</s>",
+  "stride": 0,
+  "strip_accents": null,
+  "tokenize_chinese_chars": true,
+  "tokenizer_class": "MPNetTokenizer",
+  "truncation_side": "right",
+  "truncation_strategy": "longest_first",
+  "unk_token": "[UNK]"
+}