w-nobris commited on
Commit
e5d9e23
·
verified ·
1 Parent(s): 694e131

Add new SentenceTransformer model

Browse files
1_Pooling/config.json ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "word_embedding_dimension": 768,
3
+ "pooling_mode_cls_token": true,
4
+ "pooling_mode_mean_tokens": false,
5
+ "pooling_mode_max_tokens": false,
6
+ "pooling_mode_mean_sqrt_len_tokens": false,
7
+ "pooling_mode_weightedmean_tokens": false,
8
+ "pooling_mode_lasttoken": false,
9
+ "include_prompt": true
10
+ }
README.md ADDED
@@ -0,0 +1,322 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ tags:
3
+ - sentence-transformers
4
+ - sentence-similarity
5
+ - feature-extraction
6
+ - entity-resolution
7
+ - research-security
8
+ - export-control
9
+ - sanctions-screening
10
+ license: mit
11
+ language:
12
+ - en
13
+ - zh
14
+ - ru
15
+ base_model: dell-research-harvard/lt-wikidata-comp-en
16
+ datasets:
17
+ - custom
18
+ pipeline_tag: sentence-similarity
19
+ library_name: sentence-transformers
20
+ metrics:
21
+ - cosine_accuracy
22
+ - cosine_f1
23
+ - cosine_precision
24
+ - cosine_recall
25
+ - cosine_ap
26
+ - cosine_mcc
27
+ model-index:
28
+ - name: lt-nobris-en
29
+ results:
30
+ - task:
31
+ type: binary-classification
32
+ name: Entity Resolution
33
+ dataset:
34
+ name: nobris-val
35
+ type: nobris-val
36
+ metrics:
37
+ - type: cosine_accuracy
38
+ value: 0.859
39
+ name: Cosine Accuracy
40
+ - type: cosine_f1
41
+ value: 0.815
42
+ name: Cosine F1
43
+ - type: cosine_precision
44
+ value: 0.775
45
+ name: Cosine Precision
46
+ - type: cosine_recall
47
+ value: 0.860
48
+ name: Cosine Recall
49
+ - type: cosine_ap
50
+ value: 0.877
51
+ name: Average Precision
52
+ - type: cosine_mcc
53
+ value: 0.679
54
+ name: Matthews Correlation Coefficient
55
+ ---
56
+
57
+ # lt-nobris-en
58
+
59
+ A sentence-transformer model fine-tuned for **entity resolution in research security screening**. Given two entity names, the model produces embeddings whose cosine similarity indicates whether they refer to the same organization.
60
+
61
+ ## Quickstart
62
+
63
+ ```python
64
+ from sentence_transformers import SentenceTransformer, util
65
+
66
+ model = SentenceTransformer("nobris/lt-nobris-en")
67
+ emb1 = model.encode("Harbin Institute of Technology")
68
+ emb2 = model.encode("HIT")
69
+ similarity = util.cos_sim(emb1, emb2) # ~0.85
70
+ ```
71
+
72
+ ## Intended Use
73
+
74
+ This model is designed for matching entity names against restricted party lists in the context of research security and export control compliance. Primary use cases include:
75
+
76
+ - Screening research proposal affiliations against the US Consolidated Screening List (CSL), Section 1260H, Section 1286, and BIOSECURE Act entities
77
+ - Matching organization name variants across languages (English, Chinese, Russian)
78
+ - Resolving acronyms, aliases, subsidiaries, and transliterations to canonical entity names
79
+ - Matching institutional website domains (e.g., "hit.edu.cn") to organization names
80
+
81
+ ### Out-of-Scope Use
82
+
83
+ - **Not a compliance decision system.** This model produces similarity scores, not legal determinations. All matches should be reviewed by qualified compliance personnel.
84
+ - **Not designed for individual/person name matching.** The model is trained on organizational entity names.
85
+ - **Not a general-purpose semantic similarity model.** Performance on tasks outside entity resolution (e.g., sentence similarity, paraphrase detection) is not validated.
86
+
87
+ ## Model Details
88
+
89
+ | Property | Value |
90
+ |:--|:--|
91
+ | **Architecture** | MPNet (12 layers, 12 heads, 768 hidden) |
92
+ | **Base Model** | [dell-research-harvard/lt-wikidata-comp-en](https://huggingface.co/dell-research-harvard/lt-wikidata-comp-en) |
93
+ | **Max Sequence Length** | 512 tokens |
94
+ | **Output Dimensions** | 768 |
95
+ | **Similarity Function** | Cosine Similarity |
96
+ | **Loss Function** | MultipleNegativesRankingLoss (MNRL) |
97
+ | **Pooling** | CLS token |
98
+ | **Training Precision** | FP16 (mixed precision) |
99
+
100
+ ## Performance
101
+
102
+ ### Validation Set Metrics
103
+
104
+ Evaluated on a held-out validation set of 259,052 entity pairs (96,168 positive, 162,884 negative):
105
+
106
+ | Threshold | Accuracy | Precision | Recall | F1 |
107
+ |:---------:|:--------:|:---------:|:------:|:--:|
108
+ | 0.5 | 85.5% | 77.5% | 86.0% | **81.5%** |
109
+ | **0.6** | **85.9%** | 85.9% | 74.2% | 79.6% |
110
+ | 0.7 | 82.2% | 91.9% | 57.2% | 70.5% |
111
+ | 0.8 | 75.6% | 95.4% | 36.1% | 52.3% |
112
+
113
+ **Average Precision (AP): 0.877** | **Best Accuracy Threshold: 0.581** | **Best F1 Threshold: 0.541**
114
+
115
+ ### Acronym Discrimination
116
+
117
+ Evaluated on a 22,146-pair acronym-focused subset (acronym in at least one side):
118
+
119
+ | Category | Accuracy | Description |
120
+ |:---------|:--------:|:------------|
121
+ | Cross-language acronym negatives | **99.8%** | English acronym vs wrong Chinese name (e.g., CASC vs 中国航天科工集团) |
122
+ | Acronym format variants | **93.7%** | "CASC" matches "C.A.S.C.", "casc", "the CASC" |
123
+ | Confusable acronym negatives | **90.0%** | CASC ≠ CASIC, AMMS ≠ AMS, HIT ≠ HEU |
124
+ | Defense entity negatives | **100%** | Curated confusable defense entity pairs |
125
+
126
+ ### Training Progression
127
+
128
+ | Epoch | Training Loss | Val AP |
129
+ |:--:|:--:|:--:|
130
+ | 1.0 | 0.330 | 0.862 |
131
+ | 2.0 | 0.175 | 0.877 |
132
+ | 3.0 | 0.165 | **0.877** |
133
+
134
+ ## Training Data
135
+
136
+ The model was fine-tuned on 689,049 training pairs from 12 curated data sources covering research security screening scenarios. All positive pairs represent confirmed same-entity matches; all negative pairs represent confirmed different entities.
137
+
138
+ ### Data Sources
139
+
140
+ | Source | Pairs | Description | License |
141
+ |:--|--:|:--|:--|
142
+ | **OpenSanctions Pairs** | ~401K | Analyst-judged entity matching pairs from 293 sanctions data sources. Organization/company pairs only. | CC BY-NC 4.0 |
143
+ | **ROR (Research Organization Registry)** | ~106K | Aliases, acronyms, and foreign-language labels for 111K research organizations worldwide. | CC0 (Public Domain) |
144
+ | **US Consolidated Screening List** | ~90K | Entity List, SDN, CMIC, and other US export control lists. Name-alias pairs and cross-entity negatives. | US Government (Public Domain) |
145
+ | **Hard Negatives** | ~53K | Curated confusable pairs and random ROR negatives. | Derived |
146
+ | **ROR Website Domains** | ~53K | Institutional domains (e.g., "hit.edu.cn") paired with org names. Prioritized CN/RU domains. | CC0 (Derived from ROR) |
147
+ | **International Sanctions** | ~45K | EU Financial Sanctions, UK Sanctions List, Australia DFAT. Multilingual aliases across 20+ languages. | Public (EU/UK/AU Government) |
148
+ | **Acronym Pairs** | ~16K | Acronym-to-acronym positives, confusable negatives (CASC vs CASIC, AMMS vs AMS), format variants, cross-language negatives. | Derived |
149
+ | **CSET PARAT** | ~7K | 702 AI companies (43 Chinese) with aliases from Georgetown CSET's Private-sector AI-Related Activity Tracker. | CC BY 4.0 |
150
+ | **OpenAlex Institutions** | ~2K | Real institution names from Chinese AI research papers matched against restricted entity lists. | CC0 |
151
+ | **Policy Pack Entities** | ~1.7K | ASPI defense entities, SOEs, BIOSECURE Act entities, SASTIND Seven Sons universities with Chinese names and aliases. | Various (see below) |
152
+ | **Defense/Threat Entities** | ~400 | PLA branches, defense agencies, Seven Sons universities with acronyms and Chinese aliases. Hand-curated confusable negatives. | Derived |
153
+ | **Section 1260H / 1286 Lists** | ~300 | Chinese military companies (1260H) and defense-linked institutions (1286) with aliases. | US Government (Public Domain) |
154
+
155
+ ### Label Distribution
156
+
157
+ - **Positive (same entity):** 308,573 pairs (45%)
158
+ - **Negative (different entity):** 380,476 pairs (55%)
159
+
160
+ ### Languages Covered
161
+
162
+ The training data includes entity names in English, Simplified Chinese (zh-CN), Russian (Cyrillic), and 20+ additional languages from international sanctions lists (EU covers all official EU languages).
163
+
164
+ ## Usage
165
+
166
+ ```python
167
+ from sentence_transformers import SentenceTransformer
168
+
169
+ model = SentenceTransformer("nobris/lt-nobris-en")
170
+
171
+ # Encode entity name pairs
172
+ pairs = [
173
+ ("Harbin Institute of Technology", "HIT"), # Same entity
174
+ ("Harbin Institute of Technology", "hit.edu.cn"), # Domain match
175
+ ("Harbin Institute of Technology", "哈尔滨工业大学"), # Chinese name
176
+ ("Harbin Institute of Technology", "Harbin Medical University"), # Different
177
+ ("CASC", "CASIC"), # Confusable acronyms
178
+ ]
179
+
180
+ for a, b in pairs:
181
+ emb_a = model.encode(a)
182
+ emb_b = model.encode(b)
183
+ sim = model.similarity([emb_a], [emb_b])[0][0].item()
184
+ print(f"{sim:.3f} {a} <-> {b}")
185
+ ```
186
+
187
+ ### Recommended Thresholds
188
+
189
+ | Use Case | Threshold | Behavior |
190
+ |:--|:--:|:--|
191
+ | High recall (don't miss matches) | 0.50 | Best F1 (81.5%); catches acronym matches |
192
+ | Balanced | 0.58 | Best accuracy (85.9%) |
193
+ | High precision (minimize false positives) | 0.70+ | 91.9% precision; fewer but more confident matches |
194
+
195
+ ## Bias, Risks, and Limitations
196
+
197
+ ### Known Limitations
198
+
199
+ - **Acronym recall at high thresholds is limited.** Acronym-to-name pairs (e.g., "CASC" ↔ "China Aerospace Science and Technology Corporation") often score 0.5-0.7 rather than 0.8+. Use threshold 0.5-0.6 for acronym-heavy screening.
200
+ - **Domain matching is a new capability.** The model can associate "hit.edu.cn" with "Harbin Institute of Technology", but coverage is limited to the ~109K organizations in ROR that have website links.
201
+ - **Person names** are excluded from training. The model is not suitable for individual name matching.
202
+ - **Temporal drift.** Sanctions lists and entity relationships change over time. The model reflects training data as of March 2026.
203
+
204
+ ### Bias Considerations
205
+
206
+ - The training data is heavily weighted toward Chinese and Russian entities due to the focus on US export control and sanctions screening. Performance on entities from other regions (e.g., Middle East, Africa) may be lower.
207
+ - The model inherits any biases present in the underlying sanctions lists and entity databases.
208
+ - False positives on legitimate Chinese academic institutions are a known risk. The model should not be used as the sole basis for restricting research collaborations.
209
+
210
+ ### Ethical Considerations
211
+
212
+ This model is intended to assist compliance professionals in screening research proposals against restricted party lists. It is **not** a decision-making system. All flagged matches should be reviewed by qualified personnel who can consider context, intent, and applicable regulations.
213
+
214
+ Research security screening affects international academic collaboration. Overly aggressive screening can harm legitimate scientific exchange. Users should calibrate thresholds to minimize both missed matches (compliance risk) and false positives (academic freedom risk).
215
+
216
+ ## Training Procedure
217
+
218
+ ### Hyperparameters
219
+
220
+ | Parameter | Value |
221
+ |:--|:--|
222
+ | Epochs | 3 |
223
+ | Batch Size | 32 |
224
+ | Learning Rate | 2e-5 |
225
+ | Warmup Steps | 100 |
226
+ | Optimizer | AdamW (fused) |
227
+ | Loss | MultipleNegativesRankingLoss |
228
+ | Precision | FP16 (mixed) |
229
+ | Evaluation Steps | 500 |
230
+ | Training Time | 170 minutes (NVIDIA GPU, 16GB VRAM) |
231
+
232
+ ### Framework Versions
233
+
234
+ - Python: 3.14
235
+ - Sentence Transformers: 5.3.0
236
+ - Transformers: 5.3.0
237
+ - PyTorch: 2.12.0+cu128
238
+
239
+ ## Licensing and Attribution
240
+
241
+ ### Model License
242
+
243
+ This model is released under the **MIT License**.
244
+
245
+ ### Base Model
246
+
247
+ Fine-tuned from [dell-research-harvard/lt-wikidata-comp-en](https://huggingface.co/dell-research-harvard/lt-wikidata-comp-en) (LinkTransformer), itself fine-tuned from [sentence-transformers/multi-qa-mpnet-base-dot-v1](https://huggingface.co/sentence-transformers/multi-qa-mpnet-base-dot-v1) (Apache 2.0).
248
+
249
+ ### Training Data Licenses
250
+
251
+ | Data Source | License | Commercial Use |
252
+ |:--|:--|:--|
253
+ | ROR | CC0 (Public Domain) | Yes |
254
+ | OpenAlex | CC0 (Public Domain) | Yes |
255
+ | US CSL / 1260H / 1286 | US Government (Public Domain) | Yes |
256
+ | EU / UK / AU Sanctions Lists | Government (Public Domain) | Yes |
257
+ | CSET PARAT | CC BY 4.0 | Yes (with attribution) |
258
+ | OpenSanctions Pairs | **CC BY-NC 4.0** | **Non-commercial only** (commercial license available from opensanctions.org) |
259
+ | ASPI / Policy Pack | Research/reporting use | Verify with source |
260
+
261
+ **Important:** The OpenSanctions training data is licensed CC BY-NC 4.0. If you intend to use this model commercially, you should either (a) obtain a commercial license from [OpenSanctions](https://www.opensanctions.org/licensing/), or (b) retrain without the OpenSanctions data.
262
+
263
+ ## Citation
264
+
265
+ ### This Model
266
+
267
+ ```bibtex
268
+ @misc{nobris2026ltnobris,
269
+ title={lt-nobris-en: Entity Resolution for Research Security Screening},
270
+ author={Nobris},
271
+ year={2026},
272
+ url={https://huggingface.co/nobris/lt-nobris-en}
273
+ }
274
+ ```
275
+
276
+ ### LinkTransformer (Base Model)
277
+
278
+ ```bibtex
279
+ @misc{arora2023linktransformer,
280
+ title={LinkTransformer: A Unified Package for Record Linkage with Transformer Language Models},
281
+ author={Abhishek Arora and Melissa Dell},
282
+ year={2023},
283
+ eprint={2309.00789},
284
+ archivePrefix={arXiv},
285
+ primaryClass={cs.CL}
286
+ }
287
+ ```
288
+
289
+ ### Sentence-Transformers
290
+
291
+ ```bibtex
292
+ @inproceedings{reimers-2019-sentence-bert,
293
+ title={Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks},
294
+ author={Reimers, Nils and Gurevych, Iryna},
295
+ booktitle={Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing},
296
+ month={11},
297
+ year={2019},
298
+ publisher={Association for Computational Linguistics},
299
+ url={https://arxiv.org/abs/1908.10084}
300
+ }
301
+ ```
302
+
303
+ ### MultipleNegativesRankingLoss
304
+
305
+ ```bibtex
306
+ @misc{oord2019representation,
307
+ title={Representation Learning with Contrastive Predictive Coding},
308
+ author={Aaron van den Oord and Yazhe Li and Oriol Vinyals},
309
+ year={2019},
310
+ eprint={1807.03748},
311
+ archivePrefix={arXiv},
312
+ primaryClass={cs.LG}
313
+ }
314
+ ```
315
+
316
+ ## Model Card Authors
317
+
318
+ Nobris Research Security Team
319
+
320
+ ## Contact
321
+
322
+ For questions about this model, contact: [info@nobris.dev]
config.json ADDED
@@ -0,0 +1,24 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "MPNetModel"
4
+ ],
5
+ "attention_probs_dropout_prob": 0.1,
6
+ "bos_token_id": 0,
7
+ "dtype": "float32",
8
+ "eos_token_id": 2,
9
+ "hidden_act": "gelu",
10
+ "hidden_dropout_prob": 0.1,
11
+ "hidden_size": 768,
12
+ "initializer_range": 0.02,
13
+ "intermediate_size": 3072,
14
+ "layer_norm_eps": 1e-05,
15
+ "max_position_embeddings": 514,
16
+ "model_type": "mpnet",
17
+ "num_attention_heads": 12,
18
+ "num_hidden_layers": 12,
19
+ "pad_token_id": 1,
20
+ "relative_attention_num_buckets": 32,
21
+ "tie_word_embeddings": true,
22
+ "transformers_version": "5.3.0",
23
+ "vocab_size": 30527
24
+ }
config_sentence_transformers.json ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "__version__": {
3
+ "sentence_transformers": "5.3.0",
4
+ "transformers": "5.3.0",
5
+ "pytorch": "2.12.0.dev20260314+cu128"
6
+ },
7
+ "model_type": "SentenceTransformer",
8
+ "prompts": {
9
+ "query": "",
10
+ "document": ""
11
+ },
12
+ "default_prompt_name": null,
13
+ "similarity_fn_name": "cosine"
14
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:5d07e77995c23061d50dd4ad31437d6d3b870c858f4b59beb3f23d0a306da19c
3
+ size 437967648
modules.json ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [
2
+ {
3
+ "idx": 0,
4
+ "name": "0",
5
+ "path": "",
6
+ "type": "sentence_transformers.models.Transformer"
7
+ },
8
+ {
9
+ "idx": 1,
10
+ "name": "1",
11
+ "path": "1_Pooling",
12
+ "type": "sentence_transformers.models.Pooling"
13
+ }
14
+ ]
sentence_bert_config.json ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ {
2
+ "max_seq_length": 512,
3
+ "do_lower_case": false
4
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,24 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "backend": "tokenizers",
3
+ "bos_token": "<s>",
4
+ "clean_up_tokenization_spaces": true,
5
+ "cls_token": "<s>",
6
+ "do_lower_case": true,
7
+ "eos_token": "</s>",
8
+ "is_local": true,
9
+ "mask_token": "<mask>",
10
+ "max_length": 250,
11
+ "model_max_length": 512,
12
+ "pad_to_multiple_of": null,
13
+ "pad_token": "<pad>",
14
+ "pad_token_type_id": 0,
15
+ "padding_side": "right",
16
+ "sep_token": "</s>",
17
+ "stride": 0,
18
+ "strip_accents": null,
19
+ "tokenize_chinese_chars": true,
20
+ "tokenizer_class": "MPNetTokenizer",
21
+ "truncation_side": "right",
22
+ "truncation_strategy": "longest_first",
23
+ "unk_token": "[UNK]"
24
+ }