emanuelaboros commited on
Commit
a350034
·
1 Parent(s): 5546a59

version old

Browse files
Files changed (15) hide show
  1. README.md +222 -3
  2. __init__.py +0 -0
  3. config.json +233 -0
  4. configuration_stacked.py +99 -0
  5. generic_ner.py +778 -0
  6. label_map.json +1 -0
  7. model.safetensors +3 -0
  8. modeling_stacked.py +136 -0
  9. special_tokens_map.json +37 -0
  10. test.py +46 -0
  11. tokenizer.json +0 -0
  12. tokenizer_config.json +58 -0
  13. vocab.txt +0 -0
  14. x +0 -0
  15. y +0 -0
README.md CHANGED
@@ -1,3 +1,222 @@
1
- ---
2
- license: agpl-3.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: transformers
3
+ language:
4
+ - en
5
+ - fr
6
+ - de
7
+ tags:
8
+ - v1.0.0
9
+ ---
10
+
11
+ # Model Card for `impresso-project/ner-stacked-bert-multilingual`
12
+
13
+ The **Impresso NER model** is a multilingual named entity recognition model trained for historical document processing. It is based on a stacked Transformer architecture and is designed to identify fine-grained and coarse-grained entity types in digitized historical texts, including names, titles, and locations.
14
+
15
+ ## Model Details
16
+
17
+ ### Model Description
18
+
19
+ - **Developed by:** EPFL from the [Impresso team](https://impresso-project.ch). The project is an interdisciplinary project focused on historical media analysis across languages, time, and modalities. Funded by the Swiss National Science Foundation ([CRSII5_173719](http://p3.snf.ch/project-173719), [CRSII5_213585](https://data.snf.ch/grants/grant/213585)) and the Luxembourg National Research Fund (grant No. 17498891).
20
+ - **Model type:** Stacked BERT-based token classification for named entity recognition
21
+ - **Languages:** French, German, English (with support for multilingual historical texts)
22
+ - **License:** [AGPL v3+](https://github.com/impresso/impresso-pyindexation/blob/master/LICENSE)
23
+ - **Finetuned from:** [`dbmdz/bert-medium-historic-multilingual-cased`](https://huggingface.co/dbmdz/bert-medium-historic-multilingual-cased)
24
+
25
+
26
+ ### Model Architecture
27
+
28
+ The model architecture consists of the following components:
29
+ - A **pre-trained BERT encoder** (multilingual historic BERT) as the base.
30
+ - **One or two Transformer encoder layers** stacked on top of the BERT encoder.
31
+ - A **Conditional Random Field (CRF)** decoder layer to model label dependencies.
32
+ - **Learned absolute positional embeddings** for improved handling of noisy inputs.
33
+
34
+ These additional Transformer layers help in mitigating the effects of OCR noise, spelling variation, and non-standard linguistic usage found in historical documents. The entire stack is fine-tuned end-to-end for token classification.
35
+
36
+ ### Entity Types Supported
37
+
38
+ The model supports both coarse-grained and fine-grained entity types defined in the HIPE-2020/2022 guidelines. The output format of the model includes structured predictions with contextual and semantic details. Each prediction is a dictionary with the following fields:
39
+
40
+ ```python
41
+ {
42
+ 'type': 'pers' | 'org' | 'loc' | 'time' | 'prod',
43
+ 'confidence_ner': float, # Confidence score
44
+ 'surface': str, # Surface form in text
45
+ 'lOffset': int, # Start character offset
46
+ 'rOffset': int, # End character offset
47
+ 'name': str, # Optional: full name (for persons)
48
+ 'title': str, # Optional: title (for persons)
49
+ 'function': str # Optional: function (if detected)
50
+ }
51
+ ```
52
+
53
+
54
+ #### Coarse-Grained Entity Types:
55
+ - **pers**: Person entities (individuals, collectives, authors)
56
+ - **org**: Organizations (administrative, enterprise, press agencies)
57
+ - **prod**: Products (media)
58
+ - **time**: Time expressions (absolute dates)
59
+ - **loc**: Locations (towns, regions, countries, physical, facilities)
60
+
61
+ If present in the text, surrounding an entity, model returns **person-specific attributes** such as:
62
+ - `name`: canonical full name
63
+ - `title`: honorific or title (e.g., "king", "chancellor")
64
+ - `function`: role or function in context (if available)
65
+
66
+ ### Model Sources
67
+
68
+ - **Repository:** https://huggingface.co/impresso-project/ner-stacked-bert-multilingual
69
+ - **Paper:** [CoNLL 2020](https://aclanthology.org/2020.conll-1.35/)
70
+ - **Demo:** [Impresso project](https://impresso-project.ch)
71
+
72
+ ## Uses
73
+
74
+ ### Direct Use
75
+
76
+ The model is intended to be used directly with the Hugging Face `pipeline` for `token-classification`, specifically with `generic-ner` tasks on historical texts.
77
+
78
+ ### Downstream Use
79
+
80
+ Can be used for downstream tasks such as:
81
+ - Historical information extraction
82
+ - Biographical reconstruction
83
+ - Place and person mention detection across historical archives
84
+
85
+ ### Out-of-Scope Use
86
+
87
+ - Not suitable for contemporary named entity recognition in domains such as social media or modern news.
88
+ - Not optimized for OCR-free modern corpora.
89
+
90
+ ## Bias, Risks, and Limitations
91
+
92
+ Due to training on historical documents, the model may reflect historical biases and inaccuracies. It may underperform on contemporary or non-European languages.
93
+
94
+ ### Recommendations
95
+
96
+ - Users should be cautious of historical and typographical biases.
97
+ - Consider post-processing to filter false positives from OCR noise.
98
+
99
+ ## How to Get Started with the Model
100
+
101
+ ```python
102
+ from transformers import AutoModelForTokenClassification, AutoTokenizer, pipeline
103
+
104
+ MODEL_NAME = "impresso-project/ner-stacked-bert-multilingual"
105
+
106
+ tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
107
+
108
+ ner_pipeline = pipeline("generic-ner", model=MODEL_NAME, tokenizer=tokenizer, trust_remote_code=True, device='cpu')
109
+
110
+ sentence = "En l'an 1348, au plus fort des ravages de la peste noire à travers l'Europe, le Royaume de France se trouvait à la fois au bord du désespoir et face à une opportunité. À la cour du roi Philippe VI, les murs du Louvre étaient animés par les rapports sombres venus de Paris et des villes environnantes. La peste ne montrait aucun signe de répit, et le chancelier Guillaume de Nogaret, le conseiller le plus fidèle du roi, portait le lourd fardeau de gérer la survie du royaume."
111
+ entities = ner_pipeline(sentence)
112
+ print(entities)
113
+ ```
114
+ #### Example Output
115
+
116
+ ```json
117
+ [
118
+ {'type': 'time', 'confidence_ner': 85.0, 'surface': "an 1348", 'lOffset': 0, 'rOffset': 12},
119
+ {'type': 'loc', 'confidence_ner': 90.75, 'surface': "Europe", 'lOffset': 69, 'rOffset': 75},
120
+ {'type': 'loc', 'confidence_ner': 75.45, 'surface': "Royaume de France", 'lOffset': 80, 'rOffset': 97},
121
+ {'type': 'pers', 'confidence_ner': 85.27, 'surface': "roi Philippe VI", 'lOffset': 181, 'rOffset': 196, 'title': "roi", 'name': "roi Philippe VI"},
122
+ {'type': 'loc', 'confidence_ner': 30.59, 'surface': "Louvre", 'lOffset': 210, 'rOffset': 216},
123
+ {'type': 'loc', 'confidence_ner': 94.46, 'surface': "Paris", 'lOffset': 266, 'rOffset': 271},
124
+ {'type': 'pers', 'confidence_ner': 96.1, 'surface': "chancelier Guillaume de Nogaret", 'lOffset': 350, 'rOffset': 381, 'title': "chancelier", 'name': "Guillaume de Nogaret"},
125
+ {'type': 'loc', 'confidence_ner': 49.35, 'surface': "Royaume", 'lOffset': 80, 'rOffset': 87},
126
+ {'type': 'loc', 'confidence_ner': 24.18, 'surface': "France", 'lOffset': 91, 'rOffset': 97}
127
+ ]
128
+ ```
129
+
130
+ ## Training Details
131
+
132
+ ### Training Data
133
+
134
+ The model was trained on the Impresso HIPE-2020 dataset, a subset of the [HIPE-2022 corpus](https://github.com/hipe-eval/HIPE-2022-data), which includes richly annotated OCR-transcribed historical newspaper content.
135
+
136
+ ### Training Procedure
137
+
138
+ #### Preprocessing
139
+
140
+ OCR content was cleaned and segmented. Entity types follow the HIPE-2020 typology.
141
+
142
+ #### Training Hyperparameters
143
+
144
+ - **Training regime:** Mixed precision (fp16)
145
+ - **Epochs:** 5
146
+ - **Max sequence length:** 512
147
+ - **Base model:** `dbmdz/bert-medium-historic-multilingual-cased`
148
+ - **Stacked Transformer layers:** 2
149
+
150
+ #### Speeds, Sizes, Times
151
+
152
+ - **Model size:** ~500MB
153
+ - **Training time:** ~1h on 1 GPU (NVIDIA TITAN X)
154
+
155
+ ## Evaluation
156
+
157
+ #### Testing Data
158
+
159
+ Held-out portion of HIPE-2020 (French, German)
160
+
161
+ #### Metrics
162
+
163
+ - F1-score (micro, macro)
164
+ - Entity-level precision/recall
165
+
166
+ ### Results
167
+
168
+ | Language | Precision | Recall | F1-score |
169
+ |----------|-----------|--------|----------|
170
+ | French | 84.2 | 81.6 | 82.9 |
171
+ | German | 82.0 | 78.7 | 80.3 |
172
+
173
+ #### Summary
174
+
175
+ The model performs robustly across noisy OCR historical content with support for fine-grained entity typologies.
176
+
177
+ ## Environmental Impact
178
+
179
+ - **Hardware Type:** NVIDIA TITAN X (Pascal, 12GB)
180
+ - **Hours used:** ~1 hour
181
+ - **Cloud Provider:** EPFL, Switzerland
182
+ - **Carbon Emitted:** ~0.022 kg CO₂eq (estimated)
183
+
184
+ ## Technical Specifications
185
+
186
+ ### Model Architecture and Objective
187
+
188
+ Stacked BERT architecture with multitask token classification head supporting HIPE-type entity labels.
189
+
190
+ ### Compute Infrastructure
191
+
192
+ #### Hardware
193
+
194
+ 1x NVIDIA TITAN X (Pascal, 12GB)
195
+
196
+ #### Software
197
+
198
+ - Python 3.11
199
+ - PyTorch 2.0
200
+ - Transformers 4.36
201
+
202
+ ## Citation
203
+
204
+ **BibTeX:**
205
+
206
+ ```bibtex
207
+ @inproceedings{boros2020alleviating,
208
+ title={Alleviating digitization errors in named entity recognition for historical documents},
209
+ author={Boros, Emanuela and Hamdi, Ahmed and Pontes, Elvys Linhares and Cabrera-Diego, Luis-Adri{\'a}n and Moreno, Jose G and Sidere, Nicolas and Doucet, Antoine},
210
+ booktitle={Proceedings of the 24th conference on computational natural language learning},
211
+ pages={431--441},
212
+ year={2020}
213
+ }
214
+ ```
215
+
216
+ ## Contact
217
+
218
+ - Website: [https://impresso-project.ch](https://impresso-project.ch)
219
+
220
+ <p align="center">
221
+ <img src="https://github.com/impresso/impresso.github.io/blob/master/assets/images/3x1--Yellow-Impresso-Black-on-White--transparent.png?raw=true" width="300" alt="Impresso Logo"/>
222
+ </p>
__init__.py ADDED
File without changes
config.json ADDED
@@ -0,0 +1,233 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "experiments_final/model_dbmdz_bert_medium_historic_multilingual_cased_max_sequence_length_512_epochs_5_run_extended_suffix_baseline/checkpoint-450",
3
+ "architectures": [
4
+ "ExtendedMultitaskModelForTokenClassification"
5
+ ],
6
+ "attention_probs_dropout_prob": 0.1,
7
+ "auto_map": {
8
+ "AutoConfig": "configuration_stacked.ImpressoConfig",
9
+ "AutoModelForTokenClassification": "modeling_stacked.ExtendedMultitaskModelForTokenClassification"
10
+ },
11
+ "classifier_dropout": null,
12
+ "custom_pipelines": {
13
+ "generic-ner": {
14
+ "impl": "generic_ner.MultitaskTokenClassificationPipeline",
15
+ "pt": "AutoModelForTokenClassification"
16
+ }
17
+ },
18
+ "hidden_act": "gelu",
19
+ "hidden_dropout_prob": 0.1,
20
+ "hidden_size": 512,
21
+ "initializer_range": 0.02,
22
+ "intermediate_size": 2048,
23
+ "label_map": {
24
+ "NE-COARSE-LIT": {
25
+ "B-loc": 8,
26
+ "B-org": 0,
27
+ "B-pers": 7,
28
+ "B-prod": 4,
29
+ "B-time": 5,
30
+ "I-loc": 1,
31
+ "I-org": 2,
32
+ "I-pers": 9,
33
+ "I-prod": 10,
34
+ "I-time": 6,
35
+ "O": 3
36
+ },
37
+ "NE-COARSE-METO": {
38
+ "B-loc": 3,
39
+ "B-org": 0,
40
+ "B-time": 5,
41
+ "I-loc": 4,
42
+ "I-org": 2,
43
+ "O": 1
44
+ },
45
+ "NE-FINE-COMP": {
46
+ "B-comp.demonym": 8,
47
+ "B-comp.function": 5,
48
+ "B-comp.name": 1,
49
+ "B-comp.qualifier": 9,
50
+ "B-comp.title": 2,
51
+ "I-comp.demonym": 7,
52
+ "I-comp.function": 3,
53
+ "I-comp.name": 0,
54
+ "I-comp.qualifier": 10,
55
+ "I-comp.title": 4,
56
+ "O": 6
57
+ },
58
+ "NE-FINE-LIT": {
59
+ "B-loc.add.elec": 32,
60
+ "B-loc.add.phys": 5,
61
+ "B-loc.adm.nat": 34,
62
+ "B-loc.adm.reg": 39,
63
+ "B-loc.adm.sup": 12,
64
+ "B-loc.adm.town": 33,
65
+ "B-loc.fac": 36,
66
+ "B-loc.oro": 19,
67
+ "B-loc.phys.geo": 13,
68
+ "B-loc.phys.hydro": 28,
69
+ "B-loc.unk": 4,
70
+ "B-org.adm": 3,
71
+ "B-org.ent": 24,
72
+ "B-org.ent.pressagency": 37,
73
+ "B-pers.coll": 9,
74
+ "B-pers.ind": 0,
75
+ "B-pers.ind.articleauthor": 20,
76
+ "B-prod.doctr": 2,
77
+ "B-prod.media": 10,
78
+ "B-time.date.abs": 23,
79
+ "I-loc.add.elec": 22,
80
+ "I-loc.add.phys": 6,
81
+ "I-loc.adm.nat": 11,
82
+ "I-loc.adm.reg": 35,
83
+ "I-loc.adm.sup": 15,
84
+ "I-loc.adm.town": 8,
85
+ "I-loc.fac": 27,
86
+ "I-loc.oro": 21,
87
+ "I-loc.phys.geo": 25,
88
+ "I-loc.phys.hydro": 17,
89
+ "I-loc.unk": 40,
90
+ "I-org.adm": 29,
91
+ "I-org.ent": 1,
92
+ "I-org.ent.pressagency": 14,
93
+ "I-pers.coll": 26,
94
+ "I-pers.ind": 16,
95
+ "I-pers.ind.articleauthor": 31,
96
+ "I-prod.doctr": 30,
97
+ "I-prod.media": 38,
98
+ "I-time.date.abs": 7,
99
+ "O": 18
100
+ },
101
+ "NE-FINE-METO": {
102
+ "B-loc.adm.town": 6,
103
+ "B-loc.fac": 3,
104
+ "B-loc.oro": 5,
105
+ "B-org.adm": 1,
106
+ "B-org.ent": 7,
107
+ "B-time.date.abs": 9,
108
+ "I-loc.fac": 8,
109
+ "I-org.adm": 2,
110
+ "I-org.ent": 0,
111
+ "O": 4
112
+ },
113
+ "NE-NESTED": {
114
+ "B-loc.adm.nat": 13,
115
+ "B-loc.adm.reg": 15,
116
+ "B-loc.adm.sup": 10,
117
+ "B-loc.adm.town": 9,
118
+ "B-loc.fac": 18,
119
+ "B-loc.oro": 17,
120
+ "B-loc.phys.geo": 11,
121
+ "B-loc.phys.hydro": 1,
122
+ "B-org.adm": 4,
123
+ "B-org.ent": 20,
124
+ "B-pers.coll": 7,
125
+ "B-pers.ind": 2,
126
+ "B-prod.media": 23,
127
+ "I-loc.adm.nat": 8,
128
+ "I-loc.adm.reg": 14,
129
+ "I-loc.adm.town": 6,
130
+ "I-loc.fac": 0,
131
+ "I-loc.oro": 19,
132
+ "I-loc.phys.geo": 21,
133
+ "I-loc.phys.hydro": 22,
134
+ "I-org.adm": 5,
135
+ "I-org.ent": 3,
136
+ "I-pers.ind": 12,
137
+ "I-prod.media": 24,
138
+ "O": 16
139
+ }
140
+ },
141
+ "layer_norm_eps": 1e-12,
142
+ "max_position_embeddings": 512,
143
+ "model_type": "stacked_bert",
144
+ "num_attention_heads": 8,
145
+ "num_hidden_layers": 8,
146
+ "pad_token_id": 0,
147
+ "position_embedding_type": "absolute",
148
+ "pretrained_config": {
149
+ "_name_or_path": "dbmdz/bert-medium-historic-multilingual-cased",
150
+ "add_cross_attention": false,
151
+ "architectures": [
152
+ "BertForMaskedLM"
153
+ ],
154
+ "attention_probs_dropout_prob": 0.1,
155
+ "bad_words_ids": null,
156
+ "begin_suppress_tokens": null,
157
+ "bos_token_id": null,
158
+ "chunk_size_feed_forward": 0,
159
+ "classifier_dropout": null,
160
+ "cross_attention_hidden_size": null,
161
+ "decoder_start_token_id": null,
162
+ "diversity_penalty": 0.0,
163
+ "do_sample": false,
164
+ "early_stopping": false,
165
+ "encoder_no_repeat_ngram_size": 0,
166
+ "eos_token_id": null,
167
+ "exponential_decay_length_penalty": null,
168
+ "finetuning_task": null,
169
+ "forced_bos_token_id": null,
170
+ "forced_eos_token_id": null,
171
+ "hidden_act": "gelu",
172
+ "hidden_dropout_prob": 0.1,
173
+ "hidden_size": 512,
174
+ "id2label": {
175
+ "0": "LABEL_0",
176
+ "1": "LABEL_1"
177
+ },
178
+ "initializer_range": 0.02,
179
+ "intermediate_size": 2048,
180
+ "is_decoder": false,
181
+ "is_encoder_decoder": false,
182
+ "label2id": {
183
+ "LABEL_0": 0,
184
+ "LABEL_1": 1
185
+ },
186
+ "layer_norm_eps": 1e-12,
187
+ "length_penalty": 1.0,
188
+ "max_length": 20,
189
+ "max_position_embeddings": 512,
190
+ "min_length": 0,
191
+ "model_type": "bert",
192
+ "no_repeat_ngram_size": 0,
193
+ "num_attention_heads": 8,
194
+ "num_beam_groups": 1,
195
+ "num_beams": 1,
196
+ "num_hidden_layers": 8,
197
+ "num_return_sequences": 1,
198
+ "output_attentions": false,
199
+ "output_hidden_states": false,
200
+ "output_scores": false,
201
+ "pad_token_id": 0,
202
+ "position_embedding_type": "absolute",
203
+ "prefix": null,
204
+ "problem_type": null,
205
+ "pruned_heads": {},
206
+ "remove_invalid_values": false,
207
+ "repetition_penalty": 1.0,
208
+ "return_dict": true,
209
+ "return_dict_in_generate": false,
210
+ "sep_token_id": null,
211
+ "suppress_tokens": null,
212
+ "task_specific_params": null,
213
+ "temperature": 1.0,
214
+ "tf_legacy_loss": false,
215
+ "tie_encoder_decoder": false,
216
+ "tie_word_embeddings": true,
217
+ "tokenizer_class": null,
218
+ "top_k": 50,
219
+ "top_p": 1.0,
220
+ "torch_dtype": null,
221
+ "torchscript": false,
222
+ "type_vocab_size": 2,
223
+ "typical_p": 1.0,
224
+ "use_bfloat16": false,
225
+ "use_cache": true,
226
+ "vocab_size": 32000
227
+ },
228
+ "torch_dtype": "float32",
229
+ "transformers_version": "4.40.0.dev0",
230
+ "type_vocab_size": 2,
231
+ "use_cache": true,
232
+ "vocab_size": 32000
233
+ }
configuration_stacked.py ADDED
@@ -0,0 +1,99 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from transformers import PretrainedConfig
2
+ import torch
3
+
4
+ class ImpressoConfig(PretrainedConfig):
5
+ model_type = "stacked_bert"
6
+
7
+ def __init__(
8
+ self,
9
+ vocab_size=30522,
10
+ hidden_size=768,
11
+ num_hidden_layers=12,
12
+ num_attention_heads=12,
13
+ intermediate_size=3072,
14
+ hidden_act="gelu",
15
+ hidden_dropout_prob=0.1,
16
+ attention_probs_dropout_prob=0.1,
17
+ max_position_embeddings=512,
18
+ type_vocab_size=2,
19
+ initializer_range=0.02,
20
+ layer_norm_eps=1e-12,
21
+ pad_token_id=0,
22
+ position_embedding_type="absolute",
23
+ use_cache=True,
24
+ classifier_dropout=None,
25
+ pretrained_config=None,
26
+ values_override=None,
27
+ label_map=None,
28
+ **kwargs,
29
+ ):
30
+ super().__init__(pad_token_id=pad_token_id, **kwargs)
31
+
32
+ self.vocab_size = vocab_size
33
+ self.hidden_size = hidden_size
34
+ self.num_hidden_layers = num_hidden_layers
35
+ self.num_attention_heads = num_attention_heads
36
+ self.hidden_act = hidden_act
37
+ self.intermediate_size = intermediate_size
38
+ self.hidden_dropout_prob = hidden_dropout_prob
39
+ self.attention_probs_dropout_prob = attention_probs_dropout_prob
40
+ self.max_position_embeddings = max_position_embeddings
41
+ self.type_vocab_size = type_vocab_size
42
+ self.initializer_range = initializer_range
43
+ self.layer_norm_eps = layer_norm_eps
44
+ self.position_embedding_type = position_embedding_type
45
+ self.use_cache = use_cache
46
+ self.classifier_dropout = classifier_dropout
47
+ self.pretrained_config = pretrained_config
48
+ self.label_map = label_map
49
+
50
+ self.values_override = values_override or {}
51
+ self.outputs = {
52
+ "logits": {"shape": [None, None, self.hidden_size], "dtype": "float32"}
53
+ }
54
+
55
+ @classmethod
56
+ def is_torch_support_available(cls):
57
+ """
58
+ Indicate whether Torch support is available for this configuration.
59
+ Required for compatibility with certain parts of the Transformers library.
60
+ """
61
+ return True
62
+
63
+ @classmethod
64
+ def patch_ops(self):
65
+ """
66
+ A method required by some Hugging Face utilities to modify operator mappings.
67
+ Currently, it performs no operation and is included for compatibility.
68
+ Args:
69
+ ops: A dictionary of operations to potentially patch.
70
+ Returns:
71
+ The (unmodified) ops dictionary.
72
+ """
73
+ return None
74
+
75
+ def generate_dummy_inputs(self, tokenizer, batch_size=1, seq_length=8, framework="pt"):
76
+ """
77
+ Generate dummy inputs for testing or export.
78
+ Args:
79
+ tokenizer: The tokenizer used to tokenize inputs.
80
+ batch_size: Number of input samples in the batch.
81
+ seq_length: Length of each sequence.
82
+ framework: Framework ("pt" for PyTorch, "tf" for TensorFlow).
83
+ Returns:
84
+ Dummy inputs as a dictionary.
85
+ """
86
+ if framework == "pt":
87
+ input_ids = torch.randint(
88
+ low=0,
89
+ high=self.vocab_size,
90
+ size=(batch_size, seq_length),
91
+ dtype=torch.long
92
+ )
93
+ attention_mask = torch.ones((batch_size, seq_length), dtype=torch.long)
94
+ return {"input_ids": input_ids, "attention_mask": attention_mask}
95
+ else:
96
+ raise ValueError("Framework '{}' not supported.".format(framework))
97
+
98
+ # Register the configuration with the transformers library
99
+ ImpressoConfig.register_for_auto_class()
generic_ner.py ADDED
@@ -0,0 +1,778 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import logging
2
+ from transformers import Pipeline
3
+ import numpy as np
4
+ import torch
5
+ import nltk
6
+
7
+ nltk.download("averaged_perceptron_tagger")
8
+ nltk.download("averaged_perceptron_tagger_eng")
9
+ nltk.download("stopwords")
10
+ from nltk.chunk import conlltags2tree
11
+ from nltk import pos_tag
12
+ from nltk.tree import Tree
13
+ import torch.nn.functional as F
14
+ import re, string
15
+
16
+ stop_words = set(nltk.corpus.stopwords.words("english"))
17
+ DEBUG = False
18
+ punctuation = (
19
+ string.punctuation
20
+ + "«»—…“”"
21
+ + "—."
22
+ + "–"
23
+ + "’"
24
+ + "‘"
25
+ + "´"
26
+ + "•"
27
+ + "°"
28
+ + "»"
29
+ + "“"
30
+ + "”"
31
+ + "–"
32
+ + "—"
33
+ + "‘’“”„«»•–—―‣◦…§¶†‡‰′″〈〉"
34
+ )
35
+
36
+ # List of additional "strange" punctuation marks
37
+ # additional_punctuation = "‘’“”„«»•–—―‣◦…§¶†‡‰′″〈〉"
38
+
39
+
40
+ WHITESPACE_RULES = {
41
+ "fr": {
42
+ "pct_no_ws_before": [".", ",", ")", "]", "}", "°", "...", ".-", "%"],
43
+ "pct_no_ws_after": ["(", "[", "{"],
44
+ "pct_no_ws_before_after": ["'", "-"],
45
+ "pct_number": [".", ","],
46
+ },
47
+ "de": {
48
+ "pct_no_ws_before": [
49
+ ".",
50
+ ",",
51
+ ")",
52
+ "]",
53
+ "}",
54
+ "°",
55
+ "...",
56
+ "?",
57
+ "!",
58
+ ":",
59
+ ";",
60
+ ".-",
61
+ "%",
62
+ ],
63
+ "pct_no_ws_after": ["(", "[", "{"],
64
+ "pct_no_ws_before_after": ["'", "-"],
65
+ "pct_number": [".", ","],
66
+ },
67
+ "other": {
68
+ "pct_no_ws_before": [
69
+ ".",
70
+ ",",
71
+ ")",
72
+ "]",
73
+ "}",
74
+ "°",
75
+ "...",
76
+ "?",
77
+ "!",
78
+ ":",
79
+ ";",
80
+ ".-",
81
+ "%",
82
+ ],
83
+ "pct_no_ws_after": ["(", "[", "{"],
84
+ "pct_no_ws_before_after": ["'", "-"],
85
+ "pct_number": [".", ","],
86
+ },
87
+ }
88
+
89
+
90
+ def tokenize(text: str, language: str = "other") -> list[str]:
91
+ """Apply whitespace rules to the given text and language, separating it into tokens.
92
+
93
+ Args:
94
+ text (str): The input text to separate into a list of tokens.
95
+ language (str): Language of the text.
96
+
97
+ Returns:
98
+ list[str]: List of tokens with punctuation as separate tokens.
99
+ """
100
+ # text = add_spaces_around_punctuation(text)
101
+ if not text:
102
+ return []
103
+
104
+ if language not in WHITESPACE_RULES:
105
+ # Default behavior for languages without specific rules:
106
+ # tokenize using standard whitespace splitting
107
+ language = "other"
108
+
109
+ wsrules = WHITESPACE_RULES[language]
110
+ tokenized_text = []
111
+ current_token = ""
112
+
113
+ for char in text:
114
+ if char in wsrules["pct_no_ws_before_after"]:
115
+ if current_token:
116
+ tokenized_text.append(current_token)
117
+ tokenized_text.append(char)
118
+ current_token = ""
119
+ elif char in wsrules["pct_no_ws_before"] or char in wsrules["pct_no_ws_after"]:
120
+ if current_token:
121
+ tokenized_text.append(current_token)
122
+ tokenized_text.append(char)
123
+ current_token = ""
124
+ elif char.isspace():
125
+ if current_token:
126
+ tokenized_text.append(current_token)
127
+ current_token = ""
128
+ else:
129
+ current_token += char
130
+
131
+ if current_token:
132
+ tokenized_text.append(current_token)
133
+
134
+ return tokenized_text
135
+
136
+
137
+ def normalize_text(text):
138
+ # Remove spaces and tabs for the search but keep newline characters
139
+ return re.sub(r"[ \t]+", "", text)
140
+
141
+
142
+ def find_entity_indices(article_text, search_text):
143
+ # Normalize texts by removing spaces and tabs
144
+ normalized_article = normalize_text(article_text)
145
+ normalized_search = normalize_text(search_text)
146
+
147
+ # Initialize a list to hold all start and end indices
148
+ indices = []
149
+
150
+ # Find all occurrences of the search text in the normalized article text
151
+ start_index = 0
152
+ while True:
153
+ start_index = normalized_article.find(normalized_search, start_index)
154
+ if start_index == -1:
155
+ break
156
+
157
+ # Calculate the actual start and end indices in the original article text
158
+ original_chars = 0
159
+ original_start_index = 0
160
+ for i in range(start_index):
161
+ while article_text[original_start_index] in (" ", "\t"):
162
+ original_start_index += 1
163
+ if article_text[original_start_index] not in (" ", "\t", "\n"):
164
+ original_chars += 1
165
+ original_start_index += 1
166
+
167
+ original_end_index = original_start_index
168
+ search_chars = 0
169
+ while search_chars < len(normalized_search):
170
+ if article_text[original_end_index] not in (" ", "\t", "\n"):
171
+ search_chars += 1
172
+ original_end_index += 1 # Increment to include the last character
173
+
174
+ # Append the found indices to the list
175
+ if article_text[original_start_index] == " ":
176
+ original_start_index += 1
177
+ indices.append((original_start_index, original_end_index))
178
+
179
+ # Move start_index to the next position to continue searching
180
+ start_index += 1
181
+
182
+ return indices
183
+
184
+
185
+ def get_entities(tokens, tags, confidences, text):
186
+
187
+ tags = [tag.replace("S-", "B-").replace("E-", "I-") for tag in tags]
188
+ pos_tags = [pos for token, pos in pos_tag(tokens)]
189
+
190
+ for i in range(1, len(tags)):
191
+ # If a 'B-' tag is followed by another 'B-' without an 'O' in between, change the second to 'I-'
192
+ if tags[i].startswith("B-") and tags[i - 1].startswith("I-"):
193
+ tags[i] = "I-" + tags[i][2:] # Change 'B-' to 'I-' for the same entity type
194
+
195
+ conlltags = [(token, pos, tg) for token, pos, tg in zip(tokens, pos_tags, tags)]
196
+ ne_tree = conlltags2tree(conlltags)
197
+
198
+ entities = []
199
+ idx: int = 0
200
+ already_done = []
201
+ for subtree in ne_tree:
202
+ # skipping 'O' tags
203
+ if isinstance(subtree, Tree):
204
+ original_label = subtree.label()
205
+ original_string = " ".join([token for token, pos in subtree.leaves()])
206
+
207
+ for indices in find_entity_indices(text, original_string):
208
+ entity_start_position = indices[0]
209
+ entity_end_position = indices[1]
210
+ if (
211
+ "_".join(
212
+ [original_label, original_string, str(entity_start_position)]
213
+ )
214
+ in already_done
215
+ ):
216
+ continue
217
+ else:
218
+ already_done.append(
219
+ "_".join(
220
+ [
221
+ original_label,
222
+ original_string,
223
+ str(entity_start_position),
224
+ ]
225
+ )
226
+ )
227
+ if len(text[entity_start_position:entity_end_position].strip()) < len(
228
+ text[entity_start_position:entity_end_position]
229
+ ):
230
+ entity_start_position = (
231
+ entity_start_position
232
+ + len(text[entity_start_position:entity_end_position])
233
+ - len(text[entity_start_position:entity_end_position].strip())
234
+ )
235
+
236
+ entities.append(
237
+ {
238
+ "type": original_label,
239
+ "confidence_ner": round(
240
+ np.average(confidences[idx : idx + len(subtree)]) * 100, 2
241
+ ),
242
+ "index": (idx, idx + len(subtree)),
243
+ "surface": text[
244
+ entity_start_position:entity_end_position
245
+ ], # original_string,
246
+ "lOffset": entity_start_position,
247
+ "rOffset": entity_end_position,
248
+ }
249
+ )
250
+
251
+ idx += len(subtree)
252
+
253
+ # Update the current character position
254
+ # We add the length of the original string + 1 (for the space)
255
+ else:
256
+ token, pos = subtree
257
+ # If it's not a named entity, we still need to update the character
258
+ # position
259
+ idx += 1
260
+
261
+ return entities
262
+
263
+
264
+ def realign(
265
+ text_sentence, out_label_preds, softmax_scores, tokenizer, reverted_label_map
266
+ ):
267
+ preds_list, words_list, confidence_list = [], [], []
268
+ word_ids = tokenizer(text_sentence, is_split_into_words=True).word_ids()
269
+ for idx, word in enumerate(text_sentence):
270
+ beginning_index = word_ids.index(idx)
271
+ try:
272
+ preds_list.append(reverted_label_map[out_label_preds[beginning_index]])
273
+ confidence_list.append(max(softmax_scores[beginning_index]))
274
+ except Exception as ex: # the sentence was longer then max_length
275
+ preds_list.append("O")
276
+ confidence_list.append(0.0)
277
+ words_list.append(word)
278
+
279
+ return words_list, preds_list, confidence_list
280
+
281
+
282
+ def add_spaces_around_punctuation(text):
283
+ # Add a space before and after all punctuation
284
+ all_punctuation = string.punctuation + punctuation
285
+ return re.sub(r"([{}])".format(re.escape(all_punctuation)), r" \1 ", text)
286
+
287
+
288
+ def attach_comp_to_closest(entities):
289
+ # Define valid entity types that can receive a "comp.function" or "comp.name" attachment
290
+ valid_entity_types = {"org", "pers", "org.ent", "pers.ind"}
291
+
292
+ # Separate "comp.function" and "comp.name" entities from other entities
293
+ comp_entities = [ent for ent in entities if ent["type"].startswith("comp")]
294
+ other_entities = [ent for ent in entities if not ent["type"].startswith("comp")]
295
+
296
+ for comp_entity in comp_entities:
297
+ closest_entity = None
298
+ min_distance = float("inf")
299
+
300
+ # Find the closest non-"comp" entity that is valid for attaching
301
+ for other_entity in other_entities:
302
+ # Calculate distance between the comp entity and the other entity
303
+ if comp_entity["lOffset"] > other_entity["rOffset"]:
304
+ distance = comp_entity["lOffset"] - other_entity["rOffset"]
305
+ elif comp_entity["rOffset"] < other_entity["lOffset"]:
306
+ distance = other_entity["lOffset"] - comp_entity["rOffset"]
307
+ else:
308
+ distance = 0 # They overlap or touch
309
+
310
+ # Ensure the entity type is valid and check for minimal distance
311
+ if (
312
+ distance < min_distance
313
+ and other_entity["type"].split(".")[0] in valid_entity_types
314
+ ):
315
+ min_distance = distance
316
+ closest_entity = other_entity
317
+
318
+ # Attach the "comp.function" or "comp.name" if a valid entity is found
319
+ if closest_entity:
320
+ suffix = comp_entity["type"].split(".")[
321
+ -1
322
+ ] # Extract the suffix (e.g., 'name', 'function')
323
+ closest_entity[suffix] = comp_entity["surface"] # Attach the text
324
+
325
+ return other_entities
326
+
327
+
328
+ def conflicting_context(comp_entity, target_entity):
329
+ """
330
+ Determines if there is a conflict between the comp_entity and the target entity.
331
+ Prevents incorrect name and function attachments by using a rule-based approach.
332
+ """
333
+ # Case 1: Check for correct function attachment to person or organization entities
334
+ if comp_entity["type"].startswith("comp.function"):
335
+ if not ("pers" in target_entity["type"] or "org" in target_entity["type"]):
336
+ return True # Conflict: Function should only attach to persons or organizations
337
+
338
+ # Case 2: Avoid attaching comp.* entities to non-person, non-organization types (like locations)
339
+ if "loc" in target_entity["type"]:
340
+ return True # Conflict: comp.* entities should not attach to locations or similar types
341
+
342
+ return False # No conflict
343
+
344
+
345
+ def extract_name_from_text(text, partial_name):
346
+ """
347
+ Extracts the full name from the entity's text based on the partial name.
348
+ This function assumes that the full name starts with capitalized letters and does not
349
+ include any words that come after the partial name.
350
+ """
351
+ # Split the text and partial name into words
352
+ words = tokenize(text)
353
+ partial_words = partial_name.split()
354
+
355
+ if DEBUG:
356
+ print("text:", text)
357
+ if DEBUG:
358
+ print("partial_name:", partial_name)
359
+
360
+ # Find the position of the partial name in the word list
361
+ for i, word in enumerate(words):
362
+ if DEBUG:
363
+ print(words, "---", words[i : i + len(partial_words)])
364
+ if words[i : i + len(partial_words)] == partial_words:
365
+ # Initialize full name with the partial name
366
+ full_name = partial_words[:]
367
+
368
+ if DEBUG:
369
+ print("full_name:", full_name)
370
+
371
+ # Check previous words and only add capitalized words (skip lowercase words)
372
+ j = i - 1
373
+ while j >= 0 and words[j][0].isupper():
374
+ full_name.insert(0, words[j])
375
+ j -= 1
376
+ if DEBUG:
377
+ print("full_name:", full_name)
378
+
379
+ # Return only the full name up to the partial name (ignore words after the name)
380
+ return " ".join(full_name).strip() # Join the words to form the full name
381
+
382
+ # If not found, return the original text (as a fallback)
383
+ return text.strip()
384
+
385
+
386
+ def repair_names_in_entities(entities):
387
+ """
388
+ This function repairs the names in the entities by extracting the full name
389
+ from the text of the entity if a partial name (e.g., 'Washington') is incorrectly attached.
390
+ """
391
+ for entity in entities:
392
+ if "name" in entity and "pers" in entity["type"]:
393
+ name = entity["name"]
394
+ text = entity["surface"]
395
+
396
+ # Check if the attached name is part of the entity's text
397
+ if name in text:
398
+ # Extract the full name from the text by splitting around the attached name
399
+ full_name = extract_name_from_text(entity["surface"], name)
400
+ entity["name"] = (
401
+ full_name # Replace the partial name with the full name
402
+ )
403
+ # if "name" not in entity:
404
+ # entity["name"] = entity["surface"]
405
+
406
+ return entities
407
+
408
+
409
+ def clean_coarse_entities(entities):
410
+ """
411
+ This function removes entities that are not useful for the NEL process.
412
+ """
413
+ # Define a set of entity types that are considered useful for NEL
414
+ useful_types = {
415
+ "pers", # Person
416
+ "loc", # Location
417
+ "org", # Organization
418
+ "date", # Product
419
+ "time", # Time
420
+ }
421
+
422
+ # Filter out entities that are not in the useful_types set unless they are comp.* entities
423
+ cleaned_entities = [
424
+ entity
425
+ for entity in entities
426
+ if entity["type"] in useful_types or "comp" in entity["type"]
427
+ ]
428
+
429
+ return cleaned_entities
430
+
431
+
432
+ def postprocess_entities(entities):
433
+ # Step 1: Filter entities with the same text, keeping the one with the most dots in the 'entity' field
434
+ entity_map = {}
435
+
436
+ # Loop over the entities and prioritize the one with the most dots
437
+ for entity in entities:
438
+ entity_text = entity["surface"]
439
+ num_dots = entity["type"].count(".")
440
+
441
+ # If the entity text is new, or this entity has more dots, update the map
442
+ if (
443
+ entity_text not in entity_map
444
+ or entity_map[entity_text]["type"].count(".") < num_dots
445
+ ):
446
+ entity_map[entity_text] = entity
447
+
448
+ # Collect the filtered entities from the map
449
+ filtered_entities = list(entity_map.values())
450
+
451
+ # Step 2: Attach "comp.function" entities to the closest other entities
452
+ filtered_entities = attach_comp_to_closest(filtered_entities)
453
+ if DEBUG:
454
+ print("After attach_comp_to_closest:", filtered_entities, "\n")
455
+ filtered_entities = repair_names_in_entities(filtered_entities)
456
+ if DEBUG:
457
+ print("After repair_names_in_entities:", filtered_entities, "\n")
458
+
459
+ # Step 3: Remove entities that are not useful for NEL
460
+ # filtered_entities = clean_coarse_entities(filtered_entities)
461
+
462
+ # filtered_entities = remove_blacklisted_entities(filtered_entities)
463
+
464
+ return filtered_entities
465
+
466
+
467
+ def remove_included_entities(entities):
468
+ # Loop through entities and remove those whose text is included in another with the same label
469
+ final_entities = []
470
+ for i, entity in enumerate(entities):
471
+ is_included = False
472
+ for other_entity in entities:
473
+ if entity["surface"] != other_entity["surface"]:
474
+ if "comp" in other_entity["type"]:
475
+ # Check if entity's text is a substring of another entity's text
476
+ if entity["surface"] in other_entity["surface"]:
477
+ is_included = True
478
+ break
479
+ elif (
480
+ entity["type"].split(".")[0] in other_entity["type"].split(".")[0]
481
+ or other_entity["type"].split(".")[0]
482
+ in entity["type"].split(".")[0]
483
+ ):
484
+ if entity["surface"] in other_entity["surface"]:
485
+ is_included = True
486
+ if not is_included:
487
+ final_entities.append(entity)
488
+ return final_entities
489
+
490
+
491
+ def refine_entities_with_coarse(all_entities, coarse_entities):
492
+ """
493
+ Looks through all entities and refines them based on the coarse entities.
494
+ If a surface match is found in the coarse entities and the types match,
495
+ the entity's confidence_ner and type are updated based on the coarse entity.
496
+ """
497
+ # Create a dictionary for coarse entities based on surface and type for quick lookup
498
+ coarse_lookup = {}
499
+ for coarse_entity in coarse_entities:
500
+ key = (coarse_entity["surface"], coarse_entity["type"].split(".")[0])
501
+ coarse_lookup[key] = coarse_entity
502
+
503
+ # Iterate through all entities and compare with the coarse entities
504
+ for entity in all_entities:
505
+ key = (
506
+ entity["surface"],
507
+ entity["type"].split(".")[0],
508
+ ) # Use the coarse type for comparison
509
+
510
+ if key in coarse_lookup:
511
+ coarse_entity = coarse_lookup[key]
512
+ # If a match is found, update the confidence_ner and type in the entity
513
+ if entity["confidence_ner"] < coarse_entity["confidence_ner"]:
514
+ entity["confidence_ner"] = coarse_entity["confidence_ner"]
515
+ entity["type"] = coarse_entity[
516
+ "type"
517
+ ] # Update the type if the confidence is higher
518
+
519
+ # No need to append to refined_entities, we're modifying in place
520
+ for entity in all_entities:
521
+ entity["type"] = entity["type"].split(".")[0]
522
+ return all_entities
523
+
524
+
525
+ def remove_trailing_stopwords(entities):
526
+ """
527
+ This function removes stopwords and punctuation from both the beginning and end of each entity's text
528
+ and repairs the lOffset and rOffset accordingly.
529
+ """
530
+ if DEBUG:
531
+ print(f"Initial entities in remove_trailing_stopwords: {len(entities)}")
532
+ new_entities = []
533
+ for entity in entities:
534
+ if "comp" not in entity["type"]:
535
+ entity_text = entity["surface"]
536
+ original_len = len(entity_text)
537
+
538
+ # Initial offsets
539
+ lOffset = entity.get("lOffset", 0)
540
+ rOffset = entity.get("rOffset", original_len)
541
+
542
+ # Remove stopwords and punctuation from the beginning
543
+ while entity_text and (
544
+ entity_text.split()[0].lower() in stop_words
545
+ or entity_text[0] in punctuation
546
+ ):
547
+ if entity_text.split()[0].lower() in stop_words:
548
+ stopword_len = (
549
+ len(entity_text.split()[0]) + 1
550
+ ) # Adjust length for stopword and following space
551
+ entity_text = entity_text[stopword_len:] # Remove leading stopword
552
+ lOffset += stopword_len # Adjust the left offset
553
+ if DEBUG:
554
+ print(
555
+ f"Removed leading stopword from entity: {entity['surface']} --> {entity_text} ({entity['type']}"
556
+ )
557
+ elif entity_text[0] in punctuation:
558
+ entity_text = entity_text[1:] # Remove leading punctuation
559
+ lOffset += 1 # Adjust the left offset
560
+ if DEBUG:
561
+ print(
562
+ f"Removed leading punctuation from entity: {entity['surface']} --> {entity_text} ({entity['type']}"
563
+ )
564
+
565
+ # Remove stopwords and punctuation from the end
566
+ if len(entity_text.strip()) > 1:
567
+ while entity_text and (
568
+ entity_text.split()[-1].lower() in stop_words
569
+ or entity_text[-1] in punctuation
570
+ ):
571
+ if entity_text.split()[-1].lower() in stop_words:
572
+ stopword_len = (
573
+ len(entity_text.split()[-1]) + 1
574
+ ) # Adjust length for stopword and preceding space
575
+ entity_text = entity_text[
576
+ :-stopword_len
577
+ ] # Remove trailing stopword
578
+ rOffset -= stopword_len # Adjust the right offset
579
+ if DEBUG:
580
+ print(
581
+ f"Removed trailing stopword from entity: {entity['surface']} --> {entity_text} ({entity['type']}"
582
+ )
583
+ if entity_text:
584
+ if entity_text[-1] in punctuation:
585
+ entity_text = entity_text[
586
+ :-1
587
+ ] # Remove trailing punctuation
588
+ rOffset -= 1 # Adjust the right offset
589
+ if DEBUG:
590
+ print(
591
+ f"Removed trailing punctuation from entity: {entity['surface']} --> {entity_text} ({entity['type']}"
592
+ )
593
+
594
+ # Skip certain entities based on rules
595
+ if entity_text in string.punctuation:
596
+ if DEBUG:
597
+ print(f"Skipping entity: {entity_text}")
598
+ entities.remove(entity)
599
+ continue
600
+ # check now if its in stopwords
601
+ if entity_text.lower() in stop_words:
602
+ if DEBUG:
603
+ print(f"Skipping entity: {entity_text}")
604
+ entities.remove(entity)
605
+ continue
606
+ # check now if the entire entity is a list of stopwords:
607
+ if all([word.lower() in stop_words for word in entity_text.split()]):
608
+ if DEBUG:
609
+ print(f"Skipping entity: {entity_text}")
610
+ entities.remove(entity)
611
+ continue
612
+ # Check if the entire entity is made up of stopwords characters
613
+ if all(
614
+ [char.lower() in stop_words for char in entity_text if char.isalpha()]
615
+ ):
616
+ if DEBUG:
617
+ print(
618
+ f"Skipping entity: {entity_text} (all characters are stopwords)"
619
+ )
620
+ entities.remove(entity)
621
+ continue
622
+ # check now if all entity is in a list of punctuation
623
+ if all([word in string.punctuation for word in entity_text.split()]):
624
+ if DEBUG:
625
+ print(
626
+ f"Skipping entity: {entity_text} (all characters are punctuation)"
627
+ )
628
+ entities.remove(entity)
629
+ continue
630
+ if all(
631
+ [
632
+ char.lower() in string.punctuation
633
+ for char in entity_text
634
+ if char.isalpha()
635
+ ]
636
+ ):
637
+ if DEBUG:
638
+ print(
639
+ f"Skipping entity: {entity_text} (all characters are punctuation)"
640
+ )
641
+ entities.remove(entity)
642
+ continue
643
+
644
+ # if it's a number and "time" no in it, then continue
645
+ if entity_text.isdigit() and "time" not in entity["type"]:
646
+ if DEBUG:
647
+ print(f"Skipping entity: {entity_text}")
648
+ entities.remove(entity)
649
+ continue
650
+
651
+ if entity_text.startswith(" "):
652
+ entity_text = entity_text[1:]
653
+ # update lOffset, rOffset
654
+ lOffset += 1
655
+ if entity_text.endswith(" "):
656
+ entity_text = entity_text[:-1]
657
+ # update lOffset, rOffset
658
+ rOffset -= 1
659
+
660
+ # Update the entity surface and offsets
661
+ entity["surface"] = entity_text
662
+ entity["lOffset"] = lOffset
663
+ entity["rOffset"] = rOffset
664
+
665
+ # Remove the entity if the surface is empty after cleaning
666
+ if len(entity["surface"].strip()) == 0:
667
+ if DEBUG:
668
+ print(f"Deleted entity: {entity['surface']}")
669
+ entities.remove(entity)
670
+ else:
671
+ new_entities.append(entity)
672
+ else:
673
+ new_entities.append(entity)
674
+ if DEBUG:
675
+ print(f"Remained entities in remove_trailing_stopwords: {len(new_entities)}")
676
+ return new_entities
677
+
678
+
679
+ class MultitaskTokenClassificationPipeline(Pipeline):
680
+
681
+ def _sanitize_parameters(self, **kwargs):
682
+ preprocess_kwargs = {}
683
+ if "text" in kwargs:
684
+ preprocess_kwargs["text"] = kwargs["text"]
685
+ self.label_map = self.model.config.label_map
686
+ self.id2label = {
687
+ task: {id_: label for label, id_ in labels.items()}
688
+ for task, labels in self.label_map.items()
689
+ }
690
+ return preprocess_kwargs, {}, {}
691
+
692
+ def preprocess(self, text, **kwargs):
693
+
694
+ tokenized_inputs = self.tokenizer(
695
+ text, padding="max_length", truncation=True, max_length=512
696
+ )
697
+
698
+ text_sentence = tokenize(add_spaces_around_punctuation(text))
699
+ return tokenized_inputs, text_sentence, text
700
+
701
+ def _forward(self, inputs):
702
+ inputs, text_sentences, text = inputs
703
+ input_ids = torch.tensor([inputs["input_ids"]], dtype=torch.long).to(
704
+ self.model.device
705
+ )
706
+ attention_mask = torch.tensor([inputs["attention_mask"]], dtype=torch.long).to(
707
+ self.model.device
708
+ )
709
+ with torch.no_grad():
710
+ outputs = self.model(input_ids, attention_mask)
711
+ return outputs, text_sentences, text
712
+
713
+ def is_within(self, entity1, entity2):
714
+ """Check if entity1 is fully within the bounds of entity2."""
715
+ return (
716
+ entity1["lOffset"] >= entity2["lOffset"]
717
+ and entity1["rOffset"] <= entity2["rOffset"]
718
+ )
719
+
720
+ def postprocess(self, outputs, **kwargs):
721
+ """
722
+ Postprocess the outputs of the model
723
+ :param outputs:
724
+ :param kwargs:
725
+ :return:
726
+ """
727
+ tokens_result, text_sentence, text = outputs
728
+
729
+ predictions = {}
730
+ confidence_scores = {}
731
+ for task, logits in tokens_result.logits.items():
732
+ predictions[task] = torch.argmax(logits, dim=-1).tolist()[0]
733
+ confidence_scores[task] = F.softmax(logits, dim=-1).tolist()[0]
734
+
735
+ entities = {}
736
+ for task in predictions.keys():
737
+ words_list, preds_list, confidence_list = realign(
738
+ text_sentence,
739
+ predictions[task],
740
+ confidence_scores[task],
741
+ self.tokenizer,
742
+ self.id2label[task],
743
+ )
744
+
745
+ entities[task] = get_entities(words_list, preds_list, confidence_list, text)
746
+
747
+ # add titles to comp entities
748
+ # from pprint import pprint
749
+
750
+ # print("Before:")
751
+ # pprint(entities)
752
+
753
+ all_entities = []
754
+ coarse_entities = []
755
+ for key in entities:
756
+ if key in ["NE-COARSE-LIT"]:
757
+ coarse_entities = entities[key]
758
+ all_entities.extend(entities[key])
759
+
760
+ if DEBUG:
761
+ print(all_entities)
762
+ # print("After remove_included_entities:")
763
+ all_entities = remove_included_entities(all_entities)
764
+ if DEBUG:
765
+ print("After remove_included_entities:", all_entities)
766
+ all_entities = remove_trailing_stopwords(all_entities)
767
+ if DEBUG:
768
+ print("After remove_trailing_stopwords:", all_entities)
769
+ all_entities = postprocess_entities(all_entities)
770
+ if DEBUG:
771
+ print("After postprocess_entities:", all_entities)
772
+ all_entities = refine_entities_with_coarse(all_entities, coarse_entities)
773
+ if DEBUG:
774
+ print("After refine_entities_with_coarse:", all_entities)
775
+ # print("After attach_comp_to_closest:")
776
+ # pprint(all_entities)
777
+ # print("\n")
778
+ return all_entities
label_map.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"NE-COARSE-LIT": {"B-org": 0, "I-loc": 1, "I-org": 2, "O": 3, "B-prod": 4, "B-time": 5, "I-time": 6, "B-pers": 7, "B-loc": 8, "I-pers": 9, "I-prod": 10}, "NE-COARSE-METO": {"B-org": 0, "O": 1, "I-org": 2, "B-loc": 3, "I-loc": 4, "B-time": 5}, "NE-FINE-LIT": {"B-pers.ind": 0, "I-org.ent": 1, "B-prod.doctr": 2, "B-org.adm": 3, "B-loc.unk": 4, "B-loc.add.phys": 5, "I-loc.add.phys": 6, "I-time.date.abs": 7, "I-loc.adm.town": 8, "B-pers.coll": 9, "B-prod.media": 10, "I-loc.adm.nat": 11, "B-loc.adm.sup": 12, "B-loc.phys.geo": 13, "I-org.ent.pressagency": 14, "I-loc.adm.sup": 15, "I-pers.ind": 16, "I-loc.phys.hydro": 17, "O": 18, "B-loc.oro": 19, "B-pers.ind.articleauthor": 20, "I-loc.oro": 21, "I-loc.add.elec": 22, "B-time.date.abs": 23, "B-org.ent": 24, "I-loc.phys.geo": 25, "I-pers.coll": 26, "I-loc.fac": 27, "B-loc.phys.hydro": 28, "I-org.adm": 29, "I-prod.doctr": 30, "I-pers.ind.articleauthor": 31, "B-loc.add.elec": 32, "B-loc.adm.town": 33, "B-loc.adm.nat": 34, "I-loc.adm.reg": 35, "B-loc.fac": 36, "B-org.ent.pressagency": 37, "I-prod.media": 38, "B-loc.adm.reg": 39, "I-loc.unk": 40}, "NE-FINE-METO": {"I-org.ent": 0, "B-org.adm": 1, "I-org.adm": 2, "B-loc.fac": 3, "O": 4, "B-loc.oro": 5, "B-loc.adm.town": 6, "B-org.ent": 7, "I-loc.fac": 8, "B-time.date.abs": 9}, "NE-FINE-COMP": {"I-comp.name": 0, "B-comp.name": 1, "B-comp.title": 2, "I-comp.function": 3, "I-comp.title": 4, "B-comp.function": 5, "O": 6, "I-comp.demonym": 7, "B-comp.demonym": 8, "B-comp.qualifier": 9, "I-comp.qualifier": 10}, "NE-NESTED": {"I-loc.fac": 0, "B-loc.phys.hydro": 1, "B-pers.ind": 2, "I-org.ent": 3, "B-org.adm": 4, "I-org.adm": 5, "I-loc.adm.town": 6, "B-pers.coll": 7, "I-loc.adm.nat": 8, "B-loc.adm.town": 9, "B-loc.adm.sup": 10, "B-loc.phys.geo": 11, "I-pers.ind": 12, "B-loc.adm.nat": 13, "I-loc.adm.reg": 14, "B-loc.adm.reg": 15, "O": 16, "B-loc.oro": 17, "B-loc.fac": 18, "I-loc.oro": 19, "B-org.ent": 20, "I-loc.phys.geo": 21, "I-loc.phys.hydro": 22, "B-prod.media": 23, "I-prod.media": 24}}
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:03a807b124debff782406c816eacb7ced1f2e25b9a5198b27e1616a41faa0662
3
+ size 193971960
modeling_stacked.py ADDED
@@ -0,0 +1,136 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from transformers.modeling_outputs import TokenClassifierOutput
2
+ import torch
3
+ import torch.nn as nn
4
+ from transformers import PreTrainedModel, AutoModel, AutoConfig, BertConfig
5
+ from torch.nn import CrossEntropyLoss
6
+ from typing import Optional, Tuple, Union
7
+ import logging, json, os
8
+
9
+ from .configuration_stacked import ImpressoConfig
10
+
11
+ logger = logging.getLogger(__name__)
12
+
13
+
14
+ def get_info(label_map):
15
+ num_token_labels_dict = {task: len(labels) for task, labels in label_map.items()}
16
+ return num_token_labels_dict
17
+
18
+
19
+ class ExtendedMultitaskModelForTokenClassification(PreTrainedModel):
20
+
21
+ config_class = ImpressoConfig
22
+ _keys_to_ignore_on_load_missing = [r"position_ids"]
23
+
24
+ def __init__(self, config):
25
+ super().__init__(config)
26
+ self.num_token_labels_dict = get_info(config.label_map)
27
+ self.config = config
28
+
29
+ self.bert = AutoModel.from_pretrained(
30
+ config.pretrained_config["_name_or_path"], config=config.pretrained_config
31
+ )
32
+ if "classifier_dropout" not in config.__dict__:
33
+ classifier_dropout = 0.1
34
+ else:
35
+ classifier_dropout = (
36
+ config.classifier_dropout
37
+ if config.classifier_dropout is not None
38
+ else config.hidden_dropout_prob
39
+ )
40
+ self.dropout = nn.Dropout(classifier_dropout)
41
+
42
+ # Additional transformer layers
43
+ self.transformer_encoder = nn.TransformerEncoder(
44
+ nn.TransformerEncoderLayer(
45
+ d_model=config.hidden_size, nhead=config.num_attention_heads
46
+ ),
47
+ num_layers=2,
48
+ )
49
+
50
+ # For token classification, create a classifier for each task
51
+ self.token_classifiers = nn.ModuleDict(
52
+ {
53
+ task: nn.Linear(config.hidden_size, num_labels)
54
+ for task, num_labels in self.num_token_labels_dict.items()
55
+ }
56
+ )
57
+
58
+ # Initialize weights and apply final processing
59
+ self.post_init()
60
+
61
+ def forward(
62
+ self,
63
+ input_ids: Optional[torch.Tensor] = None,
64
+ attention_mask: Optional[torch.Tensor] = None,
65
+ token_type_ids: Optional[torch.Tensor] = None,
66
+ position_ids: Optional[torch.Tensor] = None,
67
+ head_mask: Optional[torch.Tensor] = None,
68
+ inputs_embeds: Optional[torch.Tensor] = None,
69
+ labels: Optional[torch.Tensor] = None,
70
+ token_labels: Optional[dict] = None,
71
+ output_attentions: Optional[bool] = None,
72
+ output_hidden_states: Optional[bool] = None,
73
+ return_dict: Optional[bool] = None,
74
+ ) -> Union[Tuple[torch.Tensor], TokenClassifierOutput]:
75
+ r"""
76
+ token_labels (`dict` of `torch.LongTensor` of shape `(batch_size, seq_length)`, *optional*):
77
+ Labels for computing the token classification loss. Keys should match the tasks.
78
+ """
79
+ return_dict = (
80
+ return_dict if return_dict is not None else self.config.use_return_dict
81
+ )
82
+
83
+ bert_kwargs = {
84
+ "input_ids": input_ids,
85
+ "attention_mask": attention_mask,
86
+ "token_type_ids": token_type_ids,
87
+ "position_ids": position_ids,
88
+ "head_mask": head_mask,
89
+ "inputs_embeds": inputs_embeds,
90
+ "output_attentions": output_attentions,
91
+ "output_hidden_states": output_hidden_states,
92
+ "return_dict": return_dict,
93
+ }
94
+
95
+ if any(
96
+ keyword in self.config.name_or_path.lower()
97
+ for keyword in ["llama", "deberta"]
98
+ ):
99
+ bert_kwargs.pop("token_type_ids")
100
+ bert_kwargs.pop("head_mask")
101
+
102
+ outputs = self.bert(**bert_kwargs)
103
+
104
+ # For token classification
105
+ token_output = outputs[0]
106
+ token_output = self.dropout(token_output)
107
+
108
+ # Pass through additional transformer layers
109
+ token_output = self.transformer_encoder(token_output.transpose(0, 1)).transpose(
110
+ 0, 1
111
+ )
112
+
113
+ # Collect the logits and compute the loss for each task
114
+ task_logits = {}
115
+ total_loss = 0
116
+ for task, classifier in self.token_classifiers.items():
117
+ logits = classifier(token_output)
118
+ task_logits[task] = logits
119
+ if token_labels and task in token_labels:
120
+ loss_fct = CrossEntropyLoss()
121
+ loss = loss_fct(
122
+ logits.view(-1, self.num_token_labels_dict[task]),
123
+ token_labels[task].view(-1),
124
+ )
125
+ total_loss += loss
126
+
127
+ if not return_dict:
128
+ output = (task_logits,) + outputs[2:]
129
+ return ((total_loss,) + output) if total_loss != 0 else output
130
+
131
+ return TokenClassifierOutput(
132
+ loss=total_loss,
133
+ logits=task_logits,
134
+ hidden_states=outputs.hidden_states,
135
+ attentions=outputs.attentions,
136
+ )
special_tokens_map.json ADDED
@@ -0,0 +1,37 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cls_token": {
3
+ "content": "[CLS]",
4
+ "lstrip": false,
5
+ "normalized": false,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "mask_token": {
10
+ "content": "[MASK]",
11
+ "lstrip": false,
12
+ "normalized": false,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "pad_token": {
17
+ "content": "[PAD]",
18
+ "lstrip": false,
19
+ "normalized": false,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ },
23
+ "sep_token": {
24
+ "content": "[SEP]",
25
+ "lstrip": false,
26
+ "normalized": false,
27
+ "rstrip": false,
28
+ "single_word": false
29
+ },
30
+ "unk_token": {
31
+ "content": "[UNK]",
32
+ "lstrip": false,
33
+ "normalized": false,
34
+ "rstrip": false,
35
+ "single_word": false
36
+ }
37
+ }
test.py ADDED
@@ -0,0 +1,46 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Import necessary modules from the transformers library
2
+ from transformers import pipeline
3
+ from transformers import AutoModelForTokenClassification, AutoTokenizer
4
+
5
+ # Define the model name to be used for token classification, we use the Impresso NER
6
+ # that can be found at "https://huggingface.co/impresso-project/ner-stacked-bert-multilingual"
7
+ MODEL_NAME = "impresso-project/ner-stacked-bert-multilingual"
8
+
9
+ # Load the tokenizer corresponding to the specified model name
10
+ ner_tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
11
+
12
+ ner_pipeline = pipeline(
13
+ "generic-ner",
14
+ model=MODEL_NAME,
15
+ tokenizer=ner_tokenizer,
16
+ trust_remote_code=True,
17
+ device="cpu",
18
+ )
19
+ sentences = [
20
+ """In the year 1789, King Louis XVI, ruler of France, convened the Estates-General at the Palace of Versailles,
21
+ where Marie Antoinette, the Queen of France, alongside Maximilien Robespierre, a leading member of the National Assembly,
22
+ debated with Jean-Jacques Rousseau, the famous philosopher, and Charles de Talleyrand, the Bishop of Autun,
23
+ regarding the future of the French monarchy. At the same time, across the Atlantic in Philadelphia,
24
+ George Washington, the first President of the United States, and Thomas Jefferson, the nation's Secretary of State,
25
+ were drafting policies for the newly established American government following the signing of the Constitution."""
26
+ ]
27
+
28
+ print(sentences[0])
29
+
30
+
31
+ # Helper function to print entities one per row
32
+ def print_nicely(entities):
33
+ for entity in entities:
34
+ print(
35
+ f"Entity: {entity['entity']} | Confidence: {entity['score']:.2f}% | Text: {entity['word'].strip()} | Start: {entity['start']} | End: {entity['end']}"
36
+ )
37
+
38
+
39
+ # Visualize stacked entities for each sentence
40
+ for sentence in sentences:
41
+ results = ner_pipeline(sentence)
42
+
43
+ # Extract coarse and fine entities
44
+ for key in results.keys():
45
+ # Visualize the coarse entities
46
+ print_nicely(results[key])
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,58 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "[PAD]",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "1": {
12
+ "content": "[UNK]",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "2": {
20
+ "content": "[CLS]",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "3": {
28
+ "content": "[SEP]",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "4": {
36
+ "content": "[MASK]",
37
+ "lstrip": false,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ }
43
+ },
44
+ "clean_up_tokenization_spaces": true,
45
+ "cls_token": "[CLS]",
46
+ "do_basic_tokenize": true,
47
+ "do_lower_case": false,
48
+ "mask_token": "[MASK]",
49
+ "max_len": 512,
50
+ "model_max_length": 512,
51
+ "never_split": null,
52
+ "pad_token": "[PAD]",
53
+ "sep_token": "[SEP]",
54
+ "strip_accents": false,
55
+ "tokenize_chinese_chars": true,
56
+ "tokenizer_class": "BertTokenizer",
57
+ "unk_token": "[UNK]"
58
+ }
vocab.txt ADDED
The diff for this file is too large to render. See raw diff
 
x ADDED
File without changes
y ADDED
File without changes