lcs06 commited on
Commit
e2c6ed1
·
verified ·
1 Parent(s): 0fe08e9

Initial release

Browse files
.gitattributes CHANGED
@@ -33,3 +33,5 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ confusion_matrix_entity.png filter=lfs diff=lfs merge=lfs -text
37
+ label_metrics_entity.png filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,143 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - it
4
+ license: apache-2.0
5
+ tags:
6
+ - token-classification
7
+ - ner
8
+ - italian
9
+ - transformers
10
+ - pytorch
11
+ datasets:
12
+ - custom
13
+ metrics:
14
+ - f1
15
+ - precision
16
+ - recall
17
+ base_model: colinglab/BureauBERTo
18
+ pipeline_tag: token-classification
19
+ widget:
20
+ - text: "Mario Rossi, nato il 15/03/1985, residente in Via Roma 123, 00100 Roma, codice fiscale RSSMRA85C15H501Z."
21
+ example_title: "Documento anagrafico"
22
+ - text: "Il paziente assume Tachipirina 1000mg due volte al giorno per 5 giorni."
23
+ example_title: "Documento medico"
24
+ ---
25
+
26
+ # Nerone: Italian NER for Sensitive Data
27
+
28
+ Named Entity Recognition model for extracting and classifying sensitive personal information from Italian documents.
29
+
30
+ ## Model Description
31
+
32
+ Fine-tuned [BureauBERTo](https://huggingface.co/colinglab/BureauBERTo) (Italian BERT variant) for token classification with 70 entity types:
33
+
34
+ - **Personal**: PERSON, AGE, GENDER, MARITAL_STATUS, PROFESSION, BLOOD_TYPE, FISCAL_CODE
35
+ - **Geographic**: ADDRESS, COUNTRY, REGION, PROVINCE, MUNICIPALITY, ZIP_CODE, LATITUDE, LONGITUDE, ALTITUDE
36
+ - **Contact**: PHONE, EMAIL, URL
37
+ - **Financial**: MONEY_AMOUNT, PERCENTAGE, CARD_NUMBER, CVV, CHECK_NUMBER, ACCOUNT_NUMBER, IBAN, BIC, VAT_NUMBER, TAX_TYPE
38
+ - **Medical**: DISEASE, MEDICINE, DOSAGE, FORM, MEDICAL_RECORD
39
+ - **Legal/Administrative**: PASSPORT, DRIVER_LICENSE, LICENSE_NUMBER, LICENSE_PLATE, LAW, COURT, ACT_NUMBER, PROTOCOL_NUMBER, PROPERTY_REGIME
40
+ - **Cadastral**: CADASTRAL_SHEET, CADASTRAL_PARCEL, CADASTRAL_MAP, CADASTRAL_SUB
41
+ - **Technical**: IP, IMEI, MAC, UUID, VIN, OTP_CODE, PIN
42
+ - **Codes**: ISBN, CIG_CODE, CUP_CODE, REA_CODE, SDI_CODE, ATC_CODE, ATECO_CODE, ICD_CODE
43
+ - **Temporal**: DATE, DATE_RANGE, TIME, TIME_RANGE, YEAR, DURATION, FREQUENCY
44
+ - **Misc**: ORGANIZATION
45
+
46
+ ## Dataset
47
+
48
+ - **Total samples**: 122,625
49
+ - **Split**: 70% train / 15% validation / 15% test
50
+ - **Source**: Italian administrative documents
51
+
52
+ ## Training
53
+
54
+ - **Base model**: colinglab/BureauBERTo
55
+ - **Learning rate**: 4e-5
56
+ - **Batch size**: 32
57
+ - **Max sequence length**: 256
58
+
59
+ ## Evaluation Results
60
+
61
+ | Metric | Score |
62
+ |-----------|-------|
63
+ | F1 | 0.915 |
64
+ | Precision | 0.895 |
65
+ | Recall | 0.936 |
66
+
67
+ ![Entity-level metrics](label_metrics_entity.png)
68
+
69
+ ![Confusion matrix](confusion_matrix_entity.png)
70
+
71
+ ## Usage
72
+
73
+ ```python
74
+ from transformers import AutoModelForTokenClassification, AutoTokenizer, pipeline
75
+
76
+ model = AutoModelForTokenClassification.from_pretrained("lcs06/nerone")
77
+ tokenizer = AutoTokenizer.from_pretrained("lcs06/nerone")
78
+
79
+ ner = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="first")
80
+
81
+ text = """Il sottoscritto Mario Rossi, nato a Roma il 15/03/1985,
82
+ residente in Via Garibaldi 42, 00153 Roma (RM),
83
+ codice fiscale RSSMRA85C15H501Z,
84
+ dichiara di essere titolare del conto corrente
85
+ IBAN IT60X0542811101000000123456 presso Banca Intesa."""
86
+
87
+ entities = ner(text)
88
+ print(entities)
89
+ ```
90
+
91
+ **Output:**
92
+ ```json
93
+ [
94
+ {"entity_group": "PERSON", "score": 1.0, "word": "Mario Rossi", "start": 15, "end": 26},
95
+ {"entity_group": "MUNICIPALITY", "score": 1.0, "word": "Roma", "start": 35, "end": 39},
96
+ {"entity_group": "DATE", "score": 1.0, "word": "15/03/1985", "start": 43, "end": 53},
97
+ {"entity_group": "ADDRESS", "score": 1.0, "word": "Via Garibaldi 42, 00153 Roma (RM)", "start": 68, "end": 101},
98
+ {"entity_group": "FISCAL_CODE", "score": 1.0, "word": "RSSMRA85C15H501Z", "start": 118, "end": 134},
99
+ {"entity_group": "IBAN", "score": 0.99, "word": "IT60X0542811101000000123456", "start": 188, "end": 215},
100
+ {"entity_group": "ORGANIZATION", "score": 1.0, "word": "Banca Intesa", "start": 223, "end": 235}
101
+ ]
102
+ ```
103
+
104
+ ## Intended Use
105
+
106
+ Designed for processing Italian administrative and legal documents to identify and classify sensitive personal data. Primary use cases:
107
+
108
+ - Document anonymization
109
+ - GDPR compliance
110
+ - Data extraction from public administration documents
111
+
112
+ ## Limitations
113
+
114
+ - Optimized for formal Italian text (administrative, legal, medical documents)
115
+ - Performance may degrade on informal text, dialects, or non-standard formatting
116
+
117
+ ## Acknowledgements
118
+
119
+ This model is fine-tuned from [BureauBERTo](https://huggingface.co/colinglab/BureauBERTo), developed by CoLingLab at the University of Pisa. BureauBERTo adapts [UmBERTo](https://huggingface.co/Musixmatch/umberto-commoncrawl-cased-v1) to Italian bureaucratic and administrative language.
120
+
121
+ ```bibtex
122
+ @inproceedings{auriemma2023bureauberto,
123
+ title = {{BureauBERTo}: adapting {UmBERTo} to the {Italian} bureaucratic language},
124
+ author = {Auriemma, Serena and Madeddu, Mauro and Miliani, Martina and Bondielli, Alessandro and Passaro, Lucia C and Lenci, Alessandro},
125
+ booktitle = {Proceedings of the Italia Intelligenza Artificiale - Thematic Workshops (Ital IA 2023)},
126
+ series = {CEUR Workshop Proceedings},
127
+ volume = {3486},
128
+ pages = {240--248},
129
+ publisher = {CEUR-WS.org},
130
+ year = {2023},
131
+ url = {https://ceur-ws.org/Vol-3486/42.pdf}
132
+ }
133
+ ```
134
+
135
+ ## Framework Versions
136
+
137
+ - Transformers: 4.57.6
138
+ - PyTorch: 2.11.0
139
+ - Python: 3.13
140
+
141
+ ## License
142
+
143
+ Apache 2.0
config.json ADDED
@@ -0,0 +1,249 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "CamembertForTokenClassification"
4
+ ],
5
+ "attention_probs_dropout_prob": 0.1,
6
+ "bos_token_id": 5,
7
+ "classifier_dropout": 0.3,
8
+ "dtype": "float32",
9
+ "eos_token_id": 6,
10
+ "hidden_act": "gelu",
11
+ "hidden_dropout_prob": 0.2,
12
+ "hidden_size": 768,
13
+ "id2label": {
14
+ "0": "O",
15
+ "1": "B-ACCOUNT_NUMBER",
16
+ "2": "B-ACT_NUMBER",
17
+ "3": "B-ADDRESS",
18
+ "4": "B-AGE",
19
+ "5": "B-ALTITUDE",
20
+ "6": "B-ATC_CODE",
21
+ "7": "B-ATECO_CODE",
22
+ "8": "B-BIC",
23
+ "9": "B-BLOOD_TYPE",
24
+ "10": "B-CADASTRAL_MAP",
25
+ "11": "B-CADASTRAL_PARCEL",
26
+ "12": "B-CADASTRAL_SHEET",
27
+ "13": "B-CADASTRAL_SUB",
28
+ "14": "B-CARD_NUMBER",
29
+ "15": "B-CHECK_NUMBER",
30
+ "16": "B-CIG_CODE",
31
+ "17": "B-COUNTRY",
32
+ "18": "B-COURT",
33
+ "19": "B-CUP_CODE",
34
+ "20": "B-CVV",
35
+ "21": "B-DATE",
36
+ "22": "B-DATE_RANGE",
37
+ "23": "B-DISEASE",
38
+ "24": "B-DOSAGE",
39
+ "25": "B-DRIVER_LICENSE",
40
+ "26": "B-DURATION",
41
+ "27": "B-EMAIL",
42
+ "28": "B-FISCAL_CODE",
43
+ "29": "B-FORM",
44
+ "30": "B-FREQUENCY",
45
+ "31": "B-GENDER",
46
+ "32": "B-IBAN",
47
+ "33": "B-ICD_CODE",
48
+ "34": "B-IMEI",
49
+ "35": "B-IP",
50
+ "36": "B-ISBN",
51
+ "37": "B-LATITUDE",
52
+ "38": "B-LAW",
53
+ "39": "B-LICENSE_NUMBER",
54
+ "40": "B-LICENSE_PLATE",
55
+ "41": "B-LONGITUDE",
56
+ "42": "B-MAC",
57
+ "43": "B-MARITAL_STATUS",
58
+ "44": "B-MEDICAL_RECORD",
59
+ "45": "B-MEDICINE",
60
+ "46": "B-MONEY_AMOUNT",
61
+ "47": "B-MUNICIPALITY",
62
+ "48": "B-ORGANIZATION",
63
+ "49": "B-OTP_CODE",
64
+ "50": "B-PASSPORT",
65
+ "51": "B-PERCENTAGE",
66
+ "52": "B-PERSON",
67
+ "53": "B-PHONE",
68
+ "54": "B-PIN",
69
+ "55": "B-PROFESSION",
70
+ "56": "B-PROPERTY_REGIME",
71
+ "57": "B-PROTOCOL_NUMBER",
72
+ "58": "B-PROVINCE",
73
+ "59": "B-REA_CODE",
74
+ "60": "B-REGION",
75
+ "61": "B-SDI_CODE",
76
+ "62": "B-TAX_TYPE",
77
+ "63": "B-TIME",
78
+ "64": "B-TIME_RANGE",
79
+ "65": "B-URL",
80
+ "66": "B-UUID",
81
+ "67": "B-VAT_NUMBER",
82
+ "68": "B-VIN",
83
+ "69": "B-YEAR",
84
+ "70": "B-ZIP_CODE",
85
+ "71": "I-ADDRESS",
86
+ "72": "I-AGE",
87
+ "73": "I-BIC",
88
+ "74": "I-BLOOD_TYPE",
89
+ "75": "I-CADASTRAL_MAP",
90
+ "76": "I-CADASTRAL_PARCEL",
91
+ "77": "I-CADASTRAL_SHEET",
92
+ "78": "I-CADASTRAL_SUB",
93
+ "79": "I-CARD_NUMBER",
94
+ "80": "I-COUNTRY",
95
+ "81": "I-COURT",
96
+ "82": "I-DATE",
97
+ "83": "I-DATE_RANGE",
98
+ "84": "I-DISEASE",
99
+ "85": "I-DOSAGE",
100
+ "86": "I-DURATION",
101
+ "87": "I-EMAIL",
102
+ "88": "I-FORM",
103
+ "89": "I-FREQUENCY",
104
+ "90": "I-IBAN",
105
+ "91": "I-LAW",
106
+ "92": "I-LICENSE_NUMBER",
107
+ "93": "I-LICENSE_PLATE",
108
+ "94": "I-MAC",
109
+ "95": "I-MEDICAL_RECORD",
110
+ "96": "I-MEDICINE",
111
+ "97": "I-MONEY_AMOUNT",
112
+ "98": "I-MUNICIPALITY",
113
+ "99": "I-ORGANIZATION",
114
+ "100": "I-PERSON",
115
+ "101": "I-PHONE",
116
+ "102": "I-PROFESSION",
117
+ "103": "I-PROPERTY_REGIME",
118
+ "104": "I-PROVINCE",
119
+ "105": "I-REA_CODE",
120
+ "106": "I-REGION",
121
+ "107": "I-TIME",
122
+ "108": "I-TIME_RANGE"
123
+ },
124
+ "initializer_range": 0.02,
125
+ "intermediate_size": 3072,
126
+ "label2id": {
127
+ "B-ACCOUNT_NUMBER": 1,
128
+ "B-ACT_NUMBER": 2,
129
+ "B-ADDRESS": 3,
130
+ "B-AGE": 4,
131
+ "B-ALTITUDE": 5,
132
+ "B-ATC_CODE": 6,
133
+ "B-ATECO_CODE": 7,
134
+ "B-BIC": 8,
135
+ "B-BLOOD_TYPE": 9,
136
+ "B-CADASTRAL_MAP": 10,
137
+ "B-CADASTRAL_PARCEL": 11,
138
+ "B-CADASTRAL_SHEET": 12,
139
+ "B-CADASTRAL_SUB": 13,
140
+ "B-CARD_NUMBER": 14,
141
+ "B-CHECK_NUMBER": 15,
142
+ "B-CIG_CODE": 16,
143
+ "B-COUNTRY": 17,
144
+ "B-COURT": 18,
145
+ "B-CUP_CODE": 19,
146
+ "B-CVV": 20,
147
+ "B-DATE": 21,
148
+ "B-DATE_RANGE": 22,
149
+ "B-DISEASE": 23,
150
+ "B-DOSAGE": 24,
151
+ "B-DRIVER_LICENSE": 25,
152
+ "B-DURATION": 26,
153
+ "B-EMAIL": 27,
154
+ "B-FISCAL_CODE": 28,
155
+ "B-FORM": 29,
156
+ "B-FREQUENCY": 30,
157
+ "B-GENDER": 31,
158
+ "B-IBAN": 32,
159
+ "B-ICD_CODE": 33,
160
+ "B-IMEI": 34,
161
+ "B-IP": 35,
162
+ "B-ISBN": 36,
163
+ "B-LATITUDE": 37,
164
+ "B-LAW": 38,
165
+ "B-LICENSE_NUMBER": 39,
166
+ "B-LICENSE_PLATE": 40,
167
+ "B-LONGITUDE": 41,
168
+ "B-MAC": 42,
169
+ "B-MARITAL_STATUS": 43,
170
+ "B-MEDICAL_RECORD": 44,
171
+ "B-MEDICINE": 45,
172
+ "B-MONEY_AMOUNT": 46,
173
+ "B-MUNICIPALITY": 47,
174
+ "B-ORGANIZATION": 48,
175
+ "B-OTP_CODE": 49,
176
+ "B-PASSPORT": 50,
177
+ "B-PERCENTAGE": 51,
178
+ "B-PERSON": 52,
179
+ "B-PHONE": 53,
180
+ "B-PIN": 54,
181
+ "B-PROFESSION": 55,
182
+ "B-PROPERTY_REGIME": 56,
183
+ "B-PROTOCOL_NUMBER": 57,
184
+ "B-PROVINCE": 58,
185
+ "B-REA_CODE": 59,
186
+ "B-REGION": 60,
187
+ "B-SDI_CODE": 61,
188
+ "B-TAX_TYPE": 62,
189
+ "B-TIME": 63,
190
+ "B-TIME_RANGE": 64,
191
+ "B-URL": 65,
192
+ "B-UUID": 66,
193
+ "B-VAT_NUMBER": 67,
194
+ "B-VIN": 68,
195
+ "B-YEAR": 69,
196
+ "B-ZIP_CODE": 70,
197
+ "I-ADDRESS": 71,
198
+ "I-AGE": 72,
199
+ "I-BIC": 73,
200
+ "I-BLOOD_TYPE": 74,
201
+ "I-CADASTRAL_MAP": 75,
202
+ "I-CADASTRAL_PARCEL": 76,
203
+ "I-CADASTRAL_SHEET": 77,
204
+ "I-CADASTRAL_SUB": 78,
205
+ "I-CARD_NUMBER": 79,
206
+ "I-COUNTRY": 80,
207
+ "I-COURT": 81,
208
+ "I-DATE": 82,
209
+ "I-DATE_RANGE": 83,
210
+ "I-DISEASE": 84,
211
+ "I-DOSAGE": 85,
212
+ "I-DURATION": 86,
213
+ "I-EMAIL": 87,
214
+ "I-FORM": 88,
215
+ "I-FREQUENCY": 89,
216
+ "I-IBAN": 90,
217
+ "I-LAW": 91,
218
+ "I-LICENSE_NUMBER": 92,
219
+ "I-LICENSE_PLATE": 93,
220
+ "I-MAC": 94,
221
+ "I-MEDICAL_RECORD": 95,
222
+ "I-MEDICINE": 96,
223
+ "I-MONEY_AMOUNT": 97,
224
+ "I-MUNICIPALITY": 98,
225
+ "I-ORGANIZATION": 99,
226
+ "I-PERSON": 100,
227
+ "I-PHONE": 101,
228
+ "I-PROFESSION": 102,
229
+ "I-PROPERTY_REGIME": 103,
230
+ "I-PROVINCE": 104,
231
+ "I-REA_CODE": 105,
232
+ "I-REGION": 106,
233
+ "I-TIME": 107,
234
+ "I-TIME_RANGE": 108,
235
+ "O": 0
236
+ },
237
+ "layer_norm_eps": 1e-05,
238
+ "max_position_embeddings": 514,
239
+ "model_type": "camembert",
240
+ "num_attention_heads": 12,
241
+ "num_hidden_layers": 12,
242
+ "output_past": true,
243
+ "pad_token_id": 1,
244
+ "position_embedding_type": "absolute",
245
+ "transformers_version": "4.57.6",
246
+ "type_vocab_size": 1,
247
+ "use_cache": true,
248
+ "vocab_size": 40310
249
+ }
confusion_matrix_entity.png ADDED

Git LFS Details

  • SHA256: 38fb44c2b69b54a84f8b47fdc200445b8cdd633b6d79dba4419b8860c9a20691
  • Pointer size: 131 Bytes
  • Size of remote file: 182 kB
label_metrics_entity.png ADDED

Git LFS Details

  • SHA256: 385c0bafdd0128874adf05081a1d4d4e8b386284c5020edd0422973710a0e7a8
  • Pointer size: 131 Bytes
  • Size of remote file: 169 kB
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:8167f2d881ffb666d82418da49fa395021e7ed44a9a7899fb9f0651a7f3e7690
3
+ size 465997628
special_tokens_map.json ADDED
@@ -0,0 +1,55 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "additional_special_tokens": [
3
+ "<s>NOTUSED",
4
+ "</s>NOTUSED"
5
+ ],
6
+ "bos_token": {
7
+ "content": "<s>",
8
+ "lstrip": false,
9
+ "normalized": false,
10
+ "rstrip": false,
11
+ "single_word": false
12
+ },
13
+ "cls_token": {
14
+ "content": "<s>",
15
+ "lstrip": false,
16
+ "normalized": false,
17
+ "rstrip": false,
18
+ "single_word": false
19
+ },
20
+ "eos_token": {
21
+ "content": "</s>",
22
+ "lstrip": false,
23
+ "normalized": false,
24
+ "rstrip": false,
25
+ "single_word": false
26
+ },
27
+ "mask_token": {
28
+ "content": "<mask>",
29
+ "lstrip": true,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false
33
+ },
34
+ "pad_token": {
35
+ "content": "<pad>",
36
+ "lstrip": false,
37
+ "normalized": false,
38
+ "rstrip": false,
39
+ "single_word": false
40
+ },
41
+ "sep_token": {
42
+ "content": "</s>",
43
+ "lstrip": false,
44
+ "normalized": false,
45
+ "rstrip": false,
46
+ "single_word": false
47
+ },
48
+ "unk_token": {
49
+ "content": "<unk>",
50
+ "lstrip": false,
51
+ "normalized": false,
52
+ "rstrip": false,
53
+ "single_word": false
54
+ }
55
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
The diff for this file is too large to render. See raw diff