File size: 5,007 Bytes
e2c6ed1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
---
language:
  - it
license: apache-2.0
tags:
  - token-classification
  - ner
  - italian
  - transformers
  - pytorch
datasets:
  - custom
metrics:
  - f1
  - precision
  - recall
base_model: colinglab/BureauBERTo
pipeline_tag: token-classification
widget:
  - text: "Mario Rossi, nato il 15/03/1985, residente in Via Roma 123, 00100 Roma, codice fiscale RSSMRA85C15H501Z."
    example_title: "Documento anagrafico"
  - text: "Il paziente assume Tachipirina 1000mg due volte al giorno per 5 giorni."
    example_title: "Documento medico"
---

# Nerone: Italian NER for Sensitive Data

Named Entity Recognition model for extracting and classifying sensitive personal information from Italian documents.

## Model Description

Fine-tuned [BureauBERTo](https://huggingface.co/colinglab/BureauBERTo) (Italian BERT variant) for token classification with 70 entity types:

- **Personal**: PERSON, AGE, GENDER, MARITAL_STATUS, PROFESSION, BLOOD_TYPE, FISCAL_CODE
- **Geographic**: ADDRESS, COUNTRY, REGION, PROVINCE, MUNICIPALITY, ZIP_CODE, LATITUDE, LONGITUDE, ALTITUDE
- **Contact**: PHONE, EMAIL, URL
- **Financial**: MONEY_AMOUNT, PERCENTAGE, CARD_NUMBER, CVV, CHECK_NUMBER, ACCOUNT_NUMBER, IBAN, BIC, VAT_NUMBER, TAX_TYPE
- **Medical**: DISEASE, MEDICINE, DOSAGE, FORM, MEDICAL_RECORD
- **Legal/Administrative**: PASSPORT, DRIVER_LICENSE, LICENSE_NUMBER, LICENSE_PLATE, LAW, COURT, ACT_NUMBER, PROTOCOL_NUMBER, PROPERTY_REGIME
- **Cadastral**: CADASTRAL_SHEET, CADASTRAL_PARCEL, CADASTRAL_MAP, CADASTRAL_SUB
- **Technical**: IP, IMEI, MAC, UUID, VIN, OTP_CODE, PIN
- **Codes**: ISBN, CIG_CODE, CUP_CODE, REA_CODE, SDI_CODE, ATC_CODE, ATECO_CODE, ICD_CODE
- **Temporal**: DATE, DATE_RANGE, TIME, TIME_RANGE, YEAR, DURATION, FREQUENCY
- **Misc**: ORGANIZATION

## Dataset

- **Total samples**: 122,625
- **Split**: 70% train / 15% validation / 15% test
- **Source**: Italian administrative documents

## Training

- **Base model**: colinglab/BureauBERTo
- **Learning rate**: 4e-5
- **Batch size**: 32
- **Max sequence length**: 256

## Evaluation Results

| Metric    | Score |
|-----------|-------|
| F1        | 0.915 |
| Precision | 0.895 |
| Recall    | 0.936 |

![Entity-level metrics](label_metrics_entity.png)

![Confusion matrix](confusion_matrix_entity.png)

## Usage

```python
from transformers import AutoModelForTokenClassification, AutoTokenizer, pipeline

model = AutoModelForTokenClassification.from_pretrained("lcs06/nerone")
tokenizer = AutoTokenizer.from_pretrained("lcs06/nerone")

ner = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="first")

text = """Il sottoscritto Mario Rossi, nato a Roma il 15/03/1985,
residente in Via Garibaldi 42, 00153 Roma (RM),
codice fiscale RSSMRA85C15H501Z,
dichiara di essere titolare del conto corrente
IBAN IT60X0542811101000000123456 presso Banca Intesa."""

entities = ner(text)
print(entities)
```

**Output:**
```json
[
  {"entity_group": "PERSON", "score": 1.0, "word": "Mario Rossi", "start": 15, "end": 26},
  {"entity_group": "MUNICIPALITY", "score": 1.0, "word": "Roma", "start": 35, "end": 39},
  {"entity_group": "DATE", "score": 1.0, "word": "15/03/1985", "start": 43, "end": 53},
  {"entity_group": "ADDRESS", "score": 1.0, "word": "Via Garibaldi 42, 00153 Roma (RM)", "start": 68, "end": 101},
  {"entity_group": "FISCAL_CODE", "score": 1.0, "word": "RSSMRA85C15H501Z", "start": 118, "end": 134},
  {"entity_group": "IBAN", "score": 0.99, "word": "IT60X0542811101000000123456", "start": 188, "end": 215},
  {"entity_group": "ORGANIZATION", "score": 1.0, "word": "Banca Intesa", "start": 223, "end": 235}
]
```

## Intended Use

Designed for processing Italian administrative and legal documents to identify and classify sensitive personal data. Primary use cases:

- Document anonymization
- GDPR compliance
- Data extraction from public administration documents

## Limitations

- Optimized for formal Italian text (administrative, legal, medical documents)
- Performance may degrade on informal text, dialects, or non-standard formatting

## Acknowledgements

This model is fine-tuned from [BureauBERTo](https://huggingface.co/colinglab/BureauBERTo), developed by CoLingLab at the University of Pisa. BureauBERTo adapts [UmBERTo](https://huggingface.co/Musixmatch/umberto-commoncrawl-cased-v1) to Italian bureaucratic and administrative language.

```bibtex
@inproceedings{auriemma2023bureauberto,
  title = {{BureauBERTo}: adapting {UmBERTo} to the {Italian} bureaucratic language},
  author = {Auriemma, Serena and Madeddu, Mauro and Miliani, Martina and Bondielli, Alessandro and Passaro, Lucia C and Lenci, Alessandro},
  booktitle = {Proceedings of the Italia Intelligenza Artificiale - Thematic Workshops (Ital IA 2023)},
  series = {CEUR Workshop Proceedings},
  volume = {3486},
  pages = {240--248},
  publisher = {CEUR-WS.org},
  year = {2023},
  url = {https://ceur-ws.org/Vol-3486/42.pdf}
}
```

## Framework Versions

- Transformers: 4.57.6
- PyTorch: 2.11.0
- Python: 3.13

## License

Apache 2.0