nerone / README.md
lcs06's picture
Initial release
e2c6ed1 verified
---
language:
- it
license: apache-2.0
tags:
- token-classification
- ner
- italian
- transformers
- pytorch
datasets:
- custom
metrics:
- f1
- precision
- recall
base_model: colinglab/BureauBERTo
pipeline_tag: token-classification
widget:
- text: "Mario Rossi, nato il 15/03/1985, residente in Via Roma 123, 00100 Roma, codice fiscale RSSMRA85C15H501Z."
example_title: "Documento anagrafico"
- text: "Il paziente assume Tachipirina 1000mg due volte al giorno per 5 giorni."
example_title: "Documento medico"
---
# Nerone: Italian NER for Sensitive Data
Named Entity Recognition model for extracting and classifying sensitive personal information from Italian documents.
## Model Description
Fine-tuned [BureauBERTo](https://huggingface.co/colinglab/BureauBERTo) (Italian BERT variant) for token classification with 70 entity types:
- **Personal**: PERSON, AGE, GENDER, MARITAL_STATUS, PROFESSION, BLOOD_TYPE, FISCAL_CODE
- **Geographic**: ADDRESS, COUNTRY, REGION, PROVINCE, MUNICIPALITY, ZIP_CODE, LATITUDE, LONGITUDE, ALTITUDE
- **Contact**: PHONE, EMAIL, URL
- **Financial**: MONEY_AMOUNT, PERCENTAGE, CARD_NUMBER, CVV, CHECK_NUMBER, ACCOUNT_NUMBER, IBAN, BIC, VAT_NUMBER, TAX_TYPE
- **Medical**: DISEASE, MEDICINE, DOSAGE, FORM, MEDICAL_RECORD
- **Legal/Administrative**: PASSPORT, DRIVER_LICENSE, LICENSE_NUMBER, LICENSE_PLATE, LAW, COURT, ACT_NUMBER, PROTOCOL_NUMBER, PROPERTY_REGIME
- **Cadastral**: CADASTRAL_SHEET, CADASTRAL_PARCEL, CADASTRAL_MAP, CADASTRAL_SUB
- **Technical**: IP, IMEI, MAC, UUID, VIN, OTP_CODE, PIN
- **Codes**: ISBN, CIG_CODE, CUP_CODE, REA_CODE, SDI_CODE, ATC_CODE, ATECO_CODE, ICD_CODE
- **Temporal**: DATE, DATE_RANGE, TIME, TIME_RANGE, YEAR, DURATION, FREQUENCY
- **Misc**: ORGANIZATION
## Dataset
- **Total samples**: 122,625
- **Split**: 70% train / 15% validation / 15% test
- **Source**: Italian administrative documents
## Training
- **Base model**: colinglab/BureauBERTo
- **Learning rate**: 4e-5
- **Batch size**: 32
- **Max sequence length**: 256
## Evaluation Results
| Metric | Score |
|-----------|-------|
| F1 | 0.915 |
| Precision | 0.895 |
| Recall | 0.936 |
![Entity-level metrics](label_metrics_entity.png)
![Confusion matrix](confusion_matrix_entity.png)
## Usage
```python
from transformers import AutoModelForTokenClassification, AutoTokenizer, pipeline
model = AutoModelForTokenClassification.from_pretrained("lcs06/nerone")
tokenizer = AutoTokenizer.from_pretrained("lcs06/nerone")
ner = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="first")
text = """Il sottoscritto Mario Rossi, nato a Roma il 15/03/1985,
residente in Via Garibaldi 42, 00153 Roma (RM),
codice fiscale RSSMRA85C15H501Z,
dichiara di essere titolare del conto corrente
IBAN IT60X0542811101000000123456 presso Banca Intesa."""
entities = ner(text)
print(entities)
```
**Output:**
```json
[
{"entity_group": "PERSON", "score": 1.0, "word": "Mario Rossi", "start": 15, "end": 26},
{"entity_group": "MUNICIPALITY", "score": 1.0, "word": "Roma", "start": 35, "end": 39},
{"entity_group": "DATE", "score": 1.0, "word": "15/03/1985", "start": 43, "end": 53},
{"entity_group": "ADDRESS", "score": 1.0, "word": "Via Garibaldi 42, 00153 Roma (RM)", "start": 68, "end": 101},
{"entity_group": "FISCAL_CODE", "score": 1.0, "word": "RSSMRA85C15H501Z", "start": 118, "end": 134},
{"entity_group": "IBAN", "score": 0.99, "word": "IT60X0542811101000000123456", "start": 188, "end": 215},
{"entity_group": "ORGANIZATION", "score": 1.0, "word": "Banca Intesa", "start": 223, "end": 235}
]
```
## Intended Use
Designed for processing Italian administrative and legal documents to identify and classify sensitive personal data. Primary use cases:
- Document anonymization
- GDPR compliance
- Data extraction from public administration documents
## Limitations
- Optimized for formal Italian text (administrative, legal, medical documents)
- Performance may degrade on informal text, dialects, or non-standard formatting
## Acknowledgements
This model is fine-tuned from [BureauBERTo](https://huggingface.co/colinglab/BureauBERTo), developed by CoLingLab at the University of Pisa. BureauBERTo adapts [UmBERTo](https://huggingface.co/Musixmatch/umberto-commoncrawl-cased-v1) to Italian bureaucratic and administrative language.
```bibtex
@inproceedings{auriemma2023bureauberto,
title = {{BureauBERTo}: adapting {UmBERTo} to the {Italian} bureaucratic language},
author = {Auriemma, Serena and Madeddu, Mauro and Miliani, Martina and Bondielli, Alessandro and Passaro, Lucia C and Lenci, Alessandro},
booktitle = {Proceedings of the Italia Intelligenza Artificiale - Thematic Workshops (Ital IA 2023)},
series = {CEUR Workshop Proceedings},
volume = {3486},
pages = {240--248},
publisher = {CEUR-WS.org},
year = {2023},
url = {https://ceur-ws.org/Vol-3486/42.pdf}
}
```
## Framework Versions
- Transformers: 4.57.6
- PyTorch: 2.11.0
- Python: 3.13
## License
Apache 2.0