File size: 5,007 Bytes

e2c6ed1

---
language:
  - it
license: apache-2.0
tags:
  - token-classification
  - ner
  - italian
  - transformers
  - pytorch
datasets:
  - custom
metrics:
  - f1
  - precision
  - recall
base_model: colinglab/BureauBERTo
pipeline_tag: token-classification
widget:
  - text: "Mario Rossi, nato il 15/03/1985, residente in Via Roma 123, 00100 Roma, codice fiscale RSSMRA85C15H501Z."
    example_title: "Documento anagrafico"
  - text: "Il paziente assume Tachipirina 1000mg due volte al giorno per 5 giorni."
    example_title: "Documento medico"
---

# Nerone: Italian NER for Sensitive Data

Named Entity Recognition model for extracting and classifying sensitive personal information from Italian documents.

## Model Description

Fine-tuned [BureauBERTo](https://huggingface.co/colinglab/BureauBERTo) (Italian BERT variant) for token classification with 70 entity types:

- **Personal**: PERSON, AGE, GENDER, MARITAL_STATUS, PROFESSION, BLOOD_TYPE, FISCAL_CODE
- **Geographic**: ADDRESS, COUNTRY, REGION, PROVINCE, MUNICIPALITY, ZIP_CODE, LATITUDE, LONGITUDE, ALTITUDE
- **Contact**: PHONE, EMAIL, URL
- **Financial**: MONEY_AMOUNT, PERCENTAGE, CARD_NUMBER, CVV, CHECK_NUMBER, ACCOUNT_NUMBER, IBAN, BIC, VAT_NUMBER, TAX_TYPE
- **Medical**: DISEASE, MEDICINE, DOSAGE, FORM, MEDICAL_RECORD
- **Legal/Administrative**: PASSPORT, DRIVER_LICENSE, LICENSE_NUMBER, LICENSE_PLATE, LAW, COURT, ACT_NUMBER, PROTOCOL_NUMBER, PROPERTY_REGIME
- **Cadastral**: CADASTRAL_SHEET, CADASTRAL_PARCEL, CADASTRAL_MAP, CADASTRAL_SUB
- **Technical**: IP, IMEI, MAC, UUID, VIN, OTP_CODE, PIN
- **Codes**: ISBN, CIG_CODE, CUP_CODE, REA_CODE, SDI_CODE, ATC_CODE, ATECO_CODE, ICD_CODE
- **Temporal**: DATE, DATE_RANGE, TIME, TIME_RANGE, YEAR, DURATION, FREQUENCY
- **Misc**: ORGANIZATION

## Dataset

- **Total samples**: 122,625
- **Split**: 70% train / 15% validation / 15% test
- **Source**: Italian administrative documents

## Training

- **Base model**: colinglab/BureauBERTo
- **Learning rate**: 4e-5
- **Batch size**: 32
- **Max sequence length**: 256

## Evaluation Results

| Metric    | Score |
|-----------|-------|
| F1        | 0.915 |
| Precision | 0.895 |
| Recall    | 0.936 |

![Entity-level metrics](label_metrics_entity.png)

![Confusion matrix](confusion_matrix_entity.png)

## Usage

```python
from transformers import AutoModelForTokenClassification, AutoTokenizer, pipeline

model = AutoModelForTokenClassification.from_pretrained("lcs06/nerone")
tokenizer = AutoTokenizer.from_pretrained("lcs06/nerone")

ner = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="first")

text = """Il sottoscritto Mario Rossi, nato a Roma il 15/03/1985,
residente in Via Garibaldi 42, 00153 Roma (RM),
codice fiscale RSSMRA85C15H501Z,
dichiara di essere titolare del conto corrente
IBAN IT60X0542811101000000123456 presso Banca Intesa."""

entities = ner(text)
print(entities)
```

**Output:**
```json
[
  {"entity_group": "PERSON", "score": 1.0, "word": "Mario Rossi", "start": 15, "end": 26},
  {"entity_group": "MUNICIPALITY", "score": 1.0, "word": "Roma", "start": 35, "end": 39},
  {"entity_group": "DATE", "score": 1.0, "word": "15/03/1985", "start": 43, "end": 53},
  {"entity_group": "ADDRESS", "score": 1.0, "word": "Via Garibaldi 42, 00153 Roma (RM)", "start": 68, "end": 101},
  {"entity_group": "FISCAL_CODE", "score": 1.0, "word": "RSSMRA85C15H501Z", "start": 118, "end": 134},
  {"entity_group": "IBAN", "score": 0.99, "word": "IT60X0542811101000000123456", "start": 188, "end": 215},
  {"entity_group": "ORGANIZATION", "score": 1.0, "word": "Banca Intesa", "start": 223, "end": 235}
]
```

## Intended Use

Designed for processing Italian administrative and legal documents to identify and classify sensitive personal data. Primary use cases:

- Document anonymization
- GDPR compliance
- Data extraction from public administration documents

## Limitations

- Optimized for formal Italian text (administrative, legal, medical documents)
- Performance may degrade on informal text, dialects, or non-standard formatting

## Acknowledgements

This model is fine-tuned from [BureauBERTo](https://huggingface.co/colinglab/BureauBERTo), developed by CoLingLab at the University of Pisa. BureauBERTo adapts [UmBERTo](https://huggingface.co/Musixmatch/umberto-commoncrawl-cased-v1) to Italian bureaucratic and administrative language.

```bibtex
@inproceedings{auriemma2023bureauberto,
  title = {{BureauBERTo}: adapting {UmBERTo} to the {Italian} bureaucratic language},
  author = {Auriemma, Serena and Madeddu, Mauro and Miliani, Martina and Bondielli, Alessandro and Passaro, Lucia C and Lenci, Alessandro},
  booktitle = {Proceedings of the Italia Intelligenza Artificiale - Thematic Workshops (Ital IA 2023)},
  series = {CEUR Workshop Proceedings},
  volume = {3486},
  pages = {240--248},
  publisher = {CEUR-WS.org},
  year = {2023},
  url = {https://ceur-ws.org/Vol-3486/42.pdf}
}
```

## Framework Versions

- Transformers: 4.57.6
- PyTorch: 2.11.0
- Python: 3.13

## License

Apache 2.0