Nerone: Italian NER for Sensitive Data

Named Entity Recognition model for extracting and classifying sensitive personal information from Italian documents.

Model Description

Fine-tuned BureauBERTo (Italian BERT variant) for token classification with 70 entity types:

  • Personal: PERSON, AGE, GENDER, MARITAL_STATUS, PROFESSION, BLOOD_TYPE, FISCAL_CODE
  • Geographic: ADDRESS, COUNTRY, REGION, PROVINCE, MUNICIPALITY, ZIP_CODE, LATITUDE, LONGITUDE, ALTITUDE
  • Contact: PHONE, EMAIL, URL
  • Financial: MONEY_AMOUNT, PERCENTAGE, CARD_NUMBER, CVV, CHECK_NUMBER, ACCOUNT_NUMBER, IBAN, BIC, VAT_NUMBER, TAX_TYPE
  • Medical: DISEASE, MEDICINE, DOSAGE, FORM, MEDICAL_RECORD
  • Legal/Administrative: PASSPORT, DRIVER_LICENSE, LICENSE_NUMBER, LICENSE_PLATE, LAW, COURT, ACT_NUMBER, PROTOCOL_NUMBER, PROPERTY_REGIME
  • Cadastral: CADASTRAL_SHEET, CADASTRAL_PARCEL, CADASTRAL_MAP, CADASTRAL_SUB
  • Technical: IP, IMEI, MAC, UUID, VIN, OTP_CODE, PIN
  • Codes: ISBN, CIG_CODE, CUP_CODE, REA_CODE, SDI_CODE, ATC_CODE, ATECO_CODE, ICD_CODE
  • Temporal: DATE, DATE_RANGE, TIME, TIME_RANGE, YEAR, DURATION, FREQUENCY
  • Misc: ORGANIZATION

Dataset

  • Total samples: 122,625
  • Split: 70% train / 15% validation / 15% test
  • Source: Italian administrative documents

Training

  • Base model: colinglab/BureauBERTo
  • Learning rate: 4e-5
  • Batch size: 32
  • Max sequence length: 256

Evaluation Results

Metric Score
F1 0.915
Precision 0.895
Recall 0.936

Entity-level metrics

Confusion matrix

Usage

from transformers import AutoModelForTokenClassification, AutoTokenizer, pipeline

model = AutoModelForTokenClassification.from_pretrained("lcs06/nerone")
tokenizer = AutoTokenizer.from_pretrained("lcs06/nerone")

ner = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="first")

text = """Il sottoscritto Mario Rossi, nato a Roma il 15/03/1985,
residente in Via Garibaldi 42, 00153 Roma (RM),
codice fiscale RSSMRA85C15H501Z,
dichiara di essere titolare del conto corrente
IBAN IT60X0542811101000000123456 presso Banca Intesa."""

entities = ner(text)
print(entities)

Output:

[
  {"entity_group": "PERSON", "score": 1.0, "word": "Mario Rossi", "start": 15, "end": 26},
  {"entity_group": "MUNICIPALITY", "score": 1.0, "word": "Roma", "start": 35, "end": 39},
  {"entity_group": "DATE", "score": 1.0, "word": "15/03/1985", "start": 43, "end": 53},
  {"entity_group": "ADDRESS", "score": 1.0, "word": "Via Garibaldi 42, 00153 Roma (RM)", "start": 68, "end": 101},
  {"entity_group": "FISCAL_CODE", "score": 1.0, "word": "RSSMRA85C15H501Z", "start": 118, "end": 134},
  {"entity_group": "IBAN", "score": 0.99, "word": "IT60X0542811101000000123456", "start": 188, "end": 215},
  {"entity_group": "ORGANIZATION", "score": 1.0, "word": "Banca Intesa", "start": 223, "end": 235}
]

Intended Use

Designed for processing Italian administrative and legal documents to identify and classify sensitive personal data. Primary use cases:

  • Document anonymization
  • GDPR compliance
  • Data extraction from public administration documents

Limitations

  • Optimized for formal Italian text (administrative, legal, medical documents)
  • Performance may degrade on informal text, dialects, or non-standard formatting

Acknowledgements

This model is fine-tuned from BureauBERTo, developed by CoLingLab at the University of Pisa. BureauBERTo adapts UmBERTo to Italian bureaucratic and administrative language.

@inproceedings{auriemma2023bureauberto,
  title = {{BureauBERTo}: adapting {UmBERTo} to the {Italian} bureaucratic language},
  author = {Auriemma, Serena and Madeddu, Mauro and Miliani, Martina and Bondielli, Alessandro and Passaro, Lucia C and Lenci, Alessandro},
  booktitle = {Proceedings of the Italia Intelligenza Artificiale - Thematic Workshops (Ital IA 2023)},
  series = {CEUR Workshop Proceedings},
  volume = {3486},
  pages = {240--248},
  publisher = {CEUR-WS.org},
  year = {2023},
  url = {https://ceur-ws.org/Vol-3486/42.pdf}
}

Framework Versions

  • Transformers: 4.57.6
  • PyTorch: 2.11.0
  • Python: 3.13

License

Apache 2.0

Downloads last month
112
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for lcs06/nerone

Finetuned
(1)
this model