Nerone: Italian NER for Sensitive Data

Named Entity Recognition model for extracting and classifying sensitive personal information from Italian documents.

Model Description

Fine-tuned BureauBERTo (Italian BERT variant) for token classification across 101 entity types grouped into 13 categories. BIO tagging is used, for a total of 152 token-level labels (including O).

Entity Types

Personal (9)

Entity	Description
PERSON	Full name of an individual
AGE	A person's age
GENDER	Gender / sex
MARITAL_STATUS	Marital status (e.g., coniugato, celibe)
PROFESSION	Job title or occupation
BLOOD_TYPE	Blood group (e.g., A+, 0-)
FISCAL_CODE	Italian personal tax code (codice fiscale)
ID_CARD_NUMBER	Identity card number
HEALTH_CARD_NUMBER	National health card number (tessera sanitaria)

Geographic (9)

Entity	Description
ADDRESS	Full street / postal address
COUNTRY	Country name
REGION	Administrative region
PROVINCE	Province name or two-letter code (e.g., RM)
MUNICIPALITY	City, town or comune
ZIP_CODE	Postal code (CAP)
LATITUDE	Geographic latitude
LONGITUDE	Geographic longitude
ALTITUDE	Elevation above sea level

Contact (3)

Entity	Description
PHONE	Telephone or mobile number
EMAIL	Email address
URL	Web address

Social (2)

Entity	Description
HASHTAG	Social media hashtag (#...)
MENTION	Social media mention (@...)

Financial (15)

Entity	Description
MONEY_AMOUNT	Monetary amount
PERCENTAGE	Percentage value
CARD_NUMBER	Payment card number
CVV	Card verification code
CHECK_NUMBER	Bank cheque number
ACCOUNT_NUMBER	Bank account number
IBAN	International Bank Account Number
BIC	Bank Identifier Code (SWIFT)
VAT_NUMBER	VAT registration number (partita IVA)
TAX_TYPE	Type of tax or levy (e.g., IMU, IRPEF)
TAX_CODE	Tax payment code (codice tributo, model F24)
ABI_CODE	Italian bank identifier (ABI), 5 digits
CAB_CODE	Italian bank branch identifier (CAB), 5 digits
ISIN_CODE	International Securities Identification Number
LEI_CODE	Legal Entity Identifier

Medical (7)

Entity	Description
DISEASE	Disease or medical condition
MEDICINE	Drug / medication name
DOSAGE	Drug dosage (e.g., 1000mg)
FORM	Pharmaceutical form (e.g., compressa, sciroppo)
MEDICAL_RECORD	Medical record / chart identifier
DRG_CODE	Diagnosis-Related Group code
HEALTH_DISTRICT_CODE	Local health authority / district code (ASL)

Legal / Administrative (8)

Entity	Description
PASSPORT	Passport number
DRIVER_LICENSE	Driving licence number
LICENSE_PLATE	Vehicle registration plate
LAW	Reference to a law, decree or regulation
COURT	Court or tribunal name
ACT_NUMBER	Administrative act / deed number
PROTOCOL_NUMBER	Document protocol number
PROPERTY_REGIME	Matrimonial property regime

Cadastral (4)

Entity	Description
CADASTRAL_SHEET	Cadastral sheet (foglio)
CADASTRAL_PARCEL	Cadastral parcel (particella)
CADASTRAL_MAP	Cadastral map reference
CADASTRAL_SUB	Cadastral subordinate (subalterno)

Technical (8)

Entity	Description
IP	IP address
IMEI	Mobile device IMEI
MAC	MAC (hardware) address
UUID	Universally unique identifier
VIN	Vehicle identification number
OTP_CODE	One-time password / code
PIN	Personal identification number
BARCODE	Product barcode (EAN/UPC)

Codes & Standards (17)

Entity	Description
ISBN	Book identifier
CIG_CODE	Tender identifier (Codice Identificativo Gara)
CUP_CODE	Public investment project code (Codice Unico di Progetto)
REA_CODE	Business registry number (Repertorio Economico Amministrativo)
SDI_CODE	E-invoicing recipient code (Sistema di Interscambio)
ATC_CODE	Anatomical Therapeutic Chemical drug code
ATECO_CODE	Italian economic activity classification code
ICD_CODE	International Classification of Diseases code
CPV_CODE	Common Procurement Vocabulary code (EU procurement)
NUTS_CODE	EU territorial unit classification code
ISTAT_CODE	ISTAT territorial / municipality code
ISO	ISO standard reference (e.g., ISO 27001)
IEC	IEC/CEI technical standard reference (e.g., IEC 60950)
LOT_NUMBER	Batch / lot number
FLIGHT_NUMBER	Flight number
POD_CODE	Electricity supply point code (Point of Delivery)
PDR_CODE	Gas delivery point code (Punto di Riconsegna)

Measurements (11)

Entity	Description
AREA	Surface area (e.g., m², ettari)
DISTANCE	Length / distance (e.g., km, m)
ENERGY	Energy quantity (e.g., kWh)
FILE_SIZE	Digital file size (e.g., MB, GB)
POWER	Power (e.g., kW, CV)
PRESSURE	Pressure (e.g., bar, Pa)
QUANTITY	Counted quantity with unit (e.g., 20 pezzi)
SPEED	Speed (e.g., km/h)
TEMPERATURE	Temperature (e.g., °C)
VOLUME	Volume (e.g., L, m³)
WEIGHT	Weight / mass (e.g., kg, g)

Temporal (7)

Entity	Description
DATE	Calendar date
DATE_RANGE	Range between two dates
TIME	Time of day
TIME_RANGE	Range between two times
YEAR	Year
DURATION	Length of time (e.g., 5 giorni)
FREQUENCY	Recurrence frequency (e.g., due volte al giorno)

Misc (1)

Entity	Description
ORGANIZATION	Company, institution or public body

Dataset

Total samples: 530,075
Split: 70% train (371,053) / 15% validation (79,511) / 15% test (79,511)
Source: Italian administrative documents
Class weights computed to compensate for label imbalance.

Training

Base model: colinglab/BureauBERTo
Learning rate: 4e-5
Batch size: 32
Max sequence length: 256

Evaluation Results

Test set (entity-level, micro avg):

Metric	Score
F1	0.932
Precision	0.914
Recall	0.950

Usage

from transformers import AutoModelForTokenClassification, AutoTokenizer, pipeline

model = AutoModelForTokenClassification.from_pretrained("lcs06/nerone")
tokenizer = AutoTokenizer.from_pretrained("lcs06/nerone")

ner = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="first")

def extract(text):
    for e in ner(text):
        # Use character offsets to recover clean spans (the raw `word`
        # field is unreliable with this SentencePiece tokenizer).
        span = text[e["start"]:e["end"]].strip(" ,.;:")
        print(f"{e['entity_group']:<16} {span}")

Example 1 — Anagraphic document

text = (
    "La sottoscritta Giulia Verdi, nata a Torino il 22/07/1990, "
    "residente in Corso Vittorio Emanuele 18, 10123 Torino (TO), "
    "codice fiscale VRDGLI90L62L219K, email giulia.verdi@example.it."
)
extract(text)

Output:

[
  {"entity_group": "PERSON", "score": 1.0, "word": "Giulia Verdi", "start": 16, "end": 28},
  {"entity_group": "MUNICIPALITY", "score": 1.0, "word": "Torino", "start": 37, "end": 43},
  {"entity_group": "DATE", "score": 1.0, "word": "22/07/1990", "start": 47, "end": 57},
  {"entity_group": "ADDRESS", "score": 1.0, "word": "Corso Vittorio Emanuele 18, 10123 Torino (TO)", "start": 72, "end": 117},
  {"entity_group": "FISCAL_CODE", "score": 1.0, "word": "VRDGLI90L62L219K", "start": 134, "end": 150},
  {"entity_group": "EMAIL", "score": 1.0, "word": "giulia.verdi@example.it", "start": 158, "end": 181}
]

Example 2 — Medical report

text = (
    "Paziente: Anna Conti, 45 anni. Diagnosi: bronchite acuta. "
    "Terapia: Amoxicillina 1 g ogni 8 ore per 7 giorni."
)
extract(text)

Output:

[
  {"entity_group": "PERSON", "score": 1.0, "word": "Anna Conti", "start": 10, "end": 20},
  {"entity_group": "AGE", "score": 1.0, "word": "45 anni", "start": 22, "end": 29},
  {"entity_group": "DISEASE", "score": 1.0, "word": "bronchite acuta", "start": 41, "end": 56},
  {"entity_group": "MEDICINE", "score": 1.0, "word": "Amoxicillina", "start": 66, "end": 78},
  {"entity_group": "FREQUENCY", "score": 1.0, "word": "ogni 8 ore", "start": 83, "end": 93},
  {"entity_group": "DURATION", "score": 0.93, "word": "7 giorni", "start": 98, "end": 106}
]

Intended Use

Designed for processing Italian administrative and legal documents to identify and classify sensitive personal data. Primary use cases:

Document anonymization
GDPR compliance
Data extraction from public administration documents

Limitations

Optimized for formal Italian text (administrative, legal, medical documents)
Performance may degrade on informal text, dialects, or non-standard formatting
Lower-frequency or harder entity types (e.g. DOSAGE, MEDICAL_RECORD, FREQUENCY) show weaker scores than high-volume types

Acknowledgements

This model is fine-tuned from BureauBERTo, developed by CoLingLab at the University of Pisa. BureauBERTo adapts UmBERTo to Italian bureaucratic and administrative language.

@inproceedings{auriemma2023bureauberto,
  title = {{BureauBERTo}: adapting {UmBERTo} to the {Italian} bureaucratic language},
  author = {Auriemma, Serena and Madeddu, Mauro and Miliani, Martina and Bondielli, Alessandro and Passaro, Lucia C and Lenci, Alessandro},
  booktitle = {Proceedings of the Italia Intelligenza Artificiale - Thematic Workshops (Ital IA 2023)},
  series = {CEUR Workshop Proceedings},
  volume = {3486},
  pages = {240--248},
  publisher = {CEUR-WS.org},
  year = {2023},
  url = {https://ceur-ws.org/Vol-3486/42.pdf}
}

Framework Versions

Transformers: 4.57.6
PyTorch: 2.11.0
Python: 3.13

License

Apache 2.0

Downloads last month: 22

Safetensors

Model size

0.1B params

Tensor type

F32

Model tree for lcs06/nerone

Base model

colinglab/BureauBERTo

Finetuned

(1)

this model