Instructions to use lcs06/nerone with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use lcs06/nerone with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("token-classification", model="lcs06/nerone")# Load model directly from transformers import AutoTokenizer, AutoModelForTokenClassification tokenizer = AutoTokenizer.from_pretrained("lcs06/nerone") model = AutoModelForTokenClassification.from_pretrained("lcs06/nerone") - Notebooks
- Google Colab
- Kaggle
Nerone: Italian NER for Sensitive Data
Named Entity Recognition model for extracting and classifying sensitive personal information from Italian documents.
Model Description
Fine-tuned BureauBERTo (Italian BERT variant) for token classification with 70 entity types:
- Personal: PERSON, AGE, GENDER, MARITAL_STATUS, PROFESSION, BLOOD_TYPE, FISCAL_CODE
- Geographic: ADDRESS, COUNTRY, REGION, PROVINCE, MUNICIPALITY, ZIP_CODE, LATITUDE, LONGITUDE, ALTITUDE
- Contact: PHONE, EMAIL, URL
- Financial: MONEY_AMOUNT, PERCENTAGE, CARD_NUMBER, CVV, CHECK_NUMBER, ACCOUNT_NUMBER, IBAN, BIC, VAT_NUMBER, TAX_TYPE
- Medical: DISEASE, MEDICINE, DOSAGE, FORM, MEDICAL_RECORD
- Legal/Administrative: PASSPORT, DRIVER_LICENSE, LICENSE_NUMBER, LICENSE_PLATE, LAW, COURT, ACT_NUMBER, PROTOCOL_NUMBER, PROPERTY_REGIME
- Cadastral: CADASTRAL_SHEET, CADASTRAL_PARCEL, CADASTRAL_MAP, CADASTRAL_SUB
- Technical: IP, IMEI, MAC, UUID, VIN, OTP_CODE, PIN
- Codes: ISBN, CIG_CODE, CUP_CODE, REA_CODE, SDI_CODE, ATC_CODE, ATECO_CODE, ICD_CODE
- Temporal: DATE, DATE_RANGE, TIME, TIME_RANGE, YEAR, DURATION, FREQUENCY
- Misc: ORGANIZATION
Dataset
- Total samples: 122,625
- Split: 70% train / 15% validation / 15% test
- Source: Italian administrative documents
Training
- Base model: colinglab/BureauBERTo
- Learning rate: 4e-5
- Batch size: 32
- Max sequence length: 256
Evaluation Results
| Metric | Score |
|---|---|
| F1 | 0.915 |
| Precision | 0.895 |
| Recall | 0.936 |
Usage
from transformers import AutoModelForTokenClassification, AutoTokenizer, pipeline
model = AutoModelForTokenClassification.from_pretrained("lcs06/nerone")
tokenizer = AutoTokenizer.from_pretrained("lcs06/nerone")
ner = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="first")
text = """Il sottoscritto Mario Rossi, nato a Roma il 15/03/1985,
residente in Via Garibaldi 42, 00153 Roma (RM),
codice fiscale RSSMRA85C15H501Z,
dichiara di essere titolare del conto corrente
IBAN IT60X0542811101000000123456 presso Banca Intesa."""
entities = ner(text)
print(entities)
Output:
[
{"entity_group": "PERSON", "score": 1.0, "word": "Mario Rossi", "start": 15, "end": 26},
{"entity_group": "MUNICIPALITY", "score": 1.0, "word": "Roma", "start": 35, "end": 39},
{"entity_group": "DATE", "score": 1.0, "word": "15/03/1985", "start": 43, "end": 53},
{"entity_group": "ADDRESS", "score": 1.0, "word": "Via Garibaldi 42, 00153 Roma (RM)", "start": 68, "end": 101},
{"entity_group": "FISCAL_CODE", "score": 1.0, "word": "RSSMRA85C15H501Z", "start": 118, "end": 134},
{"entity_group": "IBAN", "score": 0.99, "word": "IT60X0542811101000000123456", "start": 188, "end": 215},
{"entity_group": "ORGANIZATION", "score": 1.0, "word": "Banca Intesa", "start": 223, "end": 235}
]
Intended Use
Designed for processing Italian administrative and legal documents to identify and classify sensitive personal data. Primary use cases:
- Document anonymization
- GDPR compliance
- Data extraction from public administration documents
Limitations
- Optimized for formal Italian text (administrative, legal, medical documents)
- Performance may degrade on informal text, dialects, or non-standard formatting
Acknowledgements
This model is fine-tuned from BureauBERTo, developed by CoLingLab at the University of Pisa. BureauBERTo adapts UmBERTo to Italian bureaucratic and administrative language.
@inproceedings{auriemma2023bureauberto,
title = {{BureauBERTo}: adapting {UmBERTo} to the {Italian} bureaucratic language},
author = {Auriemma, Serena and Madeddu, Mauro and Miliani, Martina and Bondielli, Alessandro and Passaro, Lucia C and Lenci, Alessandro},
booktitle = {Proceedings of the Italia Intelligenza Artificiale - Thematic Workshops (Ital IA 2023)},
series = {CEUR Workshop Proceedings},
volume = {3486},
pages = {240--248},
publisher = {CEUR-WS.org},
year = {2023},
url = {https://ceur-ws.org/Vol-3486/42.pdf}
}
Framework Versions
- Transformers: 4.57.6
- PyTorch: 2.11.0
- Python: 3.13
License
Apache 2.0
- Downloads last month
- 112
Model tree for lcs06/nerone
Base model
colinglab/BureauBERTo
