Instructions to use lcs06/nerone with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use lcs06/nerone with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("token-classification", model="lcs06/nerone")# Load model directly from transformers import AutoTokenizer, AutoModelForTokenClassification tokenizer = AutoTokenizer.from_pretrained("lcs06/nerone") model = AutoModelForTokenClassification.from_pretrained("lcs06/nerone") - Notebooks
- Google Colab
- Kaggle
File size: 5,007 Bytes
e2c6ed1 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 | ---
language:
- it
license: apache-2.0
tags:
- token-classification
- ner
- italian
- transformers
- pytorch
datasets:
- custom
metrics:
- f1
- precision
- recall
base_model: colinglab/BureauBERTo
pipeline_tag: token-classification
widget:
- text: "Mario Rossi, nato il 15/03/1985, residente in Via Roma 123, 00100 Roma, codice fiscale RSSMRA85C15H501Z."
example_title: "Documento anagrafico"
- text: "Il paziente assume Tachipirina 1000mg due volte al giorno per 5 giorni."
example_title: "Documento medico"
---
# Nerone: Italian NER for Sensitive Data
Named Entity Recognition model for extracting and classifying sensitive personal information from Italian documents.
## Model Description
Fine-tuned [BureauBERTo](https://huggingface.co/colinglab/BureauBERTo) (Italian BERT variant) for token classification with 70 entity types:
- **Personal**: PERSON, AGE, GENDER, MARITAL_STATUS, PROFESSION, BLOOD_TYPE, FISCAL_CODE
- **Geographic**: ADDRESS, COUNTRY, REGION, PROVINCE, MUNICIPALITY, ZIP_CODE, LATITUDE, LONGITUDE, ALTITUDE
- **Contact**: PHONE, EMAIL, URL
- **Financial**: MONEY_AMOUNT, PERCENTAGE, CARD_NUMBER, CVV, CHECK_NUMBER, ACCOUNT_NUMBER, IBAN, BIC, VAT_NUMBER, TAX_TYPE
- **Medical**: DISEASE, MEDICINE, DOSAGE, FORM, MEDICAL_RECORD
- **Legal/Administrative**: PASSPORT, DRIVER_LICENSE, LICENSE_NUMBER, LICENSE_PLATE, LAW, COURT, ACT_NUMBER, PROTOCOL_NUMBER, PROPERTY_REGIME
- **Cadastral**: CADASTRAL_SHEET, CADASTRAL_PARCEL, CADASTRAL_MAP, CADASTRAL_SUB
- **Technical**: IP, IMEI, MAC, UUID, VIN, OTP_CODE, PIN
- **Codes**: ISBN, CIG_CODE, CUP_CODE, REA_CODE, SDI_CODE, ATC_CODE, ATECO_CODE, ICD_CODE
- **Temporal**: DATE, DATE_RANGE, TIME, TIME_RANGE, YEAR, DURATION, FREQUENCY
- **Misc**: ORGANIZATION
## Dataset
- **Total samples**: 122,625
- **Split**: 70% train / 15% validation / 15% test
- **Source**: Italian administrative documents
## Training
- **Base model**: colinglab/BureauBERTo
- **Learning rate**: 4e-5
- **Batch size**: 32
- **Max sequence length**: 256
## Evaluation Results
| Metric | Score |
|-----------|-------|
| F1 | 0.915 |
| Precision | 0.895 |
| Recall | 0.936 |


## Usage
```python
from transformers import AutoModelForTokenClassification, AutoTokenizer, pipeline
model = AutoModelForTokenClassification.from_pretrained("lcs06/nerone")
tokenizer = AutoTokenizer.from_pretrained("lcs06/nerone")
ner = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="first")
text = """Il sottoscritto Mario Rossi, nato a Roma il 15/03/1985,
residente in Via Garibaldi 42, 00153 Roma (RM),
codice fiscale RSSMRA85C15H501Z,
dichiara di essere titolare del conto corrente
IBAN IT60X0542811101000000123456 presso Banca Intesa."""
entities = ner(text)
print(entities)
```
**Output:**
```json
[
{"entity_group": "PERSON", "score": 1.0, "word": "Mario Rossi", "start": 15, "end": 26},
{"entity_group": "MUNICIPALITY", "score": 1.0, "word": "Roma", "start": 35, "end": 39},
{"entity_group": "DATE", "score": 1.0, "word": "15/03/1985", "start": 43, "end": 53},
{"entity_group": "ADDRESS", "score": 1.0, "word": "Via Garibaldi 42, 00153 Roma (RM)", "start": 68, "end": 101},
{"entity_group": "FISCAL_CODE", "score": 1.0, "word": "RSSMRA85C15H501Z", "start": 118, "end": 134},
{"entity_group": "IBAN", "score": 0.99, "word": "IT60X0542811101000000123456", "start": 188, "end": 215},
{"entity_group": "ORGANIZATION", "score": 1.0, "word": "Banca Intesa", "start": 223, "end": 235}
]
```
## Intended Use
Designed for processing Italian administrative and legal documents to identify and classify sensitive personal data. Primary use cases:
- Document anonymization
- GDPR compliance
- Data extraction from public administration documents
## Limitations
- Optimized for formal Italian text (administrative, legal, medical documents)
- Performance may degrade on informal text, dialects, or non-standard formatting
## Acknowledgements
This model is fine-tuned from [BureauBERTo](https://huggingface.co/colinglab/BureauBERTo), developed by CoLingLab at the University of Pisa. BureauBERTo adapts [UmBERTo](https://huggingface.co/Musixmatch/umberto-commoncrawl-cased-v1) to Italian bureaucratic and administrative language.
```bibtex
@inproceedings{auriemma2023bureauberto,
title = {{BureauBERTo}: adapting {UmBERTo} to the {Italian} bureaucratic language},
author = {Auriemma, Serena and Madeddu, Mauro and Miliani, Martina and Bondielli, Alessandro and Passaro, Lucia C and Lenci, Alessandro},
booktitle = {Proceedings of the Italia Intelligenza Artificiale - Thematic Workshops (Ital IA 2023)},
series = {CEUR Workshop Proceedings},
volume = {3486},
pages = {240--248},
publisher = {CEUR-WS.org},
year = {2023},
url = {https://ceur-ws.org/Vol-3486/42.pdf}
}
```
## Framework Versions
- Transformers: 4.57.6
- PyTorch: 2.11.0
- Python: 3.13
## License
Apache 2.0
|