--- language: - it license: apache-2.0 tags: - token-classification - ner - italian - transformers - pytorch datasets: - custom metrics: - f1 - precision - recall base_model: colinglab/BureauBERTo pipeline_tag: token-classification widget: - text: "Mario Rossi, nato il 15/03/1985, residente in Via Roma 123, 00100 Roma, codice fiscale RSSMRA85C15H501Z." example_title: "Documento anagrafico" - text: "Il paziente assume Tachipirina 1000mg due volte al giorno per 5 giorni." example_title: "Documento medico" --- # Nerone: Italian NER for Sensitive Data Named Entity Recognition model for extracting and classifying sensitive personal information from Italian documents. ## Model Description Fine-tuned [BureauBERTo](https://huggingface.co/colinglab/BureauBERTo) (Italian BERT variant) for token classification with 70 entity types: - **Personal**: PERSON, AGE, GENDER, MARITAL_STATUS, PROFESSION, BLOOD_TYPE, FISCAL_CODE - **Geographic**: ADDRESS, COUNTRY, REGION, PROVINCE, MUNICIPALITY, ZIP_CODE, LATITUDE, LONGITUDE, ALTITUDE - **Contact**: PHONE, EMAIL, URL - **Financial**: MONEY_AMOUNT, PERCENTAGE, CARD_NUMBER, CVV, CHECK_NUMBER, ACCOUNT_NUMBER, IBAN, BIC, VAT_NUMBER, TAX_TYPE - **Medical**: DISEASE, MEDICINE, DOSAGE, FORM, MEDICAL_RECORD - **Legal/Administrative**: PASSPORT, DRIVER_LICENSE, LICENSE_NUMBER, LICENSE_PLATE, LAW, COURT, ACT_NUMBER, PROTOCOL_NUMBER, PROPERTY_REGIME - **Cadastral**: CADASTRAL_SHEET, CADASTRAL_PARCEL, CADASTRAL_MAP, CADASTRAL_SUB - **Technical**: IP, IMEI, MAC, UUID, VIN, OTP_CODE, PIN - **Codes**: ISBN, CIG_CODE, CUP_CODE, REA_CODE, SDI_CODE, ATC_CODE, ATECO_CODE, ICD_CODE - **Temporal**: DATE, DATE_RANGE, TIME, TIME_RANGE, YEAR, DURATION, FREQUENCY - **Misc**: ORGANIZATION ## Dataset - **Total samples**: 122,625 - **Split**: 70% train / 15% validation / 15% test - **Source**: Italian administrative documents ## Training - **Base model**: colinglab/BureauBERTo - **Learning rate**: 4e-5 - **Batch size**: 32 - **Max sequence length**: 256 ## Evaluation Results | Metric | Score | |-----------|-------| | F1 | 0.915 | | Precision | 0.895 | | Recall | 0.936 | ![Entity-level metrics](label_metrics_entity.png) ![Confusion matrix](confusion_matrix_entity.png) ## Usage ```python from transformers import AutoModelForTokenClassification, AutoTokenizer, pipeline model = AutoModelForTokenClassification.from_pretrained("lcs06/nerone") tokenizer = AutoTokenizer.from_pretrained("lcs06/nerone") ner = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="first") text = """Il sottoscritto Mario Rossi, nato a Roma il 15/03/1985, residente in Via Garibaldi 42, 00153 Roma (RM), codice fiscale RSSMRA85C15H501Z, dichiara di essere titolare del conto corrente IBAN IT60X0542811101000000123456 presso Banca Intesa.""" entities = ner(text) print(entities) ``` **Output:** ```json [ {"entity_group": "PERSON", "score": 1.0, "word": "Mario Rossi", "start": 15, "end": 26}, {"entity_group": "MUNICIPALITY", "score": 1.0, "word": "Roma", "start": 35, "end": 39}, {"entity_group": "DATE", "score": 1.0, "word": "15/03/1985", "start": 43, "end": 53}, {"entity_group": "ADDRESS", "score": 1.0, "word": "Via Garibaldi 42, 00153 Roma (RM)", "start": 68, "end": 101}, {"entity_group": "FISCAL_CODE", "score": 1.0, "word": "RSSMRA85C15H501Z", "start": 118, "end": 134}, {"entity_group": "IBAN", "score": 0.99, "word": "IT60X0542811101000000123456", "start": 188, "end": 215}, {"entity_group": "ORGANIZATION", "score": 1.0, "word": "Banca Intesa", "start": 223, "end": 235} ] ``` ## Intended Use Designed for processing Italian administrative and legal documents to identify and classify sensitive personal data. Primary use cases: - Document anonymization - GDPR compliance - Data extraction from public administration documents ## Limitations - Optimized for formal Italian text (administrative, legal, medical documents) - Performance may degrade on informal text, dialects, or non-standard formatting ## Acknowledgements This model is fine-tuned from [BureauBERTo](https://huggingface.co/colinglab/BureauBERTo), developed by CoLingLab at the University of Pisa. BureauBERTo adapts [UmBERTo](https://huggingface.co/Musixmatch/umberto-commoncrawl-cased-v1) to Italian bureaucratic and administrative language. ```bibtex @inproceedings{auriemma2023bureauberto, title = {{BureauBERTo}: adapting {UmBERTo} to the {Italian} bureaucratic language}, author = {Auriemma, Serena and Madeddu, Mauro and Miliani, Martina and Bondielli, Alessandro and Passaro, Lucia C and Lenci, Alessandro}, booktitle = {Proceedings of the Italia Intelligenza Artificiale - Thematic Workshops (Ital IA 2023)}, series = {CEUR Workshop Proceedings}, volume = {3486}, pages = {240--248}, publisher = {CEUR-WS.org}, year = {2023}, url = {https://ceur-ws.org/Vol-3486/42.pdf} } ``` ## Framework Versions - Transformers: 4.57.6 - PyTorch: 2.11.0 - Python: 3.13 ## License Apache 2.0