Instructions to use lcs06/nerone with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use lcs06/nerone with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("token-classification", model="lcs06/nerone")# Load model directly from transformers import AutoTokenizer, AutoModelForTokenClassification tokenizer = AutoTokenizer.from_pretrained("lcs06/nerone") model = AutoModelForTokenClassification.from_pretrained("lcs06/nerone") - Notebooks
- Google Colab
- Kaggle
| language: | |
| - it | |
| license: apache-2.0 | |
| tags: | |
| - token-classification | |
| - ner | |
| - italian | |
| - transformers | |
| - pytorch | |
| datasets: | |
| - custom | |
| metrics: | |
| - f1 | |
| - precision | |
| - recall | |
| base_model: colinglab/BureauBERTo | |
| pipeline_tag: token-classification | |
| widget: | |
| - text: "Mario Rossi, nato il 15/03/1985, residente in Via Roma 123, 00100 Roma, codice fiscale RSSMRA85C15H501Z." | |
| example_title: "Documento anagrafico" | |
| - text: "Il paziente assume Tachipirina 1000mg due volte al giorno per 5 giorni." | |
| example_title: "Documento medico" | |
| # Nerone: Italian NER for Sensitive Data | |
| Named Entity Recognition model for extracting and classifying sensitive personal information from Italian documents. | |
| ## Model Description | |
| Fine-tuned [BureauBERTo](https://huggingface.co/colinglab/BureauBERTo) (Italian BERT variant) for token classification with 70 entity types: | |
| - **Personal**: PERSON, AGE, GENDER, MARITAL_STATUS, PROFESSION, BLOOD_TYPE, FISCAL_CODE | |
| - **Geographic**: ADDRESS, COUNTRY, REGION, PROVINCE, MUNICIPALITY, ZIP_CODE, LATITUDE, LONGITUDE, ALTITUDE | |
| - **Contact**: PHONE, EMAIL, URL | |
| - **Financial**: MONEY_AMOUNT, PERCENTAGE, CARD_NUMBER, CVV, CHECK_NUMBER, ACCOUNT_NUMBER, IBAN, BIC, VAT_NUMBER, TAX_TYPE | |
| - **Medical**: DISEASE, MEDICINE, DOSAGE, FORM, MEDICAL_RECORD | |
| - **Legal/Administrative**: PASSPORT, DRIVER_LICENSE, LICENSE_NUMBER, LICENSE_PLATE, LAW, COURT, ACT_NUMBER, PROTOCOL_NUMBER, PROPERTY_REGIME | |
| - **Cadastral**: CADASTRAL_SHEET, CADASTRAL_PARCEL, CADASTRAL_MAP, CADASTRAL_SUB | |
| - **Technical**: IP, IMEI, MAC, UUID, VIN, OTP_CODE, PIN | |
| - **Codes**: ISBN, CIG_CODE, CUP_CODE, REA_CODE, SDI_CODE, ATC_CODE, ATECO_CODE, ICD_CODE | |
| - **Temporal**: DATE, DATE_RANGE, TIME, TIME_RANGE, YEAR, DURATION, FREQUENCY | |
| - **Misc**: ORGANIZATION | |
| ## Dataset | |
| - **Total samples**: 122,625 | |
| - **Split**: 70% train / 15% validation / 15% test | |
| - **Source**: Italian administrative documents | |
| ## Training | |
| - **Base model**: colinglab/BureauBERTo | |
| - **Learning rate**: 4e-5 | |
| - **Batch size**: 32 | |
| - **Max sequence length**: 256 | |
| ## Evaluation Results | |
| | Metric | Score | | |
| |-----------|-------| | |
| | F1 | 0.915 | | |
| | Precision | 0.895 | | |
| | Recall | 0.936 | | |
|  | |
|  | |
| ## Usage | |
| ```python | |
| from transformers import AutoModelForTokenClassification, AutoTokenizer, pipeline | |
| model = AutoModelForTokenClassification.from_pretrained("lcs06/nerone") | |
| tokenizer = AutoTokenizer.from_pretrained("lcs06/nerone") | |
| ner = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="first") | |
| text = """Il sottoscritto Mario Rossi, nato a Roma il 15/03/1985, | |
| residente in Via Garibaldi 42, 00153 Roma (RM), | |
| codice fiscale RSSMRA85C15H501Z, | |
| dichiara di essere titolare del conto corrente | |
| IBAN IT60X0542811101000000123456 presso Banca Intesa.""" | |
| entities = ner(text) | |
| print(entities) | |
| ``` | |
| **Output:** | |
| ```json | |
| [ | |
| {"entity_group": "PERSON", "score": 1.0, "word": "Mario Rossi", "start": 15, "end": 26}, | |
| {"entity_group": "MUNICIPALITY", "score": 1.0, "word": "Roma", "start": 35, "end": 39}, | |
| {"entity_group": "DATE", "score": 1.0, "word": "15/03/1985", "start": 43, "end": 53}, | |
| {"entity_group": "ADDRESS", "score": 1.0, "word": "Via Garibaldi 42, 00153 Roma (RM)", "start": 68, "end": 101}, | |
| {"entity_group": "FISCAL_CODE", "score": 1.0, "word": "RSSMRA85C15H501Z", "start": 118, "end": 134}, | |
| {"entity_group": "IBAN", "score": 0.99, "word": "IT60X0542811101000000123456", "start": 188, "end": 215}, | |
| {"entity_group": "ORGANIZATION", "score": 1.0, "word": "Banca Intesa", "start": 223, "end": 235} | |
| ] | |
| ``` | |
| ## Intended Use | |
| Designed for processing Italian administrative and legal documents to identify and classify sensitive personal data. Primary use cases: | |
| - Document anonymization | |
| - GDPR compliance | |
| - Data extraction from public administration documents | |
| ## Limitations | |
| - Optimized for formal Italian text (administrative, legal, medical documents) | |
| - Performance may degrade on informal text, dialects, or non-standard formatting | |
| ## Acknowledgements | |
| This model is fine-tuned from [BureauBERTo](https://huggingface.co/colinglab/BureauBERTo), developed by CoLingLab at the University of Pisa. BureauBERTo adapts [UmBERTo](https://huggingface.co/Musixmatch/umberto-commoncrawl-cased-v1) to Italian bureaucratic and administrative language. | |
| ```bibtex | |
| @inproceedings{auriemma2023bureauberto, | |
| title = {{BureauBERTo}: adapting {UmBERTo} to the {Italian} bureaucratic language}, | |
| author = {Auriemma, Serena and Madeddu, Mauro and Miliani, Martina and Bondielli, Alessandro and Passaro, Lucia C and Lenci, Alessandro}, | |
| booktitle = {Proceedings of the Italia Intelligenza Artificiale - Thematic Workshops (Ital IA 2023)}, | |
| series = {CEUR Workshop Proceedings}, | |
| volume = {3486}, | |
| pages = {240--248}, | |
| publisher = {CEUR-WS.org}, | |
| year = {2023}, | |
| url = {https://ceur-ws.org/Vol-3486/42.pdf} | |
| } | |
| ``` | |
| ## Framework Versions | |
| - Transformers: 4.57.6 | |
| - PyTorch: 2.11.0 | |
| - Python: 3.13 | |
| ## License | |
| Apache 2.0 | |