nerone / README.md

Initial release

e2c6ed1 verified 4 days ago

5.01 kB

	---
	language:
	- it
	license: apache-2.0
	tags:
	- token-classification
	- ner
	- italian
	- transformers
	- pytorch
	datasets:
	- custom
	metrics:
	- f1
	- precision
	- recall
	base_model: colinglab/BureauBERTo
	pipeline_tag: token-classification
	widget:
	- text: "Mario Rossi, nato il 15/03/1985, residente in Via Roma 123, 00100 Roma, codice fiscale RSSMRA85C15H501Z."
	example_title: "Documento anagrafico"
	- text: "Il paziente assume Tachipirina 1000mg due volte al giorno per 5 giorni."
	example_title: "Documento medico"
	---

	# Nerone: Italian NER for Sensitive Data

	Named Entity Recognition model for extracting and classifying sensitive personal information from Italian documents.

	## Model Description

	Fine-tuned [BureauBERTo](https://huggingface.co/colinglab/BureauBERTo) (Italian BERT variant) for token classification with 70 entity types:

	- Personal: PERSON, AGE, GENDER, MARITAL_STATUS, PROFESSION, BLOOD_TYPE, FISCAL_CODE
	- Geographic: ADDRESS, COUNTRY, REGION, PROVINCE, MUNICIPALITY, ZIP_CODE, LATITUDE, LONGITUDE, ALTITUDE
	- Contact: PHONE, EMAIL, URL
	- Financial: MONEY_AMOUNT, PERCENTAGE, CARD_NUMBER, CVV, CHECK_NUMBER, ACCOUNT_NUMBER, IBAN, BIC, VAT_NUMBER, TAX_TYPE
	- Medical: DISEASE, MEDICINE, DOSAGE, FORM, MEDICAL_RECORD
	- Legal/Administrative: PASSPORT, DRIVER_LICENSE, LICENSE_NUMBER, LICENSE_PLATE, LAW, COURT, ACT_NUMBER, PROTOCOL_NUMBER, PROPERTY_REGIME
	- Cadastral: CADASTRAL_SHEET, CADASTRAL_PARCEL, CADASTRAL_MAP, CADASTRAL_SUB
	- Technical: IP, IMEI, MAC, UUID, VIN, OTP_CODE, PIN
	- Codes: ISBN, CIG_CODE, CUP_CODE, REA_CODE, SDI_CODE, ATC_CODE, ATECO_CODE, ICD_CODE
	- Temporal: DATE, DATE_RANGE, TIME, TIME_RANGE, YEAR, DURATION, FREQUENCY
	- Misc: ORGANIZATION

	## Dataset

	- Total samples: 122,625
	- Split: 70% train / 15% validation / 15% test
	- Source: Italian administrative documents

	## Training

	- Base model: colinglab/BureauBERTo
	- Learning rate: 4e-5
	- Batch size: 32
	- Max sequence length: 256

	## Evaluation Results

	\| Metric \| Score \|
	\|-----------\|-------\|
	\| F1 \| 0.915 \|
	\| Precision \| 0.895 \|
	\| Recall \| 0.936 \|

	![Entity-level metrics](label_metrics_entity.png)

	![Confusion matrix](confusion_matrix_entity.png)

	## Usage

	```python
	from transformers import AutoModelForTokenClassification, AutoTokenizer, pipeline

	model = AutoModelForTokenClassification.from_pretrained("lcs06/nerone")
	tokenizer = AutoTokenizer.from_pretrained("lcs06/nerone")

	ner = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="first")

	text = """Il sottoscritto Mario Rossi, nato a Roma il 15/03/1985,
	residente in Via Garibaldi 42, 00153 Roma (RM),
	codice fiscale RSSMRA85C15H501Z,
	dichiara di essere titolare del conto corrente
	IBAN IT60X0542811101000000123456 presso Banca Intesa."""

	entities = ner(text)
	print(entities)
	```

	Output:
	```json
	[
	{"entity_group": "PERSON", "score": 1.0, "word": "Mario Rossi", "start": 15, "end": 26},
	{"entity_group": "MUNICIPALITY", "score": 1.0, "word": "Roma", "start": 35, "end": 39},
	{"entity_group": "DATE", "score": 1.0, "word": "15/03/1985", "start": 43, "end": 53},
	{"entity_group": "ADDRESS", "score": 1.0, "word": "Via Garibaldi 42, 00153 Roma (RM)", "start": 68, "end": 101},
	{"entity_group": "FISCAL_CODE", "score": 1.0, "word": "RSSMRA85C15H501Z", "start": 118, "end": 134},
	{"entity_group": "IBAN", "score": 0.99, "word": "IT60X0542811101000000123456", "start": 188, "end": 215},
	{"entity_group": "ORGANIZATION", "score": 1.0, "word": "Banca Intesa", "start": 223, "end": 235}
	]
	```

	## Intended Use

	Designed for processing Italian administrative and legal documents to identify and classify sensitive personal data. Primary use cases:

	- Document anonymization
	- GDPR compliance
	- Data extraction from public administration documents

	## Limitations

	- Optimized for formal Italian text (administrative, legal, medical documents)
	- Performance may degrade on informal text, dialects, or non-standard formatting

	## Acknowledgements

	This model is fine-tuned from [BureauBERTo](https://huggingface.co/colinglab/BureauBERTo), developed by CoLingLab at the University of Pisa. BureauBERTo adapts [UmBERTo](https://huggingface.co/Musixmatch/umberto-commoncrawl-cased-v1) to Italian bureaucratic and administrative language.

	```bibtex
	@inproceedings{auriemma2023bureauberto,
	title = {{BureauBERTo}: adapting {UmBERTo} to the {Italian} bureaucratic language},
	author = {Auriemma, Serena and Madeddu, Mauro and Miliani, Martina and Bondielli, Alessandro and Passaro, Lucia C and Lenci, Alessandro},
	booktitle = {Proceedings of the Italia Intelligenza Artificiale - Thematic Workshops (Ital IA 2023)},
	series = {CEUR Workshop Proceedings},
	volume = {3486},
	pages = {240--248},
	publisher = {CEUR-WS.org},
	year = {2023},
	url = {https://ceur-ws.org/Vol-3486/42.pdf}
	}
	```

	## Framework Versions

	- Transformers: 4.57.6
	- PyTorch: 2.11.0
	- Python: 3.13

	## License

	Apache 2.0