Update README.md

22dc3c5 verified 5 days ago

5.78 kB

	---
	language:
	- multilingual
	license: apache-2.0
	base_model: google-bert/bert-base-multilingual-cased
	pipeline_tag: token-classification
	inference: false
	tags:
	- ner
	- pii-detection
	- anonymization
	- privacy
	- mapa
	- bert
	- hierarchical-ner
	---

	# MAPA Multilingual NER Model (Administrative Domain)

	This model is part of the MAPA (Multilingual Anonymisation for Public Administrations) toolkit, developed by [Pangeanic](https://pangeanic.com/) and funded by the European Union through the Connecting Europe Facility (CEF) programme.

	It performs hierarchical Named Entity Recognition (NER) for the detection of personally identifiable information (PII) in multilingual text, with the goal of supporting anonymisation workflows in public administrations.

	## Model Description

	- Developed by: Pangeanic, as part of the MAPA EU Project
	- Base model: [`google-bert/bert-base-multilingual-cased`](https://huggingface.co/google-bert/bert-base-multilingual-cased) (mBERT cased), with extended vocabulary
	- Architecture: `EnhancedTwoFlatLevelsSequenceLabellingModel` — a custom BERT-based architecture with two parallel classification heads
	- Languages: Multilingual (24 EU official languages, via mBERT)
	- Domain: Administrative
	- Training data: EUR-Lex corpus
	- License: Apache 2.0

	### Hierarchical NER

	The model performs token classification at two levels simultaneously:

	- Level 1 (coarse-grained): 19 entity categories (e.g. `PERSON`, `ORGANISATION`, `LOCATION`, `DATE`, `ADDRESS`...).
	- Level 2 (fine-grained): 117 entity subcategories (e.g. for a `PERSON`: `title`, `given name`, `family name`...).

	Example output structure:

	```json
	{
	"annotations": [
	{ "content": "señor Connelly", "value": "PERSON" },
	{ "content": "señor", "value": "title" },
	{ "content": "Connelly", "value": "family name" }
	]
	}
	```

	The full label inventories are included in this repository as `level1_tags_vocabulary.json` and `level2_tags_vocabulary.json`.

	## Intended Use

	This model is intended to be used as part of the MAPA toolkit, which provides the full anonymisation pipeline (entity detection + entity replacement).

	The model uses a custom architecture (`EnhancedTwoFlatLevelsSequenceLabellingModel`) defined in the MAPA codebase. It is not directly loadable via `AutoModel.from_pretrained` from the `transformers` library without that code.

	To use this model, clone the MAPA toolkit:

	🔗 Repository: <https://github.com/PangeanicAI/MAPA-EU-Project>

	The repository contains the model class definition, inference scripts, and the complete anonymisation pipeline (including the entity replacement module, which uses auxiliary resources not included in this HF repo).

	## Training Details

	- Training corpus: EUR-Lex (EU legal and administrative document database)
	- Epochs: 124
	- Iterations: 109,375
	- Base architecture: BERT-base multilingual cased, with extended vocabulary
	- Hidden size: 768
	- Layers: 12
	- Attention heads: 12
	- Vocabulary size: 119,547

	## Evaluation Metrics

	Final metrics reported at the end of training:

	\| Metric \| Value \|
	\|-----------------------\|--------\|
	\| Level 1 micro-F1 \| 0.8374 \|
	\| Level 1 binary-F1 \| 0.8669 \|
	\| Level 2 micro-F1 \| 0.8467 \|
	\| Final loss \| 6.2376 \|

	## Limitations and Considerations

	- Not a standalone HuggingFace model: the custom architecture requires the MAPA toolkit code to be instantiated and used. The inference widget on this page is intentionally disabled.
	- Domain bias: the model was trained on EUR-Lex documents and may underperform on text from different domains (e.g. clinical notes, informal communication, social media).
	- Anonymisation is not guaranteed: as with any NER-based system, false negatives are possible. Outputs should be reviewed before publishing or sharing sensitive content.
	- No additional safety testing has been performed in this HuggingFace release. The weights are published as they were produced during the original MAPA project.

	## Related Models

	Other models from the MAPA project are available under the [Pangeanic](https://huggingface.co/Pangeanic) organisation, covering additional languages and domains.

	## Acknowledgements

	The MAPA project was funded by the European Union under the Connecting Europe Facility (CEF) programme, grant agreement INEA/CEF/ICT/A2019/1927065.

	## Citation

	If you use this model, please cite the MAPA project:

	```bibtex
	@inproceedings{mapa2022,
	title = {{MAPA} Project: Ready-to-Go Open-Source Datasets and Deep Learning Technology to Remove Identifying Information from Text Documents},
	author = {Gianola, Lucia and Ajausks, \=Emils and Arranz, Victoria and Bendi, Chomicha and Choukri, Khalid and Ciulla, Montse and Coheur, Luísa and Costa, Costanza and Cruz, Elena and Esplà-Gomis, Miquel and Garcia-Martinez, Mercedes and Herranz, Manuel and Iranzo-Sánchez, Javier and Klūga, Mārcis and Labaka, Gorka and Lagzdiņš, Artūrs and Lazar, Alina and Mahdi, Mohammed and Otero, Carla and Pinnis, Mārcis and Rigau, German and Ryšavá, Klára and Saint-Dizier, Patrick and Sosoni, Vilelmini},
	booktitle = {Proceedings of the Workshop on Challenges in the Management of Large Corpora (CMLC-10)},
	year = {2022}
	}
	```

	## Contact

	Read more -
	- [MAPA project](https://pangeanic.com/use-cases/mapa)
	- [Named Entity Recognition services](https://pangeanic.com/nlp-solutions/named-entity-recognition)
	- [Data Masking tools](https://pangeanic.com/nlp-solutions/data-masking/tool)

	For questions about the MAPA toolkit, please refer to the [project repository](https://gitlab.com/MAPA-EU-Project/mapa_project) or contact [Pangeanic](https://pangeanic.com/).