--- language: - multilingual license: apache-2.0 base_model: google-bert/bert-base-multilingual-cased pipeline_tag: token-classification inference: false tags: - ner - pii-detection - anonymization - privacy - mapa - bert - hierarchical-ner --- # MAPA Multilingual NER Model (Administrative Domain) This model is part of the **MAPA (Multilingual Anonymisation for Public Administrations)** toolkit, developed by [Pangeanic](https://pangeanic.com/) and funded by the European Union through the Connecting Europe Facility (CEF) programme. It performs **hierarchical Named Entity Recognition (NER)** for the detection of personally identifiable information (PII) in multilingual text, with the goal of supporting anonymisation workflows in public administrations. ## Model Description - **Developed by:** Pangeanic, as part of the MAPA EU Project - **Base model:** [`google-bert/bert-base-multilingual-cased`](https://huggingface.co/google-bert/bert-base-multilingual-cased) (mBERT cased), with extended vocabulary - **Architecture:** `EnhancedTwoFlatLevelsSequenceLabellingModel` — a custom BERT-based architecture with two parallel classification heads - **Languages:** Multilingual (24 EU official languages, via mBERT) - **Domain:** Administrative - **Training data:** EUR-Lex corpus - **License:** Apache 2.0 ### Hierarchical NER The model performs token classification at two levels simultaneously: - **Level 1 (coarse-grained):** 19 entity categories (e.g. `PERSON`, `ORGANISATION`, `LOCATION`, `DATE`, `ADDRESS`...). - **Level 2 (fine-grained):** 117 entity subcategories (e.g. for a `PERSON`: `title`, `given name`, `family name`...). Example output structure: ```json { "annotations": [ { "content": "señor Connelly", "value": "PERSON" }, { "content": "señor", "value": "title" }, { "content": "Connelly", "value": "family name" } ] } ``` The full label inventories are included in this repository as `level1_tags_vocabulary.json` and `level2_tags_vocabulary.json`. ## Intended Use This model is intended to be used **as part of the MAPA toolkit**, which provides the full anonymisation pipeline (entity detection + entity replacement). The model uses a custom architecture (`EnhancedTwoFlatLevelsSequenceLabellingModel`) defined in the MAPA codebase. It is **not directly loadable** via `AutoModel.from_pretrained` from the `transformers` library without that code. To use this model, clone the MAPA toolkit: 🔗 **Repository:** The repository contains the model class definition, inference scripts, and the complete anonymisation pipeline (including the entity replacement module, which uses auxiliary resources not included in this HF repo). ## Training Details - **Training corpus:** EUR-Lex (EU legal and administrative document database) - **Epochs:** 124 - **Iterations:** 109,375 - **Base architecture:** BERT-base multilingual cased, with extended vocabulary - **Hidden size:** 768 - **Layers:** 12 - **Attention heads:** 12 - **Vocabulary size:** 119,547 ## Evaluation Metrics Final metrics reported at the end of training: | Metric | Value | |-----------------------|--------| | Level 1 micro-F1 | 0.8374 | | Level 1 binary-F1 | 0.8669 | | Level 2 micro-F1 | 0.8467 | | Final loss | 6.2376 | ## Limitations and Considerations - **Not a standalone HuggingFace model:** the custom architecture requires the MAPA toolkit code to be instantiated and used. The inference widget on this page is intentionally disabled. - **Domain bias:** the model was trained on EUR-Lex documents and may underperform on text from different domains (e.g. clinical notes, informal communication, social media). - **Anonymisation is not guaranteed:** as with any NER-based system, false negatives are possible. Outputs should be reviewed before publishing or sharing sensitive content. - **No additional safety testing has been performed** in this HuggingFace release. The weights are published as they were produced during the original MAPA project. ## Related Models Other models from the MAPA project are available under the [Pangeanic](https://huggingface.co/Pangeanic) organisation, covering additional languages and domains. ## Acknowledgements The MAPA project was funded by the European Union under the Connecting Europe Facility (CEF) programme, grant agreement INEA/CEF/ICT/A2019/1927065. ## Citation If you use this model, please cite the MAPA project: ```bibtex @inproceedings{mapa2022, title = {{MAPA} Project: Ready-to-Go Open-Source Datasets and Deep Learning Technology to Remove Identifying Information from Text Documents}, author = {Gianola, Lucia and Ajausks, \=Emils and Arranz, Victoria and Bendi, Chomicha and Choukri, Khalid and Ciulla, Montse and Coheur, Luísa and Costa, Costanza and Cruz, Elena and Esplà-Gomis, Miquel and Garcia-Martinez, Mercedes and Herranz, Manuel and Iranzo-Sánchez, Javier and Klūga, Mārcis and Labaka, Gorka and Lagzdiņš, Artūrs and Lazar, Alina and Mahdi, Mohammed and Otero, Carla and Pinnis, Mārcis and Rigau, German and Ryšavá, Klára and Saint-Dizier, Patrick and Sosoni, Vilelmini}, booktitle = {Proceedings of the Workshop on Challenges in the Management of Large Corpora (CMLC-10)}, year = {2022} } ``` ## Contact Read more - - [MAPA project](https://pangeanic.com/use-cases/mapa) - [Named Entity Recognition services](https://pangeanic.com/nlp-solutions/named-entity-recognition) - [Data Masking tools](https://pangeanic.com/nlp-solutions/data-masking/tool) For questions about the MAPA toolkit, please refer to the [project repository](https://gitlab.com/MAPA-EU-Project/mapa_project) or contact [Pangeanic](https://pangeanic.com/).