| --- |
| language: |
| - multilingual |
| license: apache-2.0 |
| base_model: google-bert/bert-base-multilingual-cased |
| pipeline_tag: token-classification |
| inference: false |
| tags: |
| - ner |
| - pii-detection |
| - anonymization |
| - privacy |
| - mapa |
| - bert |
| - hierarchical-ner |
| --- |
| |
| # MAPA Multilingual NER Model (Administrative Domain) |
|
|
| This model is part of the **MAPA (Multilingual Anonymisation for Public Administrations)** toolkit, developed by [Pangeanic](https://pangeanic.com/) and funded by the European Union through the Connecting Europe Facility (CEF) programme. |
|
|
| It performs **hierarchical Named Entity Recognition (NER)** for the detection of personally identifiable information (PII) in multilingual text, with the goal of supporting anonymisation workflows in public administrations. |
|
|
| ## Model Description |
|
|
| - **Developed by:** Pangeanic, as part of the MAPA EU Project |
| - **Base model:** [`google-bert/bert-base-multilingual-cased`](https://huggingface.co/google-bert/bert-base-multilingual-cased) (mBERT cased), with extended vocabulary |
| - **Architecture:** `EnhancedTwoFlatLevelsSequenceLabellingModel` — a custom BERT-based architecture with two parallel classification heads |
| - **Languages:** Multilingual (24 EU official languages, via mBERT) |
| - **Domain:** Administrative |
| - **Training data:** EUR-Lex corpus |
| - **License:** Apache 2.0 |
|
|
| ### Hierarchical NER |
|
|
| The model performs token classification at two levels simultaneously: |
|
|
| - **Level 1 (coarse-grained):** 19 entity categories (e.g. `PERSON`, `ORGANISATION`, `LOCATION`, `DATE`, `ADDRESS`...). |
| - **Level 2 (fine-grained):** 117 entity subcategories (e.g. for a `PERSON`: `title`, `given name`, `family name`...). |
|
|
| Example output structure: |
|
|
| ```json |
| { |
| "annotations": [ |
| { "content": "señor Connelly", "value": "PERSON" }, |
| { "content": "señor", "value": "title" }, |
| { "content": "Connelly", "value": "family name" } |
| ] |
| } |
| ``` |
|
|
| The full label inventories are included in this repository as `level1_tags_vocabulary.json` and `level2_tags_vocabulary.json`. |
|
|
| ## Intended Use |
|
|
| This model is intended to be used **as part of the MAPA toolkit**, which provides the full anonymisation pipeline (entity detection + entity replacement). |
|
|
| The model uses a custom architecture (`EnhancedTwoFlatLevelsSequenceLabellingModel`) defined in the MAPA codebase. It is **not directly loadable** via `AutoModel.from_pretrained` from the `transformers` library without that code. |
|
|
| To use this model, clone the MAPA toolkit: |
|
|
| 🔗 **Repository:** <https://github.com/PangeanicAI/MAPA-EU-Project> |
|
|
| The repository contains the model class definition, inference scripts, and the complete anonymisation pipeline (including the entity replacement module, which uses auxiliary resources not included in this HF repo). |
|
|
| ## Training Details |
|
|
| - **Training corpus:** EUR-Lex (EU legal and administrative document database) |
| - **Epochs:** 124 |
| - **Iterations:** 109,375 |
| - **Base architecture:** BERT-base multilingual cased, with extended vocabulary |
| - **Hidden size:** 768 |
| - **Layers:** 12 |
| - **Attention heads:** 12 |
| - **Vocabulary size:** 119,547 |
|
|
| ## Evaluation Metrics |
|
|
| Final metrics reported at the end of training: |
|
|
| | Metric | Value | |
| |-----------------------|--------| |
| | Level 1 micro-F1 | 0.8374 | |
| | Level 1 binary-F1 | 0.8669 | |
| | Level 2 micro-F1 | 0.8467 | |
| | Final loss | 6.2376 | |
|
|
| ## Limitations and Considerations |
|
|
| - **Not a standalone HuggingFace model:** the custom architecture requires the MAPA toolkit code to be instantiated and used. The inference widget on this page is intentionally disabled. |
| - **Domain bias:** the model was trained on EUR-Lex documents and may underperform on text from different domains (e.g. clinical notes, informal communication, social media). |
| - **Anonymisation is not guaranteed:** as with any NER-based system, false negatives are possible. Outputs should be reviewed before publishing or sharing sensitive content. |
| - **No additional safety testing has been performed** in this HuggingFace release. The weights are published as they were produced during the original MAPA project. |
|
|
| ## Related Models |
|
|
| Other models from the MAPA project are available under the [Pangeanic](https://huggingface.co/Pangeanic) organisation, covering additional languages and domains. |
|
|
| ## Acknowledgements |
|
|
| The MAPA project was funded by the European Union under the Connecting Europe Facility (CEF) programme, grant agreement INEA/CEF/ICT/A2019/1927065. |
|
|
| ## Citation |
|
|
| If you use this model, please cite the MAPA project: |
|
|
| ```bibtex |
| @inproceedings{mapa2022, |
| title = {{MAPA} Project: Ready-to-Go Open-Source Datasets and Deep Learning Technology to Remove Identifying Information from Text Documents}, |
| author = {Gianola, Lucia and Ajausks, \=Emils and Arranz, Victoria and Bendi, Chomicha and Choukri, Khalid and Ciulla, Montse and Coheur, Luísa and Costa, Costanza and Cruz, Elena and Esplà-Gomis, Miquel and Garcia-Martinez, Mercedes and Herranz, Manuel and Iranzo-Sánchez, Javier and Klūga, Mārcis and Labaka, Gorka and Lagzdiņš, Artūrs and Lazar, Alina and Mahdi, Mohammed and Otero, Carla and Pinnis, Mārcis and Rigau, German and Ryšavá, Klára and Saint-Dizier, Patrick and Sosoni, Vilelmini}, |
| booktitle = {Proceedings of the Workshop on Challenges in the Management of Large Corpora (CMLC-10)}, |
| year = {2022} |
| } |
| ``` |
|
|
| ## Contact |
|
|
| Read more - |
| - [MAPA project](https://pangeanic.com/use-cases/mapa) |
| - [Named Entity Recognition services](https://pangeanic.com/nlp-solutions/named-entity-recognition) |
| - [Data Masking tools](https://pangeanic.com/nlp-solutions/data-masking/tool) |
|
|
| For questions about the MAPA toolkit, please refer to the [project repository](https://gitlab.com/MAPA-EU-Project/mapa_project) or contact [Pangeanic](https://pangeanic.com/). |
|
|