---
language:
- multilingual
license: apache-2.0
base_model: google-bert/bert-base-multilingual-cased
pipeline_tag: token-classification
inference: false
tags:
- ner
- pii-detection
- anonymization
- privacy
- mapa
- bert
- hierarchical-ner
---

# MAPA Multilingual NER Model (Administrative Domain)

This model is part of the **MAPA (Multilingual Anonymisation for Public Administrations)** toolkit, developed by [Pangeanic](https://pangeanic.com/) and funded by the European Union through the Connecting Europe Facility (CEF) programme.

It performs **hierarchical Named Entity Recognition (NER)** for the detection of personally identifiable information (PII) in multilingual text, with the goal of supporting anonymisation workflows in public administrations.

## Model Description

- **Developed by:** Pangeanic, as part of the MAPA EU Project
- **Base model:** [`google-bert/bert-base-multilingual-cased`](https://huggingface.co/google-bert/bert-base-multilingual-cased) (mBERT cased), with extended vocabulary
- **Architecture:** `EnhancedTwoFlatLevelsSequenceLabellingModel` — a custom BERT-based architecture with two parallel classification heads
- **Languages:** Multilingual (24 EU official languages, via mBERT)
- **Domain:** Administrative
- **Training data:** EUR-Lex corpus
- **License:** Apache 2.0

### Hierarchical NER

The model performs token classification at two levels simultaneously:

- **Level 1 (coarse-grained):** 19 entity categories (e.g. `PERSON`, `ORGANISATION`, `LOCATION`, `DATE`, `ADDRESS`...).
- **Level 2 (fine-grained):** 117 entity subcategories (e.g. for a `PERSON`: `title`, `given name`, `family name`...).

Example output structure:

```json
{
  "annotations": [
    { "content": "señor Connelly", "value": "PERSON" },
    { "content": "señor",         "value": "title" },
    { "content": "Connelly",      "value": "family name" }
  ]
}
```

The full label inventories are included in this repository as `level1_tags_vocabulary.json` and `level2_tags_vocabulary.json`.

## Intended Use

This model is intended to be used **as part of the MAPA toolkit**, which provides the full anonymisation pipeline (entity detection + entity replacement).

The model uses a custom architecture (`EnhancedTwoFlatLevelsSequenceLabellingModel`) defined in the MAPA codebase. It is **not directly loadable** via `AutoModel.from_pretrained` from the `transformers` library without that code.

To use this model, clone the MAPA toolkit:

🔗 **Repository:** <https://github.com/PangeanicAI/MAPA-EU-Project>

The repository contains the model class definition, inference scripts, and the complete anonymisation pipeline (including the entity replacement module, which uses auxiliary resources not included in this HF repo).

## Training Details

- **Training corpus:** EUR-Lex (EU legal and administrative document database)
- **Epochs:** 124
- **Iterations:** 109,375
- **Base architecture:** BERT-base multilingual cased, with extended vocabulary
- **Hidden size:** 768
- **Layers:** 12
- **Attention heads:** 12
- **Vocabulary size:** 119,547

## Evaluation Metrics

Final metrics reported at the end of training:

| Metric                | Value  |
|-----------------------|--------|
| Level 1 micro-F1      | 0.8374 |
| Level 1 binary-F1     | 0.8669 |
| Level 2 micro-F1      | 0.8467 |
| Final loss            | 6.2376 |

## Limitations and Considerations

- **Not a standalone HuggingFace model:** the custom architecture requires the MAPA toolkit code to be instantiated and used. The inference widget on this page is intentionally disabled.
- **Domain bias:** the model was trained on EUR-Lex documents and may underperform on text from different domains (e.g. clinical notes, informal communication, social media).
- **Anonymisation is not guaranteed:** as with any NER-based system, false negatives are possible. Outputs should be reviewed before publishing or sharing sensitive content.
- **No additional safety testing has been performed** in this HuggingFace release. The weights are published as they were produced during the original MAPA project.

## Related Models

Other models from the MAPA project are available under the [Pangeanic](https://huggingface.co/Pangeanic) organisation, covering additional languages and domains.

## Acknowledgements

The MAPA project was funded by the European Union under the Connecting Europe Facility (CEF) programme, grant agreement INEA/CEF/ICT/A2019/1927065.

## Citation

If you use this model, please cite the MAPA project:

```bibtex
@inproceedings{mapa2022,
  title     = {{MAPA} Project: Ready-to-Go Open-Source Datasets and Deep Learning Technology to Remove Identifying Information from Text Documents},
  author    = {Gianola, Lucia and Ajausks, \=Emils and Arranz, Victoria and Bendi, Chomicha and Choukri, Khalid and Ciulla, Montse and Coheur, Luísa and Costa, Costanza and Cruz, Elena and Esplà-Gomis, Miquel and Garcia-Martinez, Mercedes and Herranz, Manuel and Iranzo-Sánchez, Javier and Klūga, Mārcis and Labaka, Gorka and Lagzdiņš, Artūrs and Lazar, Alina and Mahdi, Mohammed and Otero, Carla and Pinnis, Mārcis and Rigau, German and Ryšavá, Klára and Saint-Dizier, Patrick and Sosoni, Vilelmini},
  booktitle = {Proceedings of the Workshop on Challenges in the Management of Large Corpora (CMLC-10)},
  year      = {2022}
}
```

## Contact

Read more - 
- [MAPA project](https://pangeanic.com/use-cases/mapa)
- [Named Entity Recognition services](https://pangeanic.com/nlp-solutions/named-entity-recognition)
- [Data Masking tools](https://pangeanic.com/nlp-solutions/data-masking/tool)

For questions about the MAPA toolkit, please refer to the [project repository](https://gitlab.com/MAPA-EU-Project/mapa_project) or contact [Pangeanic](https://pangeanic.com/).