supportpangeanic's picture
Update README.md
22dc3c5 verified
---
language:
- multilingual
license: apache-2.0
base_model: google-bert/bert-base-multilingual-cased
pipeline_tag: token-classification
inference: false
tags:
- ner
- pii-detection
- anonymization
- privacy
- mapa
- bert
- hierarchical-ner
---
# MAPA Multilingual NER Model (Administrative Domain)
This model is part of the **MAPA (Multilingual Anonymisation for Public Administrations)** toolkit, developed by [Pangeanic](https://pangeanic.com/) and funded by the European Union through the Connecting Europe Facility (CEF) programme.
It performs **hierarchical Named Entity Recognition (NER)** for the detection of personally identifiable information (PII) in multilingual text, with the goal of supporting anonymisation workflows in public administrations.
## Model Description
- **Developed by:** Pangeanic, as part of the MAPA EU Project
- **Base model:** [`google-bert/bert-base-multilingual-cased`](https://huggingface.co/google-bert/bert-base-multilingual-cased) (mBERT cased), with extended vocabulary
- **Architecture:** `EnhancedTwoFlatLevelsSequenceLabellingModel` — a custom BERT-based architecture with two parallel classification heads
- **Languages:** Multilingual (24 EU official languages, via mBERT)
- **Domain:** Administrative
- **Training data:** EUR-Lex corpus
- **License:** Apache 2.0
### Hierarchical NER
The model performs token classification at two levels simultaneously:
- **Level 1 (coarse-grained):** 19 entity categories (e.g. `PERSON`, `ORGANISATION`, `LOCATION`, `DATE`, `ADDRESS`...).
- **Level 2 (fine-grained):** 117 entity subcategories (e.g. for a `PERSON`: `title`, `given name`, `family name`...).
Example output structure:
```json
{
"annotations": [
{ "content": "señor Connelly", "value": "PERSON" },
{ "content": "señor", "value": "title" },
{ "content": "Connelly", "value": "family name" }
]
}
```
The full label inventories are included in this repository as `level1_tags_vocabulary.json` and `level2_tags_vocabulary.json`.
## Intended Use
This model is intended to be used **as part of the MAPA toolkit**, which provides the full anonymisation pipeline (entity detection + entity replacement).
The model uses a custom architecture (`EnhancedTwoFlatLevelsSequenceLabellingModel`) defined in the MAPA codebase. It is **not directly loadable** via `AutoModel.from_pretrained` from the `transformers` library without that code.
To use this model, clone the MAPA toolkit:
🔗 **Repository:** <https://github.com/PangeanicAI/MAPA-EU-Project>
The repository contains the model class definition, inference scripts, and the complete anonymisation pipeline (including the entity replacement module, which uses auxiliary resources not included in this HF repo).
## Training Details
- **Training corpus:** EUR-Lex (EU legal and administrative document database)
- **Epochs:** 124
- **Iterations:** 109,375
- **Base architecture:** BERT-base multilingual cased, with extended vocabulary
- **Hidden size:** 768
- **Layers:** 12
- **Attention heads:** 12
- **Vocabulary size:** 119,547
## Evaluation Metrics
Final metrics reported at the end of training:
| Metric | Value |
|-----------------------|--------|
| Level 1 micro-F1 | 0.8374 |
| Level 1 binary-F1 | 0.8669 |
| Level 2 micro-F1 | 0.8467 |
| Final loss | 6.2376 |
## Limitations and Considerations
- **Not a standalone HuggingFace model:** the custom architecture requires the MAPA toolkit code to be instantiated and used. The inference widget on this page is intentionally disabled.
- **Domain bias:** the model was trained on EUR-Lex documents and may underperform on text from different domains (e.g. clinical notes, informal communication, social media).
- **Anonymisation is not guaranteed:** as with any NER-based system, false negatives are possible. Outputs should be reviewed before publishing or sharing sensitive content.
- **No additional safety testing has been performed** in this HuggingFace release. The weights are published as they were produced during the original MAPA project.
## Related Models
Other models from the MAPA project are available under the [Pangeanic](https://huggingface.co/Pangeanic) organisation, covering additional languages and domains.
## Acknowledgements
The MAPA project was funded by the European Union under the Connecting Europe Facility (CEF) programme, grant agreement INEA/CEF/ICT/A2019/1927065.
## Citation
If you use this model, please cite the MAPA project:
```bibtex
@inproceedings{mapa2022,
title = {{MAPA} Project: Ready-to-Go Open-Source Datasets and Deep Learning Technology to Remove Identifying Information from Text Documents},
author = {Gianola, Lucia and Ajausks, \=Emils and Arranz, Victoria and Bendi, Chomicha and Choukri, Khalid and Ciulla, Montse and Coheur, Luísa and Costa, Costanza and Cruz, Elena and Esplà-Gomis, Miquel and Garcia-Martinez, Mercedes and Herranz, Manuel and Iranzo-Sánchez, Javier and Klūga, Mārcis and Labaka, Gorka and Lagzdiņš, Artūrs and Lazar, Alina and Mahdi, Mohammed and Otero, Carla and Pinnis, Mārcis and Rigau, German and Ryšavá, Klára and Saint-Dizier, Patrick and Sosoni, Vilelmini},
booktitle = {Proceedings of the Workshop on Challenges in the Management of Large Corpora (CMLC-10)},
year = {2022}
}
```
## Contact
Read more -
- [MAPA project](https://pangeanic.com/use-cases/mapa)
- [Named Entity Recognition services](https://pangeanic.com/nlp-solutions/named-entity-recognition)
- [Data Masking tools](https://pangeanic.com/nlp-solutions/data-masking/tool)
For questions about the MAPA toolkit, please refer to the [project repository](https://gitlab.com/MAPA-EU-Project/mapa_project) or contact [Pangeanic](https://pangeanic.com/).