MAPA English NER Model (Administrative Domain)
This model is part of the MAPA (Multilingual Anonymisation for Public Administrations) toolkit, developed by Pangeanic and funded by the European Union through the Connecting Europe Facility (CEF) programme.
It performs hierarchical Named Entity Recognition (NER) for the detection of personally identifiable information (PII) in English administrative text, with the goal of supporting anonymisation workflows in public administrations.
Model Description
- Developed by: Pangeanic, as part of the MAPA EU Project
- Base model:
google-bert/bert-base-multilingual-cased(mBERT cased) - Architecture:
EnhancedTwoFlatLevelsSequenceLabellingModel— a custom BERT-based architecture with two parallel classification heads - Language: English (en)
- Domain: Administrative
- Training data: English administrative documents
- License: Apache 2.0
Hierarchical NER
The model performs token classification at two levels simultaneously:
- Level 1 (coarse-grained): 19 entity categories (e.g.
PERSON,ORGANISATION,LOCATION,DATE,ADDRESS...). - Level 2 (fine-grained): 117 entity subcategories (e.g. for a
PERSON:title,given name,family name...).
Example output structure:
{
"annotations": [
{ "content": "Mr Connelly", "value": "PERSON" },
{ "content": "Mr", "value": "title" },
{ "content": "Connelly", "value": "family name" }
]
}
The full label inventories are included in this repository as level1_tags_vocabulary.json and level2_tags_vocabulary.json.
Intended Use
This model is intended to be used as part of the MAPA toolkit, which provides the full anonymisation pipeline (entity detection + entity replacement).
The model uses a custom architecture (EnhancedTwoFlatLevelsSequenceLabellingModel) defined in the MAPA codebase. It is not directly loadable via AutoModel.from_pretrained from the transformers library without that code.
To use this model, clone the MAPA toolkit:
🔗 Repository: https://gitlab.com/MAPA-EU-Project/mapa_project
The repository contains the model class definition, inference scripts, and the complete anonymisation pipeline (including the entity replacement module, which uses auxiliary resources not included in this HF repo).
Training Details
- Training domain: English administrative documents
- Epochs: 116
- Iterations: 2,223
- Random seed: 42
- Base architecture: BERT-base multilingual cased
- Hidden size: 768
- Layers: 12
- Attention heads: 12
- Vocabulary size: 119,547
Evaluation Metrics
Final metrics reported at the end of training:
| Metric | Value |
|---|---|
| Level 1 micro-F1 | 0.8406 |
| Level 1 binary-F1 | 0.9351 |
| Level 2 micro-F1 | 0.8464 |
| Final loss | 0.4631 |
Limitations and Considerations
- Not a standalone HuggingFace model: the custom architecture requires the MAPA toolkit code to be instantiated and used. The inference widget on this page is intentionally disabled.
- Domain bias: the model was trained on English administrative documents and may underperform on text from different domains (e.g. clinical notes, informal communication, social media) or on English variants underrepresented in the training data.
- Anonymisation is not guaranteed: as with any NER-based system, false negatives are possible. Outputs should be reviewed before publishing or sharing sensitive content.
- No additional safety testing has been performed in this HuggingFace release. The weights are published as they were produced during the original MAPA project.
Related Models
Other models from the MAPA project are available under the Pangeanic organisation, covering additional languages and domains.
Acknowledgements
The MAPA project was funded by the European Union under the Connecting Europe Facility (CEF) programme, grant agreement INEA/CEF/ICT/A2019/1927065.
Citation
If you use this model, please cite the MAPA project:
@inproceedings{mapa2022,
title = {{MAPA} Project: Ready-to-Go Open-Source Datasets and Deep Learning Technology to Remove Identifying Information from Text Documents},
author = {Gianola, Lucia and Ajausks, \=Emils and Arranz, Victoria and Bendi, Chomicha and Choukri, Khalid and Ciulla, Montse and Coheur, Luísa and Costa, Costanza and Cruz, Elena and Esplà-Gomis, Miquel and Garcia-Martinez, Mercedes and Herranz, Manuel and Iranzo-Sánchez, Javier and Klūga, Mārcis and Labaka, Gorka and Lagzdiņš, Artūrs and Lazar, Alina and Mahdi, Mohammed and Otero, Carla and Pinnis, Mārcis and Rigau, German and Ryšavá, Klára and Saint-Dizier, Patrick and Sosoni, Vilelmini},
booktitle = {Proceedings of the Workshop on Challenges in the Management of Large Corpora (CMLC-10)},
year = {2022}
}
Contact
For questions about the MAPA toolkit, please refer to the project repository or contact Pangeanic.
- Downloads last month
- 7
Model tree for Pangeanic/mapa-en-administrative
Base model
google-bert/bert-base-multilingual-cased