MAPA English NER Model (Administrative Domain)

This model is part of the MAPA (Multilingual Anonymisation for Public Administrations) toolkit, developed by Pangeanic and funded by the European Union through the Connecting Europe Facility (CEF) programme.

It performs hierarchical Named Entity Recognition (NER) for the detection of personally identifiable information (PII) in English administrative text, with the goal of supporting anonymisation workflows in public administrations.

Model Description

Developed by: Pangeanic, as part of the MAPA EU Project
Base model: google-bert/bert-base-multilingual-cased (mBERT cased)
Architecture: EnhancedTwoFlatLevelsSequenceLabellingModel — a custom BERT-based architecture with two parallel classification heads
Language: English (en)
Domain: Administrative
Training data: English administrative documents
License: Apache 2.0

Hierarchical NER

The model performs token classification at two levels simultaneously:

Level 1 (coarse-grained): 19 entity categories (e.g. PERSON, ORGANISATION, LOCATION, DATE, ADDRESS...).
Level 2 (fine-grained): 117 entity subcategories (e.g. for a PERSON: title, given name, family name...).

Example output structure:

{
  "annotations": [
    { "content": "Mr Connelly", "value": "PERSON" },
    { "content": "Mr",          "value": "title" },
    { "content": "Connelly",    "value": "family name" }
  ]
}

The full label inventories are included in this repository as level1_tags_vocabulary.json and level2_tags_vocabulary.json.

Intended Use

This model is intended to be used as part of the MAPA toolkit, which provides the full anonymisation pipeline (entity detection + entity replacement).

The model uses a custom architecture (EnhancedTwoFlatLevelsSequenceLabellingModel) defined in the MAPA codebase. It is not directly loadable via AutoModel.from_pretrained from the transformers library without that code.

To use this model, clone the MAPA toolkit:

🔗 Repository: https://github.com/PangeanicAI/MAPA-EU-Project

The repository contains the model class definition, inference scripts, and the complete anonymisation pipeline (including the entity replacement module, which uses auxiliary resources not included in this HF repo).

Training Details

Training domain: English administrative documents
Epochs: 116
Iterations: 2,223
Random seed: 42
Base architecture: BERT-base multilingual cased
Hidden size: 768
Layers: 12
Attention heads: 12
Vocabulary size: 119,547

Evaluation Metrics

Final metrics reported at the end of training:

Metric	Value
Level 1 micro-F1	0.8406
Level 1 binary-F1	0.9351
Level 2 micro-F1	0.8464
Final loss	0.4631

Limitations and Considerations

Not a standalone HuggingFace model: the custom architecture requires the MAPA toolkit code to be instantiated and used. The inference widget on this page is intentionally disabled.
Domain bias: the model was trained on English administrative documents and may underperform on text from different domains (e.g. clinical notes, informal communication, social media) or on English variants underrepresented in the training data.
Anonymisation is not guaranteed: as with any NER-based system, false negatives are possible. Outputs should be reviewed before publishing or sharing sensitive content.
No additional safety testing has been performed in this HuggingFace release. The weights are published as they were produced during the original MAPA project.

Related Models

Other models from the MAPA project are available under the Pangeanic organisation, covering additional languages and domains.

Acknowledgements

The MAPA project was funded by the European Union under the Connecting Europe Facility (CEF) programme, grant agreement INEA/CEF/ICT/A2019/1927065.

Citation

If you use this model, please cite the MAPA project:

@inproceedings{mapa2022,
  title     = {{MAPA} Project: Ready-to-Go Open-Source Datasets and Deep Learning Technology to Remove Identifying Information from Text Documents},
  author    = {Gianola, Lucia and Ajausks, \=Emils and Arranz, Victoria and Bendi, Chomicha and Choukri, Khalid and Ciulla, Montse and Coheur, Luísa and Costa, Costanza and Cruz, Elena and Esplà-Gomis, Miquel and Garcia-Martinez, Mercedes and Herranz, Manuel and Iranzo-Sánchez, Javier and Klūga, Mārcis and Labaka, Gorka and Lagzdiņš, Artūrs and Lazar, Alina and Mahdi, Mohammed and Otero, Carla and Pinnis, Mārcis and Rigau, German and Ryšavá, Klára and Saint-Dizier, Patrick and Sosoni, Vilelmini},
  booktitle = {Proceedings of the Workshop on Challenges in the Management of Large Corpora (CMLC-10)},
  year      = {2022}
}

Contact

Model tree for Pangeanic/mapa-en-administrative

Base model

google-bert/bert-base-multilingual-cased

Finetuned

(1001)

this model

Collection including Pangeanic/mapa-en-administrative

MAPA — Multilingual Anonymisation for Public Administrations

Collection

NER models for personal data anonymisation in public administration texts. 8 language/domain combinations. MAPA project, by Pangeanic. • 8 items • Updated May 22