--- language: - mt license: apache-2.0 base_model: google-bert/bert-base-multilingual-cased pipeline_tag: token-classification inference: false tags: - ner - pii-detection - anonymization - privacy - mapa - bert - hierarchical-ner - maltese - administrative --- # MAPA Maltese NER Model (Administrative Domain) This model is part of the **MAPA (Multilingual Anonymisation for Public Administrations)** toolkit, developed by [Pangeanic](https://pangeanic.com/) and funded by the European Union through the Connecting Europe Facility (CEF) programme. It performs **hierarchical Named Entity Recognition (NER)** for the detection of personally identifiable information (PII) in Maltese administrative text, with the goal of supporting anonymisation workflows in public administrations. ## Model Description - **Developed by:** Pangeanic, as part of the MAPA EU Project - **Base model:** [`google-bert/bert-base-multilingual-cased`](https://huggingface.co/google-bert/bert-base-multilingual-cased) (mBERT cased) - **Architecture:** `EnhancedTwoFlatLevelsSequenceLabellingModel` — a custom BERT-based architecture with two parallel classification heads - **Language:** Maltese (mt) - **Domain:** Administrative - **Training data:** Maltese administrative documents - **License:** Apache 2.0 ### Hierarchical NER The model performs token classification at two levels simultaneously: - **Level 1 (coarse-grained):** 19 entity categories (e.g. `PERSON`, `ORGANISATION`, `LOCATION`, `DATE`, `ADDRESS`...). - **Level 2 (fine-grained):** 117 entity subcategories (e.g. for a `PERSON`: `title`, `given name`, `family name`...). Example output structure: ```json { "annotations": [ { "content": "Sur Connelly", "value": "PERSON" }, { "content": "Sur", "value": "title" }, { "content": "Connelly", "value": "family name" } ] } ``` The full label inventories are included in this repository as `level1_tags_vocabulary.json` and `level2_tags_vocabulary.json`. ## Intended Use This model is intended to be used **as part of the MAPA toolkit**, which provides the full anonymisation pipeline (entity detection + entity replacement). The model uses a custom architecture (`EnhancedTwoFlatLevelsSequenceLabellingModel`) defined in the MAPA codebase. It is **not directly loadable** via `AutoModel.from_pretrained` from the `transformers` library without that code. To use this model, clone the MAPA toolkit: 🔗 **Repository:** The repository contains the model class definition, inference scripts, and the complete anonymisation pipeline (including the entity replacement module, which uses auxiliary resources not included in this HF repo). ## Training Details - **Training domain:** Maltese administrative documents - **Epochs:** 39 - **Iterations:** 3,760 - **Random seed:** 42 - **Base architecture:** BERT-base multilingual cased - **Hidden size:** 768 - **Layers:** 12 - **Attention heads:** 12 - **Vocabulary size:** 119,547 ## Evaluation Metrics Final metrics reported at the end of training: | Metric | Value | |-----------------------|--------| | Level 1 micro-F1 | 0.9429 | | Level 1 binary-F1 | 0.9532 | | Level 2 micro-F1 | 0.9243 | | Final loss | 0.2235 | ## Limitations and Considerations - **Not a standalone HuggingFace model:** the custom architecture requires the MAPA toolkit code to be instantiated and used. The inference widget on this page is intentionally disabled. - **Domain bias:** the model was trained on Maltese administrative documents and may underperform on text from different domains (e.g. clinical notes, informal communication, social media) or on Maltese variants underrepresented in the training data. - **Low-resource language considerations:** Maltese is comparatively underrepresented in the pretraining data of mBERT. While the fine-tuning corpus improves task-specific performance, the underlying language representations may be less robust than for higher-resource languages in the MAPA collection. - **Anonymisation is not guaranteed:** as with any NER-based system, false negatives are possible. Outputs should be reviewed before publishing or sharing sensitive content. - **No additional safety testing has been performed** in this HuggingFace release. The weights are published as they were produced during the original MAPA project. ## Related Models Other models from the MAPA project are available under the [Pangeanic](https://huggingface.co/Pangeanic) organisation, covering additional languages and domains. ## Acknowledgements The MAPA project was funded by the European Union under the Connecting Europe Facility (CEF) programme, grant agreement INEA/CEF/ICT/A2019/1927065. ## Citation If you use this model, please cite the MAPA project: ```bibtex @inproceedings{mapa2022, title = {{MAPA} Project: Ready-to-Go Open-Source Datasets and Deep Learning Technology to Remove Identifying Information from Text Documents}, author = {Gianola, Lucia and Ajausks, \=Emils and Arranz, Victoria and Bendi, Chomicha and Choukri, Khalid and Ciulla, Montse and Coheur, Luísa and Costa, Costanza and Cruz, Elena and Esplà-Gomis, Miquel and Garcia-Martinez, Mercedes and Herranz, Manuel and Iranzo-Sánchez, Javier and Klūga, Mārcis and Labaka, Gorka and Lagzdiņš, Artūrs and Lazar, Alina and Mahdi, Mohammed and Otero, Carla and Pinnis, Mārcis and Rigau, German and Ryšavá, Klára and Saint-Dizier, Patrick and Sosoni, Vilelmini}, booktitle = {Proceedings of the Workshop on Challenges in the Management of Large Corpora (CMLC-10)}, year = {2022} } ``` ## Contact Read more - - [MAPA project](https://pangeanic.com/use-cases/mapa) - [Named Entity Recognition services](https://pangeanic.com/nlp-solutions/named-entity-recognition) - [Data Masking tools](https://pangeanic.com/nlp-solutions/data-masking/tool) For questions about the MAPA toolkit, please refer to the [project repository](https://gitlab.com/MAPA-EU-Project/mapa_project) or contact [Pangeanic](https://pangeanic.com/).