--- license: apache-2.0 datasets: - grammarly/pseudonymization-data language: - en metrics: - f1 - bleu pipeline_tag: text2text-generation --- # Model Card for Model ID This repository contains files for two Seq2Seq transformers-based models used in our paper: https://arxiv.org/abs/2306.05561. ## Model Details ### Model Description - **Developed by:** Oleksandr Yermilov, Vipul Raheja, Artem Chernodub - **Model type:** Seq2Seq - **Language (NLP):** English - **License:** Apache license 2.0 - **Finetuned from model:** BART ### Model Sources - **Paper:** https://arxiv.org/abs/2306.05561 ## Uses These models can be used for anonymizing datasets in English language. ## Bias, Risks, and Limitations Please check the Limitations section in our paper. ## Training Details ### Training Data https://huggingface.co/datasets/grammarly/pseudonymization-data/tree/main/seq2seq ### Training Procedure 1. Gather text data from Wikipedia. 2. Preprocess it using NER-based pseudonymization. 3. Fine-tune BART model on translation task for translating text from "original" to "pseudonymized". #### Training Hyperparameters We train the models for 3 epochs using `AdamW` optimization with the learning rate α =2*10⁵, and the batch size is 8. ## Evaluation ### Factors & Metrics #### Factors There is no source truth of named entities for the data, on which this model was trained. We check whether the word is a named entity, using one of the NER systems (spaCy or FLAIR). #### Metrics We measure the amount of text, changed by our model. Specifically, we check for the following categories of translated text word by word: 1. True positive (TP) - Named entity, which was changed to another named entity. 2. True negative (TN) - Not a named entity, which was not changed. 3. False positive (FP) - Not a named entity, which was changed to another word. 4. False negative (FN) - Named entity, which was not changed to another named entity. We calculate F₁ score based on the abovementioned values. ## Citation **BibTeX:** ``` @misc{yermilov2023privacy, title={Privacy- and Utility-Preserving NLP with Anonymized Data: A case study of Pseudonymization}, author={Oleksandr Yermilov and Vipul Raheja and Artem Chernodub}, year={2023}, eprint={2306.05561}, archivePrefix={arXiv}, primaryClass={cs.CL} } ``` ## Model Card Contact Oleksandr Yermilov (oleksandr.yermilov@ucu.edu.ua).