MITIE model for Polish

The model was the product of work on "Enhancing Privacy While Preserving Context in Text Transformations by Large Language Models" article, we created an embedding model for the Polish language based on the 45 GB of Universal Dependencies for Polish.

Usage

The prepared model can be used with the MITIE library [1] for Named Entity Recognition (NER) tasks. If you want to get familiar with the MITIE library, you can find more information in the MITIE repository. The model files are available in the root of the repository. En example of usage can be found in the example available in the MITIE repository.

Citing the article

@Article{info16010049,
  AUTHOR = {Żarski, Tymon Lesław and Janicki, Artur},
  TITLE = {Enhancing Privacy While Preserving Context in Text Transformations by Large Language Models},
  JOURNAL = {Information},
  VOLUME = {16},
  YEAR = {2025},
  NUMBER = {1},
  ARTICLE-NUMBER = {49},
  URL = {https://www.mdpi.com/2078-2489/16/1/49},
  ISSN = {2078-2489},
  ABSTRACT = {Data security is a critical concern for Internet users, primarily as more people rely on social networks and online tools daily. Despite the convenience, many users are unaware of the risks posed to their sensitive and personal data. This study addresses this issue by presenting a comprehensive solution to prevent personal data leakage using online tools. We developed a conceptual solution that enhances user privacy by identifying and anonymizing named entity classes representing sensitive data while maintaining the original context by swapping source entities for functional data. Our approach utilizes natural language processing methods, combining machine learning tools such as MITIE and spaCy with rule-based text analysis. We employed regular expressions and large language models to anonymize text, preserving its context for further processing or enabling restoration to the original form after transformations. The results demonstrate the effectiveness of our custom-trained models, achieving an F1 score of 0.8292. Additionally, the proposed algorithms successfully preserved context in approximately 93.23% of test cases, indicating a promising solution for secure data handling in online environments.},
  DOI = {10.3390/info16010049}
}

References

[1] Davis E. King. Dlib-ml: A Machine Learning Toolkit. Journal of Machine Learning Research 10, pp. 1755-1758, 2009.

Downloads last month: -