--- library_name: transformers pipeline_tag: text-generation --- # Model Card This model is used for PII detection. It covers 29 classes across 7 languages (English , Spanish, Swedish, German, Italian, Dutch, French) ## Model Details The model is built on top of the decoder transformer model (Qwen2-0.5B). We used publicly available datasets with a permissible license as well as synthetic data to train the model. ### Model Description - **Developed by:** Betterdata.ai - **Model type:** Decoder Transformer - **License:** apache-2.0 - **Finetuned from model :** Qwen2-0.5B ## Uses With the advent of chatGPT , professionals and organizations use public chat interfaces for various applications. This often leads to leakage of PII data which causes privacy issues as users enter names , dates or even API keys etc to give the model better context. With the PII class tags, these confidential information can be masked out as class tags which enable the end models to understand context without data leaving the server. Even developers or teams building applications with the help of third party API's can use these models for better privacy. The image below illustrates this: ![Use Case](pii_image.png) The PII model shouldn't add too much latency and be able to take in long documents, therefore we used the Qwen 0.5B as the base model. Another consideration for the model size was that we felt a model for privacy should be easy to run even on CPUs. We do have larger models in house with better performance. We have coverage for south east asian langauges as well as giving the user the ability to define custom user classes as part of our plans. We will constantly be improving this model so always pull the latest version of the model. ### Out-of-Scope Use 1. We currently are trying to replace the PII text with the respective class tags. In the future we plan to replace the data with synthetic class values. ## Bias, Risks, and Limitations 1. Model may not be 100% accurate always but we are working on it. 2. In our 'vibe test' where we fed the model with out of distribution data , the model does very well on classes like names , emails , phone numbers etc. We find that it can do better on classes like API_KEYS , credit card cvv numbers , bank account numbers. We are creating more data for these classes to futher improve performance. You can expect improvement in the next iteration. ## How to Get Started with the Model Use the code below to get started with the model. ``` from transformers import AutoModelForCausalLM , AutoTokenizer import torch device = "cuda" if torch.cuda.is_available() else "cpu" model = AutoModelForCausalLM.from_pretrained("betterdataai/PII_DETECTION_MODEL").to(device) tokenizer = AutoTokenizer.from_pretrained("betterdataai/PII_DETECTION_MODEL") classes_list = ['','','','','','','','','','','','','','','','','','','','','','','','','','','','