Spaces:

DeepActionPotential
/

PIIGuard

Build error

App Files Files Community

DeepActionPotential commited on Sep 29, 2025

Commit

73a7314

verified ·

1 Parent(s): 7fa33ea

🚀 Initial upload of my app

Browse files

Files changed (15) hide show

.gitattributes +1 -0
LICENSE +21 -0
README.md +81 -20
__pycache__/model.cpython-311.pyc +0 -0
__pycache__/ui.cpython-311.pyc +0 -0
__pycache__/utils.cpython-311.pyc +0 -0
app.py +25 -0
demo/demo.mp4 +3 -0
demo/demo.png +0 -0
detecting-pii-with-bilstm-crf-f1-91.ipynb +0 -0
model.py +44 -0
models/best_bilstm_crf_model.pt +3 -0
requirements.txt +5 -3
ui.py +26 -0
utils.py +141 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+demo/demo.mp4 filter=lfs diff=lfs merge=lfs -text

LICENSE ADDED Viewed

	@@ -0,0 +1,21 @@

+MIT License
+Copyright (c) 2025 Eslam Tarek
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

README.md CHANGED Viewed

@@ -1,20 +1,81 @@
----
-title: PIIGuard
-emoji: 🚀
-colorFrom: red
-colorTo: red
-sdk: docker
-app_port: 8501
-tags:
-- streamlit
-pinned: false
-short_description: Deep Learning Model for PII Classification
-license: mit
----
-# Welcome to Streamlit!
-Edit `/src/streamlit_app.py` to customize this app to your heart's desire. :heart:
-If you have any questions, checkout our [documentation](https://docs.streamlit.io) and [community
-forums](https://discuss.streamlit.io).

+# PIIDetector 🔒
+Detecting Personally Identifiable Information (PII) using BiLSTM-CRF model
+## 🚀 Demo
+![Demo Screenshot](./demo/demo.png)
+[Watch Demo Video](./demo/demo.mp4)
+## ✨ Features
+- **PII Detection**: Identify various types of Personally Identifiable Information in text
+- **BiLSTM-CRF Model**: Utilizes a powerful deep learning model for sequence labeling
+- **Streamlit Web Interface**: User-friendly interface for easy interaction
+- **Multiple PII Types**: Detects various PII entities including names, addresses, financial information, and more
+## 📦 Installation
+1. **Clone the repository**
+   ```bash
+   git clone https://github.com/yourusername/PIIDetector.git
+   cd PIIDetector
+   ```
+2. **Create and activate a virtual environment**
+   ```bash
+   # Create a virtual environment
+   python -m venv .venv
+   # Activate it
+   # On Linux/Mac:
+   source .venv/bin/activate
+   # On Windows:
+   .venv\Scripts\activate
+   ```
+3. **Install dependencies**
+   ```bash
+   pip install -r requirements.txt
+   ```
+## 🚀 Usage
+1. **Run the Streamlit app**
+   ```bash
+   streamlit run app.py
+   ```
+2. **Enter text** in the text area and click "Analyze" to detect PII entities
+3. **View results** in the table showing tokens and their predicted PII labels
+## 🛠 Configuration
+The application uses a pre-trained BiLSTM-CRF model located in the `models/` directory. The model supports the following PII entity types:
+- Personal Information (names, age, gender, etc.)
+- Contact Information (emails, phone numbers, addresses)
+- Financial Information (credit cards, account numbers, IBAN, etc.)
+- Identification Numbers (SSN, passport numbers, etc.)
+- And many more...
+## 🤝 Contributing
+Contributions are welcome! Please feel free to submit a Pull Request.
+1. Fork the repository
+2. Create your feature branch (`git checkout -b feature/AmazingFeature`)
+3. Commit your changes (`git commit -m 'Add some AmazingFeature'`)
+4. Push to the branch (`git push origin feature/AmazingFeature`)
+5. Open a Pull Request
+## 📄 License
+This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
+## 🙏 Acknowledgements
+- [Hugging Face Transformers](https://huggingface.co/transformers/)
+- [PyTorch](https://pytorch.org/)
+- [Streamlit](https://streamlit.io/)

__pycache__/model.cpython-311.pyc ADDED Viewed

Binary file (2.2 kB). View file

__pycache__/ui.cpython-311.pyc ADDED Viewed

Binary file (1.74 kB). View file

__pycache__/utils.cpython-311.pyc ADDED Viewed

Binary file (5.57 kB). View file

app.py ADDED Viewed

	@@ -0,0 +1,25 @@

+import streamlit as st
+from utils import load_full_model_and_tokenizer
+from ui import render_ui
+from model import BiLSTMCRF
+# Cache model and tokenizer
+@st.cache_resource
+def get_model_and_tokenizer():
+    return load_full_model_and_tokenizer("models/best_bilstm_crf_model.pt")
+model, tokenizer, idx2tag = get_model_and_tokenizer()
+def main():
+    st.title("🔒 Detecting PII with BiLSTM-CRF")
+    text = st.text_area("Enter text to analyze:", height=200)
+    if st.button("Analyze"):
+        if text.strip():
+            render_ui(text, model, tokenizer, idx2tag)
+        else:
+            st.warning("⚠️ Please enter some text.")
+if __name__ == "__main__":
+    main()

demo/demo.mp4 ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:79e2d0b8ad23dfd91431fb1299a7c3c380cefccfa6eacb91e28d8c7921ccaf61
+size 1011984

demo/demo.png ADDED Viewed

detecting-pii-with-bilstm-crf-f1-91.ipynb ADDED Viewed

The diff for this file is too large to render. See raw diff

model.py ADDED Viewed

	@@ -0,0 +1,44 @@

+import torch.nn as nn
+from torchcrf import CRF
+class BiLSTMCRF(nn.Module):
+    def __init__(self, vocab_size, embedding_dim, hidden_dim, num_labels, pad_idx=0, pad_label_id=-100):
+        super().__init__()
+        self.pad_label_id = pad_label_id
+        # Embedding layer for tokens
+        self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=pad_idx)
+        # BiLSTM layer
+        self.lstm = nn.LSTM(
+            input_size=embedding_dim,
+            hidden_size=hidden_dim,
+            num_layers=1,
+            bidirectional=True,
+            batch_first=True
+        )
+        # Linear layer for projecting to label space
+        self.hidden2tag = nn.Linear(hidden_dim * 2, num_labels)
+        # CRF layer
+        self.crf = CRF(num_labels, batch_first=True)
+    def forward(self, input_ids, tags=None, mask=None):
+        embeds = self.embedding(input_ids)            # [B, L, E]
+        lstm_out, _ = self.lstm(embeds)               # [B, L, 2*H]
+        emissions = self.hidden2tag(lstm_out)         # [B, L, num_labels]
+        if tags is not None:
+            # Convert ignored labels to 0 for CRF
+            crf_tags = tags.clone()
+            crf_tags[crf_tags == self.pad_label_id] = 0
+            # Negative log likelihood
+            loss = -self.crf(emissions, crf_tags, mask=mask, reduction='mean')
+            return loss
+        else:
+            # Decode (Viterbi) paths
+            return self.crf.decode(emissions, mask=mask)

models/best_bilstm_crf_model.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:46bf9fbdb9930ab8f11eea77fa5e4a325fa04c841bd3c00f21a7435d7375d41c
+size 19073118

requirements.txt CHANGED Viewed

@@ -1,3 +1,5 @@
-altair
-pandas
-streamlit

+streamlit==1.31.0
+torch==2.2.1
+transformers==4.38.2
+pandas==2.1.4
+pytorch-crf==0.7.2

ui.py ADDED Viewed

	@@ -0,0 +1,26 @@

+import streamlit as st
+from utils import prepare_inputs
+import torch
+import pandas as pd
+def render_ui(text, model, tokenizer, idx2tag):
+    # Prepare inputs
+    input_ids, mask = prepare_inputs(text, tokenizer)
+    # Run model
+    with torch.no_grad():
+        predictions = model(input_ids, mask=mask)
+    tokens = tokenizer.convert_ids_to_tokens(input_ids[0])
+    labels = [idx2tag.get(tag, "O") for tag in predictions[0]]
+    # Build table data
+    rows = []
+    for token, label in zip(tokens, labels):
+        rows.append({"Token": token, "Predicted Label": label})
+    df = pd.DataFrame(rows)
+    # Show in Streamlit
+    st.subheader("🔍 Predictions")
+    st.dataframe(df, use_container_width=True)  # or st.table(df) for static table

utils.py ADDED Viewed

	@@ -0,0 +1,141 @@

+import torch
+from transformers import BertTokenizerFast
+from model import BiLSTMCRF   # make sure model.py exists
+def load_full_model_and_tokenizer(path):
+    """
+    Loads the FULL BiLSTM-CRF model (torch.save(model, ...)) and tokenizer.
+    """
+    tokenizer = BertTokenizerFast.from_pretrained("bert-base-uncased")
+    # Load full model
+    model = torch.load(path, map_location="cpu", weights_only=False)
+    model.eval()
+    # Define tag mapping (must match training)
+    idx2tag = {0: 'B-ACCOUNTNAME',
+ 1: 'B-ACCOUNTNUMBER',
+ 2: 'B-AGE',
+ 3: 'B-AMOUNT',
+ 4: 'B-BIC',
+ 5: 'B-BITCOINADDRESS',
+ 6: 'B-BUILDINGNUMBER',
+ 7: 'B-CITY',
+ 8: 'B-COMPANYNAME',
+ 9: 'B-COUNTY',
+ 10: 'B-CREDITCARDCVV',
+ 11: 'B-CREDITCARDISSUER',
+ 12: 'B-CREDITCARDNUMBER',
+ 13: 'B-CURRENCY',
+ 14: 'B-CURRENCYCODE',
+ 15: 'B-CURRENCYNAME',
+ 16: 'B-CURRENCYSYMBOL',
+ 17: 'B-DATE',
+ 18: 'B-DOB',
+ 19: 'B-EMAIL',
+ 20: 'B-ETHEREUMADDRESS',
+ 21: 'B-EYECOLOR',
+ 22: 'B-FIRSTNAME',
+ 23: 'B-GENDER',
+ 24: 'B-HEIGHT',
+ 25: 'B-IBAN',
+ 26: 'B-IP',
+ 27: 'B-IPV4',
+ 28: 'B-IPV6',
+ 29: 'B-JOBAREA',
+ 30: 'B-JOBTITLE',
+ 31: 'B-JOBTYPE',
+ 32: 'B-LASTNAME',
+ 33: 'B-LITECOINADDRESS',
+ 34: 'B-MAC',
+ 35: 'B-MASKEDNUMBER',
+ 36: 'B-MIDDLENAME',
+ 37: 'B-NEARBYGPSCOORDINATE',
+ 38: 'B-ORDINALDIRECTION',
+ 39: 'B-PASSWORD',
+ 40: 'B-PHONEIMEI',
+ 41: 'B-PHONENUMBER',
+ 42: 'B-PIN',
+ 43: 'B-PREFIX',
+ 44: 'B-SECONDARYADDRESS',
+ 45: 'B-SEX',
+ 46: 'B-SSN',
+ 47: 'B-STATE',
+ 48: 'B-STREET',
+ 49: 'B-TIME',
+ 50: 'B-URL',
+ 51: 'B-USERAGENT',
+ 52: 'B-USERNAME',
+ 53: 'B-VEHICLEVIN',
+ 54: 'B-VEHICLEVRM',
+ 55: 'B-ZIPCODE',
+ 56: 'I-ACCOUNTNAME',
+ 57: 'I-ACCOUNTNUMBER',
+ 58: 'I-AGE',
+ 59: 'I-AMOUNT',
+ 60: 'I-BIC',
+ 61: 'I-BITCOINADDRESS',
+ 62: 'I-BUILDINGNUMBER',
+ 63: 'I-CITY',
+ 64: 'I-COMPANYNAME',
+ 65: 'I-COUNTY',
+ 66: 'I-CREDITCARDCVV',
+ 67: 'I-CREDITCARDISSUER',
+ 68: 'I-CREDITCARDNUMBER',
+ 69: 'I-CURRENCY',
+ 70: 'I-CURRENCYCODE',
+ 71: 'I-CURRENCYNAME',
+ 72: 'I-CURRENCYSYMBOL',
+ 73: 'I-DATE',
+ 74: 'I-DOB',
+ 75: 'I-EMAIL',
+ 76: 'I-ETHEREUMADDRESS',
+ 77: 'I-EYECOLOR',
+ 78: 'I-FIRSTNAME',
+ 79: 'I-GENDER',
+ 80: 'I-HEIGHT',
+ 81: 'I-IBAN',
+ 82: 'I-IP',
+ 83: 'I-IPV4',
+ 84: 'I-IPV6',
+ 85: 'I-JOBAREA',
+ 86: 'I-JOBTITLE',
+ 87: 'I-JOBTYPE',
+ 88: 'I-LASTNAME',
+ 89: 'I-LITECOINADDRESS',
+ 90: 'I-MAC',
+ 91: 'I-MASKEDNUMBER',
+ 92: 'I-MIDDLENAME',
+ 93: 'I-NEARBYGPSCOORDINATE',
+ 94: 'I-PASSWORD',
+ 95: 'I-PHONEIMEI',
+ 96: 'I-PHONENUMBER',
+ 97: 'I-PIN',
+ 98: 'I-PREFIX',
+ 99: 'I-SECONDARYADDRESS',
+ 100: 'I-SSN',
+ 101: 'I-STATE',
+ 102: 'I-STREET',
+ 103: 'I-TIME',
+ 104: 'I-URL',
+ 105: 'I-USERAGENT',
+ 106: 'I-USERNAME',
+ 107: 'I-VEHICLEVIN',
+ 108: 'I-VEHICLEVRM',
+ 109: 'I-ZIPCODE',
+ 110: 'O'}
+    return model, tokenizer, idx2tag
+def prepare_inputs(text, tokenizer, max_length=128):
+    encoding = tokenizer(
+        text.split(),
+        is_split_into_words=True,
+        padding="max_length",
+        truncation=True,
+        max_length=max_length,
+        return_tensors="pt"
+    )
+    input_ids = encoding["input_ids"]
+    mask = encoding["attention_mask"].bool()
+    return input_ids, mask