DeepActionPotential commited on
Commit
73a7314
Β·
verified Β·
1 Parent(s): 7fa33ea

πŸš€ Initial upload of my app

Browse files
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ demo/demo.mp4 filter=lfs diff=lfs merge=lfs -text
LICENSE ADDED
@@ -0,0 +1,21 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ MIT License
2
+
3
+ Copyright (c) 2025 Eslam Tarek
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
README.md CHANGED
@@ -1,20 +1,81 @@
1
- ---
2
- title: PIIGuard
3
- emoji: πŸš€
4
- colorFrom: red
5
- colorTo: red
6
- sdk: docker
7
- app_port: 8501
8
- tags:
9
- - streamlit
10
- pinned: false
11
- short_description: Deep Learning Model for PII Classification
12
- license: mit
13
- ---
14
-
15
- # Welcome to Streamlit!
16
-
17
- Edit `/src/streamlit_app.py` to customize this app to your heart's desire. :heart:
18
-
19
- If you have any questions, checkout our [documentation](https://docs.streamlit.io) and [community
20
- forums](https://discuss.streamlit.io).
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # PIIDetector πŸ”’
2
+ Detecting Personally Identifiable Information (PII) using BiLSTM-CRF model
3
+
4
+ ## πŸš€ Demo
5
+
6
+ ![Demo Screenshot](./demo/demo.png)
7
+
8
+ [Watch Demo Video](./demo/demo.mp4)
9
+
10
+ ## ✨ Features
11
+
12
+ - **PII Detection**: Identify various types of Personally Identifiable Information in text
13
+ - **BiLSTM-CRF Model**: Utilizes a powerful deep learning model for sequence labeling
14
+ - **Streamlit Web Interface**: User-friendly interface for easy interaction
15
+ - **Multiple PII Types**: Detects various PII entities including names, addresses, financial information, and more
16
+
17
+ ## πŸ“¦ Installation
18
+
19
+ 1. **Clone the repository**
20
+ ```bash
21
+ git clone https://github.com/yourusername/PIIDetector.git
22
+ cd PIIDetector
23
+ ```
24
+
25
+ 2. **Create and activate a virtual environment**
26
+ ```bash
27
+ # Create a virtual environment
28
+ python -m venv .venv
29
+
30
+ # Activate it
31
+ # On Linux/Mac:
32
+ source .venv/bin/activate
33
+ # On Windows:
34
+ .venv\Scripts\activate
35
+ ```
36
+
37
+ 3. **Install dependencies**
38
+ ```bash
39
+ pip install -r requirements.txt
40
+ ```
41
+
42
+ ## πŸš€ Usage
43
+
44
+ 1. **Run the Streamlit app**
45
+ ```bash
46
+ streamlit run app.py
47
+ ```
48
+
49
+ 2. **Enter text** in the text area and click "Analyze" to detect PII entities
50
+
51
+ 3. **View results** in the table showing tokens and their predicted PII labels
52
+
53
+ ## πŸ›  Configuration
54
+
55
+ The application uses a pre-trained BiLSTM-CRF model located in the `models/` directory. The model supports the following PII entity types:
56
+
57
+ - Personal Information (names, age, gender, etc.)
58
+ - Contact Information (emails, phone numbers, addresses)
59
+ - Financial Information (credit cards, account numbers, IBAN, etc.)
60
+ - Identification Numbers (SSN, passport numbers, etc.)
61
+ - And many more...
62
+
63
+ ## 🀝 Contributing
64
+
65
+ Contributions are welcome! Please feel free to submit a Pull Request.
66
+
67
+ 1. Fork the repository
68
+ 2. Create your feature branch (`git checkout -b feature/AmazingFeature`)
69
+ 3. Commit your changes (`git commit -m 'Add some AmazingFeature'`)
70
+ 4. Push to the branch (`git push origin feature/AmazingFeature`)
71
+ 5. Open a Pull Request
72
+
73
+ ## πŸ“„ License
74
+
75
+ This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
76
+
77
+ ## πŸ™ Acknowledgements
78
+
79
+ - [Hugging Face Transformers](https://huggingface.co/transformers/)
80
+ - [PyTorch](https://pytorch.org/)
81
+ - [Streamlit](https://streamlit.io/)
__pycache__/model.cpython-311.pyc ADDED
Binary file (2.2 kB). View file
 
__pycache__/ui.cpython-311.pyc ADDED
Binary file (1.74 kB). View file
 
__pycache__/utils.cpython-311.pyc ADDED
Binary file (5.57 kB). View file
 
app.py ADDED
@@ -0,0 +1,25 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import streamlit as st
2
+ from utils import load_full_model_and_tokenizer
3
+ from ui import render_ui
4
+ from model import BiLSTMCRF
5
+
6
+ # Cache model and tokenizer
7
+ @st.cache_resource
8
+ def get_model_and_tokenizer():
9
+ return load_full_model_and_tokenizer("models/best_bilstm_crf_model.pt")
10
+
11
+ model, tokenizer, idx2tag = get_model_and_tokenizer()
12
+
13
+ def main():
14
+ st.title("πŸ”’ Detecting PII with BiLSTM-CRF")
15
+
16
+ text = st.text_area("Enter text to analyze:", height=200)
17
+
18
+ if st.button("Analyze"):
19
+ if text.strip():
20
+ render_ui(text, model, tokenizer, idx2tag)
21
+ else:
22
+ st.warning("⚠️ Please enter some text.")
23
+
24
+ if __name__ == "__main__":
25
+ main()
demo/demo.mp4 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:79e2d0b8ad23dfd91431fb1299a7c3c380cefccfa6eacb91e28d8c7921ccaf61
3
+ size 1011984
demo/demo.png ADDED
detecting-pii-with-bilstm-crf-f1-91.ipynb ADDED
The diff for this file is too large to render. See raw diff
 
model.py ADDED
@@ -0,0 +1,44 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+
3
+ import torch.nn as nn
4
+ from torchcrf import CRF
5
+
6
+ class BiLSTMCRF(nn.Module):
7
+ def __init__(self, vocab_size, embedding_dim, hidden_dim, num_labels, pad_idx=0, pad_label_id=-100):
8
+ super().__init__()
9
+ self.pad_label_id = pad_label_id
10
+
11
+ # Embedding layer for tokens
12
+ self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=pad_idx)
13
+
14
+ # BiLSTM layer
15
+ self.lstm = nn.LSTM(
16
+ input_size=embedding_dim,
17
+ hidden_size=hidden_dim,
18
+ num_layers=1,
19
+ bidirectional=True,
20
+ batch_first=True
21
+ )
22
+
23
+ # Linear layer for projecting to label space
24
+ self.hidden2tag = nn.Linear(hidden_dim * 2, num_labels)
25
+
26
+ # CRF layer
27
+ self.crf = CRF(num_labels, batch_first=True)
28
+
29
+ def forward(self, input_ids, tags=None, mask=None):
30
+ embeds = self.embedding(input_ids) # [B, L, E]
31
+ lstm_out, _ = self.lstm(embeds) # [B, L, 2*H]
32
+ emissions = self.hidden2tag(lstm_out) # [B, L, num_labels]
33
+
34
+ if tags is not None:
35
+ # Convert ignored labels to 0 for CRF
36
+ crf_tags = tags.clone()
37
+ crf_tags[crf_tags == self.pad_label_id] = 0
38
+
39
+ # Negative log likelihood
40
+ loss = -self.crf(emissions, crf_tags, mask=mask, reduction='mean')
41
+ return loss
42
+ else:
43
+ # Decode (Viterbi) paths
44
+ return self.crf.decode(emissions, mask=mask)
models/best_bilstm_crf_model.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:46bf9fbdb9930ab8f11eea77fa5e4a325fa04c841bd3c00f21a7435d7375d41c
3
+ size 19073118
requirements.txt CHANGED
@@ -1,3 +1,5 @@
1
- altair
2
- pandas
3
- streamlit
 
 
 
1
+ streamlit==1.31.0
2
+ torch==2.2.1
3
+ transformers==4.38.2
4
+ pandas==2.1.4
5
+ pytorch-crf==0.7.2
ui.py ADDED
@@ -0,0 +1,26 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import streamlit as st
2
+ from utils import prepare_inputs
3
+ import torch
4
+ import pandas as pd
5
+
6
+ def render_ui(text, model, tokenizer, idx2tag):
7
+ # Prepare inputs
8
+ input_ids, mask = prepare_inputs(text, tokenizer)
9
+
10
+ # Run model
11
+ with torch.no_grad():
12
+ predictions = model(input_ids, mask=mask)
13
+
14
+ tokens = tokenizer.convert_ids_to_tokens(input_ids[0])
15
+ labels = [idx2tag.get(tag, "O") for tag in predictions[0]]
16
+
17
+ # Build table data
18
+ rows = []
19
+ for token, label in zip(tokens, labels):
20
+ rows.append({"Token": token, "Predicted Label": label})
21
+
22
+ df = pd.DataFrame(rows)
23
+
24
+ # Show in Streamlit
25
+ st.subheader("πŸ” Predictions")
26
+ st.dataframe(df, use_container_width=True) # or st.table(df) for static table
utils.py ADDED
@@ -0,0 +1,141 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import torch
2
+ from transformers import BertTokenizerFast
3
+ from model import BiLSTMCRF # make sure model.py exists
4
+
5
+ def load_full_model_and_tokenizer(path):
6
+ """
7
+ Loads the FULL BiLSTM-CRF model (torch.save(model, ...)) and tokenizer.
8
+ """
9
+ tokenizer = BertTokenizerFast.from_pretrained("bert-base-uncased")
10
+
11
+ # Load full model
12
+ model = torch.load(path, map_location="cpu", weights_only=False)
13
+ model.eval()
14
+
15
+ # Define tag mapping (must match training)
16
+ idx2tag = {0: 'B-ACCOUNTNAME',
17
+ 1: 'B-ACCOUNTNUMBER',
18
+ 2: 'B-AGE',
19
+ 3: 'B-AMOUNT',
20
+ 4: 'B-BIC',
21
+ 5: 'B-BITCOINADDRESS',
22
+ 6: 'B-BUILDINGNUMBER',
23
+ 7: 'B-CITY',
24
+ 8: 'B-COMPANYNAME',
25
+ 9: 'B-COUNTY',
26
+ 10: 'B-CREDITCARDCVV',
27
+ 11: 'B-CREDITCARDISSUER',
28
+ 12: 'B-CREDITCARDNUMBER',
29
+ 13: 'B-CURRENCY',
30
+ 14: 'B-CURRENCYCODE',
31
+ 15: 'B-CURRENCYNAME',
32
+ 16: 'B-CURRENCYSYMBOL',
33
+ 17: 'B-DATE',
34
+ 18: 'B-DOB',
35
+ 19: 'B-EMAIL',
36
+ 20: 'B-ETHEREUMADDRESS',
37
+ 21: 'B-EYECOLOR',
38
+ 22: 'B-FIRSTNAME',
39
+ 23: 'B-GENDER',
40
+ 24: 'B-HEIGHT',
41
+ 25: 'B-IBAN',
42
+ 26: 'B-IP',
43
+ 27: 'B-IPV4',
44
+ 28: 'B-IPV6',
45
+ 29: 'B-JOBAREA',
46
+ 30: 'B-JOBTITLE',
47
+ 31: 'B-JOBTYPE',
48
+ 32: 'B-LASTNAME',
49
+ 33: 'B-LITECOINADDRESS',
50
+ 34: 'B-MAC',
51
+ 35: 'B-MASKEDNUMBER',
52
+ 36: 'B-MIDDLENAME',
53
+ 37: 'B-NEARBYGPSCOORDINATE',
54
+ 38: 'B-ORDINALDIRECTION',
55
+ 39: 'B-PASSWORD',
56
+ 40: 'B-PHONEIMEI',
57
+ 41: 'B-PHONENUMBER',
58
+ 42: 'B-PIN',
59
+ 43: 'B-PREFIX',
60
+ 44: 'B-SECONDARYADDRESS',
61
+ 45: 'B-SEX',
62
+ 46: 'B-SSN',
63
+ 47: 'B-STATE',
64
+ 48: 'B-STREET',
65
+ 49: 'B-TIME',
66
+ 50: 'B-URL',
67
+ 51: 'B-USERAGENT',
68
+ 52: 'B-USERNAME',
69
+ 53: 'B-VEHICLEVIN',
70
+ 54: 'B-VEHICLEVRM',
71
+ 55: 'B-ZIPCODE',
72
+ 56: 'I-ACCOUNTNAME',
73
+ 57: 'I-ACCOUNTNUMBER',
74
+ 58: 'I-AGE',
75
+ 59: 'I-AMOUNT',
76
+ 60: 'I-BIC',
77
+ 61: 'I-BITCOINADDRESS',
78
+ 62: 'I-BUILDINGNUMBER',
79
+ 63: 'I-CITY',
80
+ 64: 'I-COMPANYNAME',
81
+ 65: 'I-COUNTY',
82
+ 66: 'I-CREDITCARDCVV',
83
+ 67: 'I-CREDITCARDISSUER',
84
+ 68: 'I-CREDITCARDNUMBER',
85
+ 69: 'I-CURRENCY',
86
+ 70: 'I-CURRENCYCODE',
87
+ 71: 'I-CURRENCYNAME',
88
+ 72: 'I-CURRENCYSYMBOL',
89
+ 73: 'I-DATE',
90
+ 74: 'I-DOB',
91
+ 75: 'I-EMAIL',
92
+ 76: 'I-ETHEREUMADDRESS',
93
+ 77: 'I-EYECOLOR',
94
+ 78: 'I-FIRSTNAME',
95
+ 79: 'I-GENDER',
96
+ 80: 'I-HEIGHT',
97
+ 81: 'I-IBAN',
98
+ 82: 'I-IP',
99
+ 83: 'I-IPV4',
100
+ 84: 'I-IPV6',
101
+ 85: 'I-JOBAREA',
102
+ 86: 'I-JOBTITLE',
103
+ 87: 'I-JOBTYPE',
104
+ 88: 'I-LASTNAME',
105
+ 89: 'I-LITECOINADDRESS',
106
+ 90: 'I-MAC',
107
+ 91: 'I-MASKEDNUMBER',
108
+ 92: 'I-MIDDLENAME',
109
+ 93: 'I-NEARBYGPSCOORDINATE',
110
+ 94: 'I-PASSWORD',
111
+ 95: 'I-PHONEIMEI',
112
+ 96: 'I-PHONENUMBER',
113
+ 97: 'I-PIN',
114
+ 98: 'I-PREFIX',
115
+ 99: 'I-SECONDARYADDRESS',
116
+ 100: 'I-SSN',
117
+ 101: 'I-STATE',
118
+ 102: 'I-STREET',
119
+ 103: 'I-TIME',
120
+ 104: 'I-URL',
121
+ 105: 'I-USERAGENT',
122
+ 106: 'I-USERNAME',
123
+ 107: 'I-VEHICLEVIN',
124
+ 108: 'I-VEHICLEVRM',
125
+ 109: 'I-ZIPCODE',
126
+ 110: 'O'}
127
+
128
+ return model, tokenizer, idx2tag
129
+
130
+ def prepare_inputs(text, tokenizer, max_length=128):
131
+ encoding = tokenizer(
132
+ text.split(),
133
+ is_split_into_words=True,
134
+ padding="max_length",
135
+ truncation=True,
136
+ max_length=max_length,
137
+ return_tensors="pt"
138
+ )
139
+ input_ids = encoding["input_ids"]
140
+ mask = encoding["attention_mask"].bool()
141
+ return input_ids, mask