DeepActionPotential commited on
Commit
e6ba27f
Β·
verified Β·
1 Parent(s): 7d8a038

πŸš€ Initial upload of my app

Browse files
README.md CHANGED
@@ -1,20 +1,7 @@
1
- ---
2
- title: SaleSight – ML model for sales forecasting
3
- emoji: πŸ“ˆ
4
- colorFrom: indigo
5
- colorTo: green
6
- sdk: gradio
7
- sdk_version: "5.4.0"
8
- app_file: app.py
9
- pinned: false
10
- ---
11
 
12
-
13
- # Sales Forecasting with LightGBM
14
-
15
- A retail sales prediction application built with LightGBM and Gradio for interactive forecasting.
16
-
17
- ## πŸ“Š Demo
18
 
19
  ![Demo Screenshot](./demo/demo.png)
20
 
@@ -22,24 +9,20 @@ A retail sales prediction application built with LightGBM and Gradio for interac
22
 
23
  ## ✨ Features
24
 
25
- - Interactive web interface for sales prediction
26
- - Takes into account various features including:
27
- - Promotional events
28
- - Holiday status
29
- - Historical sales data (various lags and rolling means)
30
- - Temporal features (day, month, year, day of week)
31
- - Built with LightGBM for fast and accurate predictions
32
- - Simple and intuitive user interface
33
 
34
- ## πŸš€ Installation
35
 
36
- 1. Clone the repository:
37
  ```bash
38
- git clone https://github.com/yourusername/sales-forecasting.git
39
- cd sales-forecasting
40
  ```
41
 
42
- 2. Create and activate a virtual environment:
43
  ```bash
44
  # Create a virtual environment
45
  python -m venv .venv
@@ -51,34 +34,31 @@ A retail sales prediction application built with LightGBM and Gradio for interac
51
  .venv\Scripts\activate
52
  ```
53
 
54
- 3. Install the required dependencies:
55
  ```bash
56
  pip install -r requirements.txt
57
  ```
58
 
59
- ## πŸ› οΈ Usage
60
 
61
- 1. Run the application:
62
  ```bash
63
- python app.py
64
  ```
65
 
66
- 2. Open your web browser and navigate to the URL shown in the terminal (typically http://localhost:7860)
67
 
68
- 3. Input the required information:
69
- - Promo status (0 or 1)
70
- - Holiday status (0 or 1)
71
- - Date in YYYY-MM-DD format
72
- - Sales lags and rolling means
73
 
74
- 4. Click "Predict Sales" to see the prediction
75
 
76
- ## πŸ“¦ Dependencies
77
 
78
- - gradio >= 3.50.0
79
- - joblib >= 1.3.0
80
- - lightgbm >= 4.0.0
81
- - pandas >= 2.0.0
 
82
 
83
  ## 🀝 Contributing
84
 
@@ -96,6 +76,6 @@ This project is licensed under the MIT License - see the [LICENSE](LICENSE) file
96
 
97
  ## πŸ™ Acknowledgements
98
 
99
- - [LightGBM](https://github.com/microsoft/LightGBM) - The gradient boosting framework used for predictions
100
- - [Gradio](https://gradio.app/) - For the simple web interface
101
- - [Pandas](https://pandas.pydata.org/) - For data manipulation and analysis
 
1
+ # PIIDetector πŸ”’
2
+ Detecting Personally Identifiable Information (PII) using BiLSTM-CRF model
 
 
 
 
 
 
 
 
3
 
4
+ ## πŸš€ Demo
 
 
 
 
 
5
 
6
  ![Demo Screenshot](./demo/demo.png)
7
 
 
9
 
10
  ## ✨ Features
11
 
12
+ - **PII Detection**: Identify various types of Personally Identifiable Information in text
13
+ - **BiLSTM-CRF Model**: Utilizes a powerful deep learning model for sequence labeling
14
+ - **Streamlit Web Interface**: User-friendly interface for easy interaction
15
+ - **Multiple PII Types**: Detects various PII entities including names, addresses, financial information, and more
 
 
 
 
16
 
17
+ ## πŸ“¦ Installation
18
 
19
+ 1. **Clone the repository**
20
  ```bash
21
+ git clone https://github.com/yourusername/PIIDetector.git
22
+ cd PIIDetector
23
  ```
24
 
25
+ 2. **Create and activate a virtual environment**
26
  ```bash
27
  # Create a virtual environment
28
  python -m venv .venv
 
34
  .venv\Scripts\activate
35
  ```
36
 
37
+ 3. **Install dependencies**
38
  ```bash
39
  pip install -r requirements.txt
40
  ```
41
 
42
+ ## πŸš€ Usage
43
 
44
+ 1. **Run the Streamlit app**
45
  ```bash
46
+ streamlit run app.py
47
  ```
48
 
49
+ 2. **Enter text** in the text area and click "Analyze" to detect PII entities
50
 
51
+ 3. **View results** in the table showing tokens and their predicted PII labels
 
 
 
 
52
 
53
+ ## πŸ›  Configuration
54
 
55
+ The application uses a pre-trained BiLSTM-CRF model located in the `models/` directory. The model supports the following PII entity types:
56
 
57
+ - Personal Information (names, age, gender, etc.)
58
+ - Contact Information (emails, phone numbers, addresses)
59
+ - Financial Information (credit cards, account numbers, IBAN, etc.)
60
+ - Identification Numbers (SSN, passport numbers, etc.)
61
+ - And many more...
62
 
63
  ## 🀝 Contributing
64
 
 
76
 
77
  ## πŸ™ Acknowledgements
78
 
79
+ - [Hugging Face Transformers](https://huggingface.co/transformers/)
80
+ - [PyTorch](https://pytorch.org/)
81
+ - [Streamlit](https://streamlit.io/)
__pycache__/model.cpython-311.pyc ADDED
Binary file (2.2 kB). View file
 
__pycache__/ui.cpython-311.pyc CHANGED
Binary files a/__pycache__/ui.cpython-311.pyc and b/__pycache__/ui.cpython-311.pyc differ
 
__pycache__/utils.cpython-311.pyc CHANGED
Binary files a/__pycache__/utils.cpython-311.pyc and b/__pycache__/utils.cpython-311.pyc differ
 
app.py CHANGED
@@ -1,7 +1,25 @@
1
- from utils import load_artifacts, predict_sales
2
- from ui import build_ui
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
 
4
  if __name__ == "__main__":
5
- model, feature_cols = load_artifacts()
6
- iface = build_ui(model, feature_cols, predict_sales)
7
- iface.launch()
 
1
+ import streamlit as st
2
+ from utils import load_full_model_and_tokenizer
3
+ from ui import render_ui
4
+ from model import BiLSTMCRF
5
+
6
+ # Cache model and tokenizer
7
+ @st.cache_resource
8
+ def get_model_and_tokenizer():
9
+ return load_full_model_and_tokenizer("models/best_bilstm_crf_model.pt")
10
+
11
+ model, tokenizer, idx2tag = get_model_and_tokenizer()
12
+
13
+ def main():
14
+ st.title("πŸ”’ Detecting PII with BiLSTM-CRF")
15
+
16
+ text = st.text_area("Enter text to analyze:", height=200)
17
+
18
+ if st.button("Analyze"):
19
+ if text.strip():
20
+ render_ui(text, model, tokenizer, idx2tag)
21
+ else:
22
+ st.warning("⚠️ Please enter some text.")
23
 
24
  if __name__ == "__main__":
25
+ main()
 
 
demo/demo.mp4 CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:709f027723ef11b7699671bfb67b904580a63b70330dbef4069ffed351f4af8f
3
- size 896228
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:79e2d0b8ad23dfd91431fb1299a7c3c380cefccfa6eacb91e28d8c7921ccaf61
3
+ size 1011984
demo/demo.png CHANGED
detecting-pii-with-bilstm-crf-f1-91.ipynb ADDED
The diff for this file is too large to render. See raw diff
 
model.py ADDED
@@ -0,0 +1,44 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+
3
+ import torch.nn as nn
4
+ from torchcrf import CRF
5
+
6
+ class BiLSTMCRF(nn.Module):
7
+ def __init__(self, vocab_size, embedding_dim, hidden_dim, num_labels, pad_idx=0, pad_label_id=-100):
8
+ super().__init__()
9
+ self.pad_label_id = pad_label_id
10
+
11
+ # Embedding layer for tokens
12
+ self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=pad_idx)
13
+
14
+ # BiLSTM layer
15
+ self.lstm = nn.LSTM(
16
+ input_size=embedding_dim,
17
+ hidden_size=hidden_dim,
18
+ num_layers=1,
19
+ bidirectional=True,
20
+ batch_first=True
21
+ )
22
+
23
+ # Linear layer for projecting to label space
24
+ self.hidden2tag = nn.Linear(hidden_dim * 2, num_labels)
25
+
26
+ # CRF layer
27
+ self.crf = CRF(num_labels, batch_first=True)
28
+
29
+ def forward(self, input_ids, tags=None, mask=None):
30
+ embeds = self.embedding(input_ids) # [B, L, E]
31
+ lstm_out, _ = self.lstm(embeds) # [B, L, 2*H]
32
+ emissions = self.hidden2tag(lstm_out) # [B, L, num_labels]
33
+
34
+ if tags is not None:
35
+ # Convert ignored labels to 0 for CRF
36
+ crf_tags = tags.clone()
37
+ crf_tags[crf_tags == self.pad_label_id] = 0
38
+
39
+ # Negative log likelihood
40
+ loss = -self.crf(emissions, crf_tags, mask=mask, reduction='mean')
41
+ return loss
42
+ else:
43
+ # Decode (Viterbi) paths
44
+ return self.crf.decode(emissions, mask=mask)
models/best_bilstm_crf_model.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:46bf9fbdb9930ab8f11eea77fa5e4a325fa04c841bd3c00f21a7435d7375d41c
3
+ size 19073118
requirements.txt CHANGED
@@ -1,4 +1,5 @@
1
- gradio>=3.50.0
2
- joblib>=1.3.0
3
- lightgbm>=4.0.0
4
- pandas>=2.0.0
 
 
1
+ streamlit==1.31.0
2
+ torch==2.2.1
3
+ transformers==4.38.2
4
+ pandas==2.1.4
5
+ pytorch-crf==0.7.2
ui.py CHANGED
@@ -1,28 +1,26 @@
1
- import gradio as gr
 
 
 
2
 
3
- def build_ui(model, feature_cols, predict_fn):
4
- with gr.Blocks() as demo:
5
- gr.Markdown("## πŸ›’ Retail Sales Prediction App")
6
 
7
- with gr.Row():
8
- promo = gr.Radio([0, 1], label="Promo", value=0)
9
- holiday = gr.Radio([0, 1], label="Holiday", value=0)
10
 
11
- date = gr.Textbox(label="Date (YYYY-MM-DD)", value="2023-11-01")
 
12
 
13
- with gr.Row():
14
- lag_1 = gr.Number(label="Sales Lag 1 Day", value=100)
15
- lag_7 = gr.Number(label="Sales Lag 7 Days", value=120)
16
- mean_3 = gr.Number(label="Rolling Mean (3 Days)", value=110)
17
- mean_7 = gr.Number(label="Rolling Mean (7 Days)", value=115)
18
 
19
- predict_btn = gr.Button("Predict Sales")
20
- output = gr.Number(label="Predicted Sales", precision=2)
21
 
22
- predict_btn.click(
23
- fn=lambda p, h, d, l1, l7, m3, m7: predict_fn(model, feature_cols, p, h, d, l1, l7, m3, m7),
24
- inputs=[promo, holiday, date, lag_1, lag_7, mean_3, mean_7],
25
- outputs=output
26
- )
27
-
28
- return demo
 
1
+ import streamlit as st
2
+ from utils import prepare_inputs
3
+ import torch
4
+ import pandas as pd
5
 
6
+ def render_ui(text, model, tokenizer, idx2tag):
7
+ # Prepare inputs
8
+ input_ids, mask = prepare_inputs(text, tokenizer)
9
 
10
+ # Run model
11
+ with torch.no_grad():
12
+ predictions = model(input_ids, mask=mask)
13
 
14
+ tokens = tokenizer.convert_ids_to_tokens(input_ids[0])
15
+ labels = [idx2tag.get(tag, "O") for tag in predictions[0]]
16
 
17
+ # Build table data
18
+ rows = []
19
+ for token, label in zip(tokens, labels):
20
+ rows.append({"Token": token, "Predicted Label": label})
 
21
 
22
+ df = pd.DataFrame(rows)
 
23
 
24
+ # Show in Streamlit
25
+ st.subheader("πŸ” Predictions")
26
+ st.dataframe(df, use_container_width=True) # or st.table(df) for static table
 
 
 
 
utils.py CHANGED
@@ -1,53 +1,141 @@
1
- import joblib
2
- import lightgbm as lgb
3
- import pandas as pd
4
 
5
- # Load artifacts
6
- def load_artifacts():
7
- model = lgb.Booster(model_file="models/lgb_sales_model.txt")
8
- feature_cols = joblib.load("models/feature_cols.pkl")
9
- return model, feature_cols
10
-
11
- # Preprocess new input row into model-ready features
12
- def preprocess_input(promo, holiday, date, past_sales):
13
  """
14
- Args:
15
-
16
- promo: int (0/1)
17
- holiday: int (0/1)
18
- date: datetime-like
19
- past_sales: dict with keys ['lag_1','lag_7','mean_3','mean_7']
20
-
21
- Returns:
22
- pd.DataFrame with a single row ready for prediction
23
  """
24
- date = pd.to_datetime(date)
 
 
 
 
25
 
26
- features = {
27
- "promo": promo,
28
- "holiday": holiday,
29
- "day": date.day,
30
- "month": date.month,
31
- "year": date.year,
32
- "day_of_week": date.weekday(),
33
- "is_weekend": 1 if date.weekday() >= 5 else 0,
34
- "sales_lag_1": past_sales.get("lag_1", 0),
35
- "sales_lag_7": past_sales.get("lag_7", 0),
36
- "rolling_mean_3": past_sales.get("mean_3", 0),
37
- "rolling_mean_7": past_sales.get("mean_7", 0),
38
- }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
39
 
40
- return pd.DataFrame([features])
41
 
42
- # Prediction
43
- def predict_sales(model, feature_cols, promo, holiday, date, lag_1, lag_7, mean_3, mean_7):
44
- past_sales = {
45
- "lag_1": lag_1,
46
- "lag_7": lag_7,
47
- "mean_3": mean_3,
48
- "mean_7": mean_7,
49
- }
50
- X = preprocess_input(promo, holiday, date, past_sales)
51
- X = X[feature_cols] # ensure correct column order
52
- prediction = model.predict(X)[0]
53
- return round(prediction, 2)
 
1
+ import torch
2
+ from transformers import BertTokenizerFast
3
+ from model import BiLSTMCRF # make sure model.py exists
4
 
5
+ def load_full_model_and_tokenizer(path):
 
 
 
 
 
 
 
6
  """
7
+ Loads the FULL BiLSTM-CRF model (torch.save(model, ...)) and tokenizer.
 
 
 
 
 
 
 
 
8
  """
9
+ tokenizer = BertTokenizerFast.from_pretrained("bert-base-uncased")
10
+
11
+ # Load full model
12
+ model = torch.load(path, map_location="cpu", weights_only=False)
13
+ model.eval()
14
 
15
+ # Define tag mapping (must match training)
16
+ idx2tag = {0: 'B-ACCOUNTNAME',
17
+ 1: 'B-ACCOUNTNUMBER',
18
+ 2: 'B-AGE',
19
+ 3: 'B-AMOUNT',
20
+ 4: 'B-BIC',
21
+ 5: 'B-BITCOINADDRESS',
22
+ 6: 'B-BUILDINGNUMBER',
23
+ 7: 'B-CITY',
24
+ 8: 'B-COMPANYNAME',
25
+ 9: 'B-COUNTY',
26
+ 10: 'B-CREDITCARDCVV',
27
+ 11: 'B-CREDITCARDISSUER',
28
+ 12: 'B-CREDITCARDNUMBER',
29
+ 13: 'B-CURRENCY',
30
+ 14: 'B-CURRENCYCODE',
31
+ 15: 'B-CURRENCYNAME',
32
+ 16: 'B-CURRENCYSYMBOL',
33
+ 17: 'B-DATE',
34
+ 18: 'B-DOB',
35
+ 19: 'B-EMAIL',
36
+ 20: 'B-ETHEREUMADDRESS',
37
+ 21: 'B-EYECOLOR',
38
+ 22: 'B-FIRSTNAME',
39
+ 23: 'B-GENDER',
40
+ 24: 'B-HEIGHT',
41
+ 25: 'B-IBAN',
42
+ 26: 'B-IP',
43
+ 27: 'B-IPV4',
44
+ 28: 'B-IPV6',
45
+ 29: 'B-JOBAREA',
46
+ 30: 'B-JOBTITLE',
47
+ 31: 'B-JOBTYPE',
48
+ 32: 'B-LASTNAME',
49
+ 33: 'B-LITECOINADDRESS',
50
+ 34: 'B-MAC',
51
+ 35: 'B-MASKEDNUMBER',
52
+ 36: 'B-MIDDLENAME',
53
+ 37: 'B-NEARBYGPSCOORDINATE',
54
+ 38: 'B-ORDINALDIRECTION',
55
+ 39: 'B-PASSWORD',
56
+ 40: 'B-PHONEIMEI',
57
+ 41: 'B-PHONENUMBER',
58
+ 42: 'B-PIN',
59
+ 43: 'B-PREFIX',
60
+ 44: 'B-SECONDARYADDRESS',
61
+ 45: 'B-SEX',
62
+ 46: 'B-SSN',
63
+ 47: 'B-STATE',
64
+ 48: 'B-STREET',
65
+ 49: 'B-TIME',
66
+ 50: 'B-URL',
67
+ 51: 'B-USERAGENT',
68
+ 52: 'B-USERNAME',
69
+ 53: 'B-VEHICLEVIN',
70
+ 54: 'B-VEHICLEVRM',
71
+ 55: 'B-ZIPCODE',
72
+ 56: 'I-ACCOUNTNAME',
73
+ 57: 'I-ACCOUNTNUMBER',
74
+ 58: 'I-AGE',
75
+ 59: 'I-AMOUNT',
76
+ 60: 'I-BIC',
77
+ 61: 'I-BITCOINADDRESS',
78
+ 62: 'I-BUILDINGNUMBER',
79
+ 63: 'I-CITY',
80
+ 64: 'I-COMPANYNAME',
81
+ 65: 'I-COUNTY',
82
+ 66: 'I-CREDITCARDCVV',
83
+ 67: 'I-CREDITCARDISSUER',
84
+ 68: 'I-CREDITCARDNUMBER',
85
+ 69: 'I-CURRENCY',
86
+ 70: 'I-CURRENCYCODE',
87
+ 71: 'I-CURRENCYNAME',
88
+ 72: 'I-CURRENCYSYMBOL',
89
+ 73: 'I-DATE',
90
+ 74: 'I-DOB',
91
+ 75: 'I-EMAIL',
92
+ 76: 'I-ETHEREUMADDRESS',
93
+ 77: 'I-EYECOLOR',
94
+ 78: 'I-FIRSTNAME',
95
+ 79: 'I-GENDER',
96
+ 80: 'I-HEIGHT',
97
+ 81: 'I-IBAN',
98
+ 82: 'I-IP',
99
+ 83: 'I-IPV4',
100
+ 84: 'I-IPV6',
101
+ 85: 'I-JOBAREA',
102
+ 86: 'I-JOBTITLE',
103
+ 87: 'I-JOBTYPE',
104
+ 88: 'I-LASTNAME',
105
+ 89: 'I-LITECOINADDRESS',
106
+ 90: 'I-MAC',
107
+ 91: 'I-MASKEDNUMBER',
108
+ 92: 'I-MIDDLENAME',
109
+ 93: 'I-NEARBYGPSCOORDINATE',
110
+ 94: 'I-PASSWORD',
111
+ 95: 'I-PHONEIMEI',
112
+ 96: 'I-PHONENUMBER',
113
+ 97: 'I-PIN',
114
+ 98: 'I-PREFIX',
115
+ 99: 'I-SECONDARYADDRESS',
116
+ 100: 'I-SSN',
117
+ 101: 'I-STATE',
118
+ 102: 'I-STREET',
119
+ 103: 'I-TIME',
120
+ 104: 'I-URL',
121
+ 105: 'I-USERAGENT',
122
+ 106: 'I-USERNAME',
123
+ 107: 'I-VEHICLEVIN',
124
+ 108: 'I-VEHICLEVRM',
125
+ 109: 'I-ZIPCODE',
126
+ 110: 'O'}
127
 
128
+ return model, tokenizer, idx2tag
129
 
130
+ def prepare_inputs(text, tokenizer, max_length=128):
131
+ encoding = tokenizer(
132
+ text.split(),
133
+ is_split_into_words=True,
134
+ padding="max_length",
135
+ truncation=True,
136
+ max_length=max_length,
137
+ return_tensors="pt"
138
+ )
139
+ input_ids = encoding["input_ids"]
140
+ mask = encoding["attention_mask"].bool()
141
+ return input_ids, mask