Spaces:

DeepActionPotential
/

SaleSight

Sleeping

App Files Files Community

DeepActionPotential commited on Sep 29, 2025

Commit

e6ba27f

verified ·

1 Parent(s): 7d8a038

🚀 Initial upload of my app

Browse files

Files changed (13) hide show

README.md +28 -48
__pycache__/model.cpython-311.pyc +0 -0
__pycache__/ui.cpython-311.pyc +0 -0
__pycache__/utils.cpython-311.pyc +0 -0
app.py +23 -5
demo/demo.mp4 +2 -2
demo/demo.png +0 -0
detecting-pii-with-bilstm-crf-f1-91.ipynb +0 -0
model.py +44 -0
models/best_bilstm_crf_model.pt +3 -0
requirements.txt +5 -4
ui.py +20 -22
utils.py +135 -47

README.md CHANGED Viewed

@@ -1,20 +1,7 @@
----
-title: SaleSight – ML model for sales forecasting
-emoji: 📈
-colorFrom: indigo
-colorTo: green
-sdk: gradio
-sdk_version: "5.4.0"
-app_file: app.py
-pinned: false
----
-# Sales Forecasting with LightGBM
-A retail sales prediction application built with LightGBM and Gradio for interactive forecasting.
-## 📊 Demo
 ![Demo Screenshot](./demo/demo.png)
@@ -22,24 +9,20 @@ A retail sales prediction application built with LightGBM and Gradio for interac
 ## ✨ Features
-- Interactive web interface for sales prediction
-- Takes into account various features including:
-  - Promotional events
-  - Holiday status
-  - Historical sales data (various lags and rolling means)
-  - Temporal features (day, month, year, day of week)
-- Built with LightGBM for fast and accurate predictions
-- Simple and intuitive user interface
-## 🚀 Installation
-1. Clone the repository:
    ```bash
-   git clone https://github.com/yourusername/sales-forecasting.git
-   cd sales-forecasting
    ```
-2. Create and activate a virtual environment:
    ```bash
    # Create a virtual environment
    python -m venv .venv
@@ -51,34 +34,31 @@ A retail sales prediction application built with LightGBM and Gradio for interac
    .venv\Scripts\activate
    ```
-3. Install the required dependencies:
    ```bash
    pip install -r requirements.txt
    ```
-## 🛠️ Usage
-1. Run the application:
    ```bash
-   python app.py
    ```
-2. Open your web browser and navigate to the URL shown in the terminal (typically http://localhost:7860)
-3. Input the required information:
-   - Promo status (0 or 1)
-   - Holiday status (0 or 1)
-   - Date in YYYY-MM-DD format
-   - Sales lags and rolling means
-4. Click "Predict Sales" to see the prediction
-## 📦 Dependencies
-- gradio >= 3.50.0
-- joblib >= 1.3.0
-- lightgbm >= 4.0.0
-- pandas >= 2.0.0
 ## 🤝 Contributing
@@ -96,6 +76,6 @@ This project is licensed under the MIT License - see the [LICENSE](LICENSE) file
 ## 🙏 Acknowledgements
-- [LightGBM](https://github.com/microsoft/LightGBM) - The gradient boosting framework used for predictions
-- [Gradio](https://gradio.app/) - For the simple web interface
-- [Pandas](https://pandas.pydata.org/) - For data manipulation and analysis

+# PIIDetector 🔒
+Detecting Personally Identifiable Information (PII) using BiLSTM-CRF model
+## 🚀 Demo
 ![Demo Screenshot](./demo/demo.png)
 ## ✨ Features
+- **PII Detection**: Identify various types of Personally Identifiable Information in text
+- **BiLSTM-CRF Model**: Utilizes a powerful deep learning model for sequence labeling
+- **Streamlit Web Interface**: User-friendly interface for easy interaction
+- **Multiple PII Types**: Detects various PII entities including names, addresses, financial information, and more
+## 📦 Installation
+1. **Clone the repository**
    ```bash
+   git clone https://github.com/yourusername/PIIDetector.git
+   cd PIIDetector
    ```
+2. **Create and activate a virtual environment**
    ```bash
    # Create a virtual environment
    python -m venv .venv
    .venv\Scripts\activate
    ```
+3. **Install dependencies**
    ```bash
    pip install -r requirements.txt
    ```
+## 🚀 Usage
+1. **Run the Streamlit app**
    ```bash
+   streamlit run app.py
    ```
+2. **Enter text** in the text area and click "Analyze" to detect PII entities
+3. **View results** in the table showing tokens and their predicted PII labels
+## 🛠 Configuration
+The application uses a pre-trained BiLSTM-CRF model located in the `models/` directory. The model supports the following PII entity types:
+- Personal Information (names, age, gender, etc.)
+- Contact Information (emails, phone numbers, addresses)
+- Financial Information (credit cards, account numbers, IBAN, etc.)
+- Identification Numbers (SSN, passport numbers, etc.)
+- And many more...
 ## 🤝 Contributing
 ## 🙏 Acknowledgements
+- [Hugging Face Transformers](https://huggingface.co/transformers/)
+- [PyTorch](https://pytorch.org/)
+- [Streamlit](https://streamlit.io/)

__pycache__/model.cpython-311.pyc ADDED Viewed

Binary file (2.2 kB). View file

__pycache__/ui.cpython-311.pyc CHANGED Viewed

Binary files a/__pycache__/ui.cpython-311.pyc and b/__pycache__/ui.cpython-311.pyc differ

__pycache__/utils.cpython-311.pyc CHANGED Viewed

Binary files a/__pycache__/utils.cpython-311.pyc and b/__pycache__/utils.cpython-311.pyc differ

app.py CHANGED Viewed

@@ -1,7 +1,25 @@
-from utils import load_artifacts, predict_sales
-from ui import build_ui
 if __name__ == "__main__":
-    model, feature_cols = load_artifacts()
-    iface = build_ui(model, feature_cols, predict_sales)
-    iface.launch()

+import streamlit as st
+from utils import load_full_model_and_tokenizer
+from ui import render_ui
+from model import BiLSTMCRF
+# Cache model and tokenizer
+@st.cache_resource
+def get_model_and_tokenizer():
+    return load_full_model_and_tokenizer("models/best_bilstm_crf_model.pt")
+model, tokenizer, idx2tag = get_model_and_tokenizer()
+def main():
+    st.title("🔒 Detecting PII with BiLSTM-CRF")
+    text = st.text_area("Enter text to analyze:", height=200)
+    if st.button("Analyze"):
+        if text.strip():
+            render_ui(text, model, tokenizer, idx2tag)
+        else:
+            st.warning("⚠️ Please enter some text.")
 if __name__ == "__main__":
+    main()

demo/demo.mp4 CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:709f027723ef11b7699671bfb67b904580a63b70330dbef4069ffed351f4af8f
-size 896228

 version https://git-lfs.github.com/spec/v1
+oid sha256:79e2d0b8ad23dfd91431fb1299a7c3c380cefccfa6eacb91e28d8c7921ccaf61
+size 1011984

demo/demo.png CHANGED Viewed

detecting-pii-with-bilstm-crf-f1-91.ipynb ADDED Viewed

The diff for this file is too large to render. See raw diff

model.py ADDED Viewed

	@@ -0,0 +1,44 @@

+import torch.nn as nn
+from torchcrf import CRF
+class BiLSTMCRF(nn.Module):
+    def __init__(self, vocab_size, embedding_dim, hidden_dim, num_labels, pad_idx=0, pad_label_id=-100):
+        super().__init__()
+        self.pad_label_id = pad_label_id
+        # Embedding layer for tokens
+        self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=pad_idx)
+        # BiLSTM layer
+        self.lstm = nn.LSTM(
+            input_size=embedding_dim,
+            hidden_size=hidden_dim,
+            num_layers=1,
+            bidirectional=True,
+            batch_first=True
+        )
+        # Linear layer for projecting to label space
+        self.hidden2tag = nn.Linear(hidden_dim * 2, num_labels)
+        # CRF layer
+        self.crf = CRF(num_labels, batch_first=True)
+    def forward(self, input_ids, tags=None, mask=None):
+        embeds = self.embedding(input_ids)            # [B, L, E]
+        lstm_out, _ = self.lstm(embeds)               # [B, L, 2*H]
+        emissions = self.hidden2tag(lstm_out)         # [B, L, num_labels]
+        if tags is not None:
+            # Convert ignored labels to 0 for CRF
+            crf_tags = tags.clone()
+            crf_tags[crf_tags == self.pad_label_id] = 0
+            # Negative log likelihood
+            loss = -self.crf(emissions, crf_tags, mask=mask, reduction='mean')
+            return loss
+        else:
+            # Decode (Viterbi) paths
+            return self.crf.decode(emissions, mask=mask)

models/best_bilstm_crf_model.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:46bf9fbdb9930ab8f11eea77fa5e4a325fa04c841bd3c00f21a7435d7375d41c
+size 19073118

requirements.txt CHANGED Viewed

@@ -1,4 +1,5 @@
-gradio>=3.50.0
-joblib>=1.3.0
-lightgbm>=4.0.0
-pandas>=2.0.0

+streamlit==1.31.0
+torch==2.2.1
+transformers==4.38.2
+pandas==2.1.4
+pytorch-crf==0.7.2

ui.py CHANGED Viewed

@@ -1,28 +1,26 @@
-import gradio as gr
-def build_ui(model, feature_cols, predict_fn):
-    with gr.Blocks() as demo:
-        gr.Markdown("## 🛒 Retail Sales Prediction App")
-        with gr.Row():
-            promo = gr.Radio([0, 1], label="Promo", value=0)
-            holiday = gr.Radio([0, 1], label="Holiday", value=0)
-        date = gr.Textbox(label="Date (YYYY-MM-DD)", value="2023-11-01")
-        with gr.Row():
-            lag_1 = gr.Number(label="Sales Lag 1 Day", value=100)
-            lag_7 = gr.Number(label="Sales Lag 7 Days", value=120)
-            mean_3 = gr.Number(label="Rolling Mean (3 Days)", value=110)
-            mean_7 = gr.Number(label="Rolling Mean (7 Days)", value=115)
-        predict_btn = gr.Button("Predict Sales")
-        output = gr.Number(label="Predicted Sales", precision=2)
-        predict_btn.click(
-            fn=lambda p, h, d, l1, l7, m3, m7: predict_fn(model, feature_cols, p, h, d, l1, l7, m3, m7),
-            inputs=[promo, holiday, date, lag_1, lag_7, mean_3, mean_7],
-            outputs=output
-        )
-    return demo

+import streamlit as st
+from utils import prepare_inputs
+import torch
+import pandas as pd
+def render_ui(text, model, tokenizer, idx2tag):
+    # Prepare inputs
+    input_ids, mask = prepare_inputs(text, tokenizer)
+    # Run model
+    with torch.no_grad():
+        predictions = model(input_ids, mask=mask)
+    tokens = tokenizer.convert_ids_to_tokens(input_ids[0])
+    labels = [idx2tag.get(tag, "O") for tag in predictions[0]]
+    # Build table data
+    rows = []
+    for token, label in zip(tokens, labels):
+        rows.append({"Token": token, "Predicted Label": label})
+    df = pd.DataFrame(rows)
+    # Show in Streamlit
+    st.subheader("🔍 Predictions")
+    st.dataframe(df, use_container_width=True)  # or st.table(df) for static table

utils.py CHANGED Viewed

@@ -1,53 +1,141 @@
-import joblib
-import lightgbm as lgb
-import pandas as pd
-# Load artifacts
-def load_artifacts():
-    model = lgb.Booster(model_file="models/lgb_sales_model.txt")
-    feature_cols = joblib.load("models/feature_cols.pkl")
-    return model, feature_cols
-# Preprocess new input row into model-ready features
-def preprocess_input(promo, holiday, date, past_sales):
     """
-    Args:
-        promo: int (0/1)
-        holiday: int (0/1)
-        date: datetime-like
-        past_sales: dict with keys ['lag_1','lag_7','mean_3','mean_7']
-    Returns:
-        pd.DataFrame with a single row ready for prediction
     """
-    date = pd.to_datetime(date)
-    features = {
-        "promo": promo,
-        "holiday": holiday,
-        "day": date.day,
-        "month": date.month,
-        "year": date.year,
-        "day_of_week": date.weekday(),
-        "is_weekend": 1 if date.weekday() >= 5 else 0,
-        "sales_lag_1": past_sales.get("lag_1", 0),
-        "sales_lag_7": past_sales.get("lag_7", 0),
-        "rolling_mean_3": past_sales.get("mean_3", 0),
-        "rolling_mean_7": past_sales.get("mean_7", 0),
-    }
-    return pd.DataFrame([features])
-# Prediction
-def predict_sales(model, feature_cols, promo, holiday, date, lag_1, lag_7, mean_3, mean_7):
-    past_sales = {
-        "lag_1": lag_1,
-        "lag_7": lag_7,
-        "mean_3": mean_3,
-        "mean_7": mean_7,
-    }
-    X = preprocess_input(promo, holiday, date, past_sales)
-    X = X[feature_cols]  # ensure correct column order
-    prediction = model.predict(X)[0]
-    return round(prediction, 2)

+import torch
+from transformers import BertTokenizerFast
+from model import BiLSTMCRF   # make sure model.py exists
+def load_full_model_and_tokenizer(path):
     """
+    Loads the FULL BiLSTM-CRF model (torch.save(model, ...)) and tokenizer.
     """
+    tokenizer = BertTokenizerFast.from_pretrained("bert-base-uncased")
+    # Load full model
+    model = torch.load(path, map_location="cpu", weights_only=False)
+    model.eval()
+    # Define tag mapping (must match training)
+    idx2tag = {0: 'B-ACCOUNTNAME',
+ 1: 'B-ACCOUNTNUMBER',
+ 2: 'B-AGE',
+ 3: 'B-AMOUNT',
+ 4: 'B-BIC',
+ 5: 'B-BITCOINADDRESS',
+ 6: 'B-BUILDINGNUMBER',
+ 7: 'B-CITY',
+ 8: 'B-COMPANYNAME',
+ 9: 'B-COUNTY',
+ 10: 'B-CREDITCARDCVV',
+ 11: 'B-CREDITCARDISSUER',
+ 12: 'B-CREDITCARDNUMBER',
+ 13: 'B-CURRENCY',
+ 14: 'B-CURRENCYCODE',
+ 15: 'B-CURRENCYNAME',
+ 16: 'B-CURRENCYSYMBOL',
+ 17: 'B-DATE',
+ 18: 'B-DOB',
+ 19: 'B-EMAIL',
+ 20: 'B-ETHEREUMADDRESS',
+ 21: 'B-EYECOLOR',
+ 22: 'B-FIRSTNAME',
+ 23: 'B-GENDER',
+ 24: 'B-HEIGHT',
+ 25: 'B-IBAN',
+ 26: 'B-IP',
+ 27: 'B-IPV4',
+ 28: 'B-IPV6',
+ 29: 'B-JOBAREA',
+ 30: 'B-JOBTITLE',
+ 31: 'B-JOBTYPE',
+ 32: 'B-LASTNAME',
+ 33: 'B-LITECOINADDRESS',
+ 34: 'B-MAC',
+ 35: 'B-MASKEDNUMBER',
+ 36: 'B-MIDDLENAME',
+ 37: 'B-NEARBYGPSCOORDINATE',
+ 38: 'B-ORDINALDIRECTION',
+ 39: 'B-PASSWORD',
+ 40: 'B-PHONEIMEI',
+ 41: 'B-PHONENUMBER',
+ 42: 'B-PIN',
+ 43: 'B-PREFIX',
+ 44: 'B-SECONDARYADDRESS',
+ 45: 'B-SEX',
+ 46: 'B-SSN',
+ 47: 'B-STATE',
+ 48: 'B-STREET',
+ 49: 'B-TIME',
+ 50: 'B-URL',
+ 51: 'B-USERAGENT',
+ 52: 'B-USERNAME',
+ 53: 'B-VEHICLEVIN',
+ 54: 'B-VEHICLEVRM',
+ 55: 'B-ZIPCODE',
+ 56: 'I-ACCOUNTNAME',
+ 57: 'I-ACCOUNTNUMBER',
+ 58: 'I-AGE',
+ 59: 'I-AMOUNT',
+ 60: 'I-BIC',
+ 61: 'I-BITCOINADDRESS',
+ 62: 'I-BUILDINGNUMBER',
+ 63: 'I-CITY',
+ 64: 'I-COMPANYNAME',
+ 65: 'I-COUNTY',
+ 66: 'I-CREDITCARDCVV',
+ 67: 'I-CREDITCARDISSUER',
+ 68: 'I-CREDITCARDNUMBER',
+ 69: 'I-CURRENCY',
+ 70: 'I-CURRENCYCODE',
+ 71: 'I-CURRENCYNAME',
+ 72: 'I-CURRENCYSYMBOL',
+ 73: 'I-DATE',
+ 74: 'I-DOB',
+ 75: 'I-EMAIL',
+ 76: 'I-ETHEREUMADDRESS',
+ 77: 'I-EYECOLOR',
+ 78: 'I-FIRSTNAME',
+ 79: 'I-GENDER',
+ 80: 'I-HEIGHT',
+ 81: 'I-IBAN',
+ 82: 'I-IP',
+ 83: 'I-IPV4',
+ 84: 'I-IPV6',
+ 85: 'I-JOBAREA',
+ 86: 'I-JOBTITLE',
+ 87: 'I-JOBTYPE',
+ 88: 'I-LASTNAME',
+ 89: 'I-LITECOINADDRESS',
+ 90: 'I-MAC',
+ 91: 'I-MASKEDNUMBER',
+ 92: 'I-MIDDLENAME',
+ 93: 'I-NEARBYGPSCOORDINATE',
+ 94: 'I-PASSWORD',
+ 95: 'I-PHONEIMEI',
+ 96: 'I-PHONENUMBER',
+ 97: 'I-PIN',
+ 98: 'I-PREFIX',
+ 99: 'I-SECONDARYADDRESS',
+ 100: 'I-SSN',
+ 101: 'I-STATE',
+ 102: 'I-STREET',
+ 103: 'I-TIME',
+ 104: 'I-URL',
+ 105: 'I-USERAGENT',
+ 106: 'I-USERNAME',
+ 107: 'I-VEHICLEVIN',
+ 108: 'I-VEHICLEVRM',
+ 109: 'I-ZIPCODE',
+ 110: 'O'}
+    return model, tokenizer, idx2tag
+def prepare_inputs(text, tokenizer, max_length=128):
+    encoding = tokenizer(
+        text.split(),
+        is_split_into_words=True,
+        padding="max_length",
+        truncation=True,
+        max_length=max_length,
+        return_tensors="pt"
+    )
+    input_ids = encoding["input_ids"]
+    mask = encoding["attention_mask"].bool()
+    return input_ids, mask