Spaces:

Perth0603
/

Random-Forest-Model-for-PhishingDetection

Sleeping

App Files Files Community

Perth0603 commited on Oct 4, 2025

Commit

b16cfad

verified ·

1 Parent(s): 99ed65e

Upload 7 files

Browse files

Files changed (5) hide show

README.md +31 -0
app.py +111 -23
autocalib_legit.csv +27 -0
autocalib_phishy.csv +19 -0
known_hosts.csv +14 -0

README.md CHANGED Viewed

@@ -13,6 +13,7 @@ This Space exposes two endpoints so the Flutter app can call them reliably:
   - `phishing_probability` is always the raw probability of phishing (0..1)
   - `label` is `PHISH` when `phishing_probability >= threshold`, else `LEGIT`
   - `score` is the confidence for the predicted label (for `LEGIT`, `score = 1 - phishing_probability`), which lets the app show "Safe Confidence" for legitimate URLs
 ## Files
 - Dockerfile - builds a small FastAPI server image
@@ -26,6 +27,9 @@ This Space exposes two endpoints so the Flutter app can call them reliably:
    - MODEL_ID = Perth0603/phishing-email-mobilebert
    - URL_REPO = Perth0603/Random-Forest-Model-for-PhishingDetection
    - URL_FILENAME = url_rf_model.joblib  (set to your artifact filename)
 4. Wait for the Space to build and become green. Test:
    - GET `/` should return `{ status: ok, model: ... }`
    - POST `/predict` with `{ "inputs": "Win an iPhone! Click here" }`
@@ -42,3 +46,30 @@ Run the app:
 ```
 flutter run --dart-define-from-file=hf.env.json
 ```

   - `phishing_probability` is always the raw probability of phishing (0..1)
   - `label` is `PHISH` when `phishing_probability >= threshold`, else `LEGIT`
   - `score` is the confidence for the predicted label (for `LEGIT`, `score = 1 - phishing_probability`), which lets the app show "Safe Confidence" for legitimate URLs
+  - Also includes `predicted_label` (0→LEGIT, 1→PHISH) aligned to dataset polarity, and `raw_proba_class1` for debugging
 ## Files
 - Dockerfile - builds a small FastAPI server image
    - MODEL_ID = Perth0603/phishing-email-mobilebert
    - URL_REPO = Perth0603/Random-Forest-Model-for-PhishingDetection
    - URL_FILENAME = url_rf_model.joblib  (set to your artifact filename)
+   - Alternatively use: HF_URL_MODEL_ID, HF_URL_REPO_TYPE, HF_URL_FILENAME
+   - Optional: AUTOCALIB_PHISHY_CSV, AUTOCALIB_LEGIT_CSV, KNOWN_HOSTS_CSV
+   - Optional: URL_POSITIVE_CLASS (PHISH or LEGIT)
 4. Wait for the Space to build and become green. Test:
    - GET `/` should return `{ status: ok, model: ... }`
    - POST `/predict` with `{ "inputs": "Win an iPhone! Click here" }`
 ```
 flutter run --dart-define-from-file=hf.env.json
 ```
+## CSV configuration
+You can provide CSV files to customize autocalibration URLs and known host overrides.
+Formats:
+```
+# autocalib_phishy.csv
+url
+http://198.51.100.23/login/update?acc=123
+http://secure-login-account-update.example.com/session?id=123
+```
+```
+# autocalib_legit.csv
+url
+https://www.wikipedia.org/
+https://www.python.org/
+```
+```
+# known_hosts.csv
+host,label
+cjplogger.com,LEGIT
+bad-login-update.example.com,PHISH
+```

app.py CHANGED Viewed

@@ -6,6 +6,7 @@ os.environ.setdefault("TRANSFORMERS_CACHE", "/data/.cache")
 os.environ.setdefault("TORCH_HOME", "/data/.cache")
 from typing import Optional, List, Dict, Any
 from urllib.parse import urlparse
 import threading
 import re
@@ -72,29 +73,99 @@ _url_phish_is_positive: Optional[bool] = None
 # -------------------------
 # You can edit these lists to define which URLs are considered obviously phishy/legit
 # for polarity auto-calibration of classical URL models (e.g., XGBoost, scikit-learn).
-_AUTOCALIB_PHISHY_URLS: List[str] = [
-    "http://198.51.100.23/login/update?acc=123",
-    "http://secure-login-account-update.example.com/session?id=123",
-    "http://bank.verify-update-security.com/confirm",
-    "http://paypal.com.account-verify.cn/login",
-    "http://abc.xyz/downloads/invoice.exe",
-]
-_AUTOCALIB_LEGIT_URLS: List[str] = [
-    "https://www.wikipedia.org/",
-    "https://www.microsoft.com/",
-    "https://www.openai.com/",
-    "https://www.python.org/",
-    "https://www.gov.uk/",
-]
-# Known host overrides (editable): force certain domains as LEGIT or PHISH
-_KNOWN_LEGIT_HOSTS: List[str] = [
-    "cjplogger.com",
-    "www.cjplogger.com",
-]
-_KNOWN_PHISH_HOSTS: List[str] = [
-]
 # -------------------------
 # URL features (must match training)
@@ -194,6 +265,21 @@ def _auto_calibrate_phish_positive(bundle: Dict[str, Any], feature_cols: List[st
     phishy = _AUTOCALIB_PHISHY_URLS
     legit = _AUTOCALIB_LEGIT_URLS
     model = bundle.get("model")
     model_type: str = str(bundle.get("model_type") or "")
@@ -259,6 +345,8 @@ def _startup():
         print(f"[startup] text model load failed: {e}")
     try:
         _load_url_model()
         global _url_phish_is_positive
         b = _url_bundle
         if isinstance(b, dict) and _url_phish_is_positive is None:

 os.environ.setdefault("TORCH_HOME", "/data/.cache")
 from typing import Optional, List, Dict, Any
+import csv
 from urllib.parse import urlparse
 import threading
 import re
 # -------------------------
 # You can edit these lists to define which URLs are considered obviously phishy/legit
 # for polarity auto-calibration of classical URL models (e.g., XGBoost, scikit-learn).
+# Loaded from CSV. Provide via AUTOCALIB_PHISHY_CSV or hf_space/autocalib_phishy.csv
+_AUTOCALIB_PHISHY_URLS: List[str] = []
+# Loaded from CSV. Provide via AUTOCALIB_LEGIT_CSV or hf_space/autocalib_legit.csv
+_AUTOCALIB_LEGIT_URLS: List[str] = []
+# Known host overrides (CSV-driven): hf_space/known_hosts.csv or KNOWN_HOSTS_CSV
+_KNOWN_LEGIT_HOSTS: List[str] = []
+_KNOWN_PHISH_HOSTS: List[str] = []
+# -------------------------
+# CSV configuration support (optional)
+# -------------------------
+def _read_urls_from_csv(path: str) -> List[str]:
+    urls: List[str] = []
+    try:
+        with open(path, newline="", encoding="utf-8") as f:
+            reader = csv.DictReader(f)
+            if "url" in (reader.fieldnames or []):
+                for row in reader:
+                    val = str(row.get("url", "")).strip()
+                    if val:
+                        urls.append(val)
+            else:
+                f.seek(0)
+                f2 = csv.reader(f)
+                for row in f2:
+                    if not row:
+                        continue
+                    val = str(row[0]).strip()
+                    if val.lower() == "url":
+                        continue
+                    if val:
+                        urls.append(val)
+    except Exception as e:
+        print(f"[csv] failed reading URLs from {path}: {e}")
+    return urls
+def _read_hosts_from_csv(path: str) -> Dict[str, str]:
+    host_to_label: Dict[str, str] = {}
+    try:
+        with open(path, newline="", encoding="utf-8") as f:
+            reader = csv.DictReader(f)
+            fields = [x.lower() for x in (reader.fieldnames or [])]
+            if "host" in fields and "label" in fields:
+                for row in reader:
+                    host = str(row.get("host", "")).strip().lower()
+                    label = str(row.get("label", "")).strip().upper()
+                    if host and label in ("PHISH", "LEGIT"):
+                        host_to_label[host] = label
+            else:
+                f.seek(0)
+                f2 = csv.reader(f)
+                for row in f2:
+                    if len(row) < 2:
+                        continue
+                    host = str(row[0]).strip().lower()
+                    label = str(row[1]).strip().upper()
+                    if host.lower() == "host" and label == "LABEL":
+                        continue
+                    if host and label in ("PHISH", "LEGIT"):
+                        host_to_label[host] = label
+    except Exception as e:
+        print(f"[csv] failed reading hosts from {path}: {e}")
+    return host_to_label
+def _load_csv_configs_if_any():
+    base_dir = os.path.dirname(__file__)
+    phishy_csv = os.environ.get("AUTOCALIB_PHISHY_CSV", os.path.join(base_dir, "autocalib_phishy.csv"))
+    legit_csv = os.environ.get("AUTOCALIB_LEGIT_CSV", os.path.join(base_dir, "autocalib_legit.csv"))
+    hosts_csv = os.environ.get("KNOWN_HOSTS_CSV", os.path.join(base_dir, "known_hosts.csv"))
+    if os.path.exists(phishy_csv):
+        urls = _read_urls_from_csv(phishy_csv)
+        if urls:
+            print(f"[csv] loaded phishy URLs: {len(urls)} from {phishy_csv}")
+            _AUTOCALIB_PHISHY_URLS[:] = urls
+    if os.path.exists(legit_csv):
+        urls = _read_urls_from_csv(legit_csv)
+        if urls:
+            print(f"[csv] loaded legit URLs: {len(urls)} from {legit_csv}")
+            _AUTOCALIB_LEGIT_URLS[:] = urls
+    if os.path.exists(hosts_csv):
+        mapping = _read_hosts_from_csv(hosts_csv)
+        if mapping:
+            print(f"[csv] loaded known hosts: {len(mapping)} from {hosts_csv}")
+            _KNOWN_LEGIT_HOSTS.clear()
+            _KNOWN_PHISH_HOSTS.clear()
+            for host, label in mapping.items():
+                if label == "LEGIT":
+                    _KNOWN_LEGIT_HOSTS.append(host)
+                elif label == "PHISH":
+                    _KNOWN_PHISH_HOSTS.append(host)
 # -------------------------
 # URL features (must match training)
     phishy = _AUTOCALIB_PHISHY_URLS
     legit = _AUTOCALIB_LEGIT_URLS
+    # Guard: if CSVs are empty, fall back to safe defaults
+    if not phishy:
+        phishy = [
+            "http://198.51.100.23/login/update?acc=123",
+            "http://secure-login-account-update.example.com/session?id=123",
+            "http://bank.verify-update-security.com/confirm",
+            "http://paypal.com.account-verify.cn/login",
+        ]
+    if not legit:
+        legit = [
+            "https://www.wikipedia.org/",
+            "https://www.python.org/",
+            "https://www.microsoft.com/",
+            "https://www.openai.com/",
+        ]
     model = bundle.get("model")
     model_type: str = str(bundle.get("model_type") or "")
         print(f"[startup] text model load failed: {e}")
     try:
         _load_url_model()
+        # Load CSV-based config if present
+        _load_csv_configs_if_any()
         global _url_phish_is_positive
         b = _url_bundle
         if isinstance(b, dict) and _url_phish_is_positive is None:

autocalib_legit.csv ADDED Viewed

	@@ -0,0 +1,27 @@

+url
+https://www.wikipedia.org/
+https://www.microsoft.com/
+https://www.openai.com/
+https://www.python.org/
+https://www.gov.uk/
+https://www.google.com/
+https://www.apple.com/
+https://www.amazon.com/
+https://www.github.com/
+https://stackoverflow.com/
+https://www.nytimes.com/
+https://www.bbc.com/
+https://www.cnn.com/
+https://www.gov.sg/
+https://www.whitehouse.gov/
+https://www.europa.eu/
+https://www.cloudflare.com/
+https://www.dropbox.com/
+https://drive.google.com/
+https://www.paypal.com/
+https://www.facebook.com/
+https://www.linkedin.com/
+https://www.youtube.com/
+https://www.reddit.com/
+http://www.cjplogger.com/

autocalib_phishy.csv ADDED Viewed

	@@ -0,0 +1,19 @@

+url
+http://198.51.100.23/login/update?acc=123
+http://secure-login-account-update.example.com/session?id=123
+http://bank.verify-update-security.com/confirm
+http://paypal.com.account-verify.cn/login
+http://abc.xyz/downloads/invoice.exe
+http://update-login-security-paypal.com/verify
+http://login-secure-paypa1.com/
+http://verify-account-bankof-usa.example.co/reset
+http://support.microsoft.com.example.net/reset-password
+http://secure.appleid.apple.com.example.co/login
+http://drive-google-com.example.org/share/document?id=123
+http://198.51.100.45/pay/confirm?trx=9988
+http://203.0.113.10/parcel/tracking/update
+http://signin-amazon.example.tk/refund
+http://security-update-facebook.example.in/login
+http://login-secure-outlook.example.biz/
+http://dropbox-login.example.co/downloads/setup.zip

known_hosts.csv ADDED Viewed

	@@ -0,0 +1,14 @@

+host,label
+cjplogger.com,LEGIT
+www.cjplogger.com,LEGIT
+wikipedia.org,LEGIT
+www.wikipedia.org,LEGIT
+microsoft.com,LEGIT
+www.microsoft.com,LEGIT
+google.com,LEGIT
+www.google.com,LEGIT
+github.com,LEGIT
+www.github.com,LEGIT
+python.org,LEGIT
+www.python.org,LEGIT