Perth0603 commited on
Commit
b057179
·
verified ·
1 Parent(s): ef42414

Upload 4 files

Browse files
Files changed (4) hide show
  1. Dockerfile +28 -0
  2. README.md +38 -49
  3. app.py +177 -0
  4. requirements.txt +14 -33
Dockerfile ADDED
@@ -0,0 +1,28 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ FROM python:3.10-slim
2
+
3
+ ENV PYTHONDONTWRITEBYTECODE=1 \
4
+ PYTHONUNBUFFERED=1 \
5
+ PIP_NO_CACHE_DIR=1
6
+
7
+ WORKDIR /app
8
+
9
+ # Writable cache directory for HF/torch
10
+ RUN mkdir -p /data/.cache && chmod -R 777 /data
11
+ ENV HF_HOME=/data/.cache \
12
+ TRANSFORMERS_CACHE=/data/.cache \
13
+ TORCH_HOME=/data/.cache
14
+
15
+ # System deps (optional but helps with torch wheels)
16
+ RUN apt-get update && apt-get install -y --no-install-recommends \
17
+ build-essential git && \
18
+ rm -rf /var/lib/apt/lists/*
19
+
20
+ COPY requirements.txt /app/requirements.txt
21
+ RUN pip install -r /app/requirements.txt
22
+
23
+ COPY app.py /app/app.py
24
+
25
+ EXPOSE 7860
26
+ CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "7860"]
27
+
28
+
README.md CHANGED
@@ -1,52 +1,41 @@
1
- Random Forest / XGBoost Model for URL Phishing Detection
2
-
3
- This repository contains a trained tree-based classifier for detecting phishing URLs. The model was trained from the `PhiUSIIL_Phishing_URL_Dataset.csv` with lightweight, URL-only lexical and structural features. On the held-out test split it achieved high accuracy and F1.
4
-
5
- Highlights
6
- - Backend: gradient-boosted trees via XGBoost (uses GPU if available; falls back to CPU).
7
- - Input: raw URL string only (no external DNS/WHOIS calls needed).
8
- - Features: length, character counts, digit ratio, IPv4 presence, common phishing tokens, scheme/TLD heuristics.
9
-
10
- Test metrics (from notebook)
11
- - accuracy: 0.9952
12
- - precision: 0.9928
13
- - recall: 0.9989
14
- - f1: 0.9958
15
- - roc_auc: 0.9976
16
-
17
- Files
18
- - `rf_url_phishing_xgboost_bst.joblib`: joblib bundle with the trained model and metadata.
19
- - `inference.py`: helpers to load the bundle and run `predict_url()`.
20
- - `requirements.txt`: minimal dependencies for local inference.
21
-
22
- Quick start (local)
23
- 1) Install dependencies
24
- ```bash
25
- pip install -r requirements.txt
26
- ```
 
 
 
 
 
 
27
 
28
- 2) Predict a single URL
29
- ```python
30
- from inference import load_bundle, predict_url
31
-
32
- bundle = load_bundle("rf_url_phishing_xgboost_bst.joblib")
33
- result = predict_url(
34
- url="http://secure-login-account-update.example.com/session?id=123",
35
- bundle=bundle,
36
- threshold=0.5,
37
- )
38
- print(result)
39
  ```
40
 
41
- Bundle contents
42
- The joblib bundle contains:
43
- - `model`: trained XGBoost booster
44
- - `feature_cols`: ordered list of feature names expected by the model
45
- - `url_col`: original URL column name
46
- - `label_col`: label column name used in training
47
- - `model_type`: string identifying the backend (here: `xgboost_bst`)
48
-
49
- License
50
- This model is provided for research and educational purposes only. Evaluate thoroughly before use in production.
51
-
52
-
 
1
+ ---
2
+ title: PhishWatch Proxy
3
+ emoji: 🛡️
4
+ sdk: docker
5
+ ---
6
+
7
+ # Hugging Face Space - Phishing Text Classifier (Docker + FastAPI)
8
+
9
+ This Space exposes two endpoints so the Flutter app can call them reliably:
10
+
11
+ - `/predict` for text/email/SMS classification via Transformers
12
+ - `/predict-url` for URL classification via your scikit-learn Random Forest model
13
+
14
+ ## Files
15
+ - Dockerfile - builds a small FastAPI server image
16
+ - app.py - FastAPI app that loads the model and returns `{ label, score }`.
17
+ - requirements.txt - Python dependencies.
18
+
19
+ ## How to deploy
20
+ 1. Create a new Space on Hugging Face (type: Docker).
21
+ 2. Upload the contents of this `hf_space/` folder to the Space root (including Dockerfile).
22
+ 3. In Space Settings → Variables, add:
23
+ - MODEL_ID = Perth0603/phishing-email-mobilebert
24
+ - URL_REPO = Perth0603/Random-Forest-Model-for-PhishingDetection
25
+ - URL_FILENAME = url_rf_model.joblib (set to your artifact filename)
26
+ 4. Wait for the Space to build and become green. Test:
27
+ - GET `/` should return `{ status: ok, model: ... }`
28
+ - POST `/predict` with `{ "inputs": "Win an iPhone! Click here" }`
29
+ - POST `/predict-url` with `{ "url": "https://example.com/login" }`
30
+
31
+ ## Flutter app config
32
+ Set the Space URL in your env file so the app targets the Space instead of the Hosted Inference API:
33
 
34
+ ```
35
+ {"HF_SPACE_URL":"https://<your-space>.hf.space"}
 
 
 
 
 
 
 
 
 
36
  ```
37
 
38
+ Run the app:
39
+ ```
40
+ flutter run --dart-define-from-file=hf.env.json
41
+ ```
 
 
 
 
 
 
 
 
app.py ADDED
@@ -0,0 +1,177 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ os.environ.setdefault("HOME", "/data")
3
+ os.environ.setdefault("XDG_CACHE_HOME", "/data/.cache")
4
+ os.environ.setdefault("HF_HOME", "/data/.cache")
5
+ os.environ.setdefault("TRANSFORMERS_CACHE", "/data/.cache")
6
+ os.environ.setdefault("TORCH_HOME", "/data/.cache")
7
+
8
+ from fastapi import FastAPI
9
+ from fastapi.responses import JSONResponse
10
+ from pydantic import BaseModel
11
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification
12
+ from huggingface_hub import hf_hub_download
13
+ import joblib
14
+ import torch
15
+ import re
16
+ import numpy as np
17
+ import pandas as pd
18
+ try:
19
+ import xgboost as xgb # type: ignore
20
+ except Exception:
21
+ xgb = None # optional; required if bundle uses xgboost
22
+
23
+
24
+ MODEL_ID = os.environ.get("MODEL_ID", "Perth0603/phishing-email-mobilebert")
25
+ URL_REPO = os.environ.get("URL_REPO", "Perth0603/Random-Forest-Model-for-PhishingDetection")
26
+ URL_REPO_TYPE = os.environ.get("URL_REPO_TYPE", "model") # model|space|dataset
27
+ # NOTE: set to your artifact filename, e.g. rf_url_phishing_xgboost_bst.joblib
28
+ URL_FILENAME = os.environ.get("URL_FILENAME", "rf_url_phishing_xgboost_bst.joblib")
29
+
30
+ # Ensure writable cache directory for HF/torch inside Spaces Docker
31
+ CACHE_DIR = os.environ.get("HF_CACHE_DIR", "/data/.cache")
32
+ os.makedirs(CACHE_DIR, exist_ok=True)
33
+
34
+ app = FastAPI(title="Phishing Text Classifier", version="1.0.0")
35
+
36
+
37
+ class PredictPayload(BaseModel):
38
+ inputs: str
39
+
40
+
41
+ # Lazy singletons for model/tokenizer
42
+ _tokenizer = None
43
+ _model = None
44
+ _url_bundle = None # holds dict: {model, feature_cols, url_col, label_col, model_type}
45
+
46
+
47
+ def _load_url_model():
48
+ global _url_bundle
49
+ if _url_bundle is None:
50
+ # Prefer local artifact if present (e.g., committed into the Space repo)
51
+ local_path = os.path.join(os.getcwd(), URL_FILENAME)
52
+ if os.path.exists(local_path):
53
+ _url_bundle = joblib.load(local_path)
54
+ return
55
+ # Download model artifact from HF Hub
56
+ model_path = hf_hub_download(
57
+ repo_id=URL_REPO,
58
+ filename=URL_FILENAME,
59
+ repo_type=URL_REPO_TYPE,
60
+ cache_dir=CACHE_DIR,
61
+ )
62
+ _url_bundle = joblib.load(model_path)
63
+
64
+
65
+ # URL feature engineering (must match training)
66
+ _SUSPICIOUS_TOKENS = ["login", "verify", "secure", "update", "bank", "pay", "account", "webscr"]
67
+ _ipv4_pattern = re.compile(r'(?:\d{1,3}\.){3}\d{1,3}')
68
+
69
+ def _engineer_features(df: pd.DataFrame, url_col: str, feature_cols: list[str] | None = None) -> pd.DataFrame:
70
+ s = df[url_col].astype(str)
71
+ out = pd.DataFrame(index=df.index)
72
+ out['url_len'] = s.str.len().fillna(0)
73
+ out['count_dot'] = s.str.count(r'\.')
74
+ out['count_hyphen'] = s.str.count('-')
75
+ out['count_digit'] = s.str.count(r'\d')
76
+ out['count_at'] = s.str.count('@')
77
+ out['count_qmark'] = s.str.count('\?')
78
+ out['count_eq'] = s.str.count('=')
79
+ out['count_slash'] = s.str.count('/')
80
+ out['digit_ratio'] = (out['count_digit'] / out['url_len'].replace(0, np.nan)).fillna(0)
81
+ out['has_ip'] = s.str.contains(_ipv4_pattern).astype(int)
82
+ for tok in _SUSPICIOUS_TOKENS:
83
+ out[f'has_{tok}'] = s.str.contains(tok, case=False, regex=False).astype(int)
84
+ out['starts_https'] = s.str.startswith('https').astype(int)
85
+ out['ends_with_exe'] = s.str.endswith('.exe').astype(int)
86
+ out['ends_with_zip'] = s.str.endswith('.zip').astype(int)
87
+ return out if feature_cols is None else out[feature_cols]
88
+
89
+
90
+ def _load_model():
91
+ global _tokenizer, _model
92
+ if _tokenizer is None or _model is None:
93
+ _tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, cache_dir=CACHE_DIR)
94
+ _model = AutoModelForSequenceClassification.from_pretrained(MODEL_ID, cache_dir=CACHE_DIR)
95
+ # Warm-up
96
+ with torch.no_grad():
97
+ _ = _model(**_tokenizer(["warm up"], return_tensors="pt")).logits
98
+
99
+
100
+ @app.get("/")
101
+ def root():
102
+ return {"status": "ok", "model": MODEL_ID}
103
+
104
+
105
+ @app.post("/predict")
106
+ def predict(payload: PredictPayload):
107
+ try:
108
+ _load_model()
109
+ with torch.no_grad():
110
+ inputs = _tokenizer([payload.inputs], return_tensors="pt", truncation=True, max_length=512)
111
+ logits = _model(**inputs).logits
112
+ probs = torch.softmax(logits, dim=-1)[0]
113
+ score, idx = torch.max(probs, dim=0)
114
+ except Exception as e:
115
+ return JSONResponse(status_code=500, content={"error": str(e)})
116
+
117
+ # Map common ids to labels (kept generic; your config also has these)
118
+ id2label = {0: "LEGIT", 1: "PHISH"}
119
+ label = id2label.get(int(idx), str(int(idx)))
120
+ return {"label": label, "score": float(score)}
121
+
122
+
123
+ class PredictUrlPayload(BaseModel):
124
+ url: str
125
+
126
+
127
+ @app.post("/predict-url")
128
+ def predict_url(payload: PredictUrlPayload):
129
+ try:
130
+ _load_url_model()
131
+ bundle = _url_bundle
132
+ if not isinstance(bundle, dict) or 'model' not in bundle:
133
+ raise RuntimeError("Loaded URL artifact is not a bundle dict with 'model'.")
134
+ model = bundle['model']
135
+ feature_cols = bundle.get('feature_cols') or []
136
+ url_col = bundle.get('url_col') or 'url'
137
+ model_type = bundle.get('model_type') or ''
138
+
139
+ row = pd.DataFrame({url_col: [payload.url]})
140
+ feats = _engineer_features(row, url_col, feature_cols)
141
+
142
+ score = None
143
+ label = None
144
+
145
+ if isinstance(model_type, str) and model_type == 'xgboost_bst':
146
+ if xgb is None:
147
+ raise RuntimeError("xgboost is not installed but required for this model bundle.")
148
+ dmat = xgb.DMatrix(feats)
149
+ proba = float(model.predict(dmat)[0])
150
+ score = proba
151
+ label = "PHISH" if score >= 0.5 else "LEGIT"
152
+ elif hasattr(model, "predict_proba"):
153
+ proba = model.predict_proba(feats)[0]
154
+ if len(proba) == 2:
155
+ score = float(proba[1])
156
+ label = "PHISH" if score >= 0.5 else "LEGIT"
157
+ else:
158
+ max_idx = int(np.argmax(proba))
159
+ score = float(proba[max_idx])
160
+ label = "PHISH" if max_idx == 1 else "LEGIT"
161
+ else:
162
+ pred = model.predict(feats)[0]
163
+ if isinstance(pred, (int, float, np.integer, np.floating)):
164
+ label = "PHISH" if int(pred) == 1 else "LEGIT"
165
+ score = 1.0 if label == "PHISH" else 0.0
166
+ else:
167
+ up = str(pred).strip().upper()
168
+ if up in ("PHISH", "PHISHING", "MALICIOUS"):
169
+ label, score = "PHISH", 1.0
170
+ else:
171
+ label, score = "LEGIT", 0.0
172
+ except Exception as e:
173
+ return JSONResponse(status_code=500, content={"error": str(e)})
174
+
175
+ return {"label": label, "score": float(score)}
176
+
177
+
requirements.txt CHANGED
@@ -1,34 +1,15 @@
1
- # Core numerical and IO
2
- numpy<2.0
3
- pandas==2.2.2
4
- joblib>=1.3
5
-
6
- # CPU ML fallback
7
- scikit-learn>=1.3
8
-
9
- # GPU array library (imported as `cupy` in the notebook)
10
- # Pick the CUDA 12.x wheel that matches your GPU drivers
11
- cupy-cuda12x>=12.0.0
12
-
13
- # Optional: used only for the simple GPU availability check in the notebook
14
- # (You can skip installing torch if you remove the torch check cell.)
15
- torch>=2.1
16
-
17
- # IMPORTANT: RAPIDS (cuDF, cuML) install instructions (NOT via pip)
18
- # The notebook uses RAPIDS cuDF/cuML for GPU RandomForest. Install via Conda:
19
- #
20
- # conda create -n rapids-24.08 -c rapidsai -c conda-forge -c nvidia \
21
- # rapids=24.08 python=3.10 cuda-version=12.2 -y
22
- # conda activate rapids-24.08
23
- #
24
- # This will install `cudf` and `cuml` compatible with CUDA 12.2. If you're on Windows,
25
- # use WSL2 (Ubuntu) for best support. Native Windows support for RAPIDS is limited.
26
-
27
- # Windows-friendly GPU fallback (no RAPIDS required)
28
- # XGBoost supports GPU acceleration on native Windows when built with CUDA.
29
- # pip install xgboost will fetch a prebuilt wheel with GPU support if available.
30
- xgboost>=2.0
31
-
32
- # (Optional) Another alternative with some GPU support (requires separate setup):
33
- # lightgbm
34
 
 
1
+ --extra-index-url https://download.pytorch.org/whl/cpu
2
+ fastapi==0.115.0
3
+ uvicorn==0.30.6
4
+ transformers==4.46.3
5
+ torch==2.3.1+cpu
6
+ accelerate>=0.33.0
7
+ safetensors>=0.4.3
8
+
9
+ # URL model dependencies
10
+ huggingface_hub>=0.23.0
11
+ scikit-learn>=1.3.0
12
+ joblib>=1.3.0
13
+ pandas>=2.0.0
14
+ xgboost>=2.0.0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
15