nhellyercreek commited on
Commit
18f3d70
·
verified ·
1 Parent(s): ae17e7f

Upload URL Phishing Classifier Char model

Browse files
Files changed (8) hide show
  1. README.md +181 -0
  2. best_model.pt +3 -0
  3. config.json +15 -0
  4. model.pt +3 -0
  5. model.safetensors +3 -0
  6. model_config.json +10 -0
  7. tokenizer.json +209 -0
  8. training_info.json +34 -0
README.md ADDED
@@ -0,0 +1,181 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ tags:
4
+ - phishing-detection
5
+ - url-classification
6
+ - character-level
7
+ - pytorch
8
+ task: text-classification
9
+ datasets:
10
+ - custom
11
+ ---
12
+
13
+ # Url Phishing Classifier Char
14
+
15
+ This is a custom character-level Transformer model for URL phishing classification.
16
+
17
+ ## Model Description
18
+
19
+ This model is based on **Unknown** and has been fine-tuned for phishing detection tasks.
20
+
21
+ ## Training Details
22
+
23
+ - **Base Model**: Unknown
24
+ - **Training Samples**: 1629193
25
+ - **Validation Samples**: 325839
26
+ - **Test Samples**: 217226
27
+ - **Epochs**: 5
28
+ - **Batch Size**: 32
29
+ - **Learning Rate**: 0.0001
30
+ - **Max Length**: 512
31
+
32
+
33
+ ## Additional Training Parameters
34
+
35
+ - **Model Type**: character_level_transformer
36
+
37
+
38
+ ## Model Architecture Parameters
39
+
40
+ - **Vocab Size**: 100
41
+ - **Embed Dim**: 128
42
+ - **Num Heads**: 8
43
+ - **Num Layers**: 4
44
+ - **Hidden Dim**: 256
45
+ - **Max Length**: 512
46
+ - **Num Labels**: 2
47
+ - **Dropout**: 0.1
48
+
49
+
50
+ ## Character-Level Approach (In Depth)
51
+
52
+ This repository uses a **character-based URL model**, not a token/subword transformer.
53
+
54
+ ### Why Character-Level for URLs
55
+
56
+ - URLs contain signal in punctuation and local patterns (`.`, `/`, `?`, `=`, `%`, `@`, homoglyph-like variants).
57
+ - Character-level encoding can model suspicious fragments and obfuscation that tokenization can smooth out.
58
+ - Very long or uncommon URL strings do not rely on pre-trained token vocab coverage.
59
+
60
+ ### Data Processing Pipeline
61
+
62
+ 1. CSV files are auto-discovered from `Training Material/URLs`.
63
+ 2. URL and label columns are inferred from common names (`url`, `website_url`, `link`, `label`, `status`, etc.).
64
+ 3. Labels are mapped to binary classes: `0=safe`, `1=phishing`.
65
+ 4. URLs are normalized by adding a scheme if missing (`https://`).
66
+ 5. If sender metadata exists, sender domain may be prepended to URL text.
67
+ 6. Final input is encoded character-by-character and padded/truncated to fixed length.
68
+
69
+ ### Model Architecture
70
+
71
+ - Embedding layer: `vocab_size=100`, `embed_dim=128`
72
+ - Learnable positional encoding up to `max_length=512`
73
+ - Transformer encoder: `num_layers=4`, `num_heads=8`, feedforward `hidden_dim=256`
74
+ - Pooling: masked global average pooling over valid characters
75
+ - Classifier head: MLP with GELU + dropout (`dropout=0.1`) -> 2 logits
76
+
77
+ ### Training Configuration
78
+
79
+ - Epochs: `5`
80
+ - Batch size: `32`
81
+ - Learning rate: `0.0001`
82
+ - Weight decay: `0.01`
83
+ - Warmup ratio: `0.1`
84
+ - Gradient accumulation steps: `1`
85
+ - Optimizer: AdamW
86
+ - LR schedule: warmup + cosine decay
87
+ - Class balancing: weighted cross-entropy using computed class weights
88
+ - Early stopping: patience of 3 epochs (based on validation ROC-AUC)
89
+
90
+ ### Saved Artifacts
91
+
92
+ - `best_model.pt`: best checkpoint by validation ROC-AUC
93
+ - `model.pt`: final model checkpoint
94
+ - `model_config.json`: architecture hyperparameters
95
+ - `tokenizer.json`: character vocabulary + tokenizer metadata
96
+ - `training_info.json`: train/val/test metrics and key run parameters
97
+
98
+ ### Reproduce Training
99
+
100
+ ```bash
101
+ python train_url_classifier_char.py \
102
+ --output_dir ./Models/url_classifier_char \
103
+ --epochs 5 \
104
+ --batch_size 32 \
105
+ --lr 0.0001 \
106
+ --max_length 512 \
107
+ --embed_dim 128 \
108
+ --num_heads 8 \
109
+ --num_layers 4 \
110
+ --hidden_dim 256 \
111
+ --dropout 0.1
112
+ ```
113
+
114
+
115
+ ## Evaluation Results
116
+
117
+ ### Test Set Metrics
118
+
119
+ - **Loss**: 0.2078
120
+ - **Accuracy**: 0.9143
121
+ - **F1**: 0.8839
122
+ - **Precision**: 0.8703
123
+ - **Recall**: 0.8980
124
+ - **Roc Auc**: 0.9751
125
+ - **True Positives**: 70875.0000
126
+ - **True Negatives**: 127736.0000
127
+ - **False Positives**: 10565.0000
128
+ - **False Negatives**: 8050.0000
129
+
130
+ ### Validation Set Metrics
131
+
132
+ - **Loss**: 0.2064
133
+ - **Accuracy**: 0.9147
134
+ - **F1**: 0.8846
135
+ - **Precision**: 0.8706
136
+ - **Recall**: 0.8990
137
+ - **Roc Auc**: 0.9755
138
+ - **True Positives**: 106429.0000
139
+ - **True Negatives**: 191629.0000
140
+ - **False Positives**: 15822.0000
141
+ - **False Negatives**: 11959.0000
142
+
143
+
144
+ ## Usage
145
+
146
+ ```python
147
+ import json
148
+ import torch
149
+
150
+ # This repository contains a custom PyTorch model:
151
+ # - model.pt (trained weights)
152
+ # - model_config.json (architecture hyperparameters)
153
+ # - tokenizer.json (character tokenizer)
154
+ #
155
+ # Load these files with your project inference code (e.g. predict_url_char.py).
156
+
157
+ with open("model_config.json", "r", encoding="utf-8") as f:
158
+ config = json.load(f)
159
+
160
+ state_dict = torch.load("model.pt", map_location="cpu")
161
+ print("Loaded custom character-level URL classifier.")
162
+ print(config)
163
+ ```
164
+
165
+ ## Limitations
166
+
167
+ This model was trained on specific datasets and may not generalize to all types of phishing attempts. Always use additional security measures in production environments.
168
+
169
+ ## Citation
170
+
171
+ If you use this model, please cite:
172
+
173
+ ```bibtex
174
+ @misc{nhellyercreek_url_phishing_classifier_char,
175
+ title={Url Phishing Classifier Char},
176
+ author={Your Name},
177
+ year={2024},
178
+ publisher={Hugging Face},
179
+ howpublished={\url{https://huggingface.co/nhellyercreek/url-phishing-classifier-char}}
180
+ }
181
+ ```
best_model.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:5a0651fa809250dc63c3ebb5cc903f06f3190dac27abb52bd3b6e1d75e3f8e65
3
+ size 2587362
config.json ADDED
@@ -0,0 +1,15 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "model_type": "char_transformer_url_classifier",
3
+ "architectures": [
4
+ "CharLevelURLClassifier"
5
+ ],
6
+ "num_labels": 2,
7
+ "vocab_size": 100,
8
+ "hidden_size": 128,
9
+ "num_attention_heads": 8,
10
+ "num_hidden_layers": 4,
11
+ "intermediate_size": 256,
12
+ "max_position_embeddings": 512,
13
+ "hidden_dropout_prob": 0.1,
14
+ "torch_dtype": "float32"
15
+ }
model.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:4f323242da77ee68a57014af45fe7915cbc5331b3e234b56b6c71e88c1ec7d73
3
+ size 2583680
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c9c0d91447a3d5d9afc02772ffcea297ee9299c70f5fcb044d4b4b828f17034b
3
+ size 2572656
model_config.json ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "vocab_size": 100,
3
+ "embed_dim": 128,
4
+ "num_heads": 8,
5
+ "num_layers": 4,
6
+ "hidden_dim": 256,
7
+ "max_length": 512,
8
+ "num_labels": 2,
9
+ "dropout": 0.1
10
+ }
tokenizer.json ADDED
@@ -0,0 +1,209 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "char_to_idx": {
3
+ "<PAD>": 0,
4
+ "<UNK>": 1,
5
+ " ": 2,
6
+ "!": 3,
7
+ "\"": 4,
8
+ "#": 5,
9
+ "$": 6,
10
+ "%": 7,
11
+ "&": 8,
12
+ "'": 9,
13
+ "(": 10,
14
+ ")": 11,
15
+ "*": 12,
16
+ "+": 13,
17
+ ",": 14,
18
+ "-": 15,
19
+ ".": 16,
20
+ "/": 17,
21
+ "0": 18,
22
+ "1": 19,
23
+ "2": 20,
24
+ "3": 21,
25
+ "4": 22,
26
+ "5": 23,
27
+ "6": 24,
28
+ "7": 25,
29
+ "8": 26,
30
+ "9": 27,
31
+ ":": 28,
32
+ ";": 29,
33
+ "<": 30,
34
+ "=": 31,
35
+ ">": 32,
36
+ "?": 33,
37
+ "@": 34,
38
+ "A": 35,
39
+ "B": 36,
40
+ "C": 37,
41
+ "D": 38,
42
+ "E": 39,
43
+ "F": 40,
44
+ "G": 41,
45
+ "H": 42,
46
+ "I": 43,
47
+ "J": 44,
48
+ "K": 45,
49
+ "L": 46,
50
+ "M": 47,
51
+ "N": 48,
52
+ "O": 49,
53
+ "P": 50,
54
+ "Q": 51,
55
+ "R": 52,
56
+ "S": 53,
57
+ "T": 54,
58
+ "U": 55,
59
+ "V": 56,
60
+ "W": 57,
61
+ "X": 58,
62
+ "Y": 59,
63
+ "Z": 60,
64
+ "[": 61,
65
+ "\\": 62,
66
+ "]": 63,
67
+ "^": 64,
68
+ "_": 65,
69
+ "`": 66,
70
+ "a": 67,
71
+ "b": 68,
72
+ "c": 69,
73
+ "d": 70,
74
+ "e": 71,
75
+ "f": 72,
76
+ "g": 73,
77
+ "h": 74,
78
+ "i": 75,
79
+ "j": 76,
80
+ "k": 77,
81
+ "l": 78,
82
+ "m": 79,
83
+ "n": 80,
84
+ "o": 81,
85
+ "p": 82,
86
+ "q": 83,
87
+ "r": 84,
88
+ "s": 85,
89
+ "t": 86,
90
+ "u": 87,
91
+ "v": 88,
92
+ "w": 89,
93
+ "x": 90,
94
+ "y": 91,
95
+ "z": 92,
96
+ "{": 93,
97
+ "|": 94,
98
+ "}": 95,
99
+ "~": 96,
100
+ "\n": 97,
101
+ "\t": 98,
102
+ "\r": 99
103
+ },
104
+ "idx_to_char": {
105
+ "0": "<PAD>",
106
+ "1": "<UNK>",
107
+ "2": " ",
108
+ "3": "!",
109
+ "4": "\"",
110
+ "5": "#",
111
+ "6": "$",
112
+ "7": "%",
113
+ "8": "&",
114
+ "9": "'",
115
+ "10": "(",
116
+ "11": ")",
117
+ "12": "*",
118
+ "13": "+",
119
+ "14": ",",
120
+ "15": "-",
121
+ "16": ".",
122
+ "17": "/",
123
+ "18": "0",
124
+ "19": "1",
125
+ "20": "2",
126
+ "21": "3",
127
+ "22": "4",
128
+ "23": "5",
129
+ "24": "6",
130
+ "25": "7",
131
+ "26": "8",
132
+ "27": "9",
133
+ "28": ":",
134
+ "29": ";",
135
+ "30": "<",
136
+ "31": "=",
137
+ "32": ">",
138
+ "33": "?",
139
+ "34": "@",
140
+ "35": "A",
141
+ "36": "B",
142
+ "37": "C",
143
+ "38": "D",
144
+ "39": "E",
145
+ "40": "F",
146
+ "41": "G",
147
+ "42": "H",
148
+ "43": "I",
149
+ "44": "J",
150
+ "45": "K",
151
+ "46": "L",
152
+ "47": "M",
153
+ "48": "N",
154
+ "49": "O",
155
+ "50": "P",
156
+ "51": "Q",
157
+ "52": "R",
158
+ "53": "S",
159
+ "54": "T",
160
+ "55": "U",
161
+ "56": "V",
162
+ "57": "W",
163
+ "58": "X",
164
+ "59": "Y",
165
+ "60": "Z",
166
+ "61": "[",
167
+ "62": "\\",
168
+ "63": "]",
169
+ "64": "^",
170
+ "65": "_",
171
+ "66": "`",
172
+ "67": "a",
173
+ "68": "b",
174
+ "69": "c",
175
+ "70": "d",
176
+ "71": "e",
177
+ "72": "f",
178
+ "73": "g",
179
+ "74": "h",
180
+ "75": "i",
181
+ "76": "j",
182
+ "77": "k",
183
+ "78": "l",
184
+ "79": "m",
185
+ "80": "n",
186
+ "81": "o",
187
+ "82": "p",
188
+ "83": "q",
189
+ "84": "r",
190
+ "85": "s",
191
+ "86": "t",
192
+ "87": "u",
193
+ "88": "v",
194
+ "89": "w",
195
+ "90": "x",
196
+ "91": "y",
197
+ "92": "z",
198
+ "93": "{",
199
+ "94": "|",
200
+ "95": "}",
201
+ "96": "~",
202
+ "97": "\n",
203
+ "98": "\t",
204
+ "99": "\r"
205
+ },
206
+ "vocab_size": 100,
207
+ "pad_idx": 0,
208
+ "unk_idx": 1
209
+ }
training_info.json ADDED
@@ -0,0 +1,34 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "model_type": "character_level_transformer",
3
+ "training_samples": 1629193,
4
+ "validation_samples": 325839,
5
+ "test_samples": 217226,
6
+ "epochs": 5,
7
+ "batch_size": 32,
8
+ "learning_rate": 0.0001,
9
+ "max_length": 512,
10
+ "validation_metrics": {
11
+ "loss": 0.20635717244524704,
12
+ "accuracy": 0.9147401017066711,
13
+ "f1": 0.8845532104106151,
14
+ "precision": 0.8705777457853106,
15
+ "recall": 0.8989846943947022,
16
+ "roc_auc": 0.9754642003985297,
17
+ "true_positives": 106429.0,
18
+ "true_negatives": 191629.0,
19
+ "false_positives": 15822.0,
20
+ "false_negatives": 11959.0
21
+ },
22
+ "test_metrics": {
23
+ "loss": 0.20780737601077962,
24
+ "accuracy": 0.9143058381593364,
25
+ "f1": 0.883921055093069,
26
+ "precision": 0.8702725933202358,
27
+ "recall": 0.8980044345898004,
28
+ "roc_auc": 0.9751202532525032,
29
+ "true_positives": 70875.0,
30
+ "true_negatives": 127736.0,
31
+ "false_positives": 10565.0,
32
+ "false_negatives": 8050.0
33
+ }
34
+ }