Upload URL Phishing Classifier Char model

Browse files

Files changed (8) hide show

README.md +181 -0
best_model.pt +3 -0
config.json +15 -0
model.pt +3 -0
model.safetensors +3 -0
model_config.json +10 -0
tokenizer.json +209 -0
training_info.json +34 -0

README.md ADDED Viewed

	@@ -0,0 +1,181 @@

+---
+license: mit
+tags:
+  - phishing-detection
+  - url-classification
+  - character-level
+  - pytorch
+task: text-classification
+datasets:
+  - custom
+---
+# Url Phishing Classifier Char
+This is a custom character-level Transformer model for URL phishing classification.
+## Model Description
+This model is based on **Unknown** and has been fine-tuned for phishing detection tasks.
+## Training Details
+- **Base Model**: Unknown
+- **Training Samples**: 1629193
+- **Validation Samples**: 325839
+- **Test Samples**: 217226
+- **Epochs**: 5
+- **Batch Size**: 32
+- **Learning Rate**: 0.0001
+- **Max Length**: 512
+## Additional Training Parameters
+- **Model Type**: character_level_transformer
+## Model Architecture Parameters
+- **Vocab Size**: 100
+- **Embed Dim**: 128
+- **Num Heads**: 8
+- **Num Layers**: 4
+- **Hidden Dim**: 256
+- **Max Length**: 512
+- **Num Labels**: 2
+- **Dropout**: 0.1
+## Character-Level Approach (In Depth)
+This repository uses a **character-based URL model**, not a token/subword transformer.
+### Why Character-Level for URLs
+- URLs contain signal in punctuation and local patterns (`.`, `/`, `?`, `=`, `%`, `@`, homoglyph-like variants).
+- Character-level encoding can model suspicious fragments and obfuscation that tokenization can smooth out.
+- Very long or uncommon URL strings do not rely on pre-trained token vocab coverage.
+### Data Processing Pipeline
+1. CSV files are auto-discovered from `Training Material/URLs`.
+2. URL and label columns are inferred from common names (`url`, `website_url`, `link`, `label`, `status`, etc.).
+3. Labels are mapped to binary classes: `0=safe`, `1=phishing`.
+4. URLs are normalized by adding a scheme if missing (`https://`).
+5. If sender metadata exists, sender domain may be prepended to URL text.
+6. Final input is encoded character-by-character and padded/truncated to fixed length.
+### Model Architecture
+- Embedding layer: `vocab_size=100`, `embed_dim=128`
+- Learnable positional encoding up to `max_length=512`
+- Transformer encoder: `num_layers=4`, `num_heads=8`, feedforward `hidden_dim=256`
+- Pooling: masked global average pooling over valid characters
+- Classifier head: MLP with GELU + dropout (`dropout=0.1`) -> 2 logits
+### Training Configuration
+- Epochs: `5`
+- Batch size: `32`
+- Learning rate: `0.0001`
+- Weight decay: `0.01`
+- Warmup ratio: `0.1`
+- Gradient accumulation steps: `1`
+- Optimizer: AdamW
+- LR schedule: warmup + cosine decay
+- Class balancing: weighted cross-entropy using computed class weights
+- Early stopping: patience of 3 epochs (based on validation ROC-AUC)
+### Saved Artifacts
+- `best_model.pt`: best checkpoint by validation ROC-AUC
+- `model.pt`: final model checkpoint
+- `model_config.json`: architecture hyperparameters
+- `tokenizer.json`: character vocabulary + tokenizer metadata
+- `training_info.json`: train/val/test metrics and key run parameters
+### Reproduce Training
+```bash
+python train_url_classifier_char.py \
+  --output_dir ./Models/url_classifier_char \
+  --epochs 5 \
+  --batch_size 32 \
+  --lr 0.0001 \
+  --max_length 512 \
+  --embed_dim 128 \
+  --num_heads 8 \
+  --num_layers 4 \
+  --hidden_dim 256 \
+  --dropout 0.1
+```
+## Evaluation Results
+### Test Set Metrics
+- **Loss**: 0.2078
+- **Accuracy**: 0.9143
+- **F1**: 0.8839
+- **Precision**: 0.8703
+- **Recall**: 0.8980
+- **Roc Auc**: 0.9751
+- **True Positives**: 70875.0000
+- **True Negatives**: 127736.0000
+- **False Positives**: 10565.0000
+- **False Negatives**: 8050.0000
+### Validation Set Metrics
+- **Loss**: 0.2064
+- **Accuracy**: 0.9147
+- **F1**: 0.8846
+- **Precision**: 0.8706
+- **Recall**: 0.8990
+- **Roc Auc**: 0.9755
+- **True Positives**: 106429.0000
+- **True Negatives**: 191629.0000
+- **False Positives**: 15822.0000
+- **False Negatives**: 11959.0000
+## Usage
+```python
+import json
+import torch
+# This repository contains a custom PyTorch model:
+# - model.pt (trained weights)
+# - model_config.json (architecture hyperparameters)
+# - tokenizer.json (character tokenizer)
+#
+# Load these files with your project inference code (e.g. predict_url_char.py).
+with open("model_config.json", "r", encoding="utf-8") as f:
+    config = json.load(f)
+state_dict = torch.load("model.pt", map_location="cpu")
+print("Loaded custom character-level URL classifier.")
+print(config)
+```
+## Limitations
+This model was trained on specific datasets and may not generalize to all types of phishing attempts. Always use additional security measures in production environments.
+## Citation
+If you use this model, please cite:
+```bibtex
+@misc{nhellyercreek_url_phishing_classifier_char,
+  title={Url Phishing Classifier Char},
+  author={Your Name},
+  year={2024},
+  publisher={Hugging Face},
+  howpublished={\url{https://huggingface.co/nhellyercreek/url-phishing-classifier-char}}
+}
+```

best_model.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:5a0651fa809250dc63c3ebb5cc903f06f3190dac27abb52bd3b6e1d75e3f8e65
+size 2587362

config.json ADDED Viewed

	@@ -0,0 +1,15 @@

+{
+  "model_type": "char_transformer_url_classifier",
+  "architectures": [
+    "CharLevelURLClassifier"
+  ],
+  "num_labels": 2,
+  "vocab_size": 100,
+  "hidden_size": 128,
+  "num_attention_heads": 8,
+  "num_hidden_layers": 4,
+  "intermediate_size": 256,
+  "max_position_embeddings": 512,
+  "hidden_dropout_prob": 0.1,
+  "torch_dtype": "float32"
+}

model.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:4f323242da77ee68a57014af45fe7915cbc5331b3e234b56b6c71e88c1ec7d73
+size 2583680

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:c9c0d91447a3d5d9afc02772ffcea297ee9299c70f5fcb044d4b4b828f17034b
+size 2572656

model_config.json ADDED Viewed

	@@ -0,0 +1,10 @@

+{
+  "vocab_size": 100,
+  "embed_dim": 128,
+  "num_heads": 8,
+  "num_layers": 4,
+  "hidden_dim": 256,
+  "max_length": 512,
+  "num_labels": 2,
+  "dropout": 0.1
+}

tokenizer.json ADDED Viewed

	@@ -0,0 +1,209 @@

+{
+  "char_to_idx": {
+    "<PAD>": 0,
+    "<UNK>": 1,
+    " ": 2,
+    "!": 3,
+    "\"": 4,
+    "#": 5,
+    "$": 6,
+    "%": 7,
+    "&": 8,
+    "'": 9,
+    "(": 10,
+    ")": 11,
+    "*": 12,
+    "+": 13,
+    ",": 14,
+    "-": 15,
+    ".": 16,
+    "/": 17,
+    "0": 18,
+    "1": 19,
+    "2": 20,
+    "3": 21,
+    "4": 22,
+    "5": 23,
+    "6": 24,
+    "7": 25,
+    "8": 26,
+    "9": 27,
+    ":": 28,
+    ";": 29,
+    "<": 30,
+    "=": 31,
+    ">": 32,
+    "?": 33,
+    "@": 34,
+    "A": 35,
+    "B": 36,
+    "C": 37,
+    "D": 38,
+    "E": 39,
+    "F": 40,
+    "G": 41,
+    "H": 42,
+    "I": 43,
+    "J": 44,
+    "K": 45,
+    "L": 46,
+    "M": 47,
+    "N": 48,
+    "O": 49,
+    "P": 50,
+    "Q": 51,
+    "R": 52,
+    "S": 53,
+    "T": 54,
+    "U": 55,
+    "V": 56,
+    "W": 57,
+    "X": 58,
+    "Y": 59,
+    "Z": 60,
+    "[": 61,
+    "\\": 62,
+    "]": 63,
+    "^": 64,
+    "_": 65,
+    "`": 66,
+    "a": 67,
+    "b": 68,
+    "c": 69,
+    "d": 70,
+    "e": 71,
+    "f": 72,
+    "g": 73,
+    "h": 74,
+    "i": 75,
+    "j": 76,
+    "k": 77,
+    "l": 78,
+    "m": 79,
+    "n": 80,
+    "o": 81,
+    "p": 82,
+    "q": 83,
+    "r": 84,
+    "s": 85,
+    "t": 86,
+    "u": 87,
+    "v": 88,
+    "w": 89,
+    "x": 90,
+    "y": 91,
+    "z": 92,
+    "{": 93,
+    "|": 94,
+    "}": 95,
+    "~": 96,
+    "\n": 97,
+    "\t": 98,
+    "\r": 99
+  },
+  "idx_to_char": {
+    "0": "<PAD>",
+    "1": "<UNK>",
+    "2": " ",
+    "3": "!",
+    "4": "\"",
+    "5": "#",
+    "6": "$",
+    "7": "%",
+    "8": "&",
+    "9": "'",
+    "10": "(",
+    "11": ")",
+    "12": "*",
+    "13": "+",
+    "14": ",",
+    "15": "-",
+    "16": ".",
+    "17": "/",
+    "18": "0",
+    "19": "1",
+    "20": "2",
+    "21": "3",
+    "22": "4",
+    "23": "5",
+    "24": "6",
+    "25": "7",
+    "26": "8",
+    "27": "9",
+    "28": ":",
+    "29": ";",
+    "30": "<",
+    "31": "=",
+    "32": ">",
+    "33": "?",
+    "34": "@",
+    "35": "A",
+    "36": "B",
+    "37": "C",
+    "38": "D",
+    "39": "E",
+    "40": "F",
+    "41": "G",
+    "42": "H",
+    "43": "I",
+    "44": "J",
+    "45": "K",
+    "46": "L",
+    "47": "M",
+    "48": "N",
+    "49": "O",
+    "50": "P",
+    "51": "Q",
+    "52": "R",
+    "53": "S",
+    "54": "T",
+    "55": "U",
+    "56": "V",
+    "57": "W",
+    "58": "X",
+    "59": "Y",
+    "60": "Z",
+    "61": "[",
+    "62": "\\",
+    "63": "]",
+    "64": "^",
+    "65": "_",
+    "66": "`",
+    "67": "a",
+    "68": "b",
+    "69": "c",
+    "70": "d",
+    "71": "e",
+    "72": "f",
+    "73": "g",
+    "74": "h",
+    "75": "i",
+    "76": "j",
+    "77": "k",
+    "78": "l",
+    "79": "m",
+    "80": "n",
+    "81": "o",
+    "82": "p",
+    "83": "q",
+    "84": "r",
+    "85": "s",
+    "86": "t",
+    "87": "u",
+    "88": "v",
+    "89": "w",
+    "90": "x",
+    "91": "y",
+    "92": "z",
+    "93": "{",
+    "94": "|",
+    "95": "}",
+    "96": "~",
+    "97": "\n",
+    "98": "\t",
+    "99": "\r"
+  },
+  "vocab_size": 100,
+  "pad_idx": 0,
+  "unk_idx": 1
+}

training_info.json ADDED Viewed

	@@ -0,0 +1,34 @@

+{
+  "model_type": "character_level_transformer",
+  "training_samples": 1629193,
+  "validation_samples": 325839,
+  "test_samples": 217226,
+  "epochs": 5,
+  "batch_size": 32,
+  "learning_rate": 0.0001,
+  "max_length": 512,
+  "validation_metrics": {
+    "loss": 0.20635717244524704,
+    "accuracy": 0.9147401017066711,
+    "f1": 0.8845532104106151,
+    "precision": 0.8705777457853106,
+    "recall": 0.8989846943947022,
+    "roc_auc": 0.9754642003985297,
+    "true_positives": 106429.0,
+    "true_negatives": 191629.0,
+    "false_positives": 15822.0,
+    "false_negatives": 11959.0
+  },
+  "test_metrics": {
+    "loss": 0.20780737601077962,
+    "accuracy": 0.9143058381593364,
+    "f1": 0.883921055093069,
+    "precision": 0.8702725933202358,
+    "recall": 0.8980044345898004,
+    "roc_auc": 0.9751202532525032,
+    "true_positives": 70875.0,
+    "true_negatives": 127736.0,
+    "false_positives": 10565.0,
+    "false_negatives": 8050.0
+  }
+}