File size: 4,966 Bytes
18f3d70 0a9c095 18f3d70 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 | ---
license: mit
tags:
- phishing-detection
- url-classification
- character-level
- pytorch
task: text-classification
datasets:
- custom
---
# Url Phishing Classifier Char
This is a custom character-level Transformer model for URL phishing classification.
## Model Description
This model is based on **Unknown** and has been fine-tuned for phishing detection tasks.
## Training Details
- **Base Model**: Unknown
- **Training Samples**: 1629193
- **Validation Samples**: 325839
- **Test Samples**: 217226
- **Epochs**: 5
- **Batch Size**: 32
- **Learning Rate**: 0.0001
- **Max Length**: 512
## Additional Training Parameters
- **Model Type**: character_level_transformer
## Model Architecture Parameters
- **Vocab Size**: 100
- **Embed Dim**: 128
- **Num Heads**: 8
- **Num Layers**: 4
- **Hidden Dim**: 256
- **Max Length**: 512
- **Num Labels**: 2
- **Dropout**: 0.1
## Character-Level Approach (In Depth)
This repository uses a **character-based URL model**, not a token/subword transformer.
### Why Character-Level for URLs
- URLs contain signal in punctuation and local patterns (`.`, `/`, `?`, `=`, `%`, `@`, homoglyph-like variants).
- Character-level encoding can model suspicious fragments and obfuscation that tokenization can smooth out.
- Very long or uncommon URL strings do not rely on pre-trained token vocab coverage.
### Data Processing Pipeline
1. CSV files are auto-discovered from `Training Material/URLs`.
2. URL and label columns are inferred from common names (`url`, `website_url`, `link`, `label`, `status`, etc.).
3. Labels are mapped to binary classes: `0=safe`, `1=phishing`.
4. URLs are normalized by adding a scheme if missing (`https://`).
5. If sender metadata exists, sender domain may be prepended to URL text.
6. Final input is encoded character-by-character and padded/truncated to fixed length.
### Model Architecture
- Embedding layer: `vocab_size=100`, `embed_dim=128`
- Learnable positional encoding up to `max_length=512`
- Transformer encoder: `num_layers=4`, `num_heads=8`, feedforward `hidden_dim=256`
- Pooling: masked global average pooling over valid characters
- Classifier head: MLP with GELU + dropout (`dropout=0.1`) -> 2 logits
### Training Configuration
- Epochs: `5`
- Batch size: `32`
- Learning rate: `0.0001`
- Weight decay: `0.01`
- Warmup ratio: `0.1`
- Gradient accumulation steps: `1`
- Optimizer: AdamW
- LR schedule: warmup + cosine decay
- Class balancing: weighted cross-entropy using computed class weights
- Early stopping: patience of 3 epochs (based on validation ROC-AUC)
### Saved Artifacts
- `best_model.pt`: best checkpoint by validation ROC-AUC
- `model.pt`: final model checkpoint
- `model_config.json`: architecture hyperparameters
- `tokenizer.json`: character vocabulary + tokenizer metadata
- `training_info.json`: train/val/test metrics and key run parameters
### Reproduce Training
```bash
python train_url_classifier_char.py \
--output_dir ./Models/url_classifier_char \
--epochs 5 \
--batch_size 32 \
--lr 0.0001 \
--max_length 512 \
--embed_dim 128 \
--num_heads 8 \
--num_layers 4 \
--hidden_dim 256 \
--dropout 0.1
```
## Evaluation Results
### Test Set Metrics
- **Loss**: 0.2078
- **Accuracy**: 0.9143
- **F1**: 0.8839
- **Precision**: 0.8703
- **Recall**: 0.8980
- **Roc Auc**: 0.9751
- **True Positives**: 70875.0000
- **True Negatives**: 127736.0000
- **False Positives**: 10565.0000
- **False Negatives**: 8050.0000
### Validation Set Metrics
- **Loss**: 0.2064
- **Accuracy**: 0.9147
- **F1**: 0.8846
- **Precision**: 0.8706
- **Recall**: 0.8990
- **Roc Auc**: 0.9755
- **True Positives**: 106429.0000
- **True Negatives**: 191629.0000
- **False Positives**: 15822.0000
- **False Negatives**: 11959.0000
## Usage
```python
import json
import torch
# This repository contains a custom PyTorch model:
# - model.pt (trained weights)
# - model_config.json (architecture hyperparameters)
# - tokenizer.json (character tokenizer)
#
# Load these files with your project inference code (e.g. predict_url_char.py).
with open("model_config.json", "r", encoding="utf-8") as f:
config = json.load(f)
state_dict = torch.load("model.pt", map_location="cpu")
print("Loaded custom character-level URL classifier.")
print(config)
```
## Limitations
This model was trained on specific datasets and may not generalize to all types of phishing attempts. Always use additional security measures in production environments.
## Citation
If you use this model, please cite:
```bibtex
@misc{nhellyercreek_url_phishing_classifier_char,
title={Url Phishing Classifier Char},
author={Noah Hellyer},
year={2026},
publisher={Hugging Face},
howpublished={\url{https://huggingface.co/nhellyercreek/url-phishing-classifier-char}}
}
```
|