| | ---
|
| | license: mit
|
| | tags:
|
| | - phishing-detection
|
| | - url-classification
|
| | - character-level
|
| | - pytorch
|
| | task: text-classification
|
| | datasets:
|
| | - custom
|
| | ---
|
| |
|
| | # Url Phishing Classifier Char
|
| |
|
| | This is a custom character-level Transformer model for URL phishing classification.
|
| |
|
| | ## Model Description
|
| |
|
| | This model is based on **Unknown** and has been fine-tuned for phishing detection tasks.
|
| |
|
| | ## Training Details
|
| |
|
| | - **Base Model**: Unknown
|
| | - **Training Samples**: 1629193
|
| | - **Validation Samples**: 325839
|
| | - **Test Samples**: 217226
|
| | - **Epochs**: 5
|
| | - **Batch Size**: 32
|
| | - **Learning Rate**: 0.0001
|
| | - **Max Length**: 512
|
| |
|
| |
|
| | ## Additional Training Parameters
|
| |
|
| | - **Model Type**: character_level_transformer
|
| |
|
| |
|
| | ## Model Architecture Parameters
|
| |
|
| | - **Vocab Size**: 100
|
| | - **Embed Dim**: 128
|
| | - **Num Heads**: 8
|
| | - **Num Layers**: 4
|
| | - **Hidden Dim**: 256
|
| | - **Max Length**: 512
|
| | - **Num Labels**: 2
|
| | - **Dropout**: 0.1
|
| |
|
| |
|
| | ## Character-Level Approach (In Depth)
|
| |
|
| | This repository uses a **character-based URL model**, not a token/subword transformer.
|
| |
|
| | ### Why Character-Level for URLs
|
| |
|
| | - URLs contain signal in punctuation and local patterns (`.`, `/`, `?`, `=`, `%`, `@`, homoglyph-like variants).
|
| | - Character-level encoding can model suspicious fragments and obfuscation that tokenization can smooth out.
|
| | - Very long or uncommon URL strings do not rely on pre-trained token vocab coverage.
|
| |
|
| | ### Data Processing Pipeline
|
| |
|
| | 1. CSV files are auto-discovered from `Training Material/URLs`.
|
| | 2. URL and label columns are inferred from common names (`url`, `website_url`, `link`, `label`, `status`, etc.).
|
| | 3. Labels are mapped to binary classes: `0=safe`, `1=phishing`.
|
| | 4. URLs are normalized by adding a scheme if missing (`https://`).
|
| | 5. If sender metadata exists, sender domain may be prepended to URL text.
|
| | 6. Final input is encoded character-by-character and padded/truncated to fixed length.
|
| |
|
| | ### Model Architecture
|
| |
|
| | - Embedding layer: `vocab_size=100`, `embed_dim=128`
|
| | - Learnable positional encoding up to `max_length=512`
|
| | - Transformer encoder: `num_layers=4`, `num_heads=8`, feedforward `hidden_dim=256`
|
| | - Pooling: masked global average pooling over valid characters
|
| | - Classifier head: MLP with GELU + dropout (`dropout=0.1`) -> 2 logits
|
| |
|
| | ### Training Configuration
|
| |
|
| | - Epochs: `5`
|
| | - Batch size: `32`
|
| | - Learning rate: `0.0001`
|
| | - Weight decay: `0.01`
|
| | - Warmup ratio: `0.1`
|
| | - Gradient accumulation steps: `1`
|
| | - Optimizer: AdamW
|
| | - LR schedule: warmup + cosine decay
|
| | - Class balancing: weighted cross-entropy using computed class weights
|
| | - Early stopping: patience of 3 epochs (based on validation ROC-AUC)
|
| |
|
| | ### Saved Artifacts
|
| |
|
| | - `best_model.pt`: best checkpoint by validation ROC-AUC
|
| | - `model.pt`: final model checkpoint
|
| | - `model_config.json`: architecture hyperparameters
|
| | - `tokenizer.json`: character vocabulary + tokenizer metadata
|
| | - `training_info.json`: train/val/test metrics and key run parameters
|
| |
|
| | ### Reproduce Training
|
| |
|
| | ```bash
|
| | python train_url_classifier_char.py \
|
| | --output_dir ./Models/url_classifier_char \
|
| | --epochs 5 \
|
| | --batch_size 32 \
|
| | --lr 0.0001 \
|
| | --max_length 512 \
|
| | --embed_dim 128 \
|
| | --num_heads 8 \
|
| | --num_layers 4 \
|
| | --hidden_dim 256 \
|
| | --dropout 0.1
|
| | ```
|
| |
|
| |
|
| | ## Evaluation Results
|
| |
|
| | ### Test Set Metrics
|
| |
|
| | - **Loss**: 0.2078
|
| | - **Accuracy**: 0.9143
|
| | - **F1**: 0.8839
|
| | - **Precision**: 0.8703
|
| | - **Recall**: 0.8980
|
| | - **Roc Auc**: 0.9751
|
| | - **True Positives**: 70875.0000
|
| | - **True Negatives**: 127736.0000
|
| | - **False Positives**: 10565.0000
|
| | - **False Negatives**: 8050.0000
|
| |
|
| | ### Validation Set Metrics
|
| |
|
| | - **Loss**: 0.2064
|
| | - **Accuracy**: 0.9147
|
| | - **F1**: 0.8846
|
| | - **Precision**: 0.8706
|
| | - **Recall**: 0.8990
|
| | - **Roc Auc**: 0.9755
|
| | - **True Positives**: 106429.0000
|
| | - **True Negatives**: 191629.0000
|
| | - **False Positives**: 15822.0000
|
| | - **False Negatives**: 11959.0000
|
| |
|
| |
|
| | ## Usage
|
| |
|
| | ```python
|
| | import json
|
| | import torch
|
| |
|
| | # This repository contains a custom PyTorch model:
|
| | # - model.pt (trained weights)
|
| | # - model_config.json (architecture hyperparameters)
|
| | # - tokenizer.json (character tokenizer)
|
| | #
|
| | # Load these files with your project inference code (e.g. predict_url_char.py).
|
| |
|
| | with open("model_config.json", "r", encoding="utf-8") as f:
|
| | config = json.load(f)
|
| |
|
| | state_dict = torch.load("model.pt", map_location="cpu")
|
| | print("Loaded custom character-level URL classifier.")
|
| | print(config)
|
| | ```
|
| |
|
| | ## Limitations
|
| |
|
| | This model was trained on specific datasets and may not generalize to all types of phishing attempts. Always use additional security measures in production environments.
|
| |
|
| | ## Citation
|
| |
|
| | If you use this model, please cite:
|
| |
|
| | ```bibtex
|
| | @misc{nhellyercreek_url_phishing_classifier_char,
|
| | title={Url Phishing Classifier Char},
|
| | author={Noah Hellyer},
|
| | year={2026},
|
| | publisher={Hugging Face},
|
| | howpublished={\url{https://huggingface.co/nhellyercreek/url-phishing-classifier-char}}
|
| | }
|
| | ```
|
| |
|