--- license: mit tags: - phishing-detection - url-classification - character-level - pytorch task: text-classification datasets: - custom --- # Url Phishing Classifier Char This is a custom character-level Transformer model for URL phishing classification. ## Model Description This model is based on **Unknown** and has been fine-tuned for phishing detection tasks. ## Training Details - **Base Model**: Unknown - **Training Samples**: 1629193 - **Validation Samples**: 325839 - **Test Samples**: 217226 - **Epochs**: 5 - **Batch Size**: 32 - **Learning Rate**: 0.0001 - **Max Length**: 512 ## Additional Training Parameters - **Model Type**: character_level_transformer ## Model Architecture Parameters - **Vocab Size**: 100 - **Embed Dim**: 128 - **Num Heads**: 8 - **Num Layers**: 4 - **Hidden Dim**: 256 - **Max Length**: 512 - **Num Labels**: 2 - **Dropout**: 0.1 ## Character-Level Approach (In Depth) This repository uses a **character-based URL model**, not a token/subword transformer. ### Why Character-Level for URLs - URLs contain signal in punctuation and local patterns (`.`, `/`, `?`, `=`, `%`, `@`, homoglyph-like variants). - Character-level encoding can model suspicious fragments and obfuscation that tokenization can smooth out. - Very long or uncommon URL strings do not rely on pre-trained token vocab coverage. ### Data Processing Pipeline 1. CSV files are auto-discovered from `Training Material/URLs`. 2. URL and label columns are inferred from common names (`url`, `website_url`, `link`, `label`, `status`, etc.). 3. Labels are mapped to binary classes: `0=safe`, `1=phishing`. 4. URLs are normalized by adding a scheme if missing (`https://`). 5. If sender metadata exists, sender domain may be prepended to URL text. 6. Final input is encoded character-by-character and padded/truncated to fixed length. ### Model Architecture - Embedding layer: `vocab_size=100`, `embed_dim=128` - Learnable positional encoding up to `max_length=512` - Transformer encoder: `num_layers=4`, `num_heads=8`, feedforward `hidden_dim=256` - Pooling: masked global average pooling over valid characters - Classifier head: MLP with GELU + dropout (`dropout=0.1`) -> 2 logits ### Training Configuration - Epochs: `5` - Batch size: `32` - Learning rate: `0.0001` - Weight decay: `0.01` - Warmup ratio: `0.1` - Gradient accumulation steps: `1` - Optimizer: AdamW - LR schedule: warmup + cosine decay - Class balancing: weighted cross-entropy using computed class weights - Early stopping: patience of 3 epochs (based on validation ROC-AUC) ### Saved Artifacts - `best_model.pt`: best checkpoint by validation ROC-AUC - `model.pt`: final model checkpoint - `model_config.json`: architecture hyperparameters - `tokenizer.json`: character vocabulary + tokenizer metadata - `training_info.json`: train/val/test metrics and key run parameters ### Reproduce Training ```bash python train_url_classifier_char.py \ --output_dir ./Models/url_classifier_char \ --epochs 5 \ --batch_size 32 \ --lr 0.0001 \ --max_length 512 \ --embed_dim 128 \ --num_heads 8 \ --num_layers 4 \ --hidden_dim 256 \ --dropout 0.1 ``` ## Evaluation Results ### Test Set Metrics - **Loss**: 0.2078 - **Accuracy**: 0.9143 - **F1**: 0.8839 - **Precision**: 0.8703 - **Recall**: 0.8980 - **Roc Auc**: 0.9751 - **True Positives**: 70875.0000 - **True Negatives**: 127736.0000 - **False Positives**: 10565.0000 - **False Negatives**: 8050.0000 ### Validation Set Metrics - **Loss**: 0.2064 - **Accuracy**: 0.9147 - **F1**: 0.8846 - **Precision**: 0.8706 - **Recall**: 0.8990 - **Roc Auc**: 0.9755 - **True Positives**: 106429.0000 - **True Negatives**: 191629.0000 - **False Positives**: 15822.0000 - **False Negatives**: 11959.0000 ## Usage ```python import json import torch # This repository contains a custom PyTorch model: # - model.pt (trained weights) # - model_config.json (architecture hyperparameters) # - tokenizer.json (character tokenizer) # # Load these files with your project inference code (e.g. predict_url_char.py). with open("model_config.json", "r", encoding="utf-8") as f: config = json.load(f) state_dict = torch.load("model.pt", map_location="cpu") print("Loaded custom character-level URL classifier.") print(config) ``` ## Limitations This model was trained on specific datasets and may not generalize to all types of phishing attempts. Always use additional security measures in production environments. ## Citation If you use this model, please cite: ```bibtex @misc{nhellyercreek_url_phishing_classifier_char, title={Url Phishing Classifier Char}, author={Noah Hellyer}, year={2026}, publisher={Hugging Face}, howpublished={\url{https://huggingface.co/nhellyercreek/url-phishing-classifier-char}} } ```