nhellyercreek's picture
Upload URL Phishing Classifier Char model
0a9c095 verified
---
license: mit
tags:
- phishing-detection
- url-classification
- character-level
- pytorch
task: text-classification
datasets:
- custom
---
# Url Phishing Classifier Char
This is a custom character-level Transformer model for URL phishing classification.
## Model Description
This model is based on **Unknown** and has been fine-tuned for phishing detection tasks.
## Training Details
- **Base Model**: Unknown
- **Training Samples**: 1629193
- **Validation Samples**: 325839
- **Test Samples**: 217226
- **Epochs**: 5
- **Batch Size**: 32
- **Learning Rate**: 0.0001
- **Max Length**: 512
## Additional Training Parameters
- **Model Type**: character_level_transformer
## Model Architecture Parameters
- **Vocab Size**: 100
- **Embed Dim**: 128
- **Num Heads**: 8
- **Num Layers**: 4
- **Hidden Dim**: 256
- **Max Length**: 512
- **Num Labels**: 2
- **Dropout**: 0.1
## Character-Level Approach (In Depth)
This repository uses a **character-based URL model**, not a token/subword transformer.
### Why Character-Level for URLs
- URLs contain signal in punctuation and local patterns (`.`, `/`, `?`, `=`, `%`, `@`, homoglyph-like variants).
- Character-level encoding can model suspicious fragments and obfuscation that tokenization can smooth out.
- Very long or uncommon URL strings do not rely on pre-trained token vocab coverage.
### Data Processing Pipeline
1. CSV files are auto-discovered from `Training Material/URLs`.
2. URL and label columns are inferred from common names (`url`, `website_url`, `link`, `label`, `status`, etc.).
3. Labels are mapped to binary classes: `0=safe`, `1=phishing`.
4. URLs are normalized by adding a scheme if missing (`https://`).
5. If sender metadata exists, sender domain may be prepended to URL text.
6. Final input is encoded character-by-character and padded/truncated to fixed length.
### Model Architecture
- Embedding layer: `vocab_size=100`, `embed_dim=128`
- Learnable positional encoding up to `max_length=512`
- Transformer encoder: `num_layers=4`, `num_heads=8`, feedforward `hidden_dim=256`
- Pooling: masked global average pooling over valid characters
- Classifier head: MLP with GELU + dropout (`dropout=0.1`) -> 2 logits
### Training Configuration
- Epochs: `5`
- Batch size: `32`
- Learning rate: `0.0001`
- Weight decay: `0.01`
- Warmup ratio: `0.1`
- Gradient accumulation steps: `1`
- Optimizer: AdamW
- LR schedule: warmup + cosine decay
- Class balancing: weighted cross-entropy using computed class weights
- Early stopping: patience of 3 epochs (based on validation ROC-AUC)
### Saved Artifacts
- `best_model.pt`: best checkpoint by validation ROC-AUC
- `model.pt`: final model checkpoint
- `model_config.json`: architecture hyperparameters
- `tokenizer.json`: character vocabulary + tokenizer metadata
- `training_info.json`: train/val/test metrics and key run parameters
### Reproduce Training
```bash
python train_url_classifier_char.py \
--output_dir ./Models/url_classifier_char \
--epochs 5 \
--batch_size 32 \
--lr 0.0001 \
--max_length 512 \
--embed_dim 128 \
--num_heads 8 \
--num_layers 4 \
--hidden_dim 256 \
--dropout 0.1
```
## Evaluation Results
### Test Set Metrics
- **Loss**: 0.2078
- **Accuracy**: 0.9143
- **F1**: 0.8839
- **Precision**: 0.8703
- **Recall**: 0.8980
- **Roc Auc**: 0.9751
- **True Positives**: 70875.0000
- **True Negatives**: 127736.0000
- **False Positives**: 10565.0000
- **False Negatives**: 8050.0000
### Validation Set Metrics
- **Loss**: 0.2064
- **Accuracy**: 0.9147
- **F1**: 0.8846
- **Precision**: 0.8706
- **Recall**: 0.8990
- **Roc Auc**: 0.9755
- **True Positives**: 106429.0000
- **True Negatives**: 191629.0000
- **False Positives**: 15822.0000
- **False Negatives**: 11959.0000
## Usage
```python
import json
import torch
# This repository contains a custom PyTorch model:
# - model.pt (trained weights)
# - model_config.json (architecture hyperparameters)
# - tokenizer.json (character tokenizer)
#
# Load these files with your project inference code (e.g. predict_url_char.py).
with open("model_config.json", "r", encoding="utf-8") as f:
config = json.load(f)
state_dict = torch.load("model.pt", map_location="cpu")
print("Loaded custom character-level URL classifier.")
print(config)
```
## Limitations
This model was trained on specific datasets and may not generalize to all types of phishing attempts. Always use additional security measures in production environments.
## Citation
If you use this model, please cite:
```bibtex
@misc{nhellyercreek_url_phishing_classifier_char,
title={Url Phishing Classifier Char},
author={Noah Hellyer},
year={2026},
publisher={Hugging Face},
howpublished={\url{https://huggingface.co/nhellyercreek/url-phishing-classifier-char}}
}
```