Url Phishing Classifier Char V3

This is a custom character-level Transformer model for URL phishing classification.

Model Description

This model is based on char-transformer-url-v3 and has been fine-tuned for phishing detection tasks.

Training Details

Base Model: char-transformer-url-v3
Training Samples: 90000
Validation Samples: 18000
Test Samples: 12000
Epochs: 2
Batch Size: 48
Learning Rate: 8e-05
Max Length: 512

Additional Training Parameters

Model Type: character_level_transformer_v3
Model Version: v3
Base Model: custom-char-transformer
Include Hf Dataset: True
Hf Dataset Name: kmack/Phishing_urls
Hf Max Samples: 120000
Max Total Samples: 120000
Use Weighted Sampler: True
Label Smoothing: 0.05
Auto Optimize: True
Fp16: True
Gpu Name: NVIDIA GeForce RTX 3080
Is Rtx 3080 Profile: True

Model Architecture Parameters

Vocab Size: 100
Embed Dim: 192
Num Heads: 8
Num Layers: 6
Hidden Dim: 384
Max Length: 512
Num Labels: 2
Dropout: 0.1

Character-Level Approach (In Depth)

This repository uses a character-based URL model, not a token/subword transformer.

Why Character-Level for URLs

URLs contain signal in punctuation and local patterns (., /, ?, =, %, @, homoglyph-like variants).
Character-level encoding can model suspicious fragments and obfuscation that tokenization can smooth out.
Very long or uncommon URL strings do not rely on pre-trained token vocab coverage.

Data Processing Pipeline

CSV files are auto-discovered from Training Material/URLs.
URL and label columns are inferred from common names (url, website_url, link, label, status, etc.).
Labels are mapped to binary classes: 0=safe, 1=phishing.
URLs are normalized by adding a scheme if missing (https://).
If sender metadata exists, sender domain may be prepended to URL text.
Final input is encoded character-by-character and padded/truncated to fixed length.

Model Architecture

Embedding layer: vocab_size=100, embed_dim=192
Learnable positional encoding up to max_length=512
Transformer encoder: num_layers=6, num_heads=8, feedforward hidden_dim=384
Pooling: masked global average pooling over valid characters
Classifier head: MLP with GELU + dropout (dropout=0.1) -> 2 logits

Training Configuration

Epochs: 2
Batch size: 48
Learning rate: 8e-05
Weight decay: 0.01
Warmup ratio: 0.1
Gradient accumulation steps: 1
Optimizer: AdamW
LR schedule: warmup + cosine decay
Class balancing: weighted cross-entropy using computed class weights
Early stopping: patience of 3 epochs (based on validation ROC-AUC)

Saved Artifacts

best_model.pt: best checkpoint by validation ROC-AUC
model.pt: final model checkpoint
model_config.json: architecture hyperparameters
tokenizer.json: character vocabulary + tokenizer metadata
training_info.json: train/val/test metrics and key run parameters

Reproduce Training

python train_url_classifier_char.py \
  --output_dir ./Models/url_classifier_char_v3 \
  --epochs 2 \
  --batch_size 48 \
  --lr 8e-05 \
  --max_length 512 \
  --embed_dim 192 \
  --num_heads 8 \
  --num_layers 6 \
  --hidden_dim 384 \
  --dropout 0.1

Evaluation Results

Test Set Metrics

Loss: 0.4197
Accuracy: 0.8178
F1: 0.8200
Precision: 0.7712
Recall: 0.8753
Roc Auc: 0.9088
True Positives: 4978.0000
True Negatives: 4836.0000
False Positives: 1477.0000
False Negatives: 709.0000

Validation Set Metrics

Loss: 0.4235
Accuracy: 0.8151
F1: 0.8174
Precision: 0.7684
Recall: 0.8731
Roc Auc: 0.9071
True Positives: 7448.0000
True Negatives: 7224.0000
False Positives: 2245.0000
False Negatives: 1083.0000

Usage

import json
import torch

# This repository contains a custom PyTorch model:
# - model.pt (trained weights)
# - model_config.json (architecture hyperparameters)
# - tokenizer.json (character tokenizer)
#
# Load these files with your project inference code (e.g. predict_url_char.py).

with open("model_config.json", "r", encoding="utf-8") as f:
    config = json.load(f)

state_dict = torch.load("model.pt", map_location="cpu")
print("Loaded custom character-level URL classifier.")
print(config)

Limitations

This model was trained on specific datasets and may not generalize to all types of phishing attempts. Always use additional security measures in production environments.

Citation

If you use this model, please cite:

@misc{nhellyercreek_url_phishing_classifier_char_v3,
  title={Url Phishing Classifier Char V3},
  author={Noah Hellyer},
  year={2026},
  publisher={Hugging Face},
  howpublished={\url{https://huggingface.co/nhellyercreek/url-phishing-classifier-char-v3}}
}

Downloads last month: 1

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

nhellyercreek
/

url-phishing-classifier-char-v3