---
license: mit
tags:
  - phishing-detection
  - url-classification
  - character-level
  - pytorch
task: text-classification
datasets:
  - custom
---

# Url Phishing Classifier Char

This is a custom character-level Transformer model for URL phishing classification.

## Model Description

This model is based on **Unknown** and has been fine-tuned for phishing detection tasks.

## Training Details

- **Base Model**: Unknown
- **Training Samples**: 1629193
- **Validation Samples**: 325839
- **Test Samples**: 217226
- **Epochs**: 5
- **Batch Size**: 32
- **Learning Rate**: 0.0001
- **Max Length**: 512


## Additional Training Parameters

- **Model Type**: character_level_transformer


## Model Architecture Parameters

- **Vocab Size**: 100
- **Embed Dim**: 128
- **Num Heads**: 8
- **Num Layers**: 4
- **Hidden Dim**: 256
- **Max Length**: 512
- **Num Labels**: 2
- **Dropout**: 0.1


## Character-Level Approach (In Depth)

This repository uses a **character-based URL model**, not a token/subword transformer.

### Why Character-Level for URLs

- URLs contain signal in punctuation and local patterns (`.`, `/`, `?`, `=`, `%`, `@`, homoglyph-like variants).
- Character-level encoding can model suspicious fragments and obfuscation that tokenization can smooth out.
- Very long or uncommon URL strings do not rely on pre-trained token vocab coverage.

### Data Processing Pipeline

1. CSV files are auto-discovered from `Training Material/URLs`.
2. URL and label columns are inferred from common names (`url`, `website_url`, `link`, `label`, `status`, etc.).
3. Labels are mapped to binary classes: `0=safe`, `1=phishing`.
4. URLs are normalized by adding a scheme if missing (`https://`).
5. If sender metadata exists, sender domain may be prepended to URL text.
6. Final input is encoded character-by-character and padded/truncated to fixed length.

### Model Architecture

- Embedding layer: `vocab_size=100`, `embed_dim=128`
- Learnable positional encoding up to `max_length=512`
- Transformer encoder: `num_layers=4`, `num_heads=8`, feedforward `hidden_dim=256`
- Pooling: masked global average pooling over valid characters
- Classifier head: MLP with GELU + dropout (`dropout=0.1`) -> 2 logits

### Training Configuration

- Epochs: `5`
- Batch size: `32`
- Learning rate: `0.0001`
- Weight decay: `0.01`
- Warmup ratio: `0.1`
- Gradient accumulation steps: `1`
- Optimizer: AdamW
- LR schedule: warmup + cosine decay
- Class balancing: weighted cross-entropy using computed class weights
- Early stopping: patience of 3 epochs (based on validation ROC-AUC)

### Saved Artifacts

- `best_model.pt`: best checkpoint by validation ROC-AUC
- `model.pt`: final model checkpoint
- `model_config.json`: architecture hyperparameters
- `tokenizer.json`: character vocabulary + tokenizer metadata
- `training_info.json`: train/val/test metrics and key run parameters

### Reproduce Training

```bash
python train_url_classifier_char.py \
  --output_dir ./Models/url_classifier_char \
  --epochs 5 \
  --batch_size 32 \
  --lr 0.0001 \
  --max_length 512 \
  --embed_dim 128 \
  --num_heads 8 \
  --num_layers 4 \
  --hidden_dim 256 \
  --dropout 0.1
```


## Evaluation Results

### Test Set Metrics

- **Loss**: 0.2078
- **Accuracy**: 0.9143
- **F1**: 0.8839
- **Precision**: 0.8703
- **Recall**: 0.8980
- **Roc Auc**: 0.9751
- **True Positives**: 70875.0000
- **True Negatives**: 127736.0000
- **False Positives**: 10565.0000
- **False Negatives**: 8050.0000

### Validation Set Metrics

- **Loss**: 0.2064
- **Accuracy**: 0.9147
- **F1**: 0.8846
- **Precision**: 0.8706
- **Recall**: 0.8990
- **Roc Auc**: 0.9755
- **True Positives**: 106429.0000
- **True Negatives**: 191629.0000
- **False Positives**: 15822.0000
- **False Negatives**: 11959.0000


## Usage

```python
import json
import torch

# This repository contains a custom PyTorch model:
# - model.pt (trained weights)
# - model_config.json (architecture hyperparameters)
# - tokenizer.json (character tokenizer)
#
# Load these files with your project inference code (e.g. predict_url_char.py).

with open("model_config.json", "r", encoding="utf-8") as f:
    config = json.load(f)

state_dict = torch.load("model.pt", map_location="cpu")
print("Loaded custom character-level URL classifier.")
print(config)
```

## Limitations

This model was trained on specific datasets and may not generalize to all types of phishing attempts. Always use additional security measures in production environments.

## Citation

If you use this model, please cite:

```bibtex
@misc{nhellyercreek_url_phishing_classifier_char,
  title={Url Phishing Classifier Char},
  author={Noah Hellyer},
  year={2026},
  publisher={Hugging Face},
  howpublished={\url{https://huggingface.co/nhellyercreek/url-phishing-classifier-char}}
}
```