File size: 4,966 Bytes
18f3d70
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0a9c095
 
18f3d70
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
---

license: mit
tags:
  - phishing-detection
  - url-classification
  - character-level
  - pytorch
task: text-classification
datasets:
  - custom
---


# Url Phishing Classifier Char

This is a custom character-level Transformer model for URL phishing classification.

## Model Description

This model is based on **Unknown** and has been fine-tuned for phishing detection tasks.

## Training Details

- **Base Model**: Unknown
- **Training Samples**: 1629193
- **Validation Samples**: 325839
- **Test Samples**: 217226
- **Epochs**: 5
- **Batch Size**: 32
- **Learning Rate**: 0.0001
- **Max Length**: 512


## Additional Training Parameters

- **Model Type**: character_level_transformer


## Model Architecture Parameters

- **Vocab Size**: 100
- **Embed Dim**: 128
- **Num Heads**: 8
- **Num Layers**: 4
- **Hidden Dim**: 256
- **Max Length**: 512
- **Num Labels**: 2
- **Dropout**: 0.1


## Character-Level Approach (In Depth)

This repository uses a **character-based URL model**, not a token/subword transformer.

### Why Character-Level for URLs

- URLs contain signal in punctuation and local patterns (`.`, `/`, `?`, `=`, `%`, `@`, homoglyph-like variants).
- Character-level encoding can model suspicious fragments and obfuscation that tokenization can smooth out.
- Very long or uncommon URL strings do not rely on pre-trained token vocab coverage.

### Data Processing Pipeline

1. CSV files are auto-discovered from `Training Material/URLs`.
2. URL and label columns are inferred from common names (`url`, `website_url`, `link`, `label`, `status`, etc.).
3. Labels are mapped to binary classes: `0=safe`, `1=phishing`.
4. URLs are normalized by adding a scheme if missing (`https://`).
5. If sender metadata exists, sender domain may be prepended to URL text.
6. Final input is encoded character-by-character and padded/truncated to fixed length.

### Model Architecture

- Embedding layer: `vocab_size=100`, `embed_dim=128`
- Learnable positional encoding up to `max_length=512`
- Transformer encoder: `num_layers=4`, `num_heads=8`, feedforward `hidden_dim=256`
- Pooling: masked global average pooling over valid characters
- Classifier head: MLP with GELU + dropout (`dropout=0.1`) -> 2 logits

### Training Configuration

- Epochs: `5`
- Batch size: `32`
- Learning rate: `0.0001`
- Weight decay: `0.01`
- Warmup ratio: `0.1`
- Gradient accumulation steps: `1`
- Optimizer: AdamW
- LR schedule: warmup + cosine decay
- Class balancing: weighted cross-entropy using computed class weights
- Early stopping: patience of 3 epochs (based on validation ROC-AUC)

### Saved Artifacts

- `best_model.pt`: best checkpoint by validation ROC-AUC
- `model.pt`: final model checkpoint
- `model_config.json`: architecture hyperparameters
- `tokenizer.json`: character vocabulary + tokenizer metadata
- `training_info.json`: train/val/test metrics and key run parameters

### Reproduce Training

```bash

python train_url_classifier_char.py \

  --output_dir ./Models/url_classifier_char \

  --epochs 5 \

  --batch_size 32 \

  --lr 0.0001 \

  --max_length 512 \

  --embed_dim 128 \

  --num_heads 8 \

  --num_layers 4 \

  --hidden_dim 256 \

  --dropout 0.1

```


## Evaluation Results

### Test Set Metrics

- **Loss**: 0.2078
- **Accuracy**: 0.9143
- **F1**: 0.8839
- **Precision**: 0.8703
- **Recall**: 0.8980
- **Roc Auc**: 0.9751
- **True Positives**: 70875.0000
- **True Negatives**: 127736.0000
- **False Positives**: 10565.0000
- **False Negatives**: 8050.0000

### Validation Set Metrics

- **Loss**: 0.2064
- **Accuracy**: 0.9147
- **F1**: 0.8846
- **Precision**: 0.8706
- **Recall**: 0.8990
- **Roc Auc**: 0.9755
- **True Positives**: 106429.0000
- **True Negatives**: 191629.0000
- **False Positives**: 15822.0000
- **False Negatives**: 11959.0000


## Usage

```python

import json

import torch



# This repository contains a custom PyTorch model:

# - model.pt (trained weights)

# - model_config.json (architecture hyperparameters)

# - tokenizer.json (character tokenizer)

#

# Load these files with your project inference code (e.g. predict_url_char.py).



with open("model_config.json", "r", encoding="utf-8") as f:

    config = json.load(f)



state_dict = torch.load("model.pt", map_location="cpu")

print("Loaded custom character-level URL classifier.")

print(config)

```

## Limitations

This model was trained on specific datasets and may not generalize to all types of phishing attempts. Always use additional security measures in production environments.

## Citation

If you use this model, please cite:

```bibtex

@misc{nhellyercreek_url_phishing_classifier_char,

  title={Url Phishing Classifier Char},

  author={Noah Hellyer},

  year={2026},

  publisher={Hugging Face},

  howpublished={\url{https://huggingface.co/nhellyercreek/url-phishing-classifier-char}}

}

```