File size: 7,443 Bytes
ecbaffe e0a5649 ecbaffe 278a746 ecbaffe | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 | ---
datasets:
- ealvaradob/phishing-dataset
- cybersectony/PhishingEmailDetectionv2.0
- zefang-liu/phishing-email-dataset
- kmack/Phishing_urls
language:
- en
pipeline_tag: feature-extraction
tags:
- transformers
- pytorch
- distilbert
- text-classification
- feature-extraction
- security
- cybersecurity
- phishing
- malware
- url
- tiny
- lightweight
license: apache-2.0
---
This is a lightweight model utilizing the DistilBERT architecture, designed to produce high-quality embeddings for text containing URLs.
Despite utilizing the DistilBERT architecture, urlbert-tiny-v5 was not trained via knowledge distillation and is not a fine-tune of the original DistilBERT. Instead, the model was trained on MLM, text generation, token classification, and multi-class classification tasks.
### Key Specifications
- Architecture: DistilBERT (6 layers, 768 hidden dimensions, 12 attention heads)
- Parameters: ~58.2M
- Context Window: 512 tokens
- Vocabulary Size: 19,996
- Tensor type: F32
###
Here is a minimal example showing how to extract embeddings from text containing URLs:
```
import torch
from transformers import AutoTokenizer, AutoModel
model_name = "CrabInHoney/urlbert-tiny-v5"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)
text = "Check that model: https://huggingface.co/CrabInHoney/urlbert-tiny-v5"
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
with torch.no_grad():
outputs = model(**inputs)
embedding = outputs.last_hidden_state[:, 0, :]
print(f"Embedding Shape: {embedding.shape}")
print(f"First 5 values: {embedding[0, :5]}")
```
Output:
```
Embedding Shape: torch.Size([1, 768])
First 5 values: tensor([-0.0206, -0.0150, -0.0403, 0.0814, 0.0638])
```
###
Given that urlbert-tiny-v5 generates high-quality embeddings suitable for classification "out-of-the-box," we decided not to release separate base and fine-tuned versions. Instead, only classification heads were trained for specific datasets, while the encoder weights remained frozen during the process.
There are 7 trained heads available in the `heads/` directory of this repository.
### Benchmark Results
The following table shows the performance of these heads on their respective test sets:
| Model Head File (`.safetensors`) | Dataset Source | Task Type | Samples | Accuracy | Macro F1 |
| :--- | :--- | :--- | :--- | :--- | :--- |
| `MSMalicious-URLs-dataset_head` | [Kaggle: MS Malicious URLs](https://www.kaggle.com/datasets/sid321axn/malicious-urls-dataset) | **4-Class:** (Benign, Defacement, Phishing, Malware) | 651,191 | **99.82%** | 0.9965 |
| `cyPhishing-Email-Detection_head` | [HF: Cybersectony Phishing v2.0](https://huggingface.co/datasets/cybersectony/PhishingEmailDetectionv2.0) | **4-Class:** (Legit/Phish Email, Legit/Phish URL) | 200,000 | **99.69%** | 0.9914 |
| `PSSpam-Email-Classification_head` | [Kaggle: Email Spam Classification](https://www.kaggle.com/datasets/purusinghvi/email-spam-classification-dataset) | **Binary:** (Legit vs Spam Email) | 83,448 | **99.10%** | 0.9909 |
| `zlphishing-email-dataset_head` | [HF: ZL Phishing Email](https://huggingface.co/datasets/zefang-liu/phishing-email-dataset) | **Binary:** (Safe vs Phish Email) | 18,634 | **97.98%** | 0.9790 |
| `eaphishing-dataset_head` | [HF: EA Phishing (Combined)](https://huggingface.co/datasets/ealvaradob/phishing-dataset) | **Binary:** (Safe vs Phishing) | 77,677 | **96.67%** | 0.9660 |
| `kmPhishing-urls_head` | [HF: KMack Phishing URLs](https://huggingface.co/datasets/kmack/Phishing_urls) | **Binary:** (Safe vs Phishing URL) | 708,820 | **89.51%** | 0.8948 |
| `annotationGenHead` | *Unpublished Dataset* | *Annotation Generation* | - | - | - |
### Inference Example
This script loads the base model and **all** available heads to analyze a URL/text against every dataset simultaneously.
```
import torch, torch.nn as nn, torch.nn.functional as F
from transformers import AutoTokenizer, AutoModel, AutoModelForCausalLM, EncoderDecoderModel, BertConfig
from huggingface_hub import hf_hub_download
from safetensors.torch import load_file
# 1. Classifier Architecture
class Head(nn.Module):
def __init__(self, c):
super().__init__()
self.pre_classifier, self.bn = nn.Linear(768, 768), nn.BatchNorm1d(768)
self.classifier = nn.Linear(768, c)
def forward(self, x):
return self.classifier(torch.dropout(torch.relu(self.bn(self.pre_classifier(x))), 0.3, False))
# 2. Config
REPO = "CrabInHoney/urlbert-tiny-v5"
CLS_HEADS = {
"MSMalicious-URLs-dataset_head.safetensors": {0: "BENIGN", 1: "DEFACEMENT", 2: "PHISHING", 3: "MALWARE"},
"cyPhishing-Email-Detection_head.safetensors": {0: "LEGIT EMAIL", 1: "PHISH EMAIL", 2: "LEGIT URL", 3: "PHISH URL"},
"PSSpam-Email-Classification_head.safetensors": {0: "LEGIT EMAIL", 1: "SPAM EMAIL"},
"kmPhishing-urls_head.safetensors": {0: "SAFE URL", 1: "PHISHING"},
"eaphishing-dataset_head.safetensors": {0: "SAFE", 1: "PHISHING"},
"zlphishing-email-dataset_head.safetensors": {0: "SAFE EMAIL", 1: "PHISH EMAIL"}
}
GEN_FILE = "heads/annotationGenHead.safetensors"
# 3. Load Models
print("Loading models...")
tok = AutoTokenizer.from_pretrained(REPO)
enc = AutoModel.from_pretrained(REPO)
# Load Classifiers
models = {}
for f, lbls in CLS_HEADS.items():
h = Head(len(lbls))
h.load_state_dict(load_file(hf_hub_download(REPO, f"heads/{f}")))
h.eval()
models[f] = (h, lbls)
# Load Generator
dec_conf = BertConfig(vocab_size=tok.vocab_size, hidden_size=256, num_hidden_layers=4, num_attention_heads=4, intermediate_size=1024, is_decoder=True, add_cross_attention=True)
gen_model = EncoderDecoderModel(encoder=enc, decoder=AutoModelForCausalLM.from_config(dec_conf))
gen_model.load_state_dict(load_file(hf_hub_download(REPO, GEN_FILE)), strict=False)
gen_model.eval()
# 4. Inference
text = "http://paypal-secure-login.update.com"
inputs = tok(text, return_tensors="pt", truncation=True, max_length=512)
print(f"Target: {text}\n")
print(f"{'HEAD':<30} {'VERDICT':<15} {'CONF'}")
with torch.no_grad():
# Run Classifiers
emb = enc(**inputs).last_hidden_state[:, 0, :]
for fname, (model, labels) in models.items():
probs = F.softmax(model(emb), dim=1)[0]
top_id = probs.argmax().item()
verdict = labels[top_id]
c = "\033[91m" if any(x in verdict for x in ["PHISH", "MALWARE", "SPAM", "DEFACE"]) else "\033[92m"
print(f"{fname.split('_')[0]:<30} {c}{verdict:<15}\033[0m {probs[top_id]:.1%}")
# Run Generator (Fix: Explicitly pass decoder_start_token_id)
print("-" * 55)
out = gen_model.generate(
inputs.input_ids,
max_length=60,
num_beams=5,
decoder_start_token_id=tok.cls_token_id,
eos_token_id=tok.sep_token_id
)
desc = tok.decode(out[0], skip_special_tokens=True)
print(f"Generated Description: \033[96m{desc}\033[0m")
```
Output:
```
HEAD VERDICT CONF
MSMalicious-URLs-dataset PHISHING 100.0%
cyPhishing-Email-Detection PHISH URL 99.1%
PSSpam-Email-Classification SPAM EMAIL 99.9%
kmPhishing-urls PHISHING 91.8%
eaphishing-dataset PHISHING 100.0%
zlphishing-email-dataset PHISH EMAIL 55.2%
-------------------------------------------------------
Generated Description: financial institution phishing
``` |