File size: 7,443 Bytes
ecbaffe
 
 
e0a5649
 
 
ecbaffe
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
278a746
ecbaffe
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
---
datasets:
- ealvaradob/phishing-dataset
- cybersectony/PhishingEmailDetectionv2.0
- zefang-liu/phishing-email-dataset
- kmack/Phishing_urls
language:
- en
pipeline_tag: feature-extraction
tags:
- transformers
- pytorch
- distilbert
- text-classification
- feature-extraction
- security
- cybersecurity
- phishing
- malware
- url
- tiny
- lightweight
license: apache-2.0
---
This is a lightweight model utilizing the DistilBERT architecture, designed to produce high-quality embeddings for text containing URLs.

Despite utilizing the DistilBERT architecture, urlbert-tiny-v5 was not trained via knowledge distillation and is not a fine-tune of the original DistilBERT. Instead, the model was trained on MLM, text generation, token classification, and multi-class classification tasks.

### Key Specifications
- Architecture: DistilBERT (6 layers, 768 hidden dimensions, 12 attention heads)
- Parameters: ~58.2M
- Context Window: 512 tokens
- Vocabulary Size: 19,996
- Tensor type: F32
###

Here is a minimal example showing how to extract embeddings from text containing URLs:
```
import torch
from transformers import AutoTokenizer, AutoModel

model_name = "CrabInHoney/urlbert-tiny-v5"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)
text = "Check that model: https://huggingface.co/CrabInHoney/urlbert-tiny-v5"
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)

with torch.no_grad():
    outputs = model(**inputs)

embedding = outputs.last_hidden_state[:, 0, :]
print(f"Embedding Shape: {embedding.shape}")
print(f"First 5 values: {embedding[0, :5]}")
```
Output:
```
Embedding Shape: torch.Size([1, 768])
First 5 values: tensor([-0.0206, -0.0150, -0.0403,  0.0814,  0.0638])
```
###

Given that urlbert-tiny-v5 generates high-quality embeddings suitable for classification "out-of-the-box," we decided not to release separate base and fine-tuned versions. Instead, only classification heads were trained for specific datasets, while the encoder weights remained frozen during the process.

There are 7 trained heads available in the `heads/` directory of this repository.

### Benchmark Results

The following table shows the performance of these heads on their respective test sets:

| Model Head File (`.safetensors`) | Dataset Source | Task Type | Samples | Accuracy | Macro F1 |
| :--- | :--- | :--- | :--- | :--- | :--- |
| `MSMalicious-URLs-dataset_head` | [Kaggle: MS Malicious URLs](https://www.kaggle.com/datasets/sid321axn/malicious-urls-dataset) | **4-Class:** (Benign, Defacement, Phishing, Malware) | 651,191 | **99.82%** | 0.9965 |
| `cyPhishing-Email-Detection_head` | [HF: Cybersectony Phishing v2.0](https://huggingface.co/datasets/cybersectony/PhishingEmailDetectionv2.0) | **4-Class:** (Legit/Phish Email, Legit/Phish URL) | 200,000 | **99.69%** | 0.9914 |
| `PSSpam-Email-Classification_head` | [Kaggle: Email Spam Classification](https://www.kaggle.com/datasets/purusinghvi/email-spam-classification-dataset) | **Binary:** (Legit vs Spam Email) | 83,448 | **99.10%** | 0.9909 |
| `zlphishing-email-dataset_head` | [HF: ZL Phishing Email](https://huggingface.co/datasets/zefang-liu/phishing-email-dataset) | **Binary:** (Safe vs Phish Email) | 18,634 | **97.98%** | 0.9790 |
| `eaphishing-dataset_head` | [HF: EA Phishing (Combined)](https://huggingface.co/datasets/ealvaradob/phishing-dataset) | **Binary:** (Safe vs Phishing) | 77,677 | **96.67%** | 0.9660 |
| `kmPhishing-urls_head` | [HF: KMack Phishing URLs](https://huggingface.co/datasets/kmack/Phishing_urls) | **Binary:** (Safe vs Phishing URL) | 708,820 | **89.51%** | 0.8948 |
| `annotationGenHead` | *Unpublished Dataset* | *Annotation Generation* | - | - | - |



### Inference Example

This script loads the base model and **all** available heads to analyze a URL/text against every dataset simultaneously.

```
import torch, torch.nn as nn, torch.nn.functional as F
from transformers import AutoTokenizer, AutoModel, AutoModelForCausalLM, EncoderDecoderModel, BertConfig
from huggingface_hub import hf_hub_download
from safetensors.torch import load_file

# 1. Classifier Architecture
class Head(nn.Module):
    def __init__(self, c):
        super().__init__()
        self.pre_classifier, self.bn = nn.Linear(768, 768), nn.BatchNorm1d(768)
        self.classifier = nn.Linear(768, c)
    def forward(self, x):
        return self.classifier(torch.dropout(torch.relu(self.bn(self.pre_classifier(x))), 0.3, False))

# 2. Config
REPO = "CrabInHoney/urlbert-tiny-v5"
CLS_HEADS = {
    "MSMalicious-URLs-dataset_head.safetensors": {0: "BENIGN", 1: "DEFACEMENT", 2: "PHISHING", 3: "MALWARE"},
    "cyPhishing-Email-Detection_head.safetensors": {0: "LEGIT EMAIL", 1: "PHISH EMAIL", 2: "LEGIT URL", 3: "PHISH URL"},
    "PSSpam-Email-Classification_head.safetensors": {0: "LEGIT EMAIL", 1: "SPAM EMAIL"},
    "kmPhishing-urls_head.safetensors": {0: "SAFE URL", 1: "PHISHING"},
    "eaphishing-dataset_head.safetensors": {0: "SAFE", 1: "PHISHING"},
    "zlphishing-email-dataset_head.safetensors": {0: "SAFE EMAIL", 1: "PHISH EMAIL"}
}
GEN_FILE = "heads/annotationGenHead.safetensors"

# 3. Load Models
print("Loading models...")
tok = AutoTokenizer.from_pretrained(REPO)
enc = AutoModel.from_pretrained(REPO)

# Load Classifiers
models = {}
for f, lbls in CLS_HEADS.items():
    h = Head(len(lbls))
    h.load_state_dict(load_file(hf_hub_download(REPO, f"heads/{f}")))
    h.eval()
    models[f] = (h, lbls)

# Load Generator
dec_conf = BertConfig(vocab_size=tok.vocab_size, hidden_size=256, num_hidden_layers=4, num_attention_heads=4, intermediate_size=1024, is_decoder=True, add_cross_attention=True)
gen_model = EncoderDecoderModel(encoder=enc, decoder=AutoModelForCausalLM.from_config(dec_conf))
gen_model.load_state_dict(load_file(hf_hub_download(REPO, GEN_FILE)), strict=False)
gen_model.eval()

# 4. Inference
text = "http://paypal-secure-login.update.com"
inputs = tok(text, return_tensors="pt", truncation=True, max_length=512)

print(f"Target: {text}\n")
print(f"{'HEAD':<30} {'VERDICT':<15} {'CONF'}")

with torch.no_grad():
    # Run Classifiers
    emb = enc(**inputs).last_hidden_state[:, 0, :]
    for fname, (model, labels) in models.items():
        probs = F.softmax(model(emb), dim=1)[0]
        top_id = probs.argmax().item()
        verdict = labels[top_id]
        c = "\033[91m" if any(x in verdict for x in ["PHISH", "MALWARE", "SPAM", "DEFACE"]) else "\033[92m"
        print(f"{fname.split('_')[0]:<30} {c}{verdict:<15}\033[0m {probs[top_id]:.1%}")

    # Run Generator (Fix: Explicitly pass decoder_start_token_id)
    print("-" * 55)
    out = gen_model.generate(
        inputs.input_ids, 
        max_length=60, 
        num_beams=5,
        decoder_start_token_id=tok.cls_token_id,
        eos_token_id=tok.sep_token_id
    )
    desc = tok.decode(out[0], skip_special_tokens=True)
    print(f"Generated Description: \033[96m{desc}\033[0m")
```

Output:

```
HEAD                           VERDICT         CONF
MSMalicious-URLs-dataset       PHISHING        100.0%
cyPhishing-Email-Detection     PHISH URL       99.1%
PSSpam-Email-Classification    SPAM EMAIL      99.9%
kmPhishing-urls                PHISHING        91.8%
eaphishing-dataset             PHISHING        100.0%
zlphishing-email-dataset       PHISH EMAIL     55.2%
-------------------------------------------------------
Generated Description: financial institution phishing
```