ccss17 commited on
Commit
1b99900
·
verified ·
1 Parent(s): 83ca46c

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +0 -172
README.md CHANGED
@@ -1,172 +0,0 @@
1
- ---
2
- tags:
3
- - security
4
- - dga-detection
5
- - malware
6
- - cybersecurity
7
- - domain-classification
8
- - transformer
9
- license: mit
10
- datasets:
11
- - extrahop/dga-training-data
12
- metrics:
13
- - f1
14
- - accuracy
15
- - precision
16
- - recall
17
- model-index:
18
- - name: dga-transformer-encoder
19
- results:
20
- - task:
21
- type: text-classification
22
- name: Domain Classification
23
- dataset:
24
- name: ExtraHop DGA Dataset
25
- type: extrahop/dga-training-data
26
- metrics:
27
- - type: f1
28
- value: 0.9678
29
- name: F1 Score
30
- - type: accuracy
31
- value: 0.9678
32
- name: Accuracy
33
- ---
34
-
35
- # DGA Transformer Encoder
36
-
37
- A custom transformer-based model for detecting Domain Generation Algorithm (DGA) domains used in malware C2 infrastructure.
38
-
39
- ## Model Details
40
-
41
- - **Architecture**: Custom Transformer Encoder (4 layers, 256 dimensions, 4 attention heads)
42
- - **Parameters**: 3.2M
43
- - **Training Data**: ExtraHop DGA dataset (500K balanced samples)
44
- - **Performance**: 96.78% F1 score on test set
45
- - **Inference Speed**: <1ms per domain (GPU), ~10ms (CPU)
46
-
47
- ## Usage
48
-
49
- ```python
50
- from transformers import AutoModelForSequenceClassification
51
- import torch
52
-
53
- # Character encoding
54
- CHARSET = "abcdefghijklmnopqrstuvwxyz0123456789-."
55
- CHAR_TO_IDX = {c: i + 1 for i, c in enumerate(CHARSET)}
56
- PAD = 0
57
-
58
- def encode_domain(domain: str, max_len: int = 64):
59
- ids = [CHAR_TO_IDX.get(c, PAD) for c in domain.lower()]
60
- ids = ids[:max_len]
61
- ids = ids + [PAD] * (max_len - len(ids))
62
- return ids
63
-
64
- # Load model
65
- model = AutoModelForSequenceClassification.from_pretrained("ccss17/dga-transformer-encoder")
66
- model.eval()
67
-
68
- # Classify a domain
69
- def predict(domain: str):
70
- input_ids = torch.tensor([encode_domain(domain, max_len=64)])
71
- with torch.no_grad():
72
- logits = model(input_ids).logits
73
- probs = torch.softmax(logits, dim=-1)
74
- pred = torch.argmax(probs).item()
75
-
76
- label = "Legitimate" if pred == 0 else "DGA (Malicious)"
77
- confidence = probs[0, pred].item()
78
- return label, confidence
79
-
80
- # Examples
81
- print(predict("google.com")) # ('Legitimate', 0.998)
82
- print(predict("xjkd8f2h.com")) # ('DGA (Malicious)', 0.976)
83
- ```
84
-
85
- ## Try it on HuggingFace Spaces
86
-
87
- 🚀 [Interactive Demo](https://huggingface.co/spaces/ccss17/dga-detector)
88
-
89
- ## Training Details
90
-
91
- - **Framework**: PyTorch + HuggingFace Transformers
92
- - **Optimizer**: AdamW
93
- - **Learning Rate**: 3e-4 with linear warmup
94
- - **Batch Size**: 2048 (gradient accumulation)
95
- - **Epochs**: 5 (early stopping at epoch 2.4)
96
- - **Loss**: CrossEntropyLoss
97
-
98
- ## Model Architecture
99
-
100
- ```
101
- Input: Domain string (e.g., "google.com")
102
-
103
- Character Tokenization: [g, o, o, g, l, e, ., c, o, m]
104
-
105
- Embedding Layer: 256-dim vectors
106
-
107
- Positional Encoding: Add position information
108
-
109
- Transformer Encoder (4 layers):
110
- - Multi-head Self-Attention (4 heads)
111
- - Feed-Forward Network (1024 hidden)
112
- - Layer Normalization
113
- - Residual Connections
114
-
115
- [CLS] Token Pooling: Extract sequence representation
116
-
117
- Classification Head: Linear(256 → 2)
118
-
119
- Output: [P(Legitimate), P(DGA)]
120
- ```
121
-
122
- ## Performance
123
-
124
- | Metric | Score |
125
- |--------|-------|
126
- | F1 Score (Macro) | 96.78% |
127
- | F1 Score (Binary) | 96.78% |
128
- | Accuracy | 96.78% |
129
- | Precision | 96.5% |
130
- | Recall | 97.1% |
131
-
132
- **Confusion Matrix** (Test Set):
133
-
134
- | | Predicted Legit | Predicted DGA |
135
- |----------------|----------------|---------------|
136
- | **True Legit** | 24,180 | 820 |
137
- | **True DGA** | 790 | 24,210 |
138
-
139
- ## Limitations
140
-
141
- - Trained primarily on English domains
142
- - May not generalize to all DGA families (e.g., dictionary-based DGAs)
143
- - Requires domain without protocol/path for best performance
144
- - ~3% false positive rate
145
-
146
- ## Citation
147
-
148
- If you use this model, please cite:
149
-
150
- ```bibtex
151
- @misc{dga-transformer-encoder,
152
- author = {ccss17},
153
- title = {DGA Transformer Encoder},
154
- year = {2025},
155
- publisher = {HuggingFace},
156
- url = {https://huggingface.co/ccss17/dga-transformer-encoder}
157
- }
158
- ```
159
-
160
- ## References
161
-
162
- - [ExtraHop DGA Training Data](https://github.com/extrahop/dga-training-data)
163
- - [Attention Is All You Need](https://arxiv.org/abs/1706.03762)
164
- - [Project Repository](https://github.com/ccss17/DGA-Transformer-Encoder)
165
-
166
- ## License
167
-
168
- MIT License
169
-
170
- ---
171
-
172
- **Built with ❤️ using PyTorch, HuggingFace Transformers, and Gradio**