|
|
--- |
|
|
license: apache-2.0 |
|
|
tags: |
|
|
- domain-generation-algorithm |
|
|
- cybersecurity |
|
|
- domain-classification |
|
|
- security |
|
|
- malware-detection |
|
|
language: |
|
|
- en |
|
|
library_name: transformers |
|
|
pipeline_tag: text-classification |
|
|
base_model: answerdotai/ModernBERT-base |
|
|
--- |
|
|
|
|
|
# ModernBERT DGA Detector |
|
|
|
|
|
This model is designed to classify domains as either legitimate or generated by Domain Generation Algorithms (DGA). |
|
|
|
|
|
## Model Description |
|
|
|
|
|
- **Model Type:** BERT-based sequence classification |
|
|
- **Task:** Binary classification (Legitimate vs DGA domains) |
|
|
- **Base Model:** ModernBERT-base |
|
|
- **Training Data:** Domain names dataset |
|
|
- **Author:** Reynier Leyva La O, Carlos A. Catania |
|
|
|
|
|
## Usage |
|
|
|
|
|
```python |
|
|
from transformers import AutoTokenizer, AutoModelForSequenceClassification |
|
|
import torch |
|
|
|
|
|
# Load model and tokenizer |
|
|
tokenizer = AutoTokenizer.from_pretrained("Reynier/modernbert-dga-detector") |
|
|
model = AutoModelForSequenceClassification.from_pretrained("Reynier/modernbert-dga-detector") |
|
|
|
|
|
# Example prediction |
|
|
def predict_domain(domain): |
|
|
inputs = tokenizer(domain, return_tensors="pt", max_length=64, truncation=True, padding=True) |
|
|
with torch.no_grad(): |
|
|
outputs = model(**inputs) |
|
|
predictions = torch.softmax(outputs.logits, dim=-1) |
|
|
legit_prob = predictions[0][0].item() |
|
|
dga_prob = predictions[0][1].item() |
|
|
return {"prediction": "DGA" if dga_prob > legit_prob else "LEGITIMATE", |
|
|
"confidence": max(legit_prob, dga_prob)} |
|
|
|
|
|
# Test examples |
|
|
domains = ["google.com", "xkvbzpqr.net", "facebook.com", "abcdef123456.com"] |
|
|
for domain in domains: |
|
|
result = predict_domain(domain) |
|
|
print(f"{domain} -> {result['prediction']} (confidence: {result['confidence']:.3f})") |
|
|
``` |
|
|
|
|
|
## Model Architecture |
|
|
|
|
|
The model is based on ModernBERT and fine-tuned for domain classification: |
|
|
- Input: Domain names (text) |
|
|
- Output: Binary classification (0=LEGITIMATE, 1=DGA) |
|
|
- Max sequence length: 64 tokens |
|
|
|
|
|
## Training Details |
|
|
|
|
|
This model was fine-tuned on a dataset of legitimate and DGA-generated domains using: |
|
|
- Base model: answerdotai/ModernBERT-base |
|
|
- Framework: Transformers/PyTorch |
|
|
- Task: Binary sequence classification |
|
|
|
|
|
## Performance |
|
|
|
|
|
Add your model's performance metrics here when available: |
|
|
- Accuracy: 0.9658 ± 0.0153 |
|
|
- Precision: 0.9704 ± 0.0253 |
|
|
- Recall: 0.9582 ± 0.0147 |
|
|
- F1-Score: 0.9579 ± 0.0167 |
|
|
- FPR: 0.0267 ± 0.0233 |
|
|
- TPR: 0.9582 ± 0.0147 |
|
|
- Query Time 0.1226 ± 0.0253 in CPU do not need GPU |
|
|
|
|
|
## Use Cases |
|
|
|
|
|
- **Cybersecurity**: Detect malicious domains generated by malware |
|
|
- **Network Security**: Filter potentially harmful domains |
|
|
- **Threat Intelligence**: Analyze domain patterns in security feeds |
|
|
|
|
|
## Limitations |
|
|
|
|
|
- This model is trained specifically for domain classification |
|
|
- Performance may vary on domains from different TLDs or languages |
|
|
- Regular retraining may be needed as DGA techniques evolve |
|
|
- Model performance depends on the quality and diversity of training data |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you use this model in your research or applications, please cite it appropriately. |
|
|
|
|
|
## Related Models |
|
|
|
|
|
Check out the author's other security models: |
|
|
- [Llama3_8B-DGA-Detector](https://huggingface.co/Reynier/Llama3_8B-DGA-Detector) |
|
|
|