Reynier
/

modernbert-dga-detector

Text Classification

domain-generation-algorithm

domain-classification

malware-detection

text-embeddings-inference

Model card Files Files and versions

modernbert-dga-detector / README.md

Reynier's picture

Update README.md

4704d21 verified 6 months ago

|

history blame contribute delete

3.18 kB

	---
	license: apache-2.0
	tags:
	- domain-generation-algorithm
	- cybersecurity
	- domain-classification
	- security
	- malware-detection
	language:
	- en
	library_name: transformers
	pipeline_tag: text-classification
	base_model: answerdotai/ModernBERT-base
	---

	# ModernBERT DGA Detector

	This model is designed to classify domains as either legitimate or generated by Domain Generation Algorithms (DGA).

	## Model Description

	- Model Type: BERT-based sequence classification
	- Task: Binary classification (Legitimate vs DGA domains)
	- Base Model: ModernBERT-base
	- Training Data: Domain names dataset
	- Author: Reynier Leyva La O, Carlos A. Catania

	## Usage

	```python
	from transformers import AutoTokenizer, AutoModelForSequenceClassification
	import torch

	# Load model and tokenizer
	tokenizer = AutoTokenizer.from_pretrained("Reynier/modernbert-dga-detector")
	model = AutoModelForSequenceClassification.from_pretrained("Reynier/modernbert-dga-detector")

	# Example prediction
	def predict_domain(domain):
	inputs = tokenizer(domain, return_tensors="pt", max_length=64, truncation=True, padding=True)
	with torch.no_grad():
	outputs = model(**inputs)
	predictions = torch.softmax(outputs.logits, dim=-1)
	legit_prob = predictions[0][0].item()
	dga_prob = predictions[0][1].item()
	return {"prediction": "DGA" if dga_prob > legit_prob else "LEGITIMATE",
	"confidence": max(legit_prob, dga_prob)}

	# Test examples
	domains = ["google.com", "xkvbzpqr.net", "facebook.com", "abcdef123456.com"]
	for domain in domains:
	result = predict_domain(domain)
	print(f"{domain} -> {result['prediction']} (confidence: {result['confidence']:.3f})")
	```

	## Model Architecture

	The model is based on ModernBERT and fine-tuned for domain classification:
	- Input: Domain names (text)
	- Output: Binary classification (0=LEGITIMATE, 1=DGA)
	- Max sequence length: 64 tokens

	## Training Details

	This model was fine-tuned on a dataset of legitimate and DGA-generated domains using:
	- Base model: answerdotai/ModernBERT-base
	- Framework: Transformers/PyTorch
	- Task: Binary sequence classification

	## Performance

	Add your model's performance metrics here when available:
	- Accuracy: 0.9658 ± 0.0153
	- Precision: 0.9704 ± 0.0253
	- Recall: 0.9582 ± 0.0147
	- F1-Score: 0.9579 ± 0.0167
	- FPR: 0.0267 ± 0.0233
	- TPR: 0.9582 ± 0.0147
	- Query Time 0.1226 ± 0.0253 in CPU do not need GPU

	## Use Cases

	- Cybersecurity: Detect malicious domains generated by malware
	- Network Security: Filter potentially harmful domains
	- Threat Intelligence: Analyze domain patterns in security feeds

	## Limitations

	- This model is trained specifically for domain classification
	- Performance may vary on domains from different TLDs or languages
	- Regular retraining may be needed as DGA techniques evolve
	- Model performance depends on the quality and diversity of training data

	## Citation

	If you use this model in your research or applications, please cite it appropriately.

	## Related Models

	Check out the author's other security models:
	- [Llama3_8B-DGA-Detector](https://huggingface.co/Reynier/Llama3_8B-DGA-Detector)