ccss17
/

dga-transformer-encoder

Safetensors

dga_encoder

Model card Files Files and versions

xet

Community

ccss17 commited on Nov 3, 2025

Commit

1b99900

verified ·

1 Parent(s): 83ca46c

Update README.md

Browse files

Files changed (1) hide show

README.md +0 -172

README.md CHANGED Viewed

@@ -1,172 +0,0 @@
----
-tags:
-- security
-- dga-detection
-- malware
-- cybersecurity
-- domain-classification
-- transformer
-license: mit
-datasets:
-- extrahop/dga-training-data
-metrics:
-- f1
-- accuracy
-- precision
-- recall
-model-index:
-- name: dga-transformer-encoder
-  results:
-  - task:
-      type: text-classification
-      name: Domain Classification
-    dataset:
-      name: ExtraHop DGA Dataset
-      type: extrahop/dga-training-data
-    metrics:
-    - type: f1
-      value: 0.9678
-      name: F1 Score
-    - type: accuracy
-      value: 0.9678
-      name: Accuracy
----
-# DGA Transformer Encoder
-A custom transformer-based model for detecting Domain Generation Algorithm (DGA) domains used in malware C2 infrastructure.
-## Model Details
-- **Architecture**: Custom Transformer Encoder (4 layers, 256 dimensions, 4 attention heads)
-- **Parameters**: 3.2M
-- **Training Data**: ExtraHop DGA dataset (500K balanced samples)
-- **Performance**: 96.78% F1 score on test set
-- **Inference Speed**: <1ms per domain (GPU), ~10ms (CPU)
-## Usage
-```python
-from transformers import AutoModelForSequenceClassification
-import torch
-# Character encoding
-CHARSET = "abcdefghijklmnopqrstuvwxyz0123456789-."
-CHAR_TO_IDX = {c: i + 1 for i, c in enumerate(CHARSET)}
-PAD = 0
-def encode_domain(domain: str, max_len: int = 64):
-    ids = [CHAR_TO_IDX.get(c, PAD) for c in domain.lower()]
-    ids = ids[:max_len]
-    ids = ids + [PAD] * (max_len - len(ids))
-    return ids
-# Load model
-model = AutoModelForSequenceClassification.from_pretrained("ccss17/dga-transformer-encoder")
-model.eval()
-# Classify a domain
-def predict(domain: str):
-    input_ids = torch.tensor([encode_domain(domain, max_len=64)])
-    with torch.no_grad():
-        logits = model(input_ids).logits
-        probs = torch.softmax(logits, dim=-1)
-        pred = torch.argmax(probs).item()
-    label = "Legitimate" if pred == 0 else "DGA (Malicious)"
-    confidence = probs[0, pred].item()
-    return label, confidence
-# Examples
-print(predict("google.com"))        # ('Legitimate', 0.998)
-print(predict("xjkd8f2h.com"))      # ('DGA (Malicious)', 0.976)
-```
-## Try it on HuggingFace Spaces
-🚀 [Interactive Demo](https://huggingface.co/spaces/ccss17/dga-detector)
-## Training Details
-- **Framework**: PyTorch + HuggingFace Transformers
-- **Optimizer**: AdamW
-- **Learning Rate**: 3e-4 with linear warmup
-- **Batch Size**: 2048 (gradient accumulation)
-- **Epochs**: 5 (early stopping at epoch 2.4)
-- **Loss**: CrossEntropyLoss
-## Model Architecture
-```
-Input: Domain string (e.g., "google.com")
-  ↓
-Character Tokenization: [g, o, o, g, l, e, ., c, o, m]
-  ↓
-Embedding Layer: 256-dim vectors
-  ↓
-Positional Encoding: Add position information
-  ↓
-Transformer Encoder (4 layers):
-  - Multi-head Self-Attention (4 heads)
-  - Feed-Forward Network (1024 hidden)
-  - Layer Normalization
-  - Residual Connections
-  ↓
-[CLS] Token Pooling: Extract sequence representation
-  ↓
-Classification Head: Linear(256 → 2)
-  ↓
-Output: [P(Legitimate), P(DGA)]
-```
-## Performance
-| Metric | Score |
-|--------|-------|
-| F1 Score (Macro) | 96.78% |
-| F1 Score (Binary) | 96.78% |
-| Accuracy | 96.78% |
-| Precision | 96.5% |
-| Recall | 97.1% |
-**Confusion Matrix** (Test Set):
-|                | Predicted Legit | Predicted DGA |
-|----------------|----------------|---------------|
-| **True Legit** | 24,180         | 820           |
-| **True DGA**   | 790            | 24,210        |
-## Limitations
-- Trained primarily on English domains
-- May not generalize to all DGA families (e.g., dictionary-based DGAs)
-- Requires domain without protocol/path for best performance
-- ~3% false positive rate
-## Citation
-If you use this model, please cite:
-```bibtex
-@misc{dga-transformer-encoder,
-  author = {ccss17},
-  title = {DGA Transformer Encoder},
-  year = {2025},
-  publisher = {HuggingFace},
-  url = {https://huggingface.co/ccss17/dga-transformer-encoder}
-}
-```
-## References
-- [ExtraHop DGA Training Data](https://github.com/extrahop/dga-training-data)
-- [Attention Is All You Need](https://arxiv.org/abs/1706.03762)
-- [Project Repository](https://github.com/ccss17/DGA-Transformer-Encoder)
-## License
-MIT License
----
-**Built with ❤️ using PyTorch, HuggingFace Transformers, and Gradio**