|
|
--- |
|
|
language: en |
|
|
license: apache-2.0 |
|
|
tags: |
|
|
- cybersecurity |
|
|
- log-analysis |
|
|
- threat-detection |
|
|
- roberta |
|
|
--- |
|
|
|
|
|
# cyber_threat_log_classifier |
|
|
|
|
|
## Overview |
|
|
This model is a fine-tuned RoBERTa-base classifier designed to analyze raw HTTP server logs and system audit trails for malicious patterns. It identifies common web-based attacks such as SQL Injection and Cross-Site Scripting (XSS) with high precision, enabling real-time security orchestration. |
|
|
|
|
|
## Model Architecture |
|
|
The model utilizes a Transformer-based encoder architecture (RoBERTa). |
|
|
|
|
|
|
|
|
|
|
|
- **Encoder:** 12-layer Transformer with 768 hidden units and 12 attention heads. |
|
|
- **Input:** Tokenized raw log strings (up to 512 tokens). |
|
|
- **Classification Head:** Linear layer on top of the `[CLS]` (or equivalent `<s>`) token pooling to map hidden states to 5 threat categories. |
|
|
|
|
|
## Intended Use |
|
|
- **SIEM Integration:** Automated labeling of incoming logs in Security Information and Event Management systems. |
|
|
- **Incident Response:** Prioritizing security alerts based on the classified threat type. |
|
|
- **Log Cleaning:** Filtering out high-volume benign noise from security dashboards. |
|
|
|
|
|
## Limitations |
|
|
- **Obfuscated Payloads:** Highly encoded or polymorphic attack payloads may bypass detection if not represented in the training distribution. |
|
|
- **Context Window:** Extremely long request bodies or multi-line log events exceeding 512 tokens will be truncated. |
|
|
- **Adversarial Examples:** Sophisticated attackers may craft "log-injection" payloads specifically designed to mislead the classifier. |