|
|
--- |
|
|
library_name: transformers |
|
|
license: mit |
|
|
base_model: |
|
|
- mistralai/Mistral-7B-Instruct-v0.3 |
|
|
--- |
|
|
|
|
|
# 🛡️ SIEM Multisource Log Generator |
|
|
|
|
|
<p align="center"> |
|
|
<img src="https://img.shields.io/badge/Transformers-HuggingFace-yellow" /> |
|
|
<img src="https://img.shields.io/badge/Base%20Model-Mistral--7B-blue" /> |
|
|
<img src="https://img.shields.io/badge/License-MIT-green" /> |
|
|
<img src="https://img.shields.io/badge/Domain-Cybersecurity-red" /> |
|
|
</p> |
|
|
|
|
|
--- |
|
|
|
|
|
## 🚀 Why This Model? |
|
|
|
|
|
Security teams need **large volumes of realistic SIEM data** to build, test, and validate detections — but real logs are often **sensitive, restricted, or unavailable**. |
|
|
|
|
|
The **SIEM Multisource Log Generator** produces **high-quality synthetic security logs** that resemble real-world telemetry across multiple sources, enabling safer experimentation, faster iteration, and better analyst training without touching production data. |
|
|
|
|
|
--- |
|
|
|
|
|
## 📌 Model Summary |
|
|
|
|
|
The **SIEM Multisource Log Generator** is a transformer-based language model fine-tuned to generate **synthetic Security Information and Event Management (SIEM) logs**. It is designed for cybersecurity research, detection engineering, SOC training, and SIEM validation workflows in non-production environments. |
|
|
|
|
|
--- |
|
|
|
|
|
## 🧠 Model Details |
|
|
|
|
|
- **Developed by:** Adarsh Ranjan |
|
|
- **Model type:** Transformer-based causal language model |
|
|
- **Base model:** `mistralai/Mistral-7B-Instruct-v0.3` |
|
|
- **Language:** English |
|
|
- **License:** MIT |
|
|
- **Framework:** 🤗 Hugging Face Transformers |
|
|
- **Model format:** Safetensors |
|
|
|
|
|
--- |
|
|
|
|
|
## 📄 Associated Papers |
|
|
|
|
|
This model is fine-tuned from **Mistral 7B** and is grounded in the following foundational research: |
|
|
|
|
|
- **Mistral 7B** |
|
|
*Mistral 7B* — Efficient, high-performance open-weight language model |
|
|
https://huggingface.co/papers/2310.06825 |
|
|
|
|
|
- **Instruction Tuning & Open Foundation Models** |
|
|
*Direct Preference Optimization and Instruction-Following Models* |
|
|
https://huggingface.co/papers/2305.14314 |
|
|
|
|
|
These papers describe the base architecture and training philosophy underlying the model. |
|
|
|
|
|
--- |
|
|
|
|
|
## 📖 What Does It Generate? |
|
|
|
|
|
The model produces **structured and semi-structured SIEM-style logs**, including signals from: |
|
|
|
|
|
- 🔐 Authentication and identity systems |
|
|
- 🌐 Firewalls and network devices |
|
|
- 💻 Endpoint and host-based agents |
|
|
|
|
|
All outputs are **fully synthetic** and safe for testing and research. |
|
|
|
|
|
--- |
|
|
|
|
|
## 🎯 Intended Use |
|
|
|
|
|
### ✅ Direct Use |
|
|
- Synthetic SIEM log generation |
|
|
- Detection rule and alert testing |
|
|
- Security analytics experimentation |
|
|
- SOC analyst training and simulations |
|
|
|
|
|
### 🔁 Downstream Use |
|
|
- Fine-tuning for organization-specific log formats |
|
|
- Integration into SIEM test or staging environments |
|
|
|
|
|
### 🚫 Out-of-Scope Use |
|
|
- Production ingestion of real security logs |
|
|
- Automated security decisions without human oversight |
|
|
- Real-world attack execution or facilitation |
|
|
|
|
|
--- |
|
|
|
|
|
## ⚠️ Bias, Risks, and Limitations |
|
|
|
|
|
- Synthetic logs may not fully capture real attacker behavior |
|
|
- Rare or advanced attack techniques may be underrepresented |
|
|
- Benchmarks are qualitative and task-oriented |
|
|
- Outputs should be reviewed by cybersecurity professionals |
|
|
|
|
|
--- |
|
|
|
|
|
## 🚀 Getting Started |
|
|
|
|
|
### 📦 Installation |
|
|
|
|
|
```bash |
|
|
pip install transformers safetensors |
|
|
``` |
|
|
|
|
|
### 🧪 Basic Usage Example |
|
|
|
|
|
```python |
|
|
from transformers import AutoTokenizer, AutoModelForCausalLM |
|
|
|
|
|
model_name = "adarsh-aur/siem-multisource-log-generator" |
|
|
|
|
|
tokenizer = AutoTokenizer.from_pretrained(model_name) |
|
|
model = AutoModelForCausalLM.from_pretrained(model_name) |
|
|
|
|
|
prompt = ( |
|
|
"Generate SIEM logs for a suspicious login scenario.\n" |
|
|
"Include timestamp, source IP, username, host, and outcome." |
|
|
) |
|
|
|
|
|
inputs = tokenizer(prompt, return_tensors="pt") |
|
|
outputs = model.generate(**inputs, max_length=256) |
|
|
|
|
|
print(tokenizer.decode(outputs[0], skip_special_tokens=True)) |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## 🧠 Chain-of-Thought–Safe Prompting (Recommended) |
|
|
|
|
|
To remain policy-safe and improve output quality, **avoid asking for reasoning or explanations**. |
|
|
Instead, request **structured outputs directly**. |
|
|
|
|
|
### ✅ Preferred |
|
|
``` |
|
|
Generate SIEM logs showing a brute-force login attempt. |
|
|
Return only the logs in JSON format. |
|
|
``` |
|
|
|
|
|
### ❌ Avoid |
|
|
``` |
|
|
Explain step by step how an attacker performs a brute-force attack. |
|
|
``` |
|
|
|
|
|
The model is optimized for **output generation**, not procedural reasoning. |
|
|
|
|
|
--- |
|
|
|
|
|
## 🧩 Prompt Templates |
|
|
|
|
|
### 🔐 Authentication Anomaly |
|
|
``` |
|
|
Generate SIEM logs for multiple failed login attempts followed by a success. |
|
|
Include timestamp, username, source IP, host, and result. |
|
|
``` |
|
|
|
|
|
### 🌐 Firewall Activity |
|
|
``` |
|
|
Generate firewall logs showing blocked outbound traffic to malicious IPs. |
|
|
Include rule_id, destination_ip, port, protocol, and action. |
|
|
``` |
|
|
|
|
|
### 💻 Endpoint Detection |
|
|
``` |
|
|
Generate endpoint logs for suspicious PowerShell execution. |
|
|
Include process_name, command_line, parent_process, and severity. |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## 📈 Detection Rule Examples |
|
|
|
|
|
### Example: Brute Force Detection (Pseudo-SPL) |
|
|
|
|
|
``` |
|
|
index=auth_logs action=failure |
|
|
| stats count by src_ip, user |
|
|
| where count > 5 |
|
|
``` |
|
|
|
|
|
### Example: Suspicious PowerShell Execution |
|
|
|
|
|
``` |
|
|
index=endpoint_logs process_name="powershell.exe" |
|
|
| search command_line="*EncodedCommand*" |
|
|
``` |
|
|
|
|
|
These rules can be validated using synthetic logs generated by this model. |
|
|
|
|
|
--- |
|
|
|
|
|
## 🔬 Benchmarks (Qualitative) |
|
|
|
|
|
| Task | Result Summary | |
|
|
|------------------------------|------------------------------------| |
|
|
| Log Structure Consistency | High | |
|
|
| Field Coherence | High | |
|
|
| Scenario Diversity | Medium–High | |
|
|
| Detection Rule Compatibility | High | |
|
|
|
|
|
> Note: Benchmarks are qualitative and based on domain inspection. No automated scoring metrics are published. |
|
|
|
|
|
--- |
|
|
|
|
|
## 🧾 Dataset Card (Embedded) |
|
|
|
|
|
### Dataset Description |
|
|
- **Type:** Synthetic text-based SIEM logs |
|
|
- **Sources:** Authentication, network, endpoint-style events |
|
|
- **Sensitive Data:** None (fully synthetic) |
|
|
|
|
|
### Dataset Usage |
|
|
- SIEM testing and validation |
|
|
- Detection engineering |
|
|
- Cybersecurity research and education |
|
|
|
|
|
### Dataset Limitations |
|
|
- May not reflect organization-specific schemas |
|
|
- Rare attack patterns may be underrepresented |
|
|
|
|
|
--- |
|
|
|
|
|
## 🏋️ Training Details |
|
|
|
|
|
Exact training datasets, preprocessing steps, and hyperparameters have not been publicly disclosed. |
|
|
The model is assumed to be fine-tuned on curated or synthetic SIEM-style log text. |
|
|
|
|
|
--- |
|
|
|
|
|
## 🌱 Environmental Impact |
|
|
|
|
|
Training-related carbon emissions were not recorded. |
|
|
|
|
|
Environmental impact can be estimated using: |
|
|
Lacoste et al. (2019), *Quantifying the Carbon Emissions of Machine Learning* |
|
|
https://arxiv.org/abs/1910.09700 |
|
|
|
|
|
--- |
|
|
|
|
|
## ⚙️ Technical Specifications |
|
|
|
|
|
- **Architecture:** Transformer-based causal language model |
|
|
- **Objective:** Synthetic SIEM log generation |
|
|
- **Software:** Python, PyTorch, Hugging Face Transformers |
|
|
- **Hardware:** Not publicly documented |
|
|
|
|
|
--- |
|
|
|
|
|
## 📚 Citation |
|
|
|
|
|
```bibtex |
|
|
@misc{siem_multisource_log_generator, |
|
|
title={SIEM Multisource Log Generator}, |
|
|
author={Adarsh Ranjan}, |
|
|
year={2025}, |
|
|
howpublished={Hugging Face Model Hub}, |
|
|
url={https://huggingface.co/adarsh-aur/siem-multisource-log-generator} |
|
|
} |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## 👤 Model Card Author |
|
|
|
|
|
Adarsh Ranjan |
|
|
|
|
|
--- |
|
|
|
|
|
## 💬 Contact |
|
|
|
|
|
For questions, feedback, or contributions, please use the Hugging Face model repository discussion page. |
|
|
|