Transformers
Safetensors
adarsh-aur's picture
Added Papers in readme.md
c54ef79 verified
---
library_name: transformers
license: mit
base_model:
- mistralai/Mistral-7B-Instruct-v0.3
---
# 🛡️ SIEM Multisource Log Generator
<p align="center">
<img src="https://img.shields.io/badge/Transformers-HuggingFace-yellow" />
<img src="https://img.shields.io/badge/Base%20Model-Mistral--7B-blue" />
<img src="https://img.shields.io/badge/License-MIT-green" />
<img src="https://img.shields.io/badge/Domain-Cybersecurity-red" />
</p>
---
## 🚀 Why This Model?
Security teams need **large volumes of realistic SIEM data** to build, test, and validate detections — but real logs are often **sensitive, restricted, or unavailable**.
The **SIEM Multisource Log Generator** produces **high-quality synthetic security logs** that resemble real-world telemetry across multiple sources, enabling safer experimentation, faster iteration, and better analyst training without touching production data.
---
## 📌 Model Summary
The **SIEM Multisource Log Generator** is a transformer-based language model fine-tuned to generate **synthetic Security Information and Event Management (SIEM) logs**. It is designed for cybersecurity research, detection engineering, SOC training, and SIEM validation workflows in non-production environments.
---
## 🧠 Model Details
- **Developed by:** Adarsh Ranjan
- **Model type:** Transformer-based causal language model
- **Base model:** `mistralai/Mistral-7B-Instruct-v0.3`
- **Language:** English
- **License:** MIT
- **Framework:** 🤗 Hugging Face Transformers
- **Model format:** Safetensors
---
## 📄 Associated Papers
This model is fine-tuned from **Mistral 7B** and is grounded in the following foundational research:
- **Mistral 7B**
*Mistral 7B* — Efficient, high-performance open-weight language model
https://huggingface.co/papers/2310.06825
- **Instruction Tuning & Open Foundation Models**
*Direct Preference Optimization and Instruction-Following Models*
https://huggingface.co/papers/2305.14314
These papers describe the base architecture and training philosophy underlying the model.
---
## 📖 What Does It Generate?
The model produces **structured and semi-structured SIEM-style logs**, including signals from:
- 🔐 Authentication and identity systems
- 🌐 Firewalls and network devices
- 💻 Endpoint and host-based agents
All outputs are **fully synthetic** and safe for testing and research.
---
## 🎯 Intended Use
### ✅ Direct Use
- Synthetic SIEM log generation
- Detection rule and alert testing
- Security analytics experimentation
- SOC analyst training and simulations
### 🔁 Downstream Use
- Fine-tuning for organization-specific log formats
- Integration into SIEM test or staging environments
### 🚫 Out-of-Scope Use
- Production ingestion of real security logs
- Automated security decisions without human oversight
- Real-world attack execution or facilitation
---
## ⚠️ Bias, Risks, and Limitations
- Synthetic logs may not fully capture real attacker behavior
- Rare or advanced attack techniques may be underrepresented
- Benchmarks are qualitative and task-oriented
- Outputs should be reviewed by cybersecurity professionals
---
## 🚀 Getting Started
### 📦 Installation
```bash
pip install transformers safetensors
```
### 🧪 Basic Usage Example
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
model_name = "adarsh-aur/siem-multisource-log-generator"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
prompt = (
"Generate SIEM logs for a suspicious login scenario.\n"
"Include timestamp, source IP, username, host, and outcome."
)
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_length=256)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```
---
## 🧠 Chain-of-Thought–Safe Prompting (Recommended)
To remain policy-safe and improve output quality, **avoid asking for reasoning or explanations**.
Instead, request **structured outputs directly**.
### ✅ Preferred
```
Generate SIEM logs showing a brute-force login attempt.
Return only the logs in JSON format.
```
### ❌ Avoid
```
Explain step by step how an attacker performs a brute-force attack.
```
The model is optimized for **output generation**, not procedural reasoning.
---
## 🧩 Prompt Templates
### 🔐 Authentication Anomaly
```
Generate SIEM logs for multiple failed login attempts followed by a success.
Include timestamp, username, source IP, host, and result.
```
### 🌐 Firewall Activity
```
Generate firewall logs showing blocked outbound traffic to malicious IPs.
Include rule_id, destination_ip, port, protocol, and action.
```
### 💻 Endpoint Detection
```
Generate endpoint logs for suspicious PowerShell execution.
Include process_name, command_line, parent_process, and severity.
```
---
## 📈 Detection Rule Examples
### Example: Brute Force Detection (Pseudo-SPL)
```
index=auth_logs action=failure
| stats count by src_ip, user
| where count > 5
```
### Example: Suspicious PowerShell Execution
```
index=endpoint_logs process_name="powershell.exe"
| search command_line="*EncodedCommand*"
```
These rules can be validated using synthetic logs generated by this model.
---
## 🔬 Benchmarks (Qualitative)
| Task | Result Summary |
|------------------------------|------------------------------------|
| Log Structure Consistency | High |
| Field Coherence | High |
| Scenario Diversity | Medium–High |
| Detection Rule Compatibility | High |
> Note: Benchmarks are qualitative and based on domain inspection. No automated scoring metrics are published.
---
## 🧾 Dataset Card (Embedded)
### Dataset Description
- **Type:** Synthetic text-based SIEM logs
- **Sources:** Authentication, network, endpoint-style events
- **Sensitive Data:** None (fully synthetic)
### Dataset Usage
- SIEM testing and validation
- Detection engineering
- Cybersecurity research and education
### Dataset Limitations
- May not reflect organization-specific schemas
- Rare attack patterns may be underrepresented
---
## 🏋️ Training Details
Exact training datasets, preprocessing steps, and hyperparameters have not been publicly disclosed.
The model is assumed to be fine-tuned on curated or synthetic SIEM-style log text.
---
## 🌱 Environmental Impact
Training-related carbon emissions were not recorded.
Environmental impact can be estimated using:
Lacoste et al. (2019), *Quantifying the Carbon Emissions of Machine Learning*
https://arxiv.org/abs/1910.09700
---
## ⚙️ Technical Specifications
- **Architecture:** Transformer-based causal language model
- **Objective:** Synthetic SIEM log generation
- **Software:** Python, PyTorch, Hugging Face Transformers
- **Hardware:** Not publicly documented
---
## 📚 Citation
```bibtex
@misc{siem_multisource_log_generator,
title={SIEM Multisource Log Generator},
author={Adarsh Ranjan},
year={2025},
howpublished={Hugging Face Model Hub},
url={https://huggingface.co/adarsh-aur/siem-multisource-log-generator}
}
```
---
## 👤 Model Card Author
Adarsh Ranjan
---
## 💬 Contact
For questions, feedback, or contributions, please use the Hugging Face model repository discussion page.