siem-multisource-log-generator / README.md

Added Papers in readme.md

c54ef79 verified about 1 month ago

7.54 kB

	---
	library_name: transformers
	license: mit
	base_model:
	- mistralai/Mistral-7B-Instruct-v0.3
	---

	# 🛡️ SIEM Multisource Log Generator

	<p align="center">
	<img src="https://img.shields.io/badge/Transformers-HuggingFace-yellow" />
	<img src="https://img.shields.io/badge/Base%20Model-Mistral--7B-blue" />
	<img src="https://img.shields.io/badge/License-MIT-green" />
	<img src="https://img.shields.io/badge/Domain-Cybersecurity-red" />
	</p>

	---

	## 🚀 Why This Model?

	Security teams need large volumes of realistic SIEM data to build, test, and validate detections — but real logs are often sensitive, restricted, or unavailable.

	The SIEM Multisource Log Generator produces high-quality synthetic security logs that resemble real-world telemetry across multiple sources, enabling safer experimentation, faster iteration, and better analyst training without touching production data.

	---

	## 📌 Model Summary

	The SIEM Multisource Log Generator is a transformer-based language model fine-tuned to generate synthetic Security Information and Event Management (SIEM) logs. It is designed for cybersecurity research, detection engineering, SOC training, and SIEM validation workflows in non-production environments.

	---

	## 🧠 Model Details

	- Developed by: Adarsh Ranjan
	- Model type: Transformer-based causal language model
	- Base model: `mistralai/Mistral-7B-Instruct-v0.3`
	- Language: English
	- License: MIT
	- Framework: 🤗 Hugging Face Transformers
	- Model format: Safetensors

	---

	## 📄 Associated Papers

	This model is fine-tuned from Mistral 7B and is grounded in the following foundational research:

	- Mistral 7B
	Mistral 7B — Efficient, high-performance open-weight language model
	https://huggingface.co/papers/2310.06825

	- Instruction Tuning & Open Foundation Models
	Direct Preference Optimization and Instruction-Following Models
	https://huggingface.co/papers/2305.14314

	These papers describe the base architecture and training philosophy underlying the model.

	---

	## 📖 What Does It Generate?

	The model produces structured and semi-structured SIEM-style logs, including signals from:

	- 🔐 Authentication and identity systems
	- 🌐 Firewalls and network devices
	- 💻 Endpoint and host-based agents

	All outputs are fully synthetic and safe for testing and research.

	---

	## 🎯 Intended Use

	### ✅ Direct Use
	- Synthetic SIEM log generation
	- Detection rule and alert testing
	- Security analytics experimentation
	- SOC analyst training and simulations

	### 🔁 Downstream Use
	- Fine-tuning for organization-specific log formats
	- Integration into SIEM test or staging environments

	### 🚫 Out-of-Scope Use
	- Production ingestion of real security logs
	- Automated security decisions without human oversight
	- Real-world attack execution or facilitation

	---

	## ⚠️ Bias, Risks, and Limitations

	- Synthetic logs may not fully capture real attacker behavior
	- Rare or advanced attack techniques may be underrepresented
	- Benchmarks are qualitative and task-oriented
	- Outputs should be reviewed by cybersecurity professionals

	---

	## 🚀 Getting Started

	### 📦 Installation

	```bash
	pip install transformers safetensors
	```

	### 🧪 Basic Usage Example

	```python
	from transformers import AutoTokenizer, AutoModelForCausalLM

	model_name = "adarsh-aur/siem-multisource-log-generator"

	tokenizer = AutoTokenizer.from_pretrained(model_name)
	model = AutoModelForCausalLM.from_pretrained(model_name)

	prompt = (
	"Generate SIEM logs for a suspicious login scenario.\n"
	"Include timestamp, source IP, username, host, and outcome."
	)

	inputs = tokenizer(prompt, return_tensors="pt")
	outputs = model.generate(**inputs, max_length=256)

	print(tokenizer.decode(outputs[0], skip_special_tokens=True))
	```

	---

	## 🧠 Chain-of-Thought–Safe Prompting (Recommended)

	To remain policy-safe and improve output quality, avoid asking for reasoning or explanations.
	Instead, request structured outputs directly.

	### ✅ Preferred
	```
	Generate SIEM logs showing a brute-force login attempt.
	Return only the logs in JSON format.
	```

	### ❌ Avoid
	```
	Explain step by step how an attacker performs a brute-force attack.
	```

	The model is optimized for output generation, not procedural reasoning.

	---

	## 🧩 Prompt Templates

	### 🔐 Authentication Anomaly
	```
	Generate SIEM logs for multiple failed login attempts followed by a success.
	Include timestamp, username, source IP, host, and result.
	```

	### 🌐 Firewall Activity
	```
	Generate firewall logs showing blocked outbound traffic to malicious IPs.
	Include rule_id, destination_ip, port, protocol, and action.
	```

	### 💻 Endpoint Detection
	```
	Generate endpoint logs for suspicious PowerShell execution.
	Include process_name, command_line, parent_process, and severity.
	```

	---

	## 📈 Detection Rule Examples

	### Example: Brute Force Detection (Pseudo-SPL)

	```
	index=auth_logs action=failure
	\| stats count by src_ip, user
	\| where count > 5
	```

	### Example: Suspicious PowerShell Execution

	```
	index=endpoint_logs process_name="powershell.exe"
	\| search command_line="EncodedCommand"
	```

	These rules can be validated using synthetic logs generated by this model.

	---

	## 🔬 Benchmarks (Qualitative)

	\| Task \| Result Summary \|
	\|------------------------------\|------------------------------------\|
	\| Log Structure Consistency \| High \|
	\| Field Coherence \| High \|
	\| Scenario Diversity \| Medium–High \|
	\| Detection Rule Compatibility \| High \|

	> Note: Benchmarks are qualitative and based on domain inspection. No automated scoring metrics are published.

	---

	## 🧾 Dataset Card (Embedded)

	### Dataset Description
	- Type: Synthetic text-based SIEM logs
	- Sources: Authentication, network, endpoint-style events
	- Sensitive Data: None (fully synthetic)

	### Dataset Usage
	- SIEM testing and validation
	- Detection engineering
	- Cybersecurity research and education

	### Dataset Limitations
	- May not reflect organization-specific schemas
	- Rare attack patterns may be underrepresented

	---

	## 🏋️ Training Details

	Exact training datasets, preprocessing steps, and hyperparameters have not been publicly disclosed.
	The model is assumed to be fine-tuned on curated or synthetic SIEM-style log text.

	---

	## 🌱 Environmental Impact

	Training-related carbon emissions were not recorded.

	Environmental impact can be estimated using:
	Lacoste et al. (2019), Quantifying the Carbon Emissions of Machine Learning
	https://arxiv.org/abs/1910.09700

	---

	## ⚙️ Technical Specifications

	- Architecture: Transformer-based causal language model
	- Objective: Synthetic SIEM log generation
	- Software: Python, PyTorch, Hugging Face Transformers
	- Hardware: Not publicly documented

	---

	## 📚 Citation

	```bibtex
	@misc{siem_multisource_log_generator,
	title={SIEM Multisource Log Generator},
	author={Adarsh Ranjan},
	year={2025},
	howpublished={Hugging Face Model Hub},
	url={https://huggingface.co/adarsh-aur/siem-multisource-log-generator}
	}
	```

	---

	## 👤 Model Card Author

	Adarsh Ranjan

	---

	## 💬 Contact

	For questions, feedback, or contributions, please use the Hugging Face model repository discussion page.