siem-multisource-log-generator / README.md

adarsh-aur

Added Papers in readme.md

c54ef79 verified about 1 month ago

preview code

raw

history blame contribute delete

7.54 kB

metadata

library_name: transformers
license: mit
base_model:
  - mistralai/Mistral-7B-Instruct-v0.3

🛡️ SIEM Multisource Log Generator

🚀 Why This Model?

Security teams need large volumes of realistic SIEM data to build, test, and validate detections — but real logs are often sensitive, restricted, or unavailable.

The SIEM Multisource Log Generator produces high-quality synthetic security logs that resemble real-world telemetry across multiple sources, enabling safer experimentation, faster iteration, and better analyst training without touching production data.

📌 Model Summary

The SIEM Multisource Log Generator is a transformer-based language model fine-tuned to generate synthetic Security Information and Event Management (SIEM) logs. It is designed for cybersecurity research, detection engineering, SOC training, and SIEM validation workflows in non-production environments.

🧠 Model Details

Developed by: Adarsh Ranjan
Model type: Transformer-based causal language model
Base model: mistralai/Mistral-7B-Instruct-v0.3
Language: English
License: MIT
Framework: 🤗 Hugging Face Transformers
Model format: Safetensors

📄 Associated Papers

This model is fine-tuned from Mistral 7B and is grounded in the following foundational research:

Mistral 7B
Mistral 7B — Efficient, high-performance open-weight language model
https://huggingface.co/papers/2310.06825
Instruction Tuning & Open Foundation Models
Direct Preference Optimization and Instruction-Following Models
https://huggingface.co/papers/2305.14314

These papers describe the base architecture and training philosophy underlying the model.

📖 What Does It Generate?

The model produces structured and semi-structured SIEM-style logs, including signals from:

🔐 Authentication and identity systems
🌐 Firewalls and network devices
💻 Endpoint and host-based agents

All outputs are fully synthetic and safe for testing and research.

🎯 Intended Use

✅ Direct Use

Synthetic SIEM log generation
Detection rule and alert testing
Security analytics experimentation
SOC analyst training and simulations

🔁 Downstream Use

Fine-tuning for organization-specific log formats
Integration into SIEM test or staging environments

🚫 Out-of-Scope Use

Production ingestion of real security logs
Automated security decisions without human oversight
Real-world attack execution or facilitation

⚠️ Bias, Risks, and Limitations

Synthetic logs may not fully capture real attacker behavior
Rare or advanced attack techniques may be underrepresented
Benchmarks are qualitative and task-oriented
Outputs should be reviewed by cybersecurity professionals

🚀 Getting Started

📦 Installation

pip install transformers safetensors

🧪 Basic Usage Example

from transformers import AutoTokenizer, AutoModelForCausalLM

model_name = "adarsh-aur/siem-multisource-log-generator"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

prompt = (
    "Generate SIEM logs for a suspicious login scenario.\n"
    "Include timestamp, source IP, username, host, and outcome."
)

inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_length=256)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

🧠 Chain-of-Thought–Safe Prompting (Recommended)

To remain policy-safe and improve output quality, avoid asking for reasoning or explanations.
Instead, request structured outputs directly.

✅ Preferred

Generate SIEM logs showing a brute-force login attempt.
Return only the logs in JSON format.

❌ Avoid

Explain step by step how an attacker performs a brute-force attack.

The model is optimized for output generation, not procedural reasoning.

🧩 Prompt Templates

🔐 Authentication Anomaly

Generate SIEM logs for multiple failed login attempts followed by a success.
Include timestamp, username, source IP, host, and result.

🌐 Firewall Activity

Generate firewall logs showing blocked outbound traffic to malicious IPs.
Include rule_id, destination_ip, port, protocol, and action.

💻 Endpoint Detection

Generate endpoint logs for suspicious PowerShell execution.
Include process_name, command_line, parent_process, and severity.

📈 Detection Rule Examples

Example: Brute Force Detection (Pseudo-SPL)

index=auth_logs action=failure
| stats count by src_ip, user
| where count > 5

Example: Suspicious PowerShell Execution

index=endpoint_logs process_name="powershell.exe"
| search command_line="*EncodedCommand*"

These rules can be validated using synthetic logs generated by this model.

🔬 Benchmarks (Qualitative)

Task	Result Summary
Log Structure Consistency	High
Field Coherence	High
Scenario Diversity	Medium–High
Detection Rule Compatibility	High

Note: Benchmarks are qualitative and based on domain inspection. No automated scoring metrics are published.

🧾 Dataset Card (Embedded)

Dataset Description

Type: Synthetic text-based SIEM logs
Sources: Authentication, network, endpoint-style events
Sensitive Data: None (fully synthetic)

Dataset Usage

SIEM testing and validation
Detection engineering
Cybersecurity research and education

Dataset Limitations

May not reflect organization-specific schemas
Rare attack patterns may be underrepresented

🏋️ Training Details

Exact training datasets, preprocessing steps, and hyperparameters have not been publicly disclosed.
The model is assumed to be fine-tuned on curated or synthetic SIEM-style log text.

🌱 Environmental Impact

Training-related carbon emissions were not recorded.

Environmental impact can be estimated using:
Lacoste et al. (2019), Quantifying the Carbon Emissions of Machine Learning
https://arxiv.org/abs/1910.09700

⚙️ Technical Specifications

Architecture: Transformer-based causal language model
Objective: Synthetic SIEM log generation
Software: Python, PyTorch, Hugging Face Transformers
Hardware: Not publicly documented

📚 Citation

@misc{siem_multisource_log_generator,
  title={SIEM Multisource Log Generator},
  author={Adarsh Ranjan},
  year={2025},
  howpublished={Hugging Face Model Hub},
  url={https://huggingface.co/adarsh-aur/siem-multisource-log-generator}
}

👤 Model Card Author

Adarsh Ranjan

💬 Contact

For questions, feedback, or contributions, please use the Hugging Face model repository discussion page.