File size: 8,445 Bytes
5b1ef2a c8c5bfa 5b1ef2a f644e61 c8c5bfa 5b1ef2a 43bf1dd 5b1ef2a 5fcd8e1 c8c5bfa 5b1ef2a c8c5bfa 5b1ef2a c8c5bfa 5b1ef2a 3d5e7e4 c8c5bfa 5b1ef2a c8c5bfa 5b1ef2a c8c5bfa 5b1ef2a c8c5bfa 5b1ef2a c8c5bfa 5b1ef2a b362ce4 5b1ef2a c8c5bfa 5b1ef2a c8c5bfa 5b1ef2a c8c5bfa 5b1ef2a c8c5bfa 5b1ef2a c8c5bfa 5b1ef2a c8c5bfa 5b1ef2a c8c5bfa b362ce4 c8c5bfa 5b1ef2a c8c5bfa 5b1ef2a c8c5bfa 5b1ef2a c8c5bfa 5b1ef2a b362ce4 5b1ef2a c8c5bfa 5b1ef2a c8c5bfa 5b1ef2a c8c5bfa 5b1ef2a c8c5bfa 5b1ef2a c8c5bfa 5b1ef2a c8c5bfa 5b1ef2a c8c5bfa 5b1ef2a c8c5bfa 5b1ef2a c8c5bfa 5b1ef2a c8c5bfa 5b1ef2a c8c5bfa 5b1ef2a c8c5bfa 5b1ef2a c8c5bfa 5b1ef2a c8c5bfa 5b1ef2a c8c5bfa b362ce4 c8c5bfa 5b1ef2a f644e61 5b1ef2a c8c5bfa 5b1ef2a c8c5bfa 5b1ef2a c8c5bfa 5b1ef2a c8c5bfa 5b1ef2a c8c5bfa 5b1ef2a 3d5e7e4 c8c5bfa 5b1ef2a c8c5bfa 5b1ef2a c8c5bfa 5b1ef2a c8c5bfa 5b1ef2a c8c5bfa 5b1ef2a c8c5bfa 5b1ef2a c8c5bfa 5b1ef2a c8c5bfa b362ce4 c8c5bfa b362ce4 c8c5bfa 5b1ef2a c8c5bfa |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 |
---
license: apache-2.0
base_model:
- Qwen/Qwen3-0.6B
pipeline_tag: text-generation
tags:
- cybersecurity
- siem
- log-analysis
- field-extraction
- security-automation
- fine-tuned
---
<img src="https://cdn-uploads.huggingface.co/production/uploads/689df7f27100a16137c1ea74/O5bUR_i8GjNWqAB-nf15J.png" width="700">
# LLMSIEM/logem
LLMSIEM/logem is a specialized language model fine-tuned for Security Information and Event Management (SIEM) tasks, particularly excelling at structured field extraction from security logs and events.
## Model Details
### Model Description
LLMSIEM/logem is a fine-tuned version of Qwen3-0.6B, specifically optimized for cybersecurity applications. The model demonstrates that targeted fine-tuning can dramatically improve performance on domain-specific tasks, achieving superior results compared to much larger general-purpose models.
- **Developed by:** [Hassan Shehata]
- **Model type:** Causal Language Model (Fine-tuned)
- **Language(s):** English
- **License:** Apache 2.0
- **Finetuned from model:** Qwen/Qwen3-0.6B
- **Model size:** 1.2 GB (FP16), 396 MB (Q4_K_M quantized)
- **Parameters:** 0.6B
### Model Sources
- **Blog Post:** [LinkedIn/Blog Series Link]
## Performance Highlights
🏆 **Best-in-class performance** for SIEM field extraction tasks:
- **66.7% perfect matches** (FP16 version)
- **0.833 F1 score** - outperforms 12B parameter models
- **1.00s average response time** - 3x faster than larger alternatives
- **Zero complete failures** on standardized test suite
## Uses
### Direct Use
The model is designed for cybersecurity professionals and SIEM engineers who need to:
- Extract structured fields from security logs
- Parse and normalize security event data
- Automate log analysis workflows
- Generate structured outputs from unstructured security data
### Example Use Cases
```python
# Example: Extract fields from a security log
input_text = "Extract fields from: Failed login attempt from 192.168.1.100 for user admin at 2024-01-15T10:30:45Z"
# Model will output structured JSON with relevant fields:
# {
# "event_type": "failed_login",
# "source_ip": "192.168.1.100",
# "username": "admin",
# "timestamp": "2024-01-15T10:30:45Z"
# }
```
### Downstream Use
- Integration into SIEM platforms (Splunk, ELK, QRadar)
- Security orchestration and automated response (SOAR) workflows
- Threat hunting and incident response automation
- Security data lake processing pipelines
### Out-of-Scope Use
- General-purpose text generation
- Non-security related field extraction
- Real-time processing without proper input validation
- Decision-making for critical security responses without human oversight
## Bias, Risks, and Limitations
### Technical Limitations
- Optimized specifically for security log formats seen during training
- May struggle with completely novel log formats or schemas
- Performance may degrade on logs with unusual encoding or formatting
- Quantized version (Q4_K_M) shows 5% accuracy reduction vs FP16
### Security Considerations
- Model outputs should be validated before use in automated security workflows
- Not suitable for real-time critical security decisions without human oversight
- Training data may contain biases from specific security environments
- Should not be the sole source of truth for security incident classification
### Recommendations
- Always validate model outputs in production security environments
- Implement fallback mechanisms for handling novel or malformed inputs
- Regular retraining recommended as new log formats emerge
- Use FP16 version for maximum accuracy, Q4_K_M for resource-constrained deployments
## How to Get Started with the Model
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
# Load the model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("LLMSIEM/logem")
model = AutoModelForCausalLM.from_pretrained("LLMSIEM/logem")
# Example usage
prompt = "Extract security fields from the following log: [your log here]"
inputs = tokenizer(prompt, return_tensors="pt")
with torch.no_grad():
outputs = model.generate(
inputs.input_ids,
max_length=512,
temperature=0.1,
do_sample=False,
pad_token_id=tokenizer.eos_token_id
)
result = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(result)
```
### Using with Ollama (Recommended for Production)
```bash
# Pull the quantized version
ollama pull LLMSIEM/logem
# Run inference
ollama run LLMSIEM/logem "Extract fields from: SSH login from 10.0.0.5 by root"
```
## Training Details
### Training Data
The model was fine-tuned on a curated dataset of security logs and corresponding structured field extractions, including:
- Network security events (firewall, IDS/IPS)
- Authentication logs (successful/failed logins)
- System security events (file access, process execution)
- Application security logs (web servers, databases)
Dataset characteristics:
- 21 standardized test cases for evaluation
- Diverse log formats and security event types
- JSON-formatted target outputs for structured field extraction
### Training Procedure
#### Training Hyperparameters
- **Base model:** Qwen3-0.6B
- **Training regime:** Mixed precision (fp16)
- **Fine-tuning approach:** Supervised fine-tuning on field extraction tasks
- **Optimization:** Task-specific training for SIEM applications
#### Model Variants
- **FP16 Version:** 1.2 GB, maximum accuracy (0.833 F1)
- **Q4_K_M Quantized:** 396 MB, production-optimized (0.800 F1)
## Evaluation
### Testing Data, Factors & Metrics
#### Testing Data
- 21 standardized security log parsing test cases
- Diverse log formats from multiple security tools
- Ground truth structured outputs for comparison
#### Metrics
- **Perfect Match Rate:** Percentage of test cases with 100% accurate field extraction
- **F1 Score:** Harmonic mean of precision and recall for field detection
- **Precision:** Accuracy of extracted fields
- **Response Time:** Average inference latency
### Results
| Model | Perfect Matches | Avg F1 | Precision | Speed | Size |
|-------|----------------|---------|-----------|-------|------|
| **LLMSIEM/logem (FP16)** | **14/21 (66.7%)** | **0.833** | **0.848** | **1.00s** | **1.2 GB** |
| LLMSIEM/logem (Q4_K_M) | 13/21 (61.9%) | 0.800 | 0.819 | 1.00s | 396 MB |
| Gemma:12B | 15/21 (71.4%) | 0.790 | 0.788 | 3.06s | 5.0 GB |
| Qwen3:0.6B (base) | 9/21 (42.9%) | 0.651 | 0.636 | 1.57s | 522 MB |
#### Key Findings
- **+28% F1 improvement** over base Qwen3-0.6B model
- **Outperforms 12B models** in F1 score despite being 20x smaller
- **3x faster** than comparable accuracy models
- **12.6x smaller** than Gemma while maintaining superior performance
## Environmental Impact
Training a specialized 0.6B parameter model requires significantly less computational resources compared to training larger models from scratch:
- **Hardware Type:** NVIDIA GPU (RTX3060)
- **Training approach:** Fine-tuning (more efficient than training from scratch)
- **Base model efficiency:** Starting from pre-trained Qwen3-0.6B reduces carbon footprint
- **Production efficiency:** Smaller model size reduces inference energy consumption
## Technical Specifications
### Model Architecture
- **Architecture:** Transformer decoder (Qwen3 family)
- **Parameters:** 0.6 billion
- **Context length:** [Inherited from Qwen3-0.6B]
- **Vocabulary size:** [Inherited from Qwen3-0.6B]
### Compute Infrastructure
- **Training:** Fine-tuning on security-specific datasets
- **Inference:** Optimized for CPU and GPU deployment
- **Quantization:** GGML Q4_K_M for edge deployment
## Citation
If you use this model in your research or applications, please cite:
```bibtex
@misc{llmsiem-logem-2025,
title={LLMSIEM/logem: A Fine-tuned Language Model for Security Log Analysis},
author={[Hassan Shehata]},
year={2025},
url={https://huggingface.co/LLMSIEM/logem},
note={Fine-tuned from Qwen3-0.6B for SIEM applications}
}
```
## Model Card Authors
[Hassan Shehata/LLMSIEM]
## Model Card Contact
For questions about this model, please contact:
- **Email:** [hassanshehata25895@gmail.com]
- **LinkedIn:** [https://www.linkedin.com/in/hassan-shehata-503272172/]
- **GitHub:** [[Your GitHub Profile](https://github.com/HassanShehata)]
---
*This model is part of the LLMSIEM research series exploring the application of Large Language Models in cybersecurity and SIEM workflows.* |