logem / README.md
HassanShehata's picture
Update README.md
3d5e7e4 verified
---
license: apache-2.0
base_model:
- Qwen/Qwen3-0.6B
pipeline_tag: text-generation
tags:
- cybersecurity
- siem
- log-analysis
- field-extraction
- security-automation
- fine-tuned
---
<img src="https://cdn-uploads.huggingface.co/production/uploads/689df7f27100a16137c1ea74/O5bUR_i8GjNWqAB-nf15J.png" width="700">
# LLMSIEM/logem
LLMSIEM/logem is a specialized language model fine-tuned for Security Information and Event Management (SIEM) tasks, particularly excelling at structured field extraction from security logs and events.
## Model Details
### Model Description
LLMSIEM/logem is a fine-tuned version of Qwen3-0.6B, specifically optimized for cybersecurity applications. The model demonstrates that targeted fine-tuning can dramatically improve performance on domain-specific tasks, achieving superior results compared to much larger general-purpose models.
- **Developed by:** [Hassan Shehata]
- **Model type:** Causal Language Model (Fine-tuned)
- **Language(s):** English
- **License:** Apache 2.0
- **Finetuned from model:** Qwen/Qwen3-0.6B
- **Model size:** 1.2 GB (FP16), 396 MB (Q4_K_M quantized)
- **Parameters:** 0.6B
### Model Sources
- **Blog Post:** [LinkedIn/Blog Series Link]
## Performance Highlights
🏆 **Best-in-class performance** for SIEM field extraction tasks:
- **66.7% perfect matches** (FP16 version)
- **0.833 F1 score** - outperforms 12B parameter models
- **1.00s average response time** - 3x faster than larger alternatives
- **Zero complete failures** on standardized test suite
## Uses
### Direct Use
The model is designed for cybersecurity professionals and SIEM engineers who need to:
- Extract structured fields from security logs
- Parse and normalize security event data
- Automate log analysis workflows
- Generate structured outputs from unstructured security data
### Example Use Cases
```python
# Example: Extract fields from a security log
input_text = "Extract fields from: Failed login attempt from 192.168.1.100 for user admin at 2024-01-15T10:30:45Z"
# Model will output structured JSON with relevant fields:
# {
# "event_type": "failed_login",
# "source_ip": "192.168.1.100",
# "username": "admin",
# "timestamp": "2024-01-15T10:30:45Z"
# }
```
### Downstream Use
- Integration into SIEM platforms (Splunk, ELK, QRadar)
- Security orchestration and automated response (SOAR) workflows
- Threat hunting and incident response automation
- Security data lake processing pipelines
### Out-of-Scope Use
- General-purpose text generation
- Non-security related field extraction
- Real-time processing without proper input validation
- Decision-making for critical security responses without human oversight
## Bias, Risks, and Limitations
### Technical Limitations
- Optimized specifically for security log formats seen during training
- May struggle with completely novel log formats or schemas
- Performance may degrade on logs with unusual encoding or formatting
- Quantized version (Q4_K_M) shows 5% accuracy reduction vs FP16
### Security Considerations
- Model outputs should be validated before use in automated security workflows
- Not suitable for real-time critical security decisions without human oversight
- Training data may contain biases from specific security environments
- Should not be the sole source of truth for security incident classification
### Recommendations
- Always validate model outputs in production security environments
- Implement fallback mechanisms for handling novel or malformed inputs
- Regular retraining recommended as new log formats emerge
- Use FP16 version for maximum accuracy, Q4_K_M for resource-constrained deployments
## How to Get Started with the Model
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
# Load the model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("LLMSIEM/logem")
model = AutoModelForCausalLM.from_pretrained("LLMSIEM/logem")
# Example usage
prompt = "Extract security fields from the following log: [your log here]"
inputs = tokenizer(prompt, return_tensors="pt")
with torch.no_grad():
outputs = model.generate(
inputs.input_ids,
max_length=512,
temperature=0.1,
do_sample=False,
pad_token_id=tokenizer.eos_token_id
)
result = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(result)
```
### Using with Ollama (Recommended for Production)
```bash
# Pull the quantized version
ollama pull LLMSIEM/logem
# Run inference
ollama run LLMSIEM/logem "Extract fields from: SSH login from 10.0.0.5 by root"
```
## Training Details
### Training Data
The model was fine-tuned on a curated dataset of security logs and corresponding structured field extractions, including:
- Network security events (firewall, IDS/IPS)
- Authentication logs (successful/failed logins)
- System security events (file access, process execution)
- Application security logs (web servers, databases)
Dataset characteristics:
- 21 standardized test cases for evaluation
- Diverse log formats and security event types
- JSON-formatted target outputs for structured field extraction
### Training Procedure
#### Training Hyperparameters
- **Base model:** Qwen3-0.6B
- **Training regime:** Mixed precision (fp16)
- **Fine-tuning approach:** Supervised fine-tuning on field extraction tasks
- **Optimization:** Task-specific training for SIEM applications
#### Model Variants
- **FP16 Version:** 1.2 GB, maximum accuracy (0.833 F1)
- **Q4_K_M Quantized:** 396 MB, production-optimized (0.800 F1)
## Evaluation
### Testing Data, Factors & Metrics
#### Testing Data
- 21 standardized security log parsing test cases
- Diverse log formats from multiple security tools
- Ground truth structured outputs for comparison
#### Metrics
- **Perfect Match Rate:** Percentage of test cases with 100% accurate field extraction
- **F1 Score:** Harmonic mean of precision and recall for field detection
- **Precision:** Accuracy of extracted fields
- **Response Time:** Average inference latency
### Results
| Model | Perfect Matches | Avg F1 | Precision | Speed | Size |
|-------|----------------|---------|-----------|-------|------|
| **LLMSIEM/logem (FP16)** | **14/21 (66.7%)** | **0.833** | **0.848** | **1.00s** | **1.2 GB** |
| LLMSIEM/logem (Q4_K_M) | 13/21 (61.9%) | 0.800 | 0.819 | 1.00s | 396 MB |
| Gemma:12B | 15/21 (71.4%) | 0.790 | 0.788 | 3.06s | 5.0 GB |
| Qwen3:0.6B (base) | 9/21 (42.9%) | 0.651 | 0.636 | 1.57s | 522 MB |
#### Key Findings
- **+28% F1 improvement** over base Qwen3-0.6B model
- **Outperforms 12B models** in F1 score despite being 20x smaller
- **3x faster** than comparable accuracy models
- **12.6x smaller** than Gemma while maintaining superior performance
## Environmental Impact
Training a specialized 0.6B parameter model requires significantly less computational resources compared to training larger models from scratch:
- **Hardware Type:** NVIDIA GPU (RTX3060)
- **Training approach:** Fine-tuning (more efficient than training from scratch)
- **Base model efficiency:** Starting from pre-trained Qwen3-0.6B reduces carbon footprint
- **Production efficiency:** Smaller model size reduces inference energy consumption
## Technical Specifications
### Model Architecture
- **Architecture:** Transformer decoder (Qwen3 family)
- **Parameters:** 0.6 billion
- **Context length:** [Inherited from Qwen3-0.6B]
- **Vocabulary size:** [Inherited from Qwen3-0.6B]
### Compute Infrastructure
- **Training:** Fine-tuning on security-specific datasets
- **Inference:** Optimized for CPU and GPU deployment
- **Quantization:** GGML Q4_K_M for edge deployment
## Citation
If you use this model in your research or applications, please cite:
```bibtex
@misc{llmsiem-logem-2025,
title={LLMSIEM/logem: A Fine-tuned Language Model for Security Log Analysis},
author={[Hassan Shehata]},
year={2025},
url={https://huggingface.co/LLMSIEM/logem},
note={Fine-tuned from Qwen3-0.6B for SIEM applications}
}
```
## Model Card Authors
[Hassan Shehata/LLMSIEM]
## Model Card Contact
For questions about this model, please contact:
- **Email:** [hassanshehata25895@gmail.com]
- **LinkedIn:** [https://www.linkedin.com/in/hassan-shehata-503272172/]
- **GitHub:** [[Your GitHub Profile](https://github.com/HassanShehata)]
---
*This model is part of the LLMSIEM research series exploring the application of Large Language Models in cybersecurity and SIEM workflows.*