--- license: apache-2.0 base_model: - Qwen/Qwen3-0.6B pipeline_tag: text-generation tags: - cybersecurity - siem - log-analysis - field-extraction - security-automation - fine-tuned --- # LLMSIEM/logem LLMSIEM/logem is a specialized language model fine-tuned for Security Information and Event Management (SIEM) tasks, particularly excelling at structured field extraction from security logs and events. ## Model Details ### Model Description LLMSIEM/logem is a fine-tuned version of Qwen3-0.6B, specifically optimized for cybersecurity applications. The model demonstrates that targeted fine-tuning can dramatically improve performance on domain-specific tasks, achieving superior results compared to much larger general-purpose models. - **Developed by:** [Hassan Shehata] - **Model type:** Causal Language Model (Fine-tuned) - **Language(s):** English - **License:** Apache 2.0 - **Finetuned from model:** Qwen/Qwen3-0.6B - **Model size:** 1.2 GB (FP16), 396 MB (Q4_K_M quantized) - **Parameters:** 0.6B ### Model Sources - **Blog Post:** [LinkedIn/Blog Series Link] ## Performance Highlights 🏆 **Best-in-class performance** for SIEM field extraction tasks: - **66.7% perfect matches** (FP16 version) - **0.833 F1 score** - outperforms 12B parameter models - **1.00s average response time** - 3x faster than larger alternatives - **Zero complete failures** on standardized test suite ## Uses ### Direct Use The model is designed for cybersecurity professionals and SIEM engineers who need to: - Extract structured fields from security logs - Parse and normalize security event data - Automate log analysis workflows - Generate structured outputs from unstructured security data ### Example Use Cases ```python # Example: Extract fields from a security log input_text = "Extract fields from: Failed login attempt from 192.168.1.100 for user admin at 2024-01-15T10:30:45Z" # Model will output structured JSON with relevant fields: # { # "event_type": "failed_login", # "source_ip": "192.168.1.100", # "username": "admin", # "timestamp": "2024-01-15T10:30:45Z" # } ``` ### Downstream Use - Integration into SIEM platforms (Splunk, ELK, QRadar) - Security orchestration and automated response (SOAR) workflows - Threat hunting and incident response automation - Security data lake processing pipelines ### Out-of-Scope Use - General-purpose text generation - Non-security related field extraction - Real-time processing without proper input validation - Decision-making for critical security responses without human oversight ## Bias, Risks, and Limitations ### Technical Limitations - Optimized specifically for security log formats seen during training - May struggle with completely novel log formats or schemas - Performance may degrade on logs with unusual encoding or formatting - Quantized version (Q4_K_M) shows 5% accuracy reduction vs FP16 ### Security Considerations - Model outputs should be validated before use in automated security workflows - Not suitable for real-time critical security decisions without human oversight - Training data may contain biases from specific security environments - Should not be the sole source of truth for security incident classification ### Recommendations - Always validate model outputs in production security environments - Implement fallback mechanisms for handling novel or malformed inputs - Regular retraining recommended as new log formats emerge - Use FP16 version for maximum accuracy, Q4_K_M for resource-constrained deployments ## How to Get Started with the Model ```python from transformers import AutoTokenizer, AutoModelForCausalLM import torch # Load the model and tokenizer tokenizer = AutoTokenizer.from_pretrained("LLMSIEM/logem") model = AutoModelForCausalLM.from_pretrained("LLMSIEM/logem") # Example usage prompt = "Extract security fields from the following log: [your log here]" inputs = tokenizer(prompt, return_tensors="pt") with torch.no_grad(): outputs = model.generate( inputs.input_ids, max_length=512, temperature=0.1, do_sample=False, pad_token_id=tokenizer.eos_token_id ) result = tokenizer.decode(outputs[0], skip_special_tokens=True) print(result) ``` ### Using with Ollama (Recommended for Production) ```bash # Pull the quantized version ollama pull LLMSIEM/logem # Run inference ollama run LLMSIEM/logem "Extract fields from: SSH login from 10.0.0.5 by root" ``` ## Training Details ### Training Data The model was fine-tuned on a curated dataset of security logs and corresponding structured field extractions, including: - Network security events (firewall, IDS/IPS) - Authentication logs (successful/failed logins) - System security events (file access, process execution) - Application security logs (web servers, databases) Dataset characteristics: - 21 standardized test cases for evaluation - Diverse log formats and security event types - JSON-formatted target outputs for structured field extraction ### Training Procedure #### Training Hyperparameters - **Base model:** Qwen3-0.6B - **Training regime:** Mixed precision (fp16) - **Fine-tuning approach:** Supervised fine-tuning on field extraction tasks - **Optimization:** Task-specific training for SIEM applications #### Model Variants - **FP16 Version:** 1.2 GB, maximum accuracy (0.833 F1) - **Q4_K_M Quantized:** 396 MB, production-optimized (0.800 F1) ## Evaluation ### Testing Data, Factors & Metrics #### Testing Data - 21 standardized security log parsing test cases - Diverse log formats from multiple security tools - Ground truth structured outputs for comparison #### Metrics - **Perfect Match Rate:** Percentage of test cases with 100% accurate field extraction - **F1 Score:** Harmonic mean of precision and recall for field detection - **Precision:** Accuracy of extracted fields - **Response Time:** Average inference latency ### Results | Model | Perfect Matches | Avg F1 | Precision | Speed | Size | |-------|----------------|---------|-----------|-------|------| | **LLMSIEM/logem (FP16)** | **14/21 (66.7%)** | **0.833** | **0.848** | **1.00s** | **1.2 GB** | | LLMSIEM/logem (Q4_K_M) | 13/21 (61.9%) | 0.800 | 0.819 | 1.00s | 396 MB | | Gemma:12B | 15/21 (71.4%) | 0.790 | 0.788 | 3.06s | 5.0 GB | | Qwen3:0.6B (base) | 9/21 (42.9%) | 0.651 | 0.636 | 1.57s | 522 MB | #### Key Findings - **+28% F1 improvement** over base Qwen3-0.6B model - **Outperforms 12B models** in F1 score despite being 20x smaller - **3x faster** than comparable accuracy models - **12.6x smaller** than Gemma while maintaining superior performance ## Environmental Impact Training a specialized 0.6B parameter model requires significantly less computational resources compared to training larger models from scratch: - **Hardware Type:** NVIDIA GPU (RTX3060) - **Training approach:** Fine-tuning (more efficient than training from scratch) - **Base model efficiency:** Starting from pre-trained Qwen3-0.6B reduces carbon footprint - **Production efficiency:** Smaller model size reduces inference energy consumption ## Technical Specifications ### Model Architecture - **Architecture:** Transformer decoder (Qwen3 family) - **Parameters:** 0.6 billion - **Context length:** [Inherited from Qwen3-0.6B] - **Vocabulary size:** [Inherited from Qwen3-0.6B] ### Compute Infrastructure - **Training:** Fine-tuning on security-specific datasets - **Inference:** Optimized for CPU and GPU deployment - **Quantization:** GGML Q4_K_M for edge deployment ## Citation If you use this model in your research or applications, please cite: ```bibtex @misc{llmsiem-logem-2025, title={LLMSIEM/logem: A Fine-tuned Language Model for Security Log Analysis}, author={[Hassan Shehata]}, year={2025}, url={https://huggingface.co/LLMSIEM/logem}, note={Fine-tuned from Qwen3-0.6B for SIEM applications} } ``` ## Model Card Authors [Hassan Shehata/LLMSIEM] ## Model Card Contact For questions about this model, please contact: - **Email:** [hassanshehata25895@gmail.com] - **LinkedIn:** [https://www.linkedin.com/in/hassan-shehata-503272172/] - **GitHub:** [[Your GitHub Profile](https://github.com/HassanShehata)] --- *This model is part of the LLMSIEM research series exploring the application of Large Language Models in cybersecurity and SIEM workflows.*