logem / README.md

Update README.md

3d5e7e4 verified 5 months ago

8.45 kB

	---
	license: apache-2.0
	base_model:
	- Qwen/Qwen3-0.6B
	pipeline_tag: text-generation
	tags:
	- cybersecurity
	- siem
	- log-analysis
	- field-extraction
	- security-automation
	- fine-tuned
	---
	<img src="https://cdn-uploads.huggingface.co/production/uploads/689df7f27100a16137c1ea74/O5bUR_i8GjNWqAB-nf15J.png" width="700">


	# LLMSIEM/logem

	LLMSIEM/logem is a specialized language model fine-tuned for Security Information and Event Management (SIEM) tasks, particularly excelling at structured field extraction from security logs and events.

	## Model Details

	### Model Description

	LLMSIEM/logem is a fine-tuned version of Qwen3-0.6B, specifically optimized for cybersecurity applications. The model demonstrates that targeted fine-tuning can dramatically improve performance on domain-specific tasks, achieving superior results compared to much larger general-purpose models.

	- Developed by: [Hassan Shehata]
	- Model type: Causal Language Model (Fine-tuned)
	- Language(s): English
	- License: Apache 2.0
	- Finetuned from model: Qwen/Qwen3-0.6B
	- Model size: 1.2 GB (FP16), 396 MB (Q4_K_M quantized)
	- Parameters: 0.6B

	### Model Sources

	- Blog Post: [LinkedIn/Blog Series Link]

	## Performance Highlights

	🏆 Best-in-class performance for SIEM field extraction tasks:
	- 66.7% perfect matches (FP16 version)
	- 0.833 F1 score - outperforms 12B parameter models
	- 1.00s average response time - 3x faster than larger alternatives
	- Zero complete failures on standardized test suite

	## Uses

	### Direct Use

	The model is designed for cybersecurity professionals and SIEM engineers who need to:
	- Extract structured fields from security logs
	- Parse and normalize security event data
	- Automate log analysis workflows
	- Generate structured outputs from unstructured security data

	### Example Use Cases

	```python
	# Example: Extract fields from a security log
	input_text = "Extract fields from: Failed login attempt from 192.168.1.100 for user admin at 2024-01-15T10:30:45Z"

	# Model will output structured JSON with relevant fields:
	# {
	# "event_type": "failed_login",
	# "source_ip": "192.168.1.100",
	# "username": "admin",
	# "timestamp": "2024-01-15T10:30:45Z"
	# }
	```

	### Downstream Use

	- Integration into SIEM platforms (Splunk, ELK, QRadar)
	- Security orchestration and automated response (SOAR) workflows
	- Threat hunting and incident response automation
	- Security data lake processing pipelines

	### Out-of-Scope Use

	- General-purpose text generation
	- Non-security related field extraction
	- Real-time processing without proper input validation
	- Decision-making for critical security responses without human oversight

	## Bias, Risks, and Limitations

	### Technical Limitations
	- Optimized specifically for security log formats seen during training
	- May struggle with completely novel log formats or schemas
	- Performance may degrade on logs with unusual encoding or formatting
	- Quantized version (Q4_K_M) shows 5% accuracy reduction vs FP16

	### Security Considerations
	- Model outputs should be validated before use in automated security workflows
	- Not suitable for real-time critical security decisions without human oversight
	- Training data may contain biases from specific security environments
	- Should not be the sole source of truth for security incident classification

	### Recommendations

	- Always validate model outputs in production security environments
	- Implement fallback mechanisms for handling novel or malformed inputs
	- Regular retraining recommended as new log formats emerge
	- Use FP16 version for maximum accuracy, Q4_K_M for resource-constrained deployments

	## How to Get Started with the Model

	```python
	from transformers import AutoTokenizer, AutoModelForCausalLM
	import torch

	# Load the model and tokenizer
	tokenizer = AutoTokenizer.from_pretrained("LLMSIEM/logem")
	model = AutoModelForCausalLM.from_pretrained("LLMSIEM/logem")

	# Example usage
	prompt = "Extract security fields from the following log: [your log here]"
	inputs = tokenizer(prompt, return_tensors="pt")

	with torch.no_grad():
	outputs = model.generate(
	inputs.input_ids,
	max_length=512,
	temperature=0.1,
	do_sample=False,
	pad_token_id=tokenizer.eos_token_id
	)

	result = tokenizer.decode(outputs[0], skip_special_tokens=True)
	print(result)
	```

	### Using with Ollama (Recommended for Production)

	```bash
	# Pull the quantized version
	ollama pull LLMSIEM/logem

	# Run inference
	ollama run LLMSIEM/logem "Extract fields from: SSH login from 10.0.0.5 by root"
	```

	## Training Details

	### Training Data

	The model was fine-tuned on a curated dataset of security logs and corresponding structured field extractions, including:
	- Network security events (firewall, IDS/IPS)
	- Authentication logs (successful/failed logins)
	- System security events (file access, process execution)
	- Application security logs (web servers, databases)

	Dataset characteristics:
	- 21 standardized test cases for evaluation
	- Diverse log formats and security event types
	- JSON-formatted target outputs for structured field extraction

	### Training Procedure

	#### Training Hyperparameters

	- Base model: Qwen3-0.6B
	- Training regime: Mixed precision (fp16)
	- Fine-tuning approach: Supervised fine-tuning on field extraction tasks
	- Optimization: Task-specific training for SIEM applications

	#### Model Variants

	- FP16 Version: 1.2 GB, maximum accuracy (0.833 F1)
	- Q4_K_M Quantized: 396 MB, production-optimized (0.800 F1)

	## Evaluation

	### Testing Data, Factors & Metrics

	#### Testing Data
	- 21 standardized security log parsing test cases
	- Diverse log formats from multiple security tools
	- Ground truth structured outputs for comparison

	#### Metrics
	- Perfect Match Rate: Percentage of test cases with 100% accurate field extraction
	- F1 Score: Harmonic mean of precision and recall for field detection
	- Precision: Accuracy of extracted fields
	- Response Time: Average inference latency

	### Results

	\| Model \| Perfect Matches \| Avg F1 \| Precision \| Speed \| Size \|
	\|-------\|----------------\|---------\|-----------\|-------\|------\|
	\| LLMSIEM/logem (FP16) \| 14/21 (66.7%) \| 0.833 \| 0.848 \| 1.00s \| 1.2 GB \|
	\| LLMSIEM/logem (Q4_K_M) \| 13/21 (61.9%) \| 0.800 \| 0.819 \| 1.00s \| 396 MB \|
	\| Gemma:12B \| 15/21 (71.4%) \| 0.790 \| 0.788 \| 3.06s \| 5.0 GB \|
	\| Qwen3:0.6B (base) \| 9/21 (42.9%) \| 0.651 \| 0.636 \| 1.57s \| 522 MB \|

	#### Key Findings
	- +28% F1 improvement over base Qwen3-0.6B model
	- Outperforms 12B models in F1 score despite being 20x smaller
	- 3x faster than comparable accuracy models
	- 12.6x smaller than Gemma while maintaining superior performance

	## Environmental Impact

	Training a specialized 0.6B parameter model requires significantly less computational resources compared to training larger models from scratch:

	- Hardware Type: NVIDIA GPU (RTX3060)
	- Training approach: Fine-tuning (more efficient than training from scratch)
	- Base model efficiency: Starting from pre-trained Qwen3-0.6B reduces carbon footprint
	- Production efficiency: Smaller model size reduces inference energy consumption

	## Technical Specifications

	### Model Architecture
	- Architecture: Transformer decoder (Qwen3 family)
	- Parameters: 0.6 billion
	- Context length: [Inherited from Qwen3-0.6B]
	- Vocabulary size: [Inherited from Qwen3-0.6B]

	### Compute Infrastructure
	- Training: Fine-tuning on security-specific datasets
	- Inference: Optimized for CPU and GPU deployment
	- Quantization: GGML Q4_K_M for edge deployment

	## Citation

	If you use this model in your research or applications, please cite:

	```bibtex
	@misc{llmsiem-logem-2025,
	title={LLMSIEM/logem: A Fine-tuned Language Model for Security Log Analysis},
	author={[Hassan Shehata]},
	year={2025},
	url={https://huggingface.co/LLMSIEM/logem},
	note={Fine-tuned from Qwen3-0.6B for SIEM applications}
	}
	```

	## Model Card Authors

	[Hassan Shehata/LLMSIEM]

	## Model Card Contact

	For questions about this model, please contact:
	- Email: [hassanshehata25895@gmail.com]
	- LinkedIn: [https://www.linkedin.com/in/hassan-shehata-503272172/]
	- GitHub: [[Your GitHub Profile](https://github.com/HassanShehata)]

	---

	This model is part of the LLMSIEM research series exploring the application of Large Language Models in cybersecurity and SIEM workflows.