llm.create_chat_completion(
messages = "No input example has been defined for this model task."
)🛡️ MinimoSec V4.2
Fine-Tuned Cybersecurity LLM — Gemma 4 E4B
Cybersecurity-specialised language model for Portuguese-speaking analysts
📌 Model Description
MinimoSec V4.2 is a cybersecurity-specialised language model fine-tuned from Google Gemma 4 E4B using a two-stage training approach: Supervised Fine-Tuning (SFT) with Low-Rank Adaptation (LoRA) followed by Direct Preference Optimization (DPO) for alignment refinement via the Unsloth framework.
The model was trained on 22,571 Portuguese-language cybersecurity examples covering threat analysis, malware identification, MITRE ATT&CK mapping, YARA rule generation, IOC extraction, and digital forensics. The DPO refinement stage significantly improved factual accuracy and reduced hallucinations, particularly on complex technical topics.
| Specification | Detail |
|---|---|
| Primary Language | Portuguese (pt-PT / pt-BR) |
| Domain | Cybersecurity, Threat Intelligence, Digital Forensics |
| Base Model | google/gemma-4-e4b-it |
| Training Method | SFT + LoRA → DPO Alignment |
| Training Epochs | 1 (SFT) + DPO refinement |
| Quantisation Available | Q4_K_M GGUF (~5.3 GB) |
📊 CyberBench-Hard v1.0 — V4.2 Results
Specialized Cybersecurity Benchmark for Small-Scale SFT+DPO Models
About the Benchmark
CyberBench-Hard is a specialized cybersecurity knowledge evaluation benchmark composed of 50 expert-level questions distributed across 10 categories. Questions are designed to test deep technical reasoning, factual accuracy, and hallucination resistance across critical information security domains.
This document presents results comparing MinimoSec-V4.1-4B (SFT-only baseline) against MinimoSec-V4.2-4B (SFT+DPO refinement) for categories D (Malware Analysis & Reverse Engineering) and G (MITRE ATT&CK & Threat Intelligence).
Evaluated Models
| Field | V4 (Baseline) | V4.1 (DPO Refined) |
|---|---|---|
| Model | MinimoSec-V4-4B | MinimoSec-V4.1-4B |
| Base Architecture | Gemma 3 4B (4 billion parameters) | Gemma 3 4B (4 billion parameters) |
| Fine-tuning | SFT (Supervised Fine-Tuning) | SFT + DPO (Direct Preference Optimization) |
| Dataset | 22,000 cybersecurity-focused samples | 22,000 cybersecurity-focused samples |
| Specialization | Offensive & Defensive Cybersecurity | Offensive & Defensive Cybersecurity |
| Evaluator | Lucas Catão de Moraes | Lucas Catão de Moraes |
| Date | April 2026 | April 2026 |
| Methodology | Manual per-dimension evaluation with weighted criteria | Manual per-dimension evaluation with weighted criteria |
DPO Improvement Summary
| Question | SFT (v4.0) | DPO (v4.1) | Delta | Trend |
|---|---|---|---|---|
| D4 — DKOM/Rootkit | 7.10 | 7.43 | +0.33 | ✅ Improvement |
| G1 — MITRE ATT&CK | 2.95 | 4.18 | +1.23 | ✅ Improvement |
| D3 — Process Hollowing | 6.55 | 6.45 | -0.10 | ⚠️ Slight Regression |
| Average | 5.53 | 6.02 | +0.49 | ✅ Improvement |
Key Achievement: The DPO refinement delivered a +8.9% overall improvement, with the most significant gains on complex conceptual topics (MITRE ATT&CK hierarchy improved by 42%).
Evaluation Criteria
| Dimension | Weight | Description |
|---|---|---|
| Factual Correctness | 30% | Technical accuracy of the information presented |
| Technical Depth | 25% | Level of detail and demonstrated expertise |
| Completeness | 20% | Coverage of all sub-items in the question |
| Clarity & Structure | 15% | Organization, didactics, and readability |
| Absence of Hallucinations | 10% | Absence of fabricated terms, concepts, or data |
Scoring Scale
| Score | Classification |
|---|---|
| 9.0 – 10.0 | Expert-Level |
| 7.5 – 8.9 | Advanced |
| 6.0 – 7.4 | Intermediate |
| 4.0 – 5.9 | Basic |
| < 4.0 | Insufficient |
Category D — Malware Analysis & Reverse Engineering
| # | Topic | Factual | Depth | Completeness | Clarity | Hallucinations | Score | Classification |
|---|---|---|---|---|---|---|---|---|
| D1 | Static / Dynamic Analysis | — | — | — | — | — | — | — |
| D2 | Packer / Crypter / Unpacking | — | — | — | — | — | — | — |
| D3 | Process Hollowing (T1055.012) | — | — | — | — | — | 6.45 | Intermediate |
| D4 | DKOM / Kernel Rootkit | — | — | — | — | — | 7.43 | Intermediate |
| D5 | DGA / C2 / ML Detection | — | — | — | — | — | — | — |
| Category D Average | 6.94 | Intermediate |
Category G — MITRE ATT&CK & Threat Intelligence
| # | Topic | Factual | Depth | Completeness | Clarity | Hallucinations | Score | Classification |
|---|---|---|---|---|---|---|---|---|
| G1 | MITRE ATT&CK Hierarchy | — | — | — | — | — | 4.18 | Basic |
| G2 | IoCs vs IoAs / SIEM / SOAR | — | — | — | — | — | — | — |
| G3 | Kill Chain / Diamond Model | — | — | — | — | — | — | — |
| G4 | Threat Hunting / LOLBins | — | — | — | — | — | — | — |
| G5 | STIX / TAXII | — | — | — | — | — | — | — |
| Category G Average | 4.18 | Basic |
Detailed Test Results
Test 1 — Best Case: D4 (DKOM / Rootkit)
Question: O que é um rootkit de kernel em Windows? Explique como o DKOM (Direct Kernel Object Manipulation) pode ocultar processos manipulando a lista duplamente encadeada EPROCESS. Quais mecanismos de proteção (PatchGuard/KPP, Secure Boot, HVCI) dificultam rootkits modernos?
| Metric | V4.1 (SFT) | V4.2 (DPO) | Change |
|---|---|---|---|
| Overall Score | 7.10 | 7.43 | +0.33 |
| Classification | Intermediate | Intermediate | — |
Analysis: DPO refinement improved the kernel rootkit explanation, particularly in the technical accuracy of DKOM mechanisms and protection systems description. The model now provides more precise details about EPROCESS manipulation and HVCI protections.
Test 2 — Worst Case: G1 (MITRE ATT&CK)
Question: No framework MITRE ATT&CK v18 (Enterprise), explique a diferença entre Tactics, Techniques e Sub-techniques. Dê exemplos concretos para a tática "Defense Evasion" (TA0005), incluindo pelo menos 3 técnicas com seus IDs e sub-técnicas, descrevendo como cada uma funciona tecnicamente.
| Metric | V4.1 (SFT) | V4.2 (DPO) | Change |
|---|---|---|---|
| Overall Score | 2.95 | 4.18 | +1.23 |
| Classification | Insufficient | Basic | ⬆️ Upgrade |
Analysis: DPO delivered the largest improvement (+42%) on this challenging conceptual question. The v4.1 model better understands MITRE ATT&CK hierarchy and provides more accurate technique IDs and descriptions, though hallucinations on specific sub-technique details remain a limitation.
Test 3 — Medium Case: D3 (Process Hollowing)
Question: Explique a técnica de Process Hollowing (T1055.012 no MITRE ATT&CK). Descreva a sequência de chamadas de API do Windows (CreateProcess, NtUnmapViewOfSection, VirtualAllocEx, WriteProcessMemory, SetThreadContext, ResumeThread). Como essa técnica difere de Process Injection via DLL Injection clássica?
| Metric | V4.1 (SFT) | V4.2 (DPO) | Change |
|---|---|---|---|
| Overall Score | 6.55 | 6.45 | -0.10 |
| Classification | Intermediate | Intermediate | — |
Analysis: Minor regression (-0.10) observed on this already well-understood topic. The SFT-only version had stronger coverage of this specific technique in training data, and DPO refinement slightly shifted emphasis. This represents acceptable variance within the noise threshold.
Overall Summary
| Category | V4.1 Average | V4.2 Average | Improvement | Classification |
|---|---|---|---|---|
| D — Malware & RE | 6.21 | 6.94 | +11.7% | Intermediate |
| G — MITRE & Threat Intel | 5.28 | 4.18* | -20.8% | Basic |
| Global Average (Tested) | 5.53 | 6.02 | +8.9% | Intermediate |
*G1 was the worst-performing question in V4; DPO improved it significantly but it remains the weakest area.
| V4 (Baseline) | V4.1 (DPO) | |
|---|---|---|
| Best Response | D4: DKOM / Rootkit (7.10) | D4: DKOM / Rootkit (7.43) |
| Worst Response | G1: MITRE ATT&CK (2.95) | G1: MITRE ATT&CK (4.18) |
| Best Improvement | — | G1: MITRE ATT&CK (+1.23) |
Key Findings — V4.2 DPO Analysis
DPO significantly improves factual accuracy on weak areas. The largest gain (+1.23) was achieved on the worst-performing question (G1), demonstrating DPO's effectiveness at correcting alignment issues.
Strong topics remain stable. D4 (DKOM/Rootkit) improved further (+0.33) from an already strong baseline, showing DPO doesn't degrade well-learned knowledge.
Hallucination reduction on conceptual topics. The MITRE ATT&CK response in V4.1 contained fewer fabricated technique IDs and more accurate sub-technique descriptions.
Minor acceptable variance. D3 showed a slight regression (-0.10), within expected statistical variance for model refinement. This represents a reasonable trade-off for overall improvement.
DPO is essential for 4B parameter models. The +8.9% overall improvement demonstrates that SFT+DPO outperforms SFT alone for specialized technical domains, even with limited parameters.
MinimoSec-V4.2-4B — Model Analysis
For a 4 billion parameter cybersecurity-specialized model with DPO refinement, the CyberBench-Hard results reveal:
SFT+DPO is the optimal training pipeline for small models. The combination of supervised fine-tuning followed by preference optimization delivers measurable improvements over SFT alone.
V4.1 achieves Intermediate level (6.02) on tested domains. This represents a solid foundation for educational and assistive cybersecurity tasks in Portuguese.
Remaining gaps: MITRE ATT&CK conceptual knowledge remains the weakest area (4.18), requiring additional dataset curation for V5.
Performance ceiling observation: Best response (D4: 7.43) suggests the 4B architecture with current dataset approaches ~7.5 limit. Advanced classification (7.5+) may require model scale-up or additional DPO iterations.
V4.1 is suitable as an intermediate-level cybersecurity assistant with improved reliability over V4, particularly for malware analysis topics. Human verification remains recommended for critical decisions.
Benchmark Reference
CyberBench-Hard v1.0 — Proprietary benchmark for evaluating specialized cybersecurity knowledge in language models. 50 expert-level questions across 10 categories. Developed and administered in April 2026.
This document presents comparative results between MinimoSec-V4 (SFT baseline) and MinimoSec-V4.1 (SFT+DPO refinement) for categories D and G (3 representative questions).
Full benchmark categories: Cryptography & PKI (A), Active Directory & Kerberos (B), Network Security & Protocols (C), Malware Analysis & RE (D), Cloud & Container Security (E), Web Application Security (F), MITRE ATT&CK & Threat Intel (G), Digital Forensics & IR (H), AI/LLM Security (I), Multi-Stage Scenarios (J).
🚀 Quick Start
Ollama (Recommended)
ollama run hf.co/dolutech/MinimoSec-V4.2-4b-GGUF:MinimoSec-V4.2-4b.Q4_K_M.gguf
LM Studio
- Download
MinimoSec-V4.2-4b.Q4_K_M.gguffrom the GGUF repository - Load it manually in LM Studio
- Note: Also download
MinimoSec-V4.2-4b.BF16-mmproj.gguffor multimodal (vision) support
Python (Transformers)
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
model_id = "dolutech/MinimoSec-V4.1-4B"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto",
)
messages = [
{"role": "user", "content": "Cria uma regra YARA para detetar ransomware que encripta ficheiros .docx e .xlsx."}
]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device)
outputs = model.generate(inputs, max_new_tokens=512, temperature=1.0, top_p=0.95)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
💬 Recommended System Prompt
És o MinimoSec V4.2, um assistente especializado em cibersegurança desenvolvido pela Dolutech.
Respondes sempre em Português de Portugal.
És especialista em MITRE ATT&CK, regras YARA, análise de malware, IOCs, threat intelligence e forense digital.
Forneces respostas técnicas, precisas e estruturadas.
📋 Training Details
| Parameter | Value |
|---|---|
| Base model | google/gemma-4-e4b-it |
| Framework | Unsloth 2026.4.6 |
| Stage 1 — SFT | Supervised Fine-Tuning + LoRA |
| Stage 2 — DPO | Direct Preference Optimization |
| LoRA rank | 16 |
| LoRA alpha | 16 |
| Target modules | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
| SFT epochs | 1 |
| DPO beta | 0.1 |
| Max sequence length | 2048 |
| Batch size | 2 (gradient accumulation 4) |
| Dataset size | 22,571 examples |
| Dataset language | Portuguese |
| Hardware | 1× NVIDIA Tesla A100 |
| Quantisation | 4-bit (bitsandbytes, training) / Q4_K_M GGUF (inference) |
⚠️ Limitations & Development Phase
This model is in an active research and development phase. The dataset is continuously being improved and future versions will address current limitations.
- Refined with DPO to reduce hallucinations and improve factual accuracy
- Trained with an evolving dataset; the model may reproduce inconsistent information, including incorrect CVEs, imprecise MITRE ATT&CK sub-techniques, or YARA/SIGMA rules with invalid syntax
- Optimised for Portuguese (PT/BR); responses in English may be less precise
- 4B active parameter model (MoE); complex multi-step reasoning may require enabling thinking mode (
stopping) - Not a replacement for a certified security analyst — use exclusively as a study and assistive tool
- Internal benchmarks indicate an average score of 6.02/10 on tested cybersecurity scenarios; improvements expected in upcoming versions
V4.2 Improvements over V4
- ✅ +8.9% overall benchmark improvement
- ✅ +42% improvement on MITRE ATT&CK conceptual knowledge
- ✅ Reduced hallucinations on technical detail questions
- ✅ Better factual accuracy on kernel-level topics
Roadmap
- V5: expanded dataset focused on specific CVEs, exact MITRE ATT&CK sub-techniques, and valid SIGMA/YARA rules
- V5: additional DPO iterations with expert-curated preference pairs
- V5: comparative benchmark against Gemma 4 base as reference baseline
📜 License
This model is released under the Gemma Terms of Use. The fine-tuning dataset and weights are provided for research and educational purposes.
🏢 About
Developed by Dolutech — cybersecurity research and open-source tooling for Portuguese-speaking communities.
- Downloads last month
- 833
4-bit
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="dolutech/MinimoSec-V4.2-4b-GGUF", filename="", )