Qwen3-8B Cyber Knowledge:

Is a cybersecurity domain-adapted language model fine-tuned from Qwen3-8B-Base for IT professionals and security practitioners. This model was developed using a two-stage training pipeline — Continued Pre-Training (CPT) followed by Supervised Fine-Tuning (SFT) — on publicly-available curated cybersecurity datasets covering MITRE ATT&CK, NIST frameworks, incident response, CVE analysis, and threat intelligence.

Model Details

Property Value
Base Model Qwen/Qwen3-8B-Base
Model Type Causal Language Model
Language English
Fine-tune License Apache 2.0
Base Model License Qwen License
Parameters 8B
Precision bfloat16
Training Hardware NVIDIA A100 80GB

Intended Use This model is designed to assist IT security professionals with:

- MITRE ATT&CK framework guidance and technique explanation
- Incident response planning and playbook guidance
- NIST Cybersecurity Framework application
- CVE vulnerability analysis and explanation
- SOC analyst support for threat investigations
- General cybersecurity knowledge queries
- Lateral movement and threat artifact identification

This model is not intended for:

- Offensive security tooling or attack assistance
- Replacing trained security analysts in production environments
- Authoritative CVE ID or technique ID lookup without independent verification
- Real-time threat intelligence

Training Pipeline Stage 1 — Continued Pre-Training (CPT) The base Qwen3-8B-Base model underwent continued pre-training on a group of curated cybersecurity datasets to develop domain-specific knowledge before instruction tuning.

CPT Datasets:

Dataset Examples Coverage
sambanovasystems/attackqa 10,000 (sampled) MITRE ATT&CK Q&A, technique descriptions, adversary TTPs
ethanolivertroy/nist-cybersecurity-training 10,000 (sampled) NIST frameworks, security controls, compliance guidance
trendmicro-ailab/Primus-Seed 20,000 (sampled) Broad cybersecurity knowledge corpus
Total 40,000

CPT Configuration:

Parameter Value
Method LoRA (r=16, alpha=32)
Target Modules q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Learning Rate 1e-5
Scheduler Cosine
Warmup Steps 100
Max Steps 2000
Batch Size 4
Gradient Accumulation 4
Effective Batch Size 16
Max Sequence Length 2048
Precision bf16
Optimizer paged_adamw_32bit
Flash Attention flash_attention_2

CPT Results:

Metric Value
Starting Loss 2.642
Final Loss 1.224
Average Train Loss 1.312
Runtime 3 hours ($5.37)

Stage 2 — Supervised Fine-Tuning (SFT) The CPT-adapted model was further fine-tuned on structured instruction datasets to develop cybersecurity-specific question answering, incident response guidance, and knowledge retrieval capabilities.

SFT Datasets:

Dataset Base Examples Oversampling Final Examples Coverage
trendmicro-ailab/Primus-Instruct 835 ×4 3,340 Cybersecurity task instructions
zefang-liu/secqa v1+v2 (dev/val/test splits) 242 ×10 2,420 Structured cybersecurity Q&A
AlicanKiraz0/Cybersecurity-Dataset-Fenrir-v2.0 83,920 ×1 83,920 MITRE ATT&CK, NIST, IR, SIEM, cloud security, threat hunting
AlicanKiraz0/All-CVE-Records-Training-Dataset 30,000 (sampled) ×1 30,000 CVE analysis, vulnerability assessment, remediation
Combined after filtering 112,191

SFT Configuration:

Parameter Value
Method LoRA (r=32, alpha=64)
Target Modules all-linear
Learning Rate 1e-4
Scheduler Cosine
Warmup Steps 150
Max Steps 1500
Batch Size 4
Gradient Accumulation 4
Effective Batch Size 16
Max Sequence Length 1024
Precision bf16
Optimizer paged_adamw_32bit
Packing False
Flash Attention flash_attention_2

SFT Results:

Metric Value
Starting Loss 2.966
Final Loss 0.679
Best Loss 0.676
Final Token Accuracy 81.1%
Best Token Accuracy 81.5%
Average Train Loss 0.767
Runtime 2.2 hours ($3.94)

Data Processing All datasets underwent aggressive cleaning and filtering before training:

- ASCII normalization to remove non-standard characters
- HTML tag removal via regex matching
- Length filtering (200—10,000 characters) to remove malformed and excessively long examples
- AttackQA skip pattern filtering applied to raw text before cleaning to remove web-scraped exam content. Skip patterns included: brainly, chegg, quizlet, coursehero, choose two, choose one, which of the following, ccna, comptia, cissp, exam.
- Field-level pre-truncation on large AttackQA text fields (questions capped at 2,000 characters, answers at 3,000 characters)
- Consistent ChatML formatting across all SFT datasets using <|im_start|> and <|im_end|> tokens
- Literal \n replacement in CVE dataset assistant responses at the formatting stage before cleaning
- Consistent system prompt applied across Primus-Instruct and SecQA datasets to ensure uniform instruction following behavior across all dataset sources

Training Infrastructure

Component Details
Cloud Provider RunPod Secure Cloud
GPU NVIDIA A100 80GB
CUDA Memory PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
Attention flash_attention_2 (training only)
CPT Cost ~$5.37
SFT Cost ~$3.94
Merge Cost ~$0.50
Total Cost ~$9.81
Total Training Time ~5.2 hours

Performance Inference Test Results Zero artifacts detected across all seven automated test prompts following training.

Qualitative performance by domain:

Domain Performance
Lateral movement artifact identification Very Strong
Incident response guidance Strong
NIST CSF five functions Strong
General cybersecurity concepts Strong
ATT&CK tactic and technique concepts Strong
Vulnerability and exploit explanation Strong
CVE vulnerability class explanation Moderate
Specific ATT&CK technique ID recall Limited
Specific CVE ID and version recall Limited

Training Progression -

This model represents the final iteration of an iterative fine-tuning pipeline conducted across five different Qwen model sizes (1.5B, 3B, 1.7B, 4B, and 8B) with progressively cleaner datasets and refined training configurations:

Run Model Size Dataset Combination SFT Loss Token Accuracy Artifacts
1 1.5B Primus-FineWeb — noisy web scraped 1.089 75.3% High
2 1.5B Cleaned curated datasets 0.989 78.4% Moderate
3 3B CVEfixes + DetectVul + CIRCL 0.843 80.9% Low
4 3B Primus-Seed curated 0.914 78.5% Low
5 1.7B AttackQA + NIST + Primus-Seed local 0.919 78.2% Low
6 4B AttackQA + NIST + Primus-Seed local 0.914 78.5% Low
7 8B AttackQA + NIST + Primus-Seed + Fenrir + CVE Records 0.679 81.1% None

Key finding: Data quality had more of an impact on final performance than any one model size throughout the whole of the pipeline. Moving away from noisy web-scraped datasets to curated structured data-sources produced larger quality improvements than increasing model parameters alone. This finding was consistent over the multiple training cycles for each model, along with thorough data cleaning procedures.

Usage Requirements

pip install transformers torch accelerate

Inference

import torch
import re
from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig

model_path = "fiji55/qwen3-8b-cyber-knowledge"

tokenizer = AutoTokenizer.from_pretrained(model_path)

model = AutoModelForCausalLM.from_pretrained(
    model_path,
    dtype=torch.bfloat16,
    device_map="auto",
)
model.eval()

fim_tokens = [
    tokenizer.convert_tokens_to_ids("<|fim_middle|>"),
    tokenizer.convert_tokens_to_ids("<|fim_prefix|>"),
    tokenizer.convert_tokens_to_ids("<|fim_suffix|>"),
    tokenizer.convert_tokens_to_ids("<|file_sep|>"),
]
fim_tokens = [t for t in fim_tokens if t is not None]

model.generation_config = GenerationConfig(
    temperature=0.5,
    top_p=0.95,
    do_sample=True,
    max_new_tokens=1024,
    eos_token_id=[tokenizer.eos_token_id] + fim_tokens,
    pad_token_id=tokenizer.eos_token_id,
    repetition_penalty=1.1,
    no_repeat_ngram_size=3,
)

SYSTEM_PROMPT = "You are a knowledgeable cybersecurity assistant helping IT professionals with security analysis, threat intelligence, and incident response."

def clean_response(response):
    response = response.replace('\\n', '\n')
    response = response.encode("ascii", errors="ignore").decode("ascii")
    for artifact in ["<|im_end|>", "<|im_start|>"]:
        response = response.replace(artifact, "")
    return response.strip()

def chat(user_input):
    messages = [
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": user_input},
    ]
    text = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True,
    )
    inputs = tokenizer(text, return_tensors="pt").to(model.device)
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            generation_config=model.generation_config,
        )
    new_tokens = outputs[0][inputs["input_ids"].shape[-1]:]
    response = tokenizer.decode(new_tokens, skip_special_tokens=True)
    return clean_response(response)

response = chat("What are the first steps when responding to a ransomware incident?")
print(response)

Limitations:

- Technique ID hallucination: The model demonstrates strong conceptual knowledge of MITRE ATT&CK techniques but may confuse or hallucinate specific technique IDs. Always verify technique IDs independently using the official MITRE ATT&CK website at attack.mitre.org.
- CVE ID accuracy: Specific CVE identifiers, affected versions, and patch dates may be inaccurate. This model should not be used as an authoritative source for CVE details — consult the NVD at nvd.nist.gov or the MITRE CVE database for accurate information.
- Knowledge cutoff: Training data has a fixed cutoff and does not reflect recent CVEs, newly discovered techniques, or emerging threat actor activity after the training data collection date.
- Not a replacement for analysts: This model is intended to ONLY act as a knowledge support tool. All outputs should be reviewed by qualified security professionals before being acted upon within production environments.
- Factual recall vs conceptual knowledge: The model reliably explains cybersecurity concepts and frameworks but may hallucinate specific version numbers, dates, and identifiers. ONLY use for conceptual guidance and verify all specific factual claims independently.

Known Strengths:

Based on inference testing across multiple cybersecurity domains:

- Lateral movement investigation guidance with specific Windows Event IDs, registry paths, memory forensics indicators, and behavioral analytics recommendations
- Structured incident response playbook guidance covering isolation, containment, eradication, and recovery phases
- Accurate NIST Cybersecurity Framework five-function explanation and practical application guidance
- ATT&CK tactic and technique conceptual explanations with detection strategy recommendations
- CVE vulnerability class and exploitation concept explanations
- SOC analyst workflow support for common investigation and triage tasks
- Zero output artifacts across all automated inference tests

Key Design Decisions:

- Data quality over quantity: Each dataset was carefully audited for web scraping artifacts, exam content, and formatting issues before inclusion. Datasets with confirmed artifact issues were excluded after empirical testing across seven training runs demonstrated consistent quality degradation from noisy sources.
- Iterative refinement: The final dataset combination and training configuration were determined through seven training runs across five model sizes (1.5B, 3B, 1.7B, 4B, and 8B) on progressively cleaner data. Each run informed the next through systematic evaluation of output quality, artifact presence, and loss trajectories.
- Balanced SFT dataset: Smaller high-quality datasets (Primus-Instruct ×4, SecQA ×10) were oversampled to prevent the larger Fenrir (83,920 examples) and CVE Records (30,000 examples) datasets from dominating the instruction following behavior of the final model.
- Conservative SFT learning rate: SFT used a reduced learning rate of 1e-4 compared to the more commonly used 2e-4 to mitigate catastrophic forgetting of the CPT knowledge base and reduce overfitting risk across 1,500 training steps.
- Consistent system prompt: A unified system prompt was applied across all SFT datasets to produce consistent assistant persona behavior regardless of which dataset’s instruction format was encountered during training.

Acknowledgements

This model was trained using the following open source datasets and tools. Thanks to the teams behind AttackQA (SambaNova Systems), Primus-Seed and Primus-Instruct (TrendMicro AI Lab), Fenrir v2.0 & CVE Records datasets (AlicanKiraz0), SecQA (Zefang Liu), and the NIST cybersecurity training data (Ethan Oliver Troy). Training infrastructure from RunPod. Built on Qwen3-8B-Base from Alibaba Group using Hugging Face Transformers, PEFT, and TRL libraries.

License

The fine-tuned weights and training pipeline produced in the creation of this model are released under the Apache 2.0 license.

This model was built on Qwen3-8B-Base which is subject to the Qwen License. The usage of this model must comply with the terms included in/of the Qwen License.

​​​​​​​​​​​​​​​​

Downloads last month
22
Safetensors
Model size
8B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for fiji55/qwen3-8b-cyber-knowledge

Finetuned
(418)
this model

Datasets used to train fiji55/qwen3-8b-cyber-knowledge