Qwen3-8B Cyber Knowledge:

Is a cybersecurity domain-adapted language model fine-tuned from Qwen3-8B-Base for IT professionals and security practitioners. This model was developed using a two-stage training pipeline — Continued Pre-Training (CPT) followed by Supervised Fine-Tuning (SFT) — on publicly-available curated cybersecurity datasets covering MITRE ATT&CK, NIST frameworks, incident response, CVE analysis, and threat intelligence.

Model Details

Property	Value
Base Model	Qwen/Qwen3-8B-Base
Model Type	Causal Language Model
Language	English
Fine-tune License	Apache 2.0
Base Model License	Qwen License
Parameters	8B
Precision	bfloat16
Training Hardware	NVIDIA A100 80GB

Intended Use This model is designed to assist IT security professionals with:

- MITRE ATT&CK framework guidance and technique explanation
- Incident response planning and playbook guidance
- NIST Cybersecurity Framework application
- CVE vulnerability analysis and explanation
- SOC analyst support for threat investigations
- General cybersecurity knowledge queries
- Lateral movement and threat artifact identification

This model is not intended for:

- Offensive security tooling or attack assistance
- Replacing trained security analysts in production environments
- Authoritative CVE ID or technique ID lookup without independent verification
- Real-time threat intelligence

Training Pipeline Stage 1 — Continued Pre-Training (CPT) The base Qwen3-8B-Base model underwent continued pre-training on a group of curated cybersecurity datasets to develop domain-specific knowledge before instruction tuning.

CPT Datasets:

Dataset	Examples	Coverage
`sambanovasystems/attackqa`	10,000 (sampled)	MITRE ATT&CK Q&A, technique descriptions, adversary TTPs
`ethanolivertroy/nist-cybersecurity-training`	10,000 (sampled)	NIST frameworks, security controls, compliance guidance
`trendmicro-ailab/Primus-Seed`	20,000 (sampled)	Broad cybersecurity knowledge corpus
Total	40,000

CPT Configuration:

Parameter	Value
Method	LoRA (r=16, alpha=32)
Target Modules	q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Learning Rate	1e-5
Scheduler	Cosine
Warmup Steps	100
Max Steps	2000
Batch Size	4
Gradient Accumulation	4
Effective Batch Size	16
Max Sequence Length	2048
Precision	bf16
Optimizer	paged_adamw_32bit
Flash Attention	flash_attention_2

CPT Results:

Metric	Value
Starting Loss	2.642
Final Loss	1.224
Average Train Loss	1.312
Runtime	~~3 hours (~~$5.37)

Stage 2 — Supervised Fine-Tuning (SFT) The CPT-adapted model was further fine-tuned on structured instruction datasets to develop cybersecurity-specific question answering, incident response guidance, and knowledge retrieval capabilities.

SFT Datasets:

Dataset	Base Examples	Oversampling	Final Examples	Coverage
`trendmicro-ailab/Primus-Instruct`	835	×4	3,340	Cybersecurity task instructions
`zefang-liu/secqa` v1+v2 (dev/val/test splits)	242	×10	2,420	Structured cybersecurity Q&A
`AlicanKiraz0/Cybersecurity-Dataset-Fenrir-v2.0`	83,920	×1	83,920	MITRE ATT&CK, NIST, IR, SIEM, cloud security, threat hunting
`AlicanKiraz0/All-CVE-Records-Training-Dataset`	30,000 (sampled)	×1	30,000	CVE analysis, vulnerability assessment, remediation
Combined after filtering			112,191

SFT Configuration:

Parameter	Value
Method	LoRA (r=32, alpha=64)
Target Modules	all-linear
Learning Rate	1e-4
Scheduler	Cosine
Warmup Steps	150
Max Steps	1500
Batch Size	4
Gradient Accumulation	4
Effective Batch Size	16
Max Sequence Length	1024
Precision	bf16
Optimizer	paged_adamw_32bit
Packing	False
Flash Attention	flash_attention_2

SFT Results:

Metric	Value
Starting Loss	2.966
Final Loss	0.679
Best Loss	0.676
Final Token Accuracy	81.1%
Best Token Accuracy	81.5%
Average Train Loss	0.767
Runtime	~~2.2 hours (~~$3.94)

Data Processing All datasets underwent aggressive cleaning and filtering before training:

- ASCII normalization to remove non-standard characters
- HTML tag removal via regex matching
- Length filtering (200—10,000 characters) to remove malformed and excessively long examples
- AttackQA skip pattern filtering applied to raw text before cleaning to remove web-scraped exam content. Skip patterns included: brainly, chegg, quizlet, coursehero, choose two, choose one, which of the following, ccna, comptia, cissp, exam.
- Field-level pre-truncation on large AttackQA text fields (questions capped at 2,000 characters, answers at 3,000 characters)
- Consistent ChatML formatting across all SFT datasets using <|im_start|> and <|im_end|> tokens
- Literal \n replacement in CVE dataset assistant responses at the formatting stage before cleaning
- Consistent system prompt applied across Primus-Instruct and SecQA datasets to ensure uniform instruction following behavior across all dataset sources

Training Infrastructure

Component	Details
Cloud Provider	RunPod Secure Cloud
GPU	NVIDIA A100 80GB
CUDA Memory	PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
Attention	flash_attention_2 (training only)
CPT Cost	~$5.37
SFT Cost	~$3.94
Merge Cost	~$0.50
Total Cost	~$9.81
Total Training Time	~5.2 hours

Performance Inference Test Results Zero artifacts detected across all seven automated test prompts following training.

Qualitative performance by domain:

Domain	Performance
Lateral movement artifact identification	Very Strong
Incident response guidance	Strong
NIST CSF five functions	Strong
General cybersecurity concepts	Strong
ATT&CK tactic and technique concepts	Strong
Vulnerability and exploit explanation	Strong
CVE vulnerability class explanation	Moderate
Specific ATT&CK technique ID recall	Limited
Specific CVE ID and version recall	Limited

Training Progression -

This model represents the final iteration of an iterative fine-tuning pipeline conducted across five different Qwen model sizes (1.5B, 3B, 1.7B, 4B, and 8B) with progressively cleaner datasets and refined training configurations:

Run	Model Size	Dataset Combination	SFT Loss	Token Accuracy	Artifacts
1	1.5B	Primus-FineWeb — noisy web scraped	1.089	75.3%	High
2	1.5B	Cleaned curated datasets	0.989	78.4%	Moderate
3	3B	CVEfixes + DetectVul + CIRCL	0.843	80.9%	Low
4	3B	Primus-Seed curated	0.914	78.5%	Low
5	1.7B	AttackQA + NIST + Primus-Seed local	0.919	78.2%	Low
6	4B	AttackQA + NIST + Primus-Seed local	0.914	78.5%	Low
7	8B	AttackQA + NIST + Primus-Seed + Fenrir + CVE Records	0.679	81.1%	None

Key finding: Data quality had more of an impact on final performance than any one model size throughout the whole of the pipeline. Moving away from noisy web-scraped datasets to curated structured data-sources produced larger quality improvements than increasing model parameters alone. This finding was consistent over the multiple training cycles for each model, along with thorough data cleaning procedures.

Usage Requirements

pip install transformers torch accelerate

Inference

import torch
import re
from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig

model_path = "fiji55/qwen3-8b-cyber-knowledge"

tokenizer = AutoTokenizer.from_pretrained(model_path)

model = AutoModelForCausalLM.from_pretrained(
    model_path,
    dtype=torch.bfloat16,
    device_map="auto",
)
model.eval()

fim_tokens = [
    tokenizer.convert_tokens_to_ids("<|fim_middle|>"),
    tokenizer.convert_tokens_to_ids("<|fim_prefix|>"),
    tokenizer.convert_tokens_to_ids("<|fim_suffix|>"),
    tokenizer.convert_tokens_to_ids("<|file_sep|>"),
]
fim_tokens = [t for t in fim_tokens if t is not None]

model.generation_config = GenerationConfig(
    temperature=0.5,
    top_p=0.95,
    do_sample=True,
    max_new_tokens=1024,
    eos_token_id=[tokenizer.eos_token_id] + fim_tokens,
    pad_token_id=tokenizer.eos_token_id,
    repetition_penalty=1.1,
    no_repeat_ngram_size=3,
)

SYSTEM_PROMPT = "You are a knowledgeable cybersecurity assistant helping IT professionals with security analysis, threat intelligence, and incident response."

def clean_response(response):
    response = response.replace('\\n', '\n')
    response = response.encode("ascii", errors="ignore").decode("ascii")
    for artifact in ["<|im_end|>", "<|im_start|>"]:
        response = response.replace(artifact, "")
    return response.strip()

def chat(user_input):
    messages = [
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": user_input},
    ]
    text = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True,
    )
    inputs = tokenizer(text, return_tensors="pt").to(model.device)
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            generation_config=model.generation_config,
        )
    new_tokens = outputs[0][inputs["input_ids"].shape[-1]:]
    response = tokenizer.decode(new_tokens, skip_special_tokens=True)
    return clean_response(response)

response = chat("What are the first steps when responding to a ransomware incident?")
print(response)

Limitations:

- Technique ID hallucination: The model demonstrates strong conceptual knowledge of MITRE ATT&CK techniques but may confuse or hallucinate specific technique IDs. Always verify technique IDs independently using the official MITRE ATT&CK website at attack.mitre.org.
- CVE ID accuracy: Specific CVE identifiers, affected versions, and patch dates may be inaccurate. This model should not be used as an authoritative source for CVE details — consult the NVD at nvd.nist.gov or the MITRE CVE database for accurate information.
- Knowledge cutoff: Training data has a fixed cutoff and does not reflect recent CVEs, newly discovered techniques, or emerging threat actor activity after the training data collection date.
- Not a replacement for analysts: This model is intended to ONLY act as a knowledge support tool. All outputs should be reviewed by qualified security professionals before being acted upon within production environments.
- Factual recall vs conceptual knowledge: The model reliably explains cybersecurity concepts and frameworks but may hallucinate specific version numbers, dates, and identifiers. ONLY use for conceptual guidance and verify all specific factual claims independently.

Known Strengths:

Based on inference testing across multiple cybersecurity domains:

- Lateral movement investigation guidance with specific Windows Event IDs, registry paths, memory forensics indicators, and behavioral analytics recommendations
- Structured incident response playbook guidance covering isolation, containment, eradication, and recovery phases
- Accurate NIST Cybersecurity Framework five-function explanation and practical application guidance
- ATT&CK tactic and technique conceptual explanations with detection strategy recommendations
- CVE vulnerability class and exploitation concept explanations
- SOC analyst workflow support for common investigation and triage tasks
- Zero output artifacts across all automated inference tests

Key Design Decisions:

- Data quality over quantity: Each dataset was carefully audited for web scraping artifacts, exam content, and formatting issues before inclusion. Datasets with confirmed artifact issues were excluded after empirical testing across seven training runs demonstrated consistent quality degradation from noisy sources.
- Iterative refinement: The final dataset combination and training configuration were determined through seven training runs across five model sizes (1.5B, 3B, 1.7B, 4B, and 8B) on progressively cleaner data. Each run informed the next through systematic evaluation of output quality, artifact presence, and loss trajectories.
- Balanced SFT dataset: Smaller high-quality datasets (Primus-Instruct ×4, SecQA ×10) were oversampled to prevent the larger Fenrir (83,920 examples) and CVE Records (30,000 examples) datasets from dominating the instruction following behavior of the final model.
- Conservative SFT learning rate: SFT used a reduced learning rate of 1e-4 compared to the more commonly used 2e-4 to mitigate catastrophic forgetting of the CPT knowledge base and reduce overfitting risk across 1,500 training steps.
- Consistent system prompt: A unified system prompt was applied across all SFT datasets to produce consistent assistant persona behavior regardless of which dataset’s instruction format was encountered during training.

Acknowledgements

This model was trained using the following open source datasets and tools. Thanks to the teams behind AttackQA (SambaNova Systems), Primus-Seed and Primus-Instruct (TrendMicro AI Lab), Fenrir v2.0 & CVE Records datasets (AlicanKiraz0), SecQA (Zefang Liu), and the NIST cybersecurity training data (Ethan Oliver Troy). Training infrastructure from RunPod. Built on Qwen3-8B-Base from Alibaba Group using Hugging Face Transformers, PEFT, and TRL libraries.

License

The fine-tuned weights and training pipeline produced in the creation of this model are released under the Apache 2.0 license.

This model was built on Qwen3-8B-Base which is subject to the Qwen License. The usage of this model must comply with the terms included in/of the Qwen License.

Downloads last month: 22

Safetensors

Model size

8B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for fiji55/qwen3-8b-cyber-knowledge

Base model

Qwen/Qwen3-8B-Base

Finetuned

(418)

this model

fiji55
/

qwen3-8b-cyber-knowledge

License

Model tree for fiji55/qwen3-8b-cyber-knowledge

Datasets used to train fiji55/qwen3-8b-cyber-knowledge