SyedCode01's picture
Upload README.md with huggingface_hub
b199117 verified
metadata
license: apache-2.0
language:
  - en
library_name: transformers
tags:
  - cybersecurity
  - soc
  - siem
  - mitre-attack
  - incident-response
  - threat-detection
  - security-operations
  - fine-tuned
  - qlora
  - unsloth
  - gguf
  - ollama
base_model: openai/gpt-oss-20b
model-index:
  - name: rhythmai-cybersec-20b
    results:
      - task:
          type: text-generation
          name: Cybersecurity Q&A
        metrics:
          - type: eval_loss
            value: 0.5773
            name: Validation Loss
          - type: train_loss
            value: 0.4873
            name: Training Loss
datasets:
  - AlicanKiraz0/Cybersecurity-Dataset-Fenrir-v2.0
  - Trendyol/Trendyol-Cybersecurity-Instruction-Tuning-Dataset
pipeline_tag: text-generation

RhythmAI Cybersec 20B

A cybersecurity-specialized language model fine-tuned from OpenAI GPT-OSS-20B for Security Operations Center (SOC) tasks including alarm investigation, threat analysis, MITRE ATT&CK mapping, incident response, and log analysis.

Built for RhythmAI -- an AI-powered SOC platform that integrates with LogRhythm SIEM.

Model Details

Property Value
Base Model openai/gpt-oss-20b (MoE, 21B total / 3.6B active params)
Architecture Mixture of Experts (MoE) with MXFP4 native quantization
Fine-tuning Method QLoRA (4-bit) via Unsloth
LoRA Rank 32
LoRA Alpha 64
LoRA Dropout 0.05
Target Modules q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Training Precision 4-bit QLoRA with BF16 compute
Context Length 65,536 tokens (64K)
GGUF Format MXFP4 (13 GB)
License Apache 2.0 (inherited from GPT-OSS)

Training Data

Fine-tuned on 9,702 curated cybersecurity examples sourced from 137,122 raw examples across 4 public datasets, aggressively filtered for SOC/SIEM relevance (7.1% acceptance rate):

Source Raw Size After Filtering Description
Fenrir v2.0 83,920 ~5,000 General cybersecurity Q&A
Trendyol Cybersecurity 53,202 ~5,000 Instruction-tuned cybersecurity

Filtering pipeline: Keyword relevance scoring (minimum 2 matches from 60+ SOC-relevant terms), response length between 50-15,000 characters, MD5-based deduplication. Average response length: 2,627 characters (~656 tokens).

Split: 9,217 train (95%) / 485 validation (5%)

Format: OpenAI-compatible chat format:

{"messages": [{"role": "system", "content": "..."}, {"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]}

Cybersecurity Content Breakdown

MITRE ATT&CK Coverage

The training data references 424 unique MITRE ATT&CK technique IDs across all 14 tactics:

Tactic Examples Coverage
Execution 3,004 31.0%
Lateral Movement 2,427 25.0%
Impact 1,949 20.1%
Privilege Escalation 1,637 16.9%
Persistence 1,568 16.2%
Exfiltration 1,425 14.7%
Defense Evasion 1,277 13.2%
Collection 1,080 11.1%
Reconnaissance 900 9.3%
Discovery 889 9.2%
Initial Access 807 8.3%
Command and Control 208 2.1%
Credential Access 169 1.7%
Resource Development 12 0.1%

Most referenced techniques: T1078 (Valid Accounts, 1,451 examples), T1055 (Process Injection, 1,120), T1021 (Remote Services, 582), T1071 (Application Layer Protocol, 541), T1027 (Obfuscated Files, 378), T1566 (Phishing, 378), T1059 (Command and Scripting Interpreter, 376), T1562 (Impair Defenses, 339), T1203 (Exploitation for Client Execution, 323), T1041 (Exfiltration Over C2, 322).

Attack Types & Threat Categories

Attack Type Examples Coverage
Phishing & Social Engineering 9,546 98.4%
Remote Code Execution 5,620 57.9%
Lateral Movement 2,427 25.0%
Privilege Escalation 1,637 16.9%
PowerShell-based Attacks 731 7.5%
Supply Chain Attacks 653 6.7%
Credential Dumping (Mimikatz/LSASS) 393 4.1%
Insider Threats 376 3.9%
Zero-Day Exploits 375 3.9%
Man-in-the-Middle 294 3.0%
Brute Force / Credential Stuffing 264 2.7%
C2 Communication 228 2.4%
DDoS / Denial of Service 217 2.2%
Backdoors 203 2.1%
Rootkits 180 1.9%
SQL Injection 177 1.8%
Buffer Overflow 144 1.5%
Cross-Site Scripting (XSS) 127 1.3%
Fileless Malware 116 1.2%
Living Off The Land (LOLBins) 80 0.8%
DNS Tunneling 57 0.6%

Log Source & SIEM Knowledge

Log Type Examples Coverage
Windows Event Logs (Event IDs) 977 10.1%
Network Flow (NetFlow/PCAP) 410 4.2%
IDS/IPS Alerts 364 3.8%
Authentication Logs 289 3.0%
Firewall Logs 150 1.5%
DNS Logs 123 1.3%
Syslog 112 1.2%

Security platforms referenced: Nmap (214), YARA rules (158), Microsoft Sentinel (120), Elastic/ELK (107), Wireshark (104), Splunk (70), Metasploit (65), Sigma rules (50), Snort/Suricata (45).

Compliance & Regulatory Frameworks

Framework Examples Coverage
NIST (CSF/SP 800-series) 9,620 99.2%
GDPR 411 4.2%
HIPAA 310 3.2%
OWASP 304 3.1%
PCI-DSS 152 1.6%
CIS Controls 66 0.7%
ISO 27001 57 0.6%
SOC 2 35 0.4%

Training Details

Parameter Value
GPU NVIDIA RTX PRO 6000 Blackwell (96 GB VRAM)
Framework Unsloth 2026.3.3 + Transformers 5.2.0
Epochs 3
Effective Batch Size 8 (2 per device x 4 gradient accumulation)
Learning Rate 2e-4 (cosine schedule, 5% warmup)
Optimizer AdamW 8-bit
Weight Decay 0.01
Max Sequence Length 4,096 (training) / 65,536 (inference)
Packing Enabled (short examples packed together)
Gradient Checkpointing Unsloth optimized (30% VRAM savings)
Total Steps 3,459
Training Time ~12.5 hours
Trainable Parameters 67M / 21B (0.32%)

Training Metrics

Metric Value
Final Training Loss 0.4873
Final Validation Loss 0.5774
Best Validation Loss 0.5773 (step 3,000)
Initial Validation Loss 0.7866 (step 100)

The model shows consistent improvement across training with no signs of overfitting (validation loss closely tracks training loss).

Capabilities

This model is specialized for:

  • Alarm Investigation: Analyzing security alarms from SIEM platforms with contextual threat assessment
  • MITRE ATT&CK Mapping: Identifying tactics, techniques, and procedures (TTPs) from security events
  • Incident Response: Generating structured incident response playbooks and triage recommendations
  • Threat Analysis: Assessing threat severity, identifying indicators of compromise (IOCs)
  • Log Analysis: Interpreting Windows Event Logs, firewall logs, IDS/IPS alerts, and authentication logs
  • Detection Engineering: Suggesting detection rules and correlation logic
  • Compliance Guidance: NIST, PCI-DSS, HIPAA, GDPR security control recommendations

Usage

With Ollama (Recommended)

# Create the model from GGUF
ollama create rhythmai-cybersec-20b -f Modelfile

# Run interactively
ollama run rhythmai-cybersec-20b "Analyze this security event: Multiple failed RDP login attempts from IP 203.0.113.45 targeting the domain controller, followed by a successful login and immediate PowerShell execution."

# Use via API
curl http://localhost:11434/api/chat -d '{
  "model": "rhythmai-cybersec-20b",
  "messages": [
    {"role": "system", "content": "You are a senior SOC analyst. Analyze security events and provide actionable recommendations."},
    {"role": "user", "content": "What MITRE ATT&CK techniques are associated with credential dumping?"}
  ]
}'

With Transformers + PEFT (Adapter)

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

base_model = AutoModelForCausalLM.from_pretrained("openai/gpt-oss-20b", device_map="auto")
model = PeftModel.from_pretrained(base_model, "SyedCode01/rhythmai-cybersec-20b", subfolder="adapter")
tokenizer = AutoTokenizer.from_pretrained("SyedCode01/rhythmai-cybersec-20b", subfolder="adapter")

messages = [
    {"role": "system", "content": "You are a cybersecurity expert specializing in SOC operations."},
    {"role": "user", "content": "Explain lateral movement in the context of MITRE ATT&CK."}
]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device)
outputs = model.generate(inputs, max_new_tokens=512, temperature=0.7)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Example Output

Prompt: "What is lateral movement in cybersecurity?"

Response: Lateral movement in cybersecurity refers to the techniques attackers use to navigate through a network or system after gaining initial access. Once an attacker compromises a single device or account -- often with relatively low privileges -- they employ lateral movement to expand their reach, elevate their access levels, and ultimately achieve broader objectives such as data exfiltration or system control. This process involves moving from one compromised asset to another, leveraging existing network paths, user credentials, and administrative tools. Attackers might use legitimate remote management protocols (e.g., RDP, SSH), stolen passwords, or exploit vulnerabilities in software to traverse the environment. It's a critical phase in advanced persistent threat (APT) campaigns, as it enables attackers to remain stealthy over extended periods by using normal administrative functions and legitimate credentials.

File Structure

rhythmai-cybersec-20b/
  adapter/                    # LoRA adapter weights (PEFT)
    adapter_config.json
    adapter_model.safetensors
    tokenizer.json
    tokenizer_config.json
  gguf-q4_k_m_gguf/          # GGUF for Ollama deployment
    gpt-oss-20b.MXFP4.gguf   # 13 GB quantized model
    Modelfile                 # Ollama model definition
  README.md                   # This model card

Intended Use

This model is designed for cybersecurity professionals, SOC analysts, and security teams who need AI assistance with:

  • Security alarm triage and investigation
  • Threat intelligence analysis
  • Incident response planning
  • Security posture assessment
  • MITRE ATT&CK framework mapping

Limitations

  • Domain-specific: Optimized for cybersecurity tasks; general knowledge may be less reliable than the base model
  • Not a replacement for human analysts: Outputs should be validated by qualified security professionals
  • Training data bias: Performance may vary for threats or attack patterns not well-represented in the training data
  • Context window: Supports up to 65,536 tokens (64K); training used 4,096 max sequence length but the base model's full context capability is preserved
  • No real-time data: The model does not have access to real-time threat intelligence feeds

Citation

@misc{rhythmai-cybersec-20b,
  title={RhythmAI Cybersec 20B: A Fine-Tuned Cybersecurity Language Model},
  author={Syed Hasan Iqbal},
  year={2026},
  url={https://huggingface.co/SyedCode01/rhythmai-cybersec-20b},
  note={Fine-tuned from OpenAI GPT-OSS-20B for SOC operations}
}

Acknowledgments

  • OpenAI for the GPT-OSS-20B base model (Apache 2.0)
  • Unsloth for efficient QLoRA fine-tuning
  • AlicanKiraz0 for the Fenrir v2.0 cybersecurity dataset
  • Trendyol for the cybersecurity instruction tuning dataset