metadata
language:
- en
license: apache-2.0
library_name: spacy
tags:
- spacy
- ner
- named-entity-recognition
- cybersecurity
- infosec
- security
- token-classification
pipeline_tag: token-classification
datasets:
- custom
model-index:
- name: cybersec-ner-roberta
results:
- task:
type: token-classification
name: Named Entity Recognition
metrics:
- type: f1
value: 0.694
name: F1
- type: precision
value: 0.691
name: Precision
- type: recall
value: 0.698
name: Recall
Cybersecurity NER Model
spaCy NER model with RoBERTa transformer backbone, trained for cybersecurity entity extraction.
Entity Types (9)
| Entity | Description | F1 Score |
|---|---|---|
| SECURITY_ROLE | Job titles (CISO, SOC Analyst, Pentester) | 57.8% |
| TECHNICAL_SKILL | Skills (Incident Response, Threat Hunting) | 54.7% |
| SECURITY_TOOL | Tools (Splunk, CrowdStrike, Metasploit) | 100% |
| CERTIFICATION | Certs (CISSP, OSCP, CEH) | 100% |
| FRAMEWORK | Frameworks (NIST, MITRE ATT&CK, ISO 27001) | 100% |
| THREAT_TYPE | Threats (APT, ransomware, phishing) | 90% |
| ATTACK_TECHNIQUE | Attacks (SQL injection, XSS, RCE) | 100% |
| REGULATION | Regulations (GDPR, HIPAA, PCI-DSS) | 100% |
| SECURITY_DOMAIN | Domains (Cloud Security, Network Security) | 13% |
Overall: F1 69.4% | Precision 69.1% | Recall 69.8%
Training Data
- 1,500+ unique cybersecurity entities
- 1,000 synthetic training examples (CVs, job descriptions)
- Domain-adapted RoBERTa on 40K security texts
Usage
import spacy
# Load model
nlp = spacy.load("path/to/model")
# Extract entities
doc = nlp("CISO with CISSP certification, expert in Splunk SIEM and threat hunting")
for ent in doc.ents:
print(f"{ent.text}: {ent.label_}")
Output:
CISO: SECURITY_ROLE
CISSP: CERTIFICATION
Splunk: SECURITY_TOOL
threat hunting: TECHNICAL_SKILL
Requirements
spacy>=3.8.0
spacy-transformers>=1.3.0
Use Cases
- Threat intelligence parsing
- Security talent matching (CV/job analysis)
- Skills inventory extraction
- Compliance document analysis
Limitations
- SECURITY_DOMAIN has low recall (7%) - needs more training data
- SECURITY_ROLE and TECHNICAL_SKILL F1 below target - ongoing improvement
- Trained primarily on English text
License
Apache 2.0
Citation
@misc{cybersec-ner-2024,
author = {PKI},
title = {Cybersecurity NER Model},
year = {2024},
publisher = {HuggingFace},
}