--- language: - en license: apache-2.0 library_name: spacy tags: - spacy - ner - named-entity-recognition - cybersecurity - infosec - security - token-classification pipeline_tag: token-classification datasets: - custom model-index: - name: cybersec-ner-roberta results: - task: type: token-classification name: Named Entity Recognition metrics: - type: f1 value: 0.694 name: F1 - type: precision value: 0.691 name: Precision - type: recall value: 0.698 name: Recall --- # Cybersecurity NER Model spaCy NER model with RoBERTa transformer backbone, trained for cybersecurity entity extraction. ## Entity Types (9) | Entity | Description | F1 Score | |--------|-------------|----------| | SECURITY_ROLE | Job titles (CISO, SOC Analyst, Pentester) | 57.8% | | TECHNICAL_SKILL | Skills (Incident Response, Threat Hunting) | 54.7% | | SECURITY_TOOL | Tools (Splunk, CrowdStrike, Metasploit) | 100% | | CERTIFICATION | Certs (CISSP, OSCP, CEH) | 100% | | FRAMEWORK | Frameworks (NIST, MITRE ATT&CK, ISO 27001) | 100% | | THREAT_TYPE | Threats (APT, ransomware, phishing) | 90% | | ATTACK_TECHNIQUE | Attacks (SQL injection, XSS, RCE) | 100% | | REGULATION | Regulations (GDPR, HIPAA, PCI-DSS) | 100% | | SECURITY_DOMAIN | Domains (Cloud Security, Network Security) | 13% | **Overall: F1 69.4% | Precision 69.1% | Recall 69.8%** ## Training Data - 1,500+ unique cybersecurity entities - 1,000 synthetic training examples (CVs, job descriptions) - Domain-adapted RoBERTa on 40K security texts ## Usage ```python import spacy # Load model nlp = spacy.load("path/to/model") # Extract entities doc = nlp("CISO with CISSP certification, expert in Splunk SIEM and threat hunting") for ent in doc.ents: print(f"{ent.text}: {ent.label_}") ``` Output: ``` CISO: SECURITY_ROLE CISSP: CERTIFICATION Splunk: SECURITY_TOOL threat hunting: TECHNICAL_SKILL ``` ## Requirements ``` spacy>=3.8.0 spacy-transformers>=1.3.0 ``` ## Use Cases - Threat intelligence parsing - Security talent matching (CV/job analysis) - Skills inventory extraction - Compliance document analysis ## Limitations - SECURITY_DOMAIN has low recall (7%) - needs more training data - SECURITY_ROLE and TECHNICAL_SKILL F1 below target - ongoing improvement - Trained primarily on English text ## License Apache 2.0 ## Citation ```bibtex @misc{cybersec-ner-2024, author = {PKI}, title = {Cybersecurity NER Model}, year = {2024}, publisher = {HuggingFace}, } ```