# Cybersecurity NER Model v8 Named Entity Recognition model for cybersecurity domain text, trained on spaCy v3.8 with custom training data. ## Model Description Fine-tuned NER model for extracting 13 cybersecurity entity types from technical documentation, CVs, job descriptions, threat reports, and compliance documents. ## Performance **Test Results (v8):** - Pass Rate: 94% (62/66 tests) - Dev F1 Score: 98.58% - Precision: 98.71% - Recall: 98.46% - Training Steps: 11,500 (early stopping) - Training Data: 2,223 examples **Entity Type Performance:** | Entity Type | Test Pass Rate | Dev Set F1 | |-------------|----------------|------------| | CVE | 100% (3/3) | 100.00% | | AUDIT_TERM | 75% (3/4) | 100.00% | | SECURITY_TOOL | 100% (4/4) | 100.00% | | CERTIFICATION | 100% (4/4) | 98.73% | | SECURITY_ROLE | 100% (4/4) | 98.11% | | FRAMEWORK | 100% (4/4) | 93.88% | | TECHNICAL_SKILL | 100% (4/4) | 100.00% | | ACRONYM | 100% (4/4) | 100.00% | | SECURITY_DOMAIN | 100% (4/4) | 100.00% | | ATTACK_TECHNIQUE | 75% (3/4) | 98.70% | | THREAT_TYPE | 75% (3/4) | 95.24% | | REGULATION | 75% (3/4) | 96.55% | | CONTROL_ID | 100% (4/4) | - | ## Entity Types 1. **CVE** - CVE identifiers (e.g., CVE-2024-1234) 2. **CERTIFICATION** - Security certifications (CISSP, OSCP, CEH, CISM, Security+) 3. **FRAMEWORK** - Security frameworks (NIST CSF, ISO 27001, MITRE ATT&CK, CIS Controls) 4. **ATTACK_TECHNIQUE** - Attack methods (SQL injection, XSS, CSRF, buffer overflow) 5. **TECHNICAL_SKILL** - Technical skills (Incident Response, Forensics, Penetration Testing) 6. **AUDIT_TERM** - Audit/compliance terms (Risk assessment, Compliance audit, Security review) 7. **SECURITY_ROLE** - Job roles (CISO, SOC Analyst, Security Engineer, Pentester) 8. **THREAT_TYPE** - Threat types (APT, ransomware, phishing, DDoS, malware) 9. **ACRONYM** - Security acronyms (SIEM, EDR, SOAR, IDS/IPS, WAF, DLP) 10. **SECURITY_DOMAIN** - Security domains (Cloud Security, Network Security, Application Security) 11. **REGULATION** - Regulations (GDPR, HIPAA, PCI-DSS, SOX, CCPA) 12. **SECURITY_TOOL** - Security tools (Splunk, Metasploit, Burp Suite, Nmap, Wireshark) 13. **CONTROL_ID** - Control identifiers (ISO 27001 A.5.1, NIST CSF PR.AC-1, CIS Control 1.1) ## Usage ```python import spacy # Load model nlp = spacy.load("path/to/model") # Extract entities text = "CISSP certified professional with experience in Splunk and Metasploit" doc = nlp(text) for ent in doc.ents: print(f"{ent.text} -> {ent.label_}") ``` **Output:** ``` CISSP -> CERTIFICATION Splunk -> SECURITY_TOOL Metasploit -> SECURITY_TOOL ``` ## Training Data **Sources:** - v7 merged data: 1,448 examples - v8 generated: 1,347 examples with multi-entity patterns, case variants - Manual curated: 100 examples - Final dataset: 2,223 unique examples (after validation and deduplication) **v8 Improvements:** - Multi-entity "X and Y" patterns (50 examples per entity type) - Title case variants (CISSP, cissp, Cissp) - Comma-separated list patterns - AUDIT_TERM edge cases (Compliance audit) **Entity Distribution:** - AUDIT_TERM: 326 (12.4%) - CERTIFICATION: 295 (11.2%) - SECURITY_TOOL: 293 (11.1%) - ATTACK_TECHNIQUE: 282 (10.7%) - THREAT_TYPE: 263 (10.0%) - TECHNICAL_SKILL: 228 (8.6%) - REGULATION: 222 (8.4%) - CVE: 182 (6.9%) - FRAMEWORK: 165 (6.3%) - SECURITY_ROLE: 153 (5.8%) - ACRONYM: 142 (5.4%) - SECURITY_DOMAIN: 85 (3.2%) ## Training Configuration - **Framework:** spaCy 3.8 - **Architecture:** tok2vec + TransitionBasedParser - **GPU:** NVIDIA RTX 4090 - **Training steps:** 11,500 (early stopping) - **Patience:** 5,000 steps - **Learning rate:** 3e-05 - **Dropout:** 0.25 - **Batch size:** 1,000 - **Train/dev split:** 85/15 ## Version History **v8 (Current):** - 94% pass rate (62/66) - Multi-entity extraction improved - Title case support added - AUDIT_TERM edge cases fixed **v7:** - 86% pass rate (57/66) - CVE detection restored - SECURITY_ROLE improved to 100% - IDS/IPS and DDoS fixed **v6:** - 74% pass rate (49/66) - CVE regression (missing) - AUDIT_TERM and SECURITY_ROLE issues ## Known Limitations v8 has 4 remaining test failures: 1. Multi-entity extraction in specific contexts ("APT group using ransomware") 2. Span boundary issues with conjunctions ("XSS and CSRF mitigated") 3. Specific "X and Y" patterns ("HIPAA and PCI-DSS standards") 4. "Gap analysis" edge case ## Use Cases - CV/resume skill extraction - Job description analysis - Threat intelligence reports - Compliance documentation - Security audit reports - Technical documentation - Security training materials ## License MIT ## Citation ```bibtex @misc{cybersecurity-ner, title={Cybersecurity NER Model}, author={PKI}, year={2026}, url={https://huggingface.co/pki/cybersecurity-ner} } ``` ## Contact For issues or questions, please open an issue on GitHub.