| # Cybersecurity NER Model v8 | |
| Named Entity Recognition model for cybersecurity domain text, trained on spaCy v3.8 with custom training data. | |
| ## Model Description | |
| Fine-tuned NER model for extracting 13 cybersecurity entity types from technical documentation, CVs, job descriptions, threat reports, and compliance documents. | |
| ## Performance | |
| **Test Results (v8):** | |
| - Pass Rate: 94% (62/66 tests) | |
| - Dev F1 Score: 98.58% | |
| - Precision: 98.71% | |
| - Recall: 98.46% | |
| - Training Steps: 11,500 (early stopping) | |
| - Training Data: 2,223 examples | |
| **Entity Type Performance:** | |
| | Entity Type | Test Pass Rate | Dev Set F1 | | |
| |-------------|----------------|------------| | |
| | CVE | 100% (3/3) | 100.00% | | |
| | AUDIT_TERM | 75% (3/4) | 100.00% | | |
| | SECURITY_TOOL | 100% (4/4) | 100.00% | | |
| | CERTIFICATION | 100% (4/4) | 98.73% | | |
| | SECURITY_ROLE | 100% (4/4) | 98.11% | | |
| | FRAMEWORK | 100% (4/4) | 93.88% | | |
| | TECHNICAL_SKILL | 100% (4/4) | 100.00% | | |
| | ACRONYM | 100% (4/4) | 100.00% | | |
| | SECURITY_DOMAIN | 100% (4/4) | 100.00% | | |
| | ATTACK_TECHNIQUE | 75% (3/4) | 98.70% | | |
| | THREAT_TYPE | 75% (3/4) | 95.24% | | |
| | REGULATION | 75% (3/4) | 96.55% | | |
| | CONTROL_ID | 100% (4/4) | - | | |
| ## Entity Types | |
| 1. **CVE** - CVE identifiers (e.g., CVE-2024-1234) | |
| 2. **CERTIFICATION** - Security certifications (CISSP, OSCP, CEH, CISM, Security+) | |
| 3. **FRAMEWORK** - Security frameworks (NIST CSF, ISO 27001, MITRE ATT&CK, CIS Controls) | |
| 4. **ATTACK_TECHNIQUE** - Attack methods (SQL injection, XSS, CSRF, buffer overflow) | |
| 5. **TECHNICAL_SKILL** - Technical skills (Incident Response, Forensics, Penetration Testing) | |
| 6. **AUDIT_TERM** - Audit/compliance terms (Risk assessment, Compliance audit, Security review) | |
| 7. **SECURITY_ROLE** - Job roles (CISO, SOC Analyst, Security Engineer, Pentester) | |
| 8. **THREAT_TYPE** - Threat types (APT, ransomware, phishing, DDoS, malware) | |
| 9. **ACRONYM** - Security acronyms (SIEM, EDR, SOAR, IDS/IPS, WAF, DLP) | |
| 10. **SECURITY_DOMAIN** - Security domains (Cloud Security, Network Security, Application Security) | |
| 11. **REGULATION** - Regulations (GDPR, HIPAA, PCI-DSS, SOX, CCPA) | |
| 12. **SECURITY_TOOL** - Security tools (Splunk, Metasploit, Burp Suite, Nmap, Wireshark) | |
| 13. **CONTROL_ID** - Control identifiers (ISO 27001 A.5.1, NIST CSF PR.AC-1, CIS Control 1.1) | |
| ## Usage | |
| ```python | |
| import spacy | |
| # Load model | |
| nlp = spacy.load("path/to/model") | |
| # Extract entities | |
| text = "CISSP certified professional with experience in Splunk and Metasploit" | |
| doc = nlp(text) | |
| for ent in doc.ents: | |
| print(f"{ent.text} -> {ent.label_}") | |
| ``` | |
| **Output:** | |
| ``` | |
| CISSP -> CERTIFICATION | |
| Splunk -> SECURITY_TOOL | |
| Metasploit -> SECURITY_TOOL | |
| ``` | |
| ## Training Data | |
| **Sources:** | |
| - v7 merged data: 1,448 examples | |
| - v8 generated: 1,347 examples with multi-entity patterns, case variants | |
| - Manual curated: 100 examples | |
| - Final dataset: 2,223 unique examples (after validation and deduplication) | |
| **v8 Improvements:** | |
| - Multi-entity "X and Y" patterns (50 examples per entity type) | |
| - Title case variants (CISSP, cissp, Cissp) | |
| - Comma-separated list patterns | |
| - AUDIT_TERM edge cases (Compliance audit) | |
| **Entity Distribution:** | |
| - AUDIT_TERM: 326 (12.4%) | |
| - CERTIFICATION: 295 (11.2%) | |
| - SECURITY_TOOL: 293 (11.1%) | |
| - ATTACK_TECHNIQUE: 282 (10.7%) | |
| - THREAT_TYPE: 263 (10.0%) | |
| - TECHNICAL_SKILL: 228 (8.6%) | |
| - REGULATION: 222 (8.4%) | |
| - CVE: 182 (6.9%) | |
| - FRAMEWORK: 165 (6.3%) | |
| - SECURITY_ROLE: 153 (5.8%) | |
| - ACRONYM: 142 (5.4%) | |
| - SECURITY_DOMAIN: 85 (3.2%) | |
| ## Training Configuration | |
| - **Framework:** spaCy 3.8 | |
| - **Architecture:** tok2vec + TransitionBasedParser | |
| - **GPU:** NVIDIA RTX 4090 | |
| - **Training steps:** 11,500 (early stopping) | |
| - **Patience:** 5,000 steps | |
| - **Learning rate:** 3e-05 | |
| - **Dropout:** 0.25 | |
| - **Batch size:** 1,000 | |
| - **Train/dev split:** 85/15 | |
| ## Version History | |
| **v8 (Current):** | |
| - 94% pass rate (62/66) | |
| - Multi-entity extraction improved | |
| - Title case support added | |
| - AUDIT_TERM edge cases fixed | |
| **v7:** | |
| - 86% pass rate (57/66) | |
| - CVE detection restored | |
| - SECURITY_ROLE improved to 100% | |
| - IDS/IPS and DDoS fixed | |
| **v6:** | |
| - 74% pass rate (49/66) | |
| - CVE regression (missing) | |
| - AUDIT_TERM and SECURITY_ROLE issues | |
| ## Known Limitations | |
| v8 has 4 remaining test failures: | |
| 1. Multi-entity extraction in specific contexts ("APT group using ransomware") | |
| 2. Span boundary issues with conjunctions ("XSS and CSRF mitigated") | |
| 3. Specific "X and Y" patterns ("HIPAA and PCI-DSS standards") | |
| 4. "Gap analysis" edge case | |
| ## Use Cases | |
| - CV/resume skill extraction | |
| - Job description analysis | |
| - Threat intelligence reports | |
| - Compliance documentation | |
| - Security audit reports | |
| - Technical documentation | |
| - Security training materials | |
| ## License | |
| MIT | |
| ## Citation | |
| ```bibtex | |
| @misc{cybersecurity-ner, | |
| title={Cybersecurity NER Model}, | |
| author={PKI}, | |
| year={2026}, | |
| url={https://huggingface.co/pki/cybersecurity-ner} | |
| } | |
| ``` | |
| ## Contact | |
| For issues or questions, please open an issue on GitHub. | |