File size: 4,841 Bytes
a03103c 425f736 a03103c 425f736 a03103c |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 |
# Cybersecurity NER Model v8
Named Entity Recognition model for cybersecurity domain text, trained on spaCy v3.8 with custom training data.
## Model Description
Fine-tuned NER model for extracting 13 cybersecurity entity types from technical documentation, CVs, job descriptions, threat reports, and compliance documents.
## Performance
**Test Results (v8):**
- Pass Rate: 94% (62/66 tests)
- Dev F1 Score: 98.58%
- Precision: 98.71%
- Recall: 98.46%
- Training Steps: 11,500 (early stopping)
- Training Data: 2,223 examples
**Entity Type Performance:**
| Entity Type | Test Pass Rate | Dev Set F1 |
|-------------|----------------|------------|
| CVE | 100% (3/3) | 100.00% |
| AUDIT_TERM | 75% (3/4) | 100.00% |
| SECURITY_TOOL | 100% (4/4) | 100.00% |
| CERTIFICATION | 100% (4/4) | 98.73% |
| SECURITY_ROLE | 100% (4/4) | 98.11% |
| FRAMEWORK | 100% (4/4) | 93.88% |
| TECHNICAL_SKILL | 100% (4/4) | 100.00% |
| ACRONYM | 100% (4/4) | 100.00% |
| SECURITY_DOMAIN | 100% (4/4) | 100.00% |
| ATTACK_TECHNIQUE | 75% (3/4) | 98.70% |
| THREAT_TYPE | 75% (3/4) | 95.24% |
| REGULATION | 75% (3/4) | 96.55% |
| CONTROL_ID | 100% (4/4) | - |
## Entity Types
1. **CVE** - CVE identifiers (e.g., CVE-2024-1234)
2. **CERTIFICATION** - Security certifications (CISSP, OSCP, CEH, CISM, Security+)
3. **FRAMEWORK** - Security frameworks (NIST CSF, ISO 27001, MITRE ATT&CK, CIS Controls)
4. **ATTACK_TECHNIQUE** - Attack methods (SQL injection, XSS, CSRF, buffer overflow)
5. **TECHNICAL_SKILL** - Technical skills (Incident Response, Forensics, Penetration Testing)
6. **AUDIT_TERM** - Audit/compliance terms (Risk assessment, Compliance audit, Security review)
7. **SECURITY_ROLE** - Job roles (CISO, SOC Analyst, Security Engineer, Pentester)
8. **THREAT_TYPE** - Threat types (APT, ransomware, phishing, DDoS, malware)
9. **ACRONYM** - Security acronyms (SIEM, EDR, SOAR, IDS/IPS, WAF, DLP)
10. **SECURITY_DOMAIN** - Security domains (Cloud Security, Network Security, Application Security)
11. **REGULATION** - Regulations (GDPR, HIPAA, PCI-DSS, SOX, CCPA)
12. **SECURITY_TOOL** - Security tools (Splunk, Metasploit, Burp Suite, Nmap, Wireshark)
13. **CONTROL_ID** - Control identifiers (ISO 27001 A.5.1, NIST CSF PR.AC-1, CIS Control 1.1)
## Usage
```python
import spacy
# Load model
nlp = spacy.load("path/to/model")
# Extract entities
text = "CISSP certified professional with experience in Splunk and Metasploit"
doc = nlp(text)
for ent in doc.ents:
print(f"{ent.text} -> {ent.label_}")
```
**Output:**
```
CISSP -> CERTIFICATION
Splunk -> SECURITY_TOOL
Metasploit -> SECURITY_TOOL
```
## Training Data
**Sources:**
- v7 merged data: 1,448 examples
- v8 generated: 1,347 examples with multi-entity patterns, case variants
- Manual curated: 100 examples
- Final dataset: 2,223 unique examples (after validation and deduplication)
**v8 Improvements:**
- Multi-entity "X and Y" patterns (50 examples per entity type)
- Title case variants (CISSP, cissp, Cissp)
- Comma-separated list patterns
- AUDIT_TERM edge cases (Compliance audit)
**Entity Distribution:**
- AUDIT_TERM: 326 (12.4%)
- CERTIFICATION: 295 (11.2%)
- SECURITY_TOOL: 293 (11.1%)
- ATTACK_TECHNIQUE: 282 (10.7%)
- THREAT_TYPE: 263 (10.0%)
- TECHNICAL_SKILL: 228 (8.6%)
- REGULATION: 222 (8.4%)
- CVE: 182 (6.9%)
- FRAMEWORK: 165 (6.3%)
- SECURITY_ROLE: 153 (5.8%)
- ACRONYM: 142 (5.4%)
- SECURITY_DOMAIN: 85 (3.2%)
## Training Configuration
- **Framework:** spaCy 3.8
- **Architecture:** tok2vec + TransitionBasedParser
- **GPU:** NVIDIA RTX 4090
- **Training steps:** 11,500 (early stopping)
- **Patience:** 5,000 steps
- **Learning rate:** 3e-05
- **Dropout:** 0.25
- **Batch size:** 1,000
- **Train/dev split:** 85/15
## Version History
**v8 (Current):**
- 94% pass rate (62/66)
- Multi-entity extraction improved
- Title case support added
- AUDIT_TERM edge cases fixed
**v7:**
- 86% pass rate (57/66)
- CVE detection restored
- SECURITY_ROLE improved to 100%
- IDS/IPS and DDoS fixed
**v6:**
- 74% pass rate (49/66)
- CVE regression (missing)
- AUDIT_TERM and SECURITY_ROLE issues
## Known Limitations
v8 has 4 remaining test failures:
1. Multi-entity extraction in specific contexts ("APT group using ransomware")
2. Span boundary issues with conjunctions ("XSS and CSRF mitigated")
3. Specific "X and Y" patterns ("HIPAA and PCI-DSS standards")
4. "Gap analysis" edge case
## Use Cases
- CV/resume skill extraction
- Job description analysis
- Threat intelligence reports
- Compliance documentation
- Security audit reports
- Technical documentation
- Security training materials
## License
MIT
## Citation
```bibtex
@misc{cybersecurity-ner,
title={Cybersecurity NER Model},
author={PKI},
year={2026},
url={https://huggingface.co/pki/cybersecurity-ner}
}
```
## Contact
For issues or questions, please open an issue on GitHub.
|