File size: 4,841 Bytes
a03103c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
425f736
 
 
a03103c
425f736
a03103c
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
# Cybersecurity NER Model v8

Named Entity Recognition model for cybersecurity domain text, trained on spaCy v3.8 with custom training data.

## Model Description

Fine-tuned NER model for extracting 13 cybersecurity entity types from technical documentation, CVs, job descriptions, threat reports, and compliance documents.

## Performance

**Test Results (v8):**
- Pass Rate: 94% (62/66 tests)
- Dev F1 Score: 98.58%
- Precision: 98.71%
- Recall: 98.46%
- Training Steps: 11,500 (early stopping)
- Training Data: 2,223 examples

**Entity Type Performance:**
| Entity Type | Test Pass Rate | Dev Set F1 |
|-------------|----------------|------------|
| CVE | 100% (3/3) | 100.00% |
| AUDIT_TERM | 75% (3/4) | 100.00% |
| SECURITY_TOOL | 100% (4/4) | 100.00% |
| CERTIFICATION | 100% (4/4) | 98.73% |
| SECURITY_ROLE | 100% (4/4) | 98.11% |
| FRAMEWORK | 100% (4/4) | 93.88% |
| TECHNICAL_SKILL | 100% (4/4) | 100.00% |
| ACRONYM | 100% (4/4) | 100.00% |
| SECURITY_DOMAIN | 100% (4/4) | 100.00% |
| ATTACK_TECHNIQUE | 75% (3/4) | 98.70% |
| THREAT_TYPE | 75% (3/4) | 95.24% |
| REGULATION | 75% (3/4) | 96.55% |
| CONTROL_ID | 100% (4/4) | - |

## Entity Types

1. **CVE** - CVE identifiers (e.g., CVE-2024-1234)
2. **CERTIFICATION** - Security certifications (CISSP, OSCP, CEH, CISM, Security+)
3. **FRAMEWORK** - Security frameworks (NIST CSF, ISO 27001, MITRE ATT&CK, CIS Controls)
4. **ATTACK_TECHNIQUE** - Attack methods (SQL injection, XSS, CSRF, buffer overflow)
5. **TECHNICAL_SKILL** - Technical skills (Incident Response, Forensics, Penetration Testing)
6. **AUDIT_TERM** - Audit/compliance terms (Risk assessment, Compliance audit, Security review)
7. **SECURITY_ROLE** - Job roles (CISO, SOC Analyst, Security Engineer, Pentester)
8. **THREAT_TYPE** - Threat types (APT, ransomware, phishing, DDoS, malware)
9. **ACRONYM** - Security acronyms (SIEM, EDR, SOAR, IDS/IPS, WAF, DLP)
10. **SECURITY_DOMAIN** - Security domains (Cloud Security, Network Security, Application Security)
11. **REGULATION** - Regulations (GDPR, HIPAA, PCI-DSS, SOX, CCPA)
12. **SECURITY_TOOL** - Security tools (Splunk, Metasploit, Burp Suite, Nmap, Wireshark)
13. **CONTROL_ID** - Control identifiers (ISO 27001 A.5.1, NIST CSF PR.AC-1, CIS Control 1.1)

## Usage

```python
import spacy

# Load model
nlp = spacy.load("path/to/model")

# Extract entities
text = "CISSP certified professional with experience in Splunk and Metasploit"
doc = nlp(text)

for ent in doc.ents:
    print(f"{ent.text} -> {ent.label_}")
```

**Output:**
```
CISSP -> CERTIFICATION
Splunk -> SECURITY_TOOL
Metasploit -> SECURITY_TOOL
```

## Training Data

**Sources:**
- v7 merged data: 1,448 examples
- v8 generated: 1,347 examples with multi-entity patterns, case variants
- Manual curated: 100 examples
- Final dataset: 2,223 unique examples (after validation and deduplication)

**v8 Improvements:**
- Multi-entity "X and Y" patterns (50 examples per entity type)
- Title case variants (CISSP, cissp, Cissp)
- Comma-separated list patterns
- AUDIT_TERM edge cases (Compliance audit)

**Entity Distribution:**
- AUDIT_TERM: 326 (12.4%)
- CERTIFICATION: 295 (11.2%)
- SECURITY_TOOL: 293 (11.1%)
- ATTACK_TECHNIQUE: 282 (10.7%)
- THREAT_TYPE: 263 (10.0%)
- TECHNICAL_SKILL: 228 (8.6%)
- REGULATION: 222 (8.4%)
- CVE: 182 (6.9%)
- FRAMEWORK: 165 (6.3%)
- SECURITY_ROLE: 153 (5.8%)
- ACRONYM: 142 (5.4%)
- SECURITY_DOMAIN: 85 (3.2%)

## Training Configuration

- **Framework:** spaCy 3.8
- **Architecture:** tok2vec + TransitionBasedParser
- **GPU:** NVIDIA RTX 4090
- **Training steps:** 11,500 (early stopping)
- **Patience:** 5,000 steps
- **Learning rate:** 3e-05
- **Dropout:** 0.25
- **Batch size:** 1,000
- **Train/dev split:** 85/15

## Version History

**v8 (Current):**
- 94% pass rate (62/66)
- Multi-entity extraction improved
- Title case support added
- AUDIT_TERM edge cases fixed

**v7:**
- 86% pass rate (57/66)
- CVE detection restored
- SECURITY_ROLE improved to 100%
- IDS/IPS and DDoS fixed

**v6:**
- 74% pass rate (49/66)
- CVE regression (missing)
- AUDIT_TERM and SECURITY_ROLE issues

## Known Limitations

v8 has 4 remaining test failures:
1. Multi-entity extraction in specific contexts ("APT group using ransomware")
2. Span boundary issues with conjunctions ("XSS and CSRF mitigated")
3. Specific "X and Y" patterns ("HIPAA and PCI-DSS standards")
4. "Gap analysis" edge case

## Use Cases

- CV/resume skill extraction
- Job description analysis
- Threat intelligence reports
- Compliance documentation
- Security audit reports
- Technical documentation
- Security training materials

## License

MIT

## Citation

```bibtex
@misc{cybersecurity-ner,
  title={Cybersecurity NER Model},
  author={PKI},
  year={2026},
  url={https://huggingface.co/pki/cybersecurity-ner}
}
```

## Contact

For issues or questions, please open an issue on GitHub.