File size: 2,512 Bytes
55fb9ec
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
---
language:
- en
license: apache-2.0
library_name: spacy
tags:
- spacy
- ner
- named-entity-recognition
- cybersecurity
- infosec
- security
- token-classification
pipeline_tag: token-classification
datasets:
- custom
model-index:
- name: cybersec-ner-roberta
  results:
  - task:
      type: token-classification
      name: Named Entity Recognition
    metrics:
    - type: f1
      value: 0.694
      name: F1
    - type: precision
      value: 0.691
      name: Precision
    - type: recall
      value: 0.698
      name: Recall
---

# Cybersecurity NER Model

spaCy NER model with RoBERTa transformer backbone, trained for cybersecurity entity extraction.

## Entity Types (9)

| Entity | Description | F1 Score |
|--------|-------------|----------|
| SECURITY_ROLE | Job titles (CISO, SOC Analyst, Pentester) | 57.8% |
| TECHNICAL_SKILL | Skills (Incident Response, Threat Hunting) | 54.7% |
| SECURITY_TOOL | Tools (Splunk, CrowdStrike, Metasploit) | 100% |
| CERTIFICATION | Certs (CISSP, OSCP, CEH) | 100% |
| FRAMEWORK | Frameworks (NIST, MITRE ATT&CK, ISO 27001) | 100% |
| THREAT_TYPE | Threats (APT, ransomware, phishing) | 90% |
| ATTACK_TECHNIQUE | Attacks (SQL injection, XSS, RCE) | 100% |
| REGULATION | Regulations (GDPR, HIPAA, PCI-DSS) | 100% |
| SECURITY_DOMAIN | Domains (Cloud Security, Network Security) | 13% |

**Overall: F1 69.4% | Precision 69.1% | Recall 69.8%**

## Training Data

- 1,500+ unique cybersecurity entities
- 1,000 synthetic training examples (CVs, job descriptions)
- Domain-adapted RoBERTa on 40K security texts

## Usage

```python
import spacy

# Load model
nlp = spacy.load("path/to/model")

# Extract entities
doc = nlp("CISO with CISSP certification, expert in Splunk SIEM and threat hunting")

for ent in doc.ents:
    print(f"{ent.text}: {ent.label_}")
```

Output:
```
CISO: SECURITY_ROLE
CISSP: CERTIFICATION
Splunk: SECURITY_TOOL
threat hunting: TECHNICAL_SKILL
```

## Requirements

```
spacy>=3.8.0
spacy-transformers>=1.3.0
```

## Use Cases

- Threat intelligence parsing
- Security talent matching (CV/job analysis)
- Skills inventory extraction
- Compliance document analysis

## Limitations

- SECURITY_DOMAIN has low recall (7%) - needs more training data
- SECURITY_ROLE and TECHNICAL_SKILL F1 below target - ongoing improvement
- Trained primarily on English text

## License

Apache 2.0

## Citation

```bibtex
@misc{cybersec-ner-2024,
  author = {PKI},
  title = {Cybersecurity NER Model},
  year = {2024},
  publisher = {HuggingFace},
}
```