pki's picture
Initial upload: Cybersecurity NER model with RoBERTa backbone
55fb9ec verified
---
language:
- en
license: apache-2.0
library_name: spacy
tags:
- spacy
- ner
- named-entity-recognition
- cybersecurity
- infosec
- security
- token-classification
pipeline_tag: token-classification
datasets:
- custom
model-index:
- name: cybersec-ner-roberta
results:
- task:
type: token-classification
name: Named Entity Recognition
metrics:
- type: f1
value: 0.694
name: F1
- type: precision
value: 0.691
name: Precision
- type: recall
value: 0.698
name: Recall
---
# Cybersecurity NER Model
spaCy NER model with RoBERTa transformer backbone, trained for cybersecurity entity extraction.
## Entity Types (9)
| Entity | Description | F1 Score |
|--------|-------------|----------|
| SECURITY_ROLE | Job titles (CISO, SOC Analyst, Pentester) | 57.8% |
| TECHNICAL_SKILL | Skills (Incident Response, Threat Hunting) | 54.7% |
| SECURITY_TOOL | Tools (Splunk, CrowdStrike, Metasploit) | 100% |
| CERTIFICATION | Certs (CISSP, OSCP, CEH) | 100% |
| FRAMEWORK | Frameworks (NIST, MITRE ATT&CK, ISO 27001) | 100% |
| THREAT_TYPE | Threats (APT, ransomware, phishing) | 90% |
| ATTACK_TECHNIQUE | Attacks (SQL injection, XSS, RCE) | 100% |
| REGULATION | Regulations (GDPR, HIPAA, PCI-DSS) | 100% |
| SECURITY_DOMAIN | Domains (Cloud Security, Network Security) | 13% |
**Overall: F1 69.4% | Precision 69.1% | Recall 69.8%**
## Training Data
- 1,500+ unique cybersecurity entities
- 1,000 synthetic training examples (CVs, job descriptions)
- Domain-adapted RoBERTa on 40K security texts
## Usage
```python
import spacy
# Load model
nlp = spacy.load("path/to/model")
# Extract entities
doc = nlp("CISO with CISSP certification, expert in Splunk SIEM and threat hunting")
for ent in doc.ents:
print(f"{ent.text}: {ent.label_}")
```
Output:
```
CISO: SECURITY_ROLE
CISSP: CERTIFICATION
Splunk: SECURITY_TOOL
threat hunting: TECHNICAL_SKILL
```
## Requirements
```
spacy>=3.8.0
spacy-transformers>=1.3.0
```
## Use Cases
- Threat intelligence parsing
- Security talent matching (CV/job analysis)
- Skills inventory extraction
- Compliance document analysis
## Limitations
- SECURITY_DOMAIN has low recall (7%) - needs more training data
- SECURITY_ROLE and TECHNICAL_SKILL F1 below target - ongoing improvement
- Trained primarily on English text
## License
Apache 2.0
## Citation
```bibtex
@misc{cybersec-ner-2024,
author = {PKI},
title = {Cybersecurity NER Model},
year = {2024},
publisher = {HuggingFace},
}
```