|
|
--- |
|
|
language: |
|
|
- en |
|
|
license: apache-2.0 |
|
|
library_name: spacy |
|
|
tags: |
|
|
- spacy |
|
|
- ner |
|
|
- named-entity-recognition |
|
|
- cybersecurity |
|
|
- infosec |
|
|
- security |
|
|
- token-classification |
|
|
pipeline_tag: token-classification |
|
|
datasets: |
|
|
- custom |
|
|
model-index: |
|
|
- name: cybersec-ner-roberta |
|
|
results: |
|
|
- task: |
|
|
type: token-classification |
|
|
name: Named Entity Recognition |
|
|
metrics: |
|
|
- type: f1 |
|
|
value: 0.694 |
|
|
name: F1 |
|
|
- type: precision |
|
|
value: 0.691 |
|
|
name: Precision |
|
|
- type: recall |
|
|
value: 0.698 |
|
|
name: Recall |
|
|
--- |
|
|
|
|
|
# Cybersecurity NER Model |
|
|
|
|
|
spaCy NER model with RoBERTa transformer backbone, trained for cybersecurity entity extraction. |
|
|
|
|
|
## Entity Types (9) |
|
|
|
|
|
| Entity | Description | F1 Score | |
|
|
|--------|-------------|----------| |
|
|
| SECURITY_ROLE | Job titles (CISO, SOC Analyst, Pentester) | 57.8% | |
|
|
| TECHNICAL_SKILL | Skills (Incident Response, Threat Hunting) | 54.7% | |
|
|
| SECURITY_TOOL | Tools (Splunk, CrowdStrike, Metasploit) | 100% | |
|
|
| CERTIFICATION | Certs (CISSP, OSCP, CEH) | 100% | |
|
|
| FRAMEWORK | Frameworks (NIST, MITRE ATT&CK, ISO 27001) | 100% | |
|
|
| THREAT_TYPE | Threats (APT, ransomware, phishing) | 90% | |
|
|
| ATTACK_TECHNIQUE | Attacks (SQL injection, XSS, RCE) | 100% | |
|
|
| REGULATION | Regulations (GDPR, HIPAA, PCI-DSS) | 100% | |
|
|
| SECURITY_DOMAIN | Domains (Cloud Security, Network Security) | 13% | |
|
|
|
|
|
**Overall: F1 69.4% | Precision 69.1% | Recall 69.8%** |
|
|
|
|
|
## Training Data |
|
|
|
|
|
- 1,500+ unique cybersecurity entities |
|
|
- 1,000 synthetic training examples (CVs, job descriptions) |
|
|
- Domain-adapted RoBERTa on 40K security texts |
|
|
|
|
|
## Usage |
|
|
|
|
|
```python |
|
|
import spacy |
|
|
|
|
|
# Load model |
|
|
nlp = spacy.load("path/to/model") |
|
|
|
|
|
# Extract entities |
|
|
doc = nlp("CISO with CISSP certification, expert in Splunk SIEM and threat hunting") |
|
|
|
|
|
for ent in doc.ents: |
|
|
print(f"{ent.text}: {ent.label_}") |
|
|
``` |
|
|
|
|
|
Output: |
|
|
``` |
|
|
CISO: SECURITY_ROLE |
|
|
CISSP: CERTIFICATION |
|
|
Splunk: SECURITY_TOOL |
|
|
threat hunting: TECHNICAL_SKILL |
|
|
``` |
|
|
|
|
|
## Requirements |
|
|
|
|
|
``` |
|
|
spacy>=3.8.0 |
|
|
spacy-transformers>=1.3.0 |
|
|
``` |
|
|
|
|
|
## Use Cases |
|
|
|
|
|
- Threat intelligence parsing |
|
|
- Security talent matching (CV/job analysis) |
|
|
- Skills inventory extraction |
|
|
- Compliance document analysis |
|
|
|
|
|
## Limitations |
|
|
|
|
|
- SECURITY_DOMAIN has low recall (7%) - needs more training data |
|
|
- SECURITY_ROLE and TECHNICAL_SKILL F1 below target - ongoing improvement |
|
|
- Trained primarily on English text |
|
|
|
|
|
## License |
|
|
|
|
|
Apache 2.0 |
|
|
|
|
|
## Citation |
|
|
|
|
|
```bibtex |
|
|
@misc{cybersec-ner-2024, |
|
|
author = {PKI}, |
|
|
title = {Cybersecurity NER Model}, |
|
|
year = {2024}, |
|
|
publisher = {HuggingFace}, |
|
|
} |
|
|
``` |
|
|
|