File size: 2,512 Bytes

55fb9ec

---
language:
- en
license: apache-2.0
library_name: spacy
tags:
- spacy
- ner
- named-entity-recognition
- cybersecurity
- infosec
- security
- token-classification
pipeline_tag: token-classification
datasets:
- custom
model-index:
- name: cybersec-ner-roberta
  results:
  - task:
      type: token-classification
      name: Named Entity Recognition
    metrics:
    - type: f1
      value: 0.694
      name: F1
    - type: precision
      value: 0.691
      name: Precision
    - type: recall
      value: 0.698
      name: Recall
---

# Cybersecurity NER Model

spaCy NER model with RoBERTa transformer backbone, trained for cybersecurity entity extraction.

## Entity Types (9)

| Entity | Description | F1 Score |
|--------|-------------|----------|
| SECURITY_ROLE | Job titles (CISO, SOC Analyst, Pentester) | 57.8% |
| TECHNICAL_SKILL | Skills (Incident Response, Threat Hunting) | 54.7% |
| SECURITY_TOOL | Tools (Splunk, CrowdStrike, Metasploit) | 100% |
| CERTIFICATION | Certs (CISSP, OSCP, CEH) | 100% |
| FRAMEWORK | Frameworks (NIST, MITRE ATT&CK, ISO 27001) | 100% |
| THREAT_TYPE | Threats (APT, ransomware, phishing) | 90% |
| ATTACK_TECHNIQUE | Attacks (SQL injection, XSS, RCE) | 100% |
| REGULATION | Regulations (GDPR, HIPAA, PCI-DSS) | 100% |
| SECURITY_DOMAIN | Domains (Cloud Security, Network Security) | 13% |

**Overall: F1 69.4% | Precision 69.1% | Recall 69.8%**

## Training Data

- 1,500+ unique cybersecurity entities
- 1,000 synthetic training examples (CVs, job descriptions)
- Domain-adapted RoBERTa on 40K security texts

## Usage

```python
import spacy

# Load model
nlp = spacy.load("path/to/model")

# Extract entities
doc = nlp("CISO with CISSP certification, expert in Splunk SIEM and threat hunting")

for ent in doc.ents:
    print(f"{ent.text}: {ent.label_}")
```

Output:
```
CISO: SECURITY_ROLE
CISSP: CERTIFICATION
Splunk: SECURITY_TOOL
threat hunting: TECHNICAL_SKILL
```

## Requirements

```
spacy>=3.8.0
spacy-transformers>=1.3.0
```

## Use Cases

- Threat intelligence parsing
- Security talent matching (CV/job analysis)
- Skills inventory extraction
- Compliance document analysis

## Limitations

- SECURITY_DOMAIN has low recall (7%) - needs more training data
- SECURITY_ROLE and TECHNICAL_SKILL F1 below target - ongoing improvement
- Trained primarily on English text

## License

Apache 2.0

## Citation

```bibtex
@misc{cybersec-ner-2024,
  author = {PKI},
  title = {Cybersecurity NER Model},
  year = {2024},
  publisher = {HuggingFace},
}
```