|
|
--- |
|
|
language: en |
|
|
tags: |
|
|
- spacy |
|
|
- ner |
|
|
- cybersecurity |
|
|
- token-classification |
|
|
license: mit |
|
|
model-index: |
|
|
- name: ner-cybersecurity |
|
|
results: |
|
|
- task: |
|
|
type: token-classification |
|
|
name: Named Entity Recognition |
|
|
metrics: |
|
|
- type: f1 |
|
|
value: 0.9831 |
|
|
name: F1 |
|
|
- type: precision |
|
|
value: 0.9792 |
|
|
name: Precision |
|
|
- type: recall |
|
|
value: 0.9869 |
|
|
name: Recall |
|
|
--- |
|
|
|
|
|
# Cybersecurity NER Model |
|
|
|
|
|
NER model for cybersecurity domain. F1: 98.31%. |
|
|
|
|
|
## Model Details |
|
|
|
|
|
**Version:** v5 |
|
|
**Framework:** spaCy 3.8+ |
|
|
**Training Date:** 2025-12-29 |
|
|
**Examples:** 1922 (stratified 80/10/10) |
|
|
**Backbone:** Domain-adapted RoBERTa |
|
|
|
|
|
## Entities (13) |
|
|
|
|
|
| Entity | F1 | Examples | |
|
|
|--------|-----|----------| |
|
|
| CERTIFICATION | 100% | CISSP, OSCP, CEH | |
|
|
| SECURITY_ROLE | 100% | CISO, SOC Analyst | |
|
|
| SECURITY_TOOL | 100% | Splunk, Metasploit | |
|
|
| ATTACK_TECHNIQUE | 100% | SQL Injection, XSS | |
|
|
| FRAMEWORK | 100% | NIST CSF, ISO 27001 | |
|
|
| THREAT_TYPE | 100% | APT, ransomware | |
|
|
| AUDIT_TERM | 100% | Compliance, Audit | |
|
|
| CVE | 100% | CVE-2021-44228 | |
|
|
| SECURITY_DOMAIN | 99.10% | Cloud Security | |
|
|
| TECHNICAL_SKILL | 95.30% | Incident Response | |
|
|
| REGULATION | 94.44% | GDPR, HIPAA | |
|
|
| ACRONYM | 88.89% | SIEM, EDR | |
|
|
| CONTROL_ID | 0% | See hybrid approach | |
|
|
|
|
|
## Performance |
|
|
|
|
|
**Metrics:** |
|
|
- F1: 98.31% |
|
|
- Precision: 97.92% |
|
|
- Recall: 98.69% |
|
|
- Inference: ~60ms/doc |
|
|
|
|
|
**v5 changes from v4:** |
|
|
- Tuned hyperparameters (dropout 0.25, L2 0.02) |
|
|
- Improved REGULATION (+6.64pp), ACRONYM (+22.22pp) |
|
|
- Overall +0.25pp F1 |
|
|
|
|
|
## CONTROL_ID Handling |
|
|
|
|
|
Model F1 for CONTROL_ID: 0% (insufficient training data: 25 examples). |
|
|
|
|
|
**Solution:** Hybrid approach - regex extraction for production use. |
|
|
|
|
|
Patterns: ISO 27001, NIST CSF, CIS Controls, SOC 2, PCI-DSS. |
|
|
|
|
|
See service implementation for details. |
|
|
|
|
|
## Usage |
|
|
|
|
|
```bash |
|
|
pip install spacy>=3.7.0 spacy-transformers>=1.3.0 |
|
|
``` |
|
|
|
|
|
```python |
|
|
import spacy |
|
|
|
|
|
nlp = spacy.load("pki/ner-cybersecurity") |
|
|
doc = nlp("CISO with CISSP, expert in Splunk and ISO 27001") |
|
|
|
|
|
for ent in doc.ents: |
|
|
print(f"{ent.text:20} | {ent.label_}") |
|
|
``` |
|
|
|
|
|
**Output:** |
|
|
``` |
|
|
CISO | SECURITY_ROLE |
|
|
CISSP | CERTIFICATION |
|
|
Splunk | SECURITY_TOOL |
|
|
ISO 27001 | FRAMEWORK |
|
|
``` |
|
|
|
|
|
## Use Cases |
|
|
|
|
|
- Job/CV matching |
|
|
- Threat intelligence extraction |
|
|
- Compliance documentation parsing |
|
|
- Security policy analysis |
|
|
|
|
|
## Training Config |
|
|
|
|
|
```ini |
|
|
max_steps = 8000 |
|
|
dropout = 0.25 |
|
|
L2 = 0.02 |
|
|
learning_rate = 0.00003 |
|
|
hidden_width = 128 |
|
|
maxout_pieces = 3 |
|
|
batch_size = 128 |
|
|
``` |
|
|
|
|
|
## Limitations |
|
|
|
|
|
- ACRONYM: Lower F1 (88.89%) - limited examples (46) |
|
|
- CONTROL_ID: Requires hybrid regex approach |
|
|
- Domain-specific: Optimized for cybersecurity text |
|
|
- Context-dependent ambiguity on some terms |
|
|
|
|
|
## License |
|
|
|
|
|
MIT |
|
|
|
|
|
## Version History |
|
|
|
|
|
| Version | Date | F1 | Examples | Notes | |
|
|
|---------|------|-----|----------|-------| |
|
|
| v5 | 2025-12-29 | 98.31% | 1922 | Hyperparameter tuning | |
|
|
| v4 | 2025-12-29 | 98.06% | 1922 | Stratified split, domain RoBERTa | |
|
|
| v3 | 2025-01 | 69.4% | 1000 | spaCy 3.x migration | |
|
|
| v2 | 2024-12 | 99.5%* | 1805 | spaCy 2.x (*train accuracy) | |
|
|
|
|
|
## Contact |
|
|
|
|
|
Issues: Model repository |
|
|
|