ner-cybersecurity / README.md
pki's picture
v5: hyperparameter tuning, F1 98.31%
f466268 verified
---
language: en
tags:
- spacy
- ner
- cybersecurity
- token-classification
license: mit
model-index:
- name: ner-cybersecurity
results:
- task:
type: token-classification
name: Named Entity Recognition
metrics:
- type: f1
value: 0.9831
name: F1
- type: precision
value: 0.9792
name: Precision
- type: recall
value: 0.9869
name: Recall
---
# Cybersecurity NER Model
NER model for cybersecurity domain. F1: 98.31%.
## Model Details
**Version:** v5
**Framework:** spaCy 3.8+
**Training Date:** 2025-12-29
**Examples:** 1922 (stratified 80/10/10)
**Backbone:** Domain-adapted RoBERTa
## Entities (13)
| Entity | F1 | Examples |
|--------|-----|----------|
| CERTIFICATION | 100% | CISSP, OSCP, CEH |
| SECURITY_ROLE | 100% | CISO, SOC Analyst |
| SECURITY_TOOL | 100% | Splunk, Metasploit |
| ATTACK_TECHNIQUE | 100% | SQL Injection, XSS |
| FRAMEWORK | 100% | NIST CSF, ISO 27001 |
| THREAT_TYPE | 100% | APT, ransomware |
| AUDIT_TERM | 100% | Compliance, Audit |
| CVE | 100% | CVE-2021-44228 |
| SECURITY_DOMAIN | 99.10% | Cloud Security |
| TECHNICAL_SKILL | 95.30% | Incident Response |
| REGULATION | 94.44% | GDPR, HIPAA |
| ACRONYM | 88.89% | SIEM, EDR |
| CONTROL_ID | 0% | See hybrid approach |
## Performance
**Metrics:**
- F1: 98.31%
- Precision: 97.92%
- Recall: 98.69%
- Inference: ~60ms/doc
**v5 changes from v4:**
- Tuned hyperparameters (dropout 0.25, L2 0.02)
- Improved REGULATION (+6.64pp), ACRONYM (+22.22pp)
- Overall +0.25pp F1
## CONTROL_ID Handling
Model F1 for CONTROL_ID: 0% (insufficient training data: 25 examples).
**Solution:** Hybrid approach - regex extraction for production use.
Patterns: ISO 27001, NIST CSF, CIS Controls, SOC 2, PCI-DSS.
See service implementation for details.
## Usage
```bash
pip install spacy>=3.7.0 spacy-transformers>=1.3.0
```
```python
import spacy
nlp = spacy.load("pki/ner-cybersecurity")
doc = nlp("CISO with CISSP, expert in Splunk and ISO 27001")
for ent in doc.ents:
print(f"{ent.text:20} | {ent.label_}")
```
**Output:**
```
CISO | SECURITY_ROLE
CISSP | CERTIFICATION
Splunk | SECURITY_TOOL
ISO 27001 | FRAMEWORK
```
## Use Cases
- Job/CV matching
- Threat intelligence extraction
- Compliance documentation parsing
- Security policy analysis
## Training Config
```ini
max_steps = 8000
dropout = 0.25
L2 = 0.02
learning_rate = 0.00003
hidden_width = 128
maxout_pieces = 3
batch_size = 128
```
## Limitations
- ACRONYM: Lower F1 (88.89%) - limited examples (46)
- CONTROL_ID: Requires hybrid regex approach
- Domain-specific: Optimized for cybersecurity text
- Context-dependent ambiguity on some terms
## License
MIT
## Version History
| Version | Date | F1 | Examples | Notes |
|---------|------|-----|----------|-------|
| v5 | 2025-12-29 | 98.31% | 1922 | Hyperparameter tuning |
| v4 | 2025-12-29 | 98.06% | 1922 | Stratified split, domain RoBERTa |
| v3 | 2025-01 | 69.4% | 1000 | spaCy 3.x migration |
| v2 | 2024-12 | 99.5%* | 1805 | spaCy 2.x (*train accuracy) |
## Contact
Issues: Model repository