cybersecurity-ner / README.md
pki's picture
Update README.md
425f736 verified
# Cybersecurity NER Model v8
Named Entity Recognition model for cybersecurity domain text, trained on spaCy v3.8 with custom training data.
## Model Description
Fine-tuned NER model for extracting 13 cybersecurity entity types from technical documentation, CVs, job descriptions, threat reports, and compliance documents.
## Performance
**Test Results (v8):**
- Pass Rate: 94% (62/66 tests)
- Dev F1 Score: 98.58%
- Precision: 98.71%
- Recall: 98.46%
- Training Steps: 11,500 (early stopping)
- Training Data: 2,223 examples
**Entity Type Performance:**
| Entity Type | Test Pass Rate | Dev Set F1 |
|-------------|----------------|------------|
| CVE | 100% (3/3) | 100.00% |
| AUDIT_TERM | 75% (3/4) | 100.00% |
| SECURITY_TOOL | 100% (4/4) | 100.00% |
| CERTIFICATION | 100% (4/4) | 98.73% |
| SECURITY_ROLE | 100% (4/4) | 98.11% |
| FRAMEWORK | 100% (4/4) | 93.88% |
| TECHNICAL_SKILL | 100% (4/4) | 100.00% |
| ACRONYM | 100% (4/4) | 100.00% |
| SECURITY_DOMAIN | 100% (4/4) | 100.00% |
| ATTACK_TECHNIQUE | 75% (3/4) | 98.70% |
| THREAT_TYPE | 75% (3/4) | 95.24% |
| REGULATION | 75% (3/4) | 96.55% |
| CONTROL_ID | 100% (4/4) | - |
## Entity Types
1. **CVE** - CVE identifiers (e.g., CVE-2024-1234)
2. **CERTIFICATION** - Security certifications (CISSP, OSCP, CEH, CISM, Security+)
3. **FRAMEWORK** - Security frameworks (NIST CSF, ISO 27001, MITRE ATT&CK, CIS Controls)
4. **ATTACK_TECHNIQUE** - Attack methods (SQL injection, XSS, CSRF, buffer overflow)
5. **TECHNICAL_SKILL** - Technical skills (Incident Response, Forensics, Penetration Testing)
6. **AUDIT_TERM** - Audit/compliance terms (Risk assessment, Compliance audit, Security review)
7. **SECURITY_ROLE** - Job roles (CISO, SOC Analyst, Security Engineer, Pentester)
8. **THREAT_TYPE** - Threat types (APT, ransomware, phishing, DDoS, malware)
9. **ACRONYM** - Security acronyms (SIEM, EDR, SOAR, IDS/IPS, WAF, DLP)
10. **SECURITY_DOMAIN** - Security domains (Cloud Security, Network Security, Application Security)
11. **REGULATION** - Regulations (GDPR, HIPAA, PCI-DSS, SOX, CCPA)
12. **SECURITY_TOOL** - Security tools (Splunk, Metasploit, Burp Suite, Nmap, Wireshark)
13. **CONTROL_ID** - Control identifiers (ISO 27001 A.5.1, NIST CSF PR.AC-1, CIS Control 1.1)
## Usage
```python
import spacy
# Load model
nlp = spacy.load("path/to/model")
# Extract entities
text = "CISSP certified professional with experience in Splunk and Metasploit"
doc = nlp(text)
for ent in doc.ents:
print(f"{ent.text} -> {ent.label_}")
```
**Output:**
```
CISSP -> CERTIFICATION
Splunk -> SECURITY_TOOL
Metasploit -> SECURITY_TOOL
```
## Training Data
**Sources:**
- v7 merged data: 1,448 examples
- v8 generated: 1,347 examples with multi-entity patterns, case variants
- Manual curated: 100 examples
- Final dataset: 2,223 unique examples (after validation and deduplication)
**v8 Improvements:**
- Multi-entity "X and Y" patterns (50 examples per entity type)
- Title case variants (CISSP, cissp, Cissp)
- Comma-separated list patterns
- AUDIT_TERM edge cases (Compliance audit)
**Entity Distribution:**
- AUDIT_TERM: 326 (12.4%)
- CERTIFICATION: 295 (11.2%)
- SECURITY_TOOL: 293 (11.1%)
- ATTACK_TECHNIQUE: 282 (10.7%)
- THREAT_TYPE: 263 (10.0%)
- TECHNICAL_SKILL: 228 (8.6%)
- REGULATION: 222 (8.4%)
- CVE: 182 (6.9%)
- FRAMEWORK: 165 (6.3%)
- SECURITY_ROLE: 153 (5.8%)
- ACRONYM: 142 (5.4%)
- SECURITY_DOMAIN: 85 (3.2%)
## Training Configuration
- **Framework:** spaCy 3.8
- **Architecture:** tok2vec + TransitionBasedParser
- **GPU:** NVIDIA RTX 4090
- **Training steps:** 11,500 (early stopping)
- **Patience:** 5,000 steps
- **Learning rate:** 3e-05
- **Dropout:** 0.25
- **Batch size:** 1,000
- **Train/dev split:** 85/15
## Version History
**v8 (Current):**
- 94% pass rate (62/66)
- Multi-entity extraction improved
- Title case support added
- AUDIT_TERM edge cases fixed
**v7:**
- 86% pass rate (57/66)
- CVE detection restored
- SECURITY_ROLE improved to 100%
- IDS/IPS and DDoS fixed
**v6:**
- 74% pass rate (49/66)
- CVE regression (missing)
- AUDIT_TERM and SECURITY_ROLE issues
## Known Limitations
v8 has 4 remaining test failures:
1. Multi-entity extraction in specific contexts ("APT group using ransomware")
2. Span boundary issues with conjunctions ("XSS and CSRF mitigated")
3. Specific "X and Y" patterns ("HIPAA and PCI-DSS standards")
4. "Gap analysis" edge case
## Use Cases
- CV/resume skill extraction
- Job description analysis
- Threat intelligence reports
- Compliance documentation
- Security audit reports
- Technical documentation
- Security training materials
## License
MIT
## Citation
```bibtex
@misc{cybersecurity-ner,
title={Cybersecurity NER Model},
author={PKI},
year={2026},
url={https://huggingface.co/pki/cybersecurity-ner}
}
```
## Contact
For issues or questions, please open an issue on GitHub.