Initial upload: Cybersecurity NER model with RoBERTa backbone

55fb9ec verified 22 days ago

2.51 kB

	---
	language:
	- en
	license: apache-2.0
	library_name: spacy
	tags:
	- spacy
	- ner
	- named-entity-recognition
	- cybersecurity
	- infosec
	- security
	- token-classification
	pipeline_tag: token-classification
	datasets:
	- custom
	model-index:
	- name: cybersec-ner-roberta
	results:
	- task:
	type: token-classification
	name: Named Entity Recognition
	metrics:
	- type: f1
	value: 0.694
	name: F1
	- type: precision
	value: 0.691
	name: Precision
	- type: recall
	value: 0.698
	name: Recall
	---

	# Cybersecurity NER Model

	spaCy NER model with RoBERTa transformer backbone, trained for cybersecurity entity extraction.

	## Entity Types (9)

	\| Entity \| Description \| F1 Score \|
	\|--------\|-------------\|----------\|
	\| SECURITY_ROLE \| Job titles (CISO, SOC Analyst, Pentester) \| 57.8% \|
	\| TECHNICAL_SKILL \| Skills (Incident Response, Threat Hunting) \| 54.7% \|
	\| SECURITY_TOOL \| Tools (Splunk, CrowdStrike, Metasploit) \| 100% \|
	\| CERTIFICATION \| Certs (CISSP, OSCP, CEH) \| 100% \|
	\| FRAMEWORK \| Frameworks (NIST, MITRE ATT&CK, ISO 27001) \| 100% \|
	\| THREAT_TYPE \| Threats (APT, ransomware, phishing) \| 90% \|
	\| ATTACK_TECHNIQUE \| Attacks (SQL injection, XSS, RCE) \| 100% \|
	\| REGULATION \| Regulations (GDPR, HIPAA, PCI-DSS) \| 100% \|
	\| SECURITY_DOMAIN \| Domains (Cloud Security, Network Security) \| 13% \|

	Overall: F1 69.4% \| Precision 69.1% \| Recall 69.8%

	## Training Data

	- 1,500+ unique cybersecurity entities
	- 1,000 synthetic training examples (CVs, job descriptions)
	- Domain-adapted RoBERTa on 40K security texts

	## Usage

	```python
	import spacy

	# Load model
	nlp = spacy.load("path/to/model")

	# Extract entities
	doc = nlp("CISO with CISSP certification, expert in Splunk SIEM and threat hunting")

	for ent in doc.ents:
	print(f"{ent.text}: {ent.label_}")
	```

	Output:
	```
	CISO: SECURITY_ROLE
	CISSP: CERTIFICATION
	Splunk: SECURITY_TOOL
	threat hunting: TECHNICAL_SKILL
	```

	## Requirements

	```
	spacy>=3.8.0
	spacy-transformers>=1.3.0
	```

	## Use Cases

	- Threat intelligence parsing
	- Security talent matching (CV/job analysis)
	- Skills inventory extraction
	- Compliance document analysis

	## Limitations

	- SECURITY_DOMAIN has low recall (7%) - needs more training data
	- SECURITY_ROLE and TECHNICAL_SKILL F1 below target - ongoing improvement
	- Trained primarily on English text

	## License

	Apache 2.0

	## Citation

	```bibtex
	@misc{cybersec-ner-2024,
	author = {PKI},
	title = {Cybersecurity NER Model},
	year = {2024},
	publisher = {HuggingFace},
	}
	```