cybersecurity-ner / README.md

Update README.md

425f736 verified 28 days ago

4.84 kB

	# Cybersecurity NER Model v8

	Named Entity Recognition model for cybersecurity domain text, trained on spaCy v3.8 with custom training data.

	## Model Description

	Fine-tuned NER model for extracting 13 cybersecurity entity types from technical documentation, CVs, job descriptions, threat reports, and compliance documents.

	## Performance

	Test Results (v8):
	- Pass Rate: 94% (62/66 tests)
	- Dev F1 Score: 98.58%
	- Precision: 98.71%
	- Recall: 98.46%
	- Training Steps: 11,500 (early stopping)
	- Training Data: 2,223 examples

	Entity Type Performance:
	\| Entity Type \| Test Pass Rate \| Dev Set F1 \|
	\|-------------\|----------------\|------------\|
	\| CVE \| 100% (3/3) \| 100.00% \|
	\| AUDIT_TERM \| 75% (3/4) \| 100.00% \|
	\| SECURITY_TOOL \| 100% (4/4) \| 100.00% \|
	\| CERTIFICATION \| 100% (4/4) \| 98.73% \|
	\| SECURITY_ROLE \| 100% (4/4) \| 98.11% \|
	\| FRAMEWORK \| 100% (4/4) \| 93.88% \|
	\| TECHNICAL_SKILL \| 100% (4/4) \| 100.00% \|
	\| ACRONYM \| 100% (4/4) \| 100.00% \|
	\| SECURITY_DOMAIN \| 100% (4/4) \| 100.00% \|
	\| ATTACK_TECHNIQUE \| 75% (3/4) \| 98.70% \|
	\| THREAT_TYPE \| 75% (3/4) \| 95.24% \|
	\| REGULATION \| 75% (3/4) \| 96.55% \|
	\| CONTROL_ID \| 100% (4/4) \| - \|

	## Entity Types

	1. CVE - CVE identifiers (e.g., CVE-2024-1234)
	2. CERTIFICATION - Security certifications (CISSP, OSCP, CEH, CISM, Security+)
	3. FRAMEWORK - Security frameworks (NIST CSF, ISO 27001, MITRE ATT&CK, CIS Controls)
	4. ATTACK_TECHNIQUE - Attack methods (SQL injection, XSS, CSRF, buffer overflow)
	5. TECHNICAL_SKILL - Technical skills (Incident Response, Forensics, Penetration Testing)
	6. AUDIT_TERM - Audit/compliance terms (Risk assessment, Compliance audit, Security review)
	7. SECURITY_ROLE - Job roles (CISO, SOC Analyst, Security Engineer, Pentester)
	8. THREAT_TYPE - Threat types (APT, ransomware, phishing, DDoS, malware)
	9. ACRONYM - Security acronyms (SIEM, EDR, SOAR, IDS/IPS, WAF, DLP)
	10. SECURITY_DOMAIN - Security domains (Cloud Security, Network Security, Application Security)
	11. REGULATION - Regulations (GDPR, HIPAA, PCI-DSS, SOX, CCPA)
	12. SECURITY_TOOL - Security tools (Splunk, Metasploit, Burp Suite, Nmap, Wireshark)
	13. CONTROL_ID - Control identifiers (ISO 27001 A.5.1, NIST CSF PR.AC-1, CIS Control 1.1)

	## Usage

	```python
	import spacy

	# Load model
	nlp = spacy.load("path/to/model")

	# Extract entities
	text = "CISSP certified professional with experience in Splunk and Metasploit"
	doc = nlp(text)

	for ent in doc.ents:
	print(f"{ent.text} -> {ent.label_}")
	```

	Output:
	```
	CISSP -> CERTIFICATION
	Splunk -> SECURITY_TOOL
	Metasploit -> SECURITY_TOOL
	```

	## Training Data

	Sources:
	- v7 merged data: 1,448 examples
	- v8 generated: 1,347 examples with multi-entity patterns, case variants
	- Manual curated: 100 examples
	- Final dataset: 2,223 unique examples (after validation and deduplication)

	v8 Improvements:
	- Multi-entity "X and Y" patterns (50 examples per entity type)
	- Title case variants (CISSP, cissp, Cissp)
	- Comma-separated list patterns
	- AUDIT_TERM edge cases (Compliance audit)

	Entity Distribution:
	- AUDIT_TERM: 326 (12.4%)
	- CERTIFICATION: 295 (11.2%)
	- SECURITY_TOOL: 293 (11.1%)
	- ATTACK_TECHNIQUE: 282 (10.7%)
	- THREAT_TYPE: 263 (10.0%)
	- TECHNICAL_SKILL: 228 (8.6%)
	- REGULATION: 222 (8.4%)
	- CVE: 182 (6.9%)
	- FRAMEWORK: 165 (6.3%)
	- SECURITY_ROLE: 153 (5.8%)
	- ACRONYM: 142 (5.4%)
	- SECURITY_DOMAIN: 85 (3.2%)

	## Training Configuration

	- Framework: spaCy 3.8
	- Architecture: tok2vec + TransitionBasedParser
	- GPU: NVIDIA RTX 4090
	- Training steps: 11,500 (early stopping)
	- Patience: 5,000 steps
	- Learning rate: 3e-05
	- Dropout: 0.25
	- Batch size: 1,000
	- Train/dev split: 85/15

	## Version History

	v8 (Current):
	- 94% pass rate (62/66)
	- Multi-entity extraction improved
	- Title case support added
	- AUDIT_TERM edge cases fixed

	v7:
	- 86% pass rate (57/66)
	- CVE detection restored
	- SECURITY_ROLE improved to 100%
	- IDS/IPS and DDoS fixed

	v6:
	- 74% pass rate (49/66)
	- CVE regression (missing)
	- AUDIT_TERM and SECURITY_ROLE issues

	## Known Limitations

	v8 has 4 remaining test failures:
	1. Multi-entity extraction in specific contexts ("APT group using ransomware")
	2. Span boundary issues with conjunctions ("XSS and CSRF mitigated")
	3. Specific "X and Y" patterns ("HIPAA and PCI-DSS standards")
	4. "Gap analysis" edge case

	## Use Cases

	- CV/resume skill extraction
	- Job description analysis
	- Threat intelligence reports
	- Compliance documentation
	- Security audit reports
	- Technical documentation
	- Security training materials

	## License

	MIT

	## Citation

	```bibtex
	@misc{cybersecurity-ner,
	title={Cybersecurity NER Model},
	author={PKI},
	year={2026},
	url={https://huggingface.co/pki/cybersecurity-ner}
	}
	```

	## Contact

	For issues or questions, please open an issue on GitHub.