--- language: en tags: - spacy - ner - cybersecurity - token-classification license: mit model-index: - name: ner-cybersecurity results: - task: type: token-classification name: Named Entity Recognition metrics: - type: f1 value: 0.9806 name: F1 - type: precision value: 0.9782 name: Precision - type: recall value: 0.9830 name: Recall --- # Cybersecurity NER Model Named Entity Recognition model for cybersecurity domain with **98.06% F1 score**. ## Model Description This spaCy 3.x model extracts cybersecurity-specific entities from job descriptions, CVs, threat reports, and security documentation. It uses domain-adapted RoBERTa as the transformer backbone. **Version:** v4 **Framework:** spaCy 3.8+ **Training Date:** 2025-12-29 **Training Examples:** 1922 (stratified 80/10/10 split) ## Entity Types (13) | Entity Type | F1 Score | Examples | |-------------|----------|----------| | CERTIFICATION | 100.00% | CISSP, OSCP, CEH, CISM, Security+ | | SECURITY_ROLE | 100.00% | CISO, SOC Analyst, Penetration Tester | | TECHNICAL_SKILL | 100.00% | Incident Response, Threat Hunting, Forensics | | SECURITY_TOOL | 100.00% | Splunk, Metasploit, Burp Suite, Wireshark | | AUDIT_TERM | 100.00% | Compliance, Audit, Assessment | | CONTROL_ID | 100.00% | CIS Controls, NIST 800-53 | | SECURITY_DOMAIN | 97.83% | Cloud Security, Network Security, AppSec | | THREAT_TYPE | 97.44% | APT, ransomware, phishing, malware | | FRAMEWORK | 96.97% | NIST CSF, ISO 27001, OWASP, MITRE ATT&CK | | ATTACK_TECHNIQUE | 96.30% | SQL Injection, XSS, Buffer Overflow | | REGULATION | 87.80% | GDPR, HIPAA, PCI-DSS, SOX | | CVE | 85.71% | CVE-2021-44228 (Log4Shell) | | ACRONYM | 66.67% | SIEM, EDR, XDR, SOAR | ## Performance **Overall Metrics:** - F1: **98.06%** - Precision: 97.82% - Recall: 98.30% - Inference Speed: ~68ms per document **Improvements over v3:** - Overall F1: 69.4% → 98.06% (+28.6 points) - SECURITY_DOMAIN: 13% → 97.83% (+84.8 points) - Training examples: 1000 → 1922 (+92%) ## Training Data - **Source:** Prodigy 1.9.9 annotations (cybersecurity job descriptions, CVs, threat intelligence) - **Split:** Stratified by entity type (80% train, 10% dev, 10% test) - **Examples:** 1922 total - **Transformer:** Domain-adapted RoBERTa (roberta-cybersecurity) ## Usage ### Installation ```bash pip install spacy>=3.7.0 spacy-transformers>=1.3.0 ``` ### Load Model ```python import spacy nlp = spacy.load("pki/ner-cybersecurity") doc = nlp("CISO with CISSP certification, expert in Splunk and ISO 27001") for ent in doc.ents: print(f"{ent.text:20} | {ent.label_}") ``` **Output:** ``` CISO | SECURITY_ROLE CISSP | CERTIFICATION Splunk | SECURITY_TOOL ISO 27001 | FRAMEWORK ``` ### FastAPI Service ```python from fastapi import FastAPI import spacy app = FastAPI() nlp = spacy.load("pki/ner-cybersecurity") @app.post("/extract") async def extract_entities(text: str): doc = nlp(text) return { "entities": [ { "text": ent.text, "label": ent.label_, "start": ent.start_char, "end": ent.end_char } for ent in doc.ents ] } ``` ## Use Cases - **Job Matching:** Extract skills, certifications, tools from CVs and job descriptions - **Threat Intelligence:** Identify attack techniques, threat types, CVEs in reports - **Compliance:** Extract regulatory frameworks, controls, audit terms - **Security Documentation:** Analyze security policies, procedures, assessments ## Training Configuration ```ini [training] max_steps = 8000 hidden_width = 128 warmup_steps = 500 batch_size = 128 [transformer] name = roberta-cybersecurity (domain-adapted) ``` ## Limitations - **ACRONYM entities:** Lower F1 (66.67%) due to limited training examples (46 total) - **Context dependency:** Some terms may be ambiguous without context (e.g., "SOAR" as tool vs role) - **Domain specificity:** Optimized for cybersecurity; may underperform on general text ## Model Card Authors PKI Team ## Citation ```bibtex @misc{ner-cybersecurity-v4, title={Cybersecurity NER Model v4}, author={PKI Team}, year={2025}, publisher={Hugging Face}, url={https://huggingface.co/pki/ner-cybersecurity} } ``` ## License MIT License ## Version History | Version | Date | F1 | Training Examples | Notes | |---------|------|-----|-------------------|-------| | v4 | 2025-12-29 | 98.06% | 1922 | spaCy 3.x, stratified split, domain RoBERTa | | v3 | 2025-01 | 69.4% | 1000 | spaCy 3.x migration | | v2 | 2024-12 | 99.5%* | 1805 | spaCy 2.x (*train accuracy, not F1) | | v1 | 2024-11 | N/A | N/A | Initial Prodigy training | ## Contact For issues or questions, open an issue at the model repository.