SecureBERT2.0-NER / README.md

Update README.md

792db5b verified 4 months ago

7.82 kB

	---
	license: apache-2.0
	language:
	- en
	base_model:
	- cisco-ai/SecureBERT2.0-base
	pipeline_tag: token-classification
	library_name: transformers
	tags:
	- NER
	- SecureBERT2
	- CyberNER
	- token-classification
	- cybersecurity
	---

	# Model Card for cisco-ai/SecureBERT2.0-NER

	The Secure Modern BERT NER Model is a fine-tuned transformer based on [SecureBERT 2.0](https://huggingface.co/cisco-ai/SecureBERT2.0-base), designed for Named Entity Recognition (NER) in cybersecurity text.

	It extracts domain-specific entities such as Indicators, Malware, Organizations, Systems, and Vulnerabilities from unstructured data sources like threat reports, incident analyses, advisories, and blogs.

	NER in cybersecurity enables:
	- Automated extraction of indicators of compromise (IOCs)
	- Structuring of unstructured threat intelligence text
	- Improved situational awareness for analysts
	- Faster incident response and vulnerability triage

	---

	## Model Details

	### Model Description

	- Developed by: Cisco AI
	- Model Type: ModernBertForTokenClassification
	- Framework: TensorFlow / Transformers
	- Tokenizer Type: PreTrainedTokenizerFast
	- Number of Labels: 11
	- Task: Named Entity Recognition (NER)
	- License: Apache-2.0
	- Language: English
	- Base Model: [cisco-ai/SecureBERT2.0](https://huggingface.co/cisco-ai/SecureBERT2.0-base)

	#### Supported Entity Labels

	\| Entity \| Description \|
	\|:--------\|:-------------\|
	\| `B-Indicator`, `I-Indicator` \| Indicators of Compromise (e.g., IPs, domains, hashes) \|
	\| `B-Malware`, `I-Malware` \| Malware or exploit names \|
	\| `B-Organization`, `I-Organization` \| Companies or groups mentioned \|
	\| `B-System`, `I-System` \| Affected software or platforms \|
	\| `B-Vulnerability`, `I-Vulnerability` \| Specific CVEs or flaw descriptions \|
	\| `O` \| Outside token \|

	#### Model Configuration

	\| Parameter \| Value \|
	\|:-----------\|:-------\|
	\| Hidden size \| 768 \|
	\| Intermediate size \| 1152 \|
	\| Hidden layers \| 22 \|
	\| Attention heads \| 12 \|
	\| Max sequence length \| 8192 \|
	\| Vocabulary size \| 50368 \|
	\| Activation \| GELU \|
	\| Dropout \| 0.0 (embedding, attention, MLP, classifier) \|

	---

	## Uses

	### Direct Use

	- Named Entity Recognition (NER) on cybersecurity text
	- Threat intelligence enrichment
	- IOC extraction and normalization
	- Incident report analysis
	- Vulnerability mention detection

	### Downstream Use

	This model can be integrated into:
	- Threat intelligence platforms (TIPs)
	- SOC automation tools
	- Cybersecurity knowledge graphs
	- Vulnerability management and CVE monitoring systems

	### Out-of-Scope Use

	- Non-technical or general-domain NER tasks
	- Generative or conversational AI applications

	---

	## Benchmark Cybersecurity NER Corpus

	### Dataset Overview

	\| Aspect \| Description \|
	\|:-------\|:-------------\|
	\| Purpose \| Benchmark dataset for extracting cybersecurity entities from unstructured reports \|
	\| Data Source \| Curated threat intelligence documents emphasizing malware and system analysis \|
	\| Annotation Methodology \| Fully hand-labeled by domain experts \|
	\| Entity Types \| Malware, Indicator, System, Organization, Vulnerability \|
	\| Size \| 3.4k training samples + 717 test samples \|

	---

	## How to Get Started with the Model

	### Example Usage (Transformers)

	```python
	from transformers import AutoTokenizer, TFAutoModelForTokenClassification, pipeline

	model_name = "cisco-ai/SecureBERT2.0-NER"

	tokenizer = AutoTokenizer.from_pretrained(model_name)
	model = TFAutoModelForTokenClassification.from_pretrained(model_name)

	ner_pipeline = pipeline("ner", model=model, tokenizer=tokenizer)

	text = "Stealc malware targets browser cookies and passwords."
	entities = ner_pipeline(text)
	print(entities)
	```

	## Training Details

	### Training Objective and Procedure

	The `SecureBERT2.0-NER` was fine-tuned for token-level classification on cybersecurity text using Cross Entropy Loss.
	Training focused on accurately classifying entity boundaries and types across five cybersecurity-specific categories: Malware, Indicator, System, Organization, and Vulnerability.

	The AdamW optimizer was used with a linear learning rate scheduler, and gradient clipping ensured stability during fine-tuning.

	### Training Configuration

	\| Setting \| Value \|
	\|:---------\|:------:\|
	\| Objective \| Token-wise Cross Entropy \|
	\| Optimizer \| AdamW \|
	\| Learning Rate \| 1e-5 \|
	\| Weight Decay \| 0.001 \|
	\| Batch Size per GPU \| 8 \|
	\| Epochs \| 20 \|
	\| Max Sequence Length \| 1024 \|
	\| Gradient Clipping Norm \| 1.0 \|
	\| Scheduler \| Linear \|
	\| Mixed Precision \| fp16 \|
	\| Framework \| TensorFlow / Transformers \|

	### Training Dataset

	The model was fine-tuned on a cybersecurity-specific NER corpus, containing annotated threat intelligence reports, advisories, and technical documentation.

	\| Property \| Description \|
	\|:----------\|:-------------\|
	\| Dataset Type \| Manually annotated corpus \|
	\| Language \| English \|
	\| Entity Types \| Malware, Indicator, System, Organization, Vulnerability \|
	\| Train Size \| 3,400 samples \|
	\| Test Size \| 717 samples \|
	\| Annotation Method \| Expert hand-labeling for accuracy and consistency \|

	### Preprocessing

	- Texts were tokenized using the `PreTrainedTokenizerFast` tokenizer from SecureBERT 2.0.
	- All sequences were truncated or padded to 1024 tokens.
	- Labels were aligned with subword tokens to maintain token–label consistency.

	### Hardware and Training Setup

	\| Component \| Description \|
	\|:-----------\|:-------------\|
	\| GPUs Used \| 8× NVIDIA A100 \|
	\| Precision \| Mixed precision (fp16) \|
	\| Batch Size \| 8 per GPU \|
	\| Framework \| Transformers (TensorFlow backend) \|

	### Optimization Summary

	The model converged after approximately 20 epochs, with loss stabilizing at a low level.
	Validation metrics (F1, precision, recall) showed steady improvement from epoch 3 onward, confirming effective domain-specific adaptation.



	## Evaluation

	### Testing Data, Factors & Metrics

	#### Testing Data

	Evaluation was conducted on a cybersecurity-specific NER benchmark corpus containing annotated threat reports, advisories, and incident analysis texts.
	This benchmark includes five key entity types: Malware, Indicator, System, Organization, and Vulnerability.

	#### Metrics

	The following metrics were used to assess model performance:
	- F1-score: Harmonic mean of precision and recall
	- Recall: Measures how many true entities were correctly identified
	- Precision: Measures how many predicted entities were correct

	### Results

	\| Model \| F1 \| Recall \| Precision \|
	\|:------\|:---:\|:-------:\|:-----------:\|
	\| CyBERT \| 0.351 \| 0.281 \| 0.467 \|
	\| SecureBERT \| 0.734 \| 0.759 \| 0.717 \|
	\| SecureBERT 2.0 (Ours) \| 0.945 \| 0.965 \| 0.927 \|

	#### Summary

	The SecureBERT 2.0 NER model significantly outperforms both CyBERT and the original SecureBERT across all metrics.

	- It achieves a F1-score of 0.945, a +21% absolute improvement over SecureBERT.
	- Its recall (0.965) indicates excellent coverage of cybersecurity entities.
	- Its precision (0.927) shows strong accuracy and low false-positive rates.

	This demonstrates that domain-adaptive pretraining and fine-tuning on cybersecurity corpora dramatically improves NER performance compared to general or earlier models.

	---
	## Reference
	```
	@article{aghaei2025securebert,
	title={SecureBERT 2.0: Advanced Language Model for Cybersecurity Intelligence},
	author={Aghaei, Ehsan and Jain, Sarthak and Arun, Prashanth and Sambamoorthy, Arjun},
	journal={arXiv preprint arXiv:2510.00240},
	year={2025}
	}
	```

	---

	## Model Card Authors

	Cisco AI

	## Model Card Contact

	For inquiries, please contact [ai-threat-intel@cisco.com](mailto:ai-threat-intel@cisco.com)