|
|
--- |
|
|
license: apache-2.0 |
|
|
language: |
|
|
- en |
|
|
base_model: |
|
|
- cisco-ai/SecureBERT2.0-base |
|
|
pipeline_tag: token-classification |
|
|
library_name: transformers |
|
|
tags: |
|
|
- NER |
|
|
- SecureBERT2 |
|
|
- CyberNER |
|
|
- token-classification |
|
|
- cybersecurity |
|
|
--- |
|
|
|
|
|
# Model Card for cisco-ai/SecureBERT2.0-NER |
|
|
|
|
|
The **Secure Modern BERT NER Model** is a fine-tuned transformer based on [**SecureBERT 2.0**](https://huggingface.co/cisco-ai/SecureBERT2.0-base), designed for **Named Entity Recognition (NER)** in cybersecurity text. |
|
|
|
|
|
It extracts domain-specific entities such as **Indicators, Malware, Organizations, Systems, and Vulnerabilities** from unstructured data sources like threat reports, incident analyses, advisories, and blogs. |
|
|
|
|
|
NER in cybersecurity enables: |
|
|
- Automated extraction of indicators of compromise (IOCs) |
|
|
- Structuring of unstructured threat intelligence text |
|
|
- Improved situational awareness for analysts |
|
|
- Faster incident response and vulnerability triage |
|
|
|
|
|
--- |
|
|
|
|
|
## Model Details |
|
|
|
|
|
### Model Description |
|
|
|
|
|
- **Developed by:** Cisco AI |
|
|
- **Model Type:** ModernBertForTokenClassification |
|
|
- **Framework:** TensorFlow / Transformers |
|
|
- **Tokenizer Type:** PreTrainedTokenizerFast |
|
|
- **Number of Labels:** 11 |
|
|
- **Task:** Named Entity Recognition (NER) |
|
|
- **License:** Apache-2.0 |
|
|
- **Language:** English |
|
|
- **Base Model:** [cisco-ai/SecureBERT2.0](https://huggingface.co/cisco-ai/SecureBERT2.0-base) |
|
|
|
|
|
#### Supported Entity Labels |
|
|
|
|
|
| Entity | Description | |
|
|
|:--------|:-------------| |
|
|
| `B-Indicator`, `I-Indicator` | Indicators of Compromise (e.g., IPs, domains, hashes) | |
|
|
| `B-Malware`, `I-Malware` | Malware or exploit names | |
|
|
| `B-Organization`, `I-Organization` | Companies or groups mentioned | |
|
|
| `B-System`, `I-System` | Affected software or platforms | |
|
|
| `B-Vulnerability`, `I-Vulnerability` | Specific CVEs or flaw descriptions | |
|
|
| `O` | Outside token | |
|
|
|
|
|
#### Model Configuration |
|
|
|
|
|
| Parameter | Value | |
|
|
|:-----------|:-------| |
|
|
| Hidden size | 768 | |
|
|
| Intermediate size | 1152 | |
|
|
| Hidden layers | 22 | |
|
|
| Attention heads | 12 | |
|
|
| Max sequence length | 8192 | |
|
|
| Vocabulary size | 50368 | |
|
|
| Activation | GELU | |
|
|
| Dropout | 0.0 (embedding, attention, MLP, classifier) | |
|
|
|
|
|
--- |
|
|
|
|
|
## Uses |
|
|
|
|
|
### Direct Use |
|
|
|
|
|
- Named Entity Recognition (NER) on cybersecurity text |
|
|
- Threat intelligence enrichment |
|
|
- IOC extraction and normalization |
|
|
- Incident report analysis |
|
|
- Vulnerability mention detection |
|
|
|
|
|
### Downstream Use |
|
|
|
|
|
This model can be integrated into: |
|
|
- Threat intelligence platforms (TIPs) |
|
|
- SOC automation tools |
|
|
- Cybersecurity knowledge graphs |
|
|
- Vulnerability management and CVE monitoring systems |
|
|
|
|
|
### Out-of-Scope Use |
|
|
|
|
|
- Non-technical or general-domain NER tasks |
|
|
- Generative or conversational AI applications |
|
|
|
|
|
--- |
|
|
|
|
|
## Benchmark Cybersecurity NER Corpus |
|
|
|
|
|
### Dataset Overview |
|
|
|
|
|
| Aspect | Description | |
|
|
|:-------|:-------------| |
|
|
| **Purpose** | Benchmark dataset for extracting cybersecurity entities from unstructured reports | |
|
|
| **Data Source** | Curated threat intelligence documents emphasizing malware and system analysis | |
|
|
| **Annotation Methodology** | Fully hand-labeled by domain experts | |
|
|
| **Entity Types** | Malware, Indicator, System, Organization, Vulnerability | |
|
|
| **Size** | 3.4k training samples + 717 test samples | |
|
|
|
|
|
--- |
|
|
|
|
|
## How to Get Started with the Model |
|
|
|
|
|
### Example Usage (Transformers) |
|
|
|
|
|
```python |
|
|
from transformers import AutoTokenizer, TFAutoModelForTokenClassification, pipeline |
|
|
|
|
|
model_name = "cisco-ai/SecureBERT2.0-NER" |
|
|
|
|
|
tokenizer = AutoTokenizer.from_pretrained(model_name) |
|
|
model = TFAutoModelForTokenClassification.from_pretrained(model_name) |
|
|
|
|
|
ner_pipeline = pipeline("ner", model=model, tokenizer=tokenizer) |
|
|
|
|
|
text = "Stealc malware targets browser cookies and passwords." |
|
|
entities = ner_pipeline(text) |
|
|
print(entities) |
|
|
``` |
|
|
|
|
|
## Training Details |
|
|
|
|
|
### Training Objective and Procedure |
|
|
|
|
|
The `SecureBERT2.0-NER` was fine-tuned for **token-level classification** on cybersecurity text using **Cross Entropy Loss**. |
|
|
Training focused on accurately classifying entity boundaries and types across five cybersecurity-specific categories: *Malware, Indicator, System, Organization,* and *Vulnerability*. |
|
|
|
|
|
The **AdamW** optimizer was used with a **linear learning rate scheduler**, and gradient clipping ensured stability during fine-tuning. |
|
|
|
|
|
### Training Configuration |
|
|
|
|
|
| Setting | Value | |
|
|
|:---------|:------:| |
|
|
| Objective | Token-wise Cross Entropy | |
|
|
| Optimizer | AdamW | |
|
|
| Learning Rate | 1e-5 | |
|
|
| Weight Decay | 0.001 | |
|
|
| Batch Size per GPU | 8 | |
|
|
| Epochs | 20 | |
|
|
| Max Sequence Length | 1024 | |
|
|
| Gradient Clipping Norm | 1.0 | |
|
|
| Scheduler | Linear | |
|
|
| Mixed Precision | fp16 | |
|
|
| Framework | TensorFlow / Transformers | |
|
|
|
|
|
### Training Dataset |
|
|
|
|
|
The model was fine-tuned on a **cybersecurity-specific NER corpus**, containing annotated threat intelligence reports, advisories, and technical documentation. |
|
|
|
|
|
| Property | Description | |
|
|
|:----------|:-------------| |
|
|
| **Dataset Type** | Manually annotated corpus | |
|
|
| **Language** | English | |
|
|
| **Entity Types** | Malware, Indicator, System, Organization, Vulnerability | |
|
|
| **Train Size** | 3,400 samples | |
|
|
| **Test Size** | 717 samples | |
|
|
| **Annotation Method** | Expert hand-labeling for accuracy and consistency | |
|
|
|
|
|
### Preprocessing |
|
|
|
|
|
- Texts were tokenized using the `PreTrainedTokenizerFast` tokenizer from SecureBERT 2.0. |
|
|
- All sequences were truncated or padded to 1024 tokens. |
|
|
- Labels were aligned with subword tokens to maintain token–label consistency. |
|
|
|
|
|
### Hardware and Training Setup |
|
|
|
|
|
| Component | Description | |
|
|
|:-----------|:-------------| |
|
|
| GPUs Used | 8× NVIDIA A100 | |
|
|
| Precision | Mixed precision (fp16) | |
|
|
| Batch Size | 8 per GPU | |
|
|
| Framework | Transformers (TensorFlow backend) | |
|
|
|
|
|
### Optimization Summary |
|
|
|
|
|
The model converged after approximately **20 epochs**, with loss stabilizing at a low level. |
|
|
Validation metrics (F1, precision, recall) showed steady improvement from epoch 3 onward, confirming effective domain-specific adaptation. |
|
|
|
|
|
|
|
|
|
|
|
## Evaluation |
|
|
|
|
|
### Testing Data, Factors & Metrics |
|
|
|
|
|
#### Testing Data |
|
|
|
|
|
Evaluation was conducted on a **cybersecurity-specific NER benchmark corpus** containing annotated threat reports, advisories, and incident analysis texts. |
|
|
This benchmark includes five key entity types: **Malware, Indicator, System, Organization, and Vulnerability**. |
|
|
|
|
|
#### Metrics |
|
|
|
|
|
The following metrics were used to assess model performance: |
|
|
- **F1-score:** Harmonic mean of precision and recall |
|
|
- **Recall:** Measures how many true entities were correctly identified |
|
|
- **Precision:** Measures how many predicted entities were correct |
|
|
|
|
|
### Results |
|
|
|
|
|
| Model | F1 | Recall | Precision | |
|
|
|:------|:---:|:-------:|:-----------:| |
|
|
| **CyBERT** | 0.351 | 0.281 | 0.467 | |
|
|
| **SecureBERT** | 0.734 | 0.759 | 0.717 | |
|
|
| **SecureBERT 2.0 (Ours)** | **0.945** | **0.965** | **0.927** | |
|
|
|
|
|
#### Summary |
|
|
|
|
|
The **SecureBERT 2.0 NER model** significantly outperforms both CyBERT and the original SecureBERT across all metrics. |
|
|
|
|
|
- It achieves a **F1-score of 0.945**, a **+21% absolute improvement** over SecureBERT. |
|
|
- Its **recall (0.965)** indicates excellent coverage of cybersecurity entities. |
|
|
- Its **precision (0.927)** shows strong accuracy and low false-positive rates. |
|
|
|
|
|
This demonstrates that **domain-adaptive pretraining and fine-tuning** on cybersecurity corpora dramatically improves NER performance compared to general or earlier models. |
|
|
|
|
|
--- |
|
|
## Reference |
|
|
``` |
|
|
@article{aghaei2025securebert, |
|
|
title={SecureBERT 2.0: Advanced Language Model for Cybersecurity Intelligence}, |
|
|
author={Aghaei, Ehsan and Jain, Sarthak and Arun, Prashanth and Sambamoorthy, Arjun}, |
|
|
journal={arXiv preprint arXiv:2510.00240}, |
|
|
year={2025} |
|
|
} |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## Model Card Authors |
|
|
|
|
|
Cisco AI |
|
|
|
|
|
## Model Card Contact |
|
|
|
|
|
For inquiries, please contact [ai-threat-intel@cisco.com](mailto:ai-threat-intel@cisco.com) |