SecureBERT2.0-NER / README.md
cisco-ehsan's picture
Update README.md
792db5b verified
---
license: apache-2.0
language:
- en
base_model:
- cisco-ai/SecureBERT2.0-base
pipeline_tag: token-classification
library_name: transformers
tags:
- NER
- SecureBERT2
- CyberNER
- token-classification
- cybersecurity
---
# Model Card for cisco-ai/SecureBERT2.0-NER
The **Secure Modern BERT NER Model** is a fine-tuned transformer based on [**SecureBERT 2.0**](https://huggingface.co/cisco-ai/SecureBERT2.0-base), designed for **Named Entity Recognition (NER)** in cybersecurity text.
It extracts domain-specific entities such as **Indicators, Malware, Organizations, Systems, and Vulnerabilities** from unstructured data sources like threat reports, incident analyses, advisories, and blogs.
NER in cybersecurity enables:
- Automated extraction of indicators of compromise (IOCs)
- Structuring of unstructured threat intelligence text
- Improved situational awareness for analysts
- Faster incident response and vulnerability triage
---
## Model Details
### Model Description
- **Developed by:** Cisco AI
- **Model Type:** ModernBertForTokenClassification
- **Framework:** TensorFlow / Transformers
- **Tokenizer Type:** PreTrainedTokenizerFast
- **Number of Labels:** 11
- **Task:** Named Entity Recognition (NER)
- **License:** Apache-2.0
- **Language:** English
- **Base Model:** [cisco-ai/SecureBERT2.0](https://huggingface.co/cisco-ai/SecureBERT2.0-base)
#### Supported Entity Labels
| Entity | Description |
|:--------|:-------------|
| `B-Indicator`, `I-Indicator` | Indicators of Compromise (e.g., IPs, domains, hashes) |
| `B-Malware`, `I-Malware` | Malware or exploit names |
| `B-Organization`, `I-Organization` | Companies or groups mentioned |
| `B-System`, `I-System` | Affected software or platforms |
| `B-Vulnerability`, `I-Vulnerability` | Specific CVEs or flaw descriptions |
| `O` | Outside token |
#### Model Configuration
| Parameter | Value |
|:-----------|:-------|
| Hidden size | 768 |
| Intermediate size | 1152 |
| Hidden layers | 22 |
| Attention heads | 12 |
| Max sequence length | 8192 |
| Vocabulary size | 50368 |
| Activation | GELU |
| Dropout | 0.0 (embedding, attention, MLP, classifier) |
---
## Uses
### Direct Use
- Named Entity Recognition (NER) on cybersecurity text
- Threat intelligence enrichment
- IOC extraction and normalization
- Incident report analysis
- Vulnerability mention detection
### Downstream Use
This model can be integrated into:
- Threat intelligence platforms (TIPs)
- SOC automation tools
- Cybersecurity knowledge graphs
- Vulnerability management and CVE monitoring systems
### Out-of-Scope Use
- Non-technical or general-domain NER tasks
- Generative or conversational AI applications
---
## Benchmark Cybersecurity NER Corpus
### Dataset Overview
| Aspect | Description |
|:-------|:-------------|
| **Purpose** | Benchmark dataset for extracting cybersecurity entities from unstructured reports |
| **Data Source** | Curated threat intelligence documents emphasizing malware and system analysis |
| **Annotation Methodology** | Fully hand-labeled by domain experts |
| **Entity Types** | Malware, Indicator, System, Organization, Vulnerability |
| **Size** | 3.4k training samples + 717 test samples |
---
## How to Get Started with the Model
### Example Usage (Transformers)
```python
from transformers import AutoTokenizer, TFAutoModelForTokenClassification, pipeline
model_name = "cisco-ai/SecureBERT2.0-NER"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = TFAutoModelForTokenClassification.from_pretrained(model_name)
ner_pipeline = pipeline("ner", model=model, tokenizer=tokenizer)
text = "Stealc malware targets browser cookies and passwords."
entities = ner_pipeline(text)
print(entities)
```
## Training Details
### Training Objective and Procedure
The `SecureBERT2.0-NER` was fine-tuned for **token-level classification** on cybersecurity text using **Cross Entropy Loss**.
Training focused on accurately classifying entity boundaries and types across five cybersecurity-specific categories: *Malware, Indicator, System, Organization,* and *Vulnerability*.
The **AdamW** optimizer was used with a **linear learning rate scheduler**, and gradient clipping ensured stability during fine-tuning.
### Training Configuration
| Setting | Value |
|:---------|:------:|
| Objective | Token-wise Cross Entropy |
| Optimizer | AdamW |
| Learning Rate | 1e-5 |
| Weight Decay | 0.001 |
| Batch Size per GPU | 8 |
| Epochs | 20 |
| Max Sequence Length | 1024 |
| Gradient Clipping Norm | 1.0 |
| Scheduler | Linear |
| Mixed Precision | fp16 |
| Framework | TensorFlow / Transformers |
### Training Dataset
The model was fine-tuned on a **cybersecurity-specific NER corpus**, containing annotated threat intelligence reports, advisories, and technical documentation.
| Property | Description |
|:----------|:-------------|
| **Dataset Type** | Manually annotated corpus |
| **Language** | English |
| **Entity Types** | Malware, Indicator, System, Organization, Vulnerability |
| **Train Size** | 3,400 samples |
| **Test Size** | 717 samples |
| **Annotation Method** | Expert hand-labeling for accuracy and consistency |
### Preprocessing
- Texts were tokenized using the `PreTrainedTokenizerFast` tokenizer from SecureBERT 2.0.
- All sequences were truncated or padded to 1024 tokens.
- Labels were aligned with subword tokens to maintain token–label consistency.
### Hardware and Training Setup
| Component | Description |
|:-----------|:-------------|
| GPUs Used | 8× NVIDIA A100 |
| Precision | Mixed precision (fp16) |
| Batch Size | 8 per GPU |
| Framework | Transformers (TensorFlow backend) |
### Optimization Summary
The model converged after approximately **20 epochs**, with loss stabilizing at a low level.
Validation metrics (F1, precision, recall) showed steady improvement from epoch 3 onward, confirming effective domain-specific adaptation.
## Evaluation
### Testing Data, Factors & Metrics
#### Testing Data
Evaluation was conducted on a **cybersecurity-specific NER benchmark corpus** containing annotated threat reports, advisories, and incident analysis texts.
This benchmark includes five key entity types: **Malware, Indicator, System, Organization, and Vulnerability**.
#### Metrics
The following metrics were used to assess model performance:
- **F1-score:** Harmonic mean of precision and recall
- **Recall:** Measures how many true entities were correctly identified
- **Precision:** Measures how many predicted entities were correct
### Results
| Model | F1 | Recall | Precision |
|:------|:---:|:-------:|:-----------:|
| **CyBERT** | 0.351 | 0.281 | 0.467 |
| **SecureBERT** | 0.734 | 0.759 | 0.717 |
| **SecureBERT 2.0 (Ours)** | **0.945** | **0.965** | **0.927** |
#### Summary
The **SecureBERT 2.0 NER model** significantly outperforms both CyBERT and the original SecureBERT across all metrics.
- It achieves a **F1-score of 0.945**, a **+21% absolute improvement** over SecureBERT.
- Its **recall (0.965)** indicates excellent coverage of cybersecurity entities.
- Its **precision (0.927)** shows strong accuracy and low false-positive rates.
This demonstrates that **domain-adaptive pretraining and fine-tuning** on cybersecurity corpora dramatically improves NER performance compared to general or earlier models.
---
## Reference
```
@article{aghaei2025securebert,
title={SecureBERT 2.0: Advanced Language Model for Cybersecurity Intelligence},
author={Aghaei, Ehsan and Jain, Sarthak and Arun, Prashanth and Sambamoorthy, Arjun},
journal={arXiv preprint arXiv:2510.00240},
year={2025}
}
```
---
## Model Card Authors
Cisco AI
## Model Card Contact
For inquiries, please contact [ai-threat-intel@cisco.com](mailto:ai-threat-intel@cisco.com)