Update README.md
Browse files
README.md
CHANGED
|
@@ -16,6 +16,39 @@ library_name: transformers
|
|
| 16 |
|
| 17 |
This is a **Named Entity Recognition (NER) model** fine-tuned on top of [**SecureBERT 2.0**](cisco-ehsan/SecureBERT2.0). It is designed for extracting cybersecurity entities such as Indicators, Malware, Organizations, Systems, and Vulnerabilities from text.
|
| 18 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 19 |
---
|
| 20 |
|
| 21 |
## Model Details
|
|
|
|
| 16 |
|
| 17 |
This is a **Named Entity Recognition (NER) model** fine-tuned on top of [**SecureBERT 2.0**](cisco-ehsan/SecureBERT2.0). It is designed for extracting cybersecurity entities such as Indicators, Malware, Organizations, Systems, and Vulnerabilities from text.
|
| 18 |
|
| 19 |
+
|
| 20 |
+
Named Entity Recognition (NER) is crucial in cybersecurity for automatically extracting and classifying entities from unstructured text, such as threat reports, advisories, and incident logs. Identifying indicators of compromise (IOCs), malware names, affected systems, organizations, and vulnerabilities enables automated threat analysis, improves situational awareness, and supports rapid incident response. NER also facilitates structured knowledge bases, enriches threat intelligence platforms, and converts raw text into actionable insights.
|
| 21 |
+
|
| 22 |
+
---
|
| 23 |
+
|
| 24 |
+
## Benchmark Cybersecurity NER Corpus
|
| 25 |
+
|
| 26 |
+
To evaluate SecureBERT 2.0, we used several cybersecurity-specific benchmark corpora with annotated entities. These datasets are essential for assessing the model’s ability to identify and classify domain-specific entities.
|
| 27 |
+
|
| 28 |
+
| Aspect | Description |
|
| 29 |
+
|--------|------------|
|
| 30 |
+
| **Purpose** | A manually annotated benchmark dataset for extracting cybersecurity concepts from unstructured threat intelligence reports; designed as a foundational resource for training and evaluating NER models. |
|
| 31 |
+
| **Data Source** | Derived primarily from high-quality, noise-free threat intelligence reports, with emphasis on malware analysis. |
|
| 32 |
+
| **Annotation Methodology** | Fully hand-labeled by domain experts to ensure accuracy, consistency, and contextual relevance. |
|
| 33 |
+
| **Entity Types** | Defines five entity categories: *Malware, Indicator, System, Organization, Vulnerability*. |
|
| 34 |
+
| **Size** | Contains 3.4k samples in the combined train set and 717 rows for testing. |
|
| 35 |
+
|
| 36 |
+
---
|
| 37 |
+
|
| 38 |
+
## Training Details
|
| 39 |
+
|
| 40 |
+
The model was trained using **token-wise Cross Entropy loss** with the **AdamW optimizer** and a **linear learning rate scheduler**. Gradient clipping with a maximum norm of 1.0 was applied for stability.
|
| 41 |
+
|
| 42 |
+
**Key hyperparameters:**
|
| 43 |
+
|
| 44 |
+
- Maximum sequence length: 1024
|
| 45 |
+
- Per-GPU batch size: 8
|
| 46 |
+
- Learning rate: 1e-5
|
| 47 |
+
- Weight decay: 0.001
|
| 48 |
+
- Number of epochs: 20
|
| 49 |
+
|
| 50 |
+
---
|
| 51 |
+
|
| 52 |
---
|
| 53 |
|
| 54 |
## Model Details
|