cisco-ehsan commited on
Commit
299e961
·
verified ·
1 Parent(s): f77711b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +33 -0
README.md CHANGED
@@ -16,6 +16,39 @@ library_name: transformers
16
 
17
  This is a **Named Entity Recognition (NER) model** fine-tuned on top of [**SecureBERT 2.0**](cisco-ehsan/SecureBERT2.0). It is designed for extracting cybersecurity entities such as Indicators, Malware, Organizations, Systems, and Vulnerabilities from text.
18
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
19
  ---
20
 
21
  ## Model Details
 
16
 
17
  This is a **Named Entity Recognition (NER) model** fine-tuned on top of [**SecureBERT 2.0**](cisco-ehsan/SecureBERT2.0). It is designed for extracting cybersecurity entities such as Indicators, Malware, Organizations, Systems, and Vulnerabilities from text.
18
 
19
+
20
+ Named Entity Recognition (NER) is crucial in cybersecurity for automatically extracting and classifying entities from unstructured text, such as threat reports, advisories, and incident logs. Identifying indicators of compromise (IOCs), malware names, affected systems, organizations, and vulnerabilities enables automated threat analysis, improves situational awareness, and supports rapid incident response. NER also facilitates structured knowledge bases, enriches threat intelligence platforms, and converts raw text into actionable insights.
21
+
22
+ ---
23
+
24
+ ## Benchmark Cybersecurity NER Corpus
25
+
26
+ To evaluate SecureBERT 2.0, we used several cybersecurity-specific benchmark corpora with annotated entities. These datasets are essential for assessing the model’s ability to identify and classify domain-specific entities.
27
+
28
+ | Aspect | Description |
29
+ |--------|------------|
30
+ | **Purpose** | A manually annotated benchmark dataset for extracting cybersecurity concepts from unstructured threat intelligence reports; designed as a foundational resource for training and evaluating NER models. |
31
+ | **Data Source** | Derived primarily from high-quality, noise-free threat intelligence reports, with emphasis on malware analysis. |
32
+ | **Annotation Methodology** | Fully hand-labeled by domain experts to ensure accuracy, consistency, and contextual relevance. |
33
+ | **Entity Types** | Defines five entity categories: *Malware, Indicator, System, Organization, Vulnerability*. |
34
+ | **Size** | Contains 3.4k samples in the combined train set and 717 rows for testing. |
35
+
36
+ ---
37
+
38
+ ## Training Details
39
+
40
+ The model was trained using **token-wise Cross Entropy loss** with the **AdamW optimizer** and a **linear learning rate scheduler**. Gradient clipping with a maximum norm of 1.0 was applied for stability.
41
+
42
+ **Key hyperparameters:**
43
+
44
+ - Maximum sequence length: 1024
45
+ - Per-GPU batch size: 8
46
+ - Learning rate: 1e-5
47
+ - Weight decay: 0.001
48
+ - Number of epochs: 20
49
+
50
+ ---
51
+
52
  ---
53
 
54
  ## Model Details