cisco-ehsan commited on
Commit
e7c1709
·
verified ·
1 Parent(s): f634250

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +152 -40
README.md CHANGED
@@ -13,49 +13,95 @@ tags:
13
  - docembedding
14
  ---
15
 
 
16
 
17
- # SecureBERT 2.0 Cross-Encoder Fine-Tuned for Cybersecurity
18
-
19
- This is a **Cross Encoder** model fine-tuned on top of [**SecureBERT 2.0**](CiscoAITeam/SecureBERT2.0-base), a cybersecurity domain-specific BERT model. It computes similarity scores for pairs of texts, which can be used for **text reranking, semantic search, or other cybersecurity-related NLP tasks**.
20
 
21
  ---
22
 
23
  ## Model Details
24
 
25
- - **Model Type:** Cross Encoder
 
 
 
 
26
  - **Max Sequence Length:** 1024 tokens
27
- - **Output Labels:** 1
28
  - **Language:** English
29
  - **License:** Apache-2.0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
30
 
31
  ---
32
- ## Usage
33
- Sentence Transformers API
34
 
35
- Install the library:
 
 
36
 
 
37
  ```bash
38
  pip install -U sentence-transformers
39
  ```
40
- Load the model and run inference:
41
 
 
42
  ```python
43
  from sentence_transformers import CrossEncoder
44
 
45
  # Load the model
46
- model = CrossEncoder("cross_encoder_model_id")
47
 
48
- # Score pairs of cybersecurity text
49
  pairs = [
50
- ["How does Stealc malware extract browser data?", "Stealc uses Sqlite3 DLL to query browser databases and retrieve cookies, passwords, and history."],
51
- ["Best practices for post-acquisition cybersecurity integration?", "Conduct security assessment, align policies, integrate security technologies, and train employees."],
 
 
52
  ]
53
 
 
54
  scores = model.predict(pairs)
55
  print(scores)
56
  ```
57
 
58
- Rank a set of candidate responses based on similarity to a query:
59
  ```python
60
  query = "How to prevent Kerberoasting attacks?"
61
  candidates = [
@@ -65,32 +111,53 @@ candidates = [
65
  ]
66
  ranking = model.rank(query, candidates)
67
  print(ranking)
68
-
69
  ```
70
 
71
- ### Framework Versions
72
- - Python: 3.10.10
73
- - Sentence Transformers: 5.0.0
74
- - Transformers: 4.52.4
75
- - PyTorch: 2.7.0+cu128
76
- - Accelerate: 1.9.0
77
- - Datasets: 3.6.0
78
- - Tokenizers: 0.21.1
79
- -
 
 
80
  ## Training Details
81
 
82
  ### Training Dataset
83
 
 
84
 
85
- - **Size:** 35,705 samples
86
  - **Columns:** `sentence1`, `sentence2`, `label`
87
- - **Approximate statistics (first 1000 samples):**
88
- | Field | Sentence1 | Sentence2 | Label |
89
- |-------|-----------|-----------|-------|
90
- | Type | string | string | float |
91
- | Mean Length | 98.46 | 1468.34 | 1.0 |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
92
 
93
  - **Loss Function:** [CachedMultipleNegativesRankingLoss](https://sbert.net/docs/package_reference/cross_encoder/losses.html#cachedmultiplenegativesrankingloss)
 
 
94
  ```json
95
  {
96
  "scale": 10.0,
@@ -98,15 +165,60 @@ print(ranking)
98
  "activation_fn": "torch.nn.modules.activation.Sigmoid",
99
  "mini_batch_size": 24
100
  }
101
-
102
  ```
103
- ## Reference
104
 
105
- ```
106
- @article{aghaei2025securebert,
107
- title={SecureBERT 2.0: Advanced Language Model for Cybersecurity Intelligence},
108
- author={Aghaei, Ehsan and Jain, Sarthak and Arun, Prashanth and Sambamoorthy, Arjun},
109
- journal={arXiv preprint arXiv:2510.00240},
110
- year={2025}
111
- }
112
- ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
13
  - docembedding
14
  ---
15
 
16
+ # Model Card for CiscoAITeam/SecureBERT2.0-cross-encoder
17
 
18
+ The **SecureBERT 2.0 Cross-Encoder** is a cybersecurity domain-specific model fine-tuned from [SecureBERT 2.0](https://huggingface.co/CiscoAITeam/SecureBERT2.0-base).
19
+ It computes **pairwise similarity scores** between two texts, enabling use in **text reranking, semantic search, and cybersecurity intelligence retrieval** tasks.
 
20
 
21
  ---
22
 
23
  ## Model Details
24
 
25
+ ### Model Description
26
+
27
+ - **Developed by:** Cisco AI Team
28
+ - **Model type:** Cross Encoder (Sentence Similarity)
29
+ - **Architecture:** ModernBERT (fine-tuned via Sentence Transformers)
30
  - **Max Sequence Length:** 1024 tokens
31
+ - **Output Labels:** 1 (similarity score)
32
  - **Language:** English
33
  - **License:** Apache-2.0
34
+ - **Finetuned from model:** [CiscoAITeam/SecureBERT2.0-base](https://huggingface.co/CiscoAITeam/SecureBERT2.0-base)
35
+
36
+
37
+ ## Uses
38
+
39
+ ### Direct Use
40
+
41
+ - Semantic text similarity in cybersecurity contexts
42
+ - Text and code reranking for information retrieval (IR)
43
+ - Threat intelligence question–answer relevance scoring
44
+ - Cybersecurity report and log correlation
45
+
46
+ ### Downstream Use
47
+
48
+ Can be integrated into:
49
+ - Cyber threat intelligence search engines
50
+ - SOC automation pipelines
51
+ - Cybersecurity knowledge graph enrichment
52
+ - Threat hunting and incident response systems
53
+
54
+ ### Out-of-Scope Use
55
+
56
+ - Generic text similarity outside the cybersecurity domain
57
+ - Tasks requiring generative reasoning or open-domain question answering
58
+
59
+ ---
60
+
61
+ ## Bias, Risks, and Limitations
62
+
63
+ The model reflects the distribution of cybersecurity-related data used during fine-tuning.
64
+ Potential risks include:
65
+ - Overrepresentation of specific malware, technologies, or threat actors
66
+ - Bias toward technical English sources
67
+ - Reduced performance on non-English or mixed technical/natural text
68
+
69
+ ### Recommendations
70
+
71
+ Users should evaluate results for domain alignment and combine with other retrieval models or heuristic filters when applied to non-cybersecurity contexts.
72
 
73
  ---
 
 
74
 
75
+ ## How to Get Started with the Model
76
+
77
+ ### Using the Sentence Transformers API
78
 
79
+ #### Install dependencies
80
  ```bash
81
  pip install -U sentence-transformers
82
  ```
 
83
 
84
+ ### Run Inference
85
  ```python
86
  from sentence_transformers import CrossEncoder
87
 
88
  # Load the model
89
+ model = CrossEncoder("CiscoAITeam/SecureBERT2.0-cross-encoder")
90
 
91
+ # Example pairs
92
  pairs = [
93
+ ["How does Stealc malware extract browser data?",
94
+ "Stealc uses Sqlite3 DLL to query browser databases and retrieve cookies, passwords, and history."],
95
+ ["Best practices for post-acquisition cybersecurity integration?",
96
+ "Conduct security assessment, align policies, integrate security technologies, and train employees."],
97
  ]
98
 
99
+ # Compute similarity scores
100
  scores = model.predict(pairs)
101
  print(scores)
102
  ```
103
 
104
+ ### Rank Candidate Responses
105
  ```python
106
  query = "How to prevent Kerberoasting attacks?"
107
  candidates = [
 
111
  ]
112
  ranking = model.rank(query, candidates)
113
  print(ranking)
 
114
  ```
115
 
116
+ ## Framework Versions
117
+
118
+ * python: 3.10.10
119
+ * sentence_transformers: 5.0.0
120
+ * transformers: 4.52.4
121
+ * PyTorch: 2.7.0+cu128
122
+ * accelerate: 1.9.0
123
+ * datasets: 3.6.0
124
+
125
+ ---
126
+
127
  ## Training Details
128
 
129
  ### Training Dataset
130
 
131
+ The model was fine-tuned on a **cybersecurity sentence-pair similarity dataset** for cross-encoder training.
132
 
133
+ - **Dataset Size:** 35,705 samples
134
  - **Columns:** `sentence1`, `sentence2`, `label`
135
+
136
+ #### Average Lengths (first 1000 samples)
137
+
138
+ | Field | Mean Length |
139
+ |:------|:-------------:|
140
+ | Sentence1 | 98.46 |
141
+ | Sentence2 | 1468.34 |
142
+ | Label | 1.0 |
143
+
144
+ #### Example Schema
145
+
146
+ | Field | Type | Description |
147
+ |:------|:------|:------------|
148
+ | sentence1 | string | Query or document text |
149
+ | sentence2 | string | Paired document or candidate response |
150
+ | label | float | Similarity score between the two inputs |
151
+
152
+ ---
153
+
154
+ ### Training Objective and Loss
155
+
156
+ The model was trained using a **contrastive ranking objective** to learn high-quality similarity scores between cybersecurity-related text pairs.
157
 
158
  - **Loss Function:** [CachedMultipleNegativesRankingLoss](https://sbert.net/docs/package_reference/cross_encoder/losses.html#cachedmultiplenegativesrankingloss)
159
+
160
+ #### Loss Parameters
161
  ```json
162
  {
163
  "scale": 10.0,
 
165
  "activation_fn": "torch.nn.modules.activation.Sigmoid",
166
  "mini_batch_size": 24
167
  }
 
168
  ```
 
169
 
170
+
171
+ ## Evaluation
172
+
173
+ ### Testing Data, Factors & Metrics
174
+
175
+ #### Testing Data
176
+
177
+ The evaluation was performed on a **held-out test set** of cybersecurity-related question–answer pairs and document retrieval tasks.
178
+ Data includes:
179
+ - Threat intelligence descriptions and related advisories
180
+ - Exploit procedure and mitigation text pairs
181
+ - Cybersecurity Q&A and incident analysis examples
182
+
183
+ #### Factors
184
+
185
+ Evaluation considered multiple aspects of similarity and relevance:
186
+ - **Domain diversity:** different cybersecurity subfields (malware, vulnerabilities, network defense)
187
+ - **Task diversity:** retrieval, reranking, and relevance scoring
188
+ - **Pair length:** from short queries to long technical documents
189
+
190
+ #### Metrics
191
+
192
+ The model was evaluated using standard information retrieval metrics:
193
+ - **Mean Average Precision (mAP):** measures ranking precision across all retrieved results
194
+ - **Recall@1 (R@1):** measures the proportion of correct top-1 matches
195
+ - **Normalized Discounted Cumulative Gain (NDCG@10):** evaluates ranking quality up to the 10th result
196
+ - **Mean Reciprocal Rank (MRR@10):** assesses the average rank position of the first correct answer
197
+
198
+ ### Results
199
+
200
+ | Model | mAP | R@1 | NDCG@10 | MRR@10 |
201
+ |:------|:----:|:---:|:--------:|:--------:|
202
+ | **ms-marco-TinyBERT-L2** | 0.920 | 0.849 | 0.964 | 0.955 |
203
+ | **SecureBERT 2.0 Cross-Encoder** | **0.955** | **0.948** | **0.986** | **0.983** |
204
+
205
+ #### Summary
206
+
207
+ The **SecureBERT 2.0 Cross-Encoder** achieves **state-of-the-art retrieval and ranking performance** on cybersecurity text similarity tasks.
208
+
209
+ Compared to the general-purpose `ms-marco-TinyBERT-L2` baseline:
210
+ - It improves **mAP** by +0.035
211
+ - Achieves nearly perfect **R@1** and **MRR@10**, indicating highly accurate top-1 retrieval
212
+ - Shows the strongest **NDCG@10**, reflecting excellent ranking quality across top results
213
+
214
+ These results confirm that **domain-specific pretraining and fine-tuning** substantially enhance semantic understanding and information retrieval capabilities in cybersecurity applications.
215
+
216
+ ---
217
+
218
+ ## Model Card Authors
219
+
220
+ Cisco AI Team
221
+
222
+ ## Model Card Contact
223
+
224
+ For inquiries, please contact [Cisco AI Team](eaghaei@cisco.com)