Update README.md
Browse files
README.md
CHANGED
|
@@ -24,33 +24,52 @@ base_model:
|
|
| 24 |
- CiscoAITeam/SecureBERT2.0-base
|
| 25 |
---
|
| 26 |
|
| 27 |
-
#
|
| 28 |
|
| 29 |
-
This is a [
|
| 30 |
|
| 31 |
-
## Model Details
|
| 32 |
|
| 33 |
-
|
| 34 |
-
- **Model Type:** Sentence Transformer
|
| 35 |
-
- **Maximum Sequence Length:** 1024 tokens
|
| 36 |
-
- **Output Dimensionality:** 768 dimensions
|
| 37 |
-
- **Similarity Function:** Cosine Similarity
|
| 38 |
|
|
|
|
| 39 |
|
| 40 |
-
|
|
|
|
|
|
|
| 41 |
|
| 42 |
-
|
| 43 |
-
- **Repository:** [Sentence Transformers on GitHub](https://github.com/UKPLab/sentence-transformers)
|
| 44 |
-
- **Hugging Face:** [Sentence Transformers on Hugging Face](https://huggingface.co/models?library=sentence-transformers)
|
| 45 |
|
| 46 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 47 |
|
| 48 |
-
```
|
| 49 |
-
SentenceTransformer(
|
| 50 |
-
(0): Transformer({'max_seq_length': 1024, 'do_lower_case': False, 'architecture': 'ModernBertModel'})
|
| 51 |
-
(1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
|
| 52 |
-
)
|
| 53 |
-
```
|
| 54 |
|
| 55 |
## Usage
|
| 56 |
|
|
|
|
| 24 |
- CiscoAITeam/SecureBERT2.0-base
|
| 25 |
---
|
| 26 |
|
| 27 |
+
# SecureBERT 2.0 Document Embedding and Similarity Search Model (bi-encoder)
|
| 28 |
|
| 29 |
+
This is a Bi-Encoder model fine-tuned on top of [**SecureBERT 2.0**](CiscoAITeam/SecureBERT2.0-code-vuln-detection), a cybersecurity domain-specific Model. It computes similarity scores for pairs of texts, which can be used for text ranking, semantic search, documnet embedding or other cybersecurity-related natural language tasks.
|
| 30 |
|
|
|
|
| 31 |
|
| 32 |
+
# SecureBERT 2.0 Bi-Encoder for Cybersecurity
|
|
|
|
|
|
|
|
|
|
|
|
|
| 33 |
|
| 34 |
+
Document embeddings are central to modern cybersecurity pipelines, enabling efficient use of large and complex text corpora. They power applications such as **Retrieval-Augmented Generation (RAG)**, semantic search, ranking, and threat intelligence retrieval.
|
| 35 |
|
| 36 |
+
- In **RAG**, embeddings retrieve contextually relevant documents that improve generation accuracy.
|
| 37 |
+
- Embedding-based ranking prioritizes advisories, vulnerability reports, and incident descriptions.
|
| 38 |
+
- Unlike keyword-based search, embedding-driven retrieval supports **semantic matching** for tasks such as threat hunting, compliance checking, and knowledge management.
|
| 39 |
|
| 40 |
+
---
|
|
|
|
|
|
|
| 41 |
|
| 42 |
+
## Architecture
|
| 43 |
+
|
| 44 |
+
- **Bi-encoder** encode queries and documents independently into a shared vector space, enabling scalable approximate nearest-neighbor retrieval for initial candidate selection.
|
| 45 |
+
|
| 46 |
+
---
|
| 47 |
+
|
| 48 |
+
## Datasets
|
| 49 |
+
|
| 50 |
+
We fine-tuned embedding models using multiple cybersecurity-specific datasets:
|
| 51 |
+
|
| 52 |
+
| Dataset Category | Number of Records |
|
| 53 |
+
|-----------------|-----------------|
|
| 54 |
+
| Cybersecurity QA corpus | 43,000 |
|
| 55 |
+
| Security governance QA corpus | 60,000 |
|
| 56 |
+
| Cybersecurity instruction–response corpus | 25,000 |
|
| 57 |
+
| Cybersecurity rules corpus (evaluation) | 5,000 |
|
| 58 |
+
|
| 59 |
+
### Dataset Descriptions
|
| 60 |
+
|
| 61 |
+
- **Cybersecurity QA corpus:**
|
| 62 |
+
~43k records of question–answer pairs, incident reports, and domain knowledge across subdomains such as network security, malware analysis, cryptography, and cloud security. Provides varied technical text for both precise retrieval and contextual understanding.
|
| 63 |
+
|
| 64 |
+
- **Security governance QA corpus:**
|
| 65 |
+
~60k curated QA pairs on governance, compliance, vulnerability management, and exploit analysis. Emphasizes concise, expert-validated answers for robust semantic generalization.
|
| 66 |
+
|
| 67 |
+
- **Cybersecurity instruction–response corpus:**
|
| 68 |
+
~25k instruction–response pairs (e.g., “Describe mitigation techniques for cross-site scripting”), designed for instruction-following and contextual reasoning. Supports improved reranking and semantic search.
|
| 69 |
+
|
| 70 |
+
- **Cybersecurity rules corpus (evaluation):**
|
| 71 |
+
~5k structured cybersecurity policies, guidelines, and best practices. Used as an evaluation benchmark for assessing retrieval quality against standards and compliance rules.
|
| 72 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 73 |
|
| 74 |
## Usage
|
| 75 |
|