cisco-ehsan commited on
Commit
3b04abe
·
verified ·
1 Parent(s): 3c873ad

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +38 -19
README.md CHANGED
@@ -24,33 +24,52 @@ base_model:
24
  - CiscoAITeam/SecureBERT2.0-base
25
  ---
26
 
27
- # SentenceTransformer
28
 
29
- This is a [sentence-transformers](https://www.SBERT.net) model trained. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
30
 
31
- ## Model Details
32
 
33
- ### Model Description
34
- - **Model Type:** Sentence Transformer
35
- - **Maximum Sequence Length:** 1024 tokens
36
- - **Output Dimensionality:** 768 dimensions
37
- - **Similarity Function:** Cosine Similarity
38
 
 
39
 
40
- ### Model Sources
 
 
41
 
42
- - **Documentation:** [Sentence Transformers Documentation](https://sbert.net)
43
- - **Repository:** [Sentence Transformers on GitHub](https://github.com/UKPLab/sentence-transformers)
44
- - **Hugging Face:** [Sentence Transformers on Hugging Face](https://huggingface.co/models?library=sentence-transformers)
45
 
46
- ### Full Model Architecture
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
47
 
48
- ```
49
- SentenceTransformer(
50
- (0): Transformer({'max_seq_length': 1024, 'do_lower_case': False, 'architecture': 'ModernBertModel'})
51
- (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
52
- )
53
- ```
54
 
55
  ## Usage
56
 
 
24
  - CiscoAITeam/SecureBERT2.0-base
25
  ---
26
 
27
+ # SecureBERT 2.0 Document Embedding and Similarity Search Model (bi-encoder)
28
 
29
+ This is a Bi-Encoder model fine-tuned on top of [**SecureBERT 2.0**](CiscoAITeam/SecureBERT2.0-code-vuln-detection), a cybersecurity domain-specific Model. It computes similarity scores for pairs of texts, which can be used for text ranking, semantic search, documnet embedding or other cybersecurity-related natural language tasks.
30
 
 
31
 
32
+ # SecureBERT 2.0 Bi-Encoder for Cybersecurity
 
 
 
 
33
 
34
+ Document embeddings are central to modern cybersecurity pipelines, enabling efficient use of large and complex text corpora. They power applications such as **Retrieval-Augmented Generation (RAG)**, semantic search, ranking, and threat intelligence retrieval.
35
 
36
+ - In **RAG**, embeddings retrieve contextually relevant documents that improve generation accuracy.
37
+ - Embedding-based ranking prioritizes advisories, vulnerability reports, and incident descriptions.
38
+ - Unlike keyword-based search, embedding-driven retrieval supports **semantic matching** for tasks such as threat hunting, compliance checking, and knowledge management.
39
 
40
+ ---
 
 
41
 
42
+ ## Architecture
43
+
44
+ - **Bi-encoder** encode queries and documents independently into a shared vector space, enabling scalable approximate nearest-neighbor retrieval for initial candidate selection.
45
+
46
+ ---
47
+
48
+ ## Datasets
49
+
50
+ We fine-tuned embedding models using multiple cybersecurity-specific datasets:
51
+
52
+ | Dataset Category | Number of Records |
53
+ |-----------------|-----------------|
54
+ | Cybersecurity QA corpus | 43,000 |
55
+ | Security governance QA corpus | 60,000 |
56
+ | Cybersecurity instruction–response corpus | 25,000 |
57
+ | Cybersecurity rules corpus (evaluation) | 5,000 |
58
+
59
+ ### Dataset Descriptions
60
+
61
+ - **Cybersecurity QA corpus:**
62
+ ~43k records of question–answer pairs, incident reports, and domain knowledge across subdomains such as network security, malware analysis, cryptography, and cloud security. Provides varied technical text for both precise retrieval and contextual understanding.
63
+
64
+ - **Security governance QA corpus:**
65
+ ~60k curated QA pairs on governance, compliance, vulnerability management, and exploit analysis. Emphasizes concise, expert-validated answers for robust semantic generalization.
66
+
67
+ - **Cybersecurity instruction–response corpus:**
68
+ ~25k instruction–response pairs (e.g., “Describe mitigation techniques for cross-site scripting”), designed for instruction-following and contextual reasoning. Supports improved reranking and semantic search.
69
+
70
+ - **Cybersecurity rules corpus (evaluation):**
71
+ ~5k structured cybersecurity policies, guidelines, and best practices. Used as an evaluation benchmark for assessing retrieval quality against standards and compliance rules.
72
 
 
 
 
 
 
 
73
 
74
  ## Usage
75