cisco-ai
/

SecureBERT2.0-cross_encoder

@@ -13,49 +13,95 @@ tags:
 - docembedding
 ---
-# SecureBERT 2.0 Cross-Encoder Fine-Tuned for Cybersecurity
-This is a **Cross Encoder** model fine-tuned on top of [**SecureBERT 2.0**](CiscoAITeam/SecureBERT2.0-base), a cybersecurity domain-specific BERT model. It computes similarity scores for pairs of texts, which can be used for **text reranking, semantic search, or other cybersecurity-related NLP tasks**.
 ---
 ## Model Details
-- **Model Type:** Cross Encoder
 - **Max Sequence Length:** 1024 tokens
-- **Output Labels:** 1
 - **Language:** English
 - **License:** Apache-2.0
 ---
-## Usage
-Sentence Transformers API
-Install the library:
 ```bash
 pip install -U sentence-transformers
 ```
-Load the model and run inference:
 ```python
 from sentence_transformers import CrossEncoder
 # Load the model
-model = CrossEncoder("cross_encoder_model_id")
-# Score pairs of cybersecurity text
 pairs = [
-    ["How does Stealc malware extract browser data?", "Stealc uses Sqlite3 DLL to query browser databases and retrieve cookies, passwords, and history."],
-    ["Best practices for post-acquisition cybersecurity integration?", "Conduct security assessment, align policies, integrate security technologies, and train employees."],
 ]
 scores = model.predict(pairs)
 print(scores)
 ```
-Rank a set of candidate responses based on similarity to a query:
 ```python
 query = "How to prevent Kerberoasting attacks?"
 candidates = [
@@ -65,32 +111,53 @@ candidates = [
 ]
 ranking = model.rank(query, candidates)
 print(ranking)
 ```
-### Framework Versions
-- Python: 3.10.10
-- Sentence Transformers: 5.0.0
-- Transformers: 4.52.4
-- PyTorch: 2.7.0+cu128
-- Accelerate: 1.9.0
-- Datasets: 3.6.0
-- Tokenizers: 0.21.1
--
 ## Training Details
 ### Training Dataset
-- **Size:** 35,705 samples
 - **Columns:** `sentence1`, `sentence2`, `label`
-- **Approximate statistics (first 1000 samples):**
-  | Field | Sentence1 | Sentence2 | Label |
-  |-------|-----------|-----------|-------|
-  | Type | string | string | float |
-  | Mean Length | 98.46 | 1468.34 | 1.0 |
 - **Loss Function:** [CachedMultipleNegativesRankingLoss](https://sbert.net/docs/package_reference/cross_encoder/losses.html#cachedmultiplenegativesrankingloss)
 ```json
 {
     "scale": 10.0,
@@ -98,15 +165,60 @@ print(ranking)
     "activation_fn": "torch.nn.modules.activation.Sigmoid",
     "mini_batch_size": 24
 }
 ```
-## Reference
-```
-@article{aghaei2025securebert,
-  title={SecureBERT 2.0: Advanced Language Model for Cybersecurity Intelligence},
-  author={Aghaei, Ehsan and Jain, Sarthak and Arun, Prashanth and Sambamoorthy, Arjun},
-  journal={arXiv preprint arXiv:2510.00240},
-  year={2025}
-}
-```

 - docembedding
 ---
+# Model Card for CiscoAITeam/SecureBERT2.0-cross-encoder
+The **SecureBERT 2.0 Cross-Encoder** is a cybersecurity domain-specific model fine-tuned from [SecureBERT 2.0](https://huggingface.co/CiscoAITeam/SecureBERT2.0-base).
+It computes **pairwise similarity scores** between two texts, enabling use in **text reranking, semantic search, and cybersecurity intelligence retrieval** tasks.
 ---
 ## Model Details
+### Model Description
+- **Developed by:** Cisco AI Team
+- **Model type:** Cross Encoder (Sentence Similarity)
+- **Architecture:** ModernBERT (fine-tuned via Sentence Transformers)
 - **Max Sequence Length:** 1024 tokens
+- **Output Labels:** 1 (similarity score)
 - **Language:** English
 - **License:** Apache-2.0
+- **Finetuned from model:** [CiscoAITeam/SecureBERT2.0-base](https://huggingface.co/CiscoAITeam/SecureBERT2.0-base)
+## Uses
+### Direct Use
+- Semantic text similarity in cybersecurity contexts
+- Text and code reranking for information retrieval (IR)
+- Threat intelligence question–answer relevance scoring
+- Cybersecurity report and log correlation
+### Downstream Use
+Can be integrated into:
+- Cyber threat intelligence search engines
+- SOC automation pipelines
+- Cybersecurity knowledge graph enrichment
+- Threat hunting and incident response systems
+### Out-of-Scope Use
+- Generic text similarity outside the cybersecurity domain
+- Tasks requiring generative reasoning or open-domain question answering
+---
+## Bias, Risks, and Limitations
+The model reflects the distribution of cybersecurity-related data used during fine-tuning.
+Potential risks include:
+- Overrepresentation of specific malware, technologies, or threat actors
+- Bias toward technical English sources
+- Reduced performance on non-English or mixed technical/natural text
+### Recommendations
+Users should evaluate results for domain alignment and combine with other retrieval models or heuristic filters when applied to non-cybersecurity contexts.
 ---
+## How to Get Started with the Model
+### Using the Sentence Transformers API
+#### Install dependencies
 ```bash
 pip install -U sentence-transformers
 ```
+### Run Inference
 ```python
 from sentence_transformers import CrossEncoder
 # Load the model
+model = CrossEncoder("CiscoAITeam/SecureBERT2.0-cross-encoder")
+# Example pairs
 pairs = [
+    ["How does Stealc malware extract browser data?",
+     "Stealc uses Sqlite3 DLL to query browser databases and retrieve cookies, passwords, and history."],
+    ["Best practices for post-acquisition cybersecurity integration?",
+     "Conduct security assessment, align policies, integrate security technologies, and train employees."],
 ]
+# Compute similarity scores
 scores = model.predict(pairs)
 print(scores)
 ```
+### Rank Candidate Responses
 ```python
 query = "How to prevent Kerberoasting attacks?"
 candidates = [
 ]
 ranking = model.rank(query, candidates)
 print(ranking)
 ```
+## Framework Versions
+* python: 3.10.10
+* sentence_transformers: 5.0.0
+* transformers: 4.52.4
+* PyTorch: 2.7.0+cu128
+* accelerate: 1.9.0
+* datasets: 3.6.0
+---
 ## Training Details
 ### Training Dataset
+The model was fine-tuned on a **cybersecurity sentence-pair similarity dataset** for cross-encoder training.
+- **Dataset Size:** 35,705 samples
 - **Columns:** `sentence1`, `sentence2`, `label`
+#### Average Lengths (first 1000 samples)
+| Field | Mean Length |
+|:------|:-------------:|
+| Sentence1 | 98.46 |
+| Sentence2 | 1468.34 |
+| Label | 1.0 |
+#### Example Schema
+| Field | Type | Description |
+|:------|:------|:------------|
+| sentence1 | string | Query or document text |
+| sentence2 | string | Paired document or candidate response |
+| label | float | Similarity score between the two inputs |
+---
+### Training Objective and Loss
+The model was trained using a **contrastive ranking objective** to learn high-quality similarity scores between cybersecurity-related text pairs.
 - **Loss Function:** [CachedMultipleNegativesRankingLoss](https://sbert.net/docs/package_reference/cross_encoder/losses.html#cachedmultiplenegativesrankingloss)
+#### Loss Parameters
 ```json
 {
     "scale": 10.0,
     "activation_fn": "torch.nn.modules.activation.Sigmoid",
     "mini_batch_size": 24
 }
 ```
+## Evaluation
+### Testing Data, Factors & Metrics
+#### Testing Data
+The evaluation was performed on a **held-out test set** of cybersecurity-related question–answer pairs and document retrieval tasks.
+Data includes:
+- Threat intelligence descriptions and related advisories
+- Exploit procedure and mitigation text pairs
+- Cybersecurity Q&A and incident analysis examples
+#### Factors
+Evaluation considered multiple aspects of similarity and relevance:
+- **Domain diversity:** different cybersecurity subfields (malware, vulnerabilities, network defense)
+- **Task diversity:** retrieval, reranking, and relevance scoring
+- **Pair length:** from short queries to long technical documents
+#### Metrics
+The model was evaluated using standard information retrieval metrics:
+- **Mean Average Precision (mAP):** measures ranking precision across all retrieved results
+- **Recall@1 (R@1):** measures the proportion of correct top-1 matches
+- **Normalized Discounted Cumulative Gain (NDCG@10):** evaluates ranking quality up to the 10th result
+- **Mean Reciprocal Rank (MRR@10):** assesses the average rank position of the first correct answer
+### Results
+| Model | mAP | R@1 | NDCG@10 | MRR@10 |
+|:------|:----:|:---:|:--------:|:--------:|
+| **ms-marco-TinyBERT-L2** | 0.920 | 0.849 | 0.964 | 0.955 |
+| **SecureBERT 2.0 Cross-Encoder** | **0.955** | **0.948** | **0.986** | **0.983** |
+#### Summary
+The **SecureBERT 2.0 Cross-Encoder** achieves **state-of-the-art retrieval and ranking performance** on cybersecurity text similarity tasks.
+Compared to the general-purpose `ms-marco-TinyBERT-L2` baseline:
+- It improves **mAP** by +0.035
+- Achieves nearly perfect **R@1** and **MRR@10**, indicating highly accurate top-1 retrieval
+- Shows the strongest **NDCG@10**, reflecting excellent ranking quality across top results
+These results confirm that **domain-specific pretraining and fine-tuning** substantially enhance semantic understanding and information retrieval capabilities in cybersecurity applications.
+---
+## Model Card Authors
+Cisco AI Team
+## Model Card Contact
+For inquiries, please contact [Cisco AI Team](eaghaei@cisco.com)