File size: 6,997 Bytes
c46e1f5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
---
license: mit
datasets:
- mahdin70/balanced_merged_bigvul_primevul
metrics:
- accuracy
- f1
- recall
- precision
base_model:
- microsoft/codebert-base
pipeline_tag: text-classification
library_name: transformers
---

# CodeBERT-Primevul-BigVul Model Card

## Model Overview

`CodeBERT-Primevul-BigVul` is a multi-task model based on Microsoft's `codebert-base`, fine-tuned to detect vulnerabilities (`vul`) and classify Common Weakness Enumeration (CWE) types in code snippets. It was developed by [mahdin70](https://huggingface.co/mahdin70) and trained on a balanced dataset combining BigVul and PrimeVul datasets. The model performs binary classification for vulnerability detection and multi-class classification for CWE identification.

- **Model Repository**: [mahdin70/CodeBERT-Primevul-BigVul](https://huggingface.co/mahdin70/CodeBERT-Primevul-BigVul)
- **Base Model**: [microsoft/codebert-base](https://huggingface.co/microsoft/codebert-base)
- **Tasks**: Vulnerability Detection (Binary), CWE Classification (Multi-class)
- **License**: MIT (assumed; adjust if different)
- **Date**: Trained and uploaded as of April 22, 2025

## Model Architecture

The model extends `codebert-base` with two task-specific heads:
- **Vulnerability Head**: A linear layer mapping 768-dimensional hidden states to 2 classes (vulnerable or not).
- **CWE Head**: A linear layer mapping 768-dimensional hidden states to 135 classes (134 CWE types + 1 for "no CWE").

The architecture is implemented as a custom `MultiTaskCodeBERT` class in PyTorch, with the loss computed as the sum of cross-entropy losses for both tasks.

## Training Dataset

The model was trained on the `mahdin70/balanced_merged_bigvul_primevul` dataset, which combines:
- **BigVul**: A dataset of real-world vulnerabilities from open-source projects.
- **PrimeVul**: A dataset focused on prime vulnerabilities in code.

### Dataset Details
- **Splits**:
  - Train: 124,780 samples
  - Validation: 26,740 samples
  - Test: 26,738 samples
  
- **Features**:
  - `func`: Code snippet (text)
  - `vul`: Binary label (0 = non-vulnerable, 1 = vulnerable)
  - `CWE ID`: CWE identifier (e.g., CWE-89) or None for non-vulnerable samples
    
- **Preprocessing**:
  - CWE labels were encoded using a `LabelEncoder` with 134 unique CWE classes identified across the dataset.
  - Non-vulnerable samples assigned a CWE label of -1 (mapped to 0 in the model).

The dataset is balanced to ensure a fair representation of vulnerable and non-vulnerable samples, with a maximum of 10 samples per commit where applicable.

## Training Details

### Training Arguments
The model was trained using the Hugging Face `Trainer` API with the following arguments:
- **Evaluation Strategy**: Per epoch
- **Save Strategy**: Per epoch
- **Learning Rate**: 2e-5
- **Batch Size**: 8 (per device, train and eval)
- **Epochs**: 3
- **Weight Decay**: 0.01
- **Logging**: Every 10 steps, logged to `./logs`

### Training Environment
- **Hardware**: 2x NVIDIA Tesla T4 GPU
- **Framework**: PyTorch 2.5.1+cu121, Transformers 4.47.0
- **Duration**: ~6 hours, 23 minutes, 18 seconds (23,397 steps)

### Training Metrics
Validation metrics across epochs:

| Epoch | Training Loss | Validation Loss | Vul Accuracy | Vul Precision | Vul Recall | Vul F1   | CWE Accuracy |
|-------|---------------|-----------------|--------------|---------------|------------|----------|--------------|
| 1     | 0.4275        | 0.5737          | 0.9519       | 0.7753        | 0.4795     | 0.5925   | 0.0656       |
| 2     | 0.7608        | 0.5450          | 0.9537       | 0.7766        | 0.5133     | 0.6181   | 0.1349       |
| 3     | 0.5624        | 0.5443          | 0.9545       | 0.7669        | 0.5400     | 0.6338   | 0.1749       |


## Usage

### Installation
Install the required libraries:
```bash
pip install transformers torch datasets huggingface_hub
```

### Sample Code Snippet
Below is an example of how to use the model for inference on a code snippet:

```python
from transformers import AutoTokenizer, AutoModel
import torch

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("microsoft/codebert-base")
model = AutoModel.from_pretrained("mahdin70/CodeBERT-Primevul-BigVul", trust_remote_code=True)
model.eval()

# Example code snippet
code = """
bool DebuggerFunction::InitTabContents() {
Value* debuggee;
EXTENSION_FUNCTION_VALIDATE(args_->Get(0, &debuggee));

DictionaryValue* dict = static_cast<DictionaryValue*>(debuggee);
EXTENSION_FUNCTION_VALIDATE(dict->GetInteger(keys::kTabIdKey, &tab_id_));

contents_ = NULL;
TabContentsWrapper* wrapper = NULL;
bool result = ExtensionTabUtil::GetTabById(
tab_id_, profile(), include_incognito(), NULL, NULL, &wrapper, NULL);
if (!result || !wrapper) {
error_ = ExtensionErrorUtils::FormatErrorMessage(
keys::kNoTabError,
base::IntToString(tab_id_));
return false;
}
contents_ = wrapper->web_contents();

if (ChromeWebUIControllerFactory::GetInstance()->HasWebUIScheme(
contents_->GetURL())) {
error_ = ExtensionErrorUtils::FormatErrorMessage(
keys::kAttachToWebUIError,
contents_->GetURL().scheme());
return false;
}

return true;
}
"""

# Tokenize input
inputs = tokenizer(code, return_tensors="pt", padding="max_length", truncation=True, max_length=512)

# Move to GPU if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
inputs = {k: v.to(device) for k, v in inputs.items()}

# Get predictions
with torch.no_grad():
    outputs = model(**inputs)
    vul_logits = outputs["vul_logits"]
    cwe_logits = outputs["cwe_logits"]

    # Vulnerability prediction
    vul_pred = torch.argmax(vul_logits, dim=1).item()
    print(f"Vulnerability: {'Vulnerable' if vul_pred == 1 else 'Not Vulnerable'}")

    # CWE prediction (if vulnerable)
    if vul_pred == 1:
        cwe_pred = torch.argmax(cwe_logits, dim=1).item() - 1  # Subtract 1 as -1 is "no CWE"
        print(f"Predicted CWE: {cwe_pred if cwe_pred >= 0 else 'None'}")
```

### Output Example:
```bash
Vulnerability: Vulnerable
Predicted CWE: 120  # Maps to CWE-120 (Buffer Overflow), depending on encoder
```

## Notes
- The CWE prediction is an integer index (0 to 133). To map it to a specific CWE ID (e.g., CWE-120), you need the LabelEncoder used during training, available in the dataset preprocessing step.
- Ensure `trust_remote_code=True` as the model uses custom code from the repository.

## Limitations
- **CWE Accuracy**: The model has low CWE classification accuracy (17.49%), likely due to class imbalance or complexity in distinguishing similar CWE types.
- **Recall**: Moderate recall (54.00%) for vulnerability detection suggests some vulnerable samples may be missed.
- **Generalization**: Trained on BigVul and PrimeVul, performance may vary on out-of-domain codebases.

## Future Improvements
- Increase training epochs or dataset size to improve CWE accuracy.
- Experiment with class weighting to address CWE imbalance.
- Fine-tune on additional datasets for broader generalization.