File size: 3,432 Bytes
c5aa321 a386d63 e04979e c5aa321 514849b c5aa321 a386d63 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 |
---
license: apache-2.0
language:
- en
library_name: transformers
pipeline_tag: text-generation
datasets:
- custom
- CyberNative/Code_Vulnerability_Security_DPO
- doss1232/vulnerable-code
metrics:
- ROUGE-L F1
- BLEU
tags:
- llama-3.2-1B-Instruct
---
# Model Card for `merged-vuln-detector`
## Model Details
- **Base Model:** `llama-3.2-1B-Instruct`
- **Fine-tuned Model:** `merged-vuln-detector`
- **Model Type:** Causal Language Model fine-tuned for vulnerability detection in code.
## Model Description
This model is a fine-tuned version of `llama-3.2-1B-Instruct` on a dataset of code snippets and their corresponding vulnerability analyses. The model is intended to be used as a security expert that can analyze code and identify potential vulnerabilities.
## Training Data
The model was fine-tuned on the `CyberNative/Code_Vulnerability_Security_DPO` dataset, which can be found on Hugging Face at https://huggingface.co/datasets/CyberNative/Code_Vulnerability_Security_DPO.
The data is formatted as follows, where the model is prompted to analyze the security of a given code snippet:
```
Analyze the security vulnerabilities in the following code.
[CODE SNIPPET]
Analysis:
[VULNERABILITY DESCRIPTION]
```
## Training Procedure
The model was fine-tuned using QLoRA on a single GPU. The training script uses the `trl` library's `SFTTrainer`.
### Hyperparameters:
- **Quantization:** 4-bit (`nf4`)
- **LoRA `r`:** 16
- **LoRA `alpha`:** 32
- **LoRA `dropout`:** 0.1
- **Target Modules:** `q_proj`, `k_proj`, `v_proj`, `o_proj`
- **Batch Size:** 1 (with gradient accumulation steps of 8)
- **Optimizer:** `paged_adamw_8bit`
- **Precision:** `fp16`
- **Max Steps:** 240
- **Learning Rate:** `2e-4`
- **Max Sequence Length:** 1024
## Evaluation Results
The model was evaluated on the `doss1232/vulnerable-code` dataset against the base model. The results are as follows:
| Model | ROUGE-L F1 | BLEU |
|-------------------------|------------|--------|
| `llama-3.2-1B-Instruct` | 0.0933 | 0.0061 |
| `merged-vuln-detector` | 0.1335 | 0.0219 |
## How to use
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
model_name = "merged-vuln-detector"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
code = """
#include <cstring>
void copyString(char* dest, const char* src) {
while (*src != '\0') {
*dest = *src;
dest++;
src++;
}
}
int main() {
char source[10] = "Hello!";
char destination[5];
copyString(destination, source);
return 0;
}
"""
prompt = f"Analyze the security vulnerabilities in the following code.\n\n{code}\n\nAnalysis:\n"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_length=512)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```
### Example Output
**Input Code:**
```c
#include <cstring>
void copyString(char* dest, const char* src) {
while (*src != '\0') {
*dest = *src;
dest++;
src++;
}
}
int main() {
char source[10] = "Hello!";
char destination[5];
copyString(destination, source);
return 0;
}
```
**Model Output:**
> The code has a buffer overflow vulnerability due to the lack of bounds checking on the destination buffer size.
## Model Card Authors
[Seokhee Chang]
## Model Card Contact
[cycloevan97@gmail.com]
|