cycloevan
/

vuln_detector

Text Generation

llama-3.2-1B-Instruct

Model card Files Files and versions

vuln_detector / README.md

cycloevan's picture

Update README.md

e04979e verified 3 months ago

|

history blame contribute delete

3.43 kB

	---
	license: apache-2.0
	language:
	- en
	library_name: transformers
	pipeline_tag: text-generation
	datasets:
	- custom
	- CyberNative/Code_Vulnerability_Security_DPO
	- doss1232/vulnerable-code
	metrics:
	- ROUGE-L F1
	- BLEU
	tags:
	- llama-3.2-1B-Instruct
	---

	# Model Card for `merged-vuln-detector`

	## Model Details

	- Base Model: `llama-3.2-1B-Instruct`
	- Fine-tuned Model: `merged-vuln-detector`
	- Model Type: Causal Language Model fine-tuned for vulnerability detection in code.

	## Model Description

	This model is a fine-tuned version of `llama-3.2-1B-Instruct` on a dataset of code snippets and their corresponding vulnerability analyses. The model is intended to be used as a security expert that can analyze code and identify potential vulnerabilities.

	## Training Data

	The model was fine-tuned on the `CyberNative/Code_Vulnerability_Security_DPO` dataset, which can be found on Hugging Face at https://huggingface.co/datasets/CyberNative/Code_Vulnerability_Security_DPO.

	The data is formatted as follows, where the model is prompted to analyze the security of a given code snippet:

	```
	Analyze the security vulnerabilities in the following code.

	[CODE SNIPPET]

	Analysis:
	[VULNERABILITY DESCRIPTION]
	```

	## Training Procedure

	The model was fine-tuned using QLoRA on a single GPU. The training script uses the `trl` library's `SFTTrainer`.

	### Hyperparameters:
	- Quantization: 4-bit (`nf4`)
	- LoRA `r`: 16
	- LoRA `alpha`: 32
	- LoRA `dropout`: 0.1
	- Target Modules: `q_proj`, `k_proj`, `v_proj`, `o_proj`
	- Batch Size: 1 (with gradient accumulation steps of 8)
	- Optimizer: `paged_adamw_8bit`
	- Precision: `fp16`
	- Max Steps: 240
	- Learning Rate: `2e-4`
	- Max Sequence Length: 1024

	## Evaluation Results

	The model was evaluated on the `doss1232/vulnerable-code` dataset against the base model. The results are as follows:

	\| Model \| ROUGE-L F1 \| BLEU \|
	\|-------------------------\|------------\|--------\|
	\| `llama-3.2-1B-Instruct` \| 0.0933 \| 0.0061 \|
	\| `merged-vuln-detector` \| 0.1335 \| 0.0219 \|

	## How to use

	```python
	from transformers import AutoTokenizer, AutoModelForCausalLM

	model_name = "merged-vuln-detector"
	tokenizer = AutoTokenizer.from_pretrained(model_name)
	model = AutoModelForCausalLM.from_pretrained(model_name)

	code = """
	#include <cstring>

	void copyString(char* dest, const char* src) {
	while (*src != '\0') {
	dest = src;
	dest++;
	src++;
	}
	}

	int main() {
	char source[10] = "Hello!";
	char destination[5];
	copyString(destination, source);
	return 0;
	}
	"""

	prompt = f"Analyze the security vulnerabilities in the following code.\n\n{code}\n\nAnalysis:\n"

	inputs = tokenizer(prompt, return_tensors="pt")
	outputs = model.generate(**inputs, max_length=512)

	print(tokenizer.decode(outputs[0], skip_special_tokens=True))
	```

	### Example Output

	Input Code:
	```c
	#include <cstring>

	void copyString(char* dest, const char* src) {
	while (*src != '\0') {
	dest = src;
	dest++;
	src++;
	}
	}

	int main() {
	char source[10] = "Hello!";
	char destination[5];
	copyString(destination, source);
	return 0;
	}
	```

	Model Output:
	> The code has a buffer overflow vulnerability due to the lack of bounds checking on the destination buffer size.

	## Model Card Authors

	[Seokhee Chang]

	## Model Card Contact

	[cycloevan97@gmail.com]