--- license: apache-2.0 language: - en library_name: transformers pipeline_tag: text-generation datasets: - custom - CyberNative/Code_Vulnerability_Security_DPO - doss1232/vulnerable-code metrics: - ROUGE-L F1 - BLEU tags: - llama-3.2-1B-Instruct --- # Model Card for `merged-vuln-detector` ## Model Details - **Base Model:** `llama-3.2-1B-Instruct` - **Fine-tuned Model:** `merged-vuln-detector` - **Model Type:** Causal Language Model fine-tuned for vulnerability detection in code. ## Model Description This model is a fine-tuned version of `llama-3.2-1B-Instruct` on a dataset of code snippets and their corresponding vulnerability analyses. The model is intended to be used as a security expert that can analyze code and identify potential vulnerabilities. ## Training Data The model was fine-tuned on the `CyberNative/Code_Vulnerability_Security_DPO` dataset, which can be found on Hugging Face at https://huggingface.co/datasets/CyberNative/Code_Vulnerability_Security_DPO. The data is formatted as follows, where the model is prompted to analyze the security of a given code snippet: ``` Analyze the security vulnerabilities in the following code. [CODE SNIPPET] Analysis: [VULNERABILITY DESCRIPTION] ``` ## Training Procedure The model was fine-tuned using QLoRA on a single GPU. The training script uses the `trl` library's `SFTTrainer`. ### Hyperparameters: - **Quantization:** 4-bit (`nf4`) - **LoRA `r`:** 16 - **LoRA `alpha`:** 32 - **LoRA `dropout`:** 0.1 - **Target Modules:** `q_proj`, `k_proj`, `v_proj`, `o_proj` - **Batch Size:** 1 (with gradient accumulation steps of 8) - **Optimizer:** `paged_adamw_8bit` - **Precision:** `fp16` - **Max Steps:** 240 - **Learning Rate:** `2e-4` - **Max Sequence Length:** 1024 ## Evaluation Results The model was evaluated on the `doss1232/vulnerable-code` dataset against the base model. The results are as follows: | Model | ROUGE-L F1 | BLEU | |-------------------------|------------|--------| | `llama-3.2-1B-Instruct` | 0.0933 | 0.0061 | | `merged-vuln-detector` | 0.1335 | 0.0219 | ## How to use ```python from transformers import AutoTokenizer, AutoModelForCausalLM model_name = "merged-vuln-detector" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForCausalLM.from_pretrained(model_name) code = """ #include void copyString(char* dest, const char* src) { while (*src != '\0') { *dest = *src; dest++; src++; } } int main() { char source[10] = "Hello!"; char destination[5]; copyString(destination, source); return 0; } """ prompt = f"Analyze the security vulnerabilities in the following code.\n\n{code}\n\nAnalysis:\n" inputs = tokenizer(prompt, return_tensors="pt") outputs = model.generate(**inputs, max_length=512) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) ``` ### Example Output **Input Code:** ```c #include void copyString(char* dest, const char* src) { while (*src != '\0') { *dest = *src; dest++; src++; } } int main() { char source[10] = "Hello!"; char destination[5]; copyString(destination, source); return 0; } ``` **Model Output:** > The code has a buffer overflow vulnerability due to the lack of bounds checking on the destination buffer size. ## Model Card Authors [Seokhee Chang] ## Model Card Contact [cycloevan97@gmail.com]