File size: 3,332 Bytes
84237cb |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 |
````markdown
---
license: mit
tags:
- codellama
- linux
- bugfix
- lora
- qlora
- git-diff
base_model: codellama/CodeLLaMA-7b-Instruct-hf
model_type: LlamaForCausalLM
library_name: peft
pipeline_tag: text-generation
---
# CodeLLaMA-Linux-BugFix
A fine-tuned version of `CodeLLaMA-7B-Instruct`, designed specifically for Linux kernel bug fixing using QLoRA (Quantized Low-Rank Adaptation). The model learns to generate Git diff patches based on buggy C code and commit messages.
---
## π― Overview
This project targets automated Linux kernel bug fixing by:
- Mining real commit data from kernel Git history
- Training a QLoRA model to generate Git-style fixes
- Evaluating performance using BLEU and ROUGE
- Supporting integration into code review pipelines
---
## π Performance Results
**BLEU Score**: 33.87
**ROUGE Scores**:
- ROUGE-1: P=0.3775, R=0.7306, F1=0.4355
- ROUGE-2: P=0.2898, R=0.6096, F1=0.3457
- ROUGE-L: P=0.3023, R=0.6333, F1=0.3612
These results show that the model generates high-quality diffs with good semantic similarity to ground-truth patches.
---
## π§ Model Configuration
- **Base model**: `CodeLLaMA-7B-Instruct`
- **Fine-tuning**: QLoRA (LoRA r=64, Ξ±=16, dropout=0.1)
- **Quantization**: 4-bit NF4
- **Training**: 3 epochs, batch size 64, LR 2e-4
- **Precision**: bfloat16 with gradient checkpointing
- **Hardware**: 1Γ NVIDIA H200 (144 GB VRAM)
---
## ποΈ Dataset
- 100,000 samples from Linux kernel Git commits
- Format: JSONL with `"prompt"` and `"completion"` fields
- Content: C code segments + commit messages β Git diffs
- Source: Bug-fix commits filtered by keywords like `fix`, `null`, `race`, `panic`
---
## π Usage
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
model = AutoModelForCausalLM.from_pretrained("codellama/CodeLLaMA-7b-Instruct-hf")
model = PeftModel.from_pretrained(model, "train/output/qlora-codellama-bugfix")
tokenizer = AutoTokenizer.from_pretrained("codellama/CodeLLaMA-7b-Instruct-hf")
prompt = '''
Given the following original C code:
```c
if (!file->filter)
return;
````
Instruction: Fix the null pointer dereference
Return the diff that fixes it:
'''
inputs = tokenizer(prompt, return\_tensors="pt")
outputs = model.generate(\*\*inputs, max\_length=512, temperature=0.1)
fix = tokenizer.decode(outputs\[0], skip\_special\_tokens=True)
print(fix)
```
---
## π Structure
```
CodeLLaMA-Linux-BugFix/
βββ dataset/ # Raw and processed JSONL files
βββ dataset\_builder/ # Scripts for mining & formatting commits
βββ train/ # Training scripts & checkpoints
βββ evaluate/ # Evaluation scripts & results
βββ requirements.txt # Dependencies
```
---
## π Metrics
| Metric | Score |
|----------|--------|
| BLEU | 33.87 |
| ROUGE-1 | 0.4355 |
| ROUGE-2 | 0.3457 |
| ROUGE-L | 0.3612 |
---
## π¬ Use Cases
- Kernel patch suggestion tools
- Code review assistants
- Bug localization + repair research
- APR benchmarks for kernel code
---
## π License
MIT License
---
## π References
- [CodeLLaMA](https://arxiv.org/abs/2308.12950)
- [QLoRA](https://arxiv.org/abs/2305.14314)
- [LoRA](https://arxiv.org/abs/2106.09685)
```
|