Maaac's picture
Update README.md
7b3863a verified
|
raw
history blame
3.31 kB
metadata
license: mit
tags:
  - codellama
  - linux
  - bugfix
  - lora
  - qlora
  - git-diff
base_model: codellama/CodeLLaMA-7b-Instruct-hf
model_type: LlamaForCausalLM
library_name: peft
pipeline_tag: text-generation

CodeLLaMA-Linux-BugFix

A fine-tuned version of CodeLLaMA-7B-Instruct, designed specifically for Linux kernel bug fixing using QLoRA (Quantized Low-Rank Adaptation). The model learns to generate Git diff patches based on buggy C code and commit messages.


🎯 Overview

This project targets automated Linux kernel bug fixing by:

  • Mining real commit data from kernel Git history
  • Training a QLoRA model to generate Git-style fixes
  • Evaluating performance using BLEU and ROUGE
  • Supporting integration into code review pipelines

πŸ“Š Performance Results

BLEU Score: 33.87

ROUGE Scores:

  • ROUGE-1: P=0.3775, R=0.7306, F1=0.4355
  • ROUGE-2: P=0.2898, R=0.6096, F1=0.3457
  • ROUGE-L: P=0.3023, R=0.6333, F1=0.3612

These results show that the model generates high-quality diffs with good semantic similarity to ground-truth patches.


🧠 Model Configuration

  • Base model: CodeLLaMA-7B-Instruct
  • Fine-tuning: QLoRA (LoRA r=64, Ξ±=16, dropout=0.1)
  • Quantization: 4-bit NF4
  • Training: 3 epochs, batch size 64, LR 2e-4
  • Precision: bfloat16 with gradient checkpointing
  • Hardware: 1Γ— NVIDIA H200 (144 GB VRAM)

πŸ—ƒοΈ Dataset

  • 100,000 samples from Linux kernel Git commits
  • Format: JSONL with "prompt" and "completion" fields
  • Content: C code segments + commit messages β†’ Git diffs
  • Source: Bug-fix commits filtered by keywords like fix, null, race, panic

πŸš€ Usage

from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel

model = AutoModelForCausalLM.from_pretrained("codellama/CodeLLaMA-7b-Instruct-hf")
model = PeftModel.from_pretrained(model, "train/output/qlora-codellama-bugfix")
tokenizer = AutoTokenizer.from_pretrained("codellama/CodeLLaMA-7b-Instruct-hf")

prompt = '''
Given the following original C code:
```c
if (!file->filter)
    return;

Instruction: Fix the null pointer dereference

Return the diff that fixes it: '''

inputs = tokenizer(prompt, return_tensors="pt") outputs = model.generate(**inputs, max_length=512, temperature=0.1) fix = tokenizer.decode(outputs[0], skip_special_tokens=True) print(fix)


---

## πŸ“ Structure

CodeLLaMA-Linux-BugFix/ β”œβ”€β”€ dataset/ # Raw and processed JSONL files β”œβ”€β”€ dataset_builder/ # Scripts for mining & formatting commits β”œβ”€β”€ train/ # Training scripts & checkpoints β”œβ”€β”€ evaluate/ # Evaluation scripts & results └── requirements.txt # Dependencies


---

## πŸ“ˆ Metrics

| Metric   | Score  |
|----------|--------|
| BLEU     | 33.87  |
| ROUGE-1  | 0.4355 |
| ROUGE-2  | 0.3457 |
| ROUGE-L  | 0.3612 |

---

## πŸ”¬ Use Cases

- Kernel patch suggestion tools
- Code review assistants
- Bug localization + repair research
- APR benchmarks for kernel code

---

## πŸ“„ License

MIT License

---

## πŸ“š References

- [CodeLLaMA](https://arxiv.org/abs/2308.12950)
- [QLoRA](https://arxiv.org/abs/2305.14314)
- [LoRA](https://arxiv.org/abs/2106.09685)