File size: 4,374 Bytes

---

license: mit
tags:
  - codellama
  - linux
  - bugfix
  - lora
  - qlora
  - git-diff
base_model: codellama/CodeLLaMA-7b-Instruct-hf
model_type: LlamaForCausalLM
library_name: peft
pipeline_tag: text-generation
---


# CodeLLaMA-Linux-BugFix

A fine-tuned version of `CodeLLaMA-7B-Instruct`, designed specifically for Linux kernel bug fixing using QLoRA (Quantized Low-Rank Adaptation). The model learns to generate Git diff patches based on buggy C code and commit messages.

---

## 🎯 Overview

This project targets automated Linux kernel bug fixing by:

- **Mining real commit data** from the kernel Git history
- **Training a specialized QLoRA model** on diff-style fixes
- **Generating Git patches** in response to bug-prone code
- **Evaluating results** using BLEU, ROUGE, and human inspection

---

## 🧠 Model Configuration

- **Base model**: `CodeLLaMA-7B-Instruct`
- **Fine-tuning method**: QLoRA with 4-bit quantization
- **Training setup**:
  - LoRA r=64, alpha=16, dropout=0.1
  - Batch size: 64, LR: 2e-4, Epochs: 3
  - Mixed precision (bfloat16), gradient checkpointing
- **Hardware**: Optimized for NVIDIA H200 GPUs

---

## 📊 Dataset

Custom dataset extracted from Linux kernel Git history.

### Filtering Criteria
Bug-fix commits containing:
`fix`, `bug`, `crash`, `memory`, `null`, `panic`, `overflow`, `race`, `corruption`, etc.

### Structure
- Language: C (`.c`, `.h`)
- Context: 10 lines before/after the change
- Format:

```json

{

  "input": {

    "original code": "C code snippet with bug",

    "instruction": "Commit message or fix description"

  },

  "output": {

    "diff codes": "Git diff showing the fix"

  }

}

````

* **File**: `training_data_100k.jsonl` (100,000 samples)

---

## 🚀 Quick Start

### Install dependencies

```bash

pip install -r requirements.txt

```

### 1. Build the Dataset

```bash

cd dataset_builder

python extract_linux_bugfixes.py

python format_for_training.py

```

### 2. Fine-tune the Model

```bash

cd train

python train_codellama_qlora_linux_bugfix.py

```

### 3. Run Evaluation

```bash

cd evaluate

python evaluate_linux_bugfix_model.py

```

---

## 📁 Project Structure

```

CodeLLaMA-Linux-BugFix/

├── dataset_builder/

│   ├── extract_linux_bugfixes.py

│   ├── extract_linux_bugfixes_parallel.py

│   └── format_for_training.py

├── dataset/

│   ├── training_data_100k.jsonl

│   └── training_data_prompt_completion.jsonl

├── train/

│   ├── train_codellama_qlora_linux_bugfix.py

│   ├── train_codellama_qlora_simple.py

│   ├── download_codellama_model.py

│   └── output/

├── evaluate/

│   ├── evaluate_linux_bugfix_model.py

│   ├── test_samples.jsonl

│   └── output/

└── requirements.txt

```

---

## 🧩 Features

* 🔧 **Efficient Fine-tuning**: QLoRA + 4-bit quant = massive memory savings
* 🧠 **Real-world commits**: From actual Linux kernel development
* 💡 **Context-aware**: Code context extraction around bug lines
* 💻 **Output-ready**: Generates valid Git-style diffs

---

## 📈 Evaluation Metrics

* **BLEU**: Translation-style match to reference diffs
* **ROUGE**: Overlap in fix content
* **Human Evaluation**: Subjective patch quality

---

## 🧪 Use Cases

* Automated kernel bug fixing
* Code review assistance
* Teaching/debugging kernel code
* Research in automated program repair (APR)

---

## 🔬 Technical Highlights

### Memory & Speed Optimizations

* 4-bit quantization (NF4)
* Gradient checkpointing
* Mixed precision (bfloat16)
* Gradient accumulation

---

## 🤝 Contributing

1. Fork this repo
2. Create a branch
3. Add your feature or fix
4. Submit a PR 🙌

---

## 📄 License

MIT License – see `LICENSE` file for details.

---

## 🙏 Acknowledgments

* Meta for CodeLLaMA
* Hugging Face for Transformers + PEFT
* The Linux kernel community for open access to commit data
* Microsoft for introducing LoRA

---

## 📚 References

* [CodeLLaMA (Meta, 2023)](https://arxiv.org/abs/2308.12950)
* [QLoRA (Dettmers et al., 2023)](https://arxiv.org/abs/2305.14314)
* [LoRA (Hu et al., 2021)](https://arxiv.org/abs/2106.09685)