|
|
---
|
|
|
license: mit
|
|
|
tags:
|
|
|
- codellama
|
|
|
- linux
|
|
|
- bugfix
|
|
|
- lora
|
|
|
- qlora
|
|
|
- git-diff
|
|
|
base_model: codellama/CodeLLaMA-7b-Instruct-hf
|
|
|
model_type: LlamaForCausalLM
|
|
|
library_name: peft
|
|
|
pipeline_tag: text-generation
|
|
|
---
|
|
|
|
|
|
# CodeLLaMA-Linux-BugFix
|
|
|
|
|
|
A fine-tuned version of `CodeLLaMA-7B-Instruct`, designed specifically for Linux kernel bug fixing using QLoRA (Quantized Low-Rank Adaptation). The model learns to generate Git diff patches based on buggy C code and commit messages.
|
|
|
|
|
|
---
|
|
|
|
|
|
## π― Overview
|
|
|
|
|
|
This project targets automated Linux kernel bug fixing by:
|
|
|
|
|
|
- **Mining real commit data** from the kernel Git history
|
|
|
- **Training a specialized QLoRA model** on diff-style fixes
|
|
|
- **Generating Git patches** in response to bug-prone code
|
|
|
- **Evaluating results** using BLEU, ROUGE, and human inspection
|
|
|
|
|
|
---
|
|
|
|
|
|
## π§ Model Configuration
|
|
|
|
|
|
- **Base model**: `CodeLLaMA-7B-Instruct`
|
|
|
- **Fine-tuning method**: QLoRA with 4-bit quantization
|
|
|
- **Training setup**:
|
|
|
- LoRA r=64, alpha=16, dropout=0.1
|
|
|
- Batch size: 64, LR: 2e-4, Epochs: 3
|
|
|
- Mixed precision (bfloat16), gradient checkpointing
|
|
|
- **Hardware**: Optimized for NVIDIA H200 GPUs
|
|
|
|
|
|
---
|
|
|
|
|
|
## π Dataset
|
|
|
|
|
|
Custom dataset extracted from Linux kernel Git history.
|
|
|
|
|
|
### Filtering Criteria
|
|
|
Bug-fix commits containing:
|
|
|
`fix`, `bug`, `crash`, `memory`, `null`, `panic`, `overflow`, `race`, `corruption`, etc.
|
|
|
|
|
|
### Structure
|
|
|
- Language: C (`.c`, `.h`)
|
|
|
- Context: 10 lines before/after the change
|
|
|
- Format:
|
|
|
|
|
|
```json
|
|
|
{
|
|
|
"input": {
|
|
|
"original code": "C code snippet with bug",
|
|
|
"instruction": "Commit message or fix description"
|
|
|
},
|
|
|
"output": {
|
|
|
"diff codes": "Git diff showing the fix"
|
|
|
}
|
|
|
}
|
|
|
````
|
|
|
|
|
|
* **File**: `training_data_100k.jsonl` (100,000 samples)
|
|
|
|
|
|
---
|
|
|
|
|
|
## π Quick Start
|
|
|
|
|
|
### Install dependencies
|
|
|
|
|
|
```bash
|
|
|
pip install -r requirements.txt
|
|
|
```
|
|
|
|
|
|
### 1. Build the Dataset
|
|
|
|
|
|
```bash
|
|
|
cd dataset_builder
|
|
|
python extract_linux_bugfixes.py
|
|
|
python format_for_training.py
|
|
|
```
|
|
|
|
|
|
### 2. Fine-tune the Model
|
|
|
|
|
|
```bash
|
|
|
cd train
|
|
|
python train_codellama_qlora_linux_bugfix.py
|
|
|
```
|
|
|
|
|
|
### 3. Run Evaluation
|
|
|
|
|
|
```bash
|
|
|
cd evaluate
|
|
|
python evaluate_linux_bugfix_model.py
|
|
|
```
|
|
|
|
|
|
---
|
|
|
|
|
|
## π Project Structure
|
|
|
|
|
|
```
|
|
|
CodeLLaMA-Linux-BugFix/
|
|
|
βββ dataset_builder/
|
|
|
β βββ extract_linux_bugfixes.py
|
|
|
β βββ extract_linux_bugfixes_parallel.py
|
|
|
β βββ format_for_training.py
|
|
|
βββ dataset/
|
|
|
β βββ training_data_100k.jsonl
|
|
|
β βββ training_data_prompt_completion.jsonl
|
|
|
βββ train/
|
|
|
β βββ train_codellama_qlora_linux_bugfix.py
|
|
|
β βββ train_codellama_qlora_simple.py
|
|
|
β βββ download_codellama_model.py
|
|
|
β βββ output/
|
|
|
βββ evaluate/
|
|
|
β βββ evaluate_linux_bugfix_model.py
|
|
|
β βββ test_samples.jsonl
|
|
|
β βββ output/
|
|
|
βββ requirements.txt
|
|
|
```
|
|
|
|
|
|
---
|
|
|
|
|
|
## π§© Features
|
|
|
|
|
|
* π§ **Efficient Fine-tuning**: QLoRA + 4-bit quant = massive memory savings
|
|
|
* π§ **Real-world commits**: From actual Linux kernel development
|
|
|
* π‘ **Context-aware**: Code context extraction around bug lines
|
|
|
* π» **Output-ready**: Generates valid Git-style diffs
|
|
|
|
|
|
---
|
|
|
|
|
|
## π Evaluation Metrics
|
|
|
|
|
|
* **BLEU**: Translation-style match to reference diffs
|
|
|
* **ROUGE**: Overlap in fix content
|
|
|
* **Human Evaluation**: Subjective patch quality
|
|
|
|
|
|
---
|
|
|
|
|
|
## π§ͺ Use Cases
|
|
|
|
|
|
* Automated kernel bug fixing
|
|
|
* Code review assistance
|
|
|
* Teaching/debugging kernel code
|
|
|
* Research in automated program repair (APR)
|
|
|
|
|
|
---
|
|
|
|
|
|
## π¬ Technical Highlights
|
|
|
|
|
|
### Memory & Speed Optimizations
|
|
|
|
|
|
* 4-bit quantization (NF4)
|
|
|
* Gradient checkpointing
|
|
|
* Mixed precision (bfloat16)
|
|
|
* Gradient accumulation
|
|
|
|
|
|
---
|
|
|
|
|
|
## π€ Contributing
|
|
|
|
|
|
1. Fork this repo
|
|
|
2. Create a branch
|
|
|
3. Add your feature or fix
|
|
|
4. Submit a PR π
|
|
|
|
|
|
---
|
|
|
|
|
|
## π License
|
|
|
|
|
|
MIT License β see `LICENSE` file for details.
|
|
|
|
|
|
---
|
|
|
|
|
|
## π Acknowledgments
|
|
|
|
|
|
* Meta for CodeLLaMA
|
|
|
* Hugging Face for Transformers + PEFT
|
|
|
* The Linux kernel community for open access to commit data
|
|
|
* Microsoft for introducing LoRA
|
|
|
|
|
|
---
|
|
|
|
|
|
## π References
|
|
|
|
|
|
* [CodeLLaMA (Meta, 2023)](https://arxiv.org/abs/2308.12950)
|
|
|
* [QLoRA (Dettmers et al., 2023)](https://arxiv.org/abs/2305.14314)
|
|
|
* [LoRA (Hu et al., 2021)](https://arxiv.org/abs/2106.09685)
|
|
|
|