Mac
Update README with Hugging Face metadata and full project description
8046c68
|
raw
history blame
4.37 kB
---
license: mit
tags:
- codellama
- linux
- bugfix
- lora
- qlora
- git-diff
base_model: codellama/CodeLLaMA-7b-Instruct-hf
model_type: LlamaForCausalLM
library_name: peft
pipeline_tag: text-generation
---
# CodeLLaMA-Linux-BugFix
A fine-tuned version of `CodeLLaMA-7B-Instruct`, designed specifically for Linux kernel bug fixing using QLoRA (Quantized Low-Rank Adaptation). The model learns to generate Git diff patches based on buggy C code and commit messages.
---
## 🎯 Overview
This project targets automated Linux kernel bug fixing by:
- **Mining real commit data** from the kernel Git history
- **Training a specialized QLoRA model** on diff-style fixes
- **Generating Git patches** in response to bug-prone code
- **Evaluating results** using BLEU, ROUGE, and human inspection
---
## 🧠 Model Configuration
- **Base model**: `CodeLLaMA-7B-Instruct`
- **Fine-tuning method**: QLoRA with 4-bit quantization
- **Training setup**:
- LoRA r=64, alpha=16, dropout=0.1
- Batch size: 64, LR: 2e-4, Epochs: 3
- Mixed precision (bfloat16), gradient checkpointing
- **Hardware**: Optimized for NVIDIA H200 GPUs
---
## πŸ“Š Dataset
Custom dataset extracted from Linux kernel Git history.
### Filtering Criteria
Bug-fix commits containing:
`fix`, `bug`, `crash`, `memory`, `null`, `panic`, `overflow`, `race`, `corruption`, etc.
### Structure
- Language: C (`.c`, `.h`)
- Context: 10 lines before/after the change
- Format:
```json
{
"input": {
"original code": "C code snippet with bug",
"instruction": "Commit message or fix description"
},
"output": {
"diff codes": "Git diff showing the fix"
}
}
````
* **File**: `training_data_100k.jsonl` (100,000 samples)
---
## πŸš€ Quick Start
### Install dependencies
```bash
pip install -r requirements.txt
```
### 1. Build the Dataset
```bash
cd dataset_builder
python extract_linux_bugfixes.py
python format_for_training.py
```
### 2. Fine-tune the Model
```bash
cd train
python train_codellama_qlora_linux_bugfix.py
```
### 3. Run Evaluation
```bash
cd evaluate
python evaluate_linux_bugfix_model.py
```
---
## πŸ“ Project Structure
```
CodeLLaMA-Linux-BugFix/
β”œβ”€β”€ dataset_builder/
β”‚ β”œβ”€β”€ extract_linux_bugfixes.py
β”‚ β”œβ”€β”€ extract_linux_bugfixes_parallel.py
β”‚ └── format_for_training.py
β”œβ”€β”€ dataset/
β”‚ β”œβ”€β”€ training_data_100k.jsonl
β”‚ └── training_data_prompt_completion.jsonl
β”œβ”€β”€ train/
β”‚ β”œβ”€β”€ train_codellama_qlora_linux_bugfix.py
β”‚ β”œβ”€β”€ train_codellama_qlora_simple.py
β”‚ β”œβ”€β”€ download_codellama_model.py
β”‚ └── output/
β”œβ”€β”€ evaluate/
β”‚ β”œβ”€β”€ evaluate_linux_bugfix_model.py
β”‚ β”œβ”€β”€ test_samples.jsonl
β”‚ └── output/
└── requirements.txt
```
---
## 🧩 Features
* πŸ”§ **Efficient Fine-tuning**: QLoRA + 4-bit quant = massive memory savings
* 🧠 **Real-world commits**: From actual Linux kernel development
* πŸ’‘ **Context-aware**: Code context extraction around bug lines
* πŸ’» **Output-ready**: Generates valid Git-style diffs
---
## πŸ“ˆ Evaluation Metrics
* **BLEU**: Translation-style match to reference diffs
* **ROUGE**: Overlap in fix content
* **Human Evaluation**: Subjective patch quality
---
## πŸ§ͺ Use Cases
* Automated kernel bug fixing
* Code review assistance
* Teaching/debugging kernel code
* Research in automated program repair (APR)
---
## πŸ”¬ Technical Highlights
### Memory & Speed Optimizations
* 4-bit quantization (NF4)
* Gradient checkpointing
* Mixed precision (bfloat16)
* Gradient accumulation
---
## 🀝 Contributing
1. Fork this repo
2. Create a branch
3. Add your feature or fix
4. Submit a PR πŸ™Œ
---
## πŸ“„ License
MIT License – see `LICENSE` file for details.
---
## πŸ™ Acknowledgments
* Meta for CodeLLaMA
* Hugging Face for Transformers + PEFT
* The Linux kernel community for open access to commit data
* Microsoft for introducing LoRA
---
## πŸ“š References
* [CodeLLaMA (Meta, 2023)](https://arxiv.org/abs/2308.12950)
* [QLoRA (Dettmers et al., 2023)](https://arxiv.org/abs/2305.14314)
* [LoRA (Hu et al., 2021)](https://arxiv.org/abs/2106.09685)