File size: 9,083 Bytes
8046c68 15eb8ca 43864c1 8046c68 43864c1 8046c68 43864c1 8046c68 43864c1 5de6ff4 8046c68 43864c1 8046c68 43864c1 8046c68 43864c1 8046c68 43864c1 15eb8ca 43864c1 8046c68 5de6ff4 8046c68 43864c1 15eb8ca 43864c1 5de6ff4 8046c68 15eb8ca 43864c1 8046c68 43864c1 5de6ff4 15eb8ca 43864c1 8046c68 15eb8ca 43864c1 8046c68 15eb8ca 43864c1 5de6ff4 8046c68 43864c1 15eb8ca 8046c68 5de6ff4 8046c68 5de6ff4 8046c68 5de6ff4 8046c68 5de6ff4 8046c68 5de6ff4 43864c1 8046c68 43864c1 8046c68 43864c1 8046c68 5de6ff4 43864c1 8046c68 43864c1 8046c68 43864c1 8046c68 5de6ff4 43864c1 8046c68 43864c1 8046c68 43864c1 5de6ff4 43864c1 8046c68 43864c1 8046c68 43864c1 8046c68 43864c1 8046c68 5de6ff4 8046c68 43864c1 15eb8ca 43864c1 8046c68 5de6ff4 8046c68 43864c1 8046c68 43864c1 5de6ff4 8046c68 43864c1 15eb8ca 43864c1 8046c68 5de6ff4 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 |
---
license: mit
tags:
- codellama
- linux
- bugfix
- lora
- qlora
- git-diff
base_model: codellama/CodeLLaMA-7b-Instruct-hf
model_type: LlamaForCausalLM
library_name: peft
pipeline_tag: text-generation
---
# CodeLLaMA-Linux-BugFix
A fine-tuned version of `CodeLLaMA-7B-Instruct`, designed specifically for Linux kernel bug fixing using QLoRA (Quantized Low-Rank Adaptation). The model learns to generate Git diff patches based on buggy C code and commit messages.
---
## π― Overview
This project targets automated Linux kernel bug fixing by:
- **Mining real commit data** from the kernel Git history
- **Training a specialized QLoRA model** on diff-style fixes
- **Generating Git patches** in response to bug-prone code
- **Evaluating results** using BLEU, ROUGE, and human inspection
The model achieves strong performance in generating accurate Linux kernel bug fixes, making it a valuable tool for automated code review and bug detection.
---
## π Performance Results
### Evaluation Metrics
β
**BLEU Score**: 33.87
β
**ROUGE Scores**:
- **ROUGE-1**: P=0.3775, R=0.7306, F1=0.4355
- **ROUGE-2**: P=0.2898, R=0.6096, F1=0.3457
- **ROUGE-L**: P=0.3023, R=0.6333, F1=0.3612
These results demonstrate the model's ability to:
- Generate syntactically correct Git diff patches
- Maintain semantic similarity to reference fixes
- Produce meaningful code changes that address the underlying bugs
---
## π§ Model Configuration
- **Base model**: `CodeLLaMA-7B-Instruct`
- **Fine-tuning method**: QLoRA with 4-bit quantization
- **Training setup**:
- LoRA r=64, alpha=16, dropout=0.1
- Batch size: 64, LR: 2e-4, Epochs: 3
- Mixed precision (bfloat16), gradient checkpointing
- **Hardware**: Optimized for NVIDIA H200 GPUs
---
## π Dataset
Custom dataset extracted from Linux kernel Git history.
### Filtering Criteria
Bug-fix commits containing:
`fix`, `bug`, `crash`, `memory`, `null`, `panic`, `overflow`, `race`, `corruption`, etc.
### Structure
- Language: C (`.c`, `.h`)
- Context: 10 lines before/after the change
- Format:
```json
{
"input": {
"original code": "C code snippet with bug",
"instruction": "Commit message or fix description"
},
"output": {
"diff codes": "Git diff showing the fix"
}
}
```
* **File**: `training_data_100k.jsonl` (100,000 samples)
---
## π Quick Start
### Prerequisites
- Python 3.8+
- CUDA-compatible GPU (recommended)
- 16GB+ RAM
- 50GB+ disk space
### Install dependencies
```bash
pip install -r requirements.txt
```
### 1. Build the Dataset
```bash
cd dataset_builder
python extract_linux_bugfixes_parallel.py
python format_for_training.py
```
### 2. Fine-tune the Model
```bash
cd train
python train_codellama_qlora_linux_bugfix.py
```
### 3. Run Evaluation
```bash
cd evaluate
python evaluate_linux_bugfix_model.py
```
### 4. Use the Model
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
# Load the fine-tuned model
model = AutoModelForCausalLM.from_pretrained("codellama/CodeLLaMA-7b-Instruct-hf")
model = PeftModel.from_pretrained(model, "train/output/qlora-codellama-bugfix")
tokenizer = AutoTokenizer.from_pretrained("codellama/CodeLLaMA-7b-Instruct-hf")
# Generate a bug fix
prompt = """
Given the following original C code:
```c
if (!file->filter)
return;
```
Instruction: Fix the null pointer dereference
Return the diff that fixes it:
"""
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_length=512, temperature=0.1)
fix = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(fix)
```
---
## π Project Structure
```
CodeLLaMA-Linux-BugFix/
βββ dataset_builder/
β βββ extract_linux_bugfixes_parallel.py # Parallel extraction of bug fixes
β βββ format_for_training.py # Format data for training
β βββ build_dataset.py # Main dataset builder
βββ dataset/
β βββ training_data_100k.jsonl # 100K training samples
β βββ training_data_prompt_completion.jsonl # Formatted training data
βββ train/
β βββ train_codellama_qlora_linux_bugfix.py # Main training script
β βββ train_codellama_qlora_simple.py # Simplified training
β βββ download_codellama_model.py # Model download utility
β βββ output/
β βββ qlora-codellama-bugfix/ # Trained model checkpoints
βββ evaluate/
β βββ evaluate_linux_bugfix_model.py # Evaluation script
β βββ test_samples.jsonl # Test dataset
β βββ output/ # Evaluation results
β βββ eval_results.csv # Detailed results
β βββ eval_results.json # JSON format results
βββ requirements.txt # Python dependencies
βββ README.md # This file
βββ PROJECT_STRUCTURE.md # Detailed project overview
```
---
## π§© Features
* π§ **Efficient Fine-tuning**: QLoRA + 4-bit quant = massive memory savings
* π§ **Real-world commits**: From actual Linux kernel development
* π‘ **Context-aware**: Code context extraction around bug lines
* π» **Output-ready**: Generates valid Git-style diffs
* π **Strong Performance**: BLEU score of 33.87 with good ROUGE metrics
* π **Production-ready**: Optimized for real-world deployment
---
## π Evaluation Metrics
* **BLEU**: Translation-style match to reference diffs
* **ROUGE**: Overlap in fix content and semantic similarity
* **Human Evaluation**: Subjective patch quality assessment
### Current Performance
- **BLEU Score**: 33.87 (excellent for code generation tasks)
- **ROUGE-1 F1**: 0.4355 (good semantic overlap)
- **ROUGE-2 F1**: 0.3457 (reasonable bigram matching)
- **ROUGE-L F1**: 0.3612 (good longest common subsequence)
---
## π§ͺ Use Cases
* **Automated kernel bug fixing**: Generate fixes for common kernel bugs
* **Code review assistance**: Help reviewers identify potential issues
* **Teaching/debugging kernel code**: Educational tool for kernel development
* **Research in automated program repair (APR)**: Academic research applications
* **CI/CD integration**: Automated testing and fixing in development pipelines
---
## π¬ Technical Highlights
### Memory & Speed Optimizations
* 4-bit quantization (NF4)
* Gradient checkpointing
* Mixed precision (bfloat16)
* Gradient accumulation
* LoRA parameter efficiency
### Training Efficiency
* **QLoRA**: Reduces memory usage by ~75%
* **4-bit quantization**: Further memory optimization
* **Gradient checkpointing**: Trades compute for memory
* **Mixed precision**: Faster training with maintained accuracy
---
## π οΈ Advanced Usage
### Custom Training
```bash
# Train with custom parameters
python train_codellama_qlora_linux_bugfix.py \
--learning_rate 1e-4 \
--num_epochs 5 \
--batch_size 32 \
--lora_r 32 \
--lora_alpha 16
```
### Evaluation on Custom Data
```bash
# Evaluate on your own test set
python evaluate_linux_bugfix_model.py \
--test_file your_test_data.jsonl \
--output_dir custom_eval_results
```
---
## π€ Contributing
1. Fork this repo
2. Create a feature branch (`git checkout -b feature/amazing-feature`)
3. Commit your changes (`git commit -m 'Add amazing feature'`)
4. Push to the branch (`git push origin feature/amazing-feature`)
5. Open a Pull Request π
### Development Guidelines
- Follow PEP 8 style guidelines
- Add tests for new features
- Update documentation for API changes
- Ensure all tests pass before submitting PR
---
## π License
MIT License β see `LICENSE` file for details.
---
## π Acknowledgments
* **Meta** for CodeLLaMA base model
* **Hugging Face** for Transformers + PEFT libraries
* **The Linux kernel community** for open access to commit data
* **Microsoft** for introducing LoRA technique
* **University of Washington** for QLoRA research
---
## π References
* [CodeLLaMA (Meta, 2023)](https://arxiv.org/abs/2308.12950)
* [QLoRA (Dettmers et al., 2023)](https://arxiv.org/abs/2305.14314)
* [LoRA (Hu et al., 2021)](https://arxiv.org/abs/2106.09685)
* [Automated Program Repair: A Survey](https://ieeexplore.ieee.org/document/8449519)
---
## π Support
For questions, issues, or contributions:
- Open an issue on GitHub
- Check the project documentation
- Review the evaluation results in `evaluate/output/`
---
## π Version History
- **v1.0.0**: Initial release with QLoRA training
- **v1.1.0**: Added parallel dataset extraction
- **v1.2.0**: Improved evaluation metrics and documentation
|