# CodeLLaMA-Linux-BugFix

A machine learning project that fine-tunes CodeLLaMA-7B-Instruct specifically for Linux kernel bug fixing using QLoRA (Quantized Low-Rank Adaptation). The model learns to generate Git diff patches from buggy C code and commit messages.

## 🎯 Project Overview

This project addresses the challenging task of automated Linux kernel bug fixing by:

- **Extracting real bug-fix data** from the Linux kernel Git repository
- **Training a specialized model** using QLoRA for efficient fine-tuning
- **Generating Git diff patches** that can be applied to fix bugs
- **Providing evaluation metrics** to assess model performance

## 🏗️ Architecture

### Base Model
- **Model**: `codellama/CodeLLaMA-7b-Instruct-hf` (7 billion parameters)
- **Fine-tuning Method**: QLoRA with 4-bit quantization
- **Hardware**: Optimized for H200 GPU with bfloat16 precision

### Training Configuration
- **LoRA Config**: r=64, alpha=16, dropout=0.1
- **Training**: 3 epochs, batch size 64, learning rate 2e-4
- **Memory Optimization**: Gradient checkpointing, mixed precision training

## 📊 Dataset

The project creates a specialized dataset from Linux kernel commits:

### Data Extraction Process
1. **Commit Filtering**: Identifies bug-fix commits using keywords:
   - `fix`, `bug`, `leak`, `null`, `overflow`, `error`, `failure`
   - `crash`, `panic`, `memory`, `race`, `deadlock`, `corruption`
   - `security`, `vulnerability`, `exploit`, `buffer`, `stack`

2. **Code Context Extraction**: 
   - Focuses on C and header files (`.c`, `.h`)
   - Extracts 10 lines before/after bug location
   - Captures relevant code context

3. **Data Format**:
   ```json
   {
     "input": {
       "original code": "C code snippet with bug",
       "instruction": "Bug fix instruction from commit message"
     },
     "output": {
       "diff codes": "Git diff showing the fix"
     }
   }
   ```

### Dataset Statistics
- **Training Data**: 100K samples (`training_data_100k.jsonl`)
- **Format**: JSONL (one JSON object per line)
- **Source**: Linux kernel Git repository

## 🚀 Quick Start

### Prerequisites
```bash
pip install -r requirements.txt
```

### 1. Build Dataset
```bash
cd dataset_builder
python extract_linux_bugfixes.py
python format_for_training.py
```

### 2. Train Model
```bash
cd train
python train_codellama_qlora_linux_bugfix.py
```

### 3. Evaluate Model
```bash
cd evaluate
python evaluate_linux_bugfix_model.py
```

## 📁 Project Structure

```
CodeLLaMA-Linux-BugFix/
├── dataset_builder/          # Dataset creation scripts
│   ├── extract_linux_bugfixes.py      # Main dataset extraction
│   ├── extract_linux_bugfixes_parallel.py # Parallelized version
│   └── format_for_training.py
├── dataset/                  # Generated datasets
│   ├── training_data_100k.jsonl
│   └── training_data_prompt_completion.jsonl
├── train/                    # Training scripts and outputs
│   ├── train_codellama_qlora_linux_bugfix.py # Main training script
│   ├── train_codellama_qlora_simple.py
│   ├── download_codellama_model.py
│   └── output/              # Trained model checkpoints
├── evaluate/                 # Evaluation scripts and results
│   ├── evaluate_linux_bugfix_model.py # Model evaluation
│   ├── test_samples.jsonl   # Evaluation dataset
│   └── output/              # Evaluation results
└── requirements.txt         # Python dependencies
```

## 🔧 Key Features

### Efficient Training
- **QLoRA**: Reduces memory requirements by 75% while maintaining performance
- **4-bit Quantization**: Enables training on consumer hardware
- **Gradient Checkpointing**: Optimizes memory usage during training

### Real-world Data
- **Authentic Bug Fixes**: Extracted from actual Linux kernel development
- **Contextual Understanding**: Captures relevant code context around bugs
- **Git Integration**: Outputs proper Git diff format

### Evaluation
- **BLEU Score**: Measures translation quality
- **ROUGE Score**: Evaluates text generation accuracy
- **Comprehensive Metrics**: JSON and CSV output formats

## 🎯 Use Cases

The fine-tuned model can assist with:

1. **Automated Bug Fixing**: Generate patches for common kernel bugs
2. **Code Review**: Suggest fixes during development
3. **Learning**: Study patterns in Linux kernel bug fixes
4. **Research**: Advance automated software repair techniques

## 📈 Performance

The model is evaluated using:
- **BLEU Score**: Measures how well generated diffs match reference fixes
- **ROUGE Score**: Evaluates overlap between predicted and actual fixes
- **Human Evaluation**: Qualitative assessment of fix quality

## 🔬 Technical Details

### Model Architecture
- **Base**: CodeLLaMA-7B-Instruct with instruction tuning
- **Adapter**: LoRA layers for efficient fine-tuning
- **Output**: Generates Git diff format patches

### Training Process
1. **Data Preprocessing**: Extract and clean commit data
2. **Tokenization**: Convert to model input format
3. **QLoRA Training**: Efficient parameter-efficient fine-tuning
4. **Checkpointing**: Save model states for evaluation

### Memory Optimization
- **4-bit Quantization**: Reduces model size significantly
- **Gradient Accumulation**: Enables larger effective batch sizes
- **Mixed Precision**: Uses bfloat16 for faster training

## 🤝 Contributing

1. Fork the repository
2. Create a feature branch
3. Make your changes
4. Add tests if applicable
5. Submit a pull request

## 📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

## 🙏 Acknowledgments

- **CodeLLaMA Team**: For the base model
- **Linux Kernel Community**: For the bug-fix data
- **Hugging Face**: For the transformers library
- **Microsoft**: For the LoRA technique

## 📚 References

- [CodeLLaMA: Open Foundation for Code](https://arxiv.org/abs/2308.12950)
- [QLoRA: Efficient Finetuning of Quantized LLMs](https://arxiv.org/abs/2305.14314)
- [LoRA: Low-Rank Adaptation of Large Language Models](https://arxiv.org/abs/2106.09685)