Mac
Refactor filenames and paths for clarity and structure
15eb8ca
|
raw
history blame
6.3 kB
# CodeLLaMA-Linux-BugFix
A machine learning project that fine-tunes CodeLLaMA-7B-Instruct specifically for Linux kernel bug fixing using QLoRA (Quantized Low-Rank Adaptation). The model learns to generate Git diff patches from buggy C code and commit messages.
## 🎯 Project Overview
This project addresses the challenging task of automated Linux kernel bug fixing by:
- **Extracting real bug-fix data** from the Linux kernel Git repository
- **Training a specialized model** using QLoRA for efficient fine-tuning
- **Generating Git diff patches** that can be applied to fix bugs
- **Providing evaluation metrics** to assess model performance
## πŸ—οΈ Architecture
### Base Model
- **Model**: `codellama/CodeLLaMA-7b-Instruct-hf` (7 billion parameters)
- **Fine-tuning Method**: QLoRA with 4-bit quantization
- **Hardware**: Optimized for H200 GPU with bfloat16 precision
### Training Configuration
- **LoRA Config**: r=64, alpha=16, dropout=0.1
- **Training**: 3 epochs, batch size 64, learning rate 2e-4
- **Memory Optimization**: Gradient checkpointing, mixed precision training
## πŸ“Š Dataset
The project creates a specialized dataset from Linux kernel commits:
### Data Extraction Process
1. **Commit Filtering**: Identifies bug-fix commits using keywords:
- `fix`, `bug`, `leak`, `null`, `overflow`, `error`, `failure`
- `crash`, `panic`, `memory`, `race`, `deadlock`, `corruption`
- `security`, `vulnerability`, `exploit`, `buffer`, `stack`
2. **Code Context Extraction**:
- Focuses on C and header files (`.c`, `.h`)
- Extracts 10 lines before/after bug location
- Captures relevant code context
3. **Data Format**:
```json
{
"input": {
"original code": "C code snippet with bug",
"instruction": "Bug fix instruction from commit message"
},
"output": {
"diff codes": "Git diff showing the fix"
}
}
```
### Dataset Statistics
- **Training Data**: 100K samples (`training_data_100k.jsonl`)
- **Format**: JSONL (one JSON object per line)
- **Source**: Linux kernel Git repository
## πŸš€ Quick Start
### Prerequisites
```bash
pip install -r requirements.txt
```
### 1. Build Dataset
```bash
cd dataset_builder
python extract_linux_bugfixes.py
python format_for_training.py
```
### 2. Train Model
```bash
cd train
python train_codellama_qlora_linux_bugfix.py
```
### 3. Evaluate Model
```bash
cd evaluate
python evaluate_linux_bugfix_model.py
```
## πŸ“ Project Structure
```
CodeLLaMA-Linux-BugFix/
β”œβ”€β”€ dataset_builder/ # Dataset creation scripts
β”‚ β”œβ”€β”€ extract_linux_bugfixes.py # Main dataset extraction
β”‚ β”œβ”€β”€ extract_linux_bugfixes_parallel.py # Parallelized version
β”‚ └── format_for_training.py
β”œβ”€β”€ dataset/ # Generated datasets
β”‚ β”œβ”€β”€ training_data_100k.jsonl
β”‚ └── training_data_prompt_completion.jsonl
β”œβ”€β”€ train/ # Training scripts and outputs
β”‚ β”œβ”€β”€ train_codellama_qlora_linux_bugfix.py # Main training script
β”‚ β”œβ”€β”€ train_codellama_qlora_simple.py
β”‚ β”œβ”€β”€ download_codellama_model.py
β”‚ └── output/ # Trained model checkpoints
β”œβ”€β”€ evaluate/ # Evaluation scripts and results
β”‚ β”œβ”€β”€ evaluate_linux_bugfix_model.py # Model evaluation
β”‚ β”œβ”€β”€ test_samples.jsonl # Evaluation dataset
β”‚ └── output/ # Evaluation results
└── requirements.txt # Python dependencies
```
## πŸ”§ Key Features
### Efficient Training
- **QLoRA**: Reduces memory requirements by 75% while maintaining performance
- **4-bit Quantization**: Enables training on consumer hardware
- **Gradient Checkpointing**: Optimizes memory usage during training
### Real-world Data
- **Authentic Bug Fixes**: Extracted from actual Linux kernel development
- **Contextual Understanding**: Captures relevant code context around bugs
- **Git Integration**: Outputs proper Git diff format
### Evaluation
- **BLEU Score**: Measures translation quality
- **ROUGE Score**: Evaluates text generation accuracy
- **Comprehensive Metrics**: JSON and CSV output formats
## 🎯 Use Cases
The fine-tuned model can assist with:
1. **Automated Bug Fixing**: Generate patches for common kernel bugs
2. **Code Review**: Suggest fixes during development
3. **Learning**: Study patterns in Linux kernel bug fixes
4. **Research**: Advance automated software repair techniques
## πŸ“ˆ Performance
The model is evaluated using:
- **BLEU Score**: Measures how well generated diffs match reference fixes
- **ROUGE Score**: Evaluates overlap between predicted and actual fixes
- **Human Evaluation**: Qualitative assessment of fix quality
## πŸ”¬ Technical Details
### Model Architecture
- **Base**: CodeLLaMA-7B-Instruct with instruction tuning
- **Adapter**: LoRA layers for efficient fine-tuning
- **Output**: Generates Git diff format patches
### Training Process
1. **Data Preprocessing**: Extract and clean commit data
2. **Tokenization**: Convert to model input format
3. **QLoRA Training**: Efficient parameter-efficient fine-tuning
4. **Checkpointing**: Save model states for evaluation
### Memory Optimization
- **4-bit Quantization**: Reduces model size significantly
- **Gradient Accumulation**: Enables larger effective batch sizes
- **Mixed Precision**: Uses bfloat16 for faster training
## 🀝 Contributing
1. Fork the repository
2. Create a feature branch
3. Make your changes
4. Add tests if applicable
5. Submit a pull request
## πŸ“„ License
This project is licensed under the MIT License - see the LICENSE file for details.
## πŸ™ Acknowledgments
- **CodeLLaMA Team**: For the base model
- **Linux Kernel Community**: For the bug-fix data
- **Hugging Face**: For the transformers library
- **Microsoft**: For the LoRA technique
## πŸ“š References
- [CodeLLaMA: Open Foundation for Code](https://arxiv.org/abs/2308.12950)
- [QLoRA: Efficient Finetuning of Quantized LLMs](https://arxiv.org/abs/2305.14314)
- [LoRA: Low-Rank Adaptation of Large Language Models](https://arxiv.org/abs/2106.09685)