# CodeLLaMA-Linux-BugFix A machine learning project that fine-tunes CodeLLaMA-7B-Instruct specifically for Linux kernel bug fixing using QLoRA (Quantized Low-Rank Adaptation). The model learns to generate Git diff patches from buggy C code and commit messages. ## 🎯 Project Overview This project addresses the challenging task of automated Linux kernel bug fixing by: - **Extracting real bug-fix data** from the Linux kernel Git repository - **Training a specialized model** using QLoRA for efficient fine-tuning - **Generating Git diff patches** that can be applied to fix bugs - **Providing evaluation metrics** to assess model performance ## 🏗️ Architecture ### Base Model - **Model**: `codellama/CodeLLaMA-7b-Instruct-hf` (7 billion parameters) - **Fine-tuning Method**: QLoRA with 4-bit quantization - **Hardware**: Optimized for H200 GPU with bfloat16 precision ### Training Configuration - **LoRA Config**: r=64, alpha=16, dropout=0.1 - **Training**: 3 epochs, batch size 64, learning rate 2e-4 - **Memory Optimization**: Gradient checkpointing, mixed precision training ## 📊 Dataset The project creates a specialized dataset from Linux kernel commits: ### Data Extraction Process 1. **Commit Filtering**: Identifies bug-fix commits using keywords: - `fix`, `bug`, `leak`, `null`, `overflow`, `error`, `failure` - `crash`, `panic`, `memory`, `race`, `deadlock`, `corruption` - `security`, `vulnerability`, `exploit`, `buffer`, `stack` 2. **Code Context Extraction**: - Focuses on C and header files (`.c`, `.h`) - Extracts 10 lines before/after bug location - Captures relevant code context 3. **Data Format**: ```json { "input": { "original code": "C code snippet with bug", "instruction": "Bug fix instruction from commit message" }, "output": { "diff codes": "Git diff showing the fix" } } ``` ### Dataset Statistics - **Training Data**: 100K samples (`training_data_100k.jsonl`) - **Format**: JSONL (one JSON object per line) - **Source**: Linux kernel Git repository ## 🚀 Quick Start ### Prerequisites ```bash pip install -r requirements.txt ``` ### 1. Build Dataset ```bash cd dataset_builder python extract_linux_bugfixes.py python format_for_training.py ``` ### 2. Train Model ```bash cd train python train_codellama_qlora_linux_bugfix.py ``` ### 3. Evaluate Model ```bash cd evaluate python evaluate_linux_bugfix_model.py ``` ## 📁 Project Structure ``` CodeLLaMA-Linux-BugFix/ ├── dataset_builder/ # Dataset creation scripts │ ├── extract_linux_bugfixes.py # Main dataset extraction │ ├── extract_linux_bugfixes_parallel.py # Parallelized version │ └── format_for_training.py ├── dataset/ # Generated datasets │ ├── training_data_100k.jsonl │ └── training_data_prompt_completion.jsonl ├── train/ # Training scripts and outputs │ ├── train_codellama_qlora_linux_bugfix.py # Main training script │ ├── train_codellama_qlora_simple.py │ ├── download_codellama_model.py │ └── output/ # Trained model checkpoints ├── evaluate/ # Evaluation scripts and results │ ├── evaluate_linux_bugfix_model.py # Model evaluation │ ├── test_samples.jsonl # Evaluation dataset │ └── output/ # Evaluation results └── requirements.txt # Python dependencies ``` ## 🔧 Key Features ### Efficient Training - **QLoRA**: Reduces memory requirements by 75% while maintaining performance - **4-bit Quantization**: Enables training on consumer hardware - **Gradient Checkpointing**: Optimizes memory usage during training ### Real-world Data - **Authentic Bug Fixes**: Extracted from actual Linux kernel development - **Contextual Understanding**: Captures relevant code context around bugs - **Git Integration**: Outputs proper Git diff format ### Evaluation - **BLEU Score**: Measures translation quality - **ROUGE Score**: Evaluates text generation accuracy - **Comprehensive Metrics**: JSON and CSV output formats ## 🎯 Use Cases The fine-tuned model can assist with: 1. **Automated Bug Fixing**: Generate patches for common kernel bugs 2. **Code Review**: Suggest fixes during development 3. **Learning**: Study patterns in Linux kernel bug fixes 4. **Research**: Advance automated software repair techniques ## 📈 Performance The model is evaluated using: - **BLEU Score**: Measures how well generated diffs match reference fixes - **ROUGE Score**: Evaluates overlap between predicted and actual fixes - **Human Evaluation**: Qualitative assessment of fix quality ## 🔬 Technical Details ### Model Architecture - **Base**: CodeLLaMA-7B-Instruct with instruction tuning - **Adapter**: LoRA layers for efficient fine-tuning - **Output**: Generates Git diff format patches ### Training Process 1. **Data Preprocessing**: Extract and clean commit data 2. **Tokenization**: Convert to model input format 3. **QLoRA Training**: Efficient parameter-efficient fine-tuning 4. **Checkpointing**: Save model states for evaluation ### Memory Optimization - **4-bit Quantization**: Reduces model size significantly - **Gradient Accumulation**: Enables larger effective batch sizes - **Mixed Precision**: Uses bfloat16 for faster training ## 🤝 Contributing 1. Fork the repository 2. Create a feature branch 3. Make your changes 4. Add tests if applicable 5. Submit a pull request ## 📄 License This project is licensed under the MIT License - see the LICENSE file for details. ## 🙏 Acknowledgments - **CodeLLaMA Team**: For the base model - **Linux Kernel Community**: For the bug-fix data - **Hugging Face**: For the transformers library - **Microsoft**: For the LoRA technique ## 📚 References - [CodeLLaMA: Open Foundation for Code](https://arxiv.org/abs/2308.12950) - [QLoRA: Efficient Finetuning of Quantized LLMs](https://arxiv.org/abs/2305.14314) - [LoRA: Low-Rank Adaptation of Large Language Models](https://arxiv.org/abs/2106.09685)