| # CodeLLaMA-Linux-BugFix | |
| A machine learning project that fine-tunes CodeLLaMA-7B-Instruct specifically for Linux kernel bug fixing using QLoRA (Quantized Low-Rank Adaptation). The model learns to generate Git diff patches from buggy C code and commit messages. | |
| ## π― Project Overview | |
| This project addresses the challenging task of automated Linux kernel bug fixing by: | |
| - **Extracting real bug-fix data** from the Linux kernel Git repository | |
| - **Training a specialized model** using QLoRA for efficient fine-tuning | |
| - **Generating Git diff patches** that can be applied to fix bugs | |
| - **Providing evaluation metrics** to assess model performance | |
| ## ποΈ Architecture | |
| ### Base Model | |
| - **Model**: `codellama/CodeLLaMA-7b-Instruct-hf` (7 billion parameters) | |
| - **Fine-tuning Method**: QLoRA with 4-bit quantization | |
| - **Hardware**: Optimized for H200 GPU with bfloat16 precision | |
| ### Training Configuration | |
| - **LoRA Config**: r=64, alpha=16, dropout=0.1 | |
| - **Training**: 3 epochs, batch size 64, learning rate 2e-4 | |
| - **Memory Optimization**: Gradient checkpointing, mixed precision training | |
| ## π Dataset | |
| The project creates a specialized dataset from Linux kernel commits: | |
| ### Data Extraction Process | |
| 1. **Commit Filtering**: Identifies bug-fix commits using keywords: | |
| - `fix`, `bug`, `leak`, `null`, `overflow`, `error`, `failure` | |
| - `crash`, `panic`, `memory`, `race`, `deadlock`, `corruption` | |
| - `security`, `vulnerability`, `exploit`, `buffer`, `stack` | |
| 2. **Code Context Extraction**: | |
| - Focuses on C and header files (`.c`, `.h`) | |
| - Extracts 10 lines before/after bug location | |
| - Captures relevant code context | |
| 3. **Data Format**: | |
| ```json | |
| { | |
| "input": { | |
| "original code": "C code snippet with bug", | |
| "instruction": "Bug fix instruction from commit message" | |
| }, | |
| "output": { | |
| "diff codes": "Git diff showing the fix" | |
| } | |
| } | |
| ``` | |
| ### Dataset Statistics | |
| - **Training Data**: 100K samples (`training_data_100k.jsonl`) | |
| - **Format**: JSONL (one JSON object per line) | |
| - **Source**: Linux kernel Git repository | |
| ## π Quick Start | |
| ### Prerequisites | |
| ```bash | |
| pip install -r requirements.txt | |
| ``` | |
| ### 1. Build Dataset | |
| ```bash | |
| cd dataset_builder | |
| python extract_linux_bugfixes.py | |
| python format_for_training.py | |
| ``` | |
| ### 2. Train Model | |
| ```bash | |
| cd train | |
| python train_codellama_qlora_linux_bugfix.py | |
| ``` | |
| ### 3. Evaluate Model | |
| ```bash | |
| cd evaluate | |
| python evaluate_linux_bugfix_model.py | |
| ``` | |
| ## π Project Structure | |
| ``` | |
| CodeLLaMA-Linux-BugFix/ | |
| βββ dataset_builder/ # Dataset creation scripts | |
| β βββ extract_linux_bugfixes.py # Main dataset extraction | |
| β βββ extract_linux_bugfixes_parallel.py # Parallelized version | |
| β βββ format_for_training.py | |
| βββ dataset/ # Generated datasets | |
| β βββ training_data_100k.jsonl | |
| β βββ training_data_prompt_completion.jsonl | |
| βββ train/ # Training scripts and outputs | |
| β βββ train_codellama_qlora_linux_bugfix.py # Main training script | |
| β βββ train_codellama_qlora_simple.py | |
| β βββ download_codellama_model.py | |
| β βββ output/ # Trained model checkpoints | |
| βββ evaluate/ # Evaluation scripts and results | |
| β βββ evaluate_linux_bugfix_model.py # Model evaluation | |
| β βββ test_samples.jsonl # Evaluation dataset | |
| β βββ output/ # Evaluation results | |
| βββ requirements.txt # Python dependencies | |
| ``` | |
| ## π§ Key Features | |
| ### Efficient Training | |
| - **QLoRA**: Reduces memory requirements by 75% while maintaining performance | |
| - **4-bit Quantization**: Enables training on consumer hardware | |
| - **Gradient Checkpointing**: Optimizes memory usage during training | |
| ### Real-world Data | |
| - **Authentic Bug Fixes**: Extracted from actual Linux kernel development | |
| - **Contextual Understanding**: Captures relevant code context around bugs | |
| - **Git Integration**: Outputs proper Git diff format | |
| ### Evaluation | |
| - **BLEU Score**: Measures translation quality | |
| - **ROUGE Score**: Evaluates text generation accuracy | |
| - **Comprehensive Metrics**: JSON and CSV output formats | |
| ## π― Use Cases | |
| The fine-tuned model can assist with: | |
| 1. **Automated Bug Fixing**: Generate patches for common kernel bugs | |
| 2. **Code Review**: Suggest fixes during development | |
| 3. **Learning**: Study patterns in Linux kernel bug fixes | |
| 4. **Research**: Advance automated software repair techniques | |
| ## π Performance | |
| The model is evaluated using: | |
| - **BLEU Score**: Measures how well generated diffs match reference fixes | |
| - **ROUGE Score**: Evaluates overlap between predicted and actual fixes | |
| - **Human Evaluation**: Qualitative assessment of fix quality | |
| ## π¬ Technical Details | |
| ### Model Architecture | |
| - **Base**: CodeLLaMA-7B-Instruct with instruction tuning | |
| - **Adapter**: LoRA layers for efficient fine-tuning | |
| - **Output**: Generates Git diff format patches | |
| ### Training Process | |
| 1. **Data Preprocessing**: Extract and clean commit data | |
| 2. **Tokenization**: Convert to model input format | |
| 3. **QLoRA Training**: Efficient parameter-efficient fine-tuning | |
| 4. **Checkpointing**: Save model states for evaluation | |
| ### Memory Optimization | |
| - **4-bit Quantization**: Reduces model size significantly | |
| - **Gradient Accumulation**: Enables larger effective batch sizes | |
| - **Mixed Precision**: Uses bfloat16 for faster training | |
| ## π€ Contributing | |
| 1. Fork the repository | |
| 2. Create a feature branch | |
| 3. Make your changes | |
| 4. Add tests if applicable | |
| 5. Submit a pull request | |
| ## π License | |
| This project is licensed under the MIT License - see the LICENSE file for details. | |
| ## π Acknowledgments | |
| - **CodeLLaMA Team**: For the base model | |
| - **Linux Kernel Community**: For the bug-fix data | |
| - **Hugging Face**: For the transformers library | |
| - **Microsoft**: For the LoRA technique | |
| ## π References | |
| - [CodeLLaMA: Open Foundation for Code](https://arxiv.org/abs/2308.12950) | |
| - [QLoRA: Efficient Finetuning of Quantized LLMs](https://arxiv.org/abs/2305.14314) | |
| - [LoRA: Low-Rank Adaptation of Large Language Models](https://arxiv.org/abs/2106.09685) |