--- license: mit tags: - codellama - linux - bugfix - lora - qlora - git-diff base_model: codellama/CodeLLaMA-7b-Instruct-hf model_type: LlamaForCausalLM library_name: peft pipeline_tag: text-generation --- # CodeLLaMA-Linux-BugFix A fine-tuned version of `CodeLLaMA-7B-Instruct`, designed specifically for Linux kernel bug fixing using QLoRA (Quantized Low-Rank Adaptation). The model learns to generate Git diff patches based on buggy C code and commit messages. --- ## ๐ŸŽฏ Overview This project targets automated Linux kernel bug fixing by: - **Mining real commit data** from the kernel Git history - **Training a specialized QLoRA model** on diff-style fixes - **Generating Git patches** in response to bug-prone code - **Evaluating results** using BLEU, ROUGE, and human inspection --- ## ๐Ÿง  Model Configuration - **Base model**: `CodeLLaMA-7B-Instruct` - **Fine-tuning method**: QLoRA with 4-bit quantization - **Training setup**: - LoRA r=64, alpha=16, dropout=0.1 - Batch size: 64, LR: 2e-4, Epochs: 3 - Mixed precision (bfloat16), gradient checkpointing - **Hardware**: Optimized for NVIDIA H200 GPUs --- ## ๐Ÿ“Š Dataset Custom dataset extracted from Linux kernel Git history. ### Filtering Criteria Bug-fix commits containing: `fix`, `bug`, `crash`, `memory`, `null`, `panic`, `overflow`, `race`, `corruption`, etc. ### Structure - Language: C (`.c`, `.h`) - Context: 10 lines before/after the change - Format: ```json { "input": { "original code": "C code snippet with bug", "instruction": "Commit message or fix description" }, "output": { "diff codes": "Git diff showing the fix" } } ```` * **File**: `training_data_100k.jsonl` (100,000 samples) --- ## ๐Ÿš€ Quick Start ### Install dependencies ```bash pip install -r requirements.txt ``` ### 1. Build the Dataset ```bash cd dataset_builder python extract_linux_bugfixes.py python format_for_training.py ``` ### 2. Fine-tune the Model ```bash cd train python train_codellama_qlora_linux_bugfix.py ``` ### 3. Run Evaluation ```bash cd evaluate python evaluate_linux_bugfix_model.py ``` --- ## ๐Ÿ“ Project Structure ``` CodeLLaMA-Linux-BugFix/ โ”œโ”€โ”€ dataset_builder/ โ”‚ โ”œโ”€โ”€ extract_linux_bugfixes.py โ”‚ โ”œโ”€โ”€ extract_linux_bugfixes_parallel.py โ”‚ โ””โ”€โ”€ format_for_training.py โ”œโ”€โ”€ dataset/ โ”‚ โ”œโ”€โ”€ training_data_100k.jsonl โ”‚ โ””โ”€โ”€ training_data_prompt_completion.jsonl โ”œโ”€โ”€ train/ โ”‚ โ”œโ”€โ”€ train_codellama_qlora_linux_bugfix.py โ”‚ โ”œโ”€โ”€ train_codellama_qlora_simple.py โ”‚ โ”œโ”€โ”€ download_codellama_model.py โ”‚ โ””โ”€โ”€ output/ โ”œโ”€โ”€ evaluate/ โ”‚ โ”œโ”€โ”€ evaluate_linux_bugfix_model.py โ”‚ โ”œโ”€โ”€ test_samples.jsonl โ”‚ โ””โ”€โ”€ output/ โ””โ”€โ”€ requirements.txt ``` --- ## ๐Ÿงฉ Features * ๐Ÿ”ง **Efficient Fine-tuning**: QLoRA + 4-bit quant = massive memory savings * ๐Ÿง  **Real-world commits**: From actual Linux kernel development * ๐Ÿ’ก **Context-aware**: Code context extraction around bug lines * ๐Ÿ’ป **Output-ready**: Generates valid Git-style diffs --- ## ๐Ÿ“ˆ Evaluation Metrics * **BLEU**: Translation-style match to reference diffs * **ROUGE**: Overlap in fix content * **Human Evaluation**: Subjective patch quality --- ## ๐Ÿงช Use Cases * Automated kernel bug fixing * Code review assistance * Teaching/debugging kernel code * Research in automated program repair (APR) --- ## ๐Ÿ”ฌ Technical Highlights ### Memory & Speed Optimizations * 4-bit quantization (NF4) * Gradient checkpointing * Mixed precision (bfloat16) * Gradient accumulation --- ## ๐Ÿค Contributing 1. Fork this repo 2. Create a branch 3. Add your feature or fix 4. Submit a PR ๐Ÿ™Œ --- ## ๐Ÿ“„ License MIT License โ€“ see `LICENSE` file for details. --- ## ๐Ÿ™ Acknowledgments * Meta for CodeLLaMA * Hugging Face for Transformers + PEFT * The Linux kernel community for open access to commit data * Microsoft for introducing LoRA --- ## ๐Ÿ“š References * [CodeLLaMA (Meta, 2023)](https://arxiv.org/abs/2308.12950) * [QLoRA (Dettmers et al., 2023)](https://arxiv.org/abs/2305.14314) * [LoRA (Hu et al., 2021)](https://arxiv.org/abs/2106.09685)