````markdown --- license: mit tags: - codellama - linux - bugfix - lora - qlora - git-diff base_model: codellama/CodeLLaMA-7b-Instruct-hf model_type: LlamaForCausalLM library_name: peft pipeline_tag: text-generation --- # CodeLLaMA-Linux-BugFix A fine-tuned version of `CodeLLaMA-7B-Instruct`, designed specifically for Linux kernel bug fixing using QLoRA (Quantized Low-Rank Adaptation). The model learns to generate Git diff patches based on buggy C code and commit messages. --- ## 🎯 Overview This project targets automated Linux kernel bug fixing by: - Mining real commit data from kernel Git history - Training a QLoRA model to generate Git-style fixes - Evaluating performance using BLEU and ROUGE - Supporting integration into code review pipelines --- ## 📊 Performance Results **BLEU Score**: 33.87 **ROUGE Scores**: - ROUGE-1: P=0.3775, R=0.7306, F1=0.4355 - ROUGE-2: P=0.2898, R=0.6096, F1=0.3457 - ROUGE-L: P=0.3023, R=0.6333, F1=0.3612 These results show that the model generates high-quality diffs with good semantic similarity to ground-truth patches. --- ## 🧠 Model Configuration - **Base model**: `CodeLLaMA-7B-Instruct` - **Fine-tuning**: QLoRA (LoRA r=64, α=16, dropout=0.1) - **Quantization**: 4-bit NF4 - **Training**: 3 epochs, batch size 64, LR 2e-4 - **Precision**: bfloat16 with gradient checkpointing - **Hardware**: 1× NVIDIA H200 (144 GB VRAM) --- ## 🗃️ Dataset - 100,000 samples from Linux kernel Git commits - Format: JSONL with `"prompt"` and `"completion"` fields - Content: C code segments + commit messages → Git diffs - Source: Bug-fix commits filtered by keywords like `fix`, `null`, `race`, `panic` --- ## 🚀 Usage ```python from transformers import AutoTokenizer, AutoModelForCausalLM from peft import PeftModel model = AutoModelForCausalLM.from_pretrained("codellama/CodeLLaMA-7b-Instruct-hf") model = PeftModel.from_pretrained(model, "train/output/qlora-codellama-bugfix") tokenizer = AutoTokenizer.from_pretrained("codellama/CodeLLaMA-7b-Instruct-hf") prompt = ''' Given the following original C code: ```c if (!file->filter) return; ```` Instruction: Fix the null pointer dereference Return the diff that fixes it: ''' inputs = tokenizer(prompt, return\_tensors="pt") outputs = model.generate(\*\*inputs, max\_length=512, temperature=0.1) fix = tokenizer.decode(outputs\[0], skip\_special\_tokens=True) print(fix) ``` --- ## 📁 Structure ``` CodeLLaMA-Linux-BugFix/ ├── dataset/ # Raw and processed JSONL files ├── dataset\_builder/ # Scripts for mining & formatting commits ├── train/ # Training scripts & checkpoints ├── evaluate/ # Evaluation scripts & results └── requirements.txt # Dependencies ``` --- ## 📈 Metrics | Metric | Score | |----------|--------| | BLEU | 33.87 | | ROUGE-1 | 0.4355 | | ROUGE-2 | 0.3457 | | ROUGE-L | 0.3612 | --- ## 🔬 Use Cases - Kernel patch suggestion tools - Code review assistants - Bug localization + repair research - APR benchmarks for kernel code --- ## 📄 License MIT License --- ## 📚 References - [CodeLLaMA](https://arxiv.org/abs/2308.12950) - [QLoRA](https://arxiv.org/abs/2305.14314) - [LoRA](https://arxiv.org/abs/2106.09685) ```