Mac

Update README with Hugging Face metadata and full project description

8046c68 7 months ago

4.37 kB

	---
	license: mit
	tags:
	- codellama
	- linux
	- bugfix
	- lora
	- qlora
	- git-diff
	base_model: codellama/CodeLLaMA-7b-Instruct-hf
	model_type: LlamaForCausalLM
	library_name: peft
	pipeline_tag: text-generation
	---

	# CodeLLaMA-Linux-BugFix

	A fine-tuned version of `CodeLLaMA-7B-Instruct`, designed specifically for Linux kernel bug fixing using QLoRA (Quantized Low-Rank Adaptation). The model learns to generate Git diff patches based on buggy C code and commit messages.

	---

	## 🎯 Overview

	This project targets automated Linux kernel bug fixing by:

	- Mining real commit data from the kernel Git history
	- Training a specialized QLoRA model on diff-style fixes
	- Generating Git patches in response to bug-prone code
	- Evaluating results using BLEU, ROUGE, and human inspection

	---

	## 🧠 Model Configuration

	- Base model: `CodeLLaMA-7B-Instruct`
	- Fine-tuning method: QLoRA with 4-bit quantization
	- Training setup:
	- LoRA r=64, alpha=16, dropout=0.1
	- Batch size: 64, LR: 2e-4, Epochs: 3
	- Mixed precision (bfloat16), gradient checkpointing
	- Hardware: Optimized for NVIDIA H200 GPUs

	---

	## 📊 Dataset

	Custom dataset extracted from Linux kernel Git history.

	### Filtering Criteria
	Bug-fix commits containing:
	`fix`, `bug`, `crash`, `memory`, `null`, `panic`, `overflow`, `race`, `corruption`, etc.

	### Structure
	- Language: C (`.c`, `.h`)
	- Context: 10 lines before/after the change
	- Format:

	```json
	{
	"input": {
	"original code": "C code snippet with bug",
	"instruction": "Commit message or fix description"
	},
	"output": {
	"diff codes": "Git diff showing the fix"
	}
	}
	````

	* File: `training_data_100k.jsonl` (100,000 samples)

	---

	## 🚀 Quick Start

	### Install dependencies

	```bash
	pip install -r requirements.txt
	```

	### 1. Build the Dataset

	```bash
	cd dataset_builder
	python extract_linux_bugfixes.py
	python format_for_training.py
	```

	### 2. Fine-tune the Model

	```bash
	cd train
	python train_codellama_qlora_linux_bugfix.py
	```

	### 3. Run Evaluation

	```bash
	cd evaluate
	python evaluate_linux_bugfix_model.py
	```

	---

	## 📁 Project Structure

	```
	CodeLLaMA-Linux-BugFix/
	├── dataset_builder/
	│ ├── extract_linux_bugfixes.py
	│ ├── extract_linux_bugfixes_parallel.py
	│ └── format_for_training.py
	├── dataset/
	│ ├── training_data_100k.jsonl
	│ └── training_data_prompt_completion.jsonl
	├── train/
	│ ├── train_codellama_qlora_linux_bugfix.py
	│ ├── train_codellama_qlora_simple.py
	│ ├── download_codellama_model.py
	│ └── output/
	├── evaluate/
	│ ├── evaluate_linux_bugfix_model.py
	│ ├── test_samples.jsonl
	│ └── output/
	└── requirements.txt
	```

	---

	## 🧩 Features

	* 🔧 Efficient Fine-tuning: QLoRA + 4-bit quant = massive memory savings
	* 🧠 Real-world commits: From actual Linux kernel development
	* 💡 Context-aware: Code context extraction around bug lines
	* 💻 Output-ready: Generates valid Git-style diffs

	---

	## 📈 Evaluation Metrics

	* BLEU: Translation-style match to reference diffs
	* ROUGE: Overlap in fix content
	* Human Evaluation: Subjective patch quality

	---

	## 🧪 Use Cases

	* Automated kernel bug fixing
	* Code review assistance
	* Teaching/debugging kernel code
	* Research in automated program repair (APR)

	---

	## 🔬 Technical Highlights

	### Memory & Speed Optimizations

	* 4-bit quantization (NF4)
	* Gradient checkpointing
	* Mixed precision (bfloat16)
	* Gradient accumulation

	---

	## 🤝 Contributing

	1. Fork this repo
	2. Create a branch
	3. Add your feature or fix
	4. Submit a PR 🙌

	---

	## 📄 License

	MIT License – see `LICENSE` file for details.

	---

	## 🙏 Acknowledgments

	* Meta for CodeLLaMA
	* Hugging Face for Transformers + PEFT
	* The Linux kernel community for open access to commit data
	* Microsoft for introducing LoRA

	---

	## 📚 References

	* [CodeLLaMA (Meta, 2023)](https://arxiv.org/abs/2308.12950)
	* [QLoRA (Dettmers et al., 2023)](https://arxiv.org/abs/2305.14314)
	* [LoRA (Hu et al., 2021)](https://arxiv.org/abs/2106.09685)