Mac

Refactor filenames and paths for clarity and structure

15eb8ca 8 months ago

6.3 kB

	# CodeLLaMA-Linux-BugFix

	A machine learning project that fine-tunes CodeLLaMA-7B-Instruct specifically for Linux kernel bug fixing using QLoRA (Quantized Low-Rank Adaptation). The model learns to generate Git diff patches from buggy C code and commit messages.

	## 🎯 Project Overview

	This project addresses the challenging task of automated Linux kernel bug fixing by:

	- Extracting real bug-fix data from the Linux kernel Git repository
	- Training a specialized model using QLoRA for efficient fine-tuning
	- Generating Git diff patches that can be applied to fix bugs
	- Providing evaluation metrics to assess model performance

	## 🏗️ Architecture

	### Base Model
	- Model: `codellama/CodeLLaMA-7b-Instruct-hf` (7 billion parameters)
	- Fine-tuning Method: QLoRA with 4-bit quantization
	- Hardware: Optimized for H200 GPU with bfloat16 precision

	### Training Configuration
	- LoRA Config: r=64, alpha=16, dropout=0.1
	- Training: 3 epochs, batch size 64, learning rate 2e-4
	- Memory Optimization: Gradient checkpointing, mixed precision training

	## 📊 Dataset

	The project creates a specialized dataset from Linux kernel commits:

	### Data Extraction Process
	1. Commit Filtering: Identifies bug-fix commits using keywords:
	- `fix`, `bug`, `leak`, `null`, `overflow`, `error`, `failure`
	- `crash`, `panic`, `memory`, `race`, `deadlock`, `corruption`
	- `security`, `vulnerability`, `exploit`, `buffer`, `stack`

	2. Code Context Extraction:
	- Focuses on C and header files (`.c`, `.h`)
	- Extracts 10 lines before/after bug location
	- Captures relevant code context

	3. Data Format:
	```json
	{
	"input": {
	"original code": "C code snippet with bug",
	"instruction": "Bug fix instruction from commit message"
	},
	"output": {
	"diff codes": "Git diff showing the fix"
	}
	}
	```

	### Dataset Statistics
	- Training Data: 100K samples (`training_data_100k.jsonl`)
	- Format: JSONL (one JSON object per line)
	- Source: Linux kernel Git repository

	## 🚀 Quick Start

	### Prerequisites
	```bash
	pip install -r requirements.txt
	```

	### 1. Build Dataset
	```bash
	cd dataset_builder
	python extract_linux_bugfixes.py
	python format_for_training.py
	```

	### 2. Train Model
	```bash
	cd train
	python train_codellama_qlora_linux_bugfix.py
	```

	### 3. Evaluate Model
	```bash
	cd evaluate
	python evaluate_linux_bugfix_model.py
	```

	## 📁 Project Structure

	```
	CodeLLaMA-Linux-BugFix/
	├── dataset_builder/ # Dataset creation scripts
	│ ├── extract_linux_bugfixes.py # Main dataset extraction
	│ ├── extract_linux_bugfixes_parallel.py # Parallelized version
	│ └── format_for_training.py
	├── dataset/ # Generated datasets
	│ ├── training_data_100k.jsonl
	│ └── training_data_prompt_completion.jsonl
	├── train/ # Training scripts and outputs
	│ ├── train_codellama_qlora_linux_bugfix.py # Main training script
	│ ├── train_codellama_qlora_simple.py
	│ ├── download_codellama_model.py
	│ └── output/ # Trained model checkpoints
	├── evaluate/ # Evaluation scripts and results
	│ ├── evaluate_linux_bugfix_model.py # Model evaluation
	│ ├── test_samples.jsonl # Evaluation dataset
	│ └── output/ # Evaluation results
	└── requirements.txt # Python dependencies
	```

	## 🔧 Key Features

	### Efficient Training
	- QLoRA: Reduces memory requirements by 75% while maintaining performance
	- 4-bit Quantization: Enables training on consumer hardware
	- Gradient Checkpointing: Optimizes memory usage during training

	### Real-world Data
	- Authentic Bug Fixes: Extracted from actual Linux kernel development
	- Contextual Understanding: Captures relevant code context around bugs
	- Git Integration: Outputs proper Git diff format

	### Evaluation
	- BLEU Score: Measures translation quality
	- ROUGE Score: Evaluates text generation accuracy
	- Comprehensive Metrics: JSON and CSV output formats

	## 🎯 Use Cases

	The fine-tuned model can assist with:

	1. Automated Bug Fixing: Generate patches for common kernel bugs
	2. Code Review: Suggest fixes during development
	3. Learning: Study patterns in Linux kernel bug fixes
	4. Research: Advance automated software repair techniques

	## 📈 Performance

	The model is evaluated using:
	- BLEU Score: Measures how well generated diffs match reference fixes
	- ROUGE Score: Evaluates overlap between predicted and actual fixes
	- Human Evaluation: Qualitative assessment of fix quality

	## 🔬 Technical Details

	### Model Architecture
	- Base: CodeLLaMA-7B-Instruct with instruction tuning
	- Adapter: LoRA layers for efficient fine-tuning
	- Output: Generates Git diff format patches

	### Training Process
	1. Data Preprocessing: Extract and clean commit data
	2. Tokenization: Convert to model input format
	3. QLoRA Training: Efficient parameter-efficient fine-tuning
	4. Checkpointing: Save model states for evaluation

	### Memory Optimization
	- 4-bit Quantization: Reduces model size significantly
	- Gradient Accumulation: Enables larger effective batch sizes
	- Mixed Precision: Uses bfloat16 for faster training

	## 🤝 Contributing

	1. Fork the repository
	2. Create a feature branch
	3. Make your changes
	4. Add tests if applicable
	5. Submit a pull request

	## 📄 License

	This project is licensed under the MIT License - see the LICENSE file for details.

	## 🙏 Acknowledgments

	- CodeLLaMA Team: For the base model
	- Linux Kernel Community: For the bug-fix data
	- Hugging Face: For the transformers library
	- Microsoft: For the LoRA technique

	## 📚 References

	- [CodeLLaMA: Open Foundation for Code](https://arxiv.org/abs/2308.12950)
	- [QLoRA: Efficient Finetuning of Quantized LLMs](https://arxiv.org/abs/2305.14314)
	- [LoRA: Low-Rank Adaptation of Large Language Models](https://arxiv.org/abs/2106.09685)