Updated README.

2c6eca5 verified 29 days ago

5.32 kB

	---
	license: mit
	base_model: microsoft/phi-3-mini-4k-instruct
	tags:
	- llm
	- code-generation
	- bug-fixing
	- lora
	- peft
	- python
	datasets:
	- mbpp
	metrics:
	- exact_match
	- similarity
	---

	# DebugGPT LoRA Adapter for Phi-3 Mini

	A lightweight LoRA adapter fine-tuned on synthetic Python bug-fixing tasks using the MBPP dataset. This model enhances the ability of Phi-3 Mini to detect and correct common Python syntax errors while preserving general language capabilities.

	---

	## Model Description

	- Base Model: microsoft/phi-3-mini-4k-instruct
	- Fine-Tuning Method: QLoRA (Low-Rank Adaptation with 4-bit quantization)
	- Task: Automated Python bug fixing

	The model takes buggy Python code as input and generates the corrected version.

	---

	## Intended Use

	This model is designed for:

	- Python debugging assistance
	- Educational coding tools
	- AI-assisted code correction
	- Research experiments in code repair

	### Out-of-Scope Use

	- Production-critical systems
	- Security-sensitive applications
	- Complex multi-file debugging

	---

	## Dataset

	We use the MBPP (Mostly Basic Python Problems) dataset. Since MBPP contains correct code, we generate a bug-fixing dataset by injecting synthetic bugs.

	### Data Format

	Each example follows an instruction-tuning format:

	```json
	{
	"instruction": "Fix the bug in the following Python code",
	"input": "<buggy code>",
	"output": "<correct code>"
	}
	```

	### Bug Injection Strategy

	We introduce controlled bugs such as:

	- Operator replacement (`+` → `-`)
	- Comparison changes (`>` → `<`)
	- Removal of return statements

	### Dataset Size

	\| Split \| Samples \|
	\|------------\|---------\|
	\| Train \| ~374 \|
	\| Validation \| ~90 \|
	\| Test \| ~500 \|

	---

	## Training Procedure

	### Method: QLoRA

	To enable efficient training on limited hardware:

	- Base model loaded in 4-bit precision (NF4)
	- Base weights frozen
	- Only LoRA adapters trained

	### LoRA Configuration

	\| Parameter \| Value \|
	\|-----------------\|------------------------------------\|
	\| Rank (r) \| 16 \|
	\| Alpha \| 32 \|
	\| Dropout \| 0.05 \|
	\| Target Modules \| q_proj, k_proj, v_proj, o_proj \|

	### Training Configuration

	\| Parameter \| Value \|
	\|------------------------\|---------\|
	\| Epochs \| 3 \|
	\| Learning Rate \| 2e-4 \|
	\| Batch Size \| 1 \|
	\| Gradient Accumulation \| 8 \|
	\| Precision \| FP16 \|
	\| Optimizer \| AdamW \|

	---

	## Hardware & Frameworks

	- GPU: NVIDIA Tesla T4
	- Frameworks: Hugging Face Transformers, PEFT (LoRA), TRL (SFTTrainer), Weights & Biases

	---

	## Evaluation Results

	### Performance Summary

	\| Metric \| Base Model \| Fine-Tuned Model \|
	\|-------------------------\|---------------\|--------------------\|
	\| Syntax Fix Accuracy \| Low \| Noticeably Higher \|
	\| Indentation Correction \| Inconsistent \| Reliable \|
	\| Variable Error Fixing \| Occasional \| Improved \|
	\| Complex Logic Bugs \| Limited \| Limited (unchanged)\|
	\| Instruction Adherence \| Moderate \| High \|

	> Note: Quantitative metrics (e.g., exact match accuracy, CodeBLEU) were not computed due to dataset and tooling constraints.

	---

	## Example

	### Input — Buggy Code

	```python
	for i in range(5)
	print(i)
	```

	### Output — Fixed Code

	```python
	for i in range(5):
	print(i)
	```

	---

	## Limitations

	- Small dataset size limits generalization
	- Focused primarily on syntax-level bugs
	- Limited performance on complex logical errors
	- Not evaluated on large-scale real-world codebases

	---

	## Discussion

	### What Worked Well

	- QLoRA enabled efficient fine-tuning on limited hardware
	- Significant improvement in syntax correction tasks
	- Strong adherence to instruction format

	### Challenges

	- Limited dataset size
	- Lack of quantitative evaluation metrics
	- Difficulty handling complex multi-line logic bugs

	### Ethical Considerations

	- The model may generate incorrect fixes for complex bugs
	- Should be used as an assistive tool, not a final authority
	- Users should validate outputs before deployment

	---

	## How to Use

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer
	from peft import PeftModel

	base_model = AutoModelForCausalLM.from_pretrained(
	"microsoft/phi-3-mini-4k-instruct"
	)

	tokenizer = AutoTokenizer.from_pretrained(
	"microsoft/phi-3-mini-4k-instruct"
	)

	model = PeftModel.from_pretrained(
	base_model,
	"Sud1212/phi3-debug-llm-lora"
	)

	prompt = "Fix the bug:\nfor i in range(5)\n print(i)"

	inputs = tokenizer(prompt, return_tensors="pt")
	outputs = model.generate(**inputs, max_new_tokens=100)

	print(tokenizer.decode(outputs[0], skip_special_tokens=True))
	```

	---

	## Resources

	- GitHub Repository: [Phi3-debugLLM-LoRA](https://github.com/suddhumaddi/Phi3-debugLLM-LoRA)
	- Weights & Biases Dashboard: [W&B Project](https://wandb.ai/suddhumaddi-woxsen-university/huggingface)
	- Dataset (MBPP): [Hugging Face Datasets](https://huggingface.co/datasets/mbpp)

	---

	## Author

	Sudarshan Maddi
	Woxsen University

	---

	## License

	MIT License