auto-commit / README.md

Upload fine-tuned model

01eb821 verified 2 months ago

3.72 kB

	---
	language: en
	license: mit
	tags:
	- code
	- git
	- commit-message
	- qwen2
	- lora
	datasets:
	- bigcode/commitpackft
	---

	# Git Commit Message Generator

	Fine-tuned Qwen-0.5B model for generating professional Git commit messages from code diffs.

	## Model Description

	This model was fine-tuned using LoRA (Low-Rank Adaptation) on the CommitPackFT dataset to generate concise, professional commit messages from git diffs.

	Base Model: Qwen-0.5B
	Fine-tuning Method: LoRA (r=16, alpha=32)
	Training Data: 55K filtered commits from CommitPackFT
	Languages: Python, JavaScript, TypeScript, Java, C++, Go, Rust, and more

	## Intended Use

	Generate commit messages for staged changes in a Git repository.

	### Quick Start

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer
	import torch

	# Load model and tokenizer
	model_name = "rajtiwariee/auto-commit"
	tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
	model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True)

	# Prepare your diff
	diff = """
	Diff:
	File: src/auth.py
	Language: Python

	Old content:
	def login(username, password):
	user = get_user(username)
	if user.password == password:
	return True
	return False

	New content:
	def login(username, password):
	user = get_user(username)
	if user and user.password == password:
	return True
	return False
	"""

	# Generate commit message
	prompt = f"Write a git commit message:\n\n{diff}\n\nCommit message:\n"
	inputs = tokenizer(prompt, return_tensors="pt")

	with torch.no_grad():
	outputs = model.generate(
	**inputs,
	max_new_tokens=30,
	do_sample=False, # Deterministic
	pad_token_id=tokenizer.eos_token_id,
	)

	message = tokenizer.decode(outputs[0], skip_special_tokens=True)
	print(message.split("Commit message:")[-1].strip())
	# Output: "Check for user existence before accessing password"
	```

	### CLI Tool

	For easier usage, install the companion CLI tool from the [GitHub repository](https://github.com/rajtiwariee/GitCommitGenerator):

	```bash
	pip install -e .
	commit-gen generate --commit
	```

	## Training Details

	### Training Data

	- Dataset: CommitPackFT (filtered subset)
	- Training samples: 55,730
	- Validation samples: 6,966
	- Test samples: 6,967

	### Training Procedure

	- Epochs: 3
	- Batch Size: 4 (effective batch size: 32 with gradient accumulation)
	- Learning Rate: 5e-5
	- Optimizer: AdamW
	- LoRA Config:
	- r: 16
	- alpha: 32
	- dropout: 0.05
	- target_modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj

	### Hardware

	- GPU: NVIDIA Tesla T4 (16GB)
	- Precision: Mixed Precision (FP32 weights + FP16 compute)
	- Training Time: ~7.5 hours

	## Evaluation Results

	- BLEU Score: 0.0244
	- ROUGE-1: 0.1968
	- ROUGE-2: 0.0420
	- ROUGE-L: 0.1816
	- Exact Match Rate: 0.00%


	## Limitations

	- The model is trained primarily on English commit messages
	- Best suited for code changes in common programming languages
	- May not handle very large diffs well (>384 tokens)
	- Generated messages should be reviewed before committing

	## Ethical Considerations

	This model is intended to assist developers in writing commit messages, not replace human judgment. Users should:
	- Review generated messages for accuracy
	- Ensure messages accurately describe the changes
	- Follow their team's commit message conventions

	## Citation

	```bibtex
	@misc{git-commit-generator,
	author = {Raj Tiwari},
	title = {Git Commit Message Generator},
	year = {2024},
	publisher = {Hugging Face},
	howpublished = {\url{https://huggingface.co/rajtiwariee/auto-commit}},
	}
	```

	## License

	MIT License