|
|
--- |
|
|
language: en |
|
|
license: mit |
|
|
tags: |
|
|
- code |
|
|
- git |
|
|
- commit-message |
|
|
- qwen2 |
|
|
- lora |
|
|
datasets: |
|
|
- bigcode/commitpackft |
|
|
--- |
|
|
|
|
|
# Git Commit Message Generator |
|
|
|
|
|
Fine-tuned Qwen-0.5B model for generating professional Git commit messages from code diffs. |
|
|
|
|
|
## Model Description |
|
|
|
|
|
This model was fine-tuned using LoRA (Low-Rank Adaptation) on the CommitPackFT dataset to generate concise, professional commit messages from git diffs. |
|
|
|
|
|
**Base Model**: Qwen-0.5B |
|
|
**Fine-tuning Method**: LoRA (r=16, alpha=32) |
|
|
**Training Data**: 55K filtered commits from CommitPackFT |
|
|
**Languages**: Python, JavaScript, TypeScript, Java, C++, Go, Rust, and more |
|
|
|
|
|
## Intended Use |
|
|
|
|
|
Generate commit messages for staged changes in a Git repository. |
|
|
|
|
|
### Quick Start |
|
|
|
|
|
```python |
|
|
from transformers import AutoModelForCausalLM, AutoTokenizer |
|
|
import torch |
|
|
|
|
|
# Load model and tokenizer |
|
|
model_name = "rajtiwariee/auto-commit" |
|
|
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True) |
|
|
model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True) |
|
|
|
|
|
# Prepare your diff |
|
|
diff = """ |
|
|
Diff: |
|
|
File: src/auth.py |
|
|
Language: Python |
|
|
|
|
|
Old content: |
|
|
def login(username, password): |
|
|
user = get_user(username) |
|
|
if user.password == password: |
|
|
return True |
|
|
return False |
|
|
|
|
|
New content: |
|
|
def login(username, password): |
|
|
user = get_user(username) |
|
|
if user and user.password == password: |
|
|
return True |
|
|
return False |
|
|
""" |
|
|
|
|
|
# Generate commit message |
|
|
prompt = f"Write a git commit message:\n\n{diff}\n\nCommit message:\n" |
|
|
inputs = tokenizer(prompt, return_tensors="pt") |
|
|
|
|
|
with torch.no_grad(): |
|
|
outputs = model.generate( |
|
|
**inputs, |
|
|
max_new_tokens=30, |
|
|
do_sample=False, # Deterministic |
|
|
pad_token_id=tokenizer.eos_token_id, |
|
|
) |
|
|
|
|
|
message = tokenizer.decode(outputs[0], skip_special_tokens=True) |
|
|
print(message.split("Commit message:")[-1].strip()) |
|
|
# Output: "Check for user existence before accessing password" |
|
|
``` |
|
|
|
|
|
### CLI Tool |
|
|
|
|
|
For easier usage, install the companion CLI tool from the [GitHub repository](https://github.com/rajtiwariee/GitCommitGenerator): |
|
|
|
|
|
```bash |
|
|
pip install -e . |
|
|
commit-gen generate --commit |
|
|
``` |
|
|
|
|
|
## Training Details |
|
|
|
|
|
### Training Data |
|
|
|
|
|
- **Dataset**: CommitPackFT (filtered subset) |
|
|
- **Training samples**: 55,730 |
|
|
- **Validation samples**: 6,966 |
|
|
- **Test samples**: 6,967 |
|
|
|
|
|
### Training Procedure |
|
|
|
|
|
- **Epochs**: 3 |
|
|
- **Batch Size**: 4 (effective batch size: 32 with gradient accumulation) |
|
|
- **Learning Rate**: 5e-5 |
|
|
- **Optimizer**: AdamW |
|
|
- **LoRA Config**: |
|
|
- r: 16 |
|
|
- alpha: 32 |
|
|
- dropout: 0.05 |
|
|
- target_modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
|
|
|
|
|
### Hardware |
|
|
|
|
|
- **GPU**: NVIDIA Tesla T4 (16GB) |
|
|
- **Precision**: Mixed Precision (FP32 weights + FP16 compute) |
|
|
- **Training Time**: ~7.5 hours |
|
|
|
|
|
## Evaluation Results |
|
|
|
|
|
- **BLEU Score**: 0.0244 |
|
|
- **ROUGE-1**: 0.1968 |
|
|
- **ROUGE-2**: 0.0420 |
|
|
- **ROUGE-L**: 0.1816 |
|
|
- **Exact Match Rate**: 0.00% |
|
|
|
|
|
|
|
|
## Limitations |
|
|
|
|
|
- The model is trained primarily on English commit messages |
|
|
- Best suited for code changes in common programming languages |
|
|
- May not handle very large diffs well (>384 tokens) |
|
|
- Generated messages should be reviewed before committing |
|
|
|
|
|
## Ethical Considerations |
|
|
|
|
|
This model is intended to assist developers in writing commit messages, not replace human judgment. Users should: |
|
|
- Review generated messages for accuracy |
|
|
- Ensure messages accurately describe the changes |
|
|
- Follow their team's commit message conventions |
|
|
|
|
|
## Citation |
|
|
|
|
|
```bibtex |
|
|
@misc{git-commit-generator, |
|
|
author = {Raj Tiwari}, |
|
|
title = {Git Commit Message Generator}, |
|
|
year = {2024}, |
|
|
publisher = {Hugging Face}, |
|
|
howpublished = {\url{https://huggingface.co/rajtiwariee/auto-commit}}, |
|
|
} |
|
|
``` |
|
|
|
|
|
## License |
|
|
|
|
|
MIT License |
|
|
|