--- language: en license: mit tags: - code - git - commit-message - qwen2 - lora datasets: - bigcode/commitpackft --- # Git Commit Message Generator Fine-tuned Qwen-0.5B model for generating professional Git commit messages from code diffs. ## Model Description This model was fine-tuned using LoRA (Low-Rank Adaptation) on the CommitPackFT dataset to generate concise, professional commit messages from git diffs. **Base Model**: Qwen-0.5B **Fine-tuning Method**: LoRA (r=16, alpha=32) **Training Data**: 55K filtered commits from CommitPackFT **Languages**: Python, JavaScript, TypeScript, Java, C++, Go, Rust, and more ## Intended Use Generate commit messages for staged changes in a Git repository. ### Quick Start ```python from transformers import AutoModelForCausalLM, AutoTokenizer import torch # Load model and tokenizer model_name = "rajtiwariee/auto-commit" tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True) # Prepare your diff diff = """ Diff: File: src/auth.py Language: Python Old content: def login(username, password): user = get_user(username) if user.password == password: return True return False New content: def login(username, password): user = get_user(username) if user and user.password == password: return True return False """ # Generate commit message prompt = f"Write a git commit message:\n\n{diff}\n\nCommit message:\n" inputs = tokenizer(prompt, return_tensors="pt") with torch.no_grad(): outputs = model.generate( **inputs, max_new_tokens=30, do_sample=False, # Deterministic pad_token_id=tokenizer.eos_token_id, ) message = tokenizer.decode(outputs[0], skip_special_tokens=True) print(message.split("Commit message:")[-1].strip()) # Output: "Check for user existence before accessing password" ``` ### CLI Tool For easier usage, install the companion CLI tool from the [GitHub repository](https://github.com/rajtiwariee/GitCommitGenerator): ```bash pip install -e . commit-gen generate --commit ``` ## Training Details ### Training Data - **Dataset**: CommitPackFT (filtered subset) - **Training samples**: 55,730 - **Validation samples**: 6,966 - **Test samples**: 6,967 ### Training Procedure - **Epochs**: 3 - **Batch Size**: 4 (effective batch size: 32 with gradient accumulation) - **Learning Rate**: 5e-5 - **Optimizer**: AdamW - **LoRA Config**: - r: 16 - alpha: 32 - dropout: 0.05 - target_modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj ### Hardware - **GPU**: NVIDIA Tesla T4 (16GB) - **Precision**: Mixed Precision (FP32 weights + FP16 compute) - **Training Time**: ~7.5 hours ## Evaluation Results - **BLEU Score**: 0.0244 - **ROUGE-1**: 0.1968 - **ROUGE-2**: 0.0420 - **ROUGE-L**: 0.1816 - **Exact Match Rate**: 0.00% ## Limitations - The model is trained primarily on English commit messages - Best suited for code changes in common programming languages - May not handle very large diffs well (>384 tokens) - Generated messages should be reviewed before committing ## Ethical Considerations This model is intended to assist developers in writing commit messages, not replace human judgment. Users should: - Review generated messages for accuracy - Ensure messages accurately describe the changes - Follow their team's commit message conventions ## Citation ```bibtex @misc{git-commit-generator, author = {Raj Tiwari}, title = {Git Commit Message Generator}, year = {2024}, publisher = {Hugging Face}, howpublished = {\url{https://huggingface.co/rajtiwariee/auto-commit}}, } ``` ## License MIT License