commit-message-llm / README.md
mamounyosef's picture
Update README.md
21b816d verified
---
base_model: Qwen/Qwen2.5-Coder-0.5B
library_name: peft
pipeline_tag: text-generation
tags:
- lora
- transformers
- qlora
- commit-message-generation
- code-summarization
- generated_from_trainer
license: cc-by-nc-4.0
datasets:
- Maxscha/commitbench
language:
- en
---
# QLoRA Adapter for Commit Message Generation
Fine-tuned LoRA adapter for **Qwen2.5-Coder-0.5B** that generates clear, concise Git commit messages from code diffs.
### Model Description
This model is a **QLoRA (4-bit quantized LoRA)** adapter trained on the Qwen2.5-Coder-0.5B base model to automatically generate commit messages from Git diffs. The adapter learns to summarize code changes into human-readable descriptions, understanding programming patterns and translating technical modifications into natural language.
**Key characteristics:**
- Uses the **PT (Pretrained/Base)** version of Qwen2.5-Coder for cleaner, more controllable outputs
- Trained with 4-bit NF4 quantization for efficient fine-tuning on consumer hardware
- Only LoRA adapters are included (~few MB); requires base model for inference
- Optimized for diff-to-message generation, not chat or instruction following
- **Developed by:** Mamoun Yosef
- **Model type:** Causal Language Model (Decoder-only Transformer) with LoRA adapters
- **Language(s):** English
- **License:** CC BY-NC 4.0 (non-commercial for this trained adapter)
- **Base model license:** Apache 2.0 (`Qwen/Qwen2.5-Coder-0.5B`)
- **Finetuned from model:** Qwen/Qwen2.5-Coder-0.5B
### Model Sources
- **Repository:** [commit-message-llm](https://github.com/mamounyosef/commit-message-llm)
- **Base Model:** [Qwen/Qwen2.5-Coder-0.5B](https://huggingface.co/Qwen/Qwen2.5-Coder-0.5B)
## License and Usage
- This adapter was trained using **CommitBench** (`Maxscha/commitbench`), licensed **CC BY-NC 4.0**.
- This trained adapter is therefore **non-commercial use only**.
- The base model (`Qwen/Qwen2.5-Coder-0.5B`) remains licensed under **Apache-2.0**.
## Uses
### Direct Use
This adapter is designed for **automated commit message generation** from Git diffs. It can be used to:
- Generate commit messages for staged changes in Git repositories
- Suggest descriptive summaries for code modifications
- Automate documentation of code changes in CI/CD pipelines
- Assist developers in writing clear, consistent commit messages
**Example input (Git diff):**
```diff
diff --git a/src/utils.py b/src/utils.py
index abc123..def456 100644
--- a/src/utils.py
+++ b/src/utils.py
@@ -10,6 +10,9 @@ def process_data(data):
return result
+def validate_input(data):
+ return data is not None and len(data) > 0
+
def save_output(output, filename):
```
**Example output:**
```
Add input validation function
```
### Downstream Use
Can be integrated into:
- Git hooks (pre-commit, commit-msg)
- IDE extensions for code editors
- Code review tools
- Developer productivity applications
### Out-of-Scope Use
**Not suitable for:**
- General text generation or chat
- Generating code from descriptions (reverse direction)
- Diffs from non-programming languages
- Extremely large diffs (>8000 characters)
- Commit messages requiring deep domain knowledge beyond code structure
- Commercial usage of this trained adapter
## Bias, Risks, and Limitations
**Limitations:**
- Trained only on English commit messages
- May struggle with very complex multi-file changes
- Limited to diff length of 50-8000 characters
- Performance depends on code quality and diff clarity
- May generate generic messages for trivial changes
- Does not understand business context or domain-specific terminology
**Risks:**
- Generated messages may not capture full intent of changes
- Should be reviewed by developers before committing
- May miss important security or breaking change implications
### Recommendations
- Always review generated commit messages before use
- Use as a suggestion tool, not fully automated solution
- Combine with manual editing for complex changes
- Test on your codebase to evaluate quality
## How to Get Started with the Model
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
import torch
# Load base model in 4-bit
from transformers import BitsAndBytesConfig
quant_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True,
bnb_4bit_compute_dtype=torch.bfloat16,
)
base_model = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen2.5-Coder-0.5B",
quantization_config=quant_config,
device_map="auto",
torch_dtype=torch.bfloat16,
)
# Load LoRA adapter
model = PeftModel.from_pretrained(base_model, "mamounyosef/commit-message-llm")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-Coder-0.5B")
# Generate commit message
diff = """diff --git a/file.py b/file.py
--- a/file.py
+++ b/file.py
@@ -1,3 +1,4 @@
+import os
def main():
print("Hello")
"""
prompt = diff + "\n\nCommit message:\n"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(
**inputs,
max_new_tokens=30,
do_sample=False,
num_beams=1,
eos_token_id=tokenizer.eos_token_id,
)
message = tokenizer.decode(outputs[0], skip_special_tokens=True)
message = message[len(prompt):].strip()
print(message)
```
## Training Details
### Training Data
**Dataset:** [Maxscha/commitbench](https://huggingface.co/datasets/Maxscha/commitbench)
**Preprocessing:**
- Removed trivial messages (fix, update, wip, etc.)
- Filtered out reference-only commits (fix #123)
- Removed placeholder tokens (`<HASH>`, `<URL>`)
- Kept diffs between 50-8000 characters
- Required messages with semantic content (>=3 words)
**Final dataset sizes:**
- Training: 120,000 samples
- Validation: 15,000 samples
- Test: 15,000 samples
### Training Procedure
**Format:**
```
{diff content}
Commit message:
{target message}<eos>
```
Prompt tokens (diff + separator) are masked with label `-100` so loss is computed only on the commit message generation.
#### Preprocessing
1. Normalize newlines (CRLF -> LF)
2. Tokenize diff + separator + message
3. Mask prompt labels to `-100`
4. Truncate to `max_length=512` tokens
5. Append EOS token to target
#### Training Hyperparameters
**QLoRA Configuration:**
- Quantization: 4-bit NF4
- Compute dtype: bfloat16
- LoRA rank (r): 16
- LoRA alpha: 32
- LoRA dropout: 0.05
- Target modules: q_proj, k_proj, v_proj, o_proj
**Training Parameters:**
- Max sequence length: 512 tokens
- Per-device train batch size: 6
- Per-device eval batch size: 6
- Gradient accumulation steps: 8
- **Effective batch size: 48**
- Learning rate: 1.8e-4
- LR scheduler: Cosine with 4% warmup
- Total training steps: 6000
- Epochs: ~2
- Optimizer: paged_adamw_8bit
- Gradient clipping: 1.0
- **Training regime:** bf16 mixed precision
**Memory Optimizations:**
- Gradient checkpointing enabled
- SDPA (Scaled Dot-Product Attention) for efficient attention
- 8-bit paged optimizer
- Group by length for efficient batching
#### Speeds, Sizes, Times
- **Hardware:** NVIDIA RTX 4060 (8GB VRAM)
- **Total training time:** ~13 hours
- **Checkpoint size:** ~few MB (LoRA adapters only)
- **Peak VRAM usage:** <8GB
- **Training throughput:** ~2500 samples/hour
## Evaluation
### Testing Data, Factors & Metrics
#### Testing Data
**Test split from Maxscha/commitbench:**
- 15,000 cleaned samples
- Same preprocessing as training data
- No overlap with training/validation sets
#### Metrics
- **Loss:** Cross-entropy loss on commit message tokens
- **Perplexity:** exp(loss), measures model confidence
- Lower perplexity = better prediction quality
- Perplexity ~17 is strong for this task
### Results
| Split | Loss | Perplexity |
|-------|------|------------|
| Validation | 2.8583 | 17.43 |
| Test | 2.8501 | 17.29 |
**Qualitative Example:**
```diff
diff --git a/src/client/core/commands/menu.js
+ 'core/settings'
+], function (_, hr, MenubarView, box, panels, tabs, session, localfs, settings) {
+ }).menuSection({
+ 'id': "themes.settings",
+ 'title': "Settings",
+ 'action': function() {
+ settings.open("themes"...
```
- **Ground truth:** Add command to open themes settings in view menu
- **Model output:** Add theme settings to the menu
The model correctly identifies the purpose (menu settings addition) and generates a concise, accurate description.
## Environmental Impact
- **Hardware Type:** NVIDIA RTX 4060 (8GB VRAM)
- **Hours used:** ~13 hours
- **Cloud Provider:** N/A (local training)
- **Compute Region:** N/A
- **Carbon Emitted:** Minimal (single consumer GPU, short training time)
## Technical Specifications
### Model Architecture and Objective
- **Base Architecture:** Qwen2.5-Coder-0.5B (Decoder-only Transformer)
- **Adapter Type:** LoRA (Low-Rank Adaptation)
- **Objective:** Causal language modeling with masked prompts
- **Loss Function:** Cross-entropy on commit message tokens only
### Compute Infrastructure
#### Hardware
- GPU: NVIDIA RTX 4060
- VRAM: 8GB
- System RAM: 16GB
- Storage: SSD recommended for dataset loading
#### Software
- **Framework:** PyTorch, Hugging Face Transformers
- **PEFT Version:** 0.18.1
- **Key Libraries:**
- `transformers` (model loading, training)
- `peft` (LoRA adapters)
- `bitsandbytes` (4-bit quantization)
- `datasets` (data loading)
- `torch` (deep learning backend)
## Model Card Authors
Mamoun Yosef
### Framework Versions
- PEFT 0.18.1
- Transformers 4.x
- PyTorch 2.x
- bitsandbytes 0.x