File size: 5,527 Bytes
94a0c59
 
 
 
 
72d2c9b
 
 
 
 
 
 
 
94a0c59
 
 
72d2c9b
94a0c59
72d2c9b
 
94a0c59
 
 
 
 
72d2c9b
 
94a0c59
72d2c9b
 
 
 
 
 
 
 
94a0c59
72d2c9b
94a0c59
72d2c9b
94a0c59
 
 
 
 
72d2c9b
 
94a0c59
72d2c9b
94a0c59
72d2c9b
 
94a0c59
 
 
72d2c9b
 
 
94a0c59
 
 
72d2c9b
 
 
94a0c59
 
 
72d2c9b
 
94a0c59
72d2c9b
94a0c59
72d2c9b
 
 
 
94a0c59
72d2c9b
 
94a0c59
72d2c9b
 
94a0c59
72d2c9b
 
 
94a0c59
72d2c9b
 
 
 
 
 
 
94a0c59
72d2c9b
 
 
 
94a0c59
72d2c9b
 
 
 
 
94a0c59
72d2c9b
94a0c59
72d2c9b
94a0c59
72d2c9b
 
 
 
94a0c59
72d2c9b
94a0c59
72d2c9b
 
 
 
94a0c59
72d2c9b
 
 
 
94a0c59
72d2c9b
94a0c59
72d2c9b
 
94a0c59
72d2c9b
 
 
94a0c59
 
 
72d2c9b
 
94a0c59
72d2c9b
94a0c59
72d2c9b
 
94a0c59
72d2c9b
94a0c59
72d2c9b
94a0c59
72d2c9b
94a0c59
72d2c9b
94a0c59
72d2c9b
94a0c59
72d2c9b
94a0c59
 
 
72d2c9b
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
---
base_model: unsloth/LFM2-350M-unsloth-bnb-4bit
library_name: peft
pipeline_tag: text-generation
tags:
  - "base_model:adapter:unsloth/LFM2-350M-unsloth-bnb-4bit"
  - lora
  - qlora
  - sft
  - transformers
  - trl
  - conventional-commits
  - code
---


# lfm2_350m_commit_diff_summarizer (LoRA)

A lightweight **helper model** that turns Git diffs into **Conventional Commit–style** messages.
It outputs **strict JSON** with a short `title` (≤ 65 chars) and up to 3 `bullets`, so your CLI/agents can parse it deterministically.

## Model Details

### Model Description

* **Purpose:** Summarize `git diff` patches into concise, Conventional Commit–compliant titles with optional bullets.
* **I/O format:**

  * **Input:** prompt containing the diff (plain text).
  * **Output:** JSON object: `{"title": "...", "bullets": ["...", "..."]}`.
* **Developed by:** Ethan (HF: `ethanke`)
* **Shared by:** Ethan (HF: `ethanke`)
* **Model type:** LoRA adapter for causal LM (text generation)
* **Language(s):** English (commit message conventions)
* **License:** Inherits base model’s license; dataset has **non-commercial** terms (see **Training Data**). Review before production/commercial use.
* **Finetuned from:** `unsloth/LFM2-350M-unsloth-bnb-4bit` (4-bit quantized base, trained with QLoRA)

### Model Sources

* **Repository:** This model card + adapter on the Hub under `ethanke/lfm2_350m_commit_diff_summarizer`

## Uses

### Direct Use

* Convert patch diffs into Conventional Commit messages for PR titles, commits, and changelogs.
* Provide human-readable summaries in agent UIs with guaranteed JSON structure.

### Downstream Use

* Plug into CI to auto-suggest commit titles after tests pass.
* Use as a **helper** in a larger agent system (router/planner stays in a bigger model).

### Out-of-Scope Use

* General code generation or deep refactoring explanations.
* Non-English commit conventions.
* Knowledge-intensive narrative summaries.

## Bias, Risks, and Limitations

* Trained on public commits filtered to Conventional Commit titles; may **prefer certain styles/projects**.
* Long diffs are truncated to `max_length`; summarization may miss edge changes.
* Dataset license may restrict **commercial** usage; verify for your case.

### Recommendations

* Enforce JSON validation; if invalid, retry with a JSON-repair prompt.
* Keep a regex gate for Conventional Commit titles in your pipeline.

## How to Get Started

```python
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import PeftModel
import torch, json

BASE = "unsloth/LFM2-350M-unsloth-bnb-4bit"
ADAPTER = "ethanke/lfm2_350m_commit_diff_summarizer"  # replace with your repo id

bnb = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4",
                         bnb_4bit_use_double_quant=True, bnb_4bit_compute_dtype=torch.float16)

tok = AutoTokenizer.from_pretrained(BASE, use_fast=True)
mdl = AutoModelForCausalLM.from_pretrained(BASE, quantization_config=bnb, device_map="auto")
mdl = PeftModel.from_pretrained(mdl, ADAPTER)

diff = "...your git diff text..."
prompt = (
  "You are a commit message summarizer.\n"
  "Return a concise JSON object with fields 'title' (<=65 chars) and 'bullets' (0-3 items).\n"
  "Follow the Conventional Commit style for the title.\n\n"
  "### DIFF\n" + diff + "\n\n### OUTPUT JSON\n"
)

inputs = tok(prompt, return_tensors="pt").to(mdl.device)
with torch.no_grad():
    out = mdl.generate(**inputs, max_new_tokens=200, do_sample=False)
text = tok.decode(out[0], skip_special_tokens=True)

# naive JSON extraction
js = text[text.rfind("{"): text.rfind("}")+1]
obj = json.loads(js)
print(obj)
```

## Training Details

### Training Data

* **Dataset:** `Maxscha/commitbench` (diff → commit message).
* **Filtering:** kept only samples whose **first non-empty line** of the message matches Conventional Commits:
  `^(feat|fix|docs|style|refactor|perf|test|build|ci|chore|revert)(\([^)]+\))?(!)?:\s.+$`
* **Note:** The dataset card indicates non-commercial licensing. Confirm before commercial deployment.

### Training Procedure

* **Method:** Supervised fine-tuning (SFT) with TRL `SFTTrainer` + **QLoRA** (PEFT).
* **Prompting:** Instruction + `### DIFF` + `### OUTPUT JSON` target (title/bullets).
* **Precision:** fp16 compute on 4-bit base.
* **Hyperparameters (v0.1):**

  * `max_length=2048`, `per_device_train_batch_size=2`, `grad_accum=4`
  * `lr=2e-4`, `scheduler=cosine`, `warmup_ratio=0.03`
  * `epochs=1` over capped subset
  * LoRA: `r=16`, `alpha=32`, `dropout=0.05`, targets: q/k/v/o + MLP proj

### Evaluation

* **Validation:** filtered split from CommitBench.
* **Metrics (example run):**

  * `eval_loss ≈ 1.18`  → perplexity ≈ 3.26
  * `eval_mean_token_accuracy ≈ 0.77`
  * Suggested task metrics: JSON validity rate, CC-title compliance, title length ≤ 65 chars, bullets ≤ 3.

## Environmental Impact

* **Hardware:** 1× NVIDIA GTX 3060 12 GB (local)
* **Hours used:** ~1–2 h (prototype)

## Technical Specifications

* **Architecture:** LFM2-350M (decoder-only) + LoRA adapter
* **Libraries:** `transformers`, `trl`, `peft`, `bitsandbytes`, `datasets`, `unsloth`

## Citation

If you use this model, please cite the base model and dataset authors according to their cards.

## Model Card Authors

* Ethan (`ethanke`) and contributors

## Contact

* Open an issue on the Hub repo or message `ethanke` on Hugging Face.

### Framework versions

* PEFT 0.17.1
* TRL (SFTTrainer)
* Transformers (recent version)