Update README.md
Browse files
README.md
CHANGED
|
@@ -9,7 +9,9 @@ tags:
|
|
| 9 |
- commit-message-generation
|
| 10 |
- code-summarization
|
| 11 |
- generated_from_trainer
|
| 12 |
-
license:
|
|
|
|
|
|
|
| 13 |
language:
|
| 14 |
- en
|
| 15 |
---
|
|
@@ -33,7 +35,8 @@ This model is a **QLoRA (4-bit quantized LoRA)** adapter trained on the Qwen2.5-
|
|
| 33 |
- **Developed by:** Mamoun Yosef
|
| 34 |
- **Model type:** Causal Language Model (Decoder-only Transformer) with LoRA adapters
|
| 35 |
- **Language(s):** English
|
| 36 |
-
- **License:**
|
|
|
|
| 37 |
- **Finetuned from model:** Qwen/Qwen2.5-Coder-0.5B
|
| 38 |
|
| 39 |
### Model Sources
|
|
@@ -41,6 +44,12 @@ This model is a **QLoRA (4-bit quantized LoRA)** adapter trained on the Qwen2.5-
|
|
| 41 |
- **Repository:** [commit-message-llm](https://github.com/mamounyosef/commit-message-llm)
|
| 42 |
- **Base Model:** [Qwen/Qwen2.5-Coder-0.5B](https://huggingface.co/Qwen/Qwen2.5-Coder-0.5B)
|
| 43 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 44 |
## Uses
|
| 45 |
|
| 46 |
### Direct Use
|
|
@@ -88,6 +97,7 @@ Can be integrated into:
|
|
| 88 |
- Diffs from non-programming languages
|
| 89 |
- Extremely large diffs (>8000 characters)
|
| 90 |
- Commit messages requiring deep domain knowledge beyond code structure
|
|
|
|
| 91 |
|
| 92 |
## Bias, Risks, and Limitations
|
| 93 |
|
|
@@ -174,9 +184,9 @@ print(message)
|
|
| 174 |
**Preprocessing:**
|
| 175 |
- Removed trivial messages (fix, update, wip, etc.)
|
| 176 |
- Filtered out reference-only commits (fix #123)
|
| 177 |
-
- Removed placeholder tokens (<HASH>, <URL>)
|
| 178 |
- Kept diffs between 50-8000 characters
|
| 179 |
-
- Required messages with semantic content (
|
| 180 |
|
| 181 |
**Final dataset sizes:**
|
| 182 |
- Training: 120,000 samples
|
|
@@ -197,10 +207,10 @@ Prompt tokens (diff + separator) are masked with label `-100` so loss is compute
|
|
| 197 |
|
| 198 |
#### Preprocessing
|
| 199 |
|
| 200 |
-
1. Normalize newlines (CRLF
|
| 201 |
2. Tokenize diff + separator + message
|
| 202 |
3. Mask prompt labels to `-100`
|
| 203 |
-
4. Truncate to max_length=512 tokens
|
| 204 |
5. Append EOS token to target
|
| 205 |
|
| 206 |
#### Training Hyperparameters
|
|
@@ -257,7 +267,7 @@ Prompt tokens (diff + separator) are masked with label `-100` so loss is compute
|
|
| 257 |
- **Loss:** Cross-entropy loss on commit message tokens
|
| 258 |
- **Perplexity:** exp(loss), measures model confidence
|
| 259 |
- Lower perplexity = better prediction quality
|
| 260 |
-
- Perplexity
|
| 261 |
|
| 262 |
### Results
|
| 263 |
|
|
|
|
| 9 |
- commit-message-generation
|
| 10 |
- code-summarization
|
| 11 |
- generated_from_trainer
|
| 12 |
+
license: cc-by-nc-4.0
|
| 13 |
+
datasets:
|
| 14 |
+
- Maxscha/commitbench
|
| 15 |
language:
|
| 16 |
- en
|
| 17 |
---
|
|
|
|
| 35 |
- **Developed by:** Mamoun Yosef
|
| 36 |
- **Model type:** Causal Language Model (Decoder-only Transformer) with LoRA adapters
|
| 37 |
- **Language(s):** English
|
| 38 |
+
- **License:** CC BY-NC 4.0 (non-commercial for this trained adapter)
|
| 39 |
+
- **Base model license:** Apache 2.0 (`Qwen/Qwen2.5-Coder-0.5B`)
|
| 40 |
- **Finetuned from model:** Qwen/Qwen2.5-Coder-0.5B
|
| 41 |
|
| 42 |
### Model Sources
|
|
|
|
| 44 |
- **Repository:** [commit-message-llm](https://github.com/mamounyosef/commit-message-llm)
|
| 45 |
- **Base Model:** [Qwen/Qwen2.5-Coder-0.5B](https://huggingface.co/Qwen/Qwen2.5-Coder-0.5B)
|
| 46 |
|
| 47 |
+
## License and Usage
|
| 48 |
+
|
| 49 |
+
- This adapter was trained using **CommitBench** (`Maxscha/commitbench`), licensed **CC BY-NC 4.0**.
|
| 50 |
+
- This trained adapter is therefore **non-commercial use only**.
|
| 51 |
+
- The base model (`Qwen/Qwen2.5-Coder-0.5B`) remains licensed under **Apache-2.0**.
|
| 52 |
+
|
| 53 |
## Uses
|
| 54 |
|
| 55 |
### Direct Use
|
|
|
|
| 97 |
- Diffs from non-programming languages
|
| 98 |
- Extremely large diffs (>8000 characters)
|
| 99 |
- Commit messages requiring deep domain knowledge beyond code structure
|
| 100 |
+
- Commercial usage of this trained adapter
|
| 101 |
|
| 102 |
## Bias, Risks, and Limitations
|
| 103 |
|
|
|
|
| 184 |
**Preprocessing:**
|
| 185 |
- Removed trivial messages (fix, update, wip, etc.)
|
| 186 |
- Filtered out reference-only commits (fix #123)
|
| 187 |
+
- Removed placeholder tokens (`<HASH>`, `<URL>`)
|
| 188 |
- Kept diffs between 50-8000 characters
|
| 189 |
+
- Required messages with semantic content (>=3 words)
|
| 190 |
|
| 191 |
**Final dataset sizes:**
|
| 192 |
- Training: 120,000 samples
|
|
|
|
| 207 |
|
| 208 |
#### Preprocessing
|
| 209 |
|
| 210 |
+
1. Normalize newlines (CRLF -> LF)
|
| 211 |
2. Tokenize diff + separator + message
|
| 212 |
3. Mask prompt labels to `-100`
|
| 213 |
+
4. Truncate to `max_length=512` tokens
|
| 214 |
5. Append EOS token to target
|
| 215 |
|
| 216 |
#### Training Hyperparameters
|
|
|
|
| 267 |
- **Loss:** Cross-entropy loss on commit message tokens
|
| 268 |
- **Perplexity:** exp(loss), measures model confidence
|
| 269 |
- Lower perplexity = better prediction quality
|
| 270 |
+
- Perplexity ~17 is strong for this task
|
| 271 |
|
| 272 |
### Results
|
| 273 |
|