File size: 5,025 Bytes
f13dca3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
# Experiment 1 – Latin Square 1: CCT5 & COME on MCMD

This repository contains the artifacts for **Latin Square 1 of Experiment 1**, which corresponds to the **reproduction of the original experiment** by Wu et al. (2025) on the **MCMD dataset** using the DNN-based commit message generation baselines **CCT5** and **COME**.

***

## Models

### CCT5
CCT5 is a code-change-oriented pre-trained model built on top of the **T5 architecture**, initialized from **CodeT5** weights. It is further specialized through pre-training on **CodeChangeNet**, a commit-diff dataset containing roughly 40GB of diff and commit message pairs (~1.5M pairs). It was released at ESEC/FSE 2023.

- Base: `T5-base` β†’ `CodeT5` β†’ `CCT5`
- Pre-training data: CodeChangeNet (40GB, 1.5M diff/commit pairs)
- For MCMD: reused released checkpoint fine-tuned on MCMD by original authors

### COME
COME (Commit Message Generation with Modification Embedding) is a hybrid DNN approach that combines:
- A **fine-tuned CodeT5** component for natural language generation
- **Modification embedding** to represent code changes as numerical vectors
- An **SVM-based decision algorithm** to select between generated and retrieved candidate messages

It does not perform additional large-scale pre-training on top of CodeT5. Released at ISSTA 2023.

- For MCMD: reused language-specific checkpoints released by original COME authors (one per language)

***

## Dataset

**MCMD** – Multilingual Commit Message Dataset

| Property | Details |
|----------|---------|
| Languages | Java, C++, C#, Python, JavaScript |
| Repositories | Top 100 most-starred GitHub repos per language (500 total) |
| Total commits | ~1,094,115 |
| Date range | Up to January 1st, 2022 |
| Split | 80% train / 10% validation / 10% test |
| Authors | Liu et al. (2020) |

***

## Repository Structure

Each run folder corresponds to a **programming language** evaluated in this Latin Square:

```
experiment1_ls1/
β”œβ”€β”€ run_java/
β”‚   β”œβ”€β”€ checkpoint/          # CCT5 and COME checkpoints fine-tuned on MCMD (Java)
β”‚   β”œβ”€β”€ predictions/         # Generated commit messages on MCMD Java test set
β”‚   └── metrics/             # BLEU, METEOR, ROUGE-L, CIDEr scores
β”œβ”€β”€ run_cpp/
β”‚   β”œβ”€β”€ checkpoint/
β”‚   β”œβ”€β”€ predictions/
β”‚   └── metrics/
β”œβ”€β”€ run_csharp/
β”‚   β”œβ”€β”€ checkpoint/
β”‚   β”œβ”€β”€ predictions/
β”‚   └── metrics/
β”œβ”€β”€ run_python/
β”‚   β”œβ”€β”€ checkpoint/
β”‚   β”œβ”€β”€ predictions/
β”‚   └── metrics/
└── run_javascript/
    β”œβ”€β”€ checkpoint/
    β”œβ”€β”€ predictions/
    └── metrics/
```

### `checkpoint/`
Contains the model checkpoint files for CCT5 and COME reused from the original authors' repositories, fine-tuned on the MCMD training set for the corresponding language.

### `predictions/`
Contains the generated commit messages produced by each model on the MCMD test set for the corresponding language, stored as `.txt` files with one prediction per line aligned to the reference messages.

### `metrics/`
Contains the computed evaluation metric scores for each model-language combination. Metrics are calculated by comparing predictions against the reference messages in the MCMD test set.

***

## Evaluation Metrics

| Metric | Description |
|--------|-------------|
| **BLEU** | Bilingual Evaluation Understudy β€” measures n-gram precision between generated and reference messages |
| **METEOR** | Metric for Evaluation of Translation with Explicit Ordering β€” extends BLEU with recall, stemming, and synonym matching |
| **ROUGE-L** | Recall-Oriented Understudy for Gisting Evaluation (LCS variant) β€” measures longest common subsequence overlap |
| **CIDEr** | Consensus-based Image Description Evaluation β€” TF-IDF-weighted n-gram similarity against reference messages |

### Reported Results (Original Paper – Wu et al., 2025)

| Language | Model | BLEU | METEOR | ROUGE-L | CIDEr |
|----------|-------|------|--------|---------|-------|
| Java | CCT5 | 17.19 | 14.95 | 26.08 | 1.06 |
| Java | COME | 27.17 | 23.36 | 34.59 | 1.90 |
| C++ | CCT5 | 15.65 | 14.11 | 24.15 | 0.90 |
| C++ | COME | 27.29 | 23.29 | 33.33 | 1.91 |
| C# | CCT5 | 12.06 | 11.05 | 18.92 | 0.61 |
| C# | COME | 20.80 | 17.72 | 27.01 | 1.25 |
| Python | CCT5 | 15.12 | 13.70 | 23.79 | 0.85 |
| Python | COME | 23.17 | 19.99 | 30.48 | 1.50 |
| JavaScript | CCT5 | 19.76 | 17.51 | 28.73 | 1.33 |
| JavaScript | COME | 26.91 | 23.02 | 34.44 | 1.92 |
| **Average** | **CCT5** | **15.96** | **14.26** | **24.33** | **0.95** |
| **Average** | **COME** | **25.07** | **21.48** | **31.97** | **1.70** |
***

## References

- Wu et al. (2025). *An Empirical Study on Commit Message Generation with Large Language Models via In-Context Learning.* arXiv:2502.18904.
- Lin et al. (2023). *CCT5: A Code-Change-Oriented Pre-Trained Model.* ESEC/FSE 2023.
- He et al. (2023). *COME: Commit Message Generation with Modification Embedding.* ISSTA 2023.
- Liu et al. (2020). *MCMD dataset.*