File size: 7,411 Bytes
9e5a4a7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
# Experiment 2 – Latin Square 2: CCT5 & COME on MCMD-NL (Redesigned)

This repository contains the artifacts for **Latin Square 2 of Experiment 2**, which corresponds to the **redesigned and reimplemented experiment** evaluated on the **MCMD-NL dataset** using the DNN-based commit message generation baselines **CCT5** and **COME**. 
The models have been retrained for each language on the MCMD-NL dataset and then evaluated utilizing the BLEU, METEOR, ROUGE-L, and CIDEr metrics.

***

## Models

### CCT5
CCT5 is a code-change-oriented pre-trained model built on top of the **T5 architecture**, initialized from **CodeT5** weights and further pre-trained on **CodeChangeNet** (~40GB, 1.5M diff/commit pairs). Released at ESEC/FSE 2023.

- Architecture: Encoder-decoder Transformer (`T5-base` β†’ `CodeT5` β†’ `CCT5`)
- Pre-training corpus: CodeChangeNet (code diffs paired with commit messages)
- For MCMD-NL: **new checkpoint trained by fine-tuning the pre-trained CCT5 model on the MCMD-NL training set**, then evaluated on the MCMD-NL test set

### COME
COME (Commit Message Generation with Modification Embedding) is a hybrid DNN system built on top of CodeT5 with three core components:
- **Modification embedding**: converts code changes into numerical vectors capturing code evolution
- **Fine-tuned CodeT5**: generates candidate commit messages from the embedded representation
- **SVM-based decision algorithm**: selects between the generated and retrieved candidate messages

Released at ISSTA 2023. Does not include additional large-scale pre-training beyond CodeT5.

- For MCMD-NL: **new checkpoint trained by fine-tuning the pre-trained COME model on the MCMD-NL training set**, then evaluated on the MCMD-NL test set

***

## Dataset

**MCMD-NL** – Part of MCMD-New; commits from repositories with programming languages **not present** in the original MCMD dataset.

| Property | Details |
|----------|---------|
| Languages | PHP, R, TypeScript, Swift, Objective-C |
| Repositories | 329 new repositories (not in MCMD) |
| Total commits | 135,699 |
| Date range | January 1st, 2022 onwards |
| Split | 80% train / 10% validation / 10% test |
| Authors | Wu et al. (2025) |

MCMD-NL was constructed to test model generalization to **entirely new programming languages**, requiring full fine-tuning from the pre-trained model checkpoints rather than reuse of existing MCMD-trained weights.

***

## Repository Structure

Each run folder corresponds to a **programming language** evaluated in this Latin Square. Both CCT5 and COME were fine-tuned on MCMD-NL and evaluated independently for each language.

```
experiment2_ls2/
β”œβ”€β”€ run_php/
β”‚   β”œβ”€β”€ checkpoint/          # CCT5 and COME checkpoints fine-tuned on MCMD-NL (PHP)
β”‚   β”œβ”€β”€ predictions/         # Generated commit messages on MCMD-NL PHP test set
β”‚   └── metrics/             # BLEU, METEOR, ROUGE-L, CIDEr scores
β”œβ”€β”€ run_r/
β”‚   β”œβ”€β”€ checkpoint/
β”‚   β”œβ”€β”€ predictions/
β”‚   └── metrics/
β”œβ”€β”€ run_typescript/
β”‚   β”œβ”€β”€ checkpoint/
β”‚   β”œβ”€β”€ predictions/
β”‚   └── metrics/
β”œβ”€β”€ run_swift/
β”‚   β”œβ”€β”€ checkpoint/
β”‚   β”œβ”€β”€ predictions/
β”‚   └── metrics/
└── run_objectivec/
    β”œβ”€β”€ checkpoint/
    β”œβ”€β”€ predictions/
    └── metrics/
```

### `checkpoint/`
Contains the model checkpoint files produced after fine-tuning CCT5 and COME on the MCMD-NL training set for the corresponding language. These are **newly trained checkpoints**, not reused from prior work. The best checkpoint selected during validation is stored here.

### `predictions/`
Contains the generated commit messages produced by each model on the MCMD-NL test set for the corresponding language. Files are stored as `.txt` with one prediction per line, aligned to the reference messages in the test set.

### `metrics/`
Contains the evaluation metric scores computed by comparing the predictions against the MCMD-NL test set reference messages. Each file records BLEU, METEOR, ROUGE-L, and CIDEr scores per model and language under the redesigned evaluation protocol.

***

## Evaluation Metrics

| Metric | Description |
|--------|-------------|
| **BLEU** | Bilingual Evaluation Understudy β€” measures n-gram precision between generated and reference messages |
| **METEOR** | Metric for Evaluation of Translation with Explicit Ordering β€” extends BLEU with recall, stemming, and synonym matching via WordNet |
| **ROUGE-L** | Recall-Oriented Understudy for Gisting Evaluation (LCS variant) β€” measures longest common subsequence overlap |
| **CIDEr** | Consensus-based Image Description Evaluation β€” TF-IDF-weighted n-gram similarity against reference messages |

### Reported Results (Original Paper – Wu et al., 2025)

| Language | Model | BLEU | METEOR | ROUGE-L | CIDEr |
|----------|-------|------|--------|---------|-------|
| PHP | CCT5 | 31.96 | 27.31 | 37.99 | 2.26 |
| PHP | COME | 34.68 | 30.51 | 40.27 | 2.59 |
| R | CCT5 | 33.02 | 28.92 | 37.17 | 2.19 |
| R | COME | 35.56 | 31.99 | 38.06 | 2.66 |
| TypeScript | CCT5 | 32.33 | 27.92 | 43.62 | 2.24 |
| TypeScript | COME | 35.72 | 30.97 | 47.38 | 2.61 |
| Swift | CCT5 | 29.29 | 24.58 | 37.09 | 1.98 |
| Swift | COME | 31.72 | 27.54 | 39.32 | 2.36 |
| Objective-C | CCT5 | 28.57 | 24.62 | 31.63 | 1.68 |
| Objective-C | COME | 33.43 | 29.44 | 38.32 | 2.17 |
| **Average** | **CCT5** | **31.02** | **26.67** | **37.50** | **2.06** |
| **Average** | **COME** | **34.22** | **30.09** | **40.67** | **2.47** |

These values serve as the reference for comparison with the results produced under the redesigned protocol.

***

## Methodological Differences from Experiment 1

This experiment was redesigned to address the validity and reproducibility concerns identified during the Experiment 1 reproduction phase:

- **Explicit random seed documentation** for all fine-tuning runs
- **Fully documented fine-tuning procedure**: hyperparameters, batch size, learning rate, number of epochs, and hardware specifications
- **Best checkpoint selection criteria** explicitly defined using the validation set
- **Controlled evaluation procedure** with clearly specified evaluation script versions
- **Full documentation of execution conditions** (hardware, software versions, environment)
- **Explicit treatment of validity threats** including language-specific variability and training randomness

***

## Important Notes

- The fine-tuning procedure for MCMD-NL is **not a reuse of existing checkpoints** β€” both models were trained from their pre-trained weights on the MCMD-NL training partition.
- The original paper does not clarify whether a single multilingual checkpoint or separate per-language checkpoints were trained for MCMD-NL; this ambiguity is addressed and documented in the thesis.
- MCMD-NL scores are generally **higher than MCMD scores** across all metrics, likely due to the different commit style distributions across the new languages.

***

## References

- Wu et al. (2025). *An Empirical Study on Commit Message Generation with Large Language Models via In-Context Learning.* arXiv:2502.18904.
- Lin et al. (2023). *CCT5: A Code-Change-Oriented Pre-Trained Model.* ESEC/FSE 2023.
- He et al. (2023). *COME: Commit Message Generation with Modification Embedding.* ISSTA 2023.
- Vegas & Elbaum (2023). *Pitfalls in Experiments with DNN4SE.* ESEC/FSE 2023.