| # Experiment 1 β Latin Square 2: CCT5 & COME on MCMD-NT |
|
|
| This repository contains the artifacts for **Latin Square 2 of Experiment 1**, which corresponds to the **reproduction of the original experiment** by Wu et al. (2025) on the **MCMD-NT dataset** using the DNN-based commit message generation baselines **CCT5** and **COME**. |
|
|
| *** |
| |
| ## Models |
| |
| ### CCT5 |
| CCT5 is a code-change-oriented pre-trained model built on top of the **T5 architecture**, initialized from **CodeT5** weights. It is further specialized through pre-training on **CodeChangeNet**, a commit-diff dataset containing roughly 40GB of diff and commit message pairs (~1.5M pairs). It was released at ESEC/FSE 2023. |
|
|
| - Base: `T5-base` β `CodeT5` β `CCT5` |
| - Pre-training data: CodeChangeNet (40GB, 1.5M diff/commit pairs) |
| - For MCMD-NT: reused released MCMD-trained checkpoint from original authors (same checkpoint as MCMD, since MCMD-NT shares the same languages and structure) |
|
|
| ### COME |
| COME (Commit Message Generation with Modification Embedding) is a hybrid DNN approach that combines: |
| - A **fine-tuned CodeT5** component for natural language generation |
| - **Modification embedding** to represent code changes as numerical vectors |
| - An **SVM-based decision algorithm** to select between generated and retrieved candidate messages |
|
|
| It does not perform additional large-scale pre-training on top of CodeT5. Released at ISSTA 2023. |
|
|
| - For MCMD-NT: reused language-specific MCMD-trained checkpoints released by original COME authors (one per language) |
|
|
| *** |
| |
| ## Dataset |
| |
| **MCMD-NT** β Part of MCMD-New; newer commits from repositories also present in the original MCMD dataset. |
|
|
| | Property | Details | |
| |----------|---------| |
| | Languages | Java, C++, C#, Python, JavaScript | |
| | Repositories | 367 repositories shared with the MCMD dataset | |
| | Total commits | 229,492 | |
| | Date range | January 1st, 2022 onwards (newer than MCMD) | |
| | Split | 80% train / 10% validation / 10% test | |
| | Authors | Wu et al. (2025) | |
|
|
| MCMD-NT was constructed to reduce the risk of **data leakage**, using newer commits from the same repositories as MCMD to test model generalization to more recent data without introducing new programming languages. |
|
|
| *** |
| |
| ## Repository Structure |
| |
| Each run folder corresponds to a **programming language** evaluated in this Latin Square: |
|
|
| ``` |
| experiment1_ls2/ |
| βββ run_java/ |
| β βββ checkpoint/ # CCT5 and COME checkpoints (reused MCMD-trained, Java) |
| β βββ predictions/ # Generated commit messages on MCMD-NT Java test set |
| β βββ metrics/ # BLEU, METEOR, ROUGE-L, CIDEr scores |
| βββ run_cpp/ |
| β βββ checkpoint/ |
| β βββ predictions/ |
| β βββ metrics/ |
| βββ run_csharp/ |
| β βββ checkpoint/ |
| β βββ predictions/ |
| β βββ metrics/ |
| βββ run_python/ |
| β βββ checkpoint/ |
| β βββ predictions/ |
| β βββ metrics/ |
| βββ run_javascript/ |
| βββ checkpoint/ |
| βββ predictions/ |
| βββ metrics/ |
| ``` |
|
|
| ### `checkpoint/` |
| Contains the model checkpoint files for CCT5 and COME. These are the **same checkpoints used for MCMD** (LS1), reused here since MCMD-NT shares the same languages and format as MCMD. |
|
|
| ### `predictions/` |
| Contains the generated commit messages produced by each model on the MCMD-NT test set for the corresponding language, stored as `.txt` files with one prediction per line aligned to the reference messages. |
|
|
| ### `metrics/` |
| Contains the computed evaluation metric scores for each model-language combination. Metrics are calculated by comparing predictions against the reference messages in the MCMD-NT test set. |
|
|
| *** |
| |
| ## Evaluation Metrics |
| |
| | Metric | Description | |
| |--------|-------------| |
| | **BLEU** | Bilingual Evaluation Understudy β measures n-gram precision between generated and reference messages | |
| | **METEOR** | Metric for Evaluation of Translation with Explicit Ordering β extends BLEU with recall, stemming, and synonym matching | |
| | **ROUGE-L** | Recall-Oriented Understudy for Gisting Evaluation (LCS variant) β measures longest common subsequence overlap | |
| | **CIDEr** | Consensus-based Image Description Evaluation β TF-IDF-weighted n-gram similarity against reference messages | |
|
|
| ### Reported Results (Original Paper β Wu et al., 2025) |
|
|
| | Language | Model | BLEU | METEOR | ROUGE-L | CIDEr | |
| |----------|-------|------|--------|---------|-------| |
| | Java | CCT5 | 22.15 | 19.05 | 30.18 | 1.48 | |
| | Java | COME | 31.46 | 26.41 | 39.53 | 2.41 | |
| | C++ | CCT5 | 16.94 | 13.15 | 23.52 | 0.86 | |
| | C++ | COME | 25.60 | 20.47 | 31.68 | 1.74 | |
| | C# | CCT5 | 15.26 | 13.22 | 21.27 | 0.79 | |
| | C# | COME | 28.83 | 25.02 | 34.90 | 1.95 | |
| | Python | CCT5 | 19.02 | 16.12 | 30.47 | 0.98 | |
| | Python | COME | 25.95 | 22.55 | 36.78 | 1.75 | |
| | JavaScript | CCT5 | 24.72 | 21.66 | 34.42 | 1.73 | |
| | JavaScript | COME | 31.30 | 27.06 | 39.77 | 2.41 | |
| | **Average** | **CCT5** | **19.62** | **16.64** | **27.97** | **1.17** | |
| | **Average** | **COME** | **28.63** | **24.30** | **36.53** | **2.05** | |
|
|
| *** |
| |
| ## References |
| |
| - Wu et al. (2025). *An Empirical Study on Commit Message Generation with Large Language Models via In-Context Learning.* arXiv:2502.18904. |
| |