| # Experiment 2 β Latin Square 1: CCT5 & COME on MCMD (Redesigned) |
|
|
| This repository contains the artifacts for **Latin Square 1 of Experiment 2**, which corresponds to the **redesigned and reimplemented experiment** evaluated on the **MCMD dataset** using the DNN-based commit message generation baselines **CCT5** and **COME**. This experiment was conducted under a more explicit and controlled evaluation protocol than the original study by Wu et al. (2025). |
|
|
| *** |
| |
| ## Models |
| |
| ### CCT5 |
| CCT5 is a code-change-oriented pre-trained model built on top of the **T5 architecture**, initialized from **CodeT5** weights and further pre-trained on **CodeChangeNet** (~40GB, 1.5M diff/commit pairs). Released at ESEC/FSE 2023. |
|
|
| - Architecture: Encoder-decoder Transformer (`T5-base` β `CodeT5` β `CCT5`) |
| - Pre-training corpus: CodeChangeNet (code diffs paired with commit messages) |
| - For MCMD: fine-tuned checkpoint from original CCT5 authors, trained on MCMD training set |
|
|
| ### COME |
| COME (Commit Message Generation with Modification Embedding) is a hybrid DNN system built on top of CodeT5 with three core components: |
| - **Modification embedding**: converts code changes into numerical vectors capturing code evolution |
| - **Fine-tuned CodeT5**: generates candidate commit messages from the embedded representation |
| - **SVM-based decision algorithm**: selects between the generated and retrieved candidate messages |
|
|
| Released at ISSTA 2023. Does not include additional large-scale pre-training beyond CodeT5. |
|
|
| - For MCMD: language-specific checkpoints released by original COME authors (one per language) |
|
|
| *** |
| |
| ## Dataset |
| |
| **MCMD** β Multilingual Commit Message Dataset |
|
|
| | Property | Details | |
| |----------|---------| |
| | Languages | Java, C++, C#, Python, JavaScript | |
| | Repositories | Top 100 most-starred GitHub repos per language (500 total) | |
| | Total commits | ~1,094,115 | |
| | Date range | Up to January 1st, 2022 | |
| | Split | 80% train / 10% validation / 10% test | |
| | Authors | Liu et al. (2020) | |
|
|
| *** |
| |
| ## Repository Structure |
| |
| Each run folder corresponds to a **programming language** evaluated in this Latin Square. Unlike Experiment 1, this experiment follows a more controlled protocol with explicit random seed documentation and multiple evaluation runs where applicable. |
|
|
| ``` |
| experiment2_ls1/ |
| βββ run_java/ |
| β βββ checkpoint/ # CCT5 and COME checkpoints for MCMD (Java) |
| β βββ predictions/ # Generated commit messages on MCMD Java test set |
| β βββ metrics/ # BLEU, METEOR, ROUGE-L, CIDEr scores |
| βββ run_cpp/ |
| β βββ checkpoint/ |
| β βββ predictions/ |
| β βββ metrics/ |
| βββ run_csharp/ |
| β βββ checkpoint/ |
| β βββ predictions/ |
| β βββ metrics/ |
| βββ run_python/ |
| β βββ checkpoint/ |
| β βββ predictions/ |
| β βββ metrics/ |
| βββ run_javascript/ |
| βββ checkpoint/ |
| βββ predictions/ |
| βββ metrics/ |
| ``` |
|
|
| ### `checkpoint/` |
| Contains the model checkpoint files used for evaluation. For MCMD, these are the fine-tuned checkpoints released by the original authors of CCT5 and COME, trained on the MCMD training set for the corresponding language. |
|
|
| ### `predictions/` |
| Contains the generated commit messages produced by each model on the MCMD test set for the corresponding language. Files are stored as `.txt` with one prediction per line, aligned to the reference messages in the test set. |
|
|
| ### `metrics/` |
| Contains the evaluation metric scores computed by comparing the predictions against the MCMD test set reference messages. Each file records BLEU, METEOR, ROUGE-L, and CIDEr scores per model and language under the redesigned evaluation protocol. |
|
|
| *** |
| |
| ## Evaluation Metrics |
| |
| | Metric | Description | |
| |--------|-------------| |
| | **BLEU** | Bilingual Evaluation Understudy β measures n-gram precision between generated and reference messages | |
| | **METEOR** | Metric for Evaluation of Translation with Explicit Ordering β extends BLEU with recall, stemming, and synonym matching via WordNet | |
| | **ROUGE-L** | Recall-Oriented Understudy for Gisting Evaluation (LCS variant) β measures longest common subsequence overlap | |
| | **CIDEr** | Consensus-based Image Description Evaluation β TF-IDF-weighted n-gram similarity against reference messages | |
|
|
| ### Reference Results (Original Paper β Wu et al., 2025) |
|
|
| | Language | Model | BLEU | METEOR | ROUGE-L | CIDEr | |
| |----------|-------|------|--------|---------|-------| |
| | Java | CCT5 | 17.19 | 14.95 | 26.08 | 1.06 | |
| | Java | COME | 27.17 | 23.36 | 34.59 | 1.90 | |
| | C++ | CCT5 | 15.65 | 14.11 | 24.15 | 0.90 | |
| | C++ | COME | 27.29 | 23.29 | 33.33 | 1.91 | |
| | C# | CCT5 | 12.06 | 11.05 | 18.92 | 0.61 | |
| | C# | COME | 20.80 | 17.72 | 27.01 | 1.25 | |
| | Python | CCT5 | 15.12 | 13.70 | 23.79 | 0.85 | |
| | Python | COME | 23.17 | 19.99 | 30.48 | 1.50 | |
| | JavaScript | CCT5 | 19.76 | 17.51 | 28.73 | 1.33 | |
| | JavaScript | COME | 26.91 | 23.02 | 34.44 | 1.92 | |
| | **Average** | **CCT5** | **15.96** | **14.26** | **24.33** | **0.95** | |
| | **Average** | **COME** | **25.07** | **21.48** | **31.97** | **1.70** | |
|
|
| These values serve as the baseline reference for comparison with the results produced under the redesigned protocol. |
|
|
| *** |
| |
| ## Methodological Differences from Experiment 1 |
| |
| This experiment was redesigned to address the validity and reproducibility concerns identified during the Experiment 1 reproduction phase: |
| |
| - **Explicit random seed documentation** for all runs |
| - **Controlled evaluation procedure** with clearly specified script versions |
| - **Full documentation of execution conditions** (hardware, software versions, environment) |
| - **Explicit treatment of validity threats** at each stage of the evaluation |
|
|
| *** |
| |
| ## References |
| |
| - Wu et al. (2025). *An Empirical Study on Commit Message Generation with Large Language Models via In-Context Learning.* arXiv:2502.18904. |
| - Lin et al. (2023). *CCT5: A Code-Change-Oriented Pre-Trained Model.* ESEC/FSE 2023. |
| - He et al. (2023). *COME: Commit Message Generation with Modification Embedding.* ISSTA 2023. |
| - Liu et al. (2020). *MCMD dataset.* |
| - Vegas & Elbaum (2023). *Pitfalls in Experiments with DNN4SE.* ESEC/FSE 2023. |