YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

Experiment 2 – Latin Square 2: CCT5 & COME on MCMD-NL (Redesigned)

This repository contains the artifacts for Latin Square 2 of Experiment 2, which corresponds to the redesigned and reimplemented experiment evaluated on the MCMD-NL dataset using the DNN-based commit message generation baselines CCT5 and COME. The models have been retrained for each language on the MCMD-NL dataset and then evaluated utilizing the BLEU, METEOR, ROUGE-L, and CIDEr metrics.


Models

CCT5

CCT5 is a code-change-oriented pre-trained model built on top of the T5 architecture, initialized from CodeT5 weights and further pre-trained on CodeChangeNet (~40GB, 1.5M diff/commit pairs). Released at ESEC/FSE 2023.

  • Architecture: Encoder-decoder Transformer (T5-base β†’ CodeT5 β†’ CCT5)
  • Pre-training corpus: CodeChangeNet (code diffs paired with commit messages)
  • For MCMD-NL: new checkpoint trained by fine-tuning the pre-trained CCT5 model on the MCMD-NL training set, then evaluated on the MCMD-NL test set

COME

COME (Commit Message Generation with Modification Embedding) is a hybrid DNN system built on top of CodeT5 with three core components:

  • Modification embedding: converts code changes into numerical vectors capturing code evolution
  • Fine-tuned CodeT5: generates candidate commit messages from the embedded representation
  • SVM-based decision algorithm: selects between the generated and retrieved candidate messages

Released at ISSTA 2023. Does not include additional large-scale pre-training beyond CodeT5.

  • For MCMD-NL: new checkpoint trained by fine-tuning the pre-trained COME model on the MCMD-NL training set, then evaluated on the MCMD-NL test set

Dataset

MCMD-NL – Part of MCMD-New; commits from repositories with programming languages not present in the original MCMD dataset.

Property Details
Languages PHP, R, TypeScript, Swift, Objective-C
Repositories 329 new repositories (not in MCMD)
Total commits 135,699
Date range January 1st, 2022 onwards
Split 80% train / 10% validation / 10% test
Authors Wu et al. (2025)

MCMD-NL was constructed to test model generalization to entirely new programming languages, requiring full fine-tuning from the pre-trained model checkpoints rather than reuse of existing MCMD-trained weights.


Repository Structure

Each run folder corresponds to a programming language evaluated in this Latin Square. Both CCT5 and COME were fine-tuned on MCMD-NL and evaluated independently for each language.

experiment2_ls2/
β”œβ”€β”€ run_php/
β”‚   β”œβ”€β”€ checkpoint/          # CCT5 and COME checkpoints fine-tuned on MCMD-NL (PHP)
β”‚   β”œβ”€β”€ predictions/         # Generated commit messages on MCMD-NL PHP test set
β”‚   └── metrics/             # BLEU, METEOR, ROUGE-L, CIDEr scores
β”œβ”€β”€ run_r/
β”‚   β”œβ”€β”€ checkpoint/
β”‚   β”œβ”€β”€ predictions/
β”‚   └── metrics/
β”œβ”€β”€ run_typescript/
β”‚   β”œβ”€β”€ checkpoint/
β”‚   β”œβ”€β”€ predictions/
β”‚   └── metrics/
β”œβ”€β”€ run_swift/
β”‚   β”œβ”€β”€ checkpoint/
β”‚   β”œβ”€β”€ predictions/
β”‚   └── metrics/
└── run_objectivec/
    β”œβ”€β”€ checkpoint/
    β”œβ”€β”€ predictions/
    └── metrics/

checkpoint/

Contains the model checkpoint files produced after fine-tuning CCT5 and COME on the MCMD-NL training set for the corresponding language. These are newly trained checkpoints, not reused from prior work. The best checkpoint selected during validation is stored here.

predictions/

Contains the generated commit messages produced by each model on the MCMD-NL test set for the corresponding language. Files are stored as .txt with one prediction per line, aligned to the reference messages in the test set.

metrics/

Contains the evaluation metric scores computed by comparing the predictions against the MCMD-NL test set reference messages. Each file records BLEU, METEOR, ROUGE-L, and CIDEr scores per model and language under the redesigned evaluation protocol.


Evaluation Metrics

Metric Description
BLEU Bilingual Evaluation Understudy β€” measures n-gram precision between generated and reference messages
METEOR Metric for Evaluation of Translation with Explicit Ordering β€” extends BLEU with recall, stemming, and synonym matching via WordNet
ROUGE-L Recall-Oriented Understudy for Gisting Evaluation (LCS variant) β€” measures longest common subsequence overlap
CIDEr Consensus-based Image Description Evaluation β€” TF-IDF-weighted n-gram similarity against reference messages

Reported Results (Original Paper – Wu et al., 2025)

Language Model BLEU METEOR ROUGE-L CIDEr
PHP CCT5 31.96 27.31 37.99 2.26
PHP COME 34.68 30.51 40.27 2.59
R CCT5 33.02 28.92 37.17 2.19
R COME 35.56 31.99 38.06 2.66
TypeScript CCT5 32.33 27.92 43.62 2.24
TypeScript COME 35.72 30.97 47.38 2.61
Swift CCT5 29.29 24.58 37.09 1.98
Swift COME 31.72 27.54 39.32 2.36
Objective-C CCT5 28.57 24.62 31.63 1.68
Objective-C COME 33.43 29.44 38.32 2.17
Average CCT5 31.02 26.67 37.50 2.06
Average COME 34.22 30.09 40.67 2.47

These values serve as the reference for comparison with the results produced under the redesigned protocol.


Methodological Differences from Experiment 1

This experiment was redesigned to address the validity and reproducibility concerns identified during the Experiment 1 reproduction phase:

  • Explicit random seed documentation for all fine-tuning runs
  • Fully documented fine-tuning procedure: hyperparameters, batch size, learning rate, number of epochs, and hardware specifications
  • Best checkpoint selection criteria explicitly defined using the validation set
  • Controlled evaluation procedure with clearly specified evaluation script versions
  • Full documentation of execution conditions (hardware, software versions, environment)
  • Explicit treatment of validity threats including language-specific variability and training randomness

Important Notes

  • The fine-tuning procedure for MCMD-NL is not a reuse of existing checkpoints β€” both models were trained from their pre-trained weights on the MCMD-NL training partition.
  • The original paper does not clarify whether a single multilingual checkpoint or separate per-language checkpoints were trained for MCMD-NL; this ambiguity is addressed and documented in the thesis.
  • MCMD-NL scores are generally higher than MCMD scores across all metrics, likely due to the different commit style distributions across the new languages.

References

  • Wu et al. (2025). An Empirical Study on Commit Message Generation with Large Language Models via In-Context Learning. arXiv:2502.18904.
  • Lin et al. (2023). CCT5: A Code-Change-Oriented Pre-Trained Model. ESEC/FSE 2023.
  • He et al. (2023). COME: Commit Message Generation with Modification Embedding. ISSTA 2023.
  • Vegas & Elbaum (2023). Pitfalls in Experiments with DNN4SE. ESEC/FSE 2023.
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Paper for clouds125/TFM_EXP2_MCMD-NL_LS2