Experiment 2 – Latin Square 2: CCT5 & COME on MCMD-NL (Redesigned)

This repository contains the artifacts for Latin Square 2 of Experiment 2, which corresponds to the redesigned and reimplemented experiment evaluated on the MCMD-NL dataset using the DNN-based commit message generation baselines CCT5 and COME. The models have been retrained for each language on the MCMD-NL dataset and then evaluated utilizing the BLEU, METEOR, ROUGE-L, and CIDEr metrics.

Models

CCT5

CCT5 is a code-change-oriented pre-trained model built on top of the T5 architecture, initialized from CodeT5 weights and further pre-trained on CodeChangeNet (~40GB, 1.5M diff/commit pairs). Released at ESEC/FSE 2023.

Architecture: Encoder-decoder Transformer (T5-base → CodeT5 → CCT5)
Pre-training corpus: CodeChangeNet (code diffs paired with commit messages)
For MCMD-NL: new checkpoint trained by fine-tuning the pre-trained CCT5 model on the MCMD-NL training set, then evaluated on the MCMD-NL test set

COME

COME (Commit Message Generation with Modification Embedding) is a hybrid DNN system built on top of CodeT5 with three core components:

Modification embedding: converts code changes into numerical vectors capturing code evolution
Fine-tuned CodeT5: generates candidate commit messages from the embedded representation
SVM-based decision algorithm: selects between the generated and retrieved candidate messages

Released at ISSTA 2023. Does not include additional large-scale pre-training beyond CodeT5.

For MCMD-NL: new checkpoint trained by fine-tuning the pre-trained COME model on the MCMD-NL training set, then evaluated on the MCMD-NL test set

Dataset

MCMD-NL – Part of MCMD-New; commits from repositories with programming languages not present in the original MCMD dataset.

Property	Details
Languages	PHP, R, TypeScript, Swift, Objective-C
Repositories	329 new repositories (not in MCMD)
Total commits	135,699
Date range	January 1st, 2022 onwards
Split	80% train / 10% validation / 10% test
Authors	Wu et al. (2025)

MCMD-NL was constructed to test model generalization to entirely new programming languages, requiring full fine-tuning from the pre-trained model checkpoints rather than reuse of existing MCMD-trained weights.

Repository Structure

Each run folder corresponds to a programming language evaluated in this Latin Square. Both CCT5 and COME were fine-tuned on MCMD-NL and evaluated independently for each language.

experiment2_ls2/
├── run_php/
│   ├── checkpoint/          # CCT5 and COME checkpoints fine-tuned on MCMD-NL (PHP)
│   ├── predictions/         # Generated commit messages on MCMD-NL PHP test set
│   └── metrics/             # BLEU, METEOR, ROUGE-L, CIDEr scores
├── run_r/
│   ├── checkpoint/
│   ├── predictions/
│   └── metrics/
├── run_typescript/
│   ├── checkpoint/
│   ├── predictions/
│   └── metrics/
├── run_swift/
│   ├── checkpoint/
│   ├── predictions/
│   └── metrics/
└── run_objectivec/
    ├── checkpoint/
    ├── predictions/
    └── metrics/

`checkpoint/`

Contains the model checkpoint files produced after fine-tuning CCT5 and COME on the MCMD-NL training set for the corresponding language. These are newly trained checkpoints, not reused from prior work. The best checkpoint selected during validation is stored here.

`predictions/`

Contains the generated commit messages produced by each model on the MCMD-NL test set for the corresponding language. Files are stored as .txt with one prediction per line, aligned to the reference messages in the test set.

`metrics/`

Contains the evaluation metric scores computed by comparing the predictions against the MCMD-NL test set reference messages. Each file records BLEU, METEOR, ROUGE-L, and CIDEr scores per model and language under the redesigned evaluation protocol.

Evaluation Metrics

Metric	Description
BLEU	Bilingual Evaluation Understudy — measures n-gram precision between generated and reference messages
METEOR	Metric for Evaluation of Translation with Explicit Ordering — extends BLEU with recall, stemming, and synonym matching via WordNet
ROUGE-L	Recall-Oriented Understudy for Gisting Evaluation (LCS variant) — measures longest common subsequence overlap
CIDEr	Consensus-based Image Description Evaluation — TF-IDF-weighted n-gram similarity against reference messages

Reported Results (Original Paper – Wu et al., 2025)

Language	Model	BLEU	METEOR	ROUGE-L	CIDEr
PHP	CCT5	31.96	27.31	37.99	2.26
PHP	COME	34.68	30.51	40.27	2.59
R	CCT5	33.02	28.92	37.17	2.19
R	COME	35.56	31.99	38.06	2.66
TypeScript	CCT5	32.33	27.92	43.62	2.24
TypeScript	COME	35.72	30.97	47.38	2.61
Swift	CCT5	29.29	24.58	37.09	1.98
Swift	COME	31.72	27.54	39.32	2.36
Objective-C	CCT5	28.57	24.62	31.63	1.68
Objective-C	COME	33.43	29.44	38.32	2.17
Average	CCT5	31.02	26.67	37.50	2.06
Average	COME	34.22	30.09	40.67	2.47

These values serve as the reference for comparison with the results produced under the redesigned protocol.

Methodological Differences from Experiment 1

This experiment was redesigned to address the validity and reproducibility concerns identified during the Experiment 1 reproduction phase:

Explicit random seed documentation for all fine-tuning runs
Fully documented fine-tuning procedure: hyperparameters, batch size, learning rate, number of epochs, and hardware specifications
Best checkpoint selection criteria explicitly defined using the validation set
Controlled evaluation procedure with clearly specified evaluation script versions
Full documentation of execution conditions (hardware, software versions, environment)
Explicit treatment of validity threats including language-specific variability and training randomness

Important Notes

The fine-tuning procedure for MCMD-NL is not a reuse of existing checkpoints — both models were trained from their pre-trained weights on the MCMD-NL training partition.
The original paper does not clarify whether a single multilingual checkpoint or separate per-language checkpoints were trained for MCMD-NL; this ambiguity is addressed and documented in the thesis.
MCMD-NL scores are generally higher than MCMD scores across all metrics, likely due to the different commit style distributions across the new languages.

References

Wu et al. (2025). An Empirical Study on Commit Message Generation with Large Language Models via In-Context Learning. arXiv:2502.18904.
Lin et al. (2023). CCT5: A Code-Change-Oriented Pre-Trained Model. ESEC/FSE 2023.
He et al. (2023). COME: Commit Message Generation with Modification Embedding. ISSTA 2023.
Vegas & Elbaum (2023). Pitfalls in Experiments with DNN4SE. ESEC/FSE 2023.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for clouds125/TFM_EXP2_MCMD-NL_LS2

An Empirical Study on Commit Message Generation using LLMs via In-Context Learning

Paper • 2502.18904 • Published Feb 26, 2025