Update README.md

7823d9a verified 8 days ago

5.21 kB

	# Experiment 1 – Latin Square 2: CCT5 & COME on MCMD-NT

	This repository contains the artifacts for Latin Square 2 of Experiment 1, which corresponds to the reproduction of the original experiment by Wu et al. (2025) on the MCMD-NT dataset using the DNN-based commit message generation baselines CCT5 and COME.

	***

	## Models

	### CCT5
	CCT5 is a code-change-oriented pre-trained model built on top of the T5 architecture, initialized from CodeT5 weights. It is further specialized through pre-training on CodeChangeNet, a commit-diff dataset containing roughly 40GB of diff and commit message pairs (~1.5M pairs). It was released at ESEC/FSE 2023.

	- Base: `T5-base` → `CodeT5` → `CCT5`
	- Pre-training data: CodeChangeNet (40GB, 1.5M diff/commit pairs)
	- For MCMD-NT: reused released MCMD-trained checkpoint from original authors (same checkpoint as MCMD, since MCMD-NT shares the same languages and structure)

	### COME
	COME (Commit Message Generation with Modification Embedding) is a hybrid DNN approach that combines:
	- A fine-tuned CodeT5 component for natural language generation
	- Modification embedding to represent code changes as numerical vectors
	- An SVM-based decision algorithm to select between generated and retrieved candidate messages

	It does not perform additional large-scale pre-training on top of CodeT5. Released at ISSTA 2023.

	- For MCMD-NT: reused language-specific MCMD-trained checkpoints released by original COME authors (one per language)

	***

	## Dataset

	MCMD-NT – Part of MCMD-New; newer commits from repositories also present in the original MCMD dataset.

	\| Property \| Details \|
	\|----------\|---------\|
	\| Languages \| Java, C++, C#, Python, JavaScript \|
	\| Repositories \| 367 repositories shared with the MCMD dataset \|
	\| Total commits \| 229,492 \|
	\| Date range \| January 1st, 2022 onwards (newer than MCMD) \|
	\| Split \| 80% train / 10% validation / 10% test \|
	\| Authors \| Wu et al. (2025) \|

	MCMD-NT was constructed to reduce the risk of data leakage, using newer commits from the same repositories as MCMD to test model generalization to more recent data without introducing new programming languages.

	***

	## Repository Structure

	Each run folder corresponds to a programming language evaluated in this Latin Square:

	```
	experiment1_ls2/
	├── run_java/
	│ ├── checkpoint/ # CCT5 and COME checkpoints (reused MCMD-trained, Java)
	│ ├── predictions/ # Generated commit messages on MCMD-NT Java test set
	│ └── metrics/ # BLEU, METEOR, ROUGE-L, CIDEr scores
	├── run_cpp/
	│ ├── checkpoint/
	│ ├── predictions/
	│ └── metrics/
	├── run_csharp/
	│ ├── checkpoint/
	│ ├── predictions/
	│ └── metrics/
	├── run_python/
	│ ├── checkpoint/
	│ ├── predictions/
	│ └── metrics/
	└── run_javascript/
	├── checkpoint/
	├── predictions/
	└── metrics/
	```

	### `checkpoint/`
	Contains the model checkpoint files for CCT5 and COME. These are the same checkpoints used for MCMD (LS1), reused here since MCMD-NT shares the same languages and format as MCMD.

	### `predictions/`
	Contains the generated commit messages produced by each model on the MCMD-NT test set for the corresponding language, stored as `.txt` files with one prediction per line aligned to the reference messages.

	### `metrics/`
	Contains the computed evaluation metric scores for each model-language combination. Metrics are calculated by comparing predictions against the reference messages in the MCMD-NT test set.

	***

	## Evaluation Metrics

	\| Metric \| Description \|
	\|--------\|-------------\|
	\| BLEU \| Bilingual Evaluation Understudy — measures n-gram precision between generated and reference messages \|
	\| METEOR \| Metric for Evaluation of Translation with Explicit Ordering — extends BLEU with recall, stemming, and synonym matching \|
	\| ROUGE-L \| Recall-Oriented Understudy for Gisting Evaluation (LCS variant) — measures longest common subsequence overlap \|
	\| CIDEr \| Consensus-based Image Description Evaluation — TF-IDF-weighted n-gram similarity against reference messages \|

	### Reported Results (Original Paper – Wu et al., 2025)

	\| Language \| Model \| BLEU \| METEOR \| ROUGE-L \| CIDEr \|
	\|----------\|-------\|------\|--------\|---------\|-------\|
	\| Java \| CCT5 \| 22.15 \| 19.05 \| 30.18 \| 1.48 \|
	\| Java \| COME \| 31.46 \| 26.41 \| 39.53 \| 2.41 \|
	\| C++ \| CCT5 \| 16.94 \| 13.15 \| 23.52 \| 0.86 \|
	\| C++ \| COME \| 25.60 \| 20.47 \| 31.68 \| 1.74 \|
	\| C# \| CCT5 \| 15.26 \| 13.22 \| 21.27 \| 0.79 \|
	\| C# \| COME \| 28.83 \| 25.02 \| 34.90 \| 1.95 \|
	\| Python \| CCT5 \| 19.02 \| 16.12 \| 30.47 \| 0.98 \|
	\| Python \| COME \| 25.95 \| 22.55 \| 36.78 \| 1.75 \|
	\| JavaScript \| CCT5 \| 24.72 \| 21.66 \| 34.42 \| 1.73 \|
	\| JavaScript \| COME \| 31.30 \| 27.06 \| 39.77 \| 2.41 \|
	\| Average \| CCT5 \| 19.62 \| 16.64 \| 27.97 \| 1.17 \|
	\| Average \| COME \| 28.63 \| 24.30 \| 36.53 \| 2.05 \|

	***

	## References

	- Wu et al. (2025). An Empirical Study on Commit Message Generation with Large Language Models via In-Context Learning. arXiv:2502.18904.