Update README.md

5977264 verified 8 days ago

6.19 kB

	# Experiment 2 – Latin Square 1: CCT5 & COME on MCMD (Redesigned)

	This repository contains the artifacts for Latin Square 1 of Experiment 2, which corresponds to the redesigned and reimplemented experiment evaluated on the MCMD dataset using the DNN-based commit message generation baselines CCT5 and COME. This experiment was conducted under a more explicit and controlled evaluation protocol than the original study by Wu et al. (2025).

	***

	## Models

	### CCT5
	CCT5 is a code-change-oriented pre-trained model built on top of the T5 architecture, initialized from CodeT5 weights and further pre-trained on CodeChangeNet (~40GB, 1.5M diff/commit pairs). Released at ESEC/FSE 2023.

	- Architecture: Encoder-decoder Transformer (`T5-base` → `CodeT5` → `CCT5`)
	- Pre-training corpus: CodeChangeNet (code diffs paired with commit messages)
	- For MCMD: fine-tuned checkpoint from original CCT5 authors, trained on MCMD training set

	### COME
	COME (Commit Message Generation with Modification Embedding) is a hybrid DNN system built on top of CodeT5 with three core components:
	- Modification embedding: converts code changes into numerical vectors capturing code evolution
	- Fine-tuned CodeT5: generates candidate commit messages from the embedded representation
	- SVM-based decision algorithm: selects between the generated and retrieved candidate messages

	Released at ISSTA 2023. Does not include additional large-scale pre-training beyond CodeT5.

	- For MCMD: language-specific checkpoints released by original COME authors (one per language)

	***

	## Dataset

	MCMD – Multilingual Commit Message Dataset

	\| Property \| Details \|
	\|----------\|---------\|
	\| Languages \| Java, C++, C#, Python, JavaScript \|
	\| Repositories \| Top 100 most-starred GitHub repos per language (500 total) \|
	\| Total commits \| ~1,094,115 \|
	\| Date range \| Up to January 1st, 2022 \|
	\| Split \| 80% train / 10% validation / 10% test \|
	\| Authors \| Liu et al. (2020) \|

	***

	## Repository Structure

	Each run folder corresponds to a programming language evaluated in this Latin Square. Unlike Experiment 1, this experiment follows a more controlled protocol with explicit random seed documentation and multiple evaluation runs where applicable.

	```
	experiment2_ls1/
	├── run_java/
	│ ├── checkpoint/ # CCT5 and COME checkpoints for MCMD (Java)
	│ ├── predictions/ # Generated commit messages on MCMD Java test set
	│ └── metrics/ # BLEU, METEOR, ROUGE-L, CIDEr scores
	├── run_cpp/
	│ ├── checkpoint/
	│ ├── predictions/
	│ └── metrics/
	├── run_csharp/
	│ ├── checkpoint/
	│ ├── predictions/
	│ └── metrics/
	├── run_python/
	│ ├── checkpoint/
	│ ├── predictions/
	│ └── metrics/
	└── run_javascript/
	├── checkpoint/
	├── predictions/
	└── metrics/
	```

	### `checkpoint/`
	Contains the model checkpoint files used for evaluation. For MCMD, these are the fine-tuned checkpoints released by the original authors of CCT5 and COME, trained on the MCMD training set for the corresponding language.

	### `predictions/`
	Contains the generated commit messages produced by each model on the MCMD test set for the corresponding language. Files are stored as `.txt` with one prediction per line, aligned to the reference messages in the test set.

	### `metrics/`
	Contains the evaluation metric scores computed by comparing the predictions against the MCMD test set reference messages. Each file records BLEU, METEOR, ROUGE-L, and CIDEr scores per model and language under the redesigned evaluation protocol.

	***

	## Evaluation Metrics

	\| Metric \| Description \|
	\|--------\|-------------\|
	\| BLEU \| Bilingual Evaluation Understudy — measures n-gram precision between generated and reference messages \|
	\| METEOR \| Metric for Evaluation of Translation with Explicit Ordering — extends BLEU with recall, stemming, and synonym matching via WordNet \|
	\| ROUGE-L \| Recall-Oriented Understudy for Gisting Evaluation (LCS variant) — measures longest common subsequence overlap \|
	\| CIDEr \| Consensus-based Image Description Evaluation — TF-IDF-weighted n-gram similarity against reference messages \|

	### Reference Results (Original Paper – Wu et al., 2025)

	\| Language \| Model \| BLEU \| METEOR \| ROUGE-L \| CIDEr \|
	\|----------\|-------\|------\|--------\|---------\|-------\|
	\| Java \| CCT5 \| 17.19 \| 14.95 \| 26.08 \| 1.06 \|
	\| Java \| COME \| 27.17 \| 23.36 \| 34.59 \| 1.90 \|
	\| C++ \| CCT5 \| 15.65 \| 14.11 \| 24.15 \| 0.90 \|
	\| C++ \| COME \| 27.29 \| 23.29 \| 33.33 \| 1.91 \|
	\| C# \| CCT5 \| 12.06 \| 11.05 \| 18.92 \| 0.61 \|
	\| C# \| COME \| 20.80 \| 17.72 \| 27.01 \| 1.25 \|
	\| Python \| CCT5 \| 15.12 \| 13.70 \| 23.79 \| 0.85 \|
	\| Python \| COME \| 23.17 \| 19.99 \| 30.48 \| 1.50 \|
	\| JavaScript \| CCT5 \| 19.76 \| 17.51 \| 28.73 \| 1.33 \|
	\| JavaScript \| COME \| 26.91 \| 23.02 \| 34.44 \| 1.92 \|
	\| Average \| CCT5 \| 15.96 \| 14.26 \| 24.33 \| 0.95 \|
	\| Average \| COME \| 25.07 \| 21.48 \| 31.97 \| 1.70 \|

	These values serve as the baseline reference for comparison with the results produced under the redesigned protocol.

	***

	## Methodological Differences from Experiment 1

	This experiment was redesigned to address the validity and reproducibility concerns identified during the Experiment 1 reproduction phase:

	- Explicit random seed documentation for all runs
	- Controlled evaluation procedure with clearly specified script versions
	- Full documentation of execution conditions (hardware, software versions, environment)
	- Explicit treatment of validity threats at each stage of the evaluation

	***

	## References

	- Wu et al. (2025). An Empirical Study on Commit Message Generation with Large Language Models via In-Context Learning. arXiv:2502.18904.
	- Lin et al. (2023). CCT5: A Code-Change-Oriented Pre-Trained Model. ESEC/FSE 2023.
	- He et al. (2023). COME: Commit Message Generation with Modification Embedding. ISSTA 2023.
	- Liu et al. (2020). MCMD dataset.
	- Vegas & Elbaum (2023). Pitfalls in Experiments with DNN4SE. ESEC/FSE 2023.