Create README.md

9e5a4a7 verified 8 days ago

7.41 kB

	# Experiment 2 – Latin Square 2: CCT5 & COME on MCMD-NL (Redesigned)

	This repository contains the artifacts for Latin Square 2 of Experiment 2, which corresponds to the redesigned and reimplemented experiment evaluated on the MCMD-NL dataset using the DNN-based commit message generation baselines CCT5 and COME.
	The models have been retrained for each language on the MCMD-NL dataset and then evaluated utilizing the BLEU, METEOR, ROUGE-L, and CIDEr metrics.

	***

	## Models

	### CCT5
	CCT5 is a code-change-oriented pre-trained model built on top of the T5 architecture, initialized from CodeT5 weights and further pre-trained on CodeChangeNet (~40GB, 1.5M diff/commit pairs). Released at ESEC/FSE 2023.

	- Architecture: Encoder-decoder Transformer (`T5-base` → `CodeT5` → `CCT5`)
	- Pre-training corpus: CodeChangeNet (code diffs paired with commit messages)
	- For MCMD-NL: new checkpoint trained by fine-tuning the pre-trained CCT5 model on the MCMD-NL training set, then evaluated on the MCMD-NL test set

	### COME
	COME (Commit Message Generation with Modification Embedding) is a hybrid DNN system built on top of CodeT5 with three core components:
	- Modification embedding: converts code changes into numerical vectors capturing code evolution
	- Fine-tuned CodeT5: generates candidate commit messages from the embedded representation
	- SVM-based decision algorithm: selects between the generated and retrieved candidate messages

	Released at ISSTA 2023. Does not include additional large-scale pre-training beyond CodeT5.

	- For MCMD-NL: new checkpoint trained by fine-tuning the pre-trained COME model on the MCMD-NL training set, then evaluated on the MCMD-NL test set

	***

	## Dataset

	MCMD-NL – Part of MCMD-New; commits from repositories with programming languages not present in the original MCMD dataset.

	\| Property \| Details \|
	\|----------\|---------\|
	\| Languages \| PHP, R, TypeScript, Swift, Objective-C \|
	\| Repositories \| 329 new repositories (not in MCMD) \|
	\| Total commits \| 135,699 \|
	\| Date range \| January 1st, 2022 onwards \|
	\| Split \| 80% train / 10% validation / 10% test \|
	\| Authors \| Wu et al. (2025) \|

	MCMD-NL was constructed to test model generalization to entirely new programming languages, requiring full fine-tuning from the pre-trained model checkpoints rather than reuse of existing MCMD-trained weights.

	***

	## Repository Structure

	Each run folder corresponds to a programming language evaluated in this Latin Square. Both CCT5 and COME were fine-tuned on MCMD-NL and evaluated independently for each language.

	```
	experiment2_ls2/
	├── run_php/
	│ ├── checkpoint/ # CCT5 and COME checkpoints fine-tuned on MCMD-NL (PHP)
	│ ├── predictions/ # Generated commit messages on MCMD-NL PHP test set
	│ └── metrics/ # BLEU, METEOR, ROUGE-L, CIDEr scores
	├── run_r/
	│ ├── checkpoint/
	│ ├── predictions/
	│ └── metrics/
	├── run_typescript/
	│ ├── checkpoint/
	│ ├── predictions/
	│ └── metrics/
	├── run_swift/
	│ ├── checkpoint/
	│ ├── predictions/
	│ └── metrics/
	└── run_objectivec/
	├── checkpoint/
	├── predictions/
	└── metrics/
	```

	### `checkpoint/`
	Contains the model checkpoint files produced after fine-tuning CCT5 and COME on the MCMD-NL training set for the corresponding language. These are newly trained checkpoints, not reused from prior work. The best checkpoint selected during validation is stored here.

	### `predictions/`
	Contains the generated commit messages produced by each model on the MCMD-NL test set for the corresponding language. Files are stored as `.txt` with one prediction per line, aligned to the reference messages in the test set.

	### `metrics/`
	Contains the evaluation metric scores computed by comparing the predictions against the MCMD-NL test set reference messages. Each file records BLEU, METEOR, ROUGE-L, and CIDEr scores per model and language under the redesigned evaluation protocol.

	***

	## Evaluation Metrics

	\| Metric \| Description \|
	\|--------\|-------------\|
	\| BLEU \| Bilingual Evaluation Understudy — measures n-gram precision between generated and reference messages \|
	\| METEOR \| Metric for Evaluation of Translation with Explicit Ordering — extends BLEU with recall, stemming, and synonym matching via WordNet \|
	\| ROUGE-L \| Recall-Oriented Understudy for Gisting Evaluation (LCS variant) — measures longest common subsequence overlap \|
	\| CIDEr \| Consensus-based Image Description Evaluation — TF-IDF-weighted n-gram similarity against reference messages \|

	### Reported Results (Original Paper – Wu et al., 2025)

	\| Language \| Model \| BLEU \| METEOR \| ROUGE-L \| CIDEr \|
	\|----------\|-------\|------\|--------\|---------\|-------\|
	\| PHP \| CCT5 \| 31.96 \| 27.31 \| 37.99 \| 2.26 \|
	\| PHP \| COME \| 34.68 \| 30.51 \| 40.27 \| 2.59 \|
	\| R \| CCT5 \| 33.02 \| 28.92 \| 37.17 \| 2.19 \|
	\| R \| COME \| 35.56 \| 31.99 \| 38.06 \| 2.66 \|
	\| TypeScript \| CCT5 \| 32.33 \| 27.92 \| 43.62 \| 2.24 \|
	\| TypeScript \| COME \| 35.72 \| 30.97 \| 47.38 \| 2.61 \|
	\| Swift \| CCT5 \| 29.29 \| 24.58 \| 37.09 \| 1.98 \|
	\| Swift \| COME \| 31.72 \| 27.54 \| 39.32 \| 2.36 \|
	\| Objective-C \| CCT5 \| 28.57 \| 24.62 \| 31.63 \| 1.68 \|
	\| Objective-C \| COME \| 33.43 \| 29.44 \| 38.32 \| 2.17 \|
	\| Average \| CCT5 \| 31.02 \| 26.67 \| 37.50 \| 2.06 \|
	\| Average \| COME \| 34.22 \| 30.09 \| 40.67 \| 2.47 \|

	These values serve as the reference for comparison with the results produced under the redesigned protocol.

	***

	## Methodological Differences from Experiment 1

	This experiment was redesigned to address the validity and reproducibility concerns identified during the Experiment 1 reproduction phase:

	- Explicit random seed documentation for all fine-tuning runs
	- Fully documented fine-tuning procedure: hyperparameters, batch size, learning rate, number of epochs, and hardware specifications
	- Best checkpoint selection criteria explicitly defined using the validation set
	- Controlled evaluation procedure with clearly specified evaluation script versions
	- Full documentation of execution conditions (hardware, software versions, environment)
	- Explicit treatment of validity threats including language-specific variability and training randomness

	***

	## Important Notes

	- The fine-tuning procedure for MCMD-NL is not a reuse of existing checkpoints — both models were trained from their pre-trained weights on the MCMD-NL training partition.
	- The original paper does not clarify whether a single multilingual checkpoint or separate per-language checkpoints were trained for MCMD-NL; this ambiguity is addressed and documented in the thesis.
	- MCMD-NL scores are generally higher than MCMD scores across all metrics, likely due to the different commit style distributions across the new languages.

	***

	## References

	- Wu et al. (2025). An Empirical Study on Commit Message Generation with Large Language Models via In-Context Learning. arXiv:2502.18904.
	- Lin et al. (2023). CCT5: A Code-Change-Oriented Pre-Trained Model. ESEC/FSE 2023.
	- He et al. (2023). COME: Commit Message Generation with Modification Embedding. ISSTA 2023.
	- Vegas & Elbaum (2023). Pitfalls in Experiments with DNN4SE. ESEC/FSE 2023.