Create README.md
Browse files
README.md
ADDED
|
@@ -0,0 +1,141 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Experiment 2 β Latin Square 2: CCT5 & COME on MCMD-NL (Redesigned)
|
| 2 |
+
|
| 3 |
+
This repository contains the artifacts for **Latin Square 2 of Experiment 2**, which corresponds to the **redesigned and reimplemented experiment** evaluated on the **MCMD-NL dataset** using the DNN-based commit message generation baselines **CCT5** and **COME**.
|
| 4 |
+
The models have been retrained for each language on the MCMD-NL dataset and then evaluated utilizing the BLEU, METEOR, ROUGE-L, and CIDEr metrics.
|
| 5 |
+
|
| 6 |
+
***
|
| 7 |
+
|
| 8 |
+
## Models
|
| 9 |
+
|
| 10 |
+
### CCT5
|
| 11 |
+
CCT5 is a code-change-oriented pre-trained model built on top of the **T5 architecture**, initialized from **CodeT5** weights and further pre-trained on **CodeChangeNet** (~40GB, 1.5M diff/commit pairs). Released at ESEC/FSE 2023.
|
| 12 |
+
|
| 13 |
+
- Architecture: Encoder-decoder Transformer (`T5-base` β `CodeT5` β `CCT5`)
|
| 14 |
+
- Pre-training corpus: CodeChangeNet (code diffs paired with commit messages)
|
| 15 |
+
- For MCMD-NL: **new checkpoint trained by fine-tuning the pre-trained CCT5 model on the MCMD-NL training set**, then evaluated on the MCMD-NL test set
|
| 16 |
+
|
| 17 |
+
### COME
|
| 18 |
+
COME (Commit Message Generation with Modification Embedding) is a hybrid DNN system built on top of CodeT5 with three core components:
|
| 19 |
+
- **Modification embedding**: converts code changes into numerical vectors capturing code evolution
|
| 20 |
+
- **Fine-tuned CodeT5**: generates candidate commit messages from the embedded representation
|
| 21 |
+
- **SVM-based decision algorithm**: selects between the generated and retrieved candidate messages
|
| 22 |
+
|
| 23 |
+
Released at ISSTA 2023. Does not include additional large-scale pre-training beyond CodeT5.
|
| 24 |
+
|
| 25 |
+
- For MCMD-NL: **new checkpoint trained by fine-tuning the pre-trained COME model on the MCMD-NL training set**, then evaluated on the MCMD-NL test set
|
| 26 |
+
|
| 27 |
+
***
|
| 28 |
+
|
| 29 |
+
## Dataset
|
| 30 |
+
|
| 31 |
+
**MCMD-NL** β Part of MCMD-New; commits from repositories with programming languages **not present** in the original MCMD dataset.
|
| 32 |
+
|
| 33 |
+
| Property | Details |
|
| 34 |
+
|----------|---------|
|
| 35 |
+
| Languages | PHP, R, TypeScript, Swift, Objective-C |
|
| 36 |
+
| Repositories | 329 new repositories (not in MCMD) |
|
| 37 |
+
| Total commits | 135,699 |
|
| 38 |
+
| Date range | January 1st, 2022 onwards |
|
| 39 |
+
| Split | 80% train / 10% validation / 10% test |
|
| 40 |
+
| Authors | Wu et al. (2025) |
|
| 41 |
+
|
| 42 |
+
MCMD-NL was constructed to test model generalization to **entirely new programming languages**, requiring full fine-tuning from the pre-trained model checkpoints rather than reuse of existing MCMD-trained weights.
|
| 43 |
+
|
| 44 |
+
***
|
| 45 |
+
|
| 46 |
+
## Repository Structure
|
| 47 |
+
|
| 48 |
+
Each run folder corresponds to a **programming language** evaluated in this Latin Square. Both CCT5 and COME were fine-tuned on MCMD-NL and evaluated independently for each language.
|
| 49 |
+
|
| 50 |
+
```
|
| 51 |
+
experiment2_ls2/
|
| 52 |
+
βββ run_php/
|
| 53 |
+
β βββ checkpoint/ # CCT5 and COME checkpoints fine-tuned on MCMD-NL (PHP)
|
| 54 |
+
β βββ predictions/ # Generated commit messages on MCMD-NL PHP test set
|
| 55 |
+
β βββ metrics/ # BLEU, METEOR, ROUGE-L, CIDEr scores
|
| 56 |
+
βββ run_r/
|
| 57 |
+
β βββ checkpoint/
|
| 58 |
+
β βββ predictions/
|
| 59 |
+
β βββ metrics/
|
| 60 |
+
βββ run_typescript/
|
| 61 |
+
β βββ checkpoint/
|
| 62 |
+
β βββ predictions/
|
| 63 |
+
β βββ metrics/
|
| 64 |
+
βββ run_swift/
|
| 65 |
+
β βββ checkpoint/
|
| 66 |
+
β βββ predictions/
|
| 67 |
+
β βββ metrics/
|
| 68 |
+
βββ run_objectivec/
|
| 69 |
+
βββ checkpoint/
|
| 70 |
+
βββ predictions/
|
| 71 |
+
βββ metrics/
|
| 72 |
+
```
|
| 73 |
+
|
| 74 |
+
### `checkpoint/`
|
| 75 |
+
Contains the model checkpoint files produced after fine-tuning CCT5 and COME on the MCMD-NL training set for the corresponding language. These are **newly trained checkpoints**, not reused from prior work. The best checkpoint selected during validation is stored here.
|
| 76 |
+
|
| 77 |
+
### `predictions/`
|
| 78 |
+
Contains the generated commit messages produced by each model on the MCMD-NL test set for the corresponding language. Files are stored as `.txt` with one prediction per line, aligned to the reference messages in the test set.
|
| 79 |
+
|
| 80 |
+
### `metrics/`
|
| 81 |
+
Contains the evaluation metric scores computed by comparing the predictions against the MCMD-NL test set reference messages. Each file records BLEU, METEOR, ROUGE-L, and CIDEr scores per model and language under the redesigned evaluation protocol.
|
| 82 |
+
|
| 83 |
+
***
|
| 84 |
+
|
| 85 |
+
## Evaluation Metrics
|
| 86 |
+
|
| 87 |
+
| Metric | Description |
|
| 88 |
+
|--------|-------------|
|
| 89 |
+
| **BLEU** | Bilingual Evaluation Understudy β measures n-gram precision between generated and reference messages |
|
| 90 |
+
| **METEOR** | Metric for Evaluation of Translation with Explicit Ordering β extends BLEU with recall, stemming, and synonym matching via WordNet |
|
| 91 |
+
| **ROUGE-L** | Recall-Oriented Understudy for Gisting Evaluation (LCS variant) β measures longest common subsequence overlap |
|
| 92 |
+
| **CIDEr** | Consensus-based Image Description Evaluation β TF-IDF-weighted n-gram similarity against reference messages |
|
| 93 |
+
|
| 94 |
+
### Reported Results (Original Paper β Wu et al., 2025)
|
| 95 |
+
|
| 96 |
+
| Language | Model | BLEU | METEOR | ROUGE-L | CIDEr |
|
| 97 |
+
|----------|-------|------|--------|---------|-------|
|
| 98 |
+
| PHP | CCT5 | 31.96 | 27.31 | 37.99 | 2.26 |
|
| 99 |
+
| PHP | COME | 34.68 | 30.51 | 40.27 | 2.59 |
|
| 100 |
+
| R | CCT5 | 33.02 | 28.92 | 37.17 | 2.19 |
|
| 101 |
+
| R | COME | 35.56 | 31.99 | 38.06 | 2.66 |
|
| 102 |
+
| TypeScript | CCT5 | 32.33 | 27.92 | 43.62 | 2.24 |
|
| 103 |
+
| TypeScript | COME | 35.72 | 30.97 | 47.38 | 2.61 |
|
| 104 |
+
| Swift | CCT5 | 29.29 | 24.58 | 37.09 | 1.98 |
|
| 105 |
+
| Swift | COME | 31.72 | 27.54 | 39.32 | 2.36 |
|
| 106 |
+
| Objective-C | CCT5 | 28.57 | 24.62 | 31.63 | 1.68 |
|
| 107 |
+
| Objective-C | COME | 33.43 | 29.44 | 38.32 | 2.17 |
|
| 108 |
+
| **Average** | **CCT5** | **31.02** | **26.67** | **37.50** | **2.06** |
|
| 109 |
+
| **Average** | **COME** | **34.22** | **30.09** | **40.67** | **2.47** |
|
| 110 |
+
|
| 111 |
+
These values serve as the reference for comparison with the results produced under the redesigned protocol.
|
| 112 |
+
|
| 113 |
+
***
|
| 114 |
+
|
| 115 |
+
## Methodological Differences from Experiment 1
|
| 116 |
+
|
| 117 |
+
This experiment was redesigned to address the validity and reproducibility concerns identified during the Experiment 1 reproduction phase:
|
| 118 |
+
|
| 119 |
+
- **Explicit random seed documentation** for all fine-tuning runs
|
| 120 |
+
- **Fully documented fine-tuning procedure**: hyperparameters, batch size, learning rate, number of epochs, and hardware specifications
|
| 121 |
+
- **Best checkpoint selection criteria** explicitly defined using the validation set
|
| 122 |
+
- **Controlled evaluation procedure** with clearly specified evaluation script versions
|
| 123 |
+
- **Full documentation of execution conditions** (hardware, software versions, environment)
|
| 124 |
+
- **Explicit treatment of validity threats** including language-specific variability and training randomness
|
| 125 |
+
|
| 126 |
+
***
|
| 127 |
+
|
| 128 |
+
## Important Notes
|
| 129 |
+
|
| 130 |
+
- The fine-tuning procedure for MCMD-NL is **not a reuse of existing checkpoints** β both models were trained from their pre-trained weights on the MCMD-NL training partition.
|
| 131 |
+
- The original paper does not clarify whether a single multilingual checkpoint or separate per-language checkpoints were trained for MCMD-NL; this ambiguity is addressed and documented in the thesis.
|
| 132 |
+
- MCMD-NL scores are generally **higher than MCMD scores** across all metrics, likely due to the different commit style distributions across the new languages.
|
| 133 |
+
|
| 134 |
+
***
|
| 135 |
+
|
| 136 |
+
## References
|
| 137 |
+
|
| 138 |
+
- Wu et al. (2025). *An Empirical Study on Commit Message Generation with Large Language Models via In-Context Learning.* arXiv:2502.18904.
|
| 139 |
+
- Lin et al. (2023). *CCT5: A Code-Change-Oriented Pre-Trained Model.* ESEC/FSE 2023.
|
| 140 |
+
- He et al. (2023). *COME: Commit Message Generation with Modification Embedding.* ISSTA 2023.
|
| 141 |
+
- Vegas & Elbaum (2023). *Pitfalls in Experiments with DNN4SE.* ESEC/FSE 2023.
|