Update README.md
Browse files
README.md
CHANGED
|
@@ -1,27 +1,27 @@
|
|
| 1 |
-
# Experiment
|
| 2 |
|
| 3 |
-
This repository contains the artifacts for **Latin Square 1 of Experiment
|
| 4 |
|
| 5 |
***
|
| 6 |
|
| 7 |
## Models
|
| 8 |
|
| 9 |
### CCT5
|
| 10 |
-
CCT5 is a code-change-oriented pre-trained model built on top of the **T5 architecture**, initialized from **CodeT5** weights
|
| 11 |
|
| 12 |
-
-
|
| 13 |
-
- Pre-training
|
| 14 |
-
- For MCMD:
|
| 15 |
|
| 16 |
### COME
|
| 17 |
-
COME (Commit Message Generation with Modification Embedding) is a hybrid DNN
|
| 18 |
-
-
|
| 19 |
-
- **
|
| 20 |
-
-
|
| 21 |
|
| 22 |
-
|
| 23 |
|
| 24 |
-
- For MCMD:
|
| 25 |
|
| 26 |
***
|
| 27 |
|
|
@@ -42,12 +42,12 @@ It does not perform additional large-scale pre-training on top of CodeT5. Releas
|
|
| 42 |
|
| 43 |
## Repository Structure
|
| 44 |
|
| 45 |
-
Each run folder corresponds to a **programming language** evaluated in this Latin Square
|
| 46 |
|
| 47 |
```
|
| 48 |
-
|
| 49 |
βββ run_java/
|
| 50 |
-
β βββ checkpoint/ # CCT5 and COME checkpoints
|
| 51 |
β βββ predictions/ # Generated commit messages on MCMD Java test set
|
| 52 |
β βββ metrics/ # BLEU, METEOR, ROUGE-L, CIDEr scores
|
| 53 |
βββ run_cpp/
|
|
@@ -69,13 +69,13 @@ experiment1_ls1/
|
|
| 69 |
```
|
| 70 |
|
| 71 |
### `checkpoint/`
|
| 72 |
-
Contains the model checkpoint files for
|
| 73 |
|
| 74 |
### `predictions/`
|
| 75 |
-
Contains the generated commit messages produced by each model on the MCMD test set for the corresponding language
|
| 76 |
|
| 77 |
### `metrics/`
|
| 78 |
-
Contains the
|
| 79 |
|
| 80 |
***
|
| 81 |
|
|
@@ -84,11 +84,11 @@ Contains the computed evaluation metric scores for each model-language combinati
|
|
| 84 |
| Metric | Description |
|
| 85 |
|--------|-------------|
|
| 86 |
| **BLEU** | Bilingual Evaluation Understudy β measures n-gram precision between generated and reference messages |
|
| 87 |
-
| **METEOR** | Metric for Evaluation of Translation with Explicit Ordering β extends BLEU with recall, stemming, and synonym matching |
|
| 88 |
| **ROUGE-L** | Recall-Oriented Understudy for Gisting Evaluation (LCS variant) β measures longest common subsequence overlap |
|
| 89 |
| **CIDEr** | Consensus-based Image Description Evaluation β TF-IDF-weighted n-gram similarity against reference messages |
|
| 90 |
|
| 91 |
-
###
|
| 92 |
|
| 93 |
| Language | Model | BLEU | METEOR | ROUGE-L | CIDEr |
|
| 94 |
|----------|-------|------|--------|---------|-------|
|
|
@@ -105,14 +105,18 @@ Contains the computed evaluation metric scores for each model-language combinati
|
|
| 105 |
| **Average** | **CCT5** | **15.96** | **14.26** | **24.33** | **0.95** |
|
| 106 |
| **Average** | **COME** | **25.07** | **21.48** | **31.97** | **1.70** |
|
| 107 |
|
|
|
|
|
|
|
| 108 |
***
|
| 109 |
|
| 110 |
-
##
|
|
|
|
|
|
|
| 111 |
|
| 112 |
-
-
|
| 113 |
-
-
|
| 114 |
-
-
|
| 115 |
-
-
|
| 116 |
|
| 117 |
***
|
| 118 |
|
|
@@ -121,4 +125,5 @@ Contains the computed evaluation metric scores for each model-language combinati
|
|
| 121 |
- Wu et al. (2025). *An Empirical Study on Commit Message Generation with Large Language Models via In-Context Learning.* arXiv:2502.18904.
|
| 122 |
- Lin et al. (2023). *CCT5: A Code-Change-Oriented Pre-Trained Model.* ESEC/FSE 2023.
|
| 123 |
- He et al. (2023). *COME: Commit Message Generation with Modification Embedding.* ISSTA 2023.
|
| 124 |
-
- Liu et al. (2020). *MCMD dataset.*
|
|
|
|
|
|
| 1 |
+
# Experiment 2 β Latin Square 1: CCT5 & COME on MCMD (Redesigned)
|
| 2 |
|
| 3 |
+
This repository contains the artifacts for **Latin Square 1 of Experiment 2**, which corresponds to the **redesigned and reimplemented experiment** evaluated on the **MCMD dataset** using the DNN-based commit message generation baselines **CCT5** and **COME**. This experiment was conducted under a more explicit and controlled evaluation protocol than the original study by Wu et al. (2025).
|
| 4 |
|
| 5 |
***
|
| 6 |
|
| 7 |
## Models
|
| 8 |
|
| 9 |
### CCT5
|
| 10 |
+
CCT5 is a code-change-oriented pre-trained model built on top of the **T5 architecture**, initialized from **CodeT5** weights and further pre-trained on **CodeChangeNet** (~40GB, 1.5M diff/commit pairs). Released at ESEC/FSE 2023.
|
| 11 |
|
| 12 |
+
- Architecture: Encoder-decoder Transformer (`T5-base` β `CodeT5` β `CCT5`)
|
| 13 |
+
- Pre-training corpus: CodeChangeNet (code diffs paired with commit messages)
|
| 14 |
+
- For MCMD: fine-tuned checkpoint from original CCT5 authors, trained on MCMD training set
|
| 15 |
|
| 16 |
### COME
|
| 17 |
+
COME (Commit Message Generation with Modification Embedding) is a hybrid DNN system built on top of CodeT5 with three core components:
|
| 18 |
+
- **Modification embedding**: converts code changes into numerical vectors capturing code evolution
|
| 19 |
+
- **Fine-tuned CodeT5**: generates candidate commit messages from the embedded representation
|
| 20 |
+
- **SVM-based decision algorithm**: selects between the generated and retrieved candidate messages
|
| 21 |
|
| 22 |
+
Released at ISSTA 2023. Does not include additional large-scale pre-training beyond CodeT5.
|
| 23 |
|
| 24 |
+
- For MCMD: language-specific checkpoints released by original COME authors (one per language)
|
| 25 |
|
| 26 |
***
|
| 27 |
|
|
|
|
| 42 |
|
| 43 |
## Repository Structure
|
| 44 |
|
| 45 |
+
Each run folder corresponds to a **programming language** evaluated in this Latin Square. Unlike Experiment 1, this experiment follows a more controlled protocol with explicit random seed documentation and multiple evaluation runs where applicable.
|
| 46 |
|
| 47 |
```
|
| 48 |
+
experiment2_ls1/
|
| 49 |
βββ run_java/
|
| 50 |
+
β βββ checkpoint/ # CCT5 and COME checkpoints for MCMD (Java)
|
| 51 |
β βββ predictions/ # Generated commit messages on MCMD Java test set
|
| 52 |
β βββ metrics/ # BLEU, METEOR, ROUGE-L, CIDEr scores
|
| 53 |
βββ run_cpp/
|
|
|
|
| 69 |
```
|
| 70 |
|
| 71 |
### `checkpoint/`
|
| 72 |
+
Contains the model checkpoint files used for evaluation. For MCMD, these are the fine-tuned checkpoints released by the original authors of CCT5 and COME, trained on the MCMD training set for the corresponding language.
|
| 73 |
|
| 74 |
### `predictions/`
|
| 75 |
+
Contains the generated commit messages produced by each model on the MCMD test set for the corresponding language. Files are stored as `.txt` with one prediction per line, aligned to the reference messages in the test set.
|
| 76 |
|
| 77 |
### `metrics/`
|
| 78 |
+
Contains the evaluation metric scores computed by comparing the predictions against the MCMD test set reference messages. Each file records BLEU, METEOR, ROUGE-L, and CIDEr scores per model and language under the redesigned evaluation protocol.
|
| 79 |
|
| 80 |
***
|
| 81 |
|
|
|
|
| 84 |
| Metric | Description |
|
| 85 |
|--------|-------------|
|
| 86 |
| **BLEU** | Bilingual Evaluation Understudy β measures n-gram precision between generated and reference messages |
|
| 87 |
+
| **METEOR** | Metric for Evaluation of Translation with Explicit Ordering β extends BLEU with recall, stemming, and synonym matching via WordNet |
|
| 88 |
| **ROUGE-L** | Recall-Oriented Understudy for Gisting Evaluation (LCS variant) β measures longest common subsequence overlap |
|
| 89 |
| **CIDEr** | Consensus-based Image Description Evaluation β TF-IDF-weighted n-gram similarity against reference messages |
|
| 90 |
|
| 91 |
+
### Reference Results (Original Paper β Wu et al., 2025)
|
| 92 |
|
| 93 |
| Language | Model | BLEU | METEOR | ROUGE-L | CIDEr |
|
| 94 |
|----------|-------|------|--------|---------|-------|
|
|
|
|
| 105 |
| **Average** | **CCT5** | **15.96** | **14.26** | **24.33** | **0.95** |
|
| 106 |
| **Average** | **COME** | **25.07** | **21.48** | **31.97** | **1.70** |
|
| 107 |
|
| 108 |
+
These values serve as the baseline reference for comparison with the results produced under the redesigned protocol.
|
| 109 |
+
|
| 110 |
***
|
| 111 |
|
| 112 |
+
## Methodological Differences from Experiment 1
|
| 113 |
+
|
| 114 |
+
This experiment was redesigned to address the validity and reproducibility concerns identified during the Experiment 1 reproduction phase:
|
| 115 |
|
| 116 |
+
- **Explicit random seed documentation** for all runs
|
| 117 |
+
- **Controlled evaluation procedure** with clearly specified script versions
|
| 118 |
+
- **Full documentation of execution conditions** (hardware, software versions, environment)
|
| 119 |
+
- **Explicit treatment of validity threats** at each stage of the evaluation
|
| 120 |
|
| 121 |
***
|
| 122 |
|
|
|
|
| 125 |
- Wu et al. (2025). *An Empirical Study on Commit Message Generation with Large Language Models via In-Context Learning.* arXiv:2502.18904.
|
| 126 |
- Lin et al. (2023). *CCT5: A Code-Change-Oriented Pre-Trained Model.* ESEC/FSE 2023.
|
| 127 |
- He et al. (2023). *COME: Commit Message Generation with Modification Embedding.* ISSTA 2023.
|
| 128 |
+
- Liu et al. (2020). *MCMD dataset.*
|
| 129 |
+
- Vegas & Elbaum (2023). *Pitfalls in Experiments with DNN4SE.* ESEC/FSE 2023.
|