clouds125
/

TFM_EXP2_MCMD-NL_LS1

Model card Files Files and versions

xet

Community

clouds125 commited on 12 days ago

Commit

5977264

verified ·

1 Parent(s): 9d1d203

Update README.md

Browse files

Files changed (1) hide show

README.md +31 -26

README.md CHANGED Viewed

@@ -1,27 +1,27 @@
-# Experiment 1 – Latin Square 1: CCT5 & COME on MCMD
-This repository contains the artifacts for **Latin Square 1 of Experiment 1**, which corresponds to the **reproduction of the original experiment** by Wu et al. (2025) on the **MCMD dataset** using the DNN-based commit message generation baselines **CCT5** and **COME**.
 ***
 ## Models
 ### CCT5
-CCT5 is a code-change-oriented pre-trained model built on top of the **T5 architecture**, initialized from **CodeT5** weights. It is further specialized through pre-training on **CodeChangeNet**, a commit-diff dataset containing roughly 40GB of diff and commit message pairs (~1.5M pairs). It was released at ESEC/FSE 2023.
-- Base: `T5-base` → `CodeT5` → `CCT5`
-- Pre-training data: CodeChangeNet (40GB, 1.5M diff/commit pairs)
-- For MCMD: reused released checkpoint fine-tuned on MCMD by original authors
 ### COME
-COME (Commit Message Generation with Modification Embedding) is a hybrid DNN approach that combines:
-- A **fine-tuned CodeT5** component for natural language generation
-- **Modification embedding** to represent code changes as numerical vectors
-- An **SVM-based decision algorithm** to select between generated and retrieved candidate messages
-It does not perform additional large-scale pre-training on top of CodeT5. Released at ISSTA 2023.
-- For MCMD: reused language-specific checkpoints released by original COME authors (one per language)
 ***
@@ -42,12 +42,12 @@ It does not perform additional large-scale pre-training on top of CodeT5. Releas
 ## Repository Structure
-Each run folder corresponds to a **programming language** evaluated in this Latin Square:
 ```
-experiment1_ls1/
 ├── run_java/
-│   ├── checkpoint/          # CCT5 and COME checkpoints fine-tuned on MCMD (Java)
 │   ├── predictions/         # Generated commit messages on MCMD Java test set
 │   └── metrics/             # BLEU, METEOR, ROUGE-L, CIDEr scores
 ├── run_cpp/
@@ -69,13 +69,13 @@ experiment1_ls1/
 ```
 ### `checkpoint/`
-Contains the model checkpoint files for CCT5 and COME reused from the original authors' repositories, fine-tuned on the MCMD training set for the corresponding language.
 ### `predictions/`
-Contains the generated commit messages produced by each model on the MCMD test set for the corresponding language, stored as `.txt` files with one prediction per line aligned to the reference messages.
 ### `metrics/`
-Contains the computed evaluation metric scores for each model-language combination. Metrics are calculated by comparing predictions against the reference messages in the MCMD test set.
 ***
@@ -84,11 +84,11 @@ Contains the computed evaluation metric scores for each model-language combinati
 | Metric | Description |
 |--------|-------------|
 | **BLEU** | Bilingual Evaluation Understudy — measures n-gram precision between generated and reference messages |
-| **METEOR** | Metric for Evaluation of Translation with Explicit Ordering — extends BLEU with recall, stemming, and synonym matching |
 | **ROUGE-L** | Recall-Oriented Understudy for Gisting Evaluation (LCS variant) — measures longest common subsequence overlap |
 | **CIDEr** | Consensus-based Image Description Evaluation — TF-IDF-weighted n-gram similarity against reference messages |
-### Reported Results (Original Paper – Wu et al., 2025)
 | Language | Model | BLEU | METEOR | ROUGE-L | CIDEr |
 |----------|-------|------|--------|---------|-------|
@@ -105,14 +105,18 @@ Contains the computed evaluation metric scores for each model-language combinati
 | **Average** | **CCT5** | **15.96** | **14.26** | **24.33** | **0.95** |
 | **Average** | **COME** | **25.07** | **21.48** | **31.97** | **1.70** |
 ***
-## Notes
-- Checkpoints were **reused** from the original authors' repositories; no retraining was performed for this dataset.
-- No random seeds or repeated runs were documented in the original experiment.
-- Results in this repository correspond to the **reproduction attempt** of the original reported values.
-- Discrepancies between reproduced and reported results are documented in the thesis.
 ***
@@ -121,4 +125,5 @@ Contains the computed evaluation metric scores for each model-language combinati
 - Wu et al. (2025). *An Empirical Study on Commit Message Generation with Large Language Models via In-Context Learning.* arXiv:2502.18904.
 - Lin et al. (2023). *CCT5: A Code-Change-Oriented Pre-Trained Model.* ESEC/FSE 2023.
 - He et al. (2023). *COME: Commit Message Generation with Modification Embedding.* ISSTA 2023.
-- Liu et al. (2020). *MCMD dataset.*

+# Experiment 2 – Latin Square 1: CCT5 & COME on MCMD (Redesigned)
+This repository contains the artifacts for **Latin Square 1 of Experiment 2**, which corresponds to the **redesigned and reimplemented experiment** evaluated on the **MCMD dataset** using the DNN-based commit message generation baselines **CCT5** and **COME**. This experiment was conducted under a more explicit and controlled evaluation protocol than the original study by Wu et al. (2025).
 ***
 ## Models
 ### CCT5
+CCT5 is a code-change-oriented pre-trained model built on top of the **T5 architecture**, initialized from **CodeT5** weights and further pre-trained on **CodeChangeNet** (~40GB, 1.5M diff/commit pairs). Released at ESEC/FSE 2023.
+- Architecture: Encoder-decoder Transformer (`T5-base` → `CodeT5` → `CCT5`)
+- Pre-training corpus: CodeChangeNet (code diffs paired with commit messages)
+- For MCMD: fine-tuned checkpoint from original CCT5 authors, trained on MCMD training set
 ### COME
+COME (Commit Message Generation with Modification Embedding) is a hybrid DNN system built on top of CodeT5 with three core components:
+- **Modification embedding**: converts code changes into numerical vectors capturing code evolution
+- **Fine-tuned CodeT5**: generates candidate commit messages from the embedded representation
+- **SVM-based decision algorithm**: selects between the generated and retrieved candidate messages
+Released at ISSTA 2023. Does not include additional large-scale pre-training beyond CodeT5.
+- For MCMD: language-specific checkpoints released by original COME authors (one per language)
 ***
 ## Repository Structure
+Each run folder corresponds to a **programming language** evaluated in this Latin Square. Unlike Experiment 1, this experiment follows a more controlled protocol with explicit random seed documentation and multiple evaluation runs where applicable.
 ```
+experiment2_ls1/
 ├── run_java/
+│   ├── checkpoint/          # CCT5 and COME checkpoints for MCMD (Java)
 │   ├── predictions/         # Generated commit messages on MCMD Java test set
 │   └── metrics/             # BLEU, METEOR, ROUGE-L, CIDEr scores
 ├── run_cpp/
 ```
 ### `checkpoint/`
+Contains the model checkpoint files used for evaluation. For MCMD, these are the fine-tuned checkpoints released by the original authors of CCT5 and COME, trained on the MCMD training set for the corresponding language.
 ### `predictions/`
+Contains the generated commit messages produced by each model on the MCMD test set for the corresponding language. Files are stored as `.txt` with one prediction per line, aligned to the reference messages in the test set.
 ### `metrics/`
+Contains the evaluation metric scores computed by comparing the predictions against the MCMD test set reference messages. Each file records BLEU, METEOR, ROUGE-L, and CIDEr scores per model and language under the redesigned evaluation protocol.
 ***
 | Metric | Description |
 |--------|-------------|
 | **BLEU** | Bilingual Evaluation Understudy — measures n-gram precision between generated and reference messages |
+| **METEOR** | Metric for Evaluation of Translation with Explicit Ordering — extends BLEU with recall, stemming, and synonym matching via WordNet |
 | **ROUGE-L** | Recall-Oriented Understudy for Gisting Evaluation (LCS variant) — measures longest common subsequence overlap |
 | **CIDEr** | Consensus-based Image Description Evaluation — TF-IDF-weighted n-gram similarity against reference messages |
+### Reference Results (Original Paper – Wu et al., 2025)
 | Language | Model | BLEU | METEOR | ROUGE-L | CIDEr |
 |----------|-------|------|--------|---------|-------|
 | **Average** | **CCT5** | **15.96** | **14.26** | **24.33** | **0.95** |
 | **Average** | **COME** | **25.07** | **21.48** | **31.97** | **1.70** |
+These values serve as the baseline reference for comparison with the results produced under the redesigned protocol.
 ***
+## Methodological Differences from Experiment 1
+This experiment was redesigned to address the validity and reproducibility concerns identified during the Experiment 1 reproduction phase:
+- **Explicit random seed documentation** for all runs
+- **Controlled evaluation procedure** with clearly specified script versions
+- **Full documentation of execution conditions** (hardware, software versions, environment)
+- **Explicit treatment of validity threats** at each stage of the evaluation
 ***
 - Wu et al. (2025). *An Empirical Study on Commit Message Generation with Large Language Models via In-Context Learning.* arXiv:2502.18904.
 - Lin et al. (2023). *CCT5: A Code-Change-Oriented Pre-Trained Model.* ESEC/FSE 2023.
 - He et al. (2023). *COME: Commit Message Generation with Modification Embedding.* ISSTA 2023.
+- Liu et al. (2020). *MCMD dataset.*
+- Vegas & Elbaum (2023). *Pitfalls in Experiments with DNN4SE.* ESEC/FSE 2023.