Initial

Files changed (7) hide show

README.md +49 -32
Scripts/Exp/Acc/{calculate_accuracy.py → gen_accuracy.py} +0 -0
Scripts/Exp/Acc/{calculate_purple.py → gen_purple.py} +0 -0
Scripts/Exp/Correction/{calculate_correction.py → gen_correct.py} +0 -0
Scripts/Exp/ForkFlow/{calculate_forkflow.py → gen_forkflow.py} +0 -0
Scripts/Exp/Perf/{calculate_perf.py → gen_perf.py} +0 -0
Scripts/Exp/Time/{calculate_time.py → gen_time.py} +0 -0

README.md CHANGED Viewed

@@ -32,16 +32,33 @@ VEGA_AE
 ```
 ## 2. Hardware Dependency
-- Intel(R)Xeon(R)Gold 6132 CPU @ 2.60GHz
 - 8 Nvidia Tesla V100 GPU, each with 16 GB Memory
 ## 3. Software Dependency
 - CUDA == 11.4
 - python version == 3.8.1
-- pip install -r requirements.txt
-## 4. Code Generation
 We have provided a fine-tuned model using data from ```./dataset/train.jsonl``` and ```./dataset/valid.jsonl``` in ```./models/FT_Model```.  The ```train.jsonl``` and ```valid.jsonl``` files contain function templates, feature vectors and ground truth for 98 backends in our dataset.
@@ -123,7 +140,7 @@ The inference result will be saved in ```./models/FT_Model/result.jsonl```.
 Note that if a ```./models/FT_Model/result.jsonl``` file already exists, it will be **overwritten** after the execution of ```run_function_test.sh``` or ```run_test.sh```.
-## 5. Fine-Tuning (**Optional**)
 We provide the original UnixCoder-base-nine in ```./models/UnixCoder```. The original UnixCoder-base-nine can also be downloaded from HuggingFace: https://huggingface.co/microsoft/unixcoder-base-nine.
@@ -153,29 +170,29 @@ Customize parameters for fine-tuning by modifying following options in the ```ru
 The fine-tuned model will be saved in ```--output_dir```.
-## 6. Reproducing Results in the Experiment
 We provide the scripts to reproduce each Figure/Table from the paper, along with the corresponding output result files, in the following table:
 | Script | Description | Output | Figure/Table |
 | ---------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------ | -------------- |
-| ./Scripts/Exp/Time/calculate_time.py 	| Calculate the time overhead.                                                                                                            	| ./Scripts/Exp/Time/Fig7.csv                 	| Fig.7        	|
-| ./Scripts/Exp/Acc/calculate_accuracy.py 	| Calculate the function-level accuracy.                                                                                                  	| ./Scripts/Exp/Acc/Fig8_Acc.csv               	| Fig.8        	|
-| ./Scripts/Exp/Acc/calculate_purple.py 	| Calculate the the percentage of functions accurately synthesized from the statements of various existing targets (Purple Bar in Fig.8). 	| ./Scripts/Exp/Acc/Fig8_Purple.csv          	| Fig.8        	|
-| ./Scripts/Exp/Acc/calculate_accuracy.py 	| Calculate the percentage of three types of error.                                                                                       	| ./Scripts/Exp/Acc/Table2.csv            	| Table.2      	|
-| ./Scripts/Exp/ForkFlow/calculate_forkflow.py 	| Calculate the statement-level accracy of VEGA and ForkFlow.                                                                             	| ./Scripts/Exp/ForkFlow/Fig9.csv           	| Fig.9        	|
-| ./Scripts/Exp/ForkFlow/calculate_forkflow.py 	| Calculate the number of accurate statements of VEGA.                                                                       	| ./Scripts/Exp/ForkFlow/Table3.csv 	                | Table.3      	|
-| ./Scripts/Exp/Correction/calculate_correction.py | Calculate time required by two developers to modify the VEGA-generated RISC-V backend.                                                	| ./Scripts/Exp/Correction/Table4.csv        	| Table. 4     	|
-| ./Scripts/Exp/Perf/calculate_perf.py  	| Calculate the speedup of LLVM-Base (-O3),and LLVM-VEGA (-O3) over LLVM-Base (-O0).                                    	| ./Scripts/Exp/Perf/Fig10.csv             	| Fig. 10      	|
-### 6.1 Results for Fig. 7
 In the code generation process, we set a batch size of 256 on 8 Nvidia Tesla V100 GPU (each with 16GB memory), meaning each batch contains 256 statements. Since each batch may include statements from different function modules, we did not directly measure the generation time for each function modules of three targets (RISC-V, RI5CY, xCORE) during execution. Instead, we calculated the average inference time of each batch (25 seconds) and then derived the inference time of each statement (25/256 seconds). With the total number of statements within each function module of each target, we subsequently calculated the total inference time required for each function module of each target.
 - Command:
 ```
-$ python ./Scripts/Exp/Time/calculate_time.py
 ```
@@ -184,7 +201,7 @@ $ python ./Scripts/Exp/Time/calculate_time.py
 $ cat ./Scripts/Exp/Time/Fig7.csv
 ```
-### 6.2 Results for Fig. 8
 In our experiment, we employed the Pass@1 evaluation metric, which involves replacing each VEGA-generated function individually within the official LLVM (LLVM-Base), then running regression tests to verify the correctness of the replaced function. This process is highly time-intensive, as a single regression test run generally takes about half an hour. Thus, sequentially testing all 1,454 VEGA-generated functions across three targets would require approximately 727 hours.
@@ -196,7 +213,7 @@ In this Exact Match evaluation, each statement is deemed correct if the VEGA-gen
 - Command:
 ```
 $ cp ./models/FT_Model/result.jsonl ./Scripts/Exp/Acc
-$ python ./Scripts/Exp/Acc/calculate_accuracy.py
 ```
 This script will automatically analyze the VEGA's output from "result.jsonl" and compare the generated code and confidence scores with the ground truth. Based on this comparison, it will determine whether each function is correct.
@@ -212,7 +229,7 @@ We also provide a script for calculating the proportion of "Accurate Functions w
 - Command:
 ```
-$ python ./Scripts/Exp/Acc/calculate_purple.py
 ```
@@ -223,14 +240,14 @@ $ cat ./Scripts/Exp/Acc/Fig8_Purple.csv
-### 6.3 Results for Table. 2
-Executing the script in 6.2 will also yield the proportion of the three types of errors for each target.
 - Command:
 ```
-$ python ./Scripts/Exp/Acc/calculate_accuracy.py
 ```
@@ -240,13 +257,13 @@ $ cat ./Scripts/Exp/Acc/Table2.csv
 ```
-### 6.4 Results for Fig. 9
-We modified the functions generated by VEGA and functions in the MIPS backend (ForkFlow) to ensure they can correctly run on the RISC-V, RI5CY, and xCORE backends respectively. We have reserved function code for the MIPS backend in the ```./Scripts/Exp/ForkFlow/Mips_Code``` directory, along with manually modified code for the RISC-V, RI5CY, and xCORE LLVM backends in ```./Scripts/Exp/ForkFlow/Std_Code```. Additionally, the script in 6.2 will automatically write the VEGA-generated code from ```result.jsonl``` into the ```./Scripts/Exp/ForkFlow/VEGA_Code``` directory for comparison. By executing the following script, the proportion of accurate and modified statements of the VEGA-generated functions and ForkFlow processes will be automatically calculated.
 - Command:
 ```
-$ python ./Scripts/Exp/ForkFlow/calculate_forkflow.py
 ```
@@ -255,14 +272,14 @@ $ python ./Scripts/Exp/ForkFlow/calculate_forkflow.py
 $ cat ./Scripts/Exp/ForkFlow/Fig9.csv
 ```
-### 6.5 Results for Table. 3
-Executing the script in 6.4 will also output the number of statements accurately generated and requiring manual correction by VEGA across seven function modules for RISC-V, RI5CY, and xCORE.
 - Command:
 ```
-$ python ./Scripts/Exp/ForkFlow/calculate_forkflow.py
 ```
@@ -272,7 +289,7 @@ $ cat ./Scripts/Exp/ForkFlow/Table3.csv
 ```
-### 6.6 Results for Table. 4
 The data in Table. 4 show the time two developers needed to modify the VEGA-generated RISC-V backend. As a human-based experiment, only the recorded modification times for each function are record.
@@ -281,7 +298,7 @@ The following script computes the total time spent by Developers A and B to modi
 - Command:
 ```
-$ python ./Scripts/Exp/Correction/calculate_correction.py
 ```
 - Results:
@@ -289,7 +306,7 @@ $ python ./Scripts/Exp/Correction/calculate_correction.py
 $ cat ./Scripts/Exp/Correction/Table4.csv
 ```
-### 6.7 Results for Fig. 10
 Due to commercial licensing restrictions, we cannot provide the source code for the SPEC 2017 CPU benchmark used in this experiment. Additionally, testing all benchmarks including SPEC 2017 CPU is time-intensive, requiring around 565 hours in total. To address these constraints, we provide our recorded experimental data.
@@ -298,7 +315,7 @@ By executing the following script, the speedup for VEGA-generated LLVM backend (
 - Command:
 ```
-$ python ./Scripts/Exp/Perf/calculate_perf.py
 ```
 - Results:
@@ -308,6 +325,6 @@ $ cat ./Scripts/Exp/Perf/Fig10.csv
-## 7. Exeriment Customization
 Users can run this experiment in different environments, but they must ensure that PyTorch version is compatible with the CUDA version in those environments. The experiment can also be conducted in different hardware environments, but adjustments to the batch size for fine-tuning and inference are necessary based on the available GPU memory. We have fixed the random seed and parameters in the provided scripts to ensure consistent code generation accuracy within the same hardware and software environment. However, when the experiment is executed in different hardware or software environments, the accuracy may experience some fluctuations.

 ```
 ## 2. Hardware Dependency
 - 8 Nvidia Tesla V100 GPU, each with 16 GB Memory
 ## 3. Software Dependency
 - CUDA == 11.4
 - python version == 3.8.1
+- Conda (Any version that supports the installation of Python 3.8.1)
+## 4. Installation
+- Download the artifact from https://huggingface.co/docz-ict/VEGA_AE.
+```
+$ git lfs clone https://huggingface.co/docz-ict/VEGA_AE
+$ cd VEGA_AE
+```
+- Install Python (Version 3.8.1) in Conda environment.
+```
+$ conda create -n vega_ae python=3.8.1
+$ conda activate vega_ae
+$ pip install -r requirements.txt
+```
+## 5. Code Generation
 We have provided a fine-tuned model using data from ```./dataset/train.jsonl``` and ```./dataset/valid.jsonl``` in ```./models/FT_Model```.  The ```train.jsonl``` and ```valid.jsonl``` files contain function templates, feature vectors and ground truth for 98 backends in our dataset.
 Note that if a ```./models/FT_Model/result.jsonl``` file already exists, it will be **overwritten** after the execution of ```run_function_test.sh``` or ```run_test.sh```.
+## 6. Fine-Tuning (**Optional**)
 We provide the original UnixCoder-base-nine in ```./models/UnixCoder```. The original UnixCoder-base-nine can also be downloaded from HuggingFace: https://huggingface.co/microsoft/unixcoder-base-nine.
 The fine-tuned model will be saved in ```--output_dir```.
+## 7. Reproducing Results in the Experiment
 We provide the scripts to reproduce each Figure/Table from the paper, along with the corresponding output result files, in the following table:
 | Script | Description | Output | Figure/Table |
 | ---------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------ | -------------- |
+| ./Scripts/Exp/Time/gen_time.py 	| Calculate the time overhead.                                                                                                            	| ./Scripts/Exp/Time/Fig7.csv                 	| Fig.7        	|
+| ./Scripts/Exp/Acc/gen_accuracy.py 	| Calculate the function-level accuracy.                                                                                                  	| ./Scripts/Exp/Acc/Fig8_Acc.csv               	| Fig.8        	|
+| ./Scripts/Exp/Acc/gen_purple.py 	| Calculate the the percentage of functions accurately synthesized from the statements of various existing targets (Purple Bar in Fig.8). 	| ./Scripts/Exp/Acc/Fig8_Purple.csv          	| Fig.8        	|
+| ./Scripts/Exp/Acc/gen_accuracy.py 	| Calculate the percentage of three types of error.                                                                                       	| ./Scripts/Exp/Acc/Table2.csv            	| Table.2      	|
+| ./Scripts/Exp/ForkFlow/gen_forkflow.py 	| Calculate the statement-level accracy of VEGA and ForkFlow.                                                                             	| ./Scripts/Exp/ForkFlow/Fig9.csv           	| Fig.9        	|
+| ./Scripts/Exp/ForkFlow/gen_forkflow.py 	| Calculate the number of accurate statements of VEGA.                                                                       	| ./Scripts/Exp/ForkFlow/Table3.csv 	                | Table.3      	|
+| ./Scripts/Exp/Correction/gen_correct.py | Calculate time required by two developers to modify the VEGA-generated RISC-V backend.                                                	| ./Scripts/Exp/Correction/Table4.csv        	| Table. 4     	|
+| ./Scripts/Exp/Perf/gen_perf.py  	| Calculate the speedup of LLVM-Base (-O3),and LLVM-VEGA (-O3) over LLVM-Base (-O0).                                    	| ./Scripts/Exp/Perf/Fig10.csv             	| Fig. 10      	|
+### 7.1 Results for Fig. 7
 In the code generation process, we set a batch size of 256 on 8 Nvidia Tesla V100 GPU (each with 16GB memory), meaning each batch contains 256 statements. Since each batch may include statements from different function modules, we did not directly measure the generation time for each function modules of three targets (RISC-V, RI5CY, xCORE) during execution. Instead, we calculated the average inference time of each batch (25 seconds) and then derived the inference time of each statement (25/256 seconds). With the total number of statements within each function module of each target, we subsequently calculated the total inference time required for each function module of each target.
 - Command:
 ```
+$ python ./Scripts/Exp/Time/gen_time.py
 ```
 $ cat ./Scripts/Exp/Time/Fig7.csv
 ```
+### 7.2 Results for Fig. 8
 In our experiment, we employed the Pass@1 evaluation metric, which involves replacing each VEGA-generated function individually within the official LLVM (LLVM-Base), then running regression tests to verify the correctness of the replaced function. This process is highly time-intensive, as a single regression test run generally takes about half an hour. Thus, sequentially testing all 1,454 VEGA-generated functions across three targets would require approximately 727 hours.
 - Command:
 ```
 $ cp ./models/FT_Model/result.jsonl ./Scripts/Exp/Acc
+$ python ./Scripts/Exp/Acc/gen_accuracy.py
 ```
 This script will automatically analyze the VEGA's output from "result.jsonl" and compare the generated code and confidence scores with the ground truth. Based on this comparison, it will determine whether each function is correct.
 - Command:
 ```
+$ python ./Scripts/Exp/Acc/gen_purple.py
 ```
+### 7.3 Results for Table. 2
+Executing the script in 7.2 will also yield the proportion of the three types of errors for each target.
 - Command:
 ```
+$ python ./Scripts/Exp/Acc/gen_accuracy.py
 ```
 ```
+### 7.4 Results for Fig. 9
+We modified the functions generated by VEGA and functions in the MIPS backend (ForkFlow) to ensure they can correctly run on the RISC-V, RI5CY, and xCORE backends respectively. We have reserved function code for the MIPS backend in the ```./Scripts/Exp/ForkFlow/Mips_Code``` directory, along with manually modified code for the RISC-V, RI5CY, and xCORE LLVM backends in ```./Scripts/Exp/ForkFlow/Std_Code```. Additionally, the script in 7.2 will automatically write the VEGA-generated code from ```result.jsonl``` into the ```./Scripts/Exp/ForkFlow/VEGA_Code``` directory for comparison. By executing the following script, the proportion of accurate and modified statements of the VEGA-generated functions and ForkFlow processes will be automatically calculated.
 - Command:
 ```
+$ python ./Scripts/Exp/ForkFlow/gen_forkflow.py
 ```
 $ cat ./Scripts/Exp/ForkFlow/Fig9.csv
 ```
+### 7.5 Results for Table. 3
+Executing the script in 7.4 will also output the number of statements accurately generated and requiring manual correction by VEGA across seven function modules for RISC-V, RI5CY, and xCORE.
 - Command:
 ```
+$ python ./Scripts/Exp/ForkFlow/gen_forkflow.py
 ```
 ```
+### 7.6 Results for Table. 4
 The data in Table. 4 show the time two developers needed to modify the VEGA-generated RISC-V backend. As a human-based experiment, only the recorded modification times for each function are record.
 - Command:
 ```
+$ python ./Scripts/Exp/Correction/gen_correct.py
 ```
 - Results:
 $ cat ./Scripts/Exp/Correction/Table4.csv
 ```
+### 7.7 Results for Fig. 10
 Due to commercial licensing restrictions, we cannot provide the source code for the SPEC 2017 CPU benchmark used in this experiment. Additionally, testing all benchmarks including SPEC 2017 CPU is time-intensive, requiring around 565 hours in total. To address these constraints, we provide our recorded experimental data.
 - Command:
 ```
+$ python ./Scripts/Exp/Perf/gen_perf.py
 ```
 - Results:
+## 8. Exeriment Customization
 Users can run this experiment in different environments, but they must ensure that PyTorch version is compatible with the CUDA version in those environments. The experiment can also be conducted in different hardware environments, but adjustments to the batch size for fine-tuning and inference are necessary based on the available GPU memory. We have fixed the random seed and parameters in the provided scripts to ensure consistent code generation accuracy within the same hardware and software environment. However, when the experiment is executed in different hardware or software environments, the accuracy may experience some fluctuations.

Scripts/Exp/Acc/{calculate_accuracy.py → gen_accuracy.py} RENAMED Viewed

File without changes

Scripts/Exp/Acc/{calculate_purple.py → gen_purple.py} RENAMED Viewed

File without changes

Scripts/Exp/Correction/{calculate_correction.py → gen_correct.py} RENAMED Viewed

File without changes

Scripts/Exp/ForkFlow/{calculate_forkflow.py → gen_forkflow.py} RENAMED Viewed

File without changes

Scripts/Exp/Perf/{calculate_perf.py → gen_perf.py} RENAMED Viewed

File without changes

Scripts/Exp/Time/{calculate_time.py → gen_time.py} RENAMED Viewed

File without changes