unknown commited on
Commit ·
7a9df24
1
Parent(s): 33264ad
Initial
Browse files
README.md
CHANGED
|
@@ -35,7 +35,7 @@ VEGA_AE
|
|
| 35 |
- 8 Nvidia Tesla V100 GPU, each with 16 GB Memory
|
| 36 |
|
| 37 |
## 3. Software Dependency
|
| 38 |
-
- CUDA == 11.
|
| 39 |
- python version == 3.8.1
|
| 40 |
- Conda (Any version that supports the installation of Python 3.8.1)
|
| 41 |
|
|
@@ -60,9 +60,9 @@ $ pip install -r requirements.txt
|
|
| 60 |
|
| 61 |
## 5. Code Generation
|
| 62 |
|
| 63 |
-
We have provided a fine-tuned model
|
| 64 |
|
| 65 |
-
We have also provided a script fot functionality test, which only generates a single function for RI5CY (Recorded as PULP in our dataset), taking less than 3 minutes with 8
|
| 66 |
|
| 67 |
- **Run functionality test with:**
|
| 68 |
|
|
@@ -86,7 +86,7 @@ Check the generated code with:
|
|
| 86 |
$ cat ./models/FT_Model/result.jsonl
|
| 87 |
```
|
| 88 |
|
| 89 |
-
In the `result.jsonl` file, the meaning of each item in
|
| 90 |
|
| 91 |
|
| 92 |
| Item | Description |
|
|
@@ -94,7 +94,7 @@ In the `result.jsonl` file, the meaning of each item in each entry corresponds a
|
|
| 94 |
| vega_code | The model-generated code. |
|
| 95 |
| ans_code | The ground truth of the code. |
|
| 96 |
| vega_pre | The model-generated confidence score. |
|
| 97 |
-
|
|
| 98 |
| File | The file to which this item belongs. |
|
| 99 |
| Function | The function to which this item belongs. |
|
| 100 |
| Module | The function module to which this item belongs. |
|
|
@@ -109,7 +109,7 @@ The fine-tuned model will take function templates and feature vectors for RISC-V
|
|
| 109 |
$ bash run_test.sh
|
| 110 |
```
|
| 111 |
|
| 112 |
-
Customize parameters for
|
| 113 |
```
|
| 114 |
--model_name_or_path ../../models/UnixCoder \
|
| 115 |
--test_filename ../../dataset/test.jsonl \
|
|
@@ -177,14 +177,14 @@ We provide the scripts to reproduce each Figure/Table from the paper, along with
|
|
| 177 |
|
| 178 |
| Script | Description | Output | Figure/Table |
|
| 179 |
| ---------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------ | -------------- |
|
| 180 |
-
| ./Scripts/Exp/Time/gen_time.py | Calculate the time overhead.
|
| 181 |
-
| ./Scripts/Exp/Acc/gen_accuracy.py | Calculate the function-level accuracy.
|
| 182 |
-
| ./Scripts/Exp/Acc/gen_purple.py | Calculate the
|
| 183 |
-
| ./Scripts/Exp/Acc/gen_accuracy.py | Calculate the percentage of three types of
|
| 184 |
-
| ./Scripts/Exp/ForkFlow/gen_forkflow.py | Calculate the statement-level
|
| 185 |
-
| ./Scripts/Exp/ForkFlow/gen_forkflow.py | Calculate the number of
|
| 186 |
-
| ./Scripts/Exp/Correction/gen_correct.py | Calculate time required by two developers to modify the
|
| 187 |
-
| ./Scripts/Exp/Perf/gen_perf.py | Calculate the speedup of LLVM-Base (-O3),and LLVM-VEGA (-O3) over LLVM-Base (-O0). | ./Scripts/Exp/Perf/Fig10.csv | Fig. 10 |
|
| 188 |
### 7.1 Results for Fig. 7
|
| 189 |
|
| 190 |
In the code generation process, we set a batch size of 256 on 8 Nvidia Tesla V100 GPU (each with 16GB memory), meaning each batch contains 256 statements. Since each batch may include statements from different function modules, we did not directly measure the generation time for each function modules of three targets (RISC-V, RI5CY, xCORE) during execution. Instead, we calculated the average inference time of each batch (25 seconds) and then derived the inference time of each statement (25/256 seconds). With the total number of statements within each function module of each target, we subsequently calculated the total inference time required for each function module of each target.
|
|
@@ -204,7 +204,7 @@ $ cat ./Scripts/Exp/Time/Fig7.csv
|
|
| 204 |
### 7.2 Results for Fig. 8
|
| 205 |
|
| 206 |
|
| 207 |
-
In our experiment, we employed the Pass@1 evaluation metric, which involves replacing each VEGA-generated function individually within the official LLVM (LLVM-Base), then running regression tests to verify the correctness of the replaced function. This process is highly time-
|
| 208 |
|
| 209 |
To simplify this process, we recorded the ground truth for each statement based on the Pass@1 experiment results. Additionally, we documented a list of functions containing Err-Def errors (i.e., errors due to missing necessary statements in the function template; functions with Err-Def can not pass all regression tests). This allowed us to transform the Pass@1 testing process into an Exact Match evaluation.
|
| 210 |
|
|
@@ -259,7 +259,7 @@ $ cat ./Scripts/Exp/Acc/Table2.csv
|
|
| 259 |
|
| 260 |
### 7.4 Results for Fig. 9
|
| 261 |
|
| 262 |
-
We modified the functions generated by VEGA and functions in the MIPS backend (ForkFlow) to ensure they can correctly run on the RISC-V, RI5CY, and xCORE backends respectively. We have reserved function code for the MIPS backend in the ```./Scripts/Exp/ForkFlow/Mips_Code``` directory, along with manually
|
| 263 |
|
| 264 |
- Command:
|
| 265 |
```
|
|
@@ -291,7 +291,7 @@ $ cat ./Scripts/Exp/ForkFlow/Table3.csv
|
|
| 291 |
|
| 292 |
### 7.6 Results for Table. 4
|
| 293 |
|
| 294 |
-
The data in Table. 4 show the time two developers needed to modify the VEGA-generated RISC-V backend. As a human-based experiment, only the recorded modification times for each function are
|
| 295 |
|
| 296 |
The following script computes the total time spent by Developers A and B to modify each **function module** in the VEGA-generated RISC-V backend, based on the recorded times for each **function**.
|
| 297 |
|
|
@@ -310,7 +310,7 @@ $ cat ./Scripts/Exp/Correction/Table4.csv
|
|
| 310 |
|
| 311 |
Due to commercial licensing restrictions, we cannot provide the source code for the SPEC 2017 CPU benchmark used in this experiment. Additionally, testing all benchmarks including SPEC 2017 CPU is time-intensive, requiring around 565 hours in total. To address these constraints, we provide our recorded experimental data.
|
| 312 |
|
| 313 |
-
|
| 314 |
|
| 315 |
|
| 316 |
- Command:
|
|
@@ -325,6 +325,6 @@ $ cat ./Scripts/Exp/Perf/Fig10.csv
|
|
| 325 |
|
| 326 |
|
| 327 |
|
| 328 |
-
## 8.
|
| 329 |
|
| 330 |
-
Users can run this experiment in different environments, but they must ensure that PyTorch version is compatible with the CUDA version in those environments. The experiment can also be conducted in different hardware environments, but adjustments to the batch size for fine-tuning and inference are necessary based on the available GPU memory. We have fixed the random seed and parameters in the provided scripts to ensure consistent code generation accuracy within the same hardware and software environment. However, when the experiment is executed in different hardware or software environments, the accuracy may experience some fluctuations.
|
|
|
|
| 35 |
- 8 Nvidia Tesla V100 GPU, each with 16 GB Memory
|
| 36 |
|
| 37 |
## 3. Software Dependency
|
| 38 |
+
- CUDA == 11.7
|
| 39 |
- python version == 3.8.1
|
| 40 |
- Conda (Any version that supports the installation of Python 3.8.1)
|
| 41 |
|
|
|
|
| 60 |
|
| 61 |
## 5. Code Generation
|
| 62 |
|
| 63 |
+
We have provided a fine-tuned model in ```./models/FT_Model```, which is fine-tuned with ```./dataset/train.jsonl``` and ```./dataset/valid.jsonl```. The ```train.jsonl``` and ```valid.jsonl``` files contain function templates, feature vectors and ground truth for 98 backends (excluding RISC-V, RI5CY, xCORE) in our dataset.
|
| 64 |
|
| 65 |
+
We have also provided a script fot functionality test, which only generates a single function for RI5CY (Recorded as PULP in our dataset), taking less than 3 minutes with 8 Nvidia Tesla V100 GPUs.
|
| 66 |
|
| 67 |
- **Run functionality test with:**
|
| 68 |
|
|
|
|
| 86 |
$ cat ./models/FT_Model/result.jsonl
|
| 87 |
```
|
| 88 |
|
| 89 |
+
In the `result.jsonl` file, the meaning of each item in an entry can be found in the following table:
|
| 90 |
|
| 91 |
|
| 92 |
| Item | Description |
|
|
|
|
| 94 |
| vega_code | The model-generated code. |
|
| 95 |
| ans_code | The ground truth of the code. |
|
| 96 |
| vega_pre | The model-generated confidence score. |
|
| 97 |
+
| ans_pre | The ground truth of the confidence score. |
|
| 98 |
| File | The file to which this item belongs. |
|
| 99 |
| Function | The function to which this item belongs. |
|
| 100 |
| Module | The function module to which this item belongs. |
|
|
|
|
| 109 |
$ bash run_test.sh
|
| 110 |
```
|
| 111 |
|
| 112 |
+
Customize parameters for code generation by modifying following options in the ```run_test.sh```.
|
| 113 |
```
|
| 114 |
--model_name_or_path ../../models/UnixCoder \
|
| 115 |
--test_filename ../../dataset/test.jsonl \
|
|
|
|
| 177 |
|
| 178 |
| Script | Description | Output | Figure/Table |
|
| 179 |
| ---------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------ | -------------- |
|
| 180 |
+
| ./Scripts/Exp/Time/gen_time.py | Calculate the time overhead for VEGA to generate three backends. | ./Scripts/Exp/Time/Fig7.csv | Fig.7 |
|
| 181 |
+
| ./Scripts/Exp/Acc/gen_accuracy.py | Calculate the function-level accuracy of three VEGA-generated backends. | ./Scripts/Exp/Acc/Fig8_Acc.csv | Fig.8 |
|
| 182 |
+
| ./Scripts/Exp/Acc/gen_purple.py | Calculate the results of Purple Bar in Fig. 8. | ./Scripts/Exp/Acc/Fig8_Purple.csv | Fig.8 |
|
| 183 |
+
| ./Scripts/Exp/Acc/gen_accuracy.py | Calculate the percentage of three types of errors in three VEGA-generated backends. | ./Scripts/Exp/Acc/Table2.csv | Table.2 |
|
| 184 |
+
| ./Scripts/Exp/ForkFlow/gen_forkflow.py | Calculate the statement-level accuracy of VEGA-generated backends and ForkFlow-generated backends. | ./Scripts/Exp/ForkFlow/Fig9.csv | Fig.9 |
|
| 185 |
+
| ./Scripts/Exp/ForkFlow/gen_forkflow.py | Calculate the number of statements accurately generated and requiring manual correction by VEGA of three backends. | ./Scripts/Exp/ForkFlow/Table3.csv | Table.3 |
|
| 186 |
+
| ./Scripts/Exp/Correction/gen_correct.py | Calculate time required by two developers to modify the VEGA-generated RISC-V backend. | ./Scripts/Exp/Correction/Table4.csv | Table. 4 |
|
| 187 |
+
| ./Scripts/Exp/Perf/gen_perf.py | Calculate the speedup of LLVM-Base (-O3),and LLVM-VEGA (-O3) over LLVM-Base (-O0) on three benchmarks. | ./Scripts/Exp/Perf/Fig10.csv | Fig. 10 |
|
| 188 |
### 7.1 Results for Fig. 7
|
| 189 |
|
| 190 |
In the code generation process, we set a batch size of 256 on 8 Nvidia Tesla V100 GPU (each with 16GB memory), meaning each batch contains 256 statements. Since each batch may include statements from different function modules, we did not directly measure the generation time for each function modules of three targets (RISC-V, RI5CY, xCORE) during execution. Instead, we calculated the average inference time of each batch (25 seconds) and then derived the inference time of each statement (25/256 seconds). With the total number of statements within each function module of each target, we subsequently calculated the total inference time required for each function module of each target.
|
|
|
|
| 204 |
### 7.2 Results for Fig. 8
|
| 205 |
|
| 206 |
|
| 207 |
+
In our experiment, we employed the Pass@1 evaluation metric, which involves replacing each VEGA-generated function individually within the official LLVM (LLVM-Base), then running regression tests to verify the correctness of the replaced function. This process is highly time-consuming, as a single regression test run generally takes about half an hour. Thus, sequentially testing all 1,454 VEGA-generated functions across three targets would require approximately 727 hours.
|
| 208 |
|
| 209 |
To simplify this process, we recorded the ground truth for each statement based on the Pass@1 experiment results. Additionally, we documented a list of functions containing Err-Def errors (i.e., errors due to missing necessary statements in the function template; functions with Err-Def can not pass all regression tests). This allowed us to transform the Pass@1 testing process into an Exact Match evaluation.
|
| 210 |
|
|
|
|
| 259 |
|
| 260 |
### 7.4 Results for Fig. 9
|
| 261 |
|
| 262 |
+
We modified the functions generated by VEGA and functions in the MIPS backend (ForkFlow) to ensure they can correctly run on the RISC-V, RI5CY, and xCORE backends respectively. We have reserved function code for the MIPS backend in the ```./Scripts/Exp/ForkFlow/Mips_Code``` directory, along with manually fixed code for the RISC-V, RI5CY, and xCORE LLVM backends in ```./Scripts/Exp/ForkFlow/Std_Code```. Additionally, the script in 7.2 will automatically write the VEGA-generated code from ```result.jsonl``` into the ```./Scripts/Exp/ForkFlow/VEGA_Code``` directory for comparison. By executing the following script, the proportion of accurate and modified statements of the VEGA-generated functions and ForkFlow processes will be automatically calculated.
|
| 263 |
|
| 264 |
- Command:
|
| 265 |
```
|
|
|
|
| 291 |
|
| 292 |
### 7.6 Results for Table. 4
|
| 293 |
|
| 294 |
+
The data in Table. 4 show the time two developers needed to modify the VEGA-generated RISC-V backend. As a human-based experiment, only the recorded modification times for each function are provided.
|
| 295 |
|
| 296 |
The following script computes the total time spent by Developers A and B to modify each **function module** in the VEGA-generated RISC-V backend, based on the recorded times for each **function**.
|
| 297 |
|
|
|
|
| 310 |
|
| 311 |
Due to commercial licensing restrictions, we cannot provide the source code for the SPEC 2017 CPU benchmark used in this experiment. Additionally, testing all benchmarks including SPEC 2017 CPU is time-intensive, requiring around 565 hours in total. To address these constraints, we provide our recorded experimental data.
|
| 312 |
|
| 313 |
+
Running the following script will automatically calculate the speedup of the VEGA-generated LLVM backend (LLVM-VEGA) with the "-O3" optimization over the performance of the official LLVM backend (LLVM-Base) with "-O0", as well as the speedup of LLVM-Base with "-O3" over its own performance with "-O0".
|
| 314 |
|
| 315 |
|
| 316 |
- Command:
|
|
|
|
| 325 |
|
| 326 |
|
| 327 |
|
| 328 |
+
## 8. Experiment Customization
|
| 329 |
|
| 330 |
+
Users can run this experiment in different software environments, but they must ensure that PyTorch version is compatible with the CUDA version in those software environments. The experiment can also be conducted in different hardware environments, but adjustments to the batch size for fine-tuning and inference are necessary based on the available GPU memory. We have fixed the random seed and parameters in the provided scripts to ensure consistent code generation accuracy within the same hardware and software environment. However, when the experiment is executed in different hardware or software environments, the accuracy may experience some fluctuations.
|