Commit
·
7e94263
1
Parent(s):
d99a73a
Update README to describe realworld set and training command
Browse files- README.md +44 -0
- real_world_dataset/generate.sh +2 -2
- real_world_dataset/preprocess.sh +0 -21
README.md
CHANGED
|
@@ -29,6 +29,20 @@ The tokenized files are also preprocessed using `fairseq-preprocess`.
|
|
| 29 |
Extract only `tokenized.zip` if you just want to use the synthetic data to train new models.
|
| 30 |
Extract the `dataset.zip` if you want to tokenize in a different way or want to modify the data before processing.
|
| 31 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 32 |
## Replicate the synthetic dataset
|
| 33 |
|
| 34 |
Following are the steps to recreate the dataset present in `dataset.zip` and `tokenized.zip`
|
|
@@ -38,3 +52,33 @@ Following are the steps to recreate the dataset present in `dataset.zip` and `to
|
|
| 38 |
3. Compile the equations and disassemble them via the REMEND disassembler: `./compile.sh`
|
| 39 |
4. Combine the compiled equations and assembly files, remove duplicates, and split: `./combine.sh`
|
| 40 |
5. Tokenize the split assembly files: `./tokenize.sh`
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 29 |
Extract only `tokenized.zip` if you just want to use the synthetic data to train new models.
|
| 30 |
Extract the `dataset.zip` if you want to tokenize in a different way or want to modify the data before processing.
|
| 31 |
|
| 32 |
+
## Using the real world dataset
|
| 33 |
+
|
| 34 |
+
Follow these steps to compile and evaluate the real-world dataset
|
| 35 |
+
|
| 36 |
+
1. Run `make.sh` to compile the source files into ELF and assembly files
|
| 37 |
+
2. Run `python3 collect_dataset.py` to disassemble the ELF functions for REMEND processing
|
| 38 |
+
3. Run `generate.sh` to run REMEND, generate equations, and evaluate them for correctness
|
| 39 |
+
|
| 40 |
+
The dataset will be present in `dataset/<arch>.[eqn,asm]`.
|
| 41 |
+
The results will be present in `generated/base/<arch>_res_<beamsize>.txt`.
|
| 42 |
+
|
| 43 |
+
The folder `real_world_dataset/related_evals` contains scripts to evaluate related works BTC, SLaDE, and Nova.
|
| 44 |
+
Each of the related works need to be setup before evaluating. See each script for further instructions.
|
| 45 |
+
|
| 46 |
## Replicate the synthetic dataset
|
| 47 |
|
| 48 |
Following are the steps to recreate the dataset present in `dataset.zip` and `tokenized.zip`
|
|
|
|
| 52 |
3. Compile the equations and disassemble them via the REMEND disassembler: `./compile.sh`
|
| 53 |
4. Combine the compiled equations and assembly files, remove duplicates, and split: `./combine.sh`
|
| 54 |
5. Tokenize the split assembly files: `./tokenize.sh`
|
| 55 |
+
|
| 56 |
+
## Training command
|
| 57 |
+
|
| 58 |
+
The [fairseq](https://github.com/facebookresearch/fairseq) library is used for training the Transformer.
|
| 59 |
+
The following command with different parameters is used to train each model.
|
| 60 |
+
|
| 61 |
+
|
| 62 |
+
```
|
| 63 |
+
fairseq-train <tokenized-dataset> --task translation --arch transformer \
|
| 64 |
+
--optimizer adam --weight-decay 0.001 --lr 0.0005 --lr-scheduler inverse_sqrt \
|
| 65 |
+
--max-source-positions 1024 --max-target-positions 1024 \
|
| 66 |
+
--encoder-attention-heads 8 --decoder-attention-heads 8 --encoder-embed-dim 384 --decoder-embed-dim 128 \
|
| 67 |
+
--encoder-ffn-embed-dim 1536 --decoder-ffn-embed-dim 512 --decoder-output-dim 128 --dropout 0.05 \
|
| 68 |
+
--max-tokens 20000 --max-update 100000 \
|
| 69 |
+
--no-epoch-checkpoints --keep-best-checkpoints 3 \
|
| 70 |
+
--save-dir <save-dir> --log-file <save-dir>/training.log
|
| 71 |
+
```
|
| 72 |
+
|
| 73 |
+
The following command runs a trained model and generates translations:
|
| 74 |
+
|
| 75 |
+
```
|
| 76 |
+
fairseq-generate <tokenized-dataset> --task translation --arch transformer \
|
| 77 |
+
--max-source-positions 1024 --max-target-positions 1024 \
|
| 78 |
+
--path <checkpoint> --results-path <out-dir> --gen-subset <train/test/valid> --beam 1
|
| 79 |
+
```
|
| 80 |
+
|
| 81 |
+
## Ablations
|
| 82 |
+
|
| 83 |
+
The `ablations` folder contains the data for the 3 ablations present in the paper: REMEND without constant identification, REMEND with equations in postfix, and REMEND trained with REMaQE data included.
|
| 84 |
+
|
real_world_dataset/generate.sh
CHANGED
|
@@ -1,8 +1,8 @@
|
|
| 1 |
#!/bin/bash
|
| 2 |
|
| 3 |
ARCHS=( arm32 aarch64 x64 )
|
| 4 |
-
TOKENIZERS
|
| 5 |
-
MODELS
|
| 6 |
MODEL=base
|
| 7 |
DS=dataset
|
| 8 |
GEN=generated/${MODEL}
|
|
|
|
| 1 |
#!/bin/bash
|
| 2 |
|
| 3 |
ARCHS=( arm32 aarch64 x64 )
|
| 4 |
+
TOKENIZERS=../tokenized
|
| 5 |
+
MODELS=../models
|
| 6 |
MODEL=base
|
| 7 |
DS=dataset
|
| 8 |
GEN=generated/${MODEL}
|
real_world_dataset/preprocess.sh
DELETED
|
@@ -1,21 +0,0 @@
|
|
| 1 |
-
#!/bin/bash
|
| 2 |
-
|
| 3 |
-
ARCHS=( arm32 aarch64 x64 )
|
| 4 |
-
TOKENIZERS=$HOME/projects/decode_ML/dlsym/tokenized
|
| 5 |
-
MODELS=$HOME/projects/decode_ML/dlsym/ablation
|
| 6 |
-
MODEL=base
|
| 7 |
-
DS=dataset
|
| 8 |
-
GEN=generated/${MODEL}
|
| 9 |
-
|
| 10 |
-
mkdir -p ${GEN}
|
| 11 |
-
|
| 12 |
-
for arch in ${ARCHS[@]}
|
| 13 |
-
do
|
| 14 |
-
tok=${TOKENIZERS}/${arch}/tokenized_dlsm_${arch}
|
| 15 |
-
echo python3 -m remend.tools.bpe_apply -t ${tok}/asm_tokens.json -i ${DS}/${arch}.asm -o ${GEN}/${arch}_tokenized.asm
|
| 16 |
-
python3 -m remend.tools.bpe_apply -t ${tok}/asm_tokens.json -i ${DS}/${arch}.asm -o ${GEN}/${arch}_tokenized.asm
|
| 17 |
-
fairseq-interactive ${tok} --beam 1 --path ${MODELS}/trained_${arch}_${MODEL}/checkpoint_best.pt < ${GEN}/${arch}_tokenized.asm > ${GEN}/${arch}_generated_beam1.txt 2>/dev/null
|
| 18 |
-
fairseq-interactive ${tok} --beam 5 --path ${MODELS}/trained_${arch}_${MODEL}/checkpoint_best.pt < ${GEN}/${arch}_tokenized.asm > ${GEN}/${arch}_generated_beam5.txt 2>/dev/null
|
| 19 |
-
python3 eval_dataset.py -g ${GEN}/${arch}_generated_beam1.txt -i ${DS}/${arch}.info -r ${GEN}/${arch}_res_beam1.txt
|
| 20 |
-
python3 eval_dataset.py -g ${GEN}/${arch}_generated_beam5.txt -i ${DS}/${arch}.info -r ${GEN}/${arch}_res_beam5.txt
|
| 21 |
-
done
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|