Update README.md
Browse files
README.md
CHANGED
|
@@ -14,14 +14,15 @@ ARM64BERT-embedding 🦾
|
|
| 14 |
## General
|
| 15 |
### What is the purpose of the model
|
| 16 |
The model is a BERT model of ARM64 assembly code that can be used to find similar ARM64 functions to a given ARM64 function.
|
|
|
|
| 17 |
|
| 18 |
### What does the model architecture look like?
|
| 19 |
The model architecture is inspired by [jTrans](https://github.com/vul337/jTrans) (Wang et al., 2022).
|
| 20 |
It is a BERT model (Devlin et al. 2019) although the typical Next Sentence Prediction has been replaced with Jump Target Prediction, as proposed in Wang et al.
|
| 21 |
This architecture has subsequently been finetuned for semantic search purposes. We have followed the procedure proposed by [S-BERT](https://www.sbert.net/examples/applications/semantic-search/README.html).
|
| 22 |
|
| 23 |
-
### What is the output of the model?
|
| 24 |
-
The model returns
|
| 25 |
get an indication of which functions are similar to each other.
|
| 26 |
|
| 27 |
### How does the model perform?
|
|
@@ -44,14 +45,15 @@ The model has been designed to find similar ARM64 functions in a database of kno
|
|
| 44 |
We do not see other applications for this model.
|
| 45 |
|
| 46 |
### To what problems is the model not applicable?
|
| 47 |
-
This model has been finetuned on the semantic search task
|
| 48 |
-
model
|
|
|
|
| 49 |
|
| 50 |
## Data
|
| 51 |
### What data was used for training and evaluation?
|
| 52 |
The dataset is created in the same way as Wang et al. created Binary Corp.
|
| 53 |
-
A large set of
|
| 54 |
-
All this code is split into functions that are compiled with different optimalizations
|
| 55 |
(`O0`, `O1`, `O2`, `O3` and `Os`) and security settings (fortify or no-fortify).
|
| 56 |
This results in a maximum of 10 (5×2) different functions which are semantically similar, i.e. they represent the same functionality, but have different machine code.
|
| 57 |
The dataset is split into a train and a test set. This is done on project level, so all binaries and functions belonging to one project are part of
|
|
@@ -62,6 +64,9 @@ either the train or the test set, not both. We have not performed any deduplicat
|
|
| 62 |
| train | 18,083,285 |
|
| 63 |
| test | 3,375,741 |
|
| 64 |
|
|
|
|
|
|
|
|
|
|
| 65 |
### By whom was the dataset collected and annotated?
|
| 66 |
The dataset was collected by our team.
|
| 67 |
|
|
|
|
| 14 |
## General
|
| 15 |
### What is the purpose of the model
|
| 16 |
The model is a BERT model of ARM64 assembly code that can be used to find similar ARM64 functions to a given ARM64 function.
|
| 17 |
+
This task is known as _binary code similarity detection_, which is similar to the _sentence similarity_ task in natural language processing.
|
| 18 |
|
| 19 |
### What does the model architecture look like?
|
| 20 |
The model architecture is inspired by [jTrans](https://github.com/vul337/jTrans) (Wang et al., 2022).
|
| 21 |
It is a BERT model (Devlin et al. 2019) although the typical Next Sentence Prediction has been replaced with Jump Target Prediction, as proposed in Wang et al.
|
| 22 |
This architecture has subsequently been finetuned for semantic search purposes. We have followed the procedure proposed by [S-BERT](https://www.sbert.net/examples/applications/semantic-search/README.html).
|
| 23 |
|
| 24 |
+
### What is the output of the model?
|
| 25 |
+
The model returns an embddding vector of 768 dimensions for each function that it's given. These embeddings can be compared to
|
| 26 |
get an indication of which functions are similar to each other.
|
| 27 |
|
| 28 |
### How does the model perform?
|
|
|
|
| 45 |
We do not see other applications for this model.
|
| 46 |
|
| 47 |
### To what problems is the model not applicable?
|
| 48 |
+
This model has been finetuned on the semantic search task.
|
| 49 |
+
For the base ARM64BERT model, please refer to the [other
|
| 50 |
+
model](https://huggingface.co/NetherlandsForensicInstitute/ARM64BERT) we have published.
|
| 51 |
|
| 52 |
## Data
|
| 53 |
### What data was used for training and evaluation?
|
| 54 |
The dataset is created in the same way as Wang et al. created Binary Corp.
|
| 55 |
+
A large set of source code comes from the [ArchLinux official repositories](https://archlinux.org/packages/) and the [ArchLinux user repositories](https://aur.archlinux.org/packages/).
|
| 56 |
+
All this code is split into functions that are compiled into binary code with different optimalizations
|
| 57 |
(`O0`, `O1`, `O2`, `O3` and `Os`) and security settings (fortify or no-fortify).
|
| 58 |
This results in a maximum of 10 (5×2) different functions which are semantically similar, i.e. they represent the same functionality, but have different machine code.
|
| 59 |
The dataset is split into a train and a test set. This is done on project level, so all binaries and functions belonging to one project are part of
|
|
|
|
| 64 |
| train | 18,083,285 |
|
| 65 |
| test | 3,375,741 |
|
| 66 |
|
| 67 |
+
For our training and evaluation code, see our [GitHub repository](https://github.com/NetherlandsForensicInstitute/asmtransformers).
|
| 68 |
+
|
| 69 |
+
|
| 70 |
### By whom was the dataset collected and annotated?
|
| 71 |
The dataset was collected by our team.
|
| 72 |
|