NetherlandsForensicInstitute
/

ARM64BERT-embedding

@@ -14,14 +14,15 @@ ARM64BERT-embedding 🦾
 ## General
 ### What is the purpose of the model
 The model is a BERT model of ARM64 assembly code that can be used to find similar ARM64 functions to a given ARM64 function.
 ### What does the model architecture look like?
 The model architecture is inspired by [jTrans](https://github.com/vul337/jTrans) (Wang et al., 2022).
 It is a BERT model (Devlin et al. 2019) although the typical Next Sentence Prediction has been replaced with Jump Target Prediction, as proposed in Wang et al.
 This architecture has subsequently been finetuned for semantic search purposes. We have followed the procedure proposed by [S-BERT](https://www.sbert.net/examples/applications/semantic-search/README.html).
-### What is the output of the model?
-The model returns a vector of 768 dimensions for each function that it's given. These vectors can be compared to
 get an indication of which functions are similar to each other.
 ### How does the model perform?
@@ -44,14 +45,15 @@ The model has been designed to find similar ARM64 functions in a database of kno
 We do not see other applications for this model.
 ### To what problems is the model not applicable?
-This model has been finetuned on the semantic search task, for a generic ARM64-BERT model, please refer to the [other
-model](https://huggingface.co/NetherlandsForensicInstitute/ARM64bert) we have published.
 ## Data
 ### What data was used for training and evaluation?
 The dataset is created in the same way as Wang et al. created Binary Corp.
-A large set of binary code comes from the [ArchLinux official repositories](https://archlinux.org/packages/) and the [ArchLinux user repositories](https://aur.archlinux.org/packages/).
-All this code is split into functions that are compiled with different optimalizations
 (`O0`, `O1`, `O2`, `O3` and `Os`) and security settings (fortify or no-fortify).
 This results in a maximum of 10 (5×2) different functions which are semantically similar, i.e. they represent the same functionality, but have different machine code.
 The dataset is split into a train and a test set. This is done on project level, so all binaries and functions belonging to one project are part of
@@ -62,6 +64,9 @@ either the train or the test set, not both. We have not performed any deduplicat
 | train |  18,083,285 |
 | test  |   3,375,741 |
 ### By whom was the dataset collected and annotated?
 The dataset was collected by our team.

 ## General
 ### What is the purpose of the model
 The model is a BERT model of ARM64 assembly code that can be used to find similar ARM64 functions to a given ARM64 function.
+This task is known as _binary code similarity detection_, which is similar to the _sentence similarity_ task in natural language processing.
 ### What does the model architecture look like?
 The model architecture is inspired by [jTrans](https://github.com/vul337/jTrans) (Wang et al., 2022).
 It is a BERT model (Devlin et al. 2019) although the typical Next Sentence Prediction has been replaced with Jump Target Prediction, as proposed in Wang et al.
 This architecture has subsequently been finetuned for semantic search purposes. We have followed the procedure proposed by [S-BERT](https://www.sbert.net/examples/applications/semantic-search/README.html).
+### What is the output of the model?
+The model returns an embddding vector of 768 dimensions for each function that it's given. These embeddings can be compared to
 get an indication of which functions are similar to each other.
 ### How does the model perform?
 We do not see other applications for this model.
 ### To what problems is the model not applicable?
+This model has been finetuned on the semantic search task.
+For the base ARM64BERT model, please refer to the [other
+model](https://huggingface.co/NetherlandsForensicInstitute/ARM64BERT) we have published.
 ## Data
 ### What data was used for training and evaluation?
 The dataset is created in the same way as Wang et al. created Binary Corp.
+A large set of source code comes from the [ArchLinux official repositories](https://archlinux.org/packages/) and the [ArchLinux user repositories](https://aur.archlinux.org/packages/).
+All this code is split into functions that are compiled into binary code with different optimalizations
 (`O0`, `O1`, `O2`, `O3` and `Os`) and security settings (fortify or no-fortify).
 This results in a maximum of 10 (5×2) different functions which are semantically similar, i.e. they represent the same functionality, but have different machine code.
 The dataset is split into a train and a test set. This is done on project level, so all binaries and functions belonging to one project are part of
 | train |  18,083,285 |
 | test  |   3,375,741 |
+For our training and evaluation code, see our [GitHub repository](https://github.com/NetherlandsForensicInstitute/asmtransformers).
 ### By whom was the dataset collected and annotated?
 The dataset was collected by our team.