Instructions to use NetherlandsForensicInstitute/ARM64BERT with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use NetherlandsForensicInstitute/ARM64BERT with Transformers:
# Load model directly from transformers import AutoTokenizer, AutoModel tokenizer = AutoTokenizer.from_pretrained("NetherlandsForensicInstitute/ARM64BERT") model = AutoModel.from_pretrained("NetherlandsForensicInstitute/ARM64BERT") - Notebooks
- Google Colab
- Kaggle
| license: eupl-1.2 | |
| language: code | |
| ARM64BERT 🦾 | |
| ------------ | |
| [GitHub repository](https://github.com/NetherlandsForensicInstitute/asmtransformers) | |
| ## General | |
| ### What is the purpose of the model | |
| The model is a BERT model for ARM64 assembly code. This specific model has NOT been specifically finetuned for semantic similarity, you most likely want | |
| to use our [other model](https://huggingface.co/NetherlandsForensicInstitute/ARM64bert-embedding). The main purpose of the ARM64BERT is to be a baseline | |
| to compare the finetuned model against. | |
| ### What does the model architecture look like? | |
| The model architecture is inspired by [jTrans](https://github.com/vul337/jTrans) (Wang et al., 2022). It is a BERT model | |
| (Devlin et al. 2019), | |
| although the typical Next Sentence Prediction has been replaced with Jump Target Prediction, as proposed in Wang et al. | |
| ### What is the output of the model? | |
| The model is a BERT base model, of which the outputs are not meant to be used directly. | |
| ### How does the model perform? | |
| We have compared this model against the model specifically finetuned for semantic similarity, in order to do this we initalised this base model | |
| as a SentenceTransfomer moden. | |
| The model was then evaluated on [Mean Reciprocal Rank (MRR)](https://en.wikipedia.org/wiki/Mean_reciprocal_rank) and | |
| [Recall@1](https://en.wikipedia.org/wiki/Precision_and_recall). | |
| When the model has to pick the positive example out of a pool of 32, it almost always ranks it first. When | |
| the pool is significantly enlarged to 10.000 functions, it still ranks the positive example highest most of the time. | |
| | Model | Pool size | MRR | Recall@1 | | |
| |----------------------|-----------|------|----------| | |
| | ARM64BERT | 32 | 0.78 | 0.72 | | |
| | ARM64BERT-embedding | 32 | 0.99 | 0.99 | | |
| | ARM64BERT | 10.000 | 0.58 | 0.56 | | |
| | ARM64BERT-embedding | 10.000 | 0.87 | 0.83 | | |
| ## Purpose and use of the model | |
| ### For which problem has the model been designed? | |
| The model has been designed to act as a basemodel for the ARM64 language. | |
| ### What else could the model be used for? | |
| The model can also be used to find similar ARM64 functions in a database of known ARM64 functions when initialised as a SentenceTransformer model. | |
| ### To what problems is the model not applicable? | |
| Although the model performs reasonably well on the semantic search task, this model has NOT been finetuned on that task. | |
| For a finetuned ARM64BERT model, please refer to the [other model](https://huggingface.co/NetherlandsForensicInstitute/ARM64bert-embedding) published alongside this one. | |
| ## Data | |
| ### What data was used for training and evaluation? | |
| The dataset is created in the same way as Wang et al. created Binary Corp. | |
| A large set of source code comes from the [ArchLinux official repositories](https://archlinux.org/packages/) and the [ArchLinux user repositories](https://aur.archlinux.org/packages/). | |
| All this code is split into functions that are compiled with different optimalizations | |
| (`O0`, `O1`, `O2`, `O3` and `Os`) and security settings (fortify or no-fortify). | |
| This results in a maximum of 10 (5×2) different functions which are semantically similar, i.e. they represent the same functionality, but have different machine code. | |
| The dataset is split into a train and a test set. This is done on project level, so all binaries and functions belonging to one project are part of | |
| either the train or the test set, not both. We have not performed any deduplication on the dataset for training. | |
| | set | # functions | | |
| |-------|------------:| | |
| | train | 18,083,285 | | |
| | test | 3,375,741 | | |
| For our training and evaluation code, see our [GitHub repository](https://github.com/NetherlandsForensicInstitute/asmtransformers). | |
| ### By whom was the dataset collected and annotated? | |
| The dataset was collected by our team. The annotation of similar/non-similar function comes from the different compilation | |
| levels, i.e. what we consider "similar functions" is in fact the same function that has been compiled in a different way. | |
| ### Any remarks on data quality and bias? | |
| The way we classify functions as similar may have implications. For example, sometimes, two different ways of compiling | |
| the same function does not result in a different piece of code. We did not remove duplicates from the data during training, | |
| but we did implement checks in the evaluation stage and it seems that the model has not suffered from the simple training | |
| examples. | |
| After training this base model, we found out that something had gone wrong when compiling our dataset. Consequently, | |
| the last instruction of the previous function was included in the next. Due to the long training process, and the | |
| good performance of the model despite the mistake, we have decided not to retrain our model. | |