File size: 4,170 Bytes
acc802b 37d5164 acc802b 37d5164 acc802b |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 |
---
license: apache-2.0
---
# Virtual Compiler Is All You Need For Assembly Code Search
## Introduction
This repo contains the models and the corresponding evaluation datasets of ACL 2024 paper "Virtual Compiler Is All You Need For Assembly Code Search".
A virtual compiler is a LLM that is capable of compiling any programming language into underlying assembly code. The virtual compiler model is available at [elsagranger/VirtualCompiler](https://huggingface.co/elsagranger/VirtualCompiler), based on 34B CodeLlama.
We evaluate the similiarity of the virtual assembly code generated by the virtual compiler and the real assembly code using force execution by script [force-exec.py](./force_exec.py), the corresponding evaluation dataset is avaiable at [virtual_assembly_and_ground_truth](./virtual_assembly_and_ground_truth).
We evaluate the effective of the virtual compiler throught downstream task -- assembly code search, the evaluation dataset is avaiable at [elsagranger/AssemblyCodeSearchEval](https://huggingface.co/datasets/elsagranger/AssemblyCodeSearchEval).
## Usage
We use FastChat and vllm worker to host the model. Run these following commands in seperate terminals, such as `tmux`.
```shell
LOGDIR="" python3 -m fastchat.serve.openai_api_server \
--host 0.0.0.0 --port 8080 \
--controller-address http://localhost:21000
LOGDIR="" python3 -m fastchat.serve.controller \
--host 0.0.0.0 --port 21000
LOGDIR="" RAY_LOG_TO_STDERR=1 \
python3 -m fastchat.serve.vllm_worker \
--model-path ./VirtualCompiler \
--num-gpus 8 \
--controller http://localhost:21000 \
--max-num-batched-tokens 40960 \
--disable-log-requests \
--host 0.0.0.0 --port 22000 \
--worker-address http://localhost:22000 \
--model-names "VirtualCompiler"
```
Then with the model hosted, use `do_request.py` to make request to the model.
```shell
~/C/VirtualCompiler (main)> python3 do_request.py
test rdx, rdx
setz al
movzx eax, al
neg eax
retn
```
## Assembly Code Search Encoder
As huggingface does not support load a remote model inside a folder, we host the model trained on the assembly code search dataset augmented by the Virtual Compiler in [vic-encoder](https://cloud.vul337.team:9443/s/t5Ltt8gy7kPfyw8). You can use the `model.py` to test the custom model loading.
Here is a example on text encoder and asm encoder. Please refer to this script on how to extract the assembly code from the binary: [process_asm.py](https://github.com/Hustcw/CLAP/blob/main/scripts/process_asm.py).
```python
def calc_map_at_k(logits, pos_cnt, ks=[10,]):
_, indices = torch.sort(logits, dim=1, descending=True)
# [batch_size, pos_cnt]
ranks = torch.nonzero(
indices < pos_cnt,
as_tuple=False
)[:, 1].reshape(logits.shape[0], -1)
# [batch_size, pos_cnt]
mrr = torch.mean(1 / (ranks + 1), dim=1)
res = {}
for k in ks:
res[k] = (
torch.sum((ranks < k).float(), dim=1) / min(k, pos_cnt)
).cpu().numpy()
return ranks.cpu().numpy(), res, mrr.cpu().numpy()
pos_asm_cnt = 1
query = ["List all files in a directory"]
# Extracted by the process_asm.py script mentioned above
anchor_asm = [ {"1": "endbr64", "2": "mov eax, 0" }, ... ]
neg_anchor_asm = [ {"1": "push rbp", "2": "mov rbp, rsp", ... }, ... ]
query_embs = text_encoder(**text_tokenizer(query))
kwargs = dict(padding=True, pad_to_multiple_of=8, return_tensors="pt")
anchor_asm_ids = asm_tokenizer.pad([asm_tokenizer(pos) for pos in anchor_asm], **kwargs)
neg_anchor_asm_ids = asm_tokenizer.pad([asm_tokenizer(neg) for neg in neg_anchor_asm], **kwargs)
asm_embs = asm_encoder(**anchor_asm_ids)
asm_neg_emb = asm_encoder(**neg_anchor_asm_ids)
# query_embs: [query_cnt, emb_dim]
# asm_embs: [pos_asm_cnt, emb_dim]
# logits_pos: [query_cnt, pos_asm_cnt]
logits_pos = torch.einsum(
"ic,jc->ij", [query_embs, asm_embs])
# logits_neg: [query_cnt, neg_asm_cnt]
logits_neg = torch.einsum(
"ic,jc->ij", [query_embs, asm_neg_emb[pos_asm_cnt:]]
)
logits = torch.cat([logits_pos, logits_neg], dim=1)
ranks, map_at_k, mrr = calc_map_at_k(
logits, pos_asm_cnt, [1, 5, 10, 20, 50, 100])
```
|