File size: 4,170 Bytes
acc802b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
37d5164
 
 
 
acc802b
 
37d5164
 
 
 
 
 
 
acc802b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
---
license: apache-2.0
---

# Virtual Compiler Is All You Need For Assembly Code Search

## Introduction

This repo contains the models and the corresponding evaluation datasets of ACL 2024 paper "Virtual Compiler Is All You Need For Assembly Code Search".

A virtual compiler is a LLM that is capable of compiling any programming language into underlying assembly code. The virtual compiler model is available at [elsagranger/VirtualCompiler](https://huggingface.co/elsagranger/VirtualCompiler), based on 34B CodeLlama.

We evaluate the similiarity of the virtual assembly code generated by the virtual compiler and the real assembly code using force execution by script [force-exec.py](./force_exec.py), the corresponding evaluation dataset is avaiable at [virtual_assembly_and_ground_truth](./virtual_assembly_and_ground_truth).

We evaluate the effective of the virtual compiler throught downstream task -- assembly code search, the evaluation dataset is avaiable at [elsagranger/AssemblyCodeSearchEval](https://huggingface.co/datasets/elsagranger/AssemblyCodeSearchEval).

## Usage

We use FastChat and vllm worker to host the model. Run these following commands in seperate terminals, such as `tmux`.

```shell
LOGDIR="" python3 -m fastchat.serve.openai_api_server \
    --host 0.0.0.0 --port 8080 \
    --controller-address http://localhost:21000

LOGDIR="" python3 -m fastchat.serve.controller \
    --host 0.0.0.0 --port 21000

LOGDIR="" RAY_LOG_TO_STDERR=1 \
    python3 -m fastchat.serve.vllm_worker \
    --model-path ./VirtualCompiler \
    --num-gpus 8 \
    --controller http://localhost:21000 \
    --max-num-batched-tokens 40960 \
    --disable-log-requests \
    --host 0.0.0.0 --port 22000 \
    --worker-address http://localhost:22000 \
    --model-names "VirtualCompiler"
```

Then with the model hosted, use `do_request.py` to make request to the model.

```shell
~/C/VirtualCompiler (main)> python3 do_request.py
test rdx, rdx
setz al
movzx eax, al
neg eax
retn
```


## Assembly Code Search Encoder

As huggingface does not support load a remote model inside a folder, we host the model trained on the assembly code search dataset augmented by the Virtual Compiler in [vic-encoder](https://cloud.vul337.team:9443/s/t5Ltt8gy7kPfyw8). You can use the `model.py` to test the custom model loading.

Here is a example on text encoder and asm encoder. Please refer to this script on how to extract the assembly code from the binary: [process_asm.py](https://github.com/Hustcw/CLAP/blob/main/scripts/process_asm.py).

```python
def calc_map_at_k(logits, pos_cnt, ks=[10,]):
    _, indices = torch.sort(logits, dim=1, descending=True)

    # [batch_size, pos_cnt]
    ranks = torch.nonzero(
        indices < pos_cnt,
        as_tuple=False
    )[:, 1].reshape(logits.shape[0], -1)

    # [batch_size, pos_cnt]
    mrr = torch.mean(1 / (ranks + 1), dim=1)

    res = {}

    for k in ks:
        res[k] = (
            torch.sum((ranks < k).float(), dim=1) / min(k, pos_cnt)
        ).cpu().numpy()

    return ranks.cpu().numpy(), res, mrr.cpu().numpy()

pos_asm_cnt = 1

query = ["List all files in a directory"]

# Extracted by the process_asm.py script mentioned above
anchor_asm = [ {"1": "endbr64", "2": "mov eax, 0" }, ... ]
neg_anchor_asm = [ {"1": "push rbp", "2": "mov rbp, rsp", ... }, ... ]

query_embs = text_encoder(**text_tokenizer(query))

kwargs = dict(padding=True, pad_to_multiple_of=8, return_tensors="pt")
anchor_asm_ids = asm_tokenizer.pad([asm_tokenizer(pos) for pos in anchor_asm], **kwargs)
neg_anchor_asm_ids = asm_tokenizer.pad([asm_tokenizer(neg) for neg in neg_anchor_asm], **kwargs)

asm_embs = asm_encoder(**anchor_asm_ids)
asm_neg_emb = asm_encoder(**neg_anchor_asm_ids)

# query_embs: [query_cnt, emb_dim]
# asm_embs: [pos_asm_cnt, emb_dim]

# logits_pos: [query_cnt, pos_asm_cnt]
logits_pos = torch.einsum(
    "ic,jc->ij", [query_embs, asm_embs])
# logits_neg: [query_cnt, neg_asm_cnt]
logits_neg = torch.einsum(
    "ic,jc->ij", [query_embs, asm_neg_emb[pos_asm_cnt:]]
)
logits = torch.cat([logits_pos, logits_neg], dim=1)

ranks, map_at_k, mrr = calc_map_at_k(
    logits, pos_asm_cnt, [1, 5, 10, 20, 50, 100])
```