Improve model card: Add pipeline tag, library name, and paper link

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +157 -142
README.md CHANGED
@@ -3,173 +3,188 @@ license: mit
3
  tags:
4
  - decompile
5
  - binary
 
 
6
  ---
7
 
 
 
 
 
 
 
8
  ### 1. Introduction of LLM4Decompile
9
 
10
  LLM4Decompile aims to decompile x86 assembly instructions into C. The newly released V2 series are trained with a larger dataset (2B tokens) and a maximum token length of 4,096, with remarkable performance (up to 100% improvement) compared to the previous model.
11
 
12
- - **Github Repository:** [LLM4Decompile](https://github.com/albertan017/LLM4Decompile)
13
 
14
 
15
  ### 2. Evaluation Results
16
 
17
  | Metrics | Re-executability Rate | | | | | Edit Similarity | | | | |
18
- |:-----------------------:|:---------------------:|:-------:|:-------:|:-------:|:-------:|:---------------:|:-------:|:-------:|:-------:|:-------:|
19
- | Optimization Level | O0 | O1 | O2 | O3 | AVG | O0 | O1 | O2 | O3 | AVG |
20
- | LLM4Decompile-End-6.7B | 0.6805 | 0.3951 | 0.3671 | 0.3720 | 0.4537 | 0.1557 | 0.1292 | 0.1293 | 0.1269 | 0.1353 |
21
- | Ghidra | 0.3476 | 0.1646 | 0.1524 | 0.1402 | 0.2012 | 0.0699 | 0.0613 | 0.0619 | 0.0547 | 0.0620 |
22
- | +GPT-4o | 0.4695 | 0.3415 | 0.2866 | 0.3110 | 0.3522 | 0.0660 | 0.0563 | 0.0567 | 0.0499 | 0.0572 |
23
- | +LLM4Decompile-Ref-1.3B | 0.6890 | 0.3720 | 0.4085 | 0.3720 | 0.4604 | 0.1517 | 0.1325 | 0.1292 | 0.1267 | 0.1350 |
24
- | +LLM4Decompile-Ref-6.7B | 0.7439 | 0.4695 | 0.4756 | 0.4207 | 0.5274 | 0.1559 | 0.1353 | 0.1342 | 0.1273 | 0.1382 |
25
  | +LLM4Decompile-Ref-33B | 0.7073 | 0.4756 | 0.4390 | 0.4146 | 0.5091 | 0.1540 | 0.1379 | 0.1363 | 0.1307 | 0.1397 |
26
 
27
  ### 3. How to Use
28
  Here is an example of how to use our model (Only for V2. For previous models, please check the corresponding model page at HF).
29
 
30
- 1. Install Ghidra
31
- Download [Ghidra](https://github.com/NationalSecurityAgency/ghidra/releases/download/Ghidra_11.0.3_build/ghidra_11.0.3_PUBLIC_20240410.zip) to the current folder. You can also check the [page](https://github.com/NationalSecurityAgency/ghidra/releases) for other versions. Unzip the package to the current folder.
32
- In bash, you can use the following:
33
- ```bash
34
- cd LLM4Decompile/ghidra
35
- wget https://github.com/NationalSecurityAgency/ghidra/releases/download/Ghidra_11.0.3_build/ghidra_11.0.3_PUBLIC_20240410.zip
36
- unzip ghidra_11.0.3_PUBLIC_20240410.zip
37
- ```
38
- 2. Install Java-SDK-17
39
- Ghidra 11 is dependent on Java-SDK-17, a simple way to install the SDK on Ubuntu:
40
- ```bash
41
- apt-get update
42
- apt-get upgrade
43
- apt install openjdk-17-jdk openjdk-17-jre
44
- ```
45
- Please check [Ghidra install guide](https://htmlpreview.github.io/?https://github.com/NationalSecurityAgency/ghidra/blob/Ghidra_11.1.1_build/GhidraDocs/InstallationGuide.html) for other platforms.
46
-
47
- 3. Use Ghidra Headless to decompile binary (demo.py)
48
-
49
- Note: **Replace** func0 with the function name you want to decompile.
50
-
51
- **Preprocessing:** Compile the C code into binary, and disassemble the binary into assembly instructions.
52
- ```python
53
- import os
54
- import subprocess
55
- from tqdm import tqdm,trange
56
-
57
- OPT = ["O0", "O1", "O2", "O3"]
58
- timeout_duration = 10
59
-
60
- ghidra_path = "./ghidra_11.0.3_PUBLIC/support/analyzeHeadless"#path to the headless analyzer, change the path accordingly
61
- postscript = "./decompile.py"#path to the decompiler helper function, change the path accordingly
62
- project_path = "."#path to temp folder for analysis, change the path accordingly
63
- project_name = "tmp_ghidra_proj"
64
- func_path = "../samples/sample.c"#path to c code for compiling and decompiling, change the path accordingly
65
- fileName = "sample"
66
-
67
- with tempfile.TemporaryDirectory() as temp_dir:
68
- pid = os.getpid()
69
- asm_all = {}
70
- for opt in [OPT[0]]:
71
- executable_path = os.path.join(temp_dir, f"{pid}_{opt}.o")
72
- cmd = f'gcc -{opt} -o {executable_path} {func_path} -lm'
73
- subprocess.run(
74
- cmd.split(' '),
75
- check=True,
76
- stdout=subprocess.DEVNULL, # Suppress stdout
77
- stderr=subprocess.DEVNULL, # Suppress stderr
78
- timeout=timeout_duration,
79
- )
80
-
81
- output_path = os.path.join(temp_dir, f"{pid}_{opt}.c")
82
- command = [
83
- ghidra_path,
84
- temp_dir,
85
- project_name,
86
- "-import", executable_path,
87
- "-postScript", postscript, output_path,
88
- "-deleteProject", # WARNING: This will delete the project after analysis
89
- ]
90
- result = subprocess.run(command, text=True, capture_output=True, check=True)
91
- with open(output_path,'r') as f:
92
- c_decompile = f.read()
93
- c_func = []
94
- flag = 0
95
- for line in c_decompile.split('\n'):
96
- if "Function: func0" in line:#**Replace** func0 with the function name you want to decompile.
97
- flag = 1
98
- c_func.append(line)
99
- continue
100
- if flag:
101
- if '// Function:' in line:
102
- if len(c_func) > 1:
103
- break
104
- c_func.append(line)
105
- if flag == 0:
106
- raise ValueError('bad case no function found')
107
- for idx_tmp in range(1,len(c_func)):##########remove the comments
108
- if 'func0' in c_func[idx_tmp]:
109
- break
110
- c_func = c_func[idx_tmp:]
111
- input_asm = '\n'.join(c_func).strip()
112
-
113
- before = f"# This is the assembly code:\n"#prompt
114
- after = "\n# What is the source code?\n"#prompt
115
- input_asm_prompt = before+input_asm.strip()+after
116
- with open(fileName +'_' + opt +'.pseudo','w',encoding='utf-8') as f:
117
- f.write(input_asm_prompt)
118
- ```
119
-
120
- Ghidra pseudo-code may look like this:
121
- ```c
122
- undefined4 func0(float param_1,long param_2,int param_3)
123
- {
124
- int local_28;
125
- int local_24;
126
-
127
- local_24 = 0;
128
- do {
129
- local_28 = local_24;
130
- if (param_3 <= local_24) {
131
- return 0;
132
- }
133
- while (local_28 = local_28 + 1, local_28 < param_3) {
134
- if ((double)((ulong)(double)(*(float *)(param_2 + (long)local_24 * 4) -
135
- *(float *)(param_2 + (long)local_28 * 4)) &
136
- SUB168(_DAT_00402010,0)) < (double)param_1) {
137
- return 1;
138
- }
 
 
 
 
 
 
 
 
139
  }
140
- local_24 = local_24 + 1;
141
- } while( true );
142
- }
143
- ```
144
- 4. Refine pseudo-code using LLM4Decompile (demo.py)
145
 
146
- **Decompilation:** Use LLM4Decompile-Ref to refine the Ghidra pseudo-code into C:
147
- ```python
148
- from transformers import AutoTokenizer, AutoModelForCausalLM
149
- import torch
150
 
151
- model_path = 'LLM4Binary/llm4decompile-6.7b-v2' # V2 Model
152
- tokenizer = AutoTokenizer.from_pretrained(model_path)
153
- model = AutoModelForCausalLM.from_pretrained(model_path, torch_dtype=torch.bfloat16).cuda()
154
 
155
- with open(fileName +'_' + OPT[0] +'.pseudo','r') as f:#optimization level O0
156
- asm_func = f.read()
157
- inputs = tokenizer(asm_func, return_tensors="pt").to(model.device)
158
- with torch.no_grad():
159
- outputs = model.generate(**inputs, max_new_tokens=2048)### max length to 4096, max new tokens should be below the range
160
- c_func_decompile = tokenizer.decode(outputs[0][len(inputs[0]):-1])
161
 
162
- with open(fileName +'_' + OPT[0] +'.pseudo','r') as f:#original file
163
- func = f.read()
164
 
165
- print(f'pseudo function:\n{func}')# Note we only decompile one function, where the original file may contain multiple functions
166
- print(f'refined function:\n{c_func_decompile}')
 
 
167
 
168
- ```
169
 
170
  ### 4. License
171
- This code repository is licensed under the MIT License.
172
 
173
  ### 5. Contact
174
 
175
- If you have any questions, please raise an issue.
 
3
  tags:
4
  - decompile
5
  - binary
6
+ pipeline_tag: text-generation
7
+ library_name: transformers
8
  ---
9
 
10
+ # LLM4Decompile: Decompiling Binary Code with Large Language Models
11
+
12
+ This repository contains the `LLM4Decompile` model, which is part of the work presented in [Decompile-Bench: Million-Scale Binary-Source Function Pairs for Real-World Binary Decompilation](https://huggingface.co/papers/2505.12668).
13
+
14
+ The project's code and more details can be found on its [GitHub repository](https://github.com/albertan017/LLM4Decompile).
15
+
16
  ### 1. Introduction of LLM4Decompile
17
 
18
  LLM4Decompile aims to decompile x86 assembly instructions into C. The newly released V2 series are trained with a larger dataset (2B tokens) and a maximum token length of 4,096, with remarkable performance (up to 100% improvement) compared to the previous model.
19
 
20
+ - **Github Repository:** [LLM4Decompile](https://github.com/albertan017/LLM4Decompile)
21
 
22
 
23
  ### 2. Evaluation Results
24
 
25
  | Metrics | Re-executability Rate | | | | | Edit Similarity | | | | |
26
+ |:-----------------------:|:---------------------:|:-------:|:-------:|:-------:|:-------:|:---------------:|:-------:|:-------:|:-------:|:-------:|\
27
+ | Optimization Level | O0 | O1 | O2 | O3 | AVG | O0 | O1 | O2 | O3 | AVG |\
28
+ | LLM4Decompile-End-6.7B | 0.6805 | 0.3951 | 0.3671 | 0.3720 | 0.4537 | 0.1557 | 0.1292 | 0.1293 | 0.1269 | 0.1353 |\
29
+ | Ghidra | 0.3476 | 0.1646 | 0.1524 | 0.1402 | 0.2012 | 0.0699 | 0.0613 | 0.0619 | 0.0547 | 0.0620 |\
30
+ | +GPT-4o | 0.4695 | 0.3415 | 0.2866 | 0.3110 | 0.3522 | 0.0660 | 0.0563 | 0.0567 | 0.0499 | 0.0572 |\
31
+ | +LLM4Decompile-Ref-1.3B | 0.6890 | 0.3720 | 0.4085 | 0.3720 | 0.4604 | 0.1517 | 0.1325 | 0.1292 | 0.1267 | 0.1350 |\
32
+ | +LLM4Decompile-Ref-6.7B | 0.7439 | 0.4695 | 0.4756 | 0.4207 | 0.5274 | 0.1559 | 0.1353 | 0.1342 | 0.1273 | 0.1382 |\
33
  | +LLM4Decompile-Ref-33B | 0.7073 | 0.4756 | 0.4390 | 0.4146 | 0.5091 | 0.1540 | 0.1379 | 0.1363 | 0.1307 | 0.1397 |
34
 
35
  ### 3. How to Use
36
  Here is an example of how to use our model (Only for V2. For previous models, please check the corresponding model page at HF).
37
 
38
+ 1. Install Ghidra
39
+ Download [Ghidra](https://github.com/NationalSecurityAgency/ghidra/releases/download/Ghidra_11.0.3_build/ghidra_11.0.3_PUBLIC_20240410.zip) to the current folder. You can also check the [page](https://github.com/NationalSecurityAgency/ghidra/releases) for other versions. Unzip the package to the current folder.
40
+ In bash, you can use the following:
41
+ ```bash
42
+ cd LLM4Decompile/ghidra
43
+ wget https://github.com/NationalSecurityAgency/ghidra/releases/download/Ghidra_11.0.3_build/ghidra_11.0.3_PUBLIC_20240410.zip
44
+ unzip ghidra_11.0.3_PUBLIC_20240410.zip
45
+ ```
46
+ 2. Install Java-SDK-17
47
+ Ghidra 11 is dependent on Java-SDK-17, a simple way to install the SDK on Ubuntu:
48
+ ```bash
49
+ apt-get update
50
+ apt-get upgrade
51
+ apt install openjdk-17-jdk openjdk-17-jre
52
+ ```
53
+ Please check [Ghidra install guide](https://htmlpreview.github.io/?https://github.com/NationalSecurityAgency/ghidra/blob/Ghidra_11.1.1_build/GhidraDocs/InstallationGuide.html) for other platforms.
54
+
55
+ 3. Use Ghidra Headless to decompile binary (demo.py)
56
+
57
+ Note: **Replace** func0 with the function name you want to decompile.
58
+
59
+ **Preprocessing:** Compile the C code into binary, and disassemble the binary into assembly instructions.
60
+ ```python
61
+ import os
62
+ import subprocess
63
+ from tqdm import tqdm,trange
64
+
65
+ OPT = ["O0", "O1", "O2", "O3"]
66
+ timeout_duration = 10
67
+
68
+ ghidra_path = "./ghidra_11.0.3_PUBLIC/support/analyzeHeadless"#path to the headless analyzer, change the path accordingly
69
+ postscript = "./decompile.py"#path to the decompiler helper function, change the path accordingly
70
+ project_path = "."#path to temp folder for analysis, change the path accordingly
71
+ project_name = "tmp_ghidra_proj"
72
+ func_path = "../samples/sample.c"#path to c code for compiling and decompiling, change the path accordingly
73
+ fileName = "sample"
74
+
75
+ with tempfile.TemporaryDirectory() as temp_dir:
76
+ pid = os.getpid()
77
+ asm_all = {}
78
+ for opt in [OPT[0]]:
79
+ executable_path = os.path.join(temp_dir, f"{pid}_{opt}.o")
80
+ cmd = f'gcc -{opt} -o {executable_path} {func_path} -lm'
81
+ subprocess.run(
82
+ cmd.split(' '),
83
+ check=True,
84
+ stdout=subprocess.DEVNULL, # Suppress stdout
85
+ stderr=subprocess.DEVNULL, # Suppress stderr
86
+ timeout=timeout_duration,
87
+ )
88
+
89
+ output_path = os.path.join(temp_dir, f"{pid}_{opt}.c")
90
+ command = [
91
+ ghidra_path,
92
+ temp_dir,
93
+ project_name,
94
+ "-import", executable_path,
95
+ "-postScript", postscript, output_path,
96
+ "-deleteProject", # WARNING: This will delete the project after analysis
97
+ ]
98
+ result = subprocess.run(command, text=True, capture_output=True, check=True)
99
+ with open(output_path,'r') as f:\
100
+ c_decompile = f.read()
101
+ c_func = []
102
+ flag = 0
103
+ for line in c_decompile.split('\
104
+ '):
105
+ if "Function: func0" in line:#**Replace** func0 with the function name you want to decompile.
106
+ flag = 1
107
+ c_func.append(line)
108
+ continue
109
+ if flag:
110
+ if '// Function:' in line:
111
+ if len(c_func) > 1:
112
+ break
113
+ c_func.append(line)
114
+ if flag == 0:
115
+ raise ValueError('bad case no function found')
116
+ for idx_tmp in range(1,len(c_func)):##########remove the comments
117
+ if 'func0' in c_func[idx_tmp]:
118
+ break
119
+ c_func = c_func[idx_tmp:]
120
+ input_asm = '
121
+ '.join(c_func).strip()
122
+
123
+ before = f"# This is the assembly code:\
124
+ "#prompt
125
+ after = "\
126
+ # What is the source code?\
127
+ "#prompt
128
+ input_asm_prompt = before+input_asm.strip()+after
129
+ with open(fileName +'_' + opt +'.pseudo','w',encoding='utf-8') as f:
130
+ f.write(input_asm_prompt)
131
+ ```
132
+
133
+ Ghidra pseudo-code may look like this:
134
+ ```c
135
+ undefined4 func0(float param_1,long param_2,int param_3)
136
+ {
137
+ int local_28;
138
+ int local_24;
139
+
140
+ local_24 = 0;
141
+ do {
142
+ local_28 = local_24;
143
+ if (param_3 <= local_24) {
144
+ return 0;
145
+ }
146
+ while (local_28 = local_28 + 1, local_28 < param_3) {
147
+ if ((double)((ulong)(double)(*(float *)(param_2 + (long)local_24 * 4) -\
148
+ *(float *)(param_2 + (long)local_28 * 4)) &\
149
+ SUB168(_DAT_00402010,0)) < (double)param_1) {\
150
+ return 1;\
151
+ }\
152
+ }\
153
+ local_24 = local_24 + 1;
154
+ } while( true );
155
  }
156
+ ```
157
+ 4. Refine pseudo-code using LLM4Decompile (demo.py)
 
 
 
158
 
159
+ **Decompilation:** Use LLM4Decompile-Ref to refine the Ghidra pseudo-code into C:
160
+ ```python
161
+ from transformers import AutoTokenizer, AutoModelForCausalLM
162
+ import torch
163
 
164
+ model_path = 'LLM4Binary/llm4decompile-6.7b-v2' # V2 Model
165
+ tokenizer = AutoTokenizer.from_pretrained(model_path)
166
+ model = AutoModelForCausalLM.from_pretrained(model_path, torch_dtype=torch.bfloat16).cuda()
167
 
168
+ with open(fileName +'_' + OPT[0] +'.pseudo','r') as f:#optimization level O0
169
+ asm_func = f.read()
170
+ inputs = tokenizer(asm_func, return_tensors="pt").to(model.device)
171
+ with torch.no_grad():
172
+ outputs = model.generate(**inputs, max_new_tokens=2048)### max length to 4096, max new tokens should be below the range
173
+ c_func_decompile = tokenizer.decode(outputs[0][len(inputs[0]):-1])
174
 
175
+ with open(fileName +'_' + OPT[0] +'.pseudo','r') as f:#original file
176
+ func = f.read()
177
 
178
+ print(f'pseudo function:\
179
+ {func}')# Note we only decompile one function, where the original file may contain multiple functions
180
+ print(f'refined function:\
181
+ {c_func_decompile}')
182
 
183
+ ```
184
 
185
  ### 4. License
186
+ This code repository is licensed under the MIT and DeepSeek License.
187
 
188
  ### 5. Contact
189
 
190
+ If you have any questions, please raise an issue.