nielsr HF Staff commited on
Commit
9d3026d
·
verified ·
1 Parent(s): be2ac0b

Improve model card: Add pipeline tag, library name, and paper link

Browse files

This PR enhances the model card for `LLM4Binary/llm4decompile-6.7b-v2` by:

- Adding `pipeline_tag: text-generation` to improve model discoverability on the Hugging Face Hub, as the model's core task is to decompile binary code into human-readable C source code.
- Adding `library_name: transformers` based on the explicit usage of `AutoTokenizer` and `AutoModelForCausalLM` from the `transformers` library in the provided sample code, which enables an automated "how to use" snippet.
- Adding a prominent link to the associated paper, [Decompile-Bench: Million-Scale Binary-Source Function Pairs for Real-World Binary Decompilation](https://huggingface.co/papers/2505.12668), at the top of the model card.
- Adding a prominent link to the project's [GitHub repository](https://github.com/albertan017/LLM4Decompile) at the top of the model card.
- Updating the license text within the content to "MIT and DeepSeek License" to accurately reflect the information provided in the GitHub README.

These updates will provide clearer information for users and better integrate the model within the Hugging Face ecosystem.

Files changed (1) hide show
  1. README.md +157 -142
README.md CHANGED
@@ -3,173 +3,188 @@ license: mit
3
  tags:
4
  - decompile
5
  - binary
 
 
6
  ---
7
 
 
 
 
 
 
 
8
  ### 1. Introduction of LLM4Decompile
9
 
10
  LLM4Decompile aims to decompile x86 assembly instructions into C. The newly released V2 series are trained with a larger dataset (2B tokens) and a maximum token length of 4,096, with remarkable performance (up to 100% improvement) compared to the previous model.
11
 
12
- - **Github Repository:** [LLM4Decompile](https://github.com/albertan017/LLM4Decompile)
13
 
14
 
15
  ### 2. Evaluation Results
16
 
17
  | Metrics | Re-executability Rate | | | | | Edit Similarity | | | | |
18
- |:-----------------------:|:---------------------:|:-------:|:-------:|:-------:|:-------:|:---------------:|:-------:|:-------:|:-------:|:-------:|
19
- | Optimization Level | O0 | O1 | O2 | O3 | AVG | O0 | O1 | O2 | O3 | AVG |
20
- | LLM4Decompile-End-6.7B | 0.6805 | 0.3951 | 0.3671 | 0.3720 | 0.4537 | 0.1557 | 0.1292 | 0.1293 | 0.1269 | 0.1353 |
21
- | Ghidra | 0.3476 | 0.1646 | 0.1524 | 0.1402 | 0.2012 | 0.0699 | 0.0613 | 0.0619 | 0.0547 | 0.0620 |
22
- | +GPT-4o | 0.4695 | 0.3415 | 0.2866 | 0.3110 | 0.3522 | 0.0660 | 0.0563 | 0.0567 | 0.0499 | 0.0572 |
23
- | +LLM4Decompile-Ref-1.3B | 0.6890 | 0.3720 | 0.4085 | 0.3720 | 0.4604 | 0.1517 | 0.1325 | 0.1292 | 0.1267 | 0.1350 |
24
- | +LLM4Decompile-Ref-6.7B | 0.7439 | 0.4695 | 0.4756 | 0.4207 | 0.5274 | 0.1559 | 0.1353 | 0.1342 | 0.1273 | 0.1382 |
25
  | +LLM4Decompile-Ref-33B | 0.7073 | 0.4756 | 0.4390 | 0.4146 | 0.5091 | 0.1540 | 0.1379 | 0.1363 | 0.1307 | 0.1397 |
26
 
27
  ### 3. How to Use
28
  Here is an example of how to use our model (Only for V2. For previous models, please check the corresponding model page at HF).
29
 
30
- 1. Install Ghidra
31
- Download [Ghidra](https://github.com/NationalSecurityAgency/ghidra/releases/download/Ghidra_11.0.3_build/ghidra_11.0.3_PUBLIC_20240410.zip) to the current folder. You can also check the [page](https://github.com/NationalSecurityAgency/ghidra/releases) for other versions. Unzip the package to the current folder.
32
- In bash, you can use the following:
33
- ```bash
34
- cd LLM4Decompile/ghidra
35
- wget https://github.com/NationalSecurityAgency/ghidra/releases/download/Ghidra_11.0.3_build/ghidra_11.0.3_PUBLIC_20240410.zip
36
- unzip ghidra_11.0.3_PUBLIC_20240410.zip
37
- ```
38
- 2. Install Java-SDK-17
39
- Ghidra 11 is dependent on Java-SDK-17, a simple way to install the SDK on Ubuntu:
40
- ```bash
41
- apt-get update
42
- apt-get upgrade
43
- apt install openjdk-17-jdk openjdk-17-jre
44
- ```
45
- Please check [Ghidra install guide](https://htmlpreview.github.io/?https://github.com/NationalSecurityAgency/ghidra/blob/Ghidra_11.1.1_build/GhidraDocs/InstallationGuide.html) for other platforms.
46
-
47
- 3. Use Ghidra Headless to decompile binary (demo.py)
48
-
49
- Note: **Replace** func0 with the function name you want to decompile.
50
-
51
- **Preprocessing:** Compile the C code into binary, and disassemble the binary into assembly instructions.
52
- ```python
53
- import os
54
- import subprocess
55
- from tqdm import tqdm,trange
56
-
57
- OPT = ["O0", "O1", "O2", "O3"]
58
- timeout_duration = 10
59
-
60
- ghidra_path = "./ghidra_11.0.3_PUBLIC/support/analyzeHeadless"#path to the headless analyzer, change the path accordingly
61
- postscript = "./decompile.py"#path to the decompiler helper function, change the path accordingly
62
- project_path = "."#path to temp folder for analysis, change the path accordingly
63
- project_name = "tmp_ghidra_proj"
64
- func_path = "../samples/sample.c"#path to c code for compiling and decompiling, change the path accordingly
65
- fileName = "sample"
66
-
67
- with tempfile.TemporaryDirectory() as temp_dir:
68
- pid = os.getpid()
69
- asm_all = {}
70
- for opt in [OPT[0]]:
71
- executable_path = os.path.join(temp_dir, f"{pid}_{opt}.o")
72
- cmd = f'gcc -{opt} -o {executable_path} {func_path} -lm'
73
- subprocess.run(
74
- cmd.split(' '),
75
- check=True,
76
- stdout=subprocess.DEVNULL, # Suppress stdout
77
- stderr=subprocess.DEVNULL, # Suppress stderr
78
- timeout=timeout_duration,
79
- )
80
-
81
- output_path = os.path.join(temp_dir, f"{pid}_{opt}.c")
82
- command = [
83
- ghidra_path,
84
- temp_dir,
85
- project_name,
86
- "-import", executable_path,
87
- "-postScript", postscript, output_path,
88
- "-deleteProject", # WARNING: This will delete the project after analysis
89
- ]
90
- result = subprocess.run(command, text=True, capture_output=True, check=True)
91
- with open(output_path,'r') as f:
92
- c_decompile = f.read()
93
- c_func = []
94
- flag = 0
95
- for line in c_decompile.split('\n'):
96
- if "Function: func0" in line:#**Replace** func0 with the function name you want to decompile.
97
- flag = 1
98
- c_func.append(line)
99
- continue
100
- if flag:
101
- if '// Function:' in line:
102
- if len(c_func) > 1:
103
- break
104
- c_func.append(line)
105
- if flag == 0:
106
- raise ValueError('bad case no function found')
107
- for idx_tmp in range(1,len(c_func)):##########remove the comments
108
- if 'func0' in c_func[idx_tmp]:
109
- break
110
- c_func = c_func[idx_tmp:]
111
- input_asm = '\n'.join(c_func).strip()
112
-
113
- before = f"# This is the assembly code:\n"#prompt
114
- after = "\n# What is the source code?\n"#prompt
115
- input_asm_prompt = before+input_asm.strip()+after
116
- with open(fileName +'_' + opt +'.pseudo','w',encoding='utf-8') as f:
117
- f.write(input_asm_prompt)
118
- ```
119
-
120
- Ghidra pseudo-code may look like this:
121
- ```c
122
- undefined4 func0(float param_1,long param_2,int param_3)
123
- {
124
- int local_28;
125
- int local_24;
126
-
127
- local_24 = 0;
128
- do {
129
- local_28 = local_24;
130
- if (param_3 <= local_24) {
131
- return 0;
132
- }
133
- while (local_28 = local_28 + 1, local_28 < param_3) {
134
- if ((double)((ulong)(double)(*(float *)(param_2 + (long)local_24 * 4) -
135
- *(float *)(param_2 + (long)local_28 * 4)) &
136
- SUB168(_DAT_00402010,0)) < (double)param_1) {
137
- return 1;
138
- }
 
 
 
 
 
 
 
 
139
  }
140
- local_24 = local_24 + 1;
141
- } while( true );
142
- }
143
- ```
144
- 4. Refine pseudo-code using LLM4Decompile (demo.py)
145
 
146
- **Decompilation:** Use LLM4Decompile-Ref to refine the Ghidra pseudo-code into C:
147
- ```python
148
- from transformers import AutoTokenizer, AutoModelForCausalLM
149
- import torch
150
 
151
- model_path = 'LLM4Binary/llm4decompile-6.7b-v2' # V2 Model
152
- tokenizer = AutoTokenizer.from_pretrained(model_path)
153
- model = AutoModelForCausalLM.from_pretrained(model_path, torch_dtype=torch.bfloat16).cuda()
154
 
155
- with open(fileName +'_' + OPT[0] +'.pseudo','r') as f:#optimization level O0
156
- asm_func = f.read()
157
- inputs = tokenizer(asm_func, return_tensors="pt").to(model.device)
158
- with torch.no_grad():
159
- outputs = model.generate(**inputs, max_new_tokens=2048)### max length to 4096, max new tokens should be below the range
160
- c_func_decompile = tokenizer.decode(outputs[0][len(inputs[0]):-1])
161
 
162
- with open(fileName +'_' + OPT[0] +'.pseudo','r') as f:#original file
163
- func = f.read()
164
 
165
- print(f'pseudo function:\n{func}')# Note we only decompile one function, where the original file may contain multiple functions
166
- print(f'refined function:\n{c_func_decompile}')
 
 
167
 
168
- ```
169
 
170
  ### 4. License
171
- This code repository is licensed under the MIT License.
172
 
173
  ### 5. Contact
174
 
175
- If you have any questions, please raise an issue.
 
3
  tags:
4
  - decompile
5
  - binary
6
+ pipeline_tag: text-generation
7
+ library_name: transformers
8
  ---
9
 
10
+ # LLM4Decompile: Decompiling Binary Code with Large Language Models
11
+
12
+ This repository contains the `LLM4Decompile` model, which is part of the work presented in [Decompile-Bench: Million-Scale Binary-Source Function Pairs for Real-World Binary Decompilation](https://huggingface.co/papers/2505.12668).
13
+
14
+ The project's code and more details can be found on its [GitHub repository](https://github.com/albertan017/LLM4Decompile).
15
+
16
  ### 1. Introduction of LLM4Decompile
17
 
18
  LLM4Decompile aims to decompile x86 assembly instructions into C. The newly released V2 series are trained with a larger dataset (2B tokens) and a maximum token length of 4,096, with remarkable performance (up to 100% improvement) compared to the previous model.
19
 
20
+ - **Github Repository:** [LLM4Decompile](https://github.com/albertan017/LLM4Decompile)
21
 
22
 
23
  ### 2. Evaluation Results
24
 
25
  | Metrics | Re-executability Rate | | | | | Edit Similarity | | | | |
26
+ |:-----------------------:|:---------------------:|:-------:|:-------:|:-------:|:-------:|:---------------:|:-------:|:-------:|:-------:|:-------:|\
27
+ | Optimization Level | O0 | O1 | O2 | O3 | AVG | O0 | O1 | O2 | O3 | AVG |\
28
+ | LLM4Decompile-End-6.7B | 0.6805 | 0.3951 | 0.3671 | 0.3720 | 0.4537 | 0.1557 | 0.1292 | 0.1293 | 0.1269 | 0.1353 |\
29
+ | Ghidra | 0.3476 | 0.1646 | 0.1524 | 0.1402 | 0.2012 | 0.0699 | 0.0613 | 0.0619 | 0.0547 | 0.0620 |\
30
+ | +GPT-4o | 0.4695 | 0.3415 | 0.2866 | 0.3110 | 0.3522 | 0.0660 | 0.0563 | 0.0567 | 0.0499 | 0.0572 |\
31
+ | +LLM4Decompile-Ref-1.3B | 0.6890 | 0.3720 | 0.4085 | 0.3720 | 0.4604 | 0.1517 | 0.1325 | 0.1292 | 0.1267 | 0.1350 |\
32
+ | +LLM4Decompile-Ref-6.7B | 0.7439 | 0.4695 | 0.4756 | 0.4207 | 0.5274 | 0.1559 | 0.1353 | 0.1342 | 0.1273 | 0.1382 |\
33
  | +LLM4Decompile-Ref-33B | 0.7073 | 0.4756 | 0.4390 | 0.4146 | 0.5091 | 0.1540 | 0.1379 | 0.1363 | 0.1307 | 0.1397 |
34
 
35
  ### 3. How to Use
36
  Here is an example of how to use our model (Only for V2. For previous models, please check the corresponding model page at HF).
37
 
38
+ 1. Install Ghidra
39
+ Download [Ghidra](https://github.com/NationalSecurityAgency/ghidra/releases/download/Ghidra_11.0.3_build/ghidra_11.0.3_PUBLIC_20240410.zip) to the current folder. You can also check the [page](https://github.com/NationalSecurityAgency/ghidra/releases) for other versions. Unzip the package to the current folder.
40
+ In bash, you can use the following:
41
+ ```bash
42
+ cd LLM4Decompile/ghidra
43
+ wget https://github.com/NationalSecurityAgency/ghidra/releases/download/Ghidra_11.0.3_build/ghidra_11.0.3_PUBLIC_20240410.zip
44
+ unzip ghidra_11.0.3_PUBLIC_20240410.zip
45
+ ```
46
+ 2. Install Java-SDK-17
47
+ Ghidra 11 is dependent on Java-SDK-17, a simple way to install the SDK on Ubuntu:
48
+ ```bash
49
+ apt-get update
50
+ apt-get upgrade
51
+ apt install openjdk-17-jdk openjdk-17-jre
52
+ ```
53
+ Please check [Ghidra install guide](https://htmlpreview.github.io/?https://github.com/NationalSecurityAgency/ghidra/blob/Ghidra_11.1.1_build/GhidraDocs/InstallationGuide.html) for other platforms.
54
+
55
+ 3. Use Ghidra Headless to decompile binary (demo.py)
56
+
57
+ Note: **Replace** func0 with the function name you want to decompile.
58
+
59
+ **Preprocessing:** Compile the C code into binary, and disassemble the binary into assembly instructions.
60
+ ```python
61
+ import os
62
+ import subprocess
63
+ from tqdm import tqdm,trange
64
+
65
+ OPT = ["O0", "O1", "O2", "O3"]
66
+ timeout_duration = 10
67
+
68
+ ghidra_path = "./ghidra_11.0.3_PUBLIC/support/analyzeHeadless"#path to the headless analyzer, change the path accordingly
69
+ postscript = "./decompile.py"#path to the decompiler helper function, change the path accordingly
70
+ project_path = "."#path to temp folder for analysis, change the path accordingly
71
+ project_name = "tmp_ghidra_proj"
72
+ func_path = "../samples/sample.c"#path to c code for compiling and decompiling, change the path accordingly
73
+ fileName = "sample"
74
+
75
+ with tempfile.TemporaryDirectory() as temp_dir:
76
+ pid = os.getpid()
77
+ asm_all = {}
78
+ for opt in [OPT[0]]:
79
+ executable_path = os.path.join(temp_dir, f"{pid}_{opt}.o")
80
+ cmd = f'gcc -{opt} -o {executable_path} {func_path} -lm'
81
+ subprocess.run(
82
+ cmd.split(' '),
83
+ check=True,
84
+ stdout=subprocess.DEVNULL, # Suppress stdout
85
+ stderr=subprocess.DEVNULL, # Suppress stderr
86
+ timeout=timeout_duration,
87
+ )
88
+
89
+ output_path = os.path.join(temp_dir, f"{pid}_{opt}.c")
90
+ command = [
91
+ ghidra_path,
92
+ temp_dir,
93
+ project_name,
94
+ "-import", executable_path,
95
+ "-postScript", postscript, output_path,
96
+ "-deleteProject", # WARNING: This will delete the project after analysis
97
+ ]
98
+ result = subprocess.run(command, text=True, capture_output=True, check=True)
99
+ with open(output_path,'r') as f:\
100
+ c_decompile = f.read()
101
+ c_func = []
102
+ flag = 0
103
+ for line in c_decompile.split('\
104
+ '):
105
+ if "Function: func0" in line:#**Replace** func0 with the function name you want to decompile.
106
+ flag = 1
107
+ c_func.append(line)
108
+ continue
109
+ if flag:
110
+ if '// Function:' in line:
111
+ if len(c_func) > 1:
112
+ break
113
+ c_func.append(line)
114
+ if flag == 0:
115
+ raise ValueError('bad case no function found')
116
+ for idx_tmp in range(1,len(c_func)):##########remove the comments
117
+ if 'func0' in c_func[idx_tmp]:
118
+ break
119
+ c_func = c_func[idx_tmp:]
120
+ input_asm = '
121
+ '.join(c_func).strip()
122
+
123
+ before = f"# This is the assembly code:\
124
+ "#prompt
125
+ after = "\
126
+ # What is the source code?\
127
+ "#prompt
128
+ input_asm_prompt = before+input_asm.strip()+after
129
+ with open(fileName +'_' + opt +'.pseudo','w',encoding='utf-8') as f:
130
+ f.write(input_asm_prompt)
131
+ ```
132
+
133
+ Ghidra pseudo-code may look like this:
134
+ ```c
135
+ undefined4 func0(float param_1,long param_2,int param_3)
136
+ {
137
+ int local_28;
138
+ int local_24;
139
+
140
+ local_24 = 0;
141
+ do {
142
+ local_28 = local_24;
143
+ if (param_3 <= local_24) {
144
+ return 0;
145
+ }
146
+ while (local_28 = local_28 + 1, local_28 < param_3) {
147
+ if ((double)((ulong)(double)(*(float *)(param_2 + (long)local_24 * 4) -\
148
+ *(float *)(param_2 + (long)local_28 * 4)) &\
149
+ SUB168(_DAT_00402010,0)) < (double)param_1) {\
150
+ return 1;\
151
+ }\
152
+ }\
153
+ local_24 = local_24 + 1;
154
+ } while( true );
155
  }
156
+ ```
157
+ 4. Refine pseudo-code using LLM4Decompile (demo.py)
 
 
 
158
 
159
+ **Decompilation:** Use LLM4Decompile-Ref to refine the Ghidra pseudo-code into C:
160
+ ```python
161
+ from transformers import AutoTokenizer, AutoModelForCausalLM
162
+ import torch
163
 
164
+ model_path = 'LLM4Binary/llm4decompile-6.7b-v2' # V2 Model
165
+ tokenizer = AutoTokenizer.from_pretrained(model_path)
166
+ model = AutoModelForCausalLM.from_pretrained(model_path, torch_dtype=torch.bfloat16).cuda()
167
 
168
+ with open(fileName +'_' + OPT[0] +'.pseudo','r') as f:#optimization level O0
169
+ asm_func = f.read()
170
+ inputs = tokenizer(asm_func, return_tensors="pt").to(model.device)
171
+ with torch.no_grad():
172
+ outputs = model.generate(**inputs, max_new_tokens=2048)### max length to 4096, max new tokens should be below the range
173
+ c_func_decompile = tokenizer.decode(outputs[0][len(inputs[0]):-1])
174
 
175
+ with open(fileName +'_' + OPT[0] +'.pseudo','r') as f:#original file
176
+ func = f.read()
177
 
178
+ print(f'pseudo function:\
179
+ {func}')# Note we only decompile one function, where the original file may contain multiple functions
180
+ print(f'refined function:\
181
+ {c_func_decompile}')
182
 
183
+ ```
184
 
185
  ### 4. License
186
+ This code repository is licensed under the MIT and DeepSeek License.
187
 
188
  ### 5. Contact
189
 
190
+ If you have any questions, please raise an issue.