Improve model card: Add pipeline tag, library name, and paper link
Browse filesThis PR enhances the model card for `LLM4Binary/llm4decompile-6.7b-v2` by:
- Adding `pipeline_tag: text-generation` to improve model discoverability on the Hugging Face Hub, as the model's core task is to decompile binary code into human-readable C source code.
- Adding `library_name: transformers` based on the explicit usage of `AutoTokenizer` and `AutoModelForCausalLM` from the `transformers` library in the provided sample code, which enables an automated "how to use" snippet.
- Adding a prominent link to the associated paper, [Decompile-Bench: Million-Scale Binary-Source Function Pairs for Real-World Binary Decompilation](https://huggingface.co/papers/2505.12668), at the top of the model card.
- Adding a prominent link to the project's [GitHub repository](https://github.com/albertan017/LLM4Decompile) at the top of the model card.
- Updating the license text within the content to "MIT and DeepSeek License" to accurately reflect the information provided in the GitHub README.
These updates will provide clearer information for users and better integrate the model within the Hugging Face ecosystem.
|
@@ -3,173 +3,188 @@ license: mit
|
|
| 3 |
tags:
|
| 4 |
- decompile
|
| 5 |
- binary
|
|
|
|
|
|
|
| 6 |
---
|
| 7 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 8 |
### 1. Introduction of LLM4Decompile
|
| 9 |
|
| 10 |
LLM4Decompile aims to decompile x86 assembly instructions into C. The newly released V2 series are trained with a larger dataset (2B tokens) and a maximum token length of 4,096, with remarkable performance (up to 100% improvement) compared to the previous model.
|
| 11 |
|
| 12 |
-
-
|
| 13 |
|
| 14 |
|
| 15 |
### 2. Evaluation Results
|
| 16 |
|
| 17 |
| Metrics | Re-executability Rate | | | | | Edit Similarity | | | | |
|
| 18 |
-
|:-----------------------:|:---------------------:|:-------:|:-------:|:-------:|:-------:|:---------------:|:-------:|:-------:|:-------:|:-------:|
|
| 19 |
-
| Optimization Level | O0 | O1 | O2 | O3 | AVG | O0 | O1 | O2 | O3 | AVG |
|
| 20 |
-
| LLM4Decompile-End-6.7B | 0.6805 | 0.3951 | 0.3671 | 0.3720 | 0.4537 | 0.1557 | 0.1292 | 0.1293 | 0.1269 | 0.1353 |
|
| 21 |
-
| Ghidra | 0.3476 | 0.1646 | 0.1524 | 0.1402 | 0.2012 | 0.0699 | 0.0613 | 0.0619 | 0.0547 | 0.0620 |
|
| 22 |
-
| +GPT-4o | 0.4695 | 0.3415 | 0.2866 | 0.3110 | 0.3522 | 0.0660 | 0.0563 | 0.0567 | 0.0499 | 0.0572 |
|
| 23 |
-
| +LLM4Decompile-Ref-1.3B | 0.6890 | 0.3720 | 0.4085 | 0.3720 | 0.4604 | 0.1517 | 0.1325 | 0.1292 | 0.1267 | 0.1350 |
|
| 24 |
-
| +LLM4Decompile-Ref-6.7B | 0.7439 | 0.4695 | 0.4756 | 0.4207 | 0.5274 | 0.1559 | 0.1353 | 0.1342 | 0.1273 | 0.1382 |
|
| 25 |
| +LLM4Decompile-Ref-33B | 0.7073 | 0.4756 | 0.4390 | 0.4146 | 0.5091 | 0.1540 | 0.1379 | 0.1363 | 0.1307 | 0.1397 |
|
| 26 |
|
| 27 |
### 3. How to Use
|
| 28 |
Here is an example of how to use our model (Only for V2. For previous models, please check the corresponding model page at HF).
|
| 29 |
|
| 30 |
-
1.
|
| 31 |
-
Download [Ghidra](https://github.com/NationalSecurityAgency/ghidra/releases/download/Ghidra_11.0.3_build/ghidra_11.0.3_PUBLIC_20240410.zip) to the current folder. You can also check the [page](https://github.com/NationalSecurityAgency/ghidra/releases) for other versions. Unzip the package to the current folder.
|
| 32 |
-
In bash, you can use the following:
|
| 33 |
-
```bash
|
| 34 |
-
cd LLM4Decompile/ghidra
|
| 35 |
-
wget https://github.com/NationalSecurityAgency/ghidra/releases/download/Ghidra_11.0.3_build/ghidra_11.0.3_PUBLIC_20240410.zip
|
| 36 |
-
unzip ghidra_11.0.3_PUBLIC_20240410.zip
|
| 37 |
-
```
|
| 38 |
-
2.
|
| 39 |
-
Ghidra 11 is dependent on Java-SDK-17, a simple way to install the SDK on Ubuntu:
|
| 40 |
-
```bash
|
| 41 |
-
apt-get update
|
| 42 |
-
apt-get upgrade
|
| 43 |
-
apt install openjdk-17-jdk openjdk-17-jre
|
| 44 |
-
```
|
| 45 |
-
Please check [Ghidra install guide](https://htmlpreview.github.io/?https://github.com/NationalSecurityAgency/ghidra/blob/Ghidra_11.1.1_build/GhidraDocs/InstallationGuide.html) for other platforms.
|
| 46 |
-
|
| 47 |
-
3.
|
| 48 |
-
|
| 49 |
-
Note: **Replace** func0 with the function name you want to decompile.
|
| 50 |
-
|
| 51 |
-
**Preprocessing:** Compile the C code into binary, and disassemble the binary into assembly instructions.
|
| 52 |
-
```python
|
| 53 |
-
import os
|
| 54 |
-
import subprocess
|
| 55 |
-
from tqdm import tqdm,trange
|
| 56 |
-
|
| 57 |
-
OPT = ["O0", "O1", "O2", "O3"]
|
| 58 |
-
timeout_duration = 10
|
| 59 |
-
|
| 60 |
-
ghidra_path = "./ghidra_11.0.3_PUBLIC/support/analyzeHeadless"#path to the headless analyzer, change the path accordingly
|
| 61 |
-
postscript = "./decompile.py"#path to the decompiler helper function, change the path accordingly
|
| 62 |
-
project_path = "."#path to temp folder for analysis, change the path accordingly
|
| 63 |
-
project_name = "tmp_ghidra_proj"
|
| 64 |
-
func_path = "../samples/sample.c"#path to c code for compiling and decompiling, change the path accordingly
|
| 65 |
-
fileName = "sample"
|
| 66 |
-
|
| 67 |
-
with tempfile.TemporaryDirectory() as temp_dir:
|
| 68 |
-
|
| 69 |
-
|
| 70 |
-
|
| 71 |
-
|
| 72 |
-
|
| 73 |
-
|
| 74 |
-
|
| 75 |
-
|
| 76 |
-
|
| 77 |
-
|
| 78 |
-
|
| 79 |
-
|
| 80 |
-
|
| 81 |
-
|
| 82 |
-
|
| 83 |
-
|
| 84 |
-
|
| 85 |
-
|
| 86 |
-
|
| 87 |
-
|
| 88 |
-
|
| 89 |
-
|
| 90 |
-
|
| 91 |
-
|
| 92 |
-
|
| 93 |
-
|
| 94 |
-
|
| 95 |
-
|
| 96 |
-
|
| 97 |
-
|
| 98 |
-
|
| 99 |
-
|
| 100 |
-
|
| 101 |
-
if
|
| 102 |
-
if
|
| 103 |
-
|
| 104 |
-
|
| 105 |
-
|
| 106 |
-
|
| 107 |
-
|
| 108 |
-
|
| 109 |
-
|
| 110 |
-
|
| 111 |
-
|
| 112 |
-
|
| 113 |
-
|
| 114 |
-
|
| 115 |
-
|
| 116 |
-
|
| 117 |
-
|
| 118 |
-
|
| 119 |
-
|
| 120 |
-
|
| 121 |
-
|
| 122 |
-
|
| 123 |
-
|
| 124 |
-
|
| 125 |
-
|
| 126 |
-
|
| 127 |
-
|
| 128 |
-
|
| 129 |
-
|
| 130 |
-
|
| 131 |
-
|
| 132 |
-
|
| 133 |
-
|
| 134 |
-
|
| 135 |
-
|
| 136 |
-
|
| 137 |
-
|
| 138 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 139 |
}
|
| 140 |
-
|
| 141 |
-
|
| 142 |
-
}
|
| 143 |
-
```
|
| 144 |
-
4. Refine pseudo-code using LLM4Decompile (demo.py)
|
| 145 |
|
| 146 |
-
**Decompilation:** Use LLM4Decompile-Ref to refine the Ghidra pseudo-code into C:
|
| 147 |
-
```python
|
| 148 |
-
from transformers import AutoTokenizer, AutoModelForCausalLM
|
| 149 |
-
import torch
|
| 150 |
|
| 151 |
-
model_path = 'LLM4Binary/llm4decompile-6.7b-v2' # V2 Model
|
| 152 |
-
tokenizer = AutoTokenizer.from_pretrained(model_path)
|
| 153 |
-
model = AutoModelForCausalLM.from_pretrained(model_path, torch_dtype=torch.bfloat16).cuda()
|
| 154 |
|
| 155 |
-
with open(fileName +'_' + OPT[0] +'.pseudo','r') as f:#optimization level O0
|
| 156 |
-
|
| 157 |
-
inputs = tokenizer(asm_func, return_tensors="pt").to(model.device)
|
| 158 |
-
with torch.no_grad():
|
| 159 |
-
|
| 160 |
-
c_func_decompile = tokenizer.decode(outputs[0][len(inputs[0]):-1])
|
| 161 |
|
| 162 |
-
with open(fileName +'_' + OPT[0] +'.pseudo','r') as f:#original file
|
| 163 |
-
|
| 164 |
|
| 165 |
-
print(f'pseudo function:\
|
| 166 |
-
|
|
|
|
|
|
|
| 167 |
|
| 168 |
-
```
|
| 169 |
|
| 170 |
### 4. License
|
| 171 |
-
This code repository is licensed under the MIT License.
|
| 172 |
|
| 173 |
### 5. Contact
|
| 174 |
|
| 175 |
-
If you have any questions, please raise an issue.
|
|
|
|
| 3 |
tags:
|
| 4 |
- decompile
|
| 5 |
- binary
|
| 6 |
+
pipeline_tag: text-generation
|
| 7 |
+
library_name: transformers
|
| 8 |
---
|
| 9 |
|
| 10 |
+
# LLM4Decompile: Decompiling Binary Code with Large Language Models
|
| 11 |
+
|
| 12 |
+
This repository contains the `LLM4Decompile` model, which is part of the work presented in [Decompile-Bench: Million-Scale Binary-Source Function Pairs for Real-World Binary Decompilation](https://huggingface.co/papers/2505.12668).
|
| 13 |
+
|
| 14 |
+
The project's code and more details can be found on its [GitHub repository](https://github.com/albertan017/LLM4Decompile).
|
| 15 |
+
|
| 16 |
### 1. Introduction of LLM4Decompile
|
| 17 |
|
| 18 |
LLM4Decompile aims to decompile x86 assembly instructions into C. The newly released V2 series are trained with a larger dataset (2B tokens) and a maximum token length of 4,096, with remarkable performance (up to 100% improvement) compared to the previous model.
|
| 19 |
|
| 20 |
+
- **Github Repository:** [LLM4Decompile](https://github.com/albertan017/LLM4Decompile)
|
| 21 |
|
| 22 |
|
| 23 |
### 2. Evaluation Results
|
| 24 |
|
| 25 |
| Metrics | Re-executability Rate | | | | | Edit Similarity | | | | |
|
| 26 |
+
|:-----------------------:|:---------------------:|:-------:|:-------:|:-------:|:-------:|:---------------:|:-------:|:-------:|:-------:|:-------:|\
|
| 27 |
+
| Optimization Level | O0 | O1 | O2 | O3 | AVG | O0 | O1 | O2 | O3 | AVG |\
|
| 28 |
+
| LLM4Decompile-End-6.7B | 0.6805 | 0.3951 | 0.3671 | 0.3720 | 0.4537 | 0.1557 | 0.1292 | 0.1293 | 0.1269 | 0.1353 |\
|
| 29 |
+
| Ghidra | 0.3476 | 0.1646 | 0.1524 | 0.1402 | 0.2012 | 0.0699 | 0.0613 | 0.0619 | 0.0547 | 0.0620 |\
|
| 30 |
+
| +GPT-4o | 0.4695 | 0.3415 | 0.2866 | 0.3110 | 0.3522 | 0.0660 | 0.0563 | 0.0567 | 0.0499 | 0.0572 |\
|
| 31 |
+
| +LLM4Decompile-Ref-1.3B | 0.6890 | 0.3720 | 0.4085 | 0.3720 | 0.4604 | 0.1517 | 0.1325 | 0.1292 | 0.1267 | 0.1350 |\
|
| 32 |
+
| +LLM4Decompile-Ref-6.7B | 0.7439 | 0.4695 | 0.4756 | 0.4207 | 0.5274 | 0.1559 | 0.1353 | 0.1342 | 0.1273 | 0.1382 |\
|
| 33 |
| +LLM4Decompile-Ref-33B | 0.7073 | 0.4756 | 0.4390 | 0.4146 | 0.5091 | 0.1540 | 0.1379 | 0.1363 | 0.1307 | 0.1397 |
|
| 34 |
|
| 35 |
### 3. How to Use
|
| 36 |
Here is an example of how to use our model (Only for V2. For previous models, please check the corresponding model page at HF).
|
| 37 |
|
| 38 |
+
1. Install Ghidra
|
| 39 |
+
Download [Ghidra](https://github.com/NationalSecurityAgency/ghidra/releases/download/Ghidra_11.0.3_build/ghidra_11.0.3_PUBLIC_20240410.zip) to the current folder. You can also check the [page](https://github.com/NationalSecurityAgency/ghidra/releases) for other versions. Unzip the package to the current folder.
|
| 40 |
+
In bash, you can use the following:
|
| 41 |
+
```bash
|
| 42 |
+
cd LLM4Decompile/ghidra
|
| 43 |
+
wget https://github.com/NationalSecurityAgency/ghidra/releases/download/Ghidra_11.0.3_build/ghidra_11.0.3_PUBLIC_20240410.zip
|
| 44 |
+
unzip ghidra_11.0.3_PUBLIC_20240410.zip
|
| 45 |
+
```
|
| 46 |
+
2. Install Java-SDK-17
|
| 47 |
+
Ghidra 11 is dependent on Java-SDK-17, a simple way to install the SDK on Ubuntu:
|
| 48 |
+
```bash
|
| 49 |
+
apt-get update
|
| 50 |
+
apt-get upgrade
|
| 51 |
+
apt install openjdk-17-jdk openjdk-17-jre
|
| 52 |
+
```
|
| 53 |
+
Please check [Ghidra install guide](https://htmlpreview.github.io/?https://github.com/NationalSecurityAgency/ghidra/blob/Ghidra_11.1.1_build/GhidraDocs/InstallationGuide.html) for other platforms.
|
| 54 |
+
|
| 55 |
+
3. Use Ghidra Headless to decompile binary (demo.py)
|
| 56 |
+
|
| 57 |
+
Note: **Replace** func0 with the function name you want to decompile.
|
| 58 |
+
|
| 59 |
+
**Preprocessing:** Compile the C code into binary, and disassemble the binary into assembly instructions.
|
| 60 |
+
```python
|
| 61 |
+
import os
|
| 62 |
+
import subprocess
|
| 63 |
+
from tqdm import tqdm,trange
|
| 64 |
+
|
| 65 |
+
OPT = ["O0", "O1", "O2", "O3"]
|
| 66 |
+
timeout_duration = 10
|
| 67 |
+
|
| 68 |
+
ghidra_path = "./ghidra_11.0.3_PUBLIC/support/analyzeHeadless"#path to the headless analyzer, change the path accordingly
|
| 69 |
+
postscript = "./decompile.py"#path to the decompiler helper function, change the path accordingly
|
| 70 |
+
project_path = "."#path to temp folder for analysis, change the path accordingly
|
| 71 |
+
project_name = "tmp_ghidra_proj"
|
| 72 |
+
func_path = "../samples/sample.c"#path to c code for compiling and decompiling, change the path accordingly
|
| 73 |
+
fileName = "sample"
|
| 74 |
+
|
| 75 |
+
with tempfile.TemporaryDirectory() as temp_dir:
|
| 76 |
+
pid = os.getpid()
|
| 77 |
+
asm_all = {}
|
| 78 |
+
for opt in [OPT[0]]:
|
| 79 |
+
executable_path = os.path.join(temp_dir, f"{pid}_{opt}.o")
|
| 80 |
+
cmd = f'gcc -{opt} -o {executable_path} {func_path} -lm'
|
| 81 |
+
subprocess.run(
|
| 82 |
+
cmd.split(' '),
|
| 83 |
+
check=True,
|
| 84 |
+
stdout=subprocess.DEVNULL, # Suppress stdout
|
| 85 |
+
stderr=subprocess.DEVNULL, # Suppress stderr
|
| 86 |
+
timeout=timeout_duration,
|
| 87 |
+
)
|
| 88 |
+
|
| 89 |
+
output_path = os.path.join(temp_dir, f"{pid}_{opt}.c")
|
| 90 |
+
command = [
|
| 91 |
+
ghidra_path,
|
| 92 |
+
temp_dir,
|
| 93 |
+
project_name,
|
| 94 |
+
"-import", executable_path,
|
| 95 |
+
"-postScript", postscript, output_path,
|
| 96 |
+
"-deleteProject", # WARNING: This will delete the project after analysis
|
| 97 |
+
]
|
| 98 |
+
result = subprocess.run(command, text=True, capture_output=True, check=True)
|
| 99 |
+
with open(output_path,'r') as f:\
|
| 100 |
+
c_decompile = f.read()
|
| 101 |
+
c_func = []
|
| 102 |
+
flag = 0
|
| 103 |
+
for line in c_decompile.split('\
|
| 104 |
+
'):
|
| 105 |
+
if "Function: func0" in line:#**Replace** func0 with the function name you want to decompile.
|
| 106 |
+
flag = 1
|
| 107 |
+
c_func.append(line)
|
| 108 |
+
continue
|
| 109 |
+
if flag:
|
| 110 |
+
if '// Function:' in line:
|
| 111 |
+
if len(c_func) > 1:
|
| 112 |
+
break
|
| 113 |
+
c_func.append(line)
|
| 114 |
+
if flag == 0:
|
| 115 |
+
raise ValueError('bad case no function found')
|
| 116 |
+
for idx_tmp in range(1,len(c_func)):##########remove the comments
|
| 117 |
+
if 'func0' in c_func[idx_tmp]:
|
| 118 |
+
break
|
| 119 |
+
c_func = c_func[idx_tmp:]
|
| 120 |
+
input_asm = '
|
| 121 |
+
'.join(c_func).strip()
|
| 122 |
+
|
| 123 |
+
before = f"# This is the assembly code:\
|
| 124 |
+
"#prompt
|
| 125 |
+
after = "\
|
| 126 |
+
# What is the source code?\
|
| 127 |
+
"#prompt
|
| 128 |
+
input_asm_prompt = before+input_asm.strip()+after
|
| 129 |
+
with open(fileName +'_' + opt +'.pseudo','w',encoding='utf-8') as f:
|
| 130 |
+
f.write(input_asm_prompt)
|
| 131 |
+
```
|
| 132 |
+
|
| 133 |
+
Ghidra pseudo-code may look like this:
|
| 134 |
+
```c
|
| 135 |
+
undefined4 func0(float param_1,long param_2,int param_3)
|
| 136 |
+
{
|
| 137 |
+
int local_28;
|
| 138 |
+
int local_24;
|
| 139 |
+
|
| 140 |
+
local_24 = 0;
|
| 141 |
+
do {
|
| 142 |
+
local_28 = local_24;
|
| 143 |
+
if (param_3 <= local_24) {
|
| 144 |
+
return 0;
|
| 145 |
+
}
|
| 146 |
+
while (local_28 = local_28 + 1, local_28 < param_3) {
|
| 147 |
+
if ((double)((ulong)(double)(*(float *)(param_2 + (long)local_24 * 4) -\
|
| 148 |
+
*(float *)(param_2 + (long)local_28 * 4)) &\
|
| 149 |
+
SUB168(_DAT_00402010,0)) < (double)param_1) {\
|
| 150 |
+
return 1;\
|
| 151 |
+
}\
|
| 152 |
+
}\
|
| 153 |
+
local_24 = local_24 + 1;
|
| 154 |
+
} while( true );
|
| 155 |
}
|
| 156 |
+
```
|
| 157 |
+
4. Refine pseudo-code using LLM4Decompile (demo.py)
|
|
|
|
|
|
|
|
|
|
| 158 |
|
| 159 |
+
**Decompilation:** Use LLM4Decompile-Ref to refine the Ghidra pseudo-code into C:
|
| 160 |
+
```python
|
| 161 |
+
from transformers import AutoTokenizer, AutoModelForCausalLM
|
| 162 |
+
import torch
|
| 163 |
|
| 164 |
+
model_path = 'LLM4Binary/llm4decompile-6.7b-v2' # V2 Model
|
| 165 |
+
tokenizer = AutoTokenizer.from_pretrained(model_path)
|
| 166 |
+
model = AutoModelForCausalLM.from_pretrained(model_path, torch_dtype=torch.bfloat16).cuda()
|
| 167 |
|
| 168 |
+
with open(fileName +'_' + OPT[0] +'.pseudo','r') as f:#optimization level O0
|
| 169 |
+
asm_func = f.read()
|
| 170 |
+
inputs = tokenizer(asm_func, return_tensors="pt").to(model.device)
|
| 171 |
+
with torch.no_grad():
|
| 172 |
+
outputs = model.generate(**inputs, max_new_tokens=2048)### max length to 4096, max new tokens should be below the range
|
| 173 |
+
c_func_decompile = tokenizer.decode(outputs[0][len(inputs[0]):-1])
|
| 174 |
|
| 175 |
+
with open(fileName +'_' + OPT[0] +'.pseudo','r') as f:#original file
|
| 176 |
+
func = f.read()
|
| 177 |
|
| 178 |
+
print(f'pseudo function:\
|
| 179 |
+
{func}')# Note we only decompile one function, where the original file may contain multiple functions
|
| 180 |
+
print(f'refined function:\
|
| 181 |
+
{c_func_decompile}')
|
| 182 |
|
| 183 |
+
```
|
| 184 |
|
| 185 |
### 4. License
|
| 186 |
+
This code repository is licensed under the MIT and DeepSeek License.
|
| 187 |
|
| 188 |
### 5. Contact
|
| 189 |
|
| 190 |
+
If you have any questions, please raise an issue.
|