File size: 5,968 Bytes
2c77bdd
 
 
6855ad8
 
2c77bdd
6855ad8
4a41f9f
2c77bdd
6855ad8
 
 
 
 
4a41f9f
 
6855ad8
82c2aac
 
2c77bdd
4312eeb
8e53fcf
20847d2
8e53fcf
20847d2
8e53fcf
20847d2
4a41f9f
 
20847d2
4a41f9f
 
 
 
 
 
 
 
 
 
 
82c2aac
 
 
4a41f9f
 
 
 
 
 
 
 
82c2aac
 
 
4a41f9f
20847d2
8e53fcf
 
4a41f9f
 
 
 
8e53fcf
 
 
4a41f9f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8e53fcf
 
20847d2
4a41f9f
 
 
 
 
 
 
 
 
20847d2
8e53fcf
 
 
 
 
 
 
 
 
 
 
 
 
4a41f9f
8e53fcf
4a41f9f
 
 
 
 
8e53fcf
4a41f9f
8e53fcf
4a41f9f
 
 
 
 
a364f07
 
0960430
 
a364f07
0960430
 
a364f07
0960430
a364f07
0960430
 
 
 
 
 
8e53fcf
0960430
 
4312eeb
0960430
 
039203b
4a41f9f
0960430
039203b
 
4a41f9f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2b4c420
48a13b9
2b4c420
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
---
license: mit
language:
- en
inference: true
base_model:
- microsoft/codebert-base-mlm
pipeline_tag: feature-extraction
tags:
- smart-contract
- web3
- software-engineering
- embedding
- codebert
- solidity
- code-understanding
library_name: transformers
datasets:
- web3se/smart-contract-intent-vul-dataset
---

# SmartBERT V2 CodeBERT

![SmartBERT](./framework.png)

## Overview

SmartBERT V2 CodeBERT is a **domain-adapted pre-trained model** built on top of **[CodeBERT-base-mlm](https://huggingface.co/microsoft/codebert-base-mlm)**.  
It is designed to learn high-quality semantic representations of **smart contract code**, particularly at the **function level**.

The model is further pre-trained on a large corpus of smart contracts using the **Masked Language Modeling (MLM)** objective.  
This domain-adaptive pretraining enables the model to better capture **semantic patterns, structure, and intent** within smart contract functions compared to general-purpose code models.

SmartBERT V2 can be used for tasks such as:

- Smart contract intent detection
- Code similarity analysis
- Vulnerability analysis
- Smart contract classification
- Code embedding and retrieval

SmartBERT V2 is a pre-trained model specifically developed for **[SmartIntent V2](https://github.com/web3se-lab/web3-sekit)**. It was trained on **16,000 smart contracts**, with no overlap with the SmartIntent V2 evaluation dataset to avoid data leakage.  
For production use or general smart contract representation tasks, we recommend **SmartBERT V3**: https://huggingface.co/web3se/SmartBERT-v3

---

## Training Data

SmartBERT V2 was trained on a corpus of approximately **16,000 smart contracts**, primarily written in **Solidity** and collected from public blockchain repositories.

To better model smart contract behavior, contracts were processed at the **function level**, enabling the model to learn fine-grained semantic representations of smart contract functions.

For benchmarking purposes in the **SmartIntent V2**, the pretraining corpus was intentionally limited to this **16,000-contract dataset**.  
The **evaluation dataset (4,000 smart contracts)** was strictly held out and **not included in the pretraining data**, ensuring that downstream evaluations remain unbiased and free from data leakage.

---

## Preprocessing

During preprocessing, all newline (`\n`) and tab (`\t`) characters in the function code were normalized by replacing them with a **single space**.  
This ensures a consistent input format for the tokenizer and avoids unnecessary token fragmentation.

---

## Base Model

SmartBERT V2 is initialized from:

- **Base Model:** [CodeBERT-base-mlm](https://huggingface.co/microsoft/codebert-base-mlm)

CodeBERT is a transformer-based model trained on source code and natural language pairs.  
SmartBERT V2 further adapts this model to the **smart contract domain** through continued pretraining.

---

## Training Objective

The model is trained using the **Masked Language Modeling (MLM)** objective, following the same training paradigm as the original CodeBERT model.

During training:

- A subset of tokens is randomly masked.
- The model learns to predict the masked tokens based on surrounding context.
- This encourages the model to learn deeper structural and semantic representations of smart contract code.

---

## Training Setup

Training was conducted using the **HuggingFace Transformers** framework with the following configuration:

- **Hardware:** 2 × Nvidia A100 (80GB)
- **Training Duration:** ~10 hours
- **Training Dataset:** 16,000 smart contracts
- **Evaluation Dataset:** 4,000 smart contracts

Example training configuration:

```python
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir=OUTPUT_DIR,
    overwrite_output_dir=True,
    num_train_epochs=20,
    per_device_train_batch_size=64,
    save_steps=10000,
    save_total_limit=2,
    evaluation_strategy="steps",
    eval_steps=10000,
    resume_from_checkpoint=checkpoint
)
````

---

## Evaluation

The model was evaluated on a held-out dataset of approximately **4,000 smart contracts** to monitor training stability and generalization during pretraining.

SmartBERT V2 is primarily intended as a **representation learning model**, providing high-quality embeddings for downstream smart contract analysis tasks.

---

## How to Use

You can load SmartBERT V2 using the **HuggingFace Transformers** library.

```python
import torch
from transformers import RobertaTokenizer, RobertaModel

tokenizer = RobertaTokenizer.from_pretrained("web3se/SmartBERT-v2")
model = RobertaModel.from_pretrained("web3se/SmartBERT-v2")

code = "function totalSupply() external view returns (uint256);"

inputs = tokenizer(
    code,
    return_tensors="pt",
    truncation=True,
    max_length=512
)

with torch.no_grad():
    outputs = model(**inputs)

# Option 1: CLS embedding
cls_embedding = outputs.last_hidden_state[:, 0, :]

# Option 2: Mean pooling (recommended for code representation)
mean_embedding = outputs.last_hidden_state.mean(dim=1)
```

Mean pooling is often recommended when using the model for **code representation or similarity tasks**.

---

## GitHub Repository

To train, fine-tune, or deploy SmartBERT for Web API services, please refer to our GitHub repository:

[https://github.com/web3se-lab/SmartBERT](https://github.com/web3se-lab/SmartBERT)

---

## Citation

If you use **SmartBERT** in your research, please cite:

```tex
@article{huang2025smart,
  title={Smart Contract Intent Detection with Pre-trained Programming Language Model},
  author={Huang, Youwei and Li, Jianwen and Fang, Sen and Li, Yao and Yang, Peng and Hu, Bin},
  journal={arXiv preprint arXiv:2508.20086},
  year={2025}
}
```

---

## Acknowledgement

- [Institute of Intelligent Computing Technology, Suzhou, CAS](http://iict.ac.cn/)
- [Macau University of Science and Technology](http://www.must.edu.mo)
- CAS Mino (中科劢诺)