File size: 4,503 Bytes

cb50f6a
 
 
5d8f693
335fb00
cb50f6a
5d8f693
 
 
cb50f6a
c947ee2
5d8f693
 
 
 
 
11b8fda
ea8cc34
 
cb50f6a
 
 
 
8664840
cb50f6a
 
 
ea8cc34
cb50f6a
ea8cc34
cb50f6a
ea8cc34
18ed6e4
ea8cc34
 
 
 
 
18ed6e4
ea8cc34
18ed6e4
ea8cc34
18ed6e4
ea8cc34
18ed6e4
ea8cc34
cb50f6a
ea8cc34
 
cb50f6a
ea8cc34
 
 
cb50f6a
ea8cc34
 
 
 
 
 
 
 
 
 
 
 
cb50f6a
 
 
ea8cc34
 
 
 
 
 
 
 
 
cb50f6a
 
 
 
 
 
 
 
 
 
 
 
ea8cc34
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
cb50f6a
 
ea8cc34
 
cb50f6a
 
 
 
ea8cc34
 
 
cb50f6a
ea8cc34
 
cb50f6a
ea8cc34
 
 
ad858d2
 
 
 
6547e30
ad858d2
 
 
 
 
ea8cc34
 
2f9f8f1
cb50f6a
 
2f9f8f1

---
license: mit
language:
- en
inference: true
base_model:
- microsoft/codebert-base-mlm
- web3se/SmartBERT-v2
pipeline_tag: fill-mask
tags:
- fill-mask
- smart-contract
- web3
- software-engineering
- embedding
- codebert
library_name: transformers
datasets:
- web3se/smart-contract-intent-vul-dataset
---

# SmartBERT V3 CodeBERT

![SmartBERT](https://huggingface.co/web3se/SmartBERT-v2/resolve/main/framework.png)

## Overview

**SmartBERT V3** is a domain-adapted pre-trained programming language model for **smart contract code understanding**, built upon **[CodeBERT-base-mlm](https://huggingface.co/microsoft/codebert-base-mlm)**.

The model is further trained on **[SmartBERT V2](https://huggingface.co/web3se/SmartBERT-v2)** with a substantially larger corpus of smart contracts, enabling improved robustness and richer semantic representations of **function-level smart contract code**.

SmartBERT V3 is particularly suitable for tasks such as:

- Smart contract intent detection
- Code representation learning
- Code similarity analysis
- Vulnerability detection
- Smart contract classification

Compared with **SmartBERT V2**, this version significantly expands the training corpus and improves the model’s ability to capture semantic patterns in smart contract functions.

---

## Training Data

SmartBERT V3 was trained on a total of **80,000 smart contracts**, including:

- **16,000 contracts** used in **[SmartBERT V2](https://huggingface.co/web3se/SmartBERT-v2)**
- **64,000 additional smart contracts** collected from public blockchain repositories

All contracts are primarily written in **Solidity** and processed at the **function level** to better capture fine-grained semantic structures of smart contract code.

---

## Training Objective

The model is trained using the **Masked Language Modeling (MLM)** objective, following the same training paradigm as **CodeBERT**.

During training:

- A subset of tokens in the input code is randomly masked
- The model learns to predict these masked tokens from surrounding context

This process enables the model to learn deeper **syntactic and semantic representations** of smart contract programs.

---

## Training Setup

Training was conducted using the **HuggingFace Transformers** framework.

- **Hardware:** 2 × Nvidia A100 (80GB)
- **Training Duration:** Over **30 hours**
- **Training Dataset:** 80,000 smart contracts
- **Evaluation Dataset:** 1,500 smart contracts

Example training configuration:

```python
training_args = TrainingArguments(
  output_dir=OUTPUT_DIR,
  overwrite_output_dir=True,
  num_train_epochs=20,
  per_device_train_batch_size=64,
  save_steps=10000,
  save_total_limit=2,
  evaluation_strategy="steps",
  eval_steps=10000,
  resume_from_checkpoint=checkpoint
)
````

---

## Preprocessing

During preprocessing, all newline (`\n`) and tab (`\t`) characters in the *function* code were replaced with a single space to ensure a consistent input format for tokenization.

---

## Base Model

SmartBERT V3 builds upon the following models:

* **Original Model**: [CodeBERT-base-mlm](https://huggingface.co/microsoft/codebert-base-mlm)
* **Intermediate Model**: [SmartBERT V2](https://huggingface.co/web3se/SmartBERT-v2)

---

## Usage

Example usage with HuggingFace Transformers:

```python
from transformers import RobertaTokenizer, RobertaForMaskedLM, pipeline

model = RobertaForMaskedLM.from_pretrained('web3se/SmartBERT-v3')
tokenizer = RobertaTokenizer.from_pretrained('web3se/SmartBERT-v3')

code_example = "function totalSupply() external view <mask> (uint256);"
fill_mask = pipeline('fill-mask', model=model, tokenizer=tokenizer)

outputs = fill_mask(code_example)
print(outputs)
```

---

## How to Use

To train and deploy the SmartBERT V3 model for Web API services, please refer to our GitHub repository: [web3se-lab/SmartBERT](https://github.com/web3se-lab/SmartBERT).

---

## Contributor

* [Youwei Huang](https://www.devil.ren)
* [Sen Fang](https://github.com/TomasAndersonFang)

---

## Citation

```tex
@article{huang2025smart,
  title={Smart Contract Intent Detection with Pre-trained Programming Language Model},
  author={Huang, Youwei and Li, Jianwen and Fang, Sen and Li, Yao and Yang, Peng and Hu, Bin},
  journal={arXiv preprint arXiv:2508.20086},
  year={2025}
}
```

---

## Acknowledgement

- [Institute of Intelligent Computing Technology, Suzhou, CAS](http://iict.ac.cn/)
- [Macau University of Science and Technology](http://www.must.edu.mo)
- CAS Mino (中科劢诺)