SmartBERT-v3 / README.md
devilyouwei's picture
Update README.md
2f9f8f1 verified
|
raw
history blame
4.5 kB
---
license: mit
language:
- en
inference: true
base_model:
- microsoft/codebert-base-mlm
- web3se/SmartBERT-v2
pipeline_tag: fill-mask
tags:
- fill-mask
- smart-contract
- web3
- software-engineering
- embedding
- codebert
library_name: transformers
datasets:
- web3se/smart-contract-intent-vul-dataset
---
# SmartBERT V3 CodeBERT
![SmartBERT](https://huggingface.co/web3se/SmartBERT-v2/resolve/main/framework.png)
## Overview
**SmartBERT V3** is a domain-adapted pre-trained programming language model for **smart contract code understanding**, built upon **[CodeBERT-base-mlm](https://huggingface.co/microsoft/codebert-base-mlm)**.
The model is further trained on **[SmartBERT V2](https://huggingface.co/web3se/SmartBERT-v2)** with a substantially larger corpus of smart contracts, enabling improved robustness and richer semantic representations of **function-level smart contract code**.
SmartBERT V3 is particularly suitable for tasks such as:
- Smart contract intent detection
- Code representation learning
- Code similarity analysis
- Vulnerability detection
- Smart contract classification
Compared with **SmartBERT V2**, this version significantly expands the training corpus and improves the model’s ability to capture semantic patterns in smart contract functions.
---
## Training Data
SmartBERT V3 was trained on a total of **80,000 smart contracts**, including:
- **16,000 contracts** used in **[SmartBERT V2](https://huggingface.co/web3se/SmartBERT-v2)**
- **64,000 additional smart contracts** collected from public blockchain repositories
All contracts are primarily written in **Solidity** and processed at the **function level** to better capture fine-grained semantic structures of smart contract code.
---
## Training Objective
The model is trained using the **Masked Language Modeling (MLM)** objective, following the same training paradigm as **CodeBERT**.
During training:
- A subset of tokens in the input code is randomly masked
- The model learns to predict these masked tokens from surrounding context
This process enables the model to learn deeper **syntactic and semantic representations** of smart contract programs.
---
## Training Setup
Training was conducted using the **HuggingFace Transformers** framework.
- **Hardware:** 2 × Nvidia A100 (80GB)
- **Training Duration:** Over **30 hours**
- **Training Dataset:** 80,000 smart contracts
- **Evaluation Dataset:** 1,500 smart contracts
Example training configuration:
```python
training_args = TrainingArguments(
output_dir=OUTPUT_DIR,
overwrite_output_dir=True,
num_train_epochs=20,
per_device_train_batch_size=64,
save_steps=10000,
save_total_limit=2,
evaluation_strategy="steps",
eval_steps=10000,
resume_from_checkpoint=checkpoint
)
````
---
## Preprocessing
During preprocessing, all newline (`\n`) and tab (`\t`) characters in the *function* code were replaced with a single space to ensure a consistent input format for tokenization.
---
## Base Model
SmartBERT V3 builds upon the following models:
* **Original Model**: [CodeBERT-base-mlm](https://huggingface.co/microsoft/codebert-base-mlm)
* **Intermediate Model**: [SmartBERT V2](https://huggingface.co/web3se/SmartBERT-v2)
---
## Usage
Example usage with HuggingFace Transformers:
```python
from transformers import RobertaTokenizer, RobertaForMaskedLM, pipeline
model = RobertaForMaskedLM.from_pretrained('web3se/SmartBERT-v3')
tokenizer = RobertaTokenizer.from_pretrained('web3se/SmartBERT-v3')
code_example = "function totalSupply() external view <mask> (uint256);"
fill_mask = pipeline('fill-mask', model=model, tokenizer=tokenizer)
outputs = fill_mask(code_example)
print(outputs)
```
---
## How to Use
To train and deploy the SmartBERT V3 model for Web API services, please refer to our GitHub repository: [web3se-lab/SmartBERT](https://github.com/web3se-lab/SmartBERT).
---
## Contributor
* [Youwei Huang](https://www.devil.ren)
* [Sen Fang](https://github.com/TomasAndersonFang)
---
## Citation
```tex
@article{huang2025smart,
title={Smart Contract Intent Detection with Pre-trained Programming Language Model},
author={Huang, Youwei and Li, Jianwen and Fang, Sen and Li, Yao and Yang, Peng and Hu, Bin},
journal={arXiv preprint arXiv:2508.20086},
year={2025}
}
```
---
## Acknowledgement
- [Institute of Intelligent Computing Technology, Suzhou, CAS](http://iict.ac.cn/)
- [Macau University of Science and Technology](http://www.must.edu.mo)
- CAS Mino (中科劢诺)