SmartBERT-v2 / README.md
devilyouwei's picture
Update README.md
82c2aac verified
|
raw
history blame
5.97 kB
---
license: mit
language:
- en
inference: true
base_model:
- microsoft/codebert-base-mlm
pipeline_tag: feature-extraction
tags:
- smart-contract
- web3
- software-engineering
- embedding
- codebert
- solidity
- code-understanding
library_name: transformers
datasets:
- web3se/smart-contract-intent-vul-dataset
---
# SmartBERT V2 CodeBERT
![SmartBERT](./framework.png)
## Overview
SmartBERT V2 CodeBERT is a **domain-adapted pre-trained model** built on top of **[CodeBERT-base-mlm](https://huggingface.co/microsoft/codebert-base-mlm)**.
It is designed to learn high-quality semantic representations of **smart contract code**, particularly at the **function level**.
The model is further pre-trained on a large corpus of smart contracts using the **Masked Language Modeling (MLM)** objective.
This domain-adaptive pretraining enables the model to better capture **semantic patterns, structure, and intent** within smart contract functions compared to general-purpose code models.
SmartBERT V2 can be used for tasks such as:
- Smart contract intent detection
- Code similarity analysis
- Vulnerability analysis
- Smart contract classification
- Code embedding and retrieval
SmartBERT V2 is a pre-trained model specifically developed for **[SmartIntent V2](https://github.com/web3se-lab/web3-sekit)**. It was trained on **16,000 smart contracts**, with no overlap with the SmartIntent V2 evaluation dataset to avoid data leakage.
For production use or general smart contract representation tasks, we recommend **SmartBERT V3**: https://huggingface.co/web3se/SmartBERT-v3
---
## Training Data
SmartBERT V2 was trained on a corpus of approximately **16,000 smart contracts**, primarily written in **Solidity** and collected from public blockchain repositories.
To better model smart contract behavior, contracts were processed at the **function level**, enabling the model to learn fine-grained semantic representations of smart contract functions.
For benchmarking purposes in the **SmartIntent V2**, the pretraining corpus was intentionally limited to this **16,000-contract dataset**.
The **evaluation dataset (4,000 smart contracts)** was strictly held out and **not included in the pretraining data**, ensuring that downstream evaluations remain unbiased and free from data leakage.
---
## Preprocessing
During preprocessing, all newline (`\n`) and tab (`\t`) characters in the function code were normalized by replacing them with a **single space**.
This ensures a consistent input format for the tokenizer and avoids unnecessary token fragmentation.
---
## Base Model
SmartBERT V2 is initialized from:
- **Base Model:** [CodeBERT-base-mlm](https://huggingface.co/microsoft/codebert-base-mlm)
CodeBERT is a transformer-based model trained on source code and natural language pairs.
SmartBERT V2 further adapts this model to the **smart contract domain** through continued pretraining.
---
## Training Objective
The model is trained using the **Masked Language Modeling (MLM)** objective, following the same training paradigm as the original CodeBERT model.
During training:
- A subset of tokens is randomly masked.
- The model learns to predict the masked tokens based on surrounding context.
- This encourages the model to learn deeper structural and semantic representations of smart contract code.
---
## Training Setup
Training was conducted using the **HuggingFace Transformers** framework with the following configuration:
- **Hardware:** 2 × Nvidia A100 (80GB)
- **Training Duration:** ~10 hours
- **Training Dataset:** 16,000 smart contracts
- **Evaluation Dataset:** 4,000 smart contracts
Example training configuration:
```python
from transformers import TrainingArguments
training_args = TrainingArguments(
output_dir=OUTPUT_DIR,
overwrite_output_dir=True,
num_train_epochs=20,
per_device_train_batch_size=64,
save_steps=10000,
save_total_limit=2,
evaluation_strategy="steps",
eval_steps=10000,
resume_from_checkpoint=checkpoint
)
````
---
## Evaluation
The model was evaluated on a held-out dataset of approximately **4,000 smart contracts** to monitor training stability and generalization during pretraining.
SmartBERT V2 is primarily intended as a **representation learning model**, providing high-quality embeddings for downstream smart contract analysis tasks.
---
## How to Use
You can load SmartBERT V2 using the **HuggingFace Transformers** library.
```python
import torch
from transformers import RobertaTokenizer, RobertaModel
tokenizer = RobertaTokenizer.from_pretrained("web3se/SmartBERT-v2")
model = RobertaModel.from_pretrained("web3se/SmartBERT-v2")
code = "function totalSupply() external view returns (uint256);"
inputs = tokenizer(
code,
return_tensors="pt",
truncation=True,
max_length=512
)
with torch.no_grad():
outputs = model(**inputs)
# Option 1: CLS embedding
cls_embedding = outputs.last_hidden_state[:, 0, :]
# Option 2: Mean pooling (recommended for code representation)
mean_embedding = outputs.last_hidden_state.mean(dim=1)
```
Mean pooling is often recommended when using the model for **code representation or similarity tasks**.
---
## GitHub Repository
To train, fine-tune, or deploy SmartBERT for Web API services, please refer to our GitHub repository:
[https://github.com/web3se-lab/SmartBERT](https://github.com/web3se-lab/SmartBERT)
---
## Citation
If you use **SmartBERT** in your research, please cite:
```tex
@article{huang2025smart,
title={Smart Contract Intent Detection with Pre-trained Programming Language Model},
author={Huang, Youwei and Li, Jianwen and Fang, Sen and Li, Yao and Yang, Peng and Hu, Bin},
journal={arXiv preprint arXiv:2508.20086},
year={2025}
}
```
---
## Acknowledgement
- [Institute of Intelligent Computing Technology, Suzhou, CAS](http://iict.ac.cn/)
- [Macau University of Science and Technology](http://www.must.edu.mo)
- CAS Mino (中科劢诺)