Feature Extraction
Transformers
PyTorch
English
roberta
fill-mask
smart-contract
web3
software-engineering
embedding
codebert
solidity
code-understanding
Instructions to use web3se/SmartBERT-v2 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use web3se/SmartBERT-v2 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("feature-extraction", model="web3se/SmartBERT-v2")# Load model directly from transformers import AutoTokenizer, AutoModelForMaskedLM tokenizer = AutoTokenizer.from_pretrained("web3se/SmartBERT-v2") model = AutoModelForMaskedLM.from_pretrained("web3se/SmartBERT-v2") - Notebooks
- Google Colab
- Kaggle
File size: 5,968 Bytes
2c77bdd 6855ad8 2c77bdd 6855ad8 4a41f9f 2c77bdd 6855ad8 4a41f9f 6855ad8 82c2aac 2c77bdd 4312eeb 8e53fcf 20847d2 8e53fcf 20847d2 8e53fcf 20847d2 4a41f9f 20847d2 4a41f9f 82c2aac 4a41f9f 82c2aac 4a41f9f 20847d2 8e53fcf 4a41f9f 8e53fcf 4a41f9f 8e53fcf 20847d2 4a41f9f 20847d2 8e53fcf 4a41f9f 8e53fcf 4a41f9f 8e53fcf 4a41f9f 8e53fcf 4a41f9f a364f07 0960430 a364f07 0960430 a364f07 0960430 a364f07 0960430 8e53fcf 0960430 4312eeb 0960430 039203b 4a41f9f 0960430 039203b 4a41f9f 2b4c420 48a13b9 2b4c420 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 | ---
license: mit
language:
- en
inference: true
base_model:
- microsoft/codebert-base-mlm
pipeline_tag: feature-extraction
tags:
- smart-contract
- web3
- software-engineering
- embedding
- codebert
- solidity
- code-understanding
library_name: transformers
datasets:
- web3se/smart-contract-intent-vul-dataset
---
# SmartBERT V2 CodeBERT

## Overview
SmartBERT V2 CodeBERT is a **domain-adapted pre-trained model** built on top of **[CodeBERT-base-mlm](https://huggingface.co/microsoft/codebert-base-mlm)**.
It is designed to learn high-quality semantic representations of **smart contract code**, particularly at the **function level**.
The model is further pre-trained on a large corpus of smart contracts using the **Masked Language Modeling (MLM)** objective.
This domain-adaptive pretraining enables the model to better capture **semantic patterns, structure, and intent** within smart contract functions compared to general-purpose code models.
SmartBERT V2 can be used for tasks such as:
- Smart contract intent detection
- Code similarity analysis
- Vulnerability analysis
- Smart contract classification
- Code embedding and retrieval
SmartBERT V2 is a pre-trained model specifically developed for **[SmartIntent V2](https://github.com/web3se-lab/web3-sekit)**. It was trained on **16,000 smart contracts**, with no overlap with the SmartIntent V2 evaluation dataset to avoid data leakage.
For production use or general smart contract representation tasks, we recommend **SmartBERT V3**: https://huggingface.co/web3se/SmartBERT-v3
---
## Training Data
SmartBERT V2 was trained on a corpus of approximately **16,000 smart contracts**, primarily written in **Solidity** and collected from public blockchain repositories.
To better model smart contract behavior, contracts were processed at the **function level**, enabling the model to learn fine-grained semantic representations of smart contract functions.
For benchmarking purposes in the **SmartIntent V2**, the pretraining corpus was intentionally limited to this **16,000-contract dataset**.
The **evaluation dataset (4,000 smart contracts)** was strictly held out and **not included in the pretraining data**, ensuring that downstream evaluations remain unbiased and free from data leakage.
---
## Preprocessing
During preprocessing, all newline (`\n`) and tab (`\t`) characters in the function code were normalized by replacing them with a **single space**.
This ensures a consistent input format for the tokenizer and avoids unnecessary token fragmentation.
---
## Base Model
SmartBERT V2 is initialized from:
- **Base Model:** [CodeBERT-base-mlm](https://huggingface.co/microsoft/codebert-base-mlm)
CodeBERT is a transformer-based model trained on source code and natural language pairs.
SmartBERT V2 further adapts this model to the **smart contract domain** through continued pretraining.
---
## Training Objective
The model is trained using the **Masked Language Modeling (MLM)** objective, following the same training paradigm as the original CodeBERT model.
During training:
- A subset of tokens is randomly masked.
- The model learns to predict the masked tokens based on surrounding context.
- This encourages the model to learn deeper structural and semantic representations of smart contract code.
---
## Training Setup
Training was conducted using the **HuggingFace Transformers** framework with the following configuration:
- **Hardware:** 2 × Nvidia A100 (80GB)
- **Training Duration:** ~10 hours
- **Training Dataset:** 16,000 smart contracts
- **Evaluation Dataset:** 4,000 smart contracts
Example training configuration:
```python
from transformers import TrainingArguments
training_args = TrainingArguments(
output_dir=OUTPUT_DIR,
overwrite_output_dir=True,
num_train_epochs=20,
per_device_train_batch_size=64,
save_steps=10000,
save_total_limit=2,
evaluation_strategy="steps",
eval_steps=10000,
resume_from_checkpoint=checkpoint
)
````
---
## Evaluation
The model was evaluated on a held-out dataset of approximately **4,000 smart contracts** to monitor training stability and generalization during pretraining.
SmartBERT V2 is primarily intended as a **representation learning model**, providing high-quality embeddings for downstream smart contract analysis tasks.
---
## How to Use
You can load SmartBERT V2 using the **HuggingFace Transformers** library.
```python
import torch
from transformers import RobertaTokenizer, RobertaModel
tokenizer = RobertaTokenizer.from_pretrained("web3se/SmartBERT-v2")
model = RobertaModel.from_pretrained("web3se/SmartBERT-v2")
code = "function totalSupply() external view returns (uint256);"
inputs = tokenizer(
code,
return_tensors="pt",
truncation=True,
max_length=512
)
with torch.no_grad():
outputs = model(**inputs)
# Option 1: CLS embedding
cls_embedding = outputs.last_hidden_state[:, 0, :]
# Option 2: Mean pooling (recommended for code representation)
mean_embedding = outputs.last_hidden_state.mean(dim=1)
```
Mean pooling is often recommended when using the model for **code representation or similarity tasks**.
---
## GitHub Repository
To train, fine-tune, or deploy SmartBERT for Web API services, please refer to our GitHub repository:
[https://github.com/web3se-lab/SmartBERT](https://github.com/web3se-lab/SmartBERT)
---
## Citation
If you use **SmartBERT** in your research, please cite:
```tex
@article{huang2025smart,
title={Smart Contract Intent Detection with Pre-trained Programming Language Model},
author={Huang, Youwei and Li, Jianwen and Fang, Sen and Li, Yao and Yang, Peng and Hu, Bin},
journal={arXiv preprint arXiv:2508.20086},
year={2025}
}
```
---
## Acknowledgement
- [Institute of Intelligent Computing Technology, Suzhou, CAS](http://iict.ac.cn/)
- [Macau University of Science and Technology](http://www.must.edu.mo)
- CAS Mino (中科劢诺) |