Fill-Mask
Transformers
PyTorch
English
roberta
smart-contract
web3
software-engineering
embedding
codebert
Instructions to use web3se/SmartBERT-v3 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use web3se/SmartBERT-v3 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("fill-mask", model="web3se/SmartBERT-v3")# Load model directly from transformers import AutoTokenizer, AutoModelForMaskedLM tokenizer = AutoTokenizer.from_pretrained("web3se/SmartBERT-v3") model = AutoModelForMaskedLM.from_pretrained("web3se/SmartBERT-v3") - Notebooks
- Google Colab
- Kaggle
Update README.md
Browse files
README.md
CHANGED
|
@@ -15,6 +15,8 @@ tags:
|
|
| 15 |
- embedding
|
| 16 |
- codebert
|
| 17 |
library_name: transformers
|
|
|
|
|
|
|
| 18 |
---
|
| 19 |
|
| 20 |
# SmartBERT V3 CodeBERT
|
|
@@ -23,38 +25,57 @@ library_name: transformers
|
|
| 23 |
|
| 24 |
## Overview
|
| 25 |
|
| 26 |
-
**SmartBERT V3** is a pre-trained programming language model,
|
| 27 |
|
| 28 |
-
|
| 29 |
-
- **Hardware:** Utilized 2 Nvidia A100 80G GPUs.
|
| 30 |
-
- **Training Duration:** Over 30 hours.
|
| 31 |
-
- **Evaluation Data:** Evaluated on **1,500** (starts from 96425) smart contracts.
|
| 32 |
|
| 33 |
-
|
| 34 |
|
| 35 |
-
|
| 36 |
-
|
|
|
|
|
|
|
|
|
|
| 37 |
|
| 38 |
-
model
|
| 39 |
-
tokenizer = RobertaTokenizer.from_pretrained('web3se/SmartBERT-v3')
|
| 40 |
|
| 41 |
-
|
| 42 |
-
fill_mask = pipeline('fill-mask', model=model, tokenizer=tokenizer)
|
| 43 |
|
| 44 |
-
|
| 45 |
-
print(outputs)
|
| 46 |
-
```
|
| 47 |
|
| 48 |
-
|
| 49 |
|
| 50 |
-
|
|
|
|
| 51 |
|
| 52 |
-
|
|
|
|
|
|
|
| 53 |
|
| 54 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 55 |
|
| 56 |
## Training Setup
|
| 57 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 58 |
```python
|
| 59 |
training_args = TrainingArguments(
|
| 60 |
output_dir=OUTPUT_DIR,
|
|
@@ -67,18 +88,58 @@ training_args = TrainingArguments(
|
|
| 67 |
eval_steps=10000,
|
| 68 |
resume_from_checkpoint=checkpoint
|
| 69 |
)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 70 |
```
|
| 71 |
|
|
|
|
|
|
|
| 72 |
## How to Use
|
| 73 |
|
| 74 |
To train and deploy the SmartBERT V3 model for Web API services, please refer to our GitHub repository: [web3se-lab/SmartBERT](https://github.com/web3se-lab/SmartBERT).
|
| 75 |
|
| 76 |
-
|
|
|
|
|
|
|
| 77 |
|
| 78 |
-
|
| 79 |
-
|
| 80 |
|
| 81 |
-
|
|
|
|
|
|
|
| 82 |
|
| 83 |
```tex
|
| 84 |
@article{huang2025smart,
|
|
@@ -89,7 +150,9 @@ To train and deploy the SmartBERT V3 model for Web API services, please refer to
|
|
| 89 |
}
|
| 90 |
```
|
| 91 |
|
| 92 |
-
|
|
|
|
|
|
|
| 93 |
|
| 94 |
- [Institute of Intelligent Computing Technology, Suzhou, CAS](http://iict.ac.cn/)
|
| 95 |
-
-
|
|
|
|
| 15 |
- embedding
|
| 16 |
- codebert
|
| 17 |
library_name: transformers
|
| 18 |
+
datasets:
|
| 19 |
+
- web3se/smart-contract-intent-vul-dataset
|
| 20 |
---
|
| 21 |
|
| 22 |
# SmartBERT V3 CodeBERT
|
|
|
|
| 25 |
|
| 26 |
## Overview
|
| 27 |
|
| 28 |
+
**SmartBERT V3** is a domain-adapted pre-trained programming language model for **smart contract code understanding**, built upon **[CodeBERT-base-mlm](https://huggingface.co/microsoft/codebert-base-mlm)**.
|
| 29 |
|
| 30 |
+
The model is further trained on **[SmartBERT V2](https://huggingface.co/web3se/SmartBERT-v2)** with a substantially larger corpus of smart contracts, enabling improved robustness and richer semantic representations of **function-level smart contract code**.
|
|
|
|
|
|
|
|
|
|
| 31 |
|
| 32 |
+
SmartBERT V3 is particularly suitable for tasks such as:
|
| 33 |
|
| 34 |
+
- Smart contract intent detection
|
| 35 |
+
- Code representation learning
|
| 36 |
+
- Code similarity analysis
|
| 37 |
+
- Vulnerability detection
|
| 38 |
+
- Smart contract classification
|
| 39 |
|
| 40 |
+
Compared with **SmartBERT V2**, this version significantly expands the training corpus and improves the model’s ability to capture semantic patterns in smart contract functions.
|
|
|
|
| 41 |
|
| 42 |
+
---
|
|
|
|
| 43 |
|
| 44 |
+
## Training Data
|
|
|
|
|
|
|
| 45 |
|
| 46 |
+
SmartBERT V3 was trained on a total of **80,000 smart contracts**, including:
|
| 47 |
|
| 48 |
+
- **16,000 contracts** used in **[SmartBERT V2](https://huggingface.co/web3se/SmartBERT-v2)**
|
| 49 |
+
- **64,000 additional smart contracts** collected from public blockchain repositories
|
| 50 |
|
| 51 |
+
All contracts are primarily written in **Solidity** and processed at the **function level** to better capture fine-grained semantic structures of smart contract code.
|
| 52 |
+
|
| 53 |
+
---
|
| 54 |
|
| 55 |
+
## Training Objective
|
| 56 |
+
|
| 57 |
+
The model is trained using the **Masked Language Modeling (MLM)** objective, following the same training paradigm as **CodeBERT**.
|
| 58 |
+
|
| 59 |
+
During training:
|
| 60 |
+
|
| 61 |
+
- A subset of tokens in the input code is randomly masked
|
| 62 |
+
- The model learns to predict these masked tokens from surrounding context
|
| 63 |
+
|
| 64 |
+
This process enables the model to learn deeper **syntactic and semantic representations** of smart contract programs.
|
| 65 |
+
|
| 66 |
+
---
|
| 67 |
|
| 68 |
## Training Setup
|
| 69 |
|
| 70 |
+
Training was conducted using the **HuggingFace Transformers** framework.
|
| 71 |
+
|
| 72 |
+
- **Hardware:** 2 × Nvidia A100 (80GB)
|
| 73 |
+
- **Training Duration:** Over **30 hours**
|
| 74 |
+
- **Training Dataset:** 80,000 smart contracts
|
| 75 |
+
- **Evaluation Dataset:** 1,500 smart contracts
|
| 76 |
+
|
| 77 |
+
Example training configuration:
|
| 78 |
+
|
| 79 |
```python
|
| 80 |
training_args = TrainingArguments(
|
| 81 |
output_dir=OUTPUT_DIR,
|
|
|
|
| 88 |
eval_steps=10000,
|
| 89 |
resume_from_checkpoint=checkpoint
|
| 90 |
)
|
| 91 |
+
````
|
| 92 |
+
|
| 93 |
+
---
|
| 94 |
+
|
| 95 |
+
## Preprocessing
|
| 96 |
+
|
| 97 |
+
During preprocessing, all newline (`\n`) and tab (`\t`) characters in the *function* code were replaced with a single space to ensure a consistent input format for tokenization.
|
| 98 |
+
|
| 99 |
+
---
|
| 100 |
+
|
| 101 |
+
## Base Model
|
| 102 |
+
|
| 103 |
+
SmartBERT V3 builds upon the following models:
|
| 104 |
+
|
| 105 |
+
* **Original Model**: [CodeBERT-base-mlm](https://huggingface.co/microsoft/codebert-base-mlm)
|
| 106 |
+
* **Intermediate Model**: [SmartBERT V2](https://huggingface.co/web3se/SmartBERT-v2)
|
| 107 |
+
|
| 108 |
+
---
|
| 109 |
+
|
| 110 |
+
## Usage
|
| 111 |
+
|
| 112 |
+
Example usage with HuggingFace Transformers:
|
| 113 |
+
|
| 114 |
+
```python
|
| 115 |
+
from transformers import RobertaTokenizer, RobertaForMaskedLM, pipeline
|
| 116 |
+
|
| 117 |
+
model = RobertaForMaskedLM.from_pretrained('web3se/SmartBERT-v3')
|
| 118 |
+
tokenizer = RobertaTokenizer.from_pretrained('web3se/SmartBERT-v3')
|
| 119 |
+
|
| 120 |
+
code_example = "function totalSupply() external view <mask> (uint256);"
|
| 121 |
+
fill_mask = pipeline('fill-mask', model=model, tokenizer=tokenizer)
|
| 122 |
+
|
| 123 |
+
outputs = fill_mask(code_example)
|
| 124 |
+
print(outputs)
|
| 125 |
```
|
| 126 |
|
| 127 |
+
---
|
| 128 |
+
|
| 129 |
## How to Use
|
| 130 |
|
| 131 |
To train and deploy the SmartBERT V3 model for Web API services, please refer to our GitHub repository: [web3se-lab/SmartBERT](https://github.com/web3se-lab/SmartBERT).
|
| 132 |
|
| 133 |
+
---
|
| 134 |
+
|
| 135 |
+
## Contributor
|
| 136 |
|
| 137 |
+
* [Youwei Huang](https://www.devil.ren)
|
| 138 |
+
* [Sen Fang](https://github.com/TomasAndersonFang)
|
| 139 |
|
| 140 |
+
---
|
| 141 |
+
|
| 142 |
+
## Citation
|
| 143 |
|
| 144 |
```tex
|
| 145 |
@article{huang2025smart,
|
|
|
|
| 150 |
}
|
| 151 |
```
|
| 152 |
|
| 153 |
+
---
|
| 154 |
+
|
| 155 |
+
## Acknowledgment
|
| 156 |
|
| 157 |
- [Institute of Intelligent Computing Technology, Suzhou, CAS](http://iict.ac.cn/)
|
| 158 |
+
- [Macau University of Science and Technology](http://www.must.edu.mo)
|