Feature Extraction
Transformers
PyTorch
English
roberta
fill-mask
smart-contract
web3
software-engineering
embedding
codebert
solidity
code-understanding
Instructions to use web3se/SmartBERT-v2 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use web3se/SmartBERT-v2 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("feature-extraction", model="web3se/SmartBERT-v2")# Load model directly from transformers import AutoTokenizer, AutoModelForMaskedLM tokenizer = AutoTokenizer.from_pretrained("web3se/SmartBERT-v2") model = AutoModelForMaskedLM.from_pretrained("web3se/SmartBERT-v2") - Notebooks
- Google Colab
- Kaggle
| license: mit | |
| language: | |
| - en | |
| inference: true | |
| base_model: | |
| - microsoft/codebert-base-mlm | |
| pipeline_tag: feature-extraction | |
| tags: | |
| - smart-contract | |
| - web3 | |
| - software-engineering | |
| - embedding | |
| - codebert | |
| - solidity | |
| - code-understanding | |
| library_name: transformers | |
| datasets: | |
| - web3se/smart-contract-intent-vul-dataset | |
| # SmartBERT V2 CodeBERT | |
|  | |
| ## Overview | |
| SmartBERT V2 CodeBERT is a **domain-adapted pre-trained model** built on top of **[CodeBERT-base-mlm](https://huggingface.co/microsoft/codebert-base-mlm)**. | |
| It is designed to learn high-quality semantic representations of **smart contract code**, particularly at the **function level**. | |
| The model is further pre-trained on a large corpus of smart contracts using the **Masked Language Modeling (MLM)** objective. | |
| This domain-adaptive pretraining enables the model to better capture **semantic patterns, structure, and intent** within smart contract functions compared to general-purpose code models. | |
| SmartBERT V2 can be used for tasks such as: | |
| - Smart contract intent detection | |
| - Code similarity analysis | |
| - Vulnerability analysis | |
| - Smart contract classification | |
| - Code embedding and retrieval | |
| SmartBERT V2 is a pre-trained model specifically developed for **[SmartIntent V2](https://github.com/web3se-lab/web3-sekit)**. It was trained on **16,000 smart contracts**, with no overlap with the SmartIntent V2 evaluation dataset to avoid data leakage. | |
| For production use or general smart contract representation tasks, we recommend **SmartBERT V3**: https://huggingface.co/web3se/SmartBERT-v3 | |
| --- | |
| ## Training Data | |
| SmartBERT V2 was trained on a corpus of approximately **16,000 smart contracts**, primarily written in **Solidity** and collected from public blockchain repositories. | |
| To better model smart contract behavior, contracts were processed at the **function level**, enabling the model to learn fine-grained semantic representations of smart contract functions. | |
| For benchmarking purposes in the **SmartIntent V2**, the pretraining corpus was intentionally limited to this **16,000-contract dataset**. | |
| The **evaluation dataset (4,000 smart contracts)** was strictly held out and **not included in the pretraining data**, ensuring that downstream evaluations remain unbiased and free from data leakage. | |
| --- | |
| ## Preprocessing | |
| During preprocessing, all newline (`\n`) and tab (`\t`) characters in the function code were normalized by replacing them with a **single space**. | |
| This ensures a consistent input format for the tokenizer and avoids unnecessary token fragmentation. | |
| --- | |
| ## Base Model | |
| SmartBERT V2 is initialized from: | |
| - **Base Model:** [CodeBERT-base-mlm](https://huggingface.co/microsoft/codebert-base-mlm) | |
| CodeBERT is a transformer-based model trained on source code and natural language pairs. | |
| SmartBERT V2 further adapts this model to the **smart contract domain** through continued pretraining. | |
| --- | |
| ## Training Objective | |
| The model is trained using the **Masked Language Modeling (MLM)** objective, following the same training paradigm as the original CodeBERT model. | |
| During training: | |
| - A subset of tokens is randomly masked. | |
| - The model learns to predict the masked tokens based on surrounding context. | |
| - This encourages the model to learn deeper structural and semantic representations of smart contract code. | |
| --- | |
| ## Training Setup | |
| Training was conducted using the **HuggingFace Transformers** framework with the following configuration: | |
| - **Hardware:** 2 × Nvidia A100 (80GB) | |
| - **Training Duration:** ~10 hours | |
| - **Training Dataset:** 16,000 smart contracts | |
| - **Evaluation Dataset:** 4,000 smart contracts | |
| Example training configuration: | |
| ```python | |
| from transformers import TrainingArguments | |
| training_args = TrainingArguments( | |
| output_dir=OUTPUT_DIR, | |
| overwrite_output_dir=True, | |
| num_train_epochs=20, | |
| per_device_train_batch_size=64, | |
| save_steps=10000, | |
| save_total_limit=2, | |
| evaluation_strategy="steps", | |
| eval_steps=10000, | |
| resume_from_checkpoint=checkpoint | |
| ) | |
| ```` | |
| --- | |
| ## Evaluation | |
| The model was evaluated on a held-out dataset of approximately **4,000 smart contracts** to monitor training stability and generalization during pretraining. | |
| SmartBERT V2 is primarily intended as a **representation learning model**, providing high-quality embeddings for downstream smart contract analysis tasks. | |
| --- | |
| ## How to Use | |
| You can load SmartBERT V2 using the **HuggingFace Transformers** library. | |
| ```python | |
| import torch | |
| from transformers import RobertaTokenizer, RobertaModel | |
| tokenizer = RobertaTokenizer.from_pretrained("web3se/SmartBERT-v2") | |
| model = RobertaModel.from_pretrained("web3se/SmartBERT-v2") | |
| code = "function totalSupply() external view returns (uint256);" | |
| inputs = tokenizer( | |
| code, | |
| return_tensors="pt", | |
| truncation=True, | |
| max_length=512 | |
| ) | |
| with torch.no_grad(): | |
| outputs = model(**inputs) | |
| # Option 1: CLS embedding | |
| cls_embedding = outputs.last_hidden_state[:, 0, :] | |
| # Option 2: Mean pooling (recommended for code representation) | |
| mean_embedding = outputs.last_hidden_state.mean(dim=1) | |
| ``` | |
| Mean pooling is often recommended when using the model for **code representation or similarity tasks**. | |
| --- | |
| ## GitHub Repository | |
| To train, fine-tune, or deploy SmartBERT for Web API services, please refer to our GitHub repository: | |
| [https://github.com/web3se-lab/SmartBERT](https://github.com/web3se-lab/SmartBERT) | |
| --- | |
| ## Citation | |
| If you use **SmartBERT** in your research, please cite: | |
| ```tex | |
| @article{huang2025smart, | |
| title={Smart Contract Intent Detection with Pre-trained Programming Language Model}, | |
| author={Huang, Youwei and Li, Jianwen and Fang, Sen and Li, Yao and Yang, Peng and Hu, Bin}, | |
| journal={arXiv preprint arXiv:2508.20086}, | |
| year={2025} | |
| } | |
| ``` | |
| --- | |
| ## Acknowledgement | |
| - [Institute of Intelligent Computing Technology, Suzhou, CAS](http://iict.ac.cn/) | |
| - [Macau University of Science and Technology](http://www.must.edu.mo) | |
| - CAS Mino (中科劢诺) |