Update README.md

2f9f8f1 verified 3 months ago

4.5 kB

	---
	license: mit
	language:
	- en
	inference: true
	base_model:
	- microsoft/codebert-base-mlm
	- web3se/SmartBERT-v2
	pipeline_tag: fill-mask
	tags:
	- fill-mask
	- smart-contract
	- web3
	- software-engineering
	- embedding
	- codebert
	library_name: transformers
	datasets:
	- web3se/smart-contract-intent-vul-dataset
	---

	# SmartBERT V3 CodeBERT

	![SmartBERT](https://huggingface.co/web3se/SmartBERT-v2/resolve/main/framework.png)

	## Overview

	SmartBERT V3 is a domain-adapted pre-trained programming language model for smart contract code understanding, built upon [CodeBERT-base-mlm](https://huggingface.co/microsoft/codebert-base-mlm).

	The model is further trained on [SmartBERT V2](https://huggingface.co/web3se/SmartBERT-v2) with a substantially larger corpus of smart contracts, enabling improved robustness and richer semantic representations of function-level smart contract code.

	SmartBERT V3 is particularly suitable for tasks such as:

	- Smart contract intent detection
	- Code representation learning
	- Code similarity analysis
	- Vulnerability detection
	- Smart contract classification

	Compared with SmartBERT V2, this version significantly expands the training corpus and improves the model’s ability to capture semantic patterns in smart contract functions.

	---

	## Training Data

	SmartBERT V3 was trained on a total of 80,000 smart contracts, including:

	- 16,000 contracts used in [SmartBERT V2](https://huggingface.co/web3se/SmartBERT-v2)
	- 64,000 additional smart contracts collected from public blockchain repositories

	All contracts are primarily written in Solidity and processed at the function level to better capture fine-grained semantic structures of smart contract code.

	---

	## Training Objective

	The model is trained using the Masked Language Modeling (MLM) objective, following the same training paradigm as CodeBERT.

	During training:

	- A subset of tokens in the input code is randomly masked
	- The model learns to predict these masked tokens from surrounding context

	This process enables the model to learn deeper syntactic and semantic representations of smart contract programs.

	---

	## Training Setup

	Training was conducted using the HuggingFace Transformers framework.

	- Hardware: 2 × Nvidia A100 (80GB)
	- Training Duration: Over 30 hours
	- Training Dataset: 80,000 smart contracts
	- Evaluation Dataset: 1,500 smart contracts

	Example training configuration:

	```python
	training_args = TrainingArguments(
	output_dir=OUTPUT_DIR,
	overwrite_output_dir=True,
	num_train_epochs=20,
	per_device_train_batch_size=64,
	save_steps=10000,
	save_total_limit=2,
	evaluation_strategy="steps",
	eval_steps=10000,
	resume_from_checkpoint=checkpoint
	)
	````

	---

	## Preprocessing

	During preprocessing, all newline (`\n`) and tab (`\t`) characters in the function code were replaced with a single space to ensure a consistent input format for tokenization.

	---

	## Base Model

	SmartBERT V3 builds upon the following models:

	* Original Model: [CodeBERT-base-mlm](https://huggingface.co/microsoft/codebert-base-mlm)
	* Intermediate Model: [SmartBERT V2](https://huggingface.co/web3se/SmartBERT-v2)

	---

	## Usage

	Example usage with HuggingFace Transformers:

	```python
	from transformers import RobertaTokenizer, RobertaForMaskedLM, pipeline

	model = RobertaForMaskedLM.from_pretrained('web3se/SmartBERT-v3')
	tokenizer = RobertaTokenizer.from_pretrained('web3se/SmartBERT-v3')

	code_example = "function totalSupply() external view <mask> (uint256);"
	fill_mask = pipeline('fill-mask', model=model, tokenizer=tokenizer)

	outputs = fill_mask(code_example)
	print(outputs)
	```

	---

	## How to Use

	To train and deploy the SmartBERT V3 model for Web API services, please refer to our GitHub repository: [web3se-lab/SmartBERT](https://github.com/web3se-lab/SmartBERT).

	---

	## Contributor

	* [Youwei Huang](https://www.devil.ren)
	* [Sen Fang](https://github.com/TomasAndersonFang)

	---

	## Citation

	```tex
	@article{huang2025smart,
	title={Smart Contract Intent Detection with Pre-trained Programming Language Model},
	author={Huang, Youwei and Li, Jianwen and Fang, Sen and Li, Yao and Yang, Peng and Hu, Bin},
	journal={arXiv preprint arXiv:2508.20086},
	year={2025}
	}
	```

	---

	## Acknowledgement

	- [Institute of Intelligent Computing Technology, Suzhou, CAS](http://iict.ac.cn/)
	- [Macau University of Science and Technology](http://www.must.edu.mo)
	- CAS Mino (中科劢诺)