Update README.md

82c2aac verified 3 months ago

5.97 kB

	---
	license: mit
	language:
	- en
	inference: true
	base_model:
	- microsoft/codebert-base-mlm
	pipeline_tag: feature-extraction
	tags:
	- smart-contract
	- web3
	- software-engineering
	- embedding
	- codebert
	- solidity
	- code-understanding
	library_name: transformers
	datasets:
	- web3se/smart-contract-intent-vul-dataset
	---

	# SmartBERT V2 CodeBERT

	![SmartBERT](./framework.png)

	## Overview

	SmartBERT V2 CodeBERT is a domain-adapted pre-trained model built on top of [CodeBERT-base-mlm](https://huggingface.co/microsoft/codebert-base-mlm).
	It is designed to learn high-quality semantic representations of smart contract code, particularly at the function level.

	The model is further pre-trained on a large corpus of smart contracts using the Masked Language Modeling (MLM) objective.
	This domain-adaptive pretraining enables the model to better capture semantic patterns, structure, and intent within smart contract functions compared to general-purpose code models.

	SmartBERT V2 can be used for tasks such as:

	- Smart contract intent detection
	- Code similarity analysis
	- Vulnerability analysis
	- Smart contract classification
	- Code embedding and retrieval

	SmartBERT V2 is a pre-trained model specifically developed for [SmartIntent V2](https://github.com/web3se-lab/web3-sekit). It was trained on 16,000 smart contracts, with no overlap with the SmartIntent V2 evaluation dataset to avoid data leakage.
	For production use or general smart contract representation tasks, we recommend SmartBERT V3: https://huggingface.co/web3se/SmartBERT-v3

	---

	## Training Data

	SmartBERT V2 was trained on a corpus of approximately 16,000 smart contracts, primarily written in Solidity and collected from public blockchain repositories.

	To better model smart contract behavior, contracts were processed at the function level, enabling the model to learn fine-grained semantic representations of smart contract functions.

	For benchmarking purposes in the SmartIntent V2, the pretraining corpus was intentionally limited to this 16,000-contract dataset.
	The evaluation dataset (4,000 smart contracts) was strictly held out and not included in the pretraining data, ensuring that downstream evaluations remain unbiased and free from data leakage.

	---

	## Preprocessing

	During preprocessing, all newline (`\n`) and tab (`\t`) characters in the function code were normalized by replacing them with a single space.
	This ensures a consistent input format for the tokenizer and avoids unnecessary token fragmentation.

	---

	## Base Model

	SmartBERT V2 is initialized from:

	- Base Model: [CodeBERT-base-mlm](https://huggingface.co/microsoft/codebert-base-mlm)

	CodeBERT is a transformer-based model trained on source code and natural language pairs.
	SmartBERT V2 further adapts this model to the smart contract domain through continued pretraining.

	---

	## Training Objective

	The model is trained using the Masked Language Modeling (MLM) objective, following the same training paradigm as the original CodeBERT model.

	During training:

	- A subset of tokens is randomly masked.
	- The model learns to predict the masked tokens based on surrounding context.
	- This encourages the model to learn deeper structural and semantic representations of smart contract code.

	---

	## Training Setup

	Training was conducted using the HuggingFace Transformers framework with the following configuration:

	- Hardware: 2 × Nvidia A100 (80GB)
	- Training Duration: ~10 hours
	- Training Dataset: 16,000 smart contracts
	- Evaluation Dataset: 4,000 smart contracts

	Example training configuration:

	```python
	from transformers import TrainingArguments

	training_args = TrainingArguments(
	output_dir=OUTPUT_DIR,
	overwrite_output_dir=True,
	num_train_epochs=20,
	per_device_train_batch_size=64,
	save_steps=10000,
	save_total_limit=2,
	evaluation_strategy="steps",
	eval_steps=10000,
	resume_from_checkpoint=checkpoint
	)
	````

	---

	## Evaluation

	The model was evaluated on a held-out dataset of approximately 4,000 smart contracts to monitor training stability and generalization during pretraining.

	SmartBERT V2 is primarily intended as a representation learning model, providing high-quality embeddings for downstream smart contract analysis tasks.

	---

	## How to Use

	You can load SmartBERT V2 using the HuggingFace Transformers library.

	```python
	import torch
	from transformers import RobertaTokenizer, RobertaModel

	tokenizer = RobertaTokenizer.from_pretrained("web3se/SmartBERT-v2")
	model = RobertaModel.from_pretrained("web3se/SmartBERT-v2")

	code = "function totalSupply() external view returns (uint256);"

	inputs = tokenizer(
	code,
	return_tensors="pt",
	truncation=True,
	max_length=512
	)

	with torch.no_grad():
	outputs = model(**inputs)

	# Option 1: CLS embedding
	cls_embedding = outputs.last_hidden_state[:, 0, :]

	# Option 2: Mean pooling (recommended for code representation)
	mean_embedding = outputs.last_hidden_state.mean(dim=1)
	```

	Mean pooling is often recommended when using the model for code representation or similarity tasks.

	---

	## GitHub Repository

	To train, fine-tune, or deploy SmartBERT for Web API services, please refer to our GitHub repository:

	[https://github.com/web3se-lab/SmartBERT](https://github.com/web3se-lab/SmartBERT)

	---

	## Citation

	If you use SmartBERT in your research, please cite:

	```tex
	@article{huang2025smart,
	title={Smart Contract Intent Detection with Pre-trained Programming Language Model},
	author={Huang, Youwei and Li, Jianwen and Fang, Sen and Li, Yao and Yang, Peng and Hu, Bin},
	journal={arXiv preprint arXiv:2508.20086},
	year={2025}
	}
	```

	---

	## Acknowledgement

	- [Institute of Intelligent Computing Technology, Suzhou, CAS](http://iict.ac.cn/)
	- [Macau University of Science and Technology](http://www.must.edu.mo)
	- CAS Mino (中科劢诺)