web3se
/

SmartBERT-v2

@@ -5,14 +5,15 @@ language:
 inference: true
 base_model:
 - microsoft/codebert-base-mlm
-pipeline_tag: fill-mask
 tags:
-- fill-mask
 - smart-contract
 - web3
 - software-engineering
 - embedding
 - codebert
 library_name: transformers
 ---
@@ -22,23 +23,71 @@ library_name: transformers
 ## Overview
-SmartBERT V2 CodeBERT is a pre-trained model, initialized with **[CodeBERT-base-mlm](https://huggingface.co/microsoft/codebert-base-mlm)**, designed to transfer **Smart Contract** function-level code into embeddings effectively.
-- **Training Data:** Trained on **16,000** smart contracts.
-- **Hardware:** Utilized 2 Nvidia A100 80G GPUs.
-- **Training Duration:** More than 10 hours.
-- **Evaluation Data:** Evaluated on **4,000** smart contracts.
 ## Preprocessing
-All newline (`\n`) and tab (`\t`) characters in the function code were replaced with a single space to ensure consistency in the input data format.
 ## Base Model
-- **Base Model**: [CodeBERT-base-mlm](https://huggingface.co/microsoft/codebert-base-mlm)
 ## Training Setup
 ```python
 from transformers import TrainingArguments
@@ -53,13 +102,21 @@ training_args = TrainingArguments(
     eval_steps=10000,
     resume_from_checkpoint=checkpoint
 )
-```
-## How to Use
-To train and deploy the SmartBERT V2 model for Web API services, please refer to our GitHub repository: [web3se-lab/SmartBERT](https://github.com/web3se-lab/SmartBERT).
-Or use pipeline:
 ```python
 import torch
@@ -83,7 +140,42 @@ with torch.no_grad():
 # Option 1: CLS embedding
 cls_embedding = outputs.last_hidden_state[:, 0, :]
-# Option 2: Mean pooling (often better for code)
 mean_embedding = outputs.last_hidden_state.mean(dim=1)
 ```

 inference: true
 base_model:
 - microsoft/codebert-base-mlm
+pipeline_tag: feature-extraction
 tags:
 - smart-contract
 - web3
 - software-engineering
 - embedding
 - codebert
+- solidity
+- code-understanding
 library_name: transformers
 ---
 ## Overview
+SmartBERT V2 CodeBERT is a **domain-adapted pre-trained model** built on top of **[CodeBERT-base-mlm](https://huggingface.co/microsoft/codebert-base-mlm)**.
+It is designed to learn high-quality semantic representations of **smart contract code**, particularly at the **function level**.
+The model is further pre-trained on a large corpus of smart contracts using the **Masked Language Modeling (MLM)** objective.
+This domain-adaptive pretraining enables the model to better capture **semantic patterns, structure, and intent** within smart contract functions compared to general-purpose code models.
+SmartBERT V2 can be used for tasks such as:
+- Smart contract intent detection
+- Code similarity analysis
+- Vulnerability analysis
+- Smart contract classification
+- Code embedding and retrieval
+---
+## Training Data
+SmartBERT V2 was trained on a corpus of approximately **16,000 smart contracts**, primarily written in **Solidity** and collected from public blockchain repositories.
+To better model smart contract behavior, contracts were processed at the **function level**, enabling the model to learn fine-grained semantic representations of smart contract functions.
+---
 ## Preprocessing
+During preprocessing, all newline (`\n`) and tab (`\t`) characters in the function code were normalized by replacing them with a **single space**.
+This ensures a consistent input format for the tokenizer and avoids unnecessary token fragmentation.
+---
 ## Base Model
+SmartBERT V2 is initialized from:
+- **Base Model:** [CodeBERT-base-mlm](https://huggingface.co/microsoft/codebert-base-mlm)
+CodeBERT is a transformer-based model trained on source code and natural language pairs.
+SmartBERT V2 further adapts this model to the **smart contract domain** through continued pretraining.
+---
+## Training Objective
+The model is trained using the **Masked Language Modeling (MLM)** objective, following the same training paradigm as the original CodeBERT model.
+During training:
+- A subset of tokens is randomly masked.
+- The model learns to predict the masked tokens based on surrounding context.
+- This encourages the model to learn deeper structural and semantic representations of smart contract code.
+---
 ## Training Setup
+Training was conducted using the **HuggingFace Transformers** framework with the following configuration:
+- **Hardware:** 2 × Nvidia A100 (80GB)
+- **Training Duration:** ~10 hours
+- **Training Dataset:** 16,000 smart contracts
+- **Evaluation Dataset:** 4,000 smart contracts
+Example training configuration:
 ```python
 from transformers import TrainingArguments
     eval_steps=10000,
     resume_from_checkpoint=checkpoint
 )
+````
+---
+## Evaluation
+The model was evaluated on a held-out dataset of approximately **4,000 smart contracts** to monitor training stability and generalization during pretraining.
+SmartBERT V2 is primarily intended as a **representation learning model**, providing high-quality embeddings for downstream smart contract analysis tasks.
+---
+## How to Use
+You can load SmartBERT V2 using the **HuggingFace Transformers** library.
 ```python
 import torch
 # Option 1: CLS embedding
 cls_embedding = outputs.last_hidden_state[:, 0, :]
+# Option 2: Mean pooling (recommended for code representation)
 mean_embedding = outputs.last_hidden_state.mean(dim=1)
 ```
+Mean pooling is often recommended when using the model for **code representation or similarity tasks**.
+---
+## GitHub Repository
+To train, fine-tune, or deploy SmartBERT for Web API services, please refer to our GitHub repository:
+[https://github.com/web3se-lab/SmartBERT](https://github.com/web3se-lab/SmartBERT)
+---
+## Citation
+If you use **SmartBERT** in your research, please cite:
+```tex
+@article{huang2025smart,
+  title={Smart Contract Intent Detection with Pre-trained Programming Language Model},
+  author={Huang, Youwei and Li, Jianwen and Fang, Sen and Li, Yao and Yang, Peng and Hu, Bin},
+  journal={arXiv preprint arXiv:2508.20086},
+  year={2025}
+}
+```
+---
+## Acknowledgement
+This project was supported by:
+* **Institute of Intelligent Computing Technology, Suzhou, CAS**
+  [http://iict.ac.cn/](http://iict.ac.cn/)
+* **CAS Mino (中科劢诺)**