devilyouwei commited on
Commit
4a41f9f
·
verified ·
1 Parent(s): 0960430

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +106 -14
README.md CHANGED
@@ -5,14 +5,15 @@ language:
5
  inference: true
6
  base_model:
7
  - microsoft/codebert-base-mlm
8
- pipeline_tag: fill-mask
9
  tags:
10
- - fill-mask
11
  - smart-contract
12
  - web3
13
  - software-engineering
14
  - embedding
15
  - codebert
 
 
16
  library_name: transformers
17
  ---
18
 
@@ -22,23 +23,71 @@ library_name: transformers
22
 
23
  ## Overview
24
 
25
- SmartBERT V2 CodeBERT is a pre-trained model, initialized with **[CodeBERT-base-mlm](https://huggingface.co/microsoft/codebert-base-mlm)**, designed to transfer **Smart Contract** function-level code into embeddings effectively.
 
26
 
27
- - **Training Data:** Trained on **16,000** smart contracts.
28
- - **Hardware:** Utilized 2 Nvidia A100 80G GPUs.
29
- - **Training Duration:** More than 10 hours.
30
- - **Evaluation Data:** Evaluated on **4,000** smart contracts.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
31
 
32
  ## Preprocessing
33
 
34
- All newline (`\n`) and tab (`\t`) characters in the function code were replaced with a single space to ensure consistency in the input data format.
 
 
 
35
 
36
  ## Base Model
37
 
38
- - **Base Model**: [CodeBERT-base-mlm](https://huggingface.co/microsoft/codebert-base-mlm)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
39
 
40
  ## Training Setup
41
 
 
 
 
 
 
 
 
 
 
42
  ```python
43
  from transformers import TrainingArguments
44
 
@@ -53,13 +102,21 @@ training_args = TrainingArguments(
53
  eval_steps=10000,
54
  resume_from_checkpoint=checkpoint
55
  )
56
- ```
57
 
58
- ## How to Use
 
 
 
 
59
 
60
- To train and deploy the SmartBERT V2 model for Web API services, please refer to our GitHub repository: [web3se-lab/SmartBERT](https://github.com/web3se-lab/SmartBERT).
61
 
62
- Or use pipeline:
 
 
 
 
63
 
64
  ```python
65
  import torch
@@ -83,7 +140,42 @@ with torch.no_grad():
83
  # Option 1: CLS embedding
84
  cls_embedding = outputs.last_hidden_state[:, 0, :]
85
 
86
- # Option 2: Mean pooling (often better for code)
87
  mean_embedding = outputs.last_hidden_state.mean(dim=1)
88
  ```
89
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5
  inference: true
6
  base_model:
7
  - microsoft/codebert-base-mlm
8
+ pipeline_tag: feature-extraction
9
  tags:
 
10
  - smart-contract
11
  - web3
12
  - software-engineering
13
  - embedding
14
  - codebert
15
+ - solidity
16
+ - code-understanding
17
  library_name: transformers
18
  ---
19
 
 
23
 
24
  ## Overview
25
 
26
+ SmartBERT V2 CodeBERT is a **domain-adapted pre-trained model** built on top of **[CodeBERT-base-mlm](https://huggingface.co/microsoft/codebert-base-mlm)**.
27
+ It is designed to learn high-quality semantic representations of **smart contract code**, particularly at the **function level**.
28
 
29
+ The model is further pre-trained on a large corpus of smart contracts using the **Masked Language Modeling (MLM)** objective.
30
+ This domain-adaptive pretraining enables the model to better capture **semantic patterns, structure, and intent** within smart contract functions compared to general-purpose code models.
31
+
32
+ SmartBERT V2 can be used for tasks such as:
33
+
34
+ - Smart contract intent detection
35
+ - Code similarity analysis
36
+ - Vulnerability analysis
37
+ - Smart contract classification
38
+ - Code embedding and retrieval
39
+
40
+ ---
41
+
42
+ ## Training Data
43
+
44
+ SmartBERT V2 was trained on a corpus of approximately **16,000 smart contracts**, primarily written in **Solidity** and collected from public blockchain repositories.
45
+
46
+ To better model smart contract behavior, contracts were processed at the **function level**, enabling the model to learn fine-grained semantic representations of smart contract functions.
47
+
48
+ ---
49
 
50
  ## Preprocessing
51
 
52
+ During preprocessing, all newline (`\n`) and tab (`\t`) characters in the function code were normalized by replacing them with a **single space**.
53
+ This ensures a consistent input format for the tokenizer and avoids unnecessary token fragmentation.
54
+
55
+ ---
56
 
57
  ## Base Model
58
 
59
+ SmartBERT V2 is initialized from:
60
+
61
+ - **Base Model:** [CodeBERT-base-mlm](https://huggingface.co/microsoft/codebert-base-mlm)
62
+
63
+ CodeBERT is a transformer-based model trained on source code and natural language pairs.
64
+ SmartBERT V2 further adapts this model to the **smart contract domain** through continued pretraining.
65
+
66
+ ---
67
+
68
+ ## Training Objective
69
+
70
+ The model is trained using the **Masked Language Modeling (MLM)** objective, following the same training paradigm as the original CodeBERT model.
71
+
72
+ During training:
73
+
74
+ - A subset of tokens is randomly masked.
75
+ - The model learns to predict the masked tokens based on surrounding context.
76
+ - This encourages the model to learn deeper structural and semantic representations of smart contract code.
77
+
78
+ ---
79
 
80
  ## Training Setup
81
 
82
+ Training was conducted using the **HuggingFace Transformers** framework with the following configuration:
83
+
84
+ - **Hardware:** 2 × Nvidia A100 (80GB)
85
+ - **Training Duration:** ~10 hours
86
+ - **Training Dataset:** 16,000 smart contracts
87
+ - **Evaluation Dataset:** 4,000 smart contracts
88
+
89
+ Example training configuration:
90
+
91
  ```python
92
  from transformers import TrainingArguments
93
 
 
102
  eval_steps=10000,
103
  resume_from_checkpoint=checkpoint
104
  )
105
+ ````
106
 
107
+ ---
108
+
109
+ ## Evaluation
110
+
111
+ The model was evaluated on a held-out dataset of approximately **4,000 smart contracts** to monitor training stability and generalization during pretraining.
112
 
113
+ SmartBERT V2 is primarily intended as a **representation learning model**, providing high-quality embeddings for downstream smart contract analysis tasks.
114
 
115
+ ---
116
+
117
+ ## How to Use
118
+
119
+ You can load SmartBERT V2 using the **HuggingFace Transformers** library.
120
 
121
  ```python
122
  import torch
 
140
  # Option 1: CLS embedding
141
  cls_embedding = outputs.last_hidden_state[:, 0, :]
142
 
143
+ # Option 2: Mean pooling (recommended for code representation)
144
  mean_embedding = outputs.last_hidden_state.mean(dim=1)
145
  ```
146
 
147
+ Mean pooling is often recommended when using the model for **code representation or similarity tasks**.
148
+
149
+ ---
150
+
151
+ ## GitHub Repository
152
+
153
+ To train, fine-tune, or deploy SmartBERT for Web API services, please refer to our GitHub repository:
154
+
155
+ [https://github.com/web3se-lab/SmartBERT](https://github.com/web3se-lab/SmartBERT)
156
+
157
+ ---
158
+
159
+ ## Citation
160
+
161
+ If you use **SmartBERT** in your research, please cite:
162
+
163
+ ```tex
164
+ @article{huang2025smart,
165
+ title={Smart Contract Intent Detection with Pre-trained Programming Language Model},
166
+ author={Huang, Youwei and Li, Jianwen and Fang, Sen and Li, Yao and Yang, Peng and Hu, Bin},
167
+ journal={arXiv preprint arXiv:2508.20086},
168
+ year={2025}
169
+ }
170
+ ```
171
+
172
+ ---
173
+
174
+ ## Acknowledgement
175
+
176
+ This project was supported by:
177
+
178
+ * **Institute of Intelligent Computing Technology, Suzhou, CAS**
179
+ [http://iict.ac.cn/](http://iict.ac.cn/)
180
+
181
+ * **CAS Mino (中科劢诺)**