devilyouwei commited on
Commit
ea8cc34
·
verified ·
1 Parent(s): 6547e30

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +88 -25
README.md CHANGED
@@ -15,6 +15,8 @@ tags:
15
  - embedding
16
  - codebert
17
  library_name: transformers
 
 
18
  ---
19
 
20
  # SmartBERT V3 CodeBERT
@@ -23,38 +25,57 @@ library_name: transformers
23
 
24
  ## Overview
25
 
26
- **SmartBERT V3** is a pre-trained programming language model, initialized with **[CodeBERT-base-mlm](https://huggingface.co/microsoft/codebert-base-mlm)**. It has been further trained on [SmartBERT V2](https://huggingface.co/web3se/SmartBERT-v2) with an additional **64,000** smart contracts, to enhance its robustness in representing smart contract code at the _function_ level.
27
 
28
- - **Training Data:** Trained on a total of **80,000** smart contracts, including **16,000** from **[SmartBERT V2](https://huggingface.co/web3se/SmartBERT-v2)** and **64,000** (starts from 30001) new contracts.
29
- - **Hardware:** Utilized 2 Nvidia A100 80G GPUs.
30
- - **Training Duration:** Over 30 hours.
31
- - **Evaluation Data:** Evaluated on **1,500** (starts from 96425) smart contracts.
32
 
33
- ## Usage
34
 
35
- ```python
36
- from transformers import RobertaTokenizer, RobertaForMaskedLM, pipeline
 
 
 
37
 
38
- model = RobertaForMaskedLM.from_pretrained('web3se/SmartBERT-v3')
39
- tokenizer = RobertaTokenizer.from_pretrained('web3se/SmartBERT-v3')
40
 
41
- code_example = "function totalSupply() external view <mask> (uint256);"
42
- fill_mask = pipeline('fill-mask', model=model, tokenizer=tokenizer)
43
 
44
- outputs = fill_mask(code_example)
45
- print(outputs)
46
- ```
47
 
48
- ## Preprocessing
49
 
50
- All newline (`\n`) and tab (`\t`) characters in the _function_ code were replaced with a single space to ensure consistency in the input data format.
 
51
 
52
- ## Base Model
 
 
53
 
54
- - **Original Model**: [CodeBERT-base-mlm](https://huggingface.co/microsoft/codebert-base-mlm)
 
 
 
 
 
 
 
 
 
 
 
55
 
56
  ## Training Setup
57
 
 
 
 
 
 
 
 
 
 
58
  ```python
59
  training_args = TrainingArguments(
60
  output_dir=OUTPUT_DIR,
@@ -67,18 +88,58 @@ training_args = TrainingArguments(
67
  eval_steps=10000,
68
  resume_from_checkpoint=checkpoint
69
  )
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
70
  ```
71
 
 
 
72
  ## How to Use
73
 
74
  To train and deploy the SmartBERT V3 model for Web API services, please refer to our GitHub repository: [web3se-lab/SmartBERT](https://github.com/web3se-lab/SmartBERT).
75
 
76
- ## Contributors
 
 
77
 
78
- - [Youwei Huang](https://www.devil.ren)
79
- - [Sen Fang](https://github.com/TomasAndersonFang)
80
 
81
- ## Citations
 
 
82
 
83
  ```tex
84
  @article{huang2025smart,
@@ -89,7 +150,9 @@ To train and deploy the SmartBERT V3 model for Web API services, please refer to
89
  }
90
  ```
91
 
92
- ## Sponsors
 
 
93
 
94
  - [Institute of Intelligent Computing Technology, Suzhou, CAS](http://iict.ac.cn/)
95
- - CAS Mino (中科劢诺)
 
15
  - embedding
16
  - codebert
17
  library_name: transformers
18
+ datasets:
19
+ - web3se/smart-contract-intent-vul-dataset
20
  ---
21
 
22
  # SmartBERT V3 CodeBERT
 
25
 
26
  ## Overview
27
 
28
+ **SmartBERT V3** is a domain-adapted pre-trained programming language model for **smart contract code understanding**, built upon **[CodeBERT-base-mlm](https://huggingface.co/microsoft/codebert-base-mlm)**.
29
 
30
+ The model is further trained on **[SmartBERT V2](https://huggingface.co/web3se/SmartBERT-v2)** with a substantially larger corpus of smart contracts, enabling improved robustness and richer semantic representations of **function-level smart contract code**.
 
 
 
31
 
32
+ SmartBERT V3 is particularly suitable for tasks such as:
33
 
34
+ - Smart contract intent detection
35
+ - Code representation learning
36
+ - Code similarity analysis
37
+ - Vulnerability detection
38
+ - Smart contract classification
39
 
40
+ Compared with **SmartBERT V2**, this version significantly expands the training corpus and improves the model’s ability to capture semantic patterns in smart contract functions.
 
41
 
42
+ ---
 
43
 
44
+ ## Training Data
 
 
45
 
46
+ SmartBERT V3 was trained on a total of **80,000 smart contracts**, including:
47
 
48
+ - **16,000 contracts** used in **[SmartBERT V2](https://huggingface.co/web3se/SmartBERT-v2)**
49
+ - **64,000 additional smart contracts** collected from public blockchain repositories
50
 
51
+ All contracts are primarily written in **Solidity** and processed at the **function level** to better capture fine-grained semantic structures of smart contract code.
52
+
53
+ ---
54
 
55
+ ## Training Objective
56
+
57
+ The model is trained using the **Masked Language Modeling (MLM)** objective, following the same training paradigm as **CodeBERT**.
58
+
59
+ During training:
60
+
61
+ - A subset of tokens in the input code is randomly masked
62
+ - The model learns to predict these masked tokens from surrounding context
63
+
64
+ This process enables the model to learn deeper **syntactic and semantic representations** of smart contract programs.
65
+
66
+ ---
67
 
68
  ## Training Setup
69
 
70
+ Training was conducted using the **HuggingFace Transformers** framework.
71
+
72
+ - **Hardware:** 2 × Nvidia A100 (80GB)
73
+ - **Training Duration:** Over **30 hours**
74
+ - **Training Dataset:** 80,000 smart contracts
75
+ - **Evaluation Dataset:** 1,500 smart contracts
76
+
77
+ Example training configuration:
78
+
79
  ```python
80
  training_args = TrainingArguments(
81
  output_dir=OUTPUT_DIR,
 
88
  eval_steps=10000,
89
  resume_from_checkpoint=checkpoint
90
  )
91
+ ````
92
+
93
+ ---
94
+
95
+ ## Preprocessing
96
+
97
+ During preprocessing, all newline (`\n`) and tab (`\t`) characters in the *function* code were replaced with a single space to ensure a consistent input format for tokenization.
98
+
99
+ ---
100
+
101
+ ## Base Model
102
+
103
+ SmartBERT V3 builds upon the following models:
104
+
105
+ * **Original Model**: [CodeBERT-base-mlm](https://huggingface.co/microsoft/codebert-base-mlm)
106
+ * **Intermediate Model**: [SmartBERT V2](https://huggingface.co/web3se/SmartBERT-v2)
107
+
108
+ ---
109
+
110
+ ## Usage
111
+
112
+ Example usage with HuggingFace Transformers:
113
+
114
+ ```python
115
+ from transformers import RobertaTokenizer, RobertaForMaskedLM, pipeline
116
+
117
+ model = RobertaForMaskedLM.from_pretrained('web3se/SmartBERT-v3')
118
+ tokenizer = RobertaTokenizer.from_pretrained('web3se/SmartBERT-v3')
119
+
120
+ code_example = "function totalSupply() external view <mask> (uint256);"
121
+ fill_mask = pipeline('fill-mask', model=model, tokenizer=tokenizer)
122
+
123
+ outputs = fill_mask(code_example)
124
+ print(outputs)
125
  ```
126
 
127
+ ---
128
+
129
  ## How to Use
130
 
131
  To train and deploy the SmartBERT V3 model for Web API services, please refer to our GitHub repository: [web3se-lab/SmartBERT](https://github.com/web3se-lab/SmartBERT).
132
 
133
+ ---
134
+
135
+ ## Contributor
136
 
137
+ * [Youwei Huang](https://www.devil.ren)
138
+ * [Sen Fang](https://github.com/TomasAndersonFang)
139
 
140
+ ---
141
+
142
+ ## Citation
143
 
144
  ```tex
145
  @article{huang2025smart,
 
150
  }
151
  ```
152
 
153
+ ---
154
+
155
+ ## Acknowledgment
156
 
157
  - [Institute of Intelligent Computing Technology, Suzhou, CAS](http://iict.ac.cn/)
158
+ - [Macau University of Science and Technology](http://www.must.edu.mo)