Von-R commited on
Commit
f25c900
·
verified ·
1 Parent(s): c16c7a7

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +166 -3
README.md CHANGED
@@ -1,3 +1,166 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ license: mit
2
+ language:
3
+ - en
4
+ metrics:
5
+ - accuracy
6
+ - perplexity
7
+ - f1
8
+ - precision
9
+ - recall
10
+ tags:
11
+ - code
12
+ ---
13
+ # Model Card for VerilogProtoModel
14
+
15
+ VerilogProtoModel is a predictive model for Verilog next-token prediction, designed to serve as a foundational model for future Verilog code copilots. It demonstrates significant improvements in coding efficiency and accuracy for hardware description languages.
16
+
17
+ ## Model Details
18
+
19
+ ### Model Description
20
+
21
+ VerilogProtoModel is developed to predict the next token in Verilog code, aiming to enhance coding efficiency and accuracy. The model was fine-tuned on a large dataset of Verilog code, with significant preprocessing to clean and anonymize the data. It achieved a 52% accuracy in predicting the correct next token out of approximately 40,000 possibilities, showcasing its potential to improve the coding process for hardware description languages.
22
+
23
+ - **Developed by:** Von Davis
24
+ - **Model type:** GPT-2
25
+ - **Language(s) (NLP):** English
26
+ - **License:** MIT
27
+ - **Finetuned from model:** GPT-2
28
+
29
+ ### Model Sources
30
+
31
+ - **HuggingFace repository:** Von-R/VerilogProtoToken
32
+ - **Github Repo:** https://github.com/Von-R/VerilogProtoToken
33
+
34
+ ## Uses
35
+
36
+ ### Direct Use
37
+
38
+ The model can be directly used for next-token prediction in Verilog code, assisting developers in writing more efficient and accurate code.
39
+
40
+ ### Downstream Use
41
+
42
+ Fine-tuning the model for specific Verilog coding standards or integrating it into a larger code completion system.
43
+
44
+ ### Out-of-Scope Use
45
+
46
+ The model is not intended for use in non-Verilog programming languages or general text prediction. It should not be used for generating Verilog code in safety-critical systems without thorough validation.
47
+
48
+ ## Bias, Risks, and Limitations
49
+
50
+ The model's predictions are based on the training data and may not generalize well to all possible Verilog coding scenarios. The reduced vocabulary size might limit its ability to predict less common tokens accurately.
51
+
52
+ ### Recommendations
53
+
54
+ Users should validate the model's predictions in the context of their specific applications and be aware of its limitations. Continuous monitoring and fine-tuning may be required to maintain performance.
55
+
56
+ ## How to Get Started with the Model
57
+
58
+ Use the code below to get started with the model.
59
+
60
+ ```
61
+ from transformers import AutoTokenizer, AutoModelForCausalLM
62
+
63
+ tokenizer = AutoTokenizer.from_pretrained("Von-R/VerilogProtoToken")
64
+ model = AutoModelForCausalLM.from_pretrained("Von-R/VerilogProtoToken")
65
+
66
+ inputs = tokenizer("input Verilog code here", return_tensors="pt")
67
+ outputs = model.generate(inputs.input_ids, max_length=100)
68
+ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
69
+ ```
70
+
71
+ ## Training Details
72
+
73
+ ### Training Data
74
+
75
+ The model was trained on a dataset of Verilog code extracted from GitHub. The data was cleaned, anonymized, and preprocessed to ensure high quality.
76
+
77
+ https://huggingface.co/datasets/Von-R/verilog_preprocessed_anonymized/tree/main/data?show_file_info=data%2Ftrain-00000-of-00001.parquet
78
+
79
+ #### Preprocessing
80
+
81
+ Data extraction involved removing non-synthesizable code, comments, and duplicates. Identifiers were anonymized to reduce vocabulary size and improve model efficiency.
82
+
83
+ #### Training Hyperparameters
84
+
85
+ - **Training regime:** fp32
86
+ - **Learning rate:** 5e-5
87
+ - **Batch size:** 16
88
+ - **Epochs:** 1
89
+
90
+ ## Evaluation
91
+
92
+ <!-- This section describes the evaluation protocols and provides the results. -->
93
+
94
+ ### Testing Data, Factors & Metrics
95
+
96
+ #### Testing Data
97
+
98
+ The testing data was a subset of the training dataset, consisting of Verilog code not seen during training.
99
+
100
+ https://huggingface.co/datasets/Von-R/verilog_preprocessed_anonymized/tree/main/data?show_file_info=data%2Ftest-00000-of-00001.parquet
101
+
102
+ #### Factors
103
+
104
+ Evaluation focused on predicting the correct next token in various Verilog coding scenarios.
105
+
106
+ #### Metrics
107
+
108
+ Next Token Prediction Loss: 0.8175709573030472 - Measures the average loss per predicted token.
109
+
110
+ Perplexity: 2.2649913893200004 - Evaluates how well the model predicts the sample.
111
+
112
+ Accuracy: 0.52189453125 - Measures the percentage of correct predictions.
113
+
114
+ Precision: 0.023324851765829015 - Measures the accuracy of positive predictions.
115
+
116
+ Recall: 0.023883036472085516 - Measures the model's ability to identify all relevant instances.
117
+
118
+ F1 Score: 0.02345157579189002 - Balances precision and recall.
119
+
120
+ Top-5 Accuracy: 0.56113671875 - Measures the percentage of times the correct token is within the top 5 predicted tokens.
121
+
122
+ Entropy: 0.8339920132160187 - Measures the uncertainty in the predictions.
123
+
124
+ Prediction Confidence: 0.8293982080221176 - Measures the confidence of the model in its predictions.
125
+
126
+
127
+ ### Results
128
+
129
+ The model achieved a 52% accuracy in predicting the next token out of approximately 40,000 possibilities.
130
+
131
+ #### Summary
132
+
133
+ The model demonstrates significant potential in improving Verilog coding efficiency and accuracy.
134
+
135
+ ### Model Architecture and Objective
136
+
137
+ The model is based on the GPT-2 architecture and fine-tuned for next-token prediction in Verilog code.
138
+
139
+ ### Compute Infrastructure
140
+
141
+ The training and evaluation were performed on high-performance GPUs to handle the computational demands of fine-tuning a large language model.
142
+
143
+ **BibTeX:**
144
+
145
+ ```bibtex
146
+ @article{Davis2024VerilogProtoModel,
147
+ title={VerilogProtoModel: A Predictive Model for Verilog Next-Token Prediction},
148
+ author={Von Davis},
149
+ journal={GitHub Repository},
150
+ year={2024}
151
+ }
152
+ ```
153
+
154
+ **APA:**
155
+
156
+ Davis, V. (2024). VerilogProtoModel: A Predictive Model for Verilog Next-Token Prediction. GitHub Repository.
157
+
158
+ ## Model Card Authors
159
+
160
+ Von Davis
161
+
162
+ ## Model Card Contact
163
+
164
+ Von.Roth.1991@gmail.com
165
+ https://github.com/Von-R
166
+ https://www.linkedin.com/in/daelonvondavis/