mstaron
/

SingBERTa

Fill-Mask

Transformers

PyTorch

roberta

Model card Files Files and versions

xet

Community

mstaron commited on Feb 7, 2023

Commit

da26daa

1 Parent(s): 34cd09a

First version of the SingBeRTa model for singleton analysis.

Browse files

Files changed (1) hide show

README.md +44 -1

README.md CHANGED Viewed

@@ -4,4 +4,47 @@ license: cc-by-4.0
 This model is a RoBERTa model trained on a programming language code - WolfSSL + examples of Singletons diffused with the Linux Kernel code. The model is pre-trained to understand the concep of a singleton in the code
-The programming language is C/C++, but the actual inference can also use other languages.

 This model is a RoBERTa model trained on a programming language code - WolfSSL + examples of Singletons diffused with the Linux Kernel code. The model is pre-trained to understand the concep of a singleton in the code
+The programming language is C/C++, but the actual inference can also use other languages.
+Using the model to unmask can be done in the following way
+```python
+from transformers import pipeline
+unmasker = pipeline('fill-mask', model='mstaron/SingBERTa')
+unmasker("Hello I'm a <mask> model.")
+```
+To obtain the embeddings for downstream task can be done in the following way:
+```python
+# import the model via the huggingface library
+from transformers import AutoTokenizer, AutoModelForMaskedLM
+# load the tokenizer and the model for the pretrained SingBERTa
+tokenizer = AutoTokenizer.from_pretrained('mstaron/SingBERTa')
+# load the model
+model = AutoModelForMaskedLM.from_pretrained("mstaron/SingBERTa")
+# import the feature extraction pipeline
+from transformers import pipeline
+# create the pipeline, which will extract the embedding vectors
+# the models are already pre-defined, so we do not need to train anything here
+features = pipeline(
+    "feature-extraction",
+    model="./SingletonSSLBERT",
+    tokenizer="./SingletonSSLBERT",
+    return_tensor = False
+)
+# extract the features == embeddings
+lstFeatures = features('Class SingletonX1')
+# print the first token's embedding [CLS]
+# which is also a good approximation of the whole sentence embedding
+# the same as using np.mean(lstFeatures[0], axis=0)
+lstFeatures[0][0]
+```
+In order to use the model, we need to train it on the downstream task.