| | --- |
| | license: cc-by-4.0 |
| | --- |
| | |
| | This model is a RoBERTa model trained on a programming language code - WolfSSL + examples of cybersecurity vulnerabilities related to input validation, diffused with the Linux Kernel code. The model is pre-trained to understand the concep of a singleton in the code |
| |
|
| | The programming language is C/C++, but the actual inference can also use other languages. |
| |
|
| | Using the model to unmask can be done in the following way |
| |
|
| | ```python |
| | from transformers import pipeline |
| | unmasker = pipeline('fill-mask', model='mstaron/CyLBERT') |
| | unmasker("Hello I'm a <mask> model.") |
| | ``` |
| |
|
| | To obtain the embeddings for downstream task can be done in the following way: |
| |
|
| | ```python |
| | # import the model via the huggingface library |
| | from transformers import AutoTokenizer, AutoModelForMaskedLM |
| | |
| | # load the tokenizer and the model for the pretrained CyLBERT |
| | tokenizer = AutoTokenizer.from_pretrained('mstaron/CyLBERT') |
| | |
| | # load the model |
| | model = AutoModelForMaskedLM.from_pretrained("mstaron/CyLBERT") |
| | |
| | # import the feature extraction pipeline |
| | from transformers import pipeline |
| | |
| | # create the pipeline, which will extract the embedding vectors |
| | # the models are already pre-defined, so we do not need to train anything here |
| | features = pipeline( |
| | "feature-extraction", |
| | model=model, |
| | tokenizer=tokenizer, |
| | return_tensor = False |
| | ) |
| | |
| | # extract the features == embeddings |
| | lstFeatures = features('Class HTTP::X1') |
| | |
| | # print the first token's embedding [CLS] |
| | # which is also a good approximation of the whole sentence embedding |
| | # the same as using np.mean(lstFeatures[0], axis=0) |
| | lstFeatures[0][0] |
| | ``` |
| |
|
| | In order to use the model, we need to train it on the downstream task. |