| # StackOBERTflow-comments-small | |
| StackOBERTflow is a RoBERTa model trained on StackOverflow comments. | |
| A Byte-level BPE tokenizer with dropout was used (using the `tokenizers` package). | |
| The model is *small*, i.e. has only 6-layers and the maximum sequence length was restricted to 256 tokens. | |
| The model was trained for 6 epochs on several GBs of comments from the StackOverflow corpus. | |
| ## Quick start: masked language modeling prediction | |
| ```python | |
| from transformers import pipeline | |
| from pprint import pprint | |
| COMMENT = "You really should not do it this way, I would use <mask> instead." | |
| fill_mask = pipeline( | |
| "fill-mask", | |
| model="./StackOBERTflow-comments-small-v1", | |
| tokenizer="./StackOBERTflow-comments-small-v1" | |
| ) | |
| pprint(fill_mask(COMMENT)) | |
| # [{'score': 0.019997311756014824, | |
| # 'sequence': '<s> You really should not do it this way, I would use jQuery instead.</s>', | |
| # 'token': 1738}, | |
| # {'score': 0.01693696901202202, | |
| # 'sequence': '<s> You really should not do it this way, I would use arrays instead.</s>', | |
| # 'token': 2844}, | |
| # {'score': 0.013411642983555794, | |
| # 'sequence': '<s> You really should not do it this way, I would use CSS instead.</s>', | |
| # 'token': 2254}, | |
| # {'score': 0.013224546797573566, | |
| # 'sequence': '<s> You really should not do it this way, I would use it instead.</s>', | |
| # 'token': 300}, | |
| # {'score': 0.011984303593635559, | |
| # 'sequence': '<s> You really should not do it this way, I would use classes instead.</s>', | |
| # 'token': 1779}] | |
| ``` | |