Instructions to use vinai/bertweet-base with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use vinai/bertweet-base with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("fill-mask", model="vinai/bertweet-base")# Load model directly from transformers import AutoTokenizer, AutoModelForMaskedLM tokenizer = AutoTokenizer.from_pretrained("vinai/bertweet-base") model = AutoModelForMaskedLM.from_pretrained("vinai/bertweet-base") - Inference
- Notebooks
- Google Colab
- Kaggle
Understanding vocab.txt
#1
by krishnagarg09 - opened
While looking at vocab.txt, I was left wondering why the vocabulary is not continuous.
For instance, see the below sample:
...
dice 63328
)@@ 63327
struggled 63326
wraps 63324
Investors 63312
#summer@@ 63305
...
As you can see, after 63305, we have 63312, followed by 63324... what about the numbers in between?
- Also, it feels a bit strange why vocabulary starts at around 3800.
르@@ 3800
utory 3798
...
Any explanations will be really appreciated.
- Each number denotes the frequency count that the corresponding word appears in the pre-training corpus.
- Only top 64k words are included in the vocab.
dqnguyen changed discussion status to closed