Understanding vocab.txt

by krishnagarg09 - opened Aug 30, 2022

While looking at vocab.txt, I was left wondering why the vocabulary is not continuous.
For instance, see the below sample:

...
dice 63328
)@@ 63327
struggled 63326
wraps 63324
Investors 63312
#summer@@ 63305
...

As you can see, after 63305, we have 63312, followed by 63324... what about the numbers in between?

르@@ 3800
utory 3798
...

Any explanations will be really appreciated.

VinAI Research org Aug 31, 2022

Each number denotes the frequency count that the corresponding word appears in the pre-training corpus.
Only top 64k words are included in the vocab.

dqnguyen changed discussion status to closed Aug 31, 2022

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment