BPETokenizer / data /testdata2.txt
mathminakshi's picture
Added Data Folder
8bb8395 verified
raw
history blame contribute delete
482 Bytes
This is all well and good for the naive setting of a character-level language model. But in practice, in state of the art language models, people use a lot more complicated schemes for constructing these token vocabularies. In particular, these schemes work not on a character level, but on character chunk level. And the way these chunk vocabularies are constructed is by using algorithms such as the Byte Pair Encoding (BPE) algorithm, which we are going to cover in detail below.