add first tokenizer
#1
by
hac541309 - opened
Tokenizer important details :
Bytelevel() pretokenizer
BPE algorithm with vocabsize=102400 including added tokens and special tokens
training corpus : korean, english, code
hac541309 changed pull request status to
merged