special tokens in prompt with ggml/examples/starcoder

by mljxy - opened Jun 12, 2023

Jun 12, 2023

Using the starcoder example in ggml, the special tokens in prompt does not got tokenized correctly. For example,

main: token[0] =     46, <                                                                                                                                                                    
main: token[1] =    110, |                                                                                                                                                                    
main: token[2] =   2946, system                                                                                                                                                               
main: token[3] =  28318, |>

The correct tokenization should map <|system|> to 49152 instead. The same incorrect tokenizations happen to <|user|>, <|assistant|>, and <|end|>.

mike-ravkine

Jun 26, 2023

This was fixed last week: https://github.com/ggerganov/ggml/commit/e456108433017d5586b35fd36ce781b4c3aed631

But only kinda-sorta fixed I think, there's still somethign up here I can't get SantaCoder to spit out token 49152 (<|end|>) the GGML inference diverges from what the HF model does.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment