remove `<|endoftext|>`

#31
by Jaffe2718 - opened

My newly submitted code removes the mechanism of hard-coding glossaries to make them compatible with custom glossaries. However, the newly submitted code fails the Ruby test, and after my troubleshooting, it is not a problem with the code, but a special token <|endoftext|> in the English ggml model vocabulary, which caused the calculation of positioning of the ID of the special token to be wrong. Similar to https://github.com/ggml-org/whisper.cpp/pull/725, remove special token in English model. These models are transformed using my modified script, and after my testing, it is compatible with whisper.cpp current official latest version v1.8.2. It is necessary to remove special tags from the glossary from the perspective of code correctness, so I hope your team will adopt this pull request.

In addition, the English q5_1 and q8_0 quantization models also contain the special marker <|endoftext|>, which requires re-quantization and overwrite the original old model.

Ready to merge
This branch is ready to get merged automatically.

Sign up or log in to comment