NEW: LLama.cpp: Using `ngram-mod` to Get 2x Speed Boost on Long-Chats/Agent!
#20
by PussyHut - opened
https://github.com/ggml-org/llama.cpp/pull/19164
Basically you only need is to add
--spec-type ngram-mod --spec-ngram-size-n 24 --draft-min 48 --draft-max 64
and now you get at least 2x the speed boost in agent|coding|long chats use!
PussyHut changed discussion title from NEW: LLama.cpp: Using `ngram-mod` to Speed Up Long-Chats/Agent! to NEW: LLama.cpp: Using `ngram-mod` to Get 2x Speed Boost on Long-Chats/Agent!
Works with all MoE models!