A few observations - Memory estimation & thinking is getting stuck in loops
#1
by spanspek - opened
I'm running the Q4_K_M in LM Studio 0.3.39 in Linux Mint with llama.cpp v1.104.1 (based on llama.cpp release b7779 (commit 6df686b))
- The memory estimation is not correct. The estimation suggests that the full 202752 context requires 22.3GB but it fails to load into the 24GB VRAM.
To add to this, 33000 of context loads into 23GB of total VRAM (model weights + KV cache) - this is with no quantization applied to K and V cache - Using the settings suggested by LM Studio (temp = 0.2, etc) the model is thinking (well) to a point but then gets stuck in a loop. This has happened three times in a row on the same prompt
I don't mind the first point, this model does seem to have larger than normal attention heads so I can live with trial-and-error there but the thinking loop is making it unusable
Any advice?