This model's lit!

#2
by Nesaliti - opened

The first model that actually made me smile with it's answers. Characters and worldbuilding are lively, recount is noticable, telling is interesting to follow. Yes, the regular templates and repetition are sometimes there and logic faults can seep through (most likely due to low quant) but it's least dry iteration I can remember.

I messed around a bit, and for those who are interested, I found this an optimal choice:

(16 gb VRAM + 64 gb RAM, koboldcpp nocuda (AMD), Vulkan)
Quant IQ3_XS (not XXS)
Offload 24 layers (BLAS batch lowered to 256 or won't fit), MMAP, FlashAttention on. In GPU ID selected 'all' despite having only one (have no idea why, but it struggles otherwise)
Context 16k, generation speed vary from ~1.26 tps in the beginning to ~0,89 tps with filled context

Sign up or log in to comment