Request for 3bpw model (target: 8GB VRAM Cards)
Hello bartowski,
would it be possible to get a 3bpw version of this model? At this size, 8GB GPUs are able to run 13B models at 4k context.
Thanks,
Sebastian
Hey @Sebastian651 , I made 3.0, 3.25, and 3.5 so you can experiment and find the sweet spot! Let me know and whichever works best I'll add to 13B quants :)
@bartowski thanks for the quick reaction!
I tested the models today, the 3.0bpw is the best I can do with my 8GB RTX 3050 at 4k context and 8bit cache. With this settings, it's using 7.4GB VRAM and generates ~9 token/s. Which is a huge step up and actually usable compared to about 2.7 token/s using a GGUF Q4_K_M model only partially loaded to the GPU.
Using the 3.25bpw I can go up to 3k context with 8-bit cache, but it gave me a bluescreen with a memory management error at the first try. After reboot, it's working also with 7.4GB of VRAM. For me, 4k context is a lower bound I like to use, so I will stick with the 3.0bpw version :)