Request for 3bpw model (target: 8GB VRAM Cards)

by Sebastian651 - opened Jan 9, 2024

Jan 9, 2024

Hello bartowski,
would it be possible to get a 3bpw version of this model? At this size, 8GB GPUs are able to run 13B models at 4k context.

Thanks,
Sebastian

bartowski

Owner Jan 10, 2024

@Sebastian651 oh sure, i'll add 3.0 to all my future 13B as well if that's a nice target

bartowski

Owner Jan 10, 2024

Hey @Sebastian651 , I made 3.0, 3.25, and 3.5 so you can experiment and find the sweet spot! Let me know and whichever works best I'll add to 13B quants :)

Sebastian651

Jan 11, 2024

@bartowski thanks for the quick reaction!
I tested the models today, the 3.0bpw is the best I can do with my 8GB RTX 3050 at 4k context and 8bit cache. With this settings, it's using 7.4GB VRAM and generates ~9 token/s. Which is a huge step up and actually usable compared to about 2.7 token/s using a GGUF Q4_K_M model only partially loaded to the GPU.
Using the 3.25bpw I can go up to 3k context with 8-bit cache, but it gave me a bluescreen with a memory management error at the first try. After reboot, it's working also with 7.4GB of VRAM. For me, 4k context is a lower bound I like to use, so I will stick with the 3.0bpw version :)

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment