Work with llama.cpp?
Hi, thanks for this GGUF.
Does this work with llama.cpp? Can it run on LM studio?
I've got 2 RTX 8000s w/nvlink (96gb vram) and 336 GB of CPU RAM, so splitting this I could probably just extremely barely fit this. Just not sure if this works yet.
@AIMuddle It's a work in progress. I have several branches, but they're all experiencing degenerate generation after about 2k context. I've been trying to resolve the issue for months. The simplest branch is: https://github.com/createthis/llama.cpp/pull/31
There are no cuda kernels in that branch. It just uses normal GGML and a CPU radix top-k implementation.
I also have a branch where I vendored the VLLM top-k cuda kernel: https://github.com/createthis/llama.cpp/pull/9
I have a single blackwell 6000 pro and 768gb of system ram. I haven't tried either of those branches on anything else. I think the VLLM top-k branch is specifically hard coded to only compile on sm_120a (my blackwell card).