Work with llama.cpp?

#2
by AIMuddle - opened

Hi, thanks for this GGUF.

Does this work with llama.cpp? Can it run on LM studio?

I've got 2 RTX 8000s w/nvlink (96gb vram) and 336 GB of CPU RAM, so splitting this I could probably just extremely barely fit this. Just not sure if this works yet.

@AIMuddle It's a work in progress. I have several branches, but they're all experiencing degenerate generation after about 2k context. I've been trying to resolve the issue for months. The simplest branch is: https://github.com/createthis/llama.cpp/pull/31

There are no cuda kernels in that branch. It just uses normal GGML and a CPU radix top-k implementation.

I also have a branch where I vendored the VLLM top-k cuda kernel: https://github.com/createthis/llama.cpp/pull/9

I have a single blackwell 6000 pro and 768gb of system ram. I haven't tried either of those branches on anything else. I think the VLLM top-k branch is specifically hard coded to only compile on sm_120a (my blackwell card).

Sign up or log in to comment