Work with llama.cpp?

by AIMuddle - opened 22 days ago

22 days ago

Hi, thanks for this GGUF.

Does this work with llama.cpp? Can it run on LM studio?

I've got 2 RTX 8000s w/nvlink (96gb vram) and 336 GB of CPU RAM, so splitting this I could probably just extremely barely fit this. Just not sure if this works yet.

createthis

Owner 21 days ago

@AIMuddle It's a work in progress. I have several branches, but they're all experiencing degenerate generation after about 2k context. I've been trying to resolve the issue for months. The simplest branch is: https://github.com/createthis/llama.cpp/pull/31

There are no cuda kernels in that branch. It just uses normal GGML and a CPU radix top-k implementation.

I also have a branch where I vendored the VLLM top-k cuda kernel: https://github.com/createthis/llama.cpp/pull/9

I have a single blackwell 6000 pro and 768gb of system ram. I haven't tried either of those branches on anything else. I think the VLLM top-k branch is specifically hard coded to only compile on sm_120a (my blackwell card).

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment