chatllm.cpp supports this model now

#11
by J22 - opened

chatllm.cpp supports this model now.

Quantized model

Decoding options:

Supported options (--set OPTION VALUE):

Note: this model is very sensitive to sampling parameters. The results may be completely unacceptable with improper parameters.

Performance

On CPU, when generating ~300 tokens, we can see a 50+% performance boosting with the customized sampling algo. Unfortunately, I can't see any performance boosting on GPU. ---- maybe using a larger block_size?

Run in AR mode

> main.exe -m quantized\wedlm-8b-it.bin --max-length 4000 -p "solve the equaltion x^2 - 4 = 0" --set block-size 0

To solve the equation \(x^2 - 4 = 0\), we can follow these steps:

1. **Isolate the term involving \(x\)**:
   The equation is already in a form where the term involving \(x\) is isolated on one side of the equation. So, we have:
   \[
   x^2 - 4 = 0
   \]

...

timings: prompt eval time =       631.03 ms /    32 tokens (    19.72 ms per token,    50.71 tokens per second)
timings:        eval time =     45880.58 ms /   310 tokens (   148.00 ms per token,     6.76 tokens per second)
timings:       total time =     46511.61 ms /   342 tokens

Run in parallel decoding mode

> main.exe -m quantized\wedlm-8b-it.bin --max-length 4000 -p "solve the equaltion x^2 - 4 = 0" 

To solve the equation \( x^2 - 4 = 0 \), we can follow these steps:

1. **Recognize the equation as a difference of squares:**
   The \( x^2 - 4 \) can be written as \( x^2 - 2^2 \), which is a difference of squares. The difference of squares formula is \( a^2 - b^2 = (a - b)(a + b) \). Here, \( a = x \) and \( b = 2 \). So, we can rewrite the equation as:
   \[
   x^2 - 4 = (x - 2)(x + 2) = 0
   \]

...

timings: prompt eval time =      1579.78 ms /    64 tokens (    24.68 ms per token,    40.51 tokens per second)
timings:        eval time =     38127.28 ms /   373 tokens (   102.22 ms per token,     9.78 tokens per second)
timings:       total time =     39707.06 ms /   437 tokens
Tencent org

Thank you for your work; it looks good. Could you please compare the tokens per step on both CPU and GPU? Also, does this align with the models in our original repository? We will also help check this when we have time.

Inference of Qwen3 arch is ok. But calculation of entropy is not aligned to PyTorch (else should be kept, but not works):

https://github.com/foldl/chatllm.cpp/blob/27f076460363b08f1cb7a5481b81d66b560dfe33/src/layers.cpp#L1176

I am working on other models now. Hope to come back to this later.

Sign up or log in to comment