chatllm.cpp supports this model now

#11

by J22 - opened Jan 13

Discussion

J22

Jan 13

•

edited Jan 13

chatllm.cpp supports this model now.

Quantized model

Decoding options:

Supported options (--set OPTION VALUE):

block_size: default 16

When set to <= 1, it falls back to auto regressive decoding.
accept_algo: default 2
- 0: entropy algo: https://github.com/Tencent/WeDLM/blob/d4481cab821044b8ebd5f78bc37f23787a6275ed/wedlm/engine/sampler.py#L169
- 1: prob algo: https://huggingface.co/tencent/WeDLM-8B-Instruct/blob/main/modeling_wedlm.py#L694
- 2: custom algo: sampling + prob
threshold: default 0.7

For algo 0, tokens are accepted if entropy is less than threshold; for others, tokens are accepted when probability (or confidence level) is larger than this.
pos_penalty_factor: default 0.02 (used by entropy algo)

Note: this model is very sensitive to sampling parameters. The results may be completely unacceptable with improper parameters.

Performance

On CPU, when generating ~300 tokens, we can see a 50+% performance boosting with the customized sampling algo. Unfortunately, I can't see any performance boosting on GPU. ---- maybe using a larger block_size?

Run in AR mode

> main.exe -m quantized\wedlm-8b-it.bin --max-length 4000 -p "solve the equaltion x^2 - 4 = 0" --set block-size 0

To solve the equation \(x^2 - 4 = 0\), we can follow these steps:

1. **Isolate the term involving \(x\)**:
   The equation is already in a form where the term involving \(x\) is isolated on one side of the equation. So, we have:
   \[
   x^2 - 4 = 0
   \]

...

timings: prompt eval time =       631.03 ms /    32 tokens (    19.72 ms per token,    50.71 tokens per second)
timings:        eval time =     45880.58 ms /   310 tokens (   148.00 ms per token,     6.76 tokens per second)
timings:       total time =     46511.61 ms /   342 tokens

Run in parallel decoding mode

> main.exe -m quantized\wedlm-8b-it.bin --max-length 4000 -p "solve the equaltion x^2 - 4 = 0" 

To solve the equation \( x^2 - 4 = 0 \), we can follow these steps:

1. **Recognize the equation as a difference of squares:**
   The \( x^2 - 4 \) can be written as \( x^2 - 2^2 \), which is a difference of squares. The difference of squares formula is \( a^2 - b^2 = (a - b)(a + b) \). Here, \( a = x \) and \( b = 2 \). So, we can rewrite the equation as:
   \[
   x^2 - 4 = (x - 2)(x + 2) = 0
   \]

...

timings: prompt eval time =      1579.78 ms /    64 tokens (    24.68 ms per token,    40.51 tokens per second)
timings:        eval time =     38127.28 ms /   373 tokens (   102.22 ms per token,     9.78 tokens per second)
timings:       total time =     39707.06 ms /   437 tokens

exlaw

Tencent org Jan 14

Thank you for your work; it looks good. Could you please compare the tokens per step on both CPU and GPU? Also, does this align with the models in our original repository? We will also help check this when we have time.

J22

Jan 15

Inference of Qwen3 arch is ok. But calculation of entropy is not aligned to PyTorch (else should be kept, but not works):

https://github.com/foldl/chatllm.cpp/blob/27f076460363b08f1cb7a5481b81d66b560dfe33/src/layers.cpp#L1176

I am working on other models now. Hope to come back to this later.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment