chatllm.cpp supports this model now
chatllm.cpp supports this model now.
Decoding options:
Supported options (--set OPTION VALUE):
block_size: default 16When set to <= 1, it falls back to auto regressive decoding.
accept_algo: default 2- 0: entropy algo: https://github.com/Tencent/WeDLM/blob/d4481cab821044b8ebd5f78bc37f23787a6275ed/wedlm/engine/sampler.py#L169
- 1: prob algo: https://huggingface.co/tencent/WeDLM-8B-Instruct/blob/main/modeling_wedlm.py#L694
- 2: custom algo: sampling + prob
threshold: default 0.7For algo 0, tokens are accepted if entropy is less than threshold; for others, tokens are accepted when probability (or confidence level) is larger than this.
pos_penalty_factor: default 0.02 (used by entropy algo)
Note: this model is very sensitive to sampling parameters. The results may be completely unacceptable with improper parameters.
Performance
On CPU, when generating ~300 tokens, we can see a 50+% performance boosting with the customized sampling algo. Unfortunately, I can't see any performance boosting on GPU. ---- maybe using a larger block_size?
Run in AR mode
> main.exe -m quantized\wedlm-8b-it.bin --max-length 4000 -p "solve the equaltion x^2 - 4 = 0" --set block-size 0
To solve the equation \(x^2 - 4 = 0\), we can follow these steps:
1. **Isolate the term involving \(x\)**:
The equation is already in a form where the term involving \(x\) is isolated on one side of the equation. So, we have:
\[
x^2 - 4 = 0
\]
...
timings: prompt eval time = 631.03 ms / 32 tokens ( 19.72 ms per token, 50.71 tokens per second)
timings: eval time = 45880.58 ms / 310 tokens ( 148.00 ms per token, 6.76 tokens per second)
timings: total time = 46511.61 ms / 342 tokens
Run in parallel decoding mode
> main.exe -m quantized\wedlm-8b-it.bin --max-length 4000 -p "solve the equaltion x^2 - 4 = 0"
To solve the equation \( x^2 - 4 = 0 \), we can follow these steps:
1. **Recognize the equation as a difference of squares:**
The \( x^2 - 4 \) can be written as \( x^2 - 2^2 \), which is a difference of squares. The difference of squares formula is \( a^2 - b^2 = (a - b)(a + b) \). Here, \( a = x \) and \( b = 2 \). So, we can rewrite the equation as:
\[
x^2 - 4 = (x - 2)(x + 2) = 0
\]
...
timings: prompt eval time = 1579.78 ms / 64 tokens ( 24.68 ms per token, 40.51 tokens per second)
timings: eval time = 38127.28 ms / 373 tokens ( 102.22 ms per token, 9.78 tokens per second)
timings: total time = 39707.06 ms / 437 tokens
Thank you for your work; it looks good. Could you please compare the tokens per step on both CPU and GPU? Also, does this align with the models in our original repository? We will also help check this when we have time.
Inference of Qwen3 arch is ok. But calculation of entropy is not aligned to PyTorch (else should be kept, but not works):
I am working on other models now. Hope to come back to this later.