Instructions to use QuantTrio/GLM-4.7-AWQ with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use QuantTrio/GLM-4.7-AWQ with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="QuantTrio/GLM-4.7-AWQ")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("QuantTrio/GLM-4.7-AWQ")
model = AutoModelForCausalLM.from_pretrained("QuantTrio/GLM-4.7-AWQ")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use QuantTrio/GLM-4.7-AWQ with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "QuantTrio/GLM-4.7-AWQ"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "QuantTrio/GLM-4.7-AWQ",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/QuantTrio/GLM-4.7-AWQ

SGLang

How to use QuantTrio/GLM-4.7-AWQ with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "QuantTrio/GLM-4.7-AWQ" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "QuantTrio/GLM-4.7-AWQ",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "QuantTrio/GLM-4.7-AWQ" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "QuantTrio/GLM-4.7-AWQ",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use QuantTrio/GLM-4.7-AWQ with Docker Model Runner:
```
docker model run hf.co/QuantTrio/GLM-4.7-AWQ
```

Once again Thanks, here is my review for 8 x RTX 5090 setup

by crystech - opened Dec 24, 2025

Discussion

crystech

Dec 24, 2025

for RTX 5090 setup , I observed slower TPS when enabled MTP
--speculative-config.method mtp
--speculative-config.num_speculative_tokens 1
Avg 25 TPS

With limited Vram of 256GB , I think RTX 5090 users should turn it off for more KV Cache and better performance?

Once I disabled MTP I attained 35-41 TPS (ubuntu GDM, I expect a little more TPS if headless setup)

crystech

Dec 24, 2025

---- Without MTP enabled -----
(APIServer pid=21036) INFO: 127.0.0.1:50526 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=21036) INFO 12-25 02:01:12 [loggers.py:257] Engine 000: Avg prompt throughput: 1383.7 tokens/s, Avg generation throughput: 13.8 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 9.1%, Prefix cache hit rate: 0.0%
(APIServer pid=21036) INFO 12-25 02:01:22 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 40.0 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 9.3%, Prefix cache hit rate: 0.0%
(APIServer pid=21036) INFO 12-25 02:01:32 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 41.9 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 9.6%, Prefix cache hit rate: 0.0%

---with MTP enabled ------
(APIServer pid=19990) INFO: Application startup complete.
(APIServer pid=19990) INFO: 127.0.0.1:35818 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=19990) INFO 12-25 01:51:58 [loggers.py:257] Engine 000: Avg prompt throughput: 1260.1 tokens/s, Avg generation throughput: 13.2 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 10.3%, Prefix cache hit rate: 0.0%
(APIServer pid=19990) INFO 12-25 01:51:58 [metrics.py:100] SpecDecoding metrics: Mean acceptance length: 1.89, Accepted throughput: 0.77 tokens/s, Drafted throughput: 0.86 tokens/s, Accepted: 62 tokens, Drafted: 70 tokens, Per-position acceptance rate: 0.886, Avg Draft acceptance rate: 88.6%
(APIServer pid=19990) INFO 12-25 01:52:08 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 27.2 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 10.6%, Prefix cache hit rate: 0.0%
(APIServer pid=19990) INFO 12-25 01:52:08 [metrics.py:100] SpecDecoding metrics: Mean acceptance length: 1.78, Accepted throughput: 11.90 tokens/s, Drafted throughput: 15.30 tokens/s, Accepted: 119 tokens, Drafted: 153 tokens, Per-position acceptance rate: 0.778, Avg Draft acceptance rate: 77.8%
(APIServer pid=19990) INFO 12-25 01:52:18 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 27.1 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 10.8%, Prefix cache hit rate: 0.0%
(APIServer pid=19990) INFO 12-25 01:52:18 [metrics.py:100] SpecDecoding metrics: Mean acceptance length: 1.78, Accepted throughput: 11.90 tokens/s, Drafted throughput: 15.20 tokens/s, Accepted: 119 tokens, Drafted: 152 tokens, Per-position acceptance rate: 0.783, Avg Draft acceptance rate: 78.3%
(APIServer pid=19990) INFO 12-25 01:52:28 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 28.5 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 11.0%, Prefix cache hit rate: 0.0%
(APIServer pid=19990) INFO 12-25 01:52:28 [metrics.py:100] SpecDecoding metrics: Mean acceptance length: 1.79, Accepted throughput: 12.60 tokens/s, Drafted throughput: 15.90 tokens/s, Accepted: 126 tokens, Drafted: 159 tokens, Per-position acceptance rate: 0.792, Avg Draft acceptance rate: 79.2%
(APIServer pid=19990) INFO 12-25 01:52:38 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.6 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
(APIServer pid=19990) INFO 12-25 01:52:38 [metrics.py:100] SpecDecoding metrics: Mean acceptance length: 2.00, Accepted throughput: 0.30 tokens/s, Drafted throughput: 0.30 tokens/s, Accepted: 3 tokens, Drafted: 3 tokens, Per-position acceptance rate: 1.000, Avg Draft acceptance rate: 100.0%
(APIServer pid=19990) INFO 12-25 01:52:48 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%

am i doing anything wrong? just in case

tclf90

QuantTrio org Dec 25, 2025

emmm I guess you would have to let go "export VLLM_USE_DEEP_GEMM=0", and give it a try. But I'm not sure.

tclf90

QuantTrio org Dec 25, 2025

I ran a test on my 8x4090(48GB) rig:

---- w/o MTP enabled -----

(APIServer pid=143220) INFO 12-25 15:24:40 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 53.9 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.2%, Prefix cache hit rate: 0.0%
(APIServer pid=143220) INFO 12-25 15:24:50 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 54.0 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.3%, Prefix cache hit rate: 0.0%
(APIServer pid=143220) INFO 12-25 15:25:00 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 53.9 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.4%, Prefix cache hit rate: 0.0%

---with MTP enabled ------

(APIServer pid=136891) INFO 12-25 15:17:58 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 73.7 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.2%, Prefix cache hit rate: 0.0%
(APIServer pid=136891) INFO 12-25 15:17:58 [metrics.py:100] SpecDecoding metrics: Mean acceptance length: 1.66, Accepted throughput: 29.30 tokens/s, Drafted throughput: 44.39 tokens/s, Accepted: 293 tokens, Drafted: 444 tokens, Per-position acceptance rate: 0.660, Avg Draft acceptance rate: 66.0%
(APIServer pid=136891) INFO 12-25 15:18:08 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 66.6 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.4%, Prefix cache hit rate: 0.0%
(APIServer pid=136891) INFO 12-25 15:18:08 [metrics.py:100] SpecDecoding metrics: Mean acceptance length: 1.63, Accepted throughput: 25.80 tokens/s, Drafted throughput: 40.80 tokens/s, Accepted: 258 tokens, Drafted: 408 tokens, Per-position acceptance rate: 0.632, Avg Draft acceptance rate: 63.2%
(APIServer pid=136891) INFO 12-25 15:18:18 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 62.7 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.5%, Prefix cache hit rate: 0.0%
(APIServer pid=136891) INFO 12-25 15:18:18 [metrics.py:100] SpecDecoding metrics: Mean acceptance length: 1.65, Accepted throughput: 24.60 tokens/s, Drafted throughput: 38.10 tokens/s, Accepted: 246 tokens, Drafted: 381 tokens, Per-position acceptance rate: 0.646, Avg Draft acceptance rate: 64.6%
(APIServer pid=136891) INFO 12-25 15:18:28 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 58.6 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.7%, Prefix cache hit rate: 0.0%

My results look reasonable, for reference.

crystech

Dec 25, 2025

Thanks, tried and still worst. i will disable MTP for my setup. not sure what causes that to slow down

mratsim

Dec 25, 2025

•

edited Dec 25, 2025

Thanks, tried and still worst. i will disable MTP for my setup. not sure what causes that to slow down

I had the same issue with speculative decoding on GLM-4.5V-FP8 (official quant) with RTX Pro 6000: https://github.com/vllm-project/vllm/issues/26838#issuecomment-3563172299 so I'm not sure if it's the quant itself that's problematic.

Edit: Ah, but in this specific unofficial quant case, a MTP layer would predict what the unquantized or the FP8 models would output, it would need to be retrained for a quant and obviously this is quite complex. So for any quant, the MTP layer likely should be stripped or mispredictions would lead to more work.

tclf90

QuantTrio org Dec 25, 2025

Thanks, tried and still worst. i will disable MTP for my setup. not sure what causes that to slow down

in your case, the most prominent issue is, your 8x5090 runs slower than my 8x4090, even without mtp enabled...

my rig runs on 53.9 tokens/s, but yours 40.0 tokens/s

tclf90

QuantTrio org Dec 25, 2025

Thanks, tried and still worst. i will disable MTP for my setup. not sure what causes that to slow down

I had the same issue with speculative decoding on GLM-4.5V-FP8 (official quant) with RTX Pro 6000: https://github.com/vllm-project/vllm/issues/26838#issuecomment-3563172299 so I'm not sure if it's the quant itself that's problematic.

Edit: Ah, but in this specific unofficial quant case, a MTP layer would predict what the unquantized or the FP8 models would output, it would need to be retrained for a quant and obviously this is quite complex. So for any quant, the MTP layer likely should be stripped or mispredictions would lead to more work.

"mispredictions would lead to more work." at the least, this is not the case, as you can see from Avg Draft acceptance rate, which is comparable to that of the bf16 version.

mratsim

Dec 25, 2025

"mispredictions would lead to more work." at the least, this is not the case, as you can see from Avg Draft acceptance rate, which is comparable to that of the bf16 version.

Ah good catch

mtcl

Dec 25, 2025

I ran a test on my 8x4090(48GB) rig:

will i be able to run it on 2X6000 Pro (192 GB total VRAM) rig?

How do i quantize it to be under that ram?

crystech

Jan 3

Thanks, tried and still worst. i will disable MTP for my setup. not sure what causes that to slow down

in your case, the most prominent issue is, your 8x5090 runs slower than my 8x4090, even without mtp enabled...

my rig runs on 53.9 tokens/s, but yours 40.0 tokens/s

i re-setup my rig with proper chassis , 8 x PCIe properly connected
update latest vllm nightly
now it seem hitting 40- 50 tps , i think vllm is not yet fully optimized for RTX 5090 but the speed and quality seem good
speculative coding also works for me now hitting up to 60-70tps. not as good but happy enough

Tried tp =8 ,going to

just like to ask:
when you train this model , did you expose the thinking together with outputs? it would be nice to conceal thinking though

latest result
(APIServer pid=22840) INFO 01-03 17:19:18 [loggers.py:257] Engine 000: Avg prompt throughput: 264.8 tokens/s, Avg generation throughput: 37.0 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 1.3%, Prefix cache hit rate: 0.0%
(APIServer pid=22840) INFO 01-03 17:19:18 [metrics.py:100] SpecDecoding metrics: Mean acceptance length: 1.71, Accepted throughput: 3.02 tokens/s, Drafted throughput: 4.24 tokens/s, Accepted: 154 tokens, Drafted: 216 tokens, Per-position acceptance rate: 0.713, Avg Draft acceptance rate: 71.3%
(APIServer pid=22840) INFO 01-03 17:19:28 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 65.0 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 1.5%, Prefix cache hit rate: 0.0%
(APIServer pid=22840) INFO 01-03 17:19:28 [metrics.py:100] SpecDecoding metrics: Mean acceptance length: 1.67, Accepted throughput: 26.10 tokens/s, Drafted throughput: 38.90 tokens/s, Accepted: 261 tokens, Drafted: 389 tokens, Per-position acceptance rate: 0.671, Avg Draft acceptance rate: 67.1%
(APIServer pid=22840) INFO 01-03 17:19:38 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 63.8 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 1.8%, Prefix cache hit rate: 0.0%
(APIServer pid=22840) INFO 01-03 17:19:38 [metrics.py:100] SpecDecoding metrics: Mean acceptance length: 1.64, Accepted throughput: 24.90 tokens/s, Drafted throughput: 38.90 tokens/s, Accepted: 249 tokens, Drafted: 389 tokens, Per-position acceptance rate: 0.640, Avg Draft acceptance rate: 64.0%
(APIServer pid=22840) INFO 01-03 17:19:48 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 64.2 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 2.1%, Prefix cache hit rate: 0.0%
(APIServer pid=22840) INFO 01-03 17:19:48 [metrics.py:100] SpecDecoding metrics: Mean acceptance length: 1.65, Accepted throughput: 25.40 tokens/s, Drafted throughput: 38.80 tokens/s, Accepted: 254 tokens, Drafted: 388 tokens, Per-position acceptance rate: 0.655, Avg Draft acceptance rate: 65.5%
(APIServer pid=22840) INFO 01-03 17:19:58 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 63.8 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 2.3%, Prefix cache hit rate: 0.0%
(APIServer pid=22840) INFO 01-03 17:19:58 [metrics.py:100] SpecDecoding metrics: Mean acceptance length: 1.64, Accepted throughput: 24.90 tokens/s, Drafted throughput: 38.90 tokens/s, Accepted: 249 tokens, Drafted: 389 tokens, Per-position acceptance rate: 0.640, Avg Draft acceptance rate: 64.0%
(APIServer pid=22840) INFO 01-03 17:20:08 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 64.2 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 2.6%, Prefix cache hit rate: 0.0%
(APIServer pid=22840) INFO 01-03 17:20:08 [metrics.py:100] SpecDecoding metrics: Mean acceptance length: 1.65, Accepted throughput: 25.40 tokens/s, Drafted throughput: 38.80 tokens/s, Accepted: 254 tokens, Drafted: 388 tokens, Per-position acceptance rate: 0.655, Avg Draft acceptance rate: 65.5%
(APIServer pid=22840) INFO: 172.17.0.2:60184 - "GET /v1/models HTTP/1.1" 200 OK
(APIServer pid=22840) INFO 01-03 17:20:18 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 51.5 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.8%, Prefix cache hit rate: 0.0%
(APIServer pid=22840) INFO 01-03 17:20:18 [metrics.py:100] SpecDecoding metrics: Mean acceptance length: 1.65, Accepted throughput: 20.30 tokens/s, Drafted throughput: 31.20 tokens/s, Accepted: 203 tokens, Drafted: 312 tokens, Per-position acceptance rate: 0.651, Avg Draft acceptance rate: 65.1%
(APIServer pid=22840) INFO 01-03 17:20:28 [loggers.py:257] Engine 000: Avg prompt throughput: 695.1 tokens/s, Avg generation throughput: 46.3 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 3.1%, Prefix cache hit rate: 0.0%
(APIServer pid=22840) INFO 01-03 17:20:28 [metrics.py:100] SpecDecoding metrics: Mean acceptance length: 1.65, Accepted throughput: 18.30 tokens/s, Drafted throughput: 28.00 tokens/s, Accepted: 183 tokens, Drafted: 280 tokens, Per-position acceptance rate: 0.654, Avg Draft acceptance rate: 65.4%
(APIServer pid=22840) INFO 01-03 17:20:38 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 67.3 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 3.4%, Prefix cache hit rate: 0.0%
(APIServer pid=22840) INFO 01-03 17:20:38 [metrics.py:100] SpecDecoding metrics: Mean acceptance length: 1.74, Accepted throughput: 28.70 tokens/s, Drafted throughput: 38.60 tokens/s, Accepted: 287 tokens, Drafted: 386 tokens, Per-position acceptance rate: 0.744, Avg Draft acceptance rate: 74.4%
(APIServer pid=22840) INFO: 172.17.0.2:60198 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=22840) INFO 01-03 17:20:48 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 4.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
(APIServer pid=22840) INFO 01-03 17:20:48 [metrics.py:100] SpecDecoding metrics: Mean acceptance length: 1.82, Accepted throughput: 1.80 tokens/s, Drafted throughput: 2.20 tokens/s, Accepted: 18 tokens, Drafted: 22 tokens, Per-position acceptance rate: 0.818, Avg Draft acceptance rate: 81.8%
(APIServer pid=22840) INFO 01-03 17:20:58 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
(APIServer pid=22840) INFO: 172.17.0.2:56364 - "GET /v1/models HTTP/1.1" 200 OK
(APIServer pid=22840) INFO: 172.17.0.2:56372 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=22840) INFO 01-03 17:21:58 [loggers.py:257] Engine 000: Avg prompt throughput: 2.8 tokens/s, Avg generation throughput: 62.9 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.3%, Prefix cache hit rate: 0.0%
(APIServer pid=22840) INFO 01-03 17:21:58 [metrics.py:100] SpecDecoding metrics: Mean acceptance length: 1.77, Accepted throughput: 3.90 tokens/s, Drafted throughput: 5.07 tokens/s, Accepted: 273 tokens, Drafted: 355 tokens, Per-position acceptance rate: 0.769, Avg Draft acceptance rate: 76.9%
(APIServer pid=22840) INFO 01-03 17:22:08 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 68.4 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.6%, Prefix cache hit rate: 0.0%
(APIServer pid=22840) INFO 01-03 17:22:08 [metrics.py:100] SpecDecoding metrics: Mean acceptance length: 1.76, Accepted throughput: 29.50 tokens/s, Drafted throughput: 38.90 tokens/s, Accepted: 295 tokens, Drafted: 389 tokens, Per-position acceptance rate: 0.758, Avg Draft acceptance rate: 75.8%
(APIServer pid=22840) INFO 01-03 17:22:18 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 66.7 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.8%, Prefix cache hit rate: 0.0%
(APIServer pid=22840) INFO 01-03 17:22:18 [metrics.py:100] SpecDecoding metrics: Mean acceptance length: 1.71, Accepted throughput: 27.70 tokens/s, Drafted throughput: 39.00 tokens/s, Accepted: 277 tokens, Drafted: 390 tokens, Per-position acceptance rate: 0.710, Avg Draft acceptance rate: 71.0%
(APIServer pid=22840) INFO 01-03 17:22:28 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 70.0 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 1.1%, Prefix cache hit rate: 0.0%
(APIServer pid=22840) INFO 01-03 17:22:28 [metrics.py:100] SpecDecoding metrics: Mean acceptance length: 1.80, Accepted throughput: 31.20 tokens/s, Drafted throughput: 38.79 tokens/s, Accepted: 312 tokens, Drafted: 388 tokens, Per-position acceptance rate: 0.804, Avg Draft acceptance rate: 80.4%

crystech

Jan 3

I ran a test on my 8x4090(48GB) rig:

will i be able to run it on 2X6000 Pro (192 GB total VRAM) rig?

How do i quantize it to be under that ram?

i think not possible with vllm ,192GB not enough to cover the weight let alone kv cache for context? GLM 4.6V is best bet imho.

mtcl

Jan 3

•

edited Jan 3

I ran a test on my 8x4090(48GB) rig:

will i be able to run it on 2X6000 Pro (192 GB total VRAM) rig?

How do i quantize it to be under that ram?

i think not possible with vllm ,192GB not enough to cover the weight let alone kv cache for context? GLM 4.6V is best bet imho.

Edit: sorry! Thought it was m2.1 :)

crystech

Jan 3

haha i am about to ask how you managed to do that as i am struggling a little with 256GB vram . get 1 more pro and run pp hehehe . is a damn good model.

fouvy

Jan 11

•

edited Jan 11

I ran a test on my 8x4080s(32GB)：
With MTP about (60 tok/s):

(APIServer pid=12580) INFO 01-11 16:59:48 [metrics.py:100] SpecDecoding metrics: Mean acceptance length: 1.78, Accepted throughput: 1.05 tokens/s, Drafted throughput: 1.35 tokens/s, Accepted: 75 tokens, Drafted: 96 tokens, Per-position acceptance rate: 0.781, Avg Draft acceptance rate: 78.1%
(APIServer pid=12580) INFO 01-11 16:59:58 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 56.3 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.4%, Prefix cache hit rate: 0.0%
(APIServer pid=12580) INFO 01-11 16:59:58 [metrics.py:100] SpecDecoding metrics: Mean acceptance length: 1.78, Accepted throughput: 24.70 tokens/s, Drafted throughput: 31.60 tokens/s, Accepted: 247 tokens, Drafted: 316 tokens, Per-position acceptance rate: 0.782, Avg Draft acceptance rate: 78.2%
(APIServer pid=12580) INFO 01-11 17:00:08 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 57.0 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.7%, Prefix cache hit rate: 0.0%
(APIServer pid=12580) INFO 01-11 17:00:08 [metrics.py:100] SpecDecoding metrics: Mean acceptance length: 1.81, Accepted throughput: 25.50 tokens/s, Drafted throughput: 31.50 tokens/s, Accepted: 255 tokens, Drafted: 315 tokens, Per-position acceptance rate: 0.810, Avg Draft acceptance rate: 81.0%
(APIServer pid=12580) INFO 01-11 17:00:18 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 56.7 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.9%, Prefix cache hit rate: 0.0%
(APIServer pid=12580) INFO 01-11 17:00:18 [metrics.py:100] SpecDecoding metrics: Mean acceptance length: 1.79, Accepted throughput: 25.10 tokens/s, Drafted throughput: 31.60 tokens/s, Accepted: 251 tokens, Drafted: 316 tokens, Per-position acceptance rate: 0.794, Avg Draft acceptance rate: 79.4%
(APIServer pid=12580) INFO 01-11 17:00:28 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 61.6 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 1.2%, Prefix cache hit rate: 0.0%
(APIServer pid=12580) INFO 01-11 17:00:28 [metrics.py:100] SpecDecoding metrics: Mean acceptance length: 1.96, Accepted throughput: 30.10 tokens/s, Drafted throughput: 31.50 tokens/s, Accepted: 301 tokens, Drafted: 315 tokens, Per-position acceptance rate: 0.956, Avg Draft acceptance rate: 95.6%
(APIServer pid=12580) INFO 01-11 17:00:38 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 60.9 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 1.5%, Prefix cache hit rate: 0.0%
(APIServer pid=12580) INFO 01-11 17:00:38 [metrics.py:100] SpecDecoding metrics: Mean acceptance length: 1.94, Accepted throughput: 29.50 tokens/s, Drafted throughput: 31.40 tokens/s, Accepted: 295 tokens, Drafted: 314 tokens, Per-position acceptance rate: 0.939, Avg Draft acceptance rate: 93.9%
(APIServer pid=12580) INFO 01-11 17:00:48 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 61.8 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 1.8%, Prefix cache hit rate: 0.0%
(APIServer pid=12580) INFO 01-11 17:00:48 [metrics.py:100] SpecDecoding metrics: Mean acceptance length: 1.97, Accepted throughput: 30.50 tokens/s, Drafted throughput: 31.30 tokens/s, Accepted: 305 tokens, Drafted: 313 tokens, Per-position acceptance rate: 0.974, Avg Draft acceptance rate: 97.4%
(APIServer pid=12580) INFO 01-11 17:00:58 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 60.3 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 2.1%, Prefix cache hit rate: 0.0%
(APIServer pid=12580) INFO 01-11 17:00:58 [metrics.py:100] SpecDecoding metrics: Mean acceptance length: 1.91, Accepted throughput: 28.80 tokens/s, Drafted throughput: 31.50 tokens/s, Accepted: 288 tokens, Drafted: 315 tokens, Per-position acceptance rate: 0.914, Avg Draft acceptance rate: 91.4%
(APIServer pid=12580) INFO 01-11 17:01:08 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 61.3 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 2.5%, Prefix cache hit rate: 0.0%
(APIServer pid=12580) INFO 01-11 17:01:08 [metrics.py:100] SpecDecoding metrics: Mean acceptance length: 1.95, Accepted throughput: 29.80 tokens/s, Drafted throughput: 31.50 tokens/s, Accepted: 298 tokens, Drafted: 315 tokens, Per-position acceptance rate: 0.946, Avg Draft acceptance rate: 94.6%
(APIServer pid=12580) INFO 01-11 17:01:18 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 60.8 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 2.8%, Prefix cache hit rate: 0.0%
(APIServer pid=12580) INFO 01-11 17:01:18 [metrics.py:100] SpecDecoding metrics: Mean acceptance length: 1.94, Accepted throughput: 29.40 tokens/s, Drafted throughput: 31.40 tokens/s, Accepted: 294 tokens, Drafted: 314 tokens, Per-position acceptance rate: 0.936, Avg Draft acceptance rate: 93.6%
(APIServer pid=12580) INFO 01-11 17:01:28 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 60.6 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 3.0%, Prefix cache hit rate: 0.0%
(APIServer pid=12580) INFO 01-11 17:01:28 [metrics.py:100] SpecDecoding metrics: Mean acceptance length: 1.94, Accepted throughput: 29.30 tokens/s, Drafted throughput: 31.30 tokens/s, Accepted: 293 tokens, Drafted: 313 tokens, Per-position acceptance rate: 0.936, Avg Draft acceptance rate: 93.6%
(APIServer pid=12580) INFO 01-11 17:01:38 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 60.9 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 3.3%, Prefix cache hit rate: 0.0%
(APIServer pid=12580) INFO 01-11 17:01:38 [metrics.py:100] SpecDecoding metrics: Mean acceptance length: 1.95, Accepted throughput: 29.70 tokens/s, Drafted throughput: 31.20 tokens/s, Accepted: 297 tokens, Drafted: 312 tokens, Per-position acceptance rate: 0.952, Avg Draft acceptance rate: 95.2%
(APIServer pid=12580) INFO 01-11 17:01:48 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 60.5 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 3.6%, Prefix cache hit rate: 0.0%
(APIServer pid=12580) INFO 01-11 17:01:48 [metrics.py:100] SpecDecoding metrics: Mean acceptance length: 1.94, Accepted throughput: 29.30 tokens/s, Drafted throughput: 31.20 tokens/s, Accepted: 293 tokens, Drafted: 312 tokens, Per-position acceptance rate: 0.939, Avg Draft acceptance rate: 93.9%```

jcowles

Jan 25

I'm seeing a warning that torch.compile isn't supported by the model (message from vllm v0.14.0, is that due to AWQ?

JunHowie

QuantTrio org Jan 26

This comment has been hidden (marked as Resolved)

JunHowie

QuantTrio org Jan 26

I'm seeing a warning that torch.compile isn't supported by the model (message from vllm v0.14.0, is that due to AWQ?

I believe this is a new feature of vLLM on certain GPU models, which is usually related to the PyTorch version. Could you also provide the GPU model and the relevant logs?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment