Update ReadME.md

by mbatuhanunverdi - opened Dec 17, 2025

base: refs/heads/main

←

from: refs/pr/1

Discussion Files changed

-3

mbatuhanunverdi

OPENZEKA org Dec 17, 2025

Qwen3-Coder-480B-A35B-Instruct Model Comparison Analysis

Test Configuration

Full Precision Model: 4× B300 GPU
NVFP4 Quantized Model: 2× B300 GPU
Tested Concurrency Levels: 1, 2, 4, 8, 16, 32
Engine: TensorRT-LLM

Use it with TensorRT-LLM

docker run \
  --gpus all \
  --rm \
  --ipc=host \
  --ulimit memlock=-1:-1 \
  --ulimit stack=67108864 \
  --shm-size=64G \
  -p 8050:8000 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -w /app/tensorrt_llm \
  nvcr.io/nvidia/tensorrt-llm/release:1.2.0rc5 \
  trtllm-serve \
   OPENZEKA/Qwen3-Coder-480B-A35B-Instruct-NVFP4 \
  --host 0.0.0.0 \
  --port 8000 \
  --backend pytorch \
  --max_batch_size 128 \
  --tp_size 2 \
  --kv_cache_free_gpu_memory_fraction 0.9

Performance Metrics Comparison

1) Time to First Token (TTFT) — ms

Concurrency	Full Model	NVFP4 Model	Diff (NVFP4 - Full)	Perf. Loss
1	73.46	92.56	+19.10	+26.0%
2	136.82	173.48	+36.66	+26.8%
4	130.01	163.84	+33.83	+26.0%
8	136.87	177.42	+40.55	+29.6%
16	163.07	174.25	+11.18	+6.9%
32	134.69	169.11	+34.42	+25.6%

TTFT Insights

The NVFP4 model shows ~26.5% higher TTFT on average across all concurrency levels.
The largest TTFT regression is at concurrency 8 with +29.6%.
The best relative TTFT result is at concurrency 16 with +6.9%.

2) Inter-Token Latency (ITL) — ms

Concurrency	Full Model	NVFP4 Model	Diff (NVFP4 - Full)	Perf. Loss
1	8.31	8.99	+0.68	+8.2%
2	9.92	10.01	+0.09	+0.9%
4	12.11	11.52	-0.59	-4.9%
8	14.99	13.66	-1.33	-8.9%
16	18.42	15.68	-2.74	-14.9%
32	22.12	18.03	-4.09	-18.5%

ITL Insights

At low concurrency (1–2), NVFP4 has slightly higher token-to-token latency.
At medium-to-high concurrency (8–32), NVFP4 performs better.
At concurrency 32, NVFP4 achieves 18.5% lower ITL.

3) Tokens Per Second (TPS) — tokens/s

Concurrency	Full Model	NVFP4 Model	Diff (NVFP4 - Full)	Perf. Loss
1	112.61	103.54	-9.07	-8.1%
2	91.60	88.53	-3.07	-3.3%
4	76.61	78.11	+1.50	+2.0%
8	62.58	66.77	+4.19	+6.7%
16	51.03	58.03	+7.00	+13.7%
32	43.37	51.75	+8.38	+19.3%

TPS Insights

At low concurrency (1–2), the full-precision model is better.
At medium-to-high concurrency (4–32), NVFP4 achieves higher TPS.
At concurrency 32, NVFP4 delivers 19.3% higher TPS.

4) Total Latency — seconds

Concurrency	Full Model	NVFP4 Model	Diff (NVFP4 - Full)	Perf. Loss
1	1.12	1.23	+0.11	+9.8%
2	1.40	1.45	+0.05	+3.6%
4	1.66	1.61	-0.05	-3.0%
8	2.03	1.90	-0.13	-6.4%
16	2.49	2.15	-0.34	-13.7%
32	2.94	2.43	-0.51	-17.3%

Latency Insights

At low concurrency, the full-precision model is better.
At medium-to-high concurrency, NVFP4 provides lower total latency.

5) Throughput (RPS) — requests/s

Concurrency	Full Model	NVFP4 Model	Diff (NVFP4 - Full)	Perf. Loss
1	0.90	0.81	-0.09	-10.0%
2	0.72	0.69	-0.03	-4.2%
4	0.60	0.62	+0.02	+3.3%
8	0.49	0.53	+0.04	+8.2%
16	0.40	0.46	+0.06	+15.0%
32	0.34	0.41	+0.07	+20.6%

Throughput Insights

At low concurrency, the full-precision model has higher throughput.
At medium-to-high concurrency, NVFP4 achieves better throughput overall.

Update ReadME.md8e2ee047

mbatuhanunverdi changed pull request status to closed Dec 17, 2025

mbatuhanunverdi changed pull request status to open Dec 17, 2025

mbatuhanunverdi changed pull request status to closed Dec 17, 2025

birolkuyumcu changed pull request status to open Dec 17, 2025

birolkuyumcu changed pull request status to closed Dec 17, 2025

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment