Update ReadME.md

# Qwen3-Coder-480B-A35B-Instruct Model Comparison Analysis

## Test Configuration

- **Full Precision Model:** 4× B300 GPU
- **NVFP4 Quantized Model:** 2× B300 GPU
- **Tested Concurrency Levels:** 1, 2, 4, 8, 16, 32
- **Engine:** TensorRT-LLM

---
## Use it with TensorRT-LLM
```
docker run \
--gpus all \
--rm \
--ipc=host \
--ulimit memlock=-1:-1 \
--ulimit stack=67108864 \
--shm-size=64G \
-p 8050:8000 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-w /app/tensorrt_llm \
nvcr.io/nvidia/tensorrt-llm/release:1.2.0rc5 \
trtllm-serve \
OPENZEKA/Qwen3-Coder-480B-A35B-Instruct-NVFP4 \
--host 0.0.0.0 \
--port 8000 \
--backend pytorch \
--max_batch_size 128 \
--tp_size 2 \
--kv_cache_free_gpu_memory_fraction 0.9
```

## Performance Metrics Comparison

### 1) Time to First Token (TTFT) — ms

| Concurrency | Full Model | NVFP4 Model | Diff (NVFP4 - Full) | Perf. Loss |
|---:|---:|---:|---:|---:|
| 1 | 73.46 | 92.56 | +19.10 | +26.0% |
| 2 | 136.82 | 173.48 | +36.66 | +26.8% |
| 4 | 130.01 | 163.84 | +33.83 | +26.0% |
| 8 | 136.87 | 177.42 | +40.55 | +29.6% |
| 16 | 163.07 | 174.25 | +11.18 | +6.9% |
| 32 | 134.69 | 169.11 | +34.42 | +25.6% |

**TTFT Insights**
- The NVFP4 model shows **~26.5% higher TTFT on average** across all concurrency levels.
- The largest TTFT regression is at **concurrency 8** with **+29.6%**.
- The best relative TTFT result is at **concurrency 16** with **+6.9%**.

---

### 2) Inter-Token Latency (ITL) — ms

| Concurrency | Full Model | NVFP4 Model | Diff (NVFP4 - Full) | Perf. Loss |
|---:|---:|---:|---:|---:|
| 1 | 8.31 | 8.99 | +0.68 | +8.2% |
| 2 | 9.92 | 10.01 | +0.09 | +0.9% |
| 4 | 12.11 | 11.52 | -0.59 | -4.9% |
| 8 | 14.99 | 13.66 | -1.33 | -8.9% |
| 16 | 18.42 | 15.68 | -2.74 | -14.9% |
| 32 | 22.12 | 18.03 | -4.09 | -18.5% |

**ITL Insights**
- At low concurrency (1–2), NVFP4 has **slightly higher** token-to-token latency.
- At medium-to-high concurrency (8–32), NVFP4 performs **better**.
- At **concurrency 32**, NVFP4 achieves **18.5% lower ITL**.

---

### 3) Tokens Per Second (TPS) — tokens/s

| Concurrency | Full Model | NVFP4 Model | Diff (NVFP4 - Full) | Perf. Loss |
|---:|---:|---:|---:|---:|
| 1 | 112.61 | 103.54 | -9.07 | -8.1% |
| 2 | 91.60 | 88.53 | -3.07 | -3.3% |
| 4 | 76.61 | 78.11 | +1.50 | +2.0% |
| 8 | 62.58 | 66.77 | +4.19 | +6.7% |
| 16 | 51.03 | 58.03 | +7.00 | +13.7% |
| 32 | 43.37 | 51.75 | +8.38 | +19.3% |

**TPS Insights**
- At low concurrency (1–2), the full-precision model is better.
- At medium-to-high concurrency (4–32), NVFP4 achieves **higher TPS**.
- At **concurrency 32**, NVFP4 delivers **19.3% higher TPS**.

---

### 4) Total Latency — seconds

| Concurrency | Full Model | NVFP4 Model | Diff (NVFP4 - Full) | Perf. Loss |
|---:|---:|---:|---:|---:|
| 1 | 1.12 | 1.23 | +0.11 | +9.8% |
| 2 | 1.40 | 1.45 | +0.05 | +3.6% |
| 4 | 1.66 | 1.61 | -0.05 | -3.0% |
| 8 | 2.03 | 1.90 | -0.13 | -6.4% |
| 16 | 2.49 | 2.15 | -0.34 | -13.7% |
| 32 | 2.94 | 2.43 | -0.51 | -17.3% |

**Latency Insights**
- At low concurrency, the full-precision model is better.
- At medium-to-high concurrency, NVFP4 provides **lower total latency**.

---

### 5) Throughput (RPS) — requests/s

| Concurrency | Full Model | NVFP4 Model | Diff (NVFP4 - Full) | Perf. Loss |
|---:|---:|---:|---:|---:|
| 1 | 0.90 | 0.81 | -0.09 | -10.0% |
| 2 | 0.72 | 0.69 | -0.03 | -4.2% |
| 4 | 0.60 | 0.62 | +0.02 | +3.3% |
| 8 | 0.49 | 0.53 | +0.04 | +8.2% |
| 16 | 0.40 | 0.46 | +0.06 | +15.0% |
| 32 | 0.34 | 0.41 | +0.07 | +20.6% |

**Throughput Insights**
- At low concurrency, the full-precision model has higher throughput.
- At medium-to-high concurrency, NVFP4 achieves better throughput overall.

Files changed (1) hide show

README.md +9 -3

README.md CHANGED Viewed

@@ -1,3 +1,9 @@
----
-license: apache-2.0
----

+---
+library_name: transformers
+license: apache-2.0
+license_link: https://huggingface.co/Qwen/Qwen3-Coder-480B-A35B-Instruct/blob/main/LICENSE
+pipeline_tag: text-generation
+tags:
+  - nvfp4
+base_model: Qwen/Qwen3-Coder-480B-A35B-Instruct
+---