Shoolife commited on
Commit
035370c
·
verified ·
1 Parent(s): b8ccd4e

Update comparison table: add BF16 variant, set BF16 as baseline

Browse files
Files changed (1) hide show
  1. README.md +6 -5
README.md CHANGED
@@ -173,13 +173,14 @@ These numbers are local measurements from one machine and should be treated as r
173
 
174
  ## Local Comparison
175
 
176
- The table below compares three locally validated TensorRT-LLM variants built for the same GPU family and the same local engine limits (`max_batch_size=1`, `max_seq_len=1024`, `max_num_tokens=256`).
177
 
178
- | Variant | Checkpoint | Engine | `short_chat_42_64` | `balanced_128_128` | `long_generation_42_192` | Quick-check overall | Quick-check change vs FP16 | Practical reading |
179
  |---|---:|---:|---:|---:|---:|---:|---|---|
180
- | `FP16` | `3.4 GB` | `3.4 GB` | `105.48 tok/s` | `105.49 tok/s` | `105.70 tok/s` | `0.75` | `baseline` | Most conservative variant |
181
- | `FP8` | `2.1 GB` | `2.2 GB` | `166.72 tok/s` | `144.37 tok/s` | `151.36 tok/s` | `0.75` | `no drop on this quick-check` | Best balance in these local tests |
182
- | `NVFP4` | `1.6 GB` | `1.2 GB` | `199.58 tok/s` | `200.09 tok/s` | `200.28 tok/s` | `0.60` | `-15 pts on this quick-check` | Fastest and smallest, but with visible quality drop |
 
183
 
184
  This comparison is intentionally local and narrow. It should not be treated as a universal benchmark across all prompts, datasets, GPUs, or TensorRT-LLM versions.
185
 
 
173
 
174
  ## Local Comparison
175
 
176
+ The table below compares locally validated TensorRT-LLM variants built for the same GPU family and the same local engine limits (`max_batch_size=1`, `max_seq_len=1024`, `max_num_tokens=256`).
177
 
178
+ | Variant | Checkpoint | Engine | `short_chat_42_64` | `balanced_128_128` | `long_generation_42_192` | Quick-check overall | Quick-check change vs BF16 | Practical reading |
179
  |---|---:|---:|---:|---:|---:|---:|---|---|
180
+ | `BF16` | `3.4 GB` | `3.4 GB` | `105.27 tok/s` | `105.37 tok/s` | `105.57 tok/s` | `0.725` | `baseline` | Same size and speed as FP16, better numerical stability for training |
181
+ | `FP16` | `3.4 GB` | `3.4 GB` | `105.48 tok/s` | `105.49 tok/s` | `105.70 tok/s` | `0.75` | `+2.5 pts on this quick-check` | Most conservative variant |
182
+ | `FP8` | `2.1 GB` | `2.2 GB` | `166.72 tok/s` | `144.37 tok/s` | `151.36 tok/s` | `0.75` | `+2.5 pts on this quick-check` | Best balance in these local tests |
183
+ | `NVFP4` | `1.6 GB` | `1.2 GB` | `199.58 tok/s` | `200.09 tok/s` | `200.28 tok/s` | `0.60` | `-12.5 pts on this quick-check` | Fastest and smallest, but with visible quality drop |
184
 
185
  This comparison is intentionally local and narrow. It should not be treated as a universal benchmark across all prompts, datasets, GPUs, or TensorRT-LLM versions.
186