Qwen3-4B TensorRT-LLM Checkpoint (FP8)
This repository contains a community-converted TensorRT-LLM checkpoint for Qwen/Qwen3-4B.
It is a TensorRT-LLM checkpoint-format repository, not a prebuilt engine. The intent is to let you download the checkpoint from Hugging Face and build an engine locally for your own GPU and TensorRT-LLM version.
Who This Repo Is For
This repository is for users who already work with TensorRT-LLM and want a ready-made TensorRT-LLM checkpoint that they can turn into a local engine for their own GPU.
It is not:
- a prebuilt TensorRT engine
- a plain Transformers checkpoint
- an Ollama model
- a one-click chat model that can be run directly after download
How to Use
- Download this repository from Hugging Face.
- Build a local engine with
trtllm-buildfor your own GPU and TensorRT-LLM version. - Run inference with the engine you built.
The Build Example section below shows the validated local command used for the benchmark snapshot in this README.
Model Characteristics
- Base model:
Qwen/Qwen3-4B - License:
apache-2.0 - Architecture:
Qwen3ForCausalLM - Upstream maximum context length (
max_position_embeddings):40960 - Hidden size:
2560 - Intermediate size:
9728 - Layers:
36 - Attention heads:
32 - KV heads:
8 - Vocabulary size:
151936
These values come from the upstream model/checkpoint configuration. They describe the model family itself, not a specific locally built TensorRT engine.
Checkpoint Details
- TensorRT-LLM version used for conversion:
1.2.0rc6 - Checkpoint dtype:
bfloat16 - Quantization:
FP8(weights and activations) - KV cache quantization:
FP8 - Calibration dataset:
cnn_dailymail(64 samples, max seq length 256) - Tensor parallel size:
1 - Checkpoint files:
config.jsonrank0.safetensors- tokenizer and generation files copied from the upstream Hugging Face model
Files
config.json: TensorRT-LLM checkpoint configrank0.safetensors: TensorRT-LLM checkpoint weightsgeneration_config.json: upstream generation configtokenizer.json: upstream tokenizertokenizer_config.json: upstream tokenizer configmerges.txt: upstream merges filevocab.json: upstream vocabulary
Build Example
The following command is the validated local engine build used for the benchmarks in this README. These values are build-time/runtime settings for one local engine, not limits of the checkpoint itself.
Build an engine locally with TensorRT-LLM:
huggingface-cli download Shoolife/Qwen3-4B-TensorRT-LLM-Checkpoint-FP8 --local-dir ./checkpoint
trtllm-build \
--checkpoint_dir ./checkpoint \
--output_dir ./engine \
--gemm_plugin auto \
--gpt_attention_plugin auto \
--max_batch_size 1 \
--max_input_len 512 \
--max_seq_len 1024 \
--max_num_tokens 256 \
--workers 1 \
--monitor_memory
If you rebuild the engine with different limits, memory usage and supported request shapes will change accordingly.
Quantization
This checkpoint was produced using TensorRT-LLM quantization tooling:
python quantize.py \
--model_dir ./Qwen3-4B \
--output_dir ./checkpoint_fp8 \
--dtype bfloat16 \
--qformat fp8 \
--kv_cache_dtype fp8 \
--calib_dataset cnn_dailymail \
--calib_size 64 \
--batch_size 1 \
--calib_max_seq_length 256 \
--tokenizer_max_seq_length 2048
Validation
The checkpoint was validated by building a local engine and running inference on:
- GPU:
NVIDIA GeForce RTX 5070 Laptop GPU - Runtime:
TensorRT-LLM 1.2.0rc6
Validated Local Engine Characteristics
Local build and runtime characteristics from the validated engine used for the benchmark snapshot below:
| Property | Value |
|---|---|
| Checkpoint size | 4.9 GB |
| Built engine size | 4.9 GB |
| Tested GPU | NVIDIA GeForce RTX 5070 Laptop GPU |
| GPU memory reported by benchmark host | 7.53 GiB |
Engine build max_batch_size |
1 |
Engine build max_input_len |
512 |
Engine build max_seq_len |
1024 |
Engine build max_num_tokens |
256 |
Important: the 1024 / 256 limits above belong only to this particular local engine build. They are not the intrinsic maximum context or generation limits of Qwen3-4B itself.
These values are specific to the local engine build used for validation and will change if you rebuild with different TensorRT-LLM settings and memory budgets.
Benchmark Snapshot
Local single-GPU measurements from the validated local engine on RTX 5070 Laptop GPU, using TensorRT-LLM synthetic fixed-length requests, 20 requests per profile, 2 warmup requests, and concurrency=1.
| Profile | Input | Output | TTFT | TPOT | Output tok/s | Avg latency |
|---|---|---|---|---|---|---|
tiny_16_32 |
16 | 32 | 18.05 ms |
15.29 ms |
65.0 |
492.2 ms |
short_chat_42_64 |
42 | 64 | 17.68 ms |
14.11 ms |
70.6 |
906.8 ms |
balanced_128_128 |
128 | 128 | 25.25 ms |
14.72 ms |
67.6 |
1894.7 ms |
long_prompt_192_64 |
192 | 64 | 39.44 ms |
15.33 ms |
63.6 |
1005.5 ms |
long_generation_42_192 |
42 | 192 | 18.32 ms |
14.47 ms |
69.0 |
2782.6 ms |
These numbers are local measurements from one machine and should be treated as reference values, not portability guarantees.
Quick Parity Check
A quick parity check was run on ARC-Challenge (20 examples) and OpenBookQA (20 examples) to verify that the TensorRT-LLM FP8 engine produces comparable answers to the upstream Hugging Face model.
| Benchmark | HF Accuracy | TRT Accuracy | Agreement |
|---|---|---|---|
arc_challenge |
0.85 |
0.85 |
1.00 |
openbookqa |
0.85 |
0.75 |
0.90 |
| Overall | 0.85 |
0.80 |
0.95 |
The FP8 quantized engine closely tracks the HF baseline with 95% agreement on this subset.
Local Comparison
The table below compares locally validated TensorRT-LLM variants that fit on RTX 5070 Laptop GPU (7.53 GiB). BF16 and FP16 checkpoints (~8.3 GB) exceed the GPU memory for engine building on this hardware.
| Variant | Checkpoint | Engine | short_chat_42_64 |
balanced_128_128 |
long_generation_42_192 |
Quick-check overall | Practical reading |
|---|---|---|---|---|---|---|---|
FP8 |
4.9 GB |
4.9 GB |
70.6 tok/s |
67.6 tok/s |
69.0 tok/s |
0.80 |
Best quality among quantized variants |
NVFP4 |
3.4 GB |
2.7 GB |
95.8 tok/s |
95.8 tok/s |
96.0 tok/s |
0.775 |
~38% faster, slightly lower quality |
BF16/FP16 checkpoints are available but require a GPU with >8 GB VRAM to build engines.
This comparison is intentionally local and narrow. It should not be treated as a universal benchmark across all prompts, datasets, GPUs, or TensorRT-LLM versions.
Notes
- This is not an official Qwen or NVIDIA release.
- This repository does not include a prebuilt TensorRT engine.
- Engine compatibility and performance depend on your GPU, driver, CUDA, TensorRT, and TensorRT-LLM versions.
- Downloads last month
- -