Qwen3-4B TensorRT-LLM Checkpoint (FP8)

This repository contains a community-converted TensorRT-LLM checkpoint for Qwen/Qwen3-4B.

It is a TensorRT-LLM checkpoint-format repository, not a prebuilt engine. The intent is to let you download the checkpoint from Hugging Face and build an engine locally for your own GPU and TensorRT-LLM version.

Who This Repo Is For

This repository is for users who already work with TensorRT-LLM and want a ready-made TensorRT-LLM checkpoint that they can turn into a local engine for their own GPU.

It is not:

a prebuilt TensorRT engine
a plain Transformers checkpoint
an Ollama model
a one-click chat model that can be run directly after download

How to Use

Download this repository from Hugging Face.
Build a local engine with trtllm-build for your own GPU and TensorRT-LLM version.
Run inference with the engine you built.

The Build Example section below shows the validated local command used for the benchmark snapshot in this README.

Model Characteristics

Base model: Qwen/Qwen3-4B
License: apache-2.0
Architecture: Qwen3ForCausalLM
Upstream maximum context length (max_position_embeddings): 40960
Hidden size: 2560
Intermediate size: 9728
Layers: 36
Attention heads: 32
KV heads: 8
Vocabulary size: 151936

These values come from the upstream model/checkpoint configuration. They describe the model family itself, not a specific locally built TensorRT engine.

Checkpoint Details

TensorRT-LLM version used for conversion: 1.2.0rc6
Checkpoint dtype: bfloat16
Quantization: FP8 (weights and activations)
KV cache quantization: FP8
Calibration dataset: cnn_dailymail (64 samples, max seq length 256)
Tensor parallel size: 1
Checkpoint files:
- config.json
- rank0.safetensors
- tokenizer and generation files copied from the upstream Hugging Face model

Files

config.json: TensorRT-LLM checkpoint config
rank0.safetensors: TensorRT-LLM checkpoint weights
generation_config.json: upstream generation config
tokenizer.json: upstream tokenizer
tokenizer_config.json: upstream tokenizer config
merges.txt: upstream merges file
vocab.json: upstream vocabulary

Build Example

The following command is the validated local engine build used for the benchmarks in this README. These values are build-time/runtime settings for one local engine, not limits of the checkpoint itself.

Build an engine locally with TensorRT-LLM:

huggingface-cli download Shoolife/Qwen3-4B-TensorRT-LLM-Checkpoint-FP8 --local-dir ./checkpoint

trtllm-build \
  --checkpoint_dir ./checkpoint \
  --output_dir ./engine \
  --gemm_plugin auto \
  --gpt_attention_plugin auto \
  --max_batch_size 1 \
  --max_input_len 512 \
  --max_seq_len 1024 \
  --max_num_tokens 256 \
  --workers 1 \
  --monitor_memory

If you rebuild the engine with different limits, memory usage and supported request shapes will change accordingly.

Quantization

This checkpoint was produced using TensorRT-LLM quantization tooling:

python quantize.py \
  --model_dir ./Qwen3-4B \
  --output_dir ./checkpoint_fp8 \
  --dtype bfloat16 \
  --qformat fp8 \
  --kv_cache_dtype fp8 \
  --calib_dataset cnn_dailymail \
  --calib_size 64 \
  --batch_size 1 \
  --calib_max_seq_length 256 \
  --tokenizer_max_seq_length 2048

Validation

The checkpoint was validated by building a local engine and running inference on:

GPU: NVIDIA GeForce RTX 5070 Laptop GPU
Runtime: TensorRT-LLM 1.2.0rc6

Validated Local Engine Characteristics

Local build and runtime characteristics from the validated engine used for the benchmark snapshot below:

Property	Value
Checkpoint size	`4.9 GB`
Built engine size	`4.9 GB`
Tested GPU	`NVIDIA GeForce RTX 5070 Laptop GPU`
GPU memory reported by benchmark host	`7.53 GiB`
Engine build `max_batch_size`	`1`
Engine build `max_input_len`	`512`
Engine build `max_seq_len`	`1024`
Engine build `max_num_tokens`	`256`

Important: the 1024 / 256 limits above belong only to this particular local engine build. They are not the intrinsic maximum context or generation limits of Qwen3-4B itself.

These values are specific to the local engine build used for validation and will change if you rebuild with different TensorRT-LLM settings and memory budgets.

Benchmark Snapshot

Local single-GPU measurements from the validated local engine on RTX 5070 Laptop GPU, using TensorRT-LLM synthetic fixed-length requests, 20 requests per profile, 2 warmup requests, and concurrency=1.

Profile	Input	Output	TTFT	TPOT	Output tok/s	Avg latency
`tiny_16_32`	16	32	`18.05 ms`	`15.29 ms`	`65.0`	`492.2 ms`
`short_chat_42_64`	42	64	`17.68 ms`	`14.11 ms`	`70.6`	`906.8 ms`
`balanced_128_128`	128	128	`25.25 ms`	`14.72 ms`	`67.6`	`1894.7 ms`
`long_prompt_192_64`	192	64	`39.44 ms`	`15.33 ms`	`63.6`	`1005.5 ms`
`long_generation_42_192`	42	192	`18.32 ms`	`14.47 ms`	`69.0`	`2782.6 ms`

These numbers are local measurements from one machine and should be treated as reference values, not portability guarantees.

Quick Parity Check

A quick parity check was run on ARC-Challenge (20 examples) and OpenBookQA (20 examples) to verify that the TensorRT-LLM FP8 engine produces comparable answers to the upstream Hugging Face model.

Benchmark	HF Accuracy	TRT Accuracy	Agreement
`arc_challenge`	`0.85`	`0.85`	`1.00`
`openbookqa`	`0.85`	`0.75`	`0.90`
Overall	`0.85`	`0.80`	`0.95`

The FP8 quantized engine closely tracks the HF baseline with 95% agreement on this subset.

Local Comparison

The table below compares locally validated TensorRT-LLM variants that fit on RTX 5070 Laptop GPU (7.53 GiB). BF16 and FP16 checkpoints (~8.3 GB) exceed the GPU memory for engine building on this hardware.

Variant	Checkpoint	Engine	`short_chat_42_64`	`balanced_128_128`	`long_generation_42_192`	Quick-check overall	Practical reading
`FP8`	`4.9 GB`	`4.9 GB`	`70.6 tok/s`	`67.6 tok/s`	`69.0 tok/s`	`0.80`	Best quality among quantized variants
`NVFP4`	`3.4 GB`	`2.7 GB`	`95.8 tok/s`	`95.8 tok/s`	`96.0 tok/s`	`0.775`	~38% faster, slightly lower quality

BF16/FP16 checkpoints are available but require a GPU with >8 GB VRAM to build engines.

This comparison is intentionally local and narrow. It should not be treated as a universal benchmark across all prompts, datasets, GPUs, or TensorRT-LLM versions.

Notes

This is not an official Qwen or NVIDIA release.
This repository does not include a prebuilt TensorRT engine.
Engine compatibility and performance depend on your GPU, driver, CUDA, TensorRT, and TensorRT-LLM versions.

Downloads last month: 4

Model tree for Shoolife/Qwen3-4B-TensorRT-LLM-Checkpoint-FP8

Base model

Qwen/Qwen3-4B-Base

Finetuned

Qwen/Qwen3-4B

Finetuned

(744)

this model