Qwen3-0.6B TensorRT-LLM Checkpoint (FP8)
This repository contains a community-converted TensorRT-LLM checkpoint for Qwen/Qwen3-0.6B.
It is a TensorRT-LLM checkpoint-format repository, not a prebuilt engine. The intent is to let you download the checkpoint from Hugging Face and build an engine locally for your own GPU and TensorRT-LLM version.
Who This Repo Is For
This repository is for users who already work with TensorRT-LLM and want a ready-made TensorRT-LLM checkpoint that they can turn into a local engine for their own GPU.
It is not:
- a prebuilt TensorRT engine
- a plain Transformers checkpoint
- an Ollama model
- a one-click chat model that can be run directly after download
How to Use
- Download this repository from Hugging Face.
- Build a local engine with
trtllm-buildfor your own GPU and TensorRT-LLM version. - Run inference with the engine you built.
The Build Example section below shows the validated local command used for the benchmark snapshot in this README.
Model Characteristics
- Base model:
Qwen/Qwen3-0.6B - License:
apache-2.0 - Architecture:
Qwen3ForCausalLM - Upstream maximum context length (
max_position_embeddings):40960 - Hidden size:
1024 - Intermediate size:
3072 - Layers:
28 - Attention heads:
16 - KV heads:
8 - Vocabulary size:
151936
These values come from the upstream model/checkpoint configuration. They describe the model family itself, not a specific locally built TensorRT engine.
Checkpoint Details
- TensorRT-LLM version used for conversion:
1.2.0rc6 - Checkpoint dtype:
bfloat16 - Quantization:
FP8(weights and activations) - KV cache quantization:
FP8 - Calibration dataset:
cnn_dailymail(64 samples, max seq length 256) - Tensor parallel size:
1 - Checkpoint files:
config.jsonrank0.safetensors- tokenizer and generation files copied from the upstream Hugging Face model
Files
config.json: TensorRT-LLM checkpoint configrank0.safetensors: TensorRT-LLM checkpoint weightsgeneration_config.json: upstream generation configtokenizer.json: upstream tokenizertokenizer_config.json: upstream tokenizer configmerges.txt: upstream merges filevocab.json: upstream vocabulary
Build Example
The following command is the validated local engine build used for the benchmarks in this README. These values are build-time/runtime settings for one local engine, not limits of the checkpoint itself.
Build an engine locally with TensorRT-LLM:
huggingface-cli download Shoolife/Qwen3-0.6B-TensorRT-LLM-Checkpoint-FP8 --local-dir ./checkpoint
trtllm-build \
--checkpoint_dir ./checkpoint \
--output_dir ./engine \
--gemm_plugin auto \
--gpt_attention_plugin auto \
--max_batch_size 1 \
--max_input_len 512 \
--max_seq_len 1024 \
--max_num_tokens 256 \
--workers 1 \
--monitor_memory
If you rebuild the engine with different limits, memory usage and supported request shapes will change accordingly.
Quantization
This checkpoint was produced using TensorRT-LLM quantization tooling:
python quantize.py \
--model_dir ./Qwen3-0.6B \
--output_dir ./checkpoint_fp8 \
--dtype bfloat16 \
--qformat fp8 \
--kv_cache_dtype fp8 \
--calib_dataset cnn_dailymail \
--calib_size 64 \
--batch_size 1 \
--calib_max_seq_length 256 \
--tokenizer_max_seq_length 2048
Validation
The checkpoint was validated by building a local engine and running inference on:
- GPU:
NVIDIA GeForce RTX 5070 Laptop GPU - Runtime:
TensorRT-LLM 1.2.0rc6
Validated Local Engine Characteristics
Local build and runtime characteristics from the validated engine used for the benchmark snapshot below:
| Property | Value |
|---|---|
| Checkpoint size | 1014 MB |
| Built engine size | 1.1 GB |
| Tested GPU | NVIDIA GeForce RTX 5070 Laptop GPU |
| GPU memory reported by benchmark host | 7.53 GiB |
Engine build max_batch_size |
1 |
Engine build max_input_len |
512 |
Engine build max_seq_len |
1024 |
Engine build max_num_tokens |
256 |
Important: the 1024 / 256 limits above belong only to this particular local engine build. They are not the intrinsic maximum context or generation limits of Qwen3-0.6B itself.
These values are specific to the local engine build used for validation and will change if you rebuild with different TensorRT-LLM settings and memory budgets.
Benchmark Snapshot
Local single-GPU measurements from the validated local engine on RTX 5070 Laptop GPU, using TensorRT-LLM synthetic fixed-length requests, 20 requests per profile, 2 warmup requests, and concurrency=1.
| Profile | Input | Output | TTFT | TPOT | Output tok/s | Avg latency |
|---|---|---|---|---|---|---|
tiny_16_32 |
16 | 32 | 5.06 ms |
2.99 ms |
327.55 |
97.67 ms |
short_chat_42_64 |
42 | 64 | 5.44 ms |
3.01 ms |
327.84 |
195.19 ms |
balanced_128_128 |
128 | 128 | 5.75 ms |
3.02 ms |
329.16 |
388.84 ms |
long_prompt_192_64 |
192 | 64 | 6.54 ms |
3.02 ms |
325.29 |
196.72 ms |
long_generation_42_192 |
42 | 192 | 5.44 ms |
3.01 ms |
330.29 |
581.28 ms |
These numbers are local measurements from one machine and should be treated as reference values, not portability guarantees.
Quick Parity Check
A quick parity check was run on ARC-Challenge (20 examples) and OpenBookQA (20 examples) to verify that the TensorRT-LLM FP8 engine produces comparable answers to the upstream Hugging Face model.
| Benchmark | HF Accuracy | TRT Accuracy | Agreement |
|---|---|---|---|
arc_challenge |
0.55 |
0.60 |
0.85 |
openbookqa |
0.65 |
0.55 |
0.80 |
| Overall | 0.60 |
0.575 |
0.825 |
The FP8 quantized engine shows minor variance from the HF baseline on this small subset.
Local Comparison
The table below compares locally validated TensorRT-LLM variants built for the same GPU family and the same local engine limits (max_batch_size=1, max_seq_len=1024, max_num_tokens=256).
| Variant | Checkpoint | Engine | short_chat_42_64 |
balanced_128_128 |
long_generation_42_192 |
Quick-check overall | Quick-check change vs BF16 | Practical reading |
|---|---|---|---|---|---|---|---|---|
BF16 |
1.5 GB |
1.5 GB |
239.49 tok/s |
238.27 tok/s |
239.96 tok/s |
0.60 |
baseline |
Native precision, best numerical stability |
FP16 |
1.5 GB |
1.5 GB |
239.53 tok/s |
238.39 tok/s |
239.94 tok/s |
0.60 |
same |
Equivalent precision, identical results |
FP8 |
1014 MB |
1.1 GB |
327.84 tok/s |
329.16 tok/s |
330.29 tok/s |
0.575 |
-2.5 pts on this quick-check |
~37% faster, minor quality variance |
NVFP4 |
830 MB |
567 MB |
310.09 tok/s |
311.15 tok/s |
312.89 tok/s |
0.45 |
-15 pts on this quick-check |
Smallest and lighter, but with visible quality drop |
This comparison is intentionally local and narrow. It should not be treated as a universal benchmark across all prompts, datasets, GPUs, or TensorRT-LLM versions.
Notes
- This is not an official Qwen or NVIDIA release.
- This repository does not include a prebuilt TensorRT engine.
- Engine compatibility and performance depend on your GPU, driver, CUDA, TensorRT, and TensorRT-LLM versions.
- Downloads last month
- -