Qwen3-1.7B TensorRT-LLM Checkpoint (FP16)
This repository contains a community-converted TensorRT-LLM checkpoint for Qwen/Qwen3-1.7B.
It is a TensorRT-LLM checkpoint-format repository, not a prebuilt engine. The intent is to let you download the checkpoint from Hugging Face and build an engine locally for your own GPU and TensorRT-LLM version.
Who This Repo Is For
This repository is for users who already work with TensorRT-LLM and want a ready-made TensorRT-LLM checkpoint that they can turn into a local engine for their own GPU.
It is not:
- a prebuilt TensorRT engine
- a plain Transformers checkpoint
- an Ollama model
- a one-click chat model that can be run directly after download
How to Use
- Download this repository from Hugging Face.
- Build a local engine with
trtllm-buildfor your own GPU and TensorRT-LLM version. - Run inference with the engine you built.
The Build Example section below shows the validated local command used for the benchmark snapshot in this README.
Model Characteristics
- Base model:
Qwen/Qwen3-1.7B - License:
apache-2.0 - Architecture:
Qwen3ForCausalLM - Upstream maximum context length (
max_position_embeddings):40960 - Hidden size:
2048 - Intermediate size:
6144 - Layers:
28 - Attention heads:
16 - KV heads:
8 - Vocabulary size:
151936
These values come from the upstream model/checkpoint configuration. They describe the model family itself, not a specific locally built TensorRT engine.
Checkpoint Details
- TensorRT-LLM version used for conversion:
1.2.0rc6 - Checkpoint dtype:
float16 - Quantization:
none - KV cache quantization:
none - Tensor parallel size:
1 - Checkpoint files:
config.jsonrank0.safetensors- tokenizer and generation files copied from the upstream Hugging Face model
For this FP16 checkpoint, float16 is the primary checkpoint dtype and there is no separate low-bit quantization recipe applied.
Files
config.json: TensorRT-LLM checkpoint configrank0.safetensors: TensorRT-LLM checkpoint weightsgeneration_config.json: upstream generation configtokenizer.json: upstream tokenizertokenizer_config.json: upstream tokenizer configmerges.txt: upstream merges filevocab.json: upstream vocabulary
Build Example
The following command is the validated local engine build used for the benchmarks in this README. These values are build-time/runtime settings for one local engine, not limits of the checkpoint itself.
Build an engine locally with TensorRT-LLM:
huggingface-cli download Shoolife/Qwen3-1.7B-TensorRT-LLM-Checkpoint-FP16 --local-dir ./checkpoint
trtllm-build \
--checkpoint_dir ./checkpoint \
--output_dir ./engine \
--gemm_plugin auto \
--gpt_attention_plugin auto \
--max_batch_size 1 \
--max_input_len 512 \
--max_seq_len 1024 \
--max_num_tokens 256 \
--workers 1 \
--monitor_memory
If you rebuild the engine with different limits, memory usage and supported request shapes will change accordingly.
Conversion
This checkpoint was produced from the upstream model with TensorRT-LLM Qwen conversion tooling:
python convert_checkpoint.py \
--model_dir ./Qwen3-1.7B \
--output_dir ./checkpoint_fp16 \
--dtype float16
Validation
The checkpoint was validated by building a local engine and running inference on:
- GPU:
NVIDIA GeForce RTX 5070 Laptop GPU - Runtime:
TensorRT-LLM 1.2.0rc6
Validated Local Engine Characteristics
Local build and runtime characteristics from the validated engine used for the benchmark snapshot below:
| Property | Value |
|---|---|
| Checkpoint size | 3.9 GB |
| Built engine size | 3.9 GB |
| Tested GPU | NVIDIA GeForce RTX 5070 Laptop GPU |
| GPU memory reported by benchmark host | 7.53 GiB |
Engine build max_batch_size |
1 |
Engine build max_input_len |
512 |
Engine build max_seq_len |
1024 |
Engine build max_num_tokens |
256 |
Important: the 1024 / 256 limits above belong only to this particular local engine build. They are not the intrinsic maximum context or generation limits of Qwen3-1.7B itself.
These values are specific to the local engine build used for validation and will change if you rebuild with different TensorRT-LLM settings and memory budgets.
Benchmark Snapshot
Local single-GPU measurements from the validated local engine on RTX 5070 Laptop GPU, using TensorRT-LLM synthetic fixed-length requests, 20 requests per profile, 2 warmup requests, and concurrency=1.
| Profile | Input | Output | TTFT | TPOT | Output tok/s | Avg latency |
|---|---|---|---|---|---|---|
tiny_16_32 |
16 | 32 | 12.75 ms |
10.10 ms |
98.2 |
325.9 ms |
short_chat_42_64 |
42 | 64 | 13.21 ms |
10.11 ms |
98.5 |
650.0 ms |
balanced_128_128 |
128 | 128 | 14.18 ms |
10.21 ms |
97.7 |
1310.6 ms |
long_prompt_192_64 |
192 | 64 | 18.28 ms |
10.27 ms |
96.2 |
665.4 ms |
long_generation_42_192 |
42 | 192 | 13.21 ms |
10.15 ms |
98.4 |
1952.2 ms |
These numbers are local measurements from one machine and should be treated as reference values, not portability guarantees.
Quick Parity Check
A quick parity check was run on ARC-Challenge (20 examples) and OpenBookQA (20 examples) to verify that the TensorRT-LLM FP16 engine produces the same answers as the upstream Hugging Face model.
| Benchmark | HF Accuracy | TRT Accuracy | Agreement |
|---|---|---|---|
arc_challenge |
0.75 |
0.75 |
1.00 |
openbookqa |
0.60 |
0.60 |
1.00 |
| Overall | 0.675 |
0.675 |
1.00 |
The TRT FP16 engine matches the HF baseline with 100% agreement on this subset.
Local Comparison
The table below compares locally validated TensorRT-LLM variants built for the same GPU family and the same local engine limits (max_batch_size=1, max_seq_len=1024, max_num_tokens=256).
| Variant | Checkpoint | Engine | short_chat_42_64 |
balanced_128_128 |
long_generation_42_192 |
Quick-check overall | Quick-check change vs BF16 | Practical reading |
|---|---|---|---|---|---|---|---|---|
BF16 |
3.9 GB |
3.9 GB |
98.5 tok/s |
97.7 tok/s |
98.4 tok/s |
0.675 |
baseline |
Native precision, best numerical stability |
FP16 |
3.9 GB |
3.9 GB |
98.5 tok/s |
97.7 tok/s |
98.4 tok/s |
0.675 |
same |
Equivalent precision, identical results |
FP8 |
2.5 GB |
2.6 GB |
131.6 tok/s |
132.2 tok/s |
138.7 tok/s |
0.675 |
same |
~35% faster, same accuracy on this subset |
NVFP4 |
1.9 GB |
1.4 GB |
176.5 tok/s |
176.8 tok/s |
177.2 tok/s |
0.25 |
-42.5 pts on this quick-check |
Fastest and smallest, but severe quality drop |
This comparison is intentionally local and narrow. It should not be treated as a universal benchmark across all prompts, datasets, GPUs, or TensorRT-LLM versions.
Notes
- This is not an official Qwen or NVIDIA release.
- This repository does not include a prebuilt TensorRT engine.
- Engine compatibility and performance depend on your GPU, driver, CUDA, TensorRT, and TensorRT-LLM versions.
- Downloads last month
- -