Qwen2.5-1.5B-Instruct TensorRT-LLM Checkpoint (BF16)
This repository contains a community-converted TensorRT-LLM checkpoint for Qwen/Qwen2.5-1.5B-Instruct.
It is a TensorRT-LLM checkpoint-format repository, not a prebuilt engine. The intent is to let you download the checkpoint from Hugging Face and build an engine locally for your own GPU and TensorRT-LLM version.
Who This Repo Is For
This repository is for users who already work with TensorRT-LLM and want a ready-made TensorRT-LLM checkpoint that they can turn into a local engine for their own GPU.
It is not:
- a prebuilt TensorRT engine
- a plain Transformers checkpoint
- an Ollama model
- a one-click chat model that can be run directly after download
How to Use
- Download this repository from Hugging Face.
- Build a local engine with
trtllm-buildfor your own GPU and TensorRT-LLM version. - Run inference with the engine you built.
The Build Example section below shows the validated local command used for the benchmark snapshot in this README.
Model Characteristics
- Base model:
Qwen/Qwen2.5-1.5B-Instruct - License:
apache-2.0 - Architecture:
Qwen2ForCausalLM - Upstream maximum context length (
max_position_embeddings):32768 - Hidden size:
1536 - Intermediate size:
8960 - Layers:
28 - Attention heads:
12 - KV heads:
2 - Vocabulary size:
151936
These values come from the upstream model/checkpoint configuration. They describe the model family itself, not a specific locally built TensorRT engine.
Checkpoint Details
- TensorRT-LLM version used for conversion:
1.2.0rc6 - Checkpoint dtype:
bfloat16 - Quantization:
none - KV cache quantization:
none - Tensor parallel size:
1 - Checkpoint files:
config.jsonrank0.safetensors- tokenizer and generation files copied from the upstream Hugging Face model
For this BF16 checkpoint, bfloat16 is the primary checkpoint dtype and there is no separate low-bit quantization recipe applied.
Files
config.json: TensorRT-LLM checkpoint configrank0.safetensors: TensorRT-LLM checkpoint weightsgeneration_config.json: upstream generation configtokenizer.json: upstream tokenizertokenizer_config.json: upstream tokenizer configmerges.txt: upstream merges filevocab.json: upstream vocabulary
Build Example
The following command is the validated local engine build used for the benchmarks in this README. These values are build-time/runtime settings for one local engine, not limits of the checkpoint itself.
Build an engine locally with TensorRT-LLM:
huggingface-cli download Shoolife/Qwen2.5-1.5B-Instruct-TensorRT-LLM-Checkpoint-BF16 --local-dir ./checkpoint
trtllm-build \
--checkpoint_dir ./checkpoint \
--output_dir ./engine \
--gemm_plugin auto \
--gpt_attention_plugin auto \
--max_batch_size 1 \
--max_input_len 512 \
--max_seq_len 1024 \
--max_num_tokens 256 \
--workers 1 \
--monitor_memory
If you rebuild the engine with different limits, memory usage and supported request shapes will change accordingly.
Conversion
This checkpoint was produced from the upstream model with TensorRT-LLM Qwen conversion tooling:
python convert_checkpoint.py \
--model_dir ./Qwen2.5-1.5B-Instruct \
--output_dir ./checkpoint_bf16 \
--dtype bfloat16
Validation
The checkpoint was validated by building a local engine and running inference on:
- GPU:
NVIDIA GeForce RTX 5070 Laptop GPU - Runtime:
TensorRT-LLM 1.2.0rc6
Validated Local Engine Characteristics
Local build and runtime characteristics from the validated engine used for the benchmark snapshot below:
| Property | Value |
|---|---|
| Checkpoint size | 3.4 GB |
| Built engine size | 3.4 GB |
| Tested GPU | NVIDIA GeForce RTX 5070 Laptop GPU |
| GPU memory reported by benchmark host | 7.53 GiB |
Engine build max_batch_size |
1 |
Engine build max_input_len |
512 |
Engine build max_seq_len |
1024 |
Engine build max_num_tokens |
256 |
| Runtime effective max input length | 256 |
| Engine load footprint | ~3.4 GiB |
| Paged KV cache allocation | ~3.36 GiB |
| Practical total GPU footprint on this setup | ~6.8-7.0 GiB |
Important: the 1024 / 256 limits above belong only to this particular local engine build. They are not the intrinsic maximum context or generation limits of Qwen2.5-1.5B-Instruct itself.
The runtime effective input length became 256 on this build because TensorRT-LLM enabled packed input and context FMHA and clamped the usable prompt budget to the engine token budget.
These values are specific to the local engine build used for validation and will change if you rebuild with different TensorRT-LLM settings and memory budgets.
Benchmark Snapshot
Local single-GPU measurements from the validated local engine on RTX 5070 Laptop GPU, using TensorRT-LLM synthetic fixed-length requests, 20 requests per profile, 2 warmup requests, and concurrency=1.
| Profile | Input | Output | TTFT | TPOT | Output tok/s | Avg latency |
|---|---|---|---|---|---|---|
tiny_16_32 |
16 | 32 | 12.01 ms |
9.47 ms |
104.72 |
305.54 ms |
short_chat_42_64 |
42 | 64 | 12.36 ms |
9.45 ms |
105.27 |
607.96 ms |
balanced_128_128 |
128 | 128 | 13.32 ms |
9.46 ms |
105.37 |
1214.79 ms |
long_prompt_192_64 |
192 | 64 | 16.57 ms |
9.46 ms |
104.43 |
612.84 ms |
long_generation_42_192 |
42 | 192 | 12.19 ms |
9.46 ms |
105.57 |
1818.66 ms |
These numbers are local measurements from one machine and should be treated as reference values, not portability guarantees.
Quick Parity Check
A quick parity check was run on ARC-Challenge (20 examples) and OpenBookQA (20 examples) to verify that the TensorRT-LLM BF16 engine produces the same answers as the upstream Hugging Face model.
| Benchmark | HF Accuracy | TRT Accuracy | Agreement |
|---|---|---|---|
arc_challenge |
0.65 |
0.65 |
1.00 |
openbookqa |
0.80 |
0.80 |
1.00 |
| Overall | 0.725 |
0.725 |
1.00 |
The TRT BF16 engine matches the HF baseline with 100% agreement on this subset.
Local Comparison
The table below compares locally validated TensorRT-LLM variants built for the same GPU family and the same local engine limits (max_batch_size=1, max_seq_len=1024, max_num_tokens=256).
| Variant | Checkpoint | Engine | short_chat_42_64 |
balanced_128_128 |
long_generation_42_192 |
Quick-check overall | Quick-check change vs BF16 | Practical reading |
|---|---|---|---|---|---|---|---|---|
BF16 |
3.4 GB |
3.4 GB |
105.27 tok/s |
105.37 tok/s |
105.57 tok/s |
0.725 |
baseline |
Native precision, best numerical stability |
FP16 |
3.4 GB |
3.4 GB |
105.48 tok/s |
105.49 tok/s |
105.70 tok/s |
0.75 |
+2.5 pts on this quick-check |
Most conservative variant |
FP8 |
2.1 GB |
2.2 GB |
166.72 tok/s |
144.37 tok/s |
151.36 tok/s |
0.75 |
+2.5 pts on this quick-check |
Best balance in these local tests |
NVFP4 |
1.6 GB |
1.2 GB |
199.58 tok/s |
200.09 tok/s |
200.28 tok/s |
0.60 |
-12.5 pts on this quick-check |
Fastest and smallest, but with visible quality drop |
This comparison is intentionally local and narrow. It should not be treated as a universal benchmark across all prompts, datasets, GPUs, or TensorRT-LLM versions.
Notes
- This is not an official Qwen or NVIDIA release.
- This repository does not include a prebuilt TensorRT engine.
- Engine compatibility and performance depend on your GPU, driver, CUDA, TensorRT, and TensorRT-LLM versions.
- Downloads last month
- -