| --- |
| license: mit |
| base_model: microsoft/Phi-4-mini-instruct |
| pipeline_tag: text-generation |
| library_name: tensorrt-llm |
| tags: |
| - phi |
| - phi-4 |
| - tensorrt-llm |
| - text-generation |
| - fp8 |
| - checkpoint |
| --- |
| |
| # Phi-4-mini-instruct TensorRT-LLM Checkpoint (FP8) |
|
|
| This repository contains a community-converted TensorRT-LLM checkpoint for [`microsoft/Phi-4-mini-instruct`](https://huggingface.co/microsoft/Phi-4-mini-instruct). |
|
|
| It is a TensorRT-LLM **checkpoint-format** repository, not a prebuilt engine. The intent is to let you download the checkpoint from Hugging Face and build an engine locally for your own GPU and TensorRT-LLM version. |
|
|
| ## Who This Repo Is For |
|
|
| This repository is for users who already work with TensorRT-LLM and want a ready-made **TensorRT-LLM checkpoint** that they can turn into a local engine for their own GPU. |
|
|
| It is **not**: |
| - a prebuilt TensorRT engine |
| - a plain Transformers checkpoint |
| - an Ollama model |
| - a one-click chat model that can be run directly after download |
|
|
| ## How to Use |
|
|
| 1. Download this repository from Hugging Face. |
| 2. Build a local engine with `trtllm-build` for your own GPU and TensorRT-LLM version. |
| 3. Run inference with the engine you built. |
|
|
| The `Build Example` section below shows the validated local command used for the benchmark snapshot in this README. |
|
|
| ## Model Characteristics |
|
|
| - Base model: `microsoft/Phi-4-mini-instruct` |
| - License: `mit` |
| - Architecture: `Phi3ForCausalLM` |
| - Upstream maximum context length (`max_position_embeddings`): `131072` |
| - Hidden size: `3072` |
| - Intermediate size: `8192` |
| - Layers: `32` |
| - Attention heads: `24` |
| - KV heads: `8` |
| - Vocabulary size: `200064` |
|
|
| These values come from the upstream model/checkpoint configuration. They describe the model family itself, not a specific locally built TensorRT engine. |
|
|
| ## Checkpoint Details |
|
|
| - TensorRT-LLM version used for conversion: `1.2.0rc6` |
| - Checkpoint dtype: `bfloat16` |
| - Quantization: `FP8` |
| - KV cache quantization: `FP8` |
| - Tensor parallel size: `1` |
| - Checkpoint files: |
| - `config.json` |
| - `rank0.safetensors` |
| - tokenizer and generation files copied from the upstream Hugging Face model |
|
|
| ## Files |
|
|
| - `config.json`: TensorRT-LLM checkpoint config |
| - `rank0.safetensors`: TensorRT-LLM checkpoint weights |
| - `generation_config.json`: upstream generation config |
| - `tokenizer.json`: upstream tokenizer |
| - `tokenizer_config.json`: upstream tokenizer config |
| - `merges.txt`: upstream merges file |
| - `vocab.json`: upstream vocabulary |
| - `added_tokens.json`: upstream added tokens |
| - `special_tokens_map.json`: upstream special tokens map |
|
|
| ## Build Example |
|
|
| The following command is the **validated local engine build** used for the benchmarks in this README. These values are build-time/runtime settings for one local engine, not limits of the checkpoint itself. |
|
|
| Build an engine locally with TensorRT-LLM: |
|
|
| ```bash |
| huggingface-cli download Shoolife/Phi-4-mini-instruct-TensorRT-LLM-Checkpoint-FP8 --local-dir ./checkpoint |
| |
| trtllm-build \ |
| --checkpoint_dir ./checkpoint \ |
| --output_dir ./engine \ |
| --gemm_plugin auto \ |
| --gpt_attention_plugin auto \ |
| --max_batch_size 1 \ |
| --max_input_len 256 \ |
| --max_seq_len 512 \ |
| --max_num_tokens 128 \ |
| --workers 1 \ |
| --monitor_memory |
| ``` |
|
|
| If you rebuild the engine with different limits, memory usage and supported request shapes will change accordingly. |
|
|
| ## Conversion |
|
|
| This checkpoint was produced from the upstream model with TensorRT-LLM FP8 quantization tooling: |
|
|
| ```bash |
| python /app/tensorrt_llm/examples/quantization/quantize.py \ |
| --model_dir ./Phi-4-mini-instruct \ |
| --output_dir ./checkpoint_fp8 \ |
| --dtype bfloat16 \ |
| --qformat fp8 \ |
| --kv_cache_dtype fp8 \ |
| --calib_dataset cnn_dailymail \ |
| --calib_size 64 \ |
| --batch_size 1 \ |
| --calib_max_seq_length 256 \ |
| --tokenizer_max_seq_length 2048 \ |
| --device cpu \ |
| --device_map cpu |
| ``` |
|
|
| Then build the engine: |
|
|
| ```bash |
| trtllm-build \ |
| --checkpoint_dir ./checkpoint_fp8 \ |
| --output_dir ./engine_fp8 \ |
| --gemm_plugin auto \ |
| --gpt_attention_plugin auto \ |
| --max_batch_size 1 \ |
| --max_input_len 256 \ |
| --max_seq_len 512 \ |
| --max_num_tokens 128 \ |
| --workers 1 \ |
| --monitor_memory |
| ``` |
|
|
| ## Validation |
|
|
| The checkpoint was validated by building a local engine and running inference on: |
|
|
| - GPU: `NVIDIA GeForce RTX 5070 Laptop GPU` |
| - Runtime: `TensorRT-LLM 1.2.0rc6` |
|
|
| Smoke-test prompt: |
|
|
| ```text |
| Explain the four basic arithmetic operations in one short sentence each. |
| ``` |
|
|
| Observed response: |
|
|
| ```text |
| Addition combines two or more numbers into a larger total. Subtraction removes one number from another to find the difference. Multiplication calculates the total of repeated addition. Division splits a number into equal parts. |
| ``` |
|
|
| ## Validated Local Engine Characteristics |
|
|
| Local build and runtime characteristics from the validated engine used for the benchmark snapshot below: |
|
|
| | Property | Value | |
| |---|---| |
| | Checkpoint size | `5.3 GB` | |
| | Built engine size | `5.4 GB` | |
| | Tested GPU | `NVIDIA GeForce RTX 5070 Laptop GPU` | |
| | GPU memory reported by benchmark host | `7.53 GiB` | |
| | Engine build `max_batch_size` | `1` | |
| | Engine build `max_input_len` | `256` | |
| | Engine build `max_seq_len` | `512` | |
| | Engine build `max_num_tokens` | `128` | |
| | Runtime effective max input length | `128` | |
| | Engine load footprint | `~5.5 GiB` | |
| | Paged KV cache allocation | `~1.39 GiB` | |
| | Practical total GPU footprint on this setup | `~6.9-7.0 GiB` | |
|
|
| Important: the `256` / `512` / `128` limits above belong only to this particular local engine build. They are not the intrinsic maximum context or generation limits of `Phi-4-mini-instruct` itself. |
|
|
| The runtime effective input length became `128` on this build because TensorRT-LLM enabled packed input and context FMHA and clamped the usable prompt budget to the engine token budget. |
|
|
| These values are specific to the local engine build used for validation and will change if you rebuild with different TensorRT-LLM settings and memory budgets. |
|
|
| ## Benchmark Snapshot |
|
|
| Local single-GPU measurements from the validated local engine on `RTX 5070 Laptop GPU`, using TensorRT-LLM synthetic fixed-length requests, `20` requests per profile, `2` warmup requests, and `concurrency=1`. |
|
|
| | Profile | Input | Output | TTFT | TPOT | Output tok/s | Avg latency | |
| |---|---:|---:|---:|---:|---:|---:| |
| | `tiny_16_32` | 16 | 32 | `21.95 ms` | `18.15 ms` | `54.73` | `584.68 ms` | |
| | `short_chat_42_64` | 42 | 64 | `24.21 ms` | `18.12 ms` | `54.88` | `1166.07 ms` | |
| | `balanced_64_64` | 64 | 64 | `23.10 ms` | `17.56 ms` | `56.68` | `1129.17 ms` | |
| | `long_prompt_96_32` | 96 | 32 | `26.17 ms` | `18.24 ms` | `54.08` | `591.63 ms` | |
| | `long_generation_32_96` | 32 | 96 | `23.01 ms` | `17.91 ms` | `55.68` | `1724.03 ms` | |
|
|
| These numbers are local measurements from one machine and should be treated as reference values, not portability guarantees. |
|
|
| ## Quick Parity Check |
|
|
| A small public sanity-check was run against the upstream Hugging Face baseline on `20` validation examples from `ARC-Challenge` and `20` validation examples from `OpenBookQA`. |
|
|
| | Benchmark | HF baseline | TRT FP8 | Agreement | |
| |---|---:|---:|---:| |
| | `ARC-Challenge` | `0.90` | `0.90` | `0.90` | |
| | `OpenBookQA` | `0.80` | `0.80` | `0.90` | |
| | `Overall` | `0.85` | `0.85` | `0.90` | |
|
|
| This is only a quick local parity check, not a full benchmark suite. It is intended to show the practical tradeoff of this conversion on a small public subset. |
|
|
| ## FP8 vs NVFP4 |
|
|
| The table below compares two locally validated TensorRT-LLM variants built for the same GPU family and the same local engine limits (`max_batch_size=1`, `max_seq_len=512`, `max_num_tokens=128`). |
|
|
| | Variant | Checkpoint | Engine | `short_chat_42_64` | `balanced_64_64` | `long_generation_32_96` | Quick-check overall | Quick-check change vs TRT FP8 | Practical reading | |
| |---|---:|---:|---:|---:|---:|---:|---|---| |
| | `FP8` | `5.3 GB` | `5.4 GB` | `54.88 tok/s` | `56.68 tok/s` | `55.68 tok/s` | `0.85` | `baseline` | Best balance in these local tests | |
| | `NVFP4` | `4.0 GB` | `3.0 GB` | `78.33 tok/s` | `78.24 tok/s` | `82.27 tok/s` | `0.775` | `-7.5 pts on this quick-check` | Faster and smaller, but with a visible quality drop | |
|
|
| This comparison is intentionally local and narrow. It should not be treated as a universal benchmark across all prompts, datasets, GPUs, or TensorRT-LLM versions. |
|
|
| HF baseline on that same `40`-question subset: `0.85`. |
|
|
| On that same `40`-question subset, the local TensorRT-LLM FP8 engine also scored `0.85`. |
|
|
| ## Notes |
|
|
| - This is not an official Microsoft or NVIDIA release. |
| - This repository does not include a prebuilt TensorRT engine. |
| - Engine compatibility and performance depend on your GPU, driver, CUDA, TensorRT, and TensorRT-LLM versions. |
|
|