Upload README.md with huggingface_hub

3a9b03f verified 5 days ago

8.67 kB

	---
	license: mit
	base_model: microsoft/Phi-4-mini-instruct
	pipeline_tag: text-generation
	library_name: tensorrt-llm
	tags:
	- phi
	- phi-4
	- tensorrt-llm
	- text-generation
	- fp8
	- checkpoint
	---

	# Phi-4-mini-instruct TensorRT-LLM Checkpoint (FP8)

	This repository contains a community-converted TensorRT-LLM checkpoint for [`microsoft/Phi-4-mini-instruct`](https://huggingface.co/microsoft/Phi-4-mini-instruct).

	It is a TensorRT-LLM checkpoint-format repository, not a prebuilt engine. The intent is to let you download the checkpoint from Hugging Face and build an engine locally for your own GPU and TensorRT-LLM version.

	## Who This Repo Is For

	This repository is for users who already work with TensorRT-LLM and want a ready-made TensorRT-LLM checkpoint that they can turn into a local engine for their own GPU.

	It is not:
	- a prebuilt TensorRT engine
	- a plain Transformers checkpoint
	- an Ollama model
	- a one-click chat model that can be run directly after download

	## How to Use

	1. Download this repository from Hugging Face.
	2. Build a local engine with `trtllm-build` for your own GPU and TensorRT-LLM version.
	3. Run inference with the engine you built.

	The `Build Example` section below shows the validated local command used for the benchmark snapshot in this README.

	## Model Characteristics

	- Base model: `microsoft/Phi-4-mini-instruct`
	- License: `mit`
	- Architecture: `Phi3ForCausalLM`
	- Upstream maximum context length (`max_position_embeddings`): `131072`
	- Hidden size: `3072`
	- Intermediate size: `8192`
	- Layers: `32`
	- Attention heads: `24`
	- KV heads: `8`
	- Vocabulary size: `200064`

	These values come from the upstream model/checkpoint configuration. They describe the model family itself, not a specific locally built TensorRT engine.

	## Checkpoint Details

	- TensorRT-LLM version used for conversion: `1.2.0rc6`
	- Checkpoint dtype: `bfloat16`
	- Quantization: `FP8`
	- KV cache quantization: `FP8`
	- Tensor parallel size: `1`
	- Checkpoint files:
	- `config.json`
	- `rank0.safetensors`
	- tokenizer and generation files copied from the upstream Hugging Face model

	## Files

	- `config.json`: TensorRT-LLM checkpoint config
	- `rank0.safetensors`: TensorRT-LLM checkpoint weights
	- `generation_config.json`: upstream generation config
	- `tokenizer.json`: upstream tokenizer
	- `tokenizer_config.json`: upstream tokenizer config
	- `merges.txt`: upstream merges file
	- `vocab.json`: upstream vocabulary
	- `added_tokens.json`: upstream added tokens
	- `special_tokens_map.json`: upstream special tokens map

	## Build Example

	The following command is the validated local engine build used for the benchmarks in this README. These values are build-time/runtime settings for one local engine, not limits of the checkpoint itself.

	Build an engine locally with TensorRT-LLM:

	```bash
	huggingface-cli download Shoolife/Phi-4-mini-instruct-TensorRT-LLM-Checkpoint-FP8 --local-dir ./checkpoint

	trtllm-build \
	--checkpoint_dir ./checkpoint \
	--output_dir ./engine \
	--gemm_plugin auto \
	--gpt_attention_plugin auto \
	--max_batch_size 1 \
	--max_input_len 256 \
	--max_seq_len 512 \
	--max_num_tokens 128 \
	--workers 1 \
	--monitor_memory
	```

	If you rebuild the engine with different limits, memory usage and supported request shapes will change accordingly.

	## Conversion

	This checkpoint was produced from the upstream model with TensorRT-LLM FP8 quantization tooling:

	```bash
	python /app/tensorrt_llm/examples/quantization/quantize.py \
	--model_dir ./Phi-4-mini-instruct \
	--output_dir ./checkpoint_fp8 \
	--dtype bfloat16 \
	--qformat fp8 \
	--kv_cache_dtype fp8 \
	--calib_dataset cnn_dailymail \
	--calib_size 64 \
	--batch_size 1 \
	--calib_max_seq_length 256 \
	--tokenizer_max_seq_length 2048 \
	--device cpu \
	--device_map cpu
	```

	Then build the engine:

	```bash
	trtllm-build \
	--checkpoint_dir ./checkpoint_fp8 \
	--output_dir ./engine_fp8 \
	--gemm_plugin auto \
	--gpt_attention_plugin auto \
	--max_batch_size 1 \
	--max_input_len 256 \
	--max_seq_len 512 \
	--max_num_tokens 128 \
	--workers 1 \
	--monitor_memory
	```

	## Validation

	The checkpoint was validated by building a local engine and running inference on:

	- GPU: `NVIDIA GeForce RTX 5070 Laptop GPU`
	- Runtime: `TensorRT-LLM 1.2.0rc6`

	Smoke-test prompt:

	```text
	Explain the four basic arithmetic operations in one short sentence each.
	```

	Observed response:

	```text
	Addition combines two or more numbers into a larger total. Subtraction removes one number from another to find the difference. Multiplication calculates the total of repeated addition. Division splits a number into equal parts.
	```

	## Validated Local Engine Characteristics

	Local build and runtime characteristics from the validated engine used for the benchmark snapshot below:

	\| Property \| Value \|
	\|---\|---\|
	\| Checkpoint size \| `5.3 GB` \|
	\| Built engine size \| `5.4 GB` \|
	\| Tested GPU \| `NVIDIA GeForce RTX 5070 Laptop GPU` \|
	\| GPU memory reported by benchmark host \| `7.53 GiB` \|
	\| Engine build `max_batch_size` \| `1` \|
	\| Engine build `max_input_len` \| `256` \|
	\| Engine build `max_seq_len` \| `512` \|
	\| Engine build `max_num_tokens` \| `128` \|
	\| Runtime effective max input length \| `128` \|
	\| Engine load footprint \| `~5.5 GiB` \|
	\| Paged KV cache allocation \| `~1.39 GiB` \|
	\| Practical total GPU footprint on this setup \| `~6.9-7.0 GiB` \|

	Important: the `256` / `512` / `128` limits above belong only to this particular local engine build. They are not the intrinsic maximum context or generation limits of `Phi-4-mini-instruct` itself.

	The runtime effective input length became `128` on this build because TensorRT-LLM enabled packed input and context FMHA and clamped the usable prompt budget to the engine token budget.

	These values are specific to the local engine build used for validation and will change if you rebuild with different TensorRT-LLM settings and memory budgets.

	## Benchmark Snapshot

	Local single-GPU measurements from the validated local engine on `RTX 5070 Laptop GPU`, using TensorRT-LLM synthetic fixed-length requests, `20` requests per profile, `2` warmup requests, and `concurrency=1`.

	\| Profile \| Input \| Output \| TTFT \| TPOT \| Output tok/s \| Avg latency \|
	\|---\|---:\|---:\|---:\|---:\|---:\|---:\|
	\| `tiny_16_32` \| 16 \| 32 \| `21.95 ms` \| `18.15 ms` \| `54.73` \| `584.68 ms` \|
	\| `short_chat_42_64` \| 42 \| 64 \| `24.21 ms` \| `18.12 ms` \| `54.88` \| `1166.07 ms` \|
	\| `balanced_64_64` \| 64 \| 64 \| `23.10 ms` \| `17.56 ms` \| `56.68` \| `1129.17 ms` \|
	\| `long_prompt_96_32` \| 96 \| 32 \| `26.17 ms` \| `18.24 ms` \| `54.08` \| `591.63 ms` \|
	\| `long_generation_32_96` \| 32 \| 96 \| `23.01 ms` \| `17.91 ms` \| `55.68` \| `1724.03 ms` \|

	These numbers are local measurements from one machine and should be treated as reference values, not portability guarantees.

	## Quick Parity Check

	A small public sanity-check was run against the upstream Hugging Face baseline on `20` validation examples from `ARC-Challenge` and `20` validation examples from `OpenBookQA`.

	\| Benchmark \| HF baseline \| TRT FP8 \| Agreement \|
	\|---\|---:\|---:\|---:\|
	\| `ARC-Challenge` \| `0.90` \| `0.90` \| `0.90` \|
	\| `OpenBookQA` \| `0.80` \| `0.80` \| `0.90` \|
	\| `Overall` \| `0.85` \| `0.85` \| `0.90` \|

	This is only a quick local parity check, not a full benchmark suite. It is intended to show the practical tradeoff of this conversion on a small public subset.

	## FP8 vs NVFP4

	The table below compares two locally validated TensorRT-LLM variants built for the same GPU family and the same local engine limits (`max_batch_size=1`, `max_seq_len=512`, `max_num_tokens=128`).

	\| Variant \| Checkpoint \| Engine \| `short_chat_42_64` \| `balanced_64_64` \| `long_generation_32_96` \| Quick-check overall \| Quick-check change vs TRT FP8 \| Practical reading \|
	\|---\|---:\|---:\|---:\|---:\|---:\|---:\|---\|---\|
	\| `FP8` \| `5.3 GB` \| `5.4 GB` \| `54.88 tok/s` \| `56.68 tok/s` \| `55.68 tok/s` \| `0.85` \| `baseline` \| Best balance in these local tests \|
	\| `NVFP4` \| `4.0 GB` \| `3.0 GB` \| `78.33 tok/s` \| `78.24 tok/s` \| `82.27 tok/s` \| `0.775` \| `-7.5 pts on this quick-check` \| Faster and smaller, but with a visible quality drop \|

	This comparison is intentionally local and narrow. It should not be treated as a universal benchmark across all prompts, datasets, GPUs, or TensorRT-LLM versions.

	HF baseline on that same `40`-question subset: `0.85`.

	On that same `40`-question subset, the local TensorRT-LLM FP8 engine also scored `0.85`.

	## Notes

	- This is not an official Microsoft or NVIDIA release.
	- This repository does not include a prebuilt TensorRT engine.
	- Engine compatibility and performance depend on your GPU, driver, CUDA, TensorRT, and TensorRT-LLM versions.