Add files using upload-large-folder tool

Browse files

Files changed (10) hide show

.gitattributes +1 -0
README.md +226 -0
added_tokens.json +12 -0
config.json +167 -0
generation_config.json +10 -0
merges.txt +0 -0
special_tokens_map.json +30 -0
tokenizer.json +3 -0
tokenizer_config.json +111 -0
vocab.json +0 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+tokenizer.json filter=lfs diff=lfs merge=lfs -text

README.md ADDED Viewed

	@@ -0,0 +1,226 @@

+---
+license: mit
+base_model: microsoft/Phi-4-mini-instruct
+pipeline_tag: text-generation
+library_name: tensorrt-llm
+tags:
+- phi
+- phi-4
+- tensorrt-llm
+- text-generation
+- nvfp4
+- checkpoint
+---
+# Phi-4-mini-instruct TensorRT-LLM Checkpoint (NVFP4)
+This repository contains a community-converted TensorRT-LLM checkpoint for [`microsoft/Phi-4-mini-instruct`](https://huggingface.co/microsoft/Phi-4-mini-instruct).
+It is a TensorRT-LLM **checkpoint-format** repository, not a prebuilt engine. The intent is to let you download the checkpoint from Hugging Face and build an engine locally for your own GPU and TensorRT-LLM version.
+## Who This Repo Is For
+This repository is for users who already work with TensorRT-LLM and want a ready-made **TensorRT-LLM checkpoint** that they can turn into a local engine for their own GPU.
+It is **not**:
+- a prebuilt TensorRT engine
+- a plain Transformers checkpoint
+- an Ollama model
+- a one-click chat model that can be run directly after download
+## How to Use
+1. Download this repository from Hugging Face.
+2. Build a local engine with `trtllm-build` for your own GPU and TensorRT-LLM version.
+3. Run inference with the engine you built.
+The `Build Example` section below shows the validated local command used for the benchmark snapshot in this README.
+## Model Characteristics
+- Base model: `microsoft/Phi-4-mini-instruct`
+- License: `mit`
+- Architecture: `Phi3ForCausalLM`
+- Upstream maximum context length (`max_position_embeddings`): `131072`
+- Hidden size: `3072`
+- Intermediate size: `8192`
+- Layers: `32`
+- Attention heads: `24`
+- KV heads: `8`
+- Vocabulary size: `200064`
+These values come from the upstream model/checkpoint configuration. They describe the model family itself, not a specific locally built TensorRT engine.
+## Checkpoint Details
+- TensorRT-LLM version used for conversion: `1.2.0rc6`
+- Checkpoint dtype: `bfloat16`
+- Quantization: `NVFP4`
+- KV cache quantization: `FP8`
+- Tensor parallel size: `1`
+- Checkpoint files:
+  - `config.json`
+  - `rank0.safetensors`
+  - tokenizer and generation files copied from the upstream Hugging Face model
+## Files
+- `config.json`: TensorRT-LLM checkpoint config
+- `rank0.safetensors`: TensorRT-LLM checkpoint weights
+- `generation_config.json`: upstream generation config
+- `tokenizer.json`: upstream tokenizer
+- `tokenizer_config.json`: upstream tokenizer config
+- `merges.txt`: upstream merges file
+- `vocab.json`: upstream vocabulary
+- `added_tokens.json`: upstream added tokens
+- `special_tokens_map.json`: upstream special tokens map
+## Build Example
+The following command is the **validated local engine build** used for the benchmarks in this README. These values are build-time/runtime settings for one local engine, not limits of the checkpoint itself.
+Build an engine locally with TensorRT-LLM:
+```bash
+huggingface-cli download Shoolife/Phi-4-mini-instruct-TensorRT-LLM-Checkpoint-NVFP4 --local-dir ./checkpoint
+trtllm-build \
+  --checkpoint_dir ./checkpoint \
+  --output_dir ./engine \
+  --gemm_plugin auto \
+  --gpt_attention_plugin auto \
+  --max_batch_size 1 \
+  --max_input_len 256 \
+  --max_seq_len 512 \
+  --max_num_tokens 128 \
+  --workers 1 \
+  --monitor_memory
+```
+If you rebuild the engine with different limits, memory usage and supported request shapes will change accordingly.
+## Conversion
+This checkpoint was produced from the upstream model with TensorRT-LLM NVFP4 quantization tooling:
+```bash
+python /app/tensorrt_llm/examples/quantization/quantize.py \
+  --model_dir ./Phi-4-mini-instruct \
+  --output_dir ./checkpoint_nvfp4 \
+  --dtype bfloat16 \
+  --qformat nvfp4 \
+  --kv_cache_dtype fp8 \
+  --calib_dataset cnn_dailymail \
+  --calib_size 64 \
+  --batch_size 1 \
+  --calib_max_seq_length 256 \
+  --tokenizer_max_seq_length 2048 \
+  --device cpu \
+  --device_map cpu
+```
+Then build the engine:
+```bash
+trtllm-build \
+  --checkpoint_dir ./checkpoint_nvfp4 \
+  --output_dir ./engine_nvfp4 \
+  --gemm_plugin auto \
+  --gpt_attention_plugin auto \
+  --max_batch_size 1 \
+  --max_input_len 256 \
+  --max_seq_len 512 \
+  --max_num_tokens 128 \
+  --workers 1 \
+  --monitor_memory
+```
+## Validation
+The checkpoint was validated by building a local engine and running inference on:
+- GPU: `NVIDIA GeForce RTX 5070 Laptop GPU`
+- Runtime: `TensorRT-LLM 1.2.0rc6`
+Smoke-test prompt:
+```text
+Explain the four basic arithmetic operations in one short sentence each.
+```
+Observed response excerpt (`max_tokens=32`):
+```text
+Addition combines numbers to find a total. Subtraction removes a quantity from another. Multiplication calculates the product of numbers. Division divides a number by another to find
+```
+## Validated Local Engine Characteristics
+Local build and runtime characteristics from the validated engine used for the benchmark snapshot below:
+| Property | Value |
+|---|---|
+| Checkpoint size | `4.0 GB` |
+| Built engine size | `3.0 GB` |
+| Tested GPU | `NVIDIA GeForce RTX 5070 Laptop GPU` |
+| GPU memory reported by benchmark host | `7.53 GiB` |
+| Engine build `max_batch_size` | `1` |
+| Engine build `max_input_len` | `256` |
+| Engine build `max_seq_len` | `512` |
+| Engine build `max_num_tokens` | `128` |
+| Runtime effective max input length | `128` |
+| Engine load footprint | `~3.0 GiB` |
+| Paged KV cache allocation | `~3.58 GiB` |
+| Practical total GPU footprint on this setup | `~6.6 GiB` |
+Important: the `256` / `512` / `128` limits above belong only to this particular local engine build. They are not the intrinsic maximum context or generation limits of `Phi-4-mini-instruct` itself.
+The runtime effective input length became `128` on this build because TensorRT-LLM enabled packed input and context FMHA and clamped the usable prompt budget to the engine token budget.
+These values are specific to the local engine build used for validation and will change if you rebuild with different TensorRT-LLM settings and memory budgets.
+## Benchmark Snapshot
+Local single-GPU measurements from the validated local engine on `RTX 5070 Laptop GPU`, using TensorRT-LLM synthetic fixed-length requests, `20` requests per profile, `2` warmup requests, and `concurrency=1`.
+| Profile | Input | Output | TTFT | TPOT | Output tok/s | Avg latency |
+|---|---:|---:|---:|---:|---:|---:|
+| `tiny_16_32` | 16 | 32 | `12.53 ms` | `10.17 ms` | `97.59` | `327.88 ms` |
+| `short_chat_42_64` | 42 | 64 | `15.57 ms` | `12.72 ms` | `78.33` | `817.03 ms` |
+| `balanced_64_64` | 64 | 64 | `15.66 ms` | `12.74 ms` | `78.24` | `818.01 ms` |
+| `long_prompt_96_32` | 96 | 32 | `15.58 ms` | `12.48 ms` | `79.50` | `402.50 ms` |
+| `long_generation_32_96` | 32 | 96 | `14.90 ms` | `12.13 ms` | `82.27` | `1166.89 ms` |
+These numbers are local measurements from one machine and should be treated as reference values, not portability guarantees.
+## Quick Parity Check
+A small public sanity-check was run against the upstream Hugging Face baseline on `20` validation examples from `ARC-Challenge` and `20` validation examples from `OpenBookQA`.
+| Benchmark | HF baseline | TRT NVFP4 | Agreement |
+|---|---:|---:|---:|
+| `ARC-Challenge` | `0.90` | `0.80` | `0.85` |
+| `OpenBookQA` | `0.80` | `0.75` | `0.85` |
+| `Overall` | `0.85` | `0.775` | `0.85` |
+This is only a quick local parity check, not a full benchmark suite. It is intended to show the practical tradeoff of this conversion on a small public subset.
+## FP8 vs NVFP4
+The table below compares two locally validated TensorRT-LLM variants built for the same GPU family and the same local engine limits (`max_batch_size=1`, `max_seq_len=512`, `max_num_tokens=128`).
+| Variant | Checkpoint | Engine | `short_chat_42_64` | `balanced_64_64` | `long_generation_32_96` | Quick-check overall | Quick-check change vs FP8 | Practical reading |
+|---|---:|---:|---:|---:|---:|---:|---|---|
+| `FP8` | `5.3 GB` | `5.4 GB` | `54.88 tok/s` | `56.68 tok/s` | `55.68 tok/s` | `0.85` | `baseline` | Best balance in these local tests |
+| `NVFP4` | `4.0 GB` | `3.0 GB` | `78.33 tok/s` | `78.24 tok/s` | `82.27 tok/s` | `0.775` | `-7.5 pts on this quick-check` | Faster and smaller, but with a visible quality drop |
+This comparison is intentionally local and narrow. It should not be treated as a universal benchmark across all prompts, datasets, GPUs, or TensorRT-LLM versions.
+On that same `40`-question subset, the upstream Hugging Face baseline scored `0.85`, while the local TensorRT-LLM FP8 engine also scored `0.85`.
+## Notes
+- This is not an official Microsoft or NVIDIA release.
+- This repository does not include a prebuilt TensorRT engine.
+- Engine compatibility and performance depend on your GPU, driver, CUDA, TensorRT, and TensorRT-LLM versions.
+- `NVFP4` is attractive for speed and engine size on Blackwell GPUs, but this local quick-check showed a meaningful quality drop relative to `FP8`.

added_tokens.json ADDED Viewed

	@@ -0,0 +1,12 @@

+{
+  "<|/tool_call|>": 200026,
+  "<|/tool|>": 200024,
+  "<|assistant|>": 200019,
+  "<|end|>": 200020,
+  "<|system|>": 200022,
+  "<|tag|>": 200028,
+  "<|tool_call|>": 200025,
+  "<|tool_response|>": 200027,
+  "<|tool|>": 200023,
+  "<|user|>": 200021
+}

config.json ADDED Viewed

	@@ -0,0 +1,167 @@

+{
+    "producer": {
+        "name": "modelopt",
+        "version": "0.37.0"
+    },
+    "architecture": "Phi3ForCausalLM",
+    "dtype": "bfloat16",
+    "logits_dtype": "float16",
+    "num_hidden_layers": 32,
+    "num_attention_heads": 24,
+    "num_key_value_heads": 8,
+    "hidden_size": 3072,
+    "norm_epsilon": 1e-05,
+    "vocab_size": 200064,
+    "max_position_embeddings": 131072,
+    "hidden_act": "swiglu",
+    "use_parallel_embedding": true,
+    "embedding_sharding_dim": 0,
+    "head_size": 128,
+    "intermediate_size": 8192,
+    "position_embedding_type": "long_rope",
+    "share_embedding_table": false,
+    "residual_mlp": false,
+    "bias": false,
+    "rotary_pct": 0.75,
+    "rank": 0,
+    "decoder": "phi3",
+    "rmsnorm": true,
+    "lm_head_bias": false,
+    "rotary_base": 10000.0,
+    "rotary_scaling": null,
+    "runtime_defaults": null,
+    "mapping": {
+        "world_size": 1,
+        "gpus_per_node": 8,
+        "cp_size": 1,
+        "tp_size": 1,
+        "pp_size": 1,
+        "moe_tp_size": 1,
+        "moe_cluster_size": 1,
+        "moe_ep_size": 1,
+        "attn_tp_size": 1,
+        "attn_cp_size": 1,
+        "cp_config": {},
+        "enable_attention_dp": false,
+        "enable_lm_head_tp_in_adp": false
+    },
+    "quantization": {
+        "quant_algo": "NVFP4",
+        "kv_cache_quant_algo": "FP8",
+        "group_size": 16,
+        "smoothquant_val": 0.5,
+        "clamp_val": null,
+        "use_meta_recipe": false,
+        "has_zero_point": false,
+        "pre_quant_scale": false,
+        "exclude_modules": [
+            "lm_head"
+        ],
+        "mamba_ssm_cache_dtype": null
+    },
+    "qk_layernorm": false,
+    "rotary_embedding_dim": 96,
+    "tie_word_embeddings": true,
+    "original_max_position_embeddings": 4096,
+    "longrope_scaling_short_factors": [
+        1.0,
+        1.0,
+        1.0,
+        1.0,
+        1.0,
+        1.0,
+        1.0,
+        1.0,
+        1.0,
+        1.0,
+        1.0,
+        1.0,
+        1.0,
+        1.0,
+        1.0,
+        1.0,
+        1.0,
+        1.0,
+        1.0,
+        1.0,
+        1.0,
+        1.0,
+        1.0,
+        1.0,
+        1.0,
+        1.0,
+        1.0,
+        1.0,
+        1.0,
+        1.0,
+        1.0,
+        1.0,
+        1.0,
+        1.0,
+        1.0,
+        1.0,
+        1.0,
+        1.0,
+        1.0,
+        1.0,
+        1.0,
+        1.0,
+        1.0,
+        1.0,
+        1.0,
+        1.0,
+        1.0,
+        1.0
+    ],
+    "longrope_scaling_long_factors": [
+        1,
+        1.118320672,
+        1.250641126,
+        1.398617824,
+        1.564103225,
+        1.74916897,
+        1.956131817,
+        2.187582649,
+        2.446418898,
+        2.735880826,
+        3.059592084,
+        3.421605075,
+        3.826451687,
+        4.279200023,
+        4.785517845,
+        5.351743533,
+        5.984965424,
+        6.693110555,
+        7.485043894,
+        8.370679318,
+        9.36110372,
+        10.4687158,
+        11.70738129,
+        13.09260651,
+        14.64173252,
+        16.37415215,
+        18.31155283,
+        20.47818807,
+        22.90118105,
+        25.61086418,
+        28.64115884,
+        32.03,
+        32.1,
+        32.13,
+        32.23,
+        32.6,
+        32.61,
+        32.64,
+        32.66,
+        32.7,
+        32.71,
+        32.93,
+        32.97,
+        33.28,
+        33.49,
+        33.5,
+        44.16,
+        47.77
+    ],
+    "model_type": "phi3"
+}

generation_config.json ADDED Viewed

	@@ -0,0 +1,10 @@

+{
+  "_from_model_config": true,
+  "bos_token_id": 199999,
+  "eos_token_id": [
+    200020,
+    199999
+  ],
+  "pad_token_id": 199999,
+  "transformers_version": "4.45.0"
+}

merges.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,30 @@

+{
+  "bos_token": {
+    "content": "<|endoftext|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "eos_token": {
+    "content": "<|endoftext|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "pad_token": {
+    "content": "<|endoftext|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "unk_token": {
+    "content": "<|endoftext|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  }
+}

tokenizer.json ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:382cc235b56c725945e149cc25f191da667c836655efd0857b004320e90e91ea
+size 15524095

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,111 @@

+{
+  "add_bos_token": false,
+  "add_eos_token": false,
+  "add_prefix_space": false,
+  "added_tokens_decoder": {
+    "199999": {
+      "content": "<|endoftext|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "200018": {
+      "content": "<|endofprompt|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "200019": {
+      "content": "<|assistant|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": true,
+      "single_word": false,
+      "special": true
+    },
+    "200020": {
+      "content": "<|end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": true,
+      "single_word": false,
+      "special": true
+    },
+    "200021": {
+      "content": "<|user|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": true,
+      "single_word": false,
+      "special": true
+    },
+    "200022": {
+      "content": "<|system|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": true,
+      "single_word": false,
+      "special": true
+    },
+    "200023": {
+      "content": "<|tool|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": true,
+      "single_word": false,
+      "special": false
+    },
+    "200024": {
+      "content": "<|/tool|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": true,
+      "single_word": false,
+      "special": false
+    },
+    "200025": {
+      "content": "<|tool_call|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": true,
+      "single_word": false,
+      "special": false
+    },
+    "200026": {
+      "content": "<|/tool_call|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": true,
+      "single_word": false,
+      "special": false
+    },
+    "200027": {
+      "content": "<|tool_response|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": true,
+      "single_word": false,
+      "special": false
+    },
+    "200028": {
+      "content": "<|tag|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": true,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "bos_token": "<|endoftext|>",
+  "chat_template": "{% for message in messages %}{% if message['role'] == 'system' and 'tools' in message and message['tools'] is not none %}{{ '<|' + message['role'] + '|>' + message['content'] + '<|tool|>' + message['tools'] + '<|/tool|>' + '<|end|>' }}{% else %}{{ '<|' + message['role'] + '|>' + message['content'] + '<|end|>' }}{% endif %}{% endfor %}{% if add_generation_prompt %}{{ '<|assistant|>' }}{% else %}{{ eos_token }}{% endif %}",
+  "clean_up_tokenization_spaces": false,
+  "eos_token": "<|endoftext|>",
+  "model_max_length": 131072,
+  "pad_token": "<|endoftext|>",
+  "tokenizer_class": "GPT2Tokenizer",
+  "unk_token": "<|endoftext|>"
+}

vocab.json ADDED Viewed

The diff for this file is too large to render. See raw diff