--- language: - multilingual - ar - zh - cs - da - nl - en - fi - fr - de - he - hu - it - ja - ko - 'no' - pl - pt - ru - es - sv - th - tr - uk license: mit license_name: mit license_link: https://huggingface.co/microsoft/Phi-4-mini-instruct/resolve/main/LICENSE name: RedHatAI/Phi-4-mini-instruct-FP8-dynamic base_model: - microsoft/Phi-4-mini-instruct provider: Microsoft description: This model was obtained by quantizing activation and weights of Phi-4-mini-instruct to FP8 data type. validated_on: - RHOAI 3.4 EA1 - RHAIIS 3.4 EA1 readme: https://huggingface.co/RedHatAI/Phi-4-mini-instruct-FP8-dynamic/blob/main/README.md pipeline_tag: text-generation tags: - nlp - code - red hat - FP8 - compressed-tensors - llm-compressor ---

Phi-4-mini-instruct-FP8-dynamic

## Model Overview - **Model Architecture:** Phi3ForCausalLM - **Input:** Text - **Output:** Text - **Model Optimizations:** - **Activation quantization:** FP8 - **Weight quantization:** FP8 - **Intended Use Cases:** The model is intended for broad multilingual commercial and research use. The model provides uses for general purpose AI systems and applications which require: 1. Memory/compute constrained environments. 2. Latency bound scenarios. 3. Math reasoning and logic. - **Release Date:** 03/03/2025 - **Version:** 1.0 - **Model Developers:** Red Hat ### Model Optimizations This model was obtained by quantizing activation and weights of [Phi-4-mini-instruct](https://huggingface.co/microsoft/Phi-4-mini-instruct) to FP8 data type. This optimization reduces the number of bits used to represent weights and activations from 16 to 8, reducing GPU memory requirements (by approximately 50%) and increasing matrix-multiply compute throughput (by approximately 2x). Weight quantization also reduces disk size requirements by approximately 50%. Only weights and activations of the linear operators within transformers blocks are quantized. Weights are quantized with a symmetric static per-channel scheme, whereas activations are quantized with a symmetric dynamic per-token scheme. The [llm-compressor](https://github.com/vllm-project/llm-compressor) library is used for quantization. ## Deployment This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend, as shown in the example below. ```bash vllm serve RedHatAI/Phi-4-mini-instruct-FP8-dynamic --max_model_len 131072 ``` ```python from openai import OpenAI # Set OpenAI's API key and API base to use vLLM's API server. openai_api_key = "EMPTY" openai_api_base = "http://localhost:8000/v1" client = OpenAI( api_key=openai_api_key, base_url=openai_api_base, ) generated_text = client.chat.completions.create( model="RedHatAI/Phi-4-mini-instruct-FP8-dynamic", messages=[ {"role": "user", "content": "Give me a short introduction to large language model."}, ], ) print(generated_text.choices[0].message.content) ``` ## Creation

Creation details

This model was created with [llm-compressor](https://github.com/vllm-project/llm-compressor) by running the code snippet below. ```python from transformers import AutoModelForCausalLM, AutoTokenizer from llmcompressor.modifiers.quantization import QuantizationModifier from llmcompressor import oneshot # Load model model_stub = "microsoft/Phi-4-mini-instruct" model_name = model_stub.split("/")[-1] tokenizer = AutoTokenizer.from_pretrained(model_stub) model = AutoModelForCausalLM.from_pretrained( model_stub, device_map="auto", torch_dtype="auto", ) # Configure the quantization algorithm and scheme recipe = QuantizationModifier( targets="Linear", scheme="FP8_dynamic", ignore=["lm_head"], ) # Apply quantization oneshot( model=model, recipe=recipe, ) # Save to disk in compressed-tensors format save_path = model_name + "-FP8-dynamic" model.save_pretrained(save_path) tokenizer.save_pretrained(save_path) print(f"Model and tokenizer saved to: {save_path}") ```

## Evaluation The model was evaluated on the Mathh 500 benchmarks using [lighteval](https://github.com/huggingface/lighteval), and on GSM8k-Platinum, MMLU CoT, MMLU-Pro, and IFEval using [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness). In both cases [vLLM](https://vllm.ai) is used as the backend

Evaluation commands

### Start vLLM server ```bash vllm serve RedHatAI/Phi-4-mini-instruct-FP8-dynamic --max_model_len 131072 ``` ### lm-evaluation-harness ```bash lm_eval --model local-chat-completions \ --tasks gsm8k_platinum_cot_llama \ --model_args "model=RedHatAI/Phi-4-mini-instruct-FP8-dynamic,max_length=131072,base_url=http://0.0.0.0:8000/v1/chat/completions,num_concurrent=128,max_retries=3,tokenized_requests=False,timeout=600,tokenizer_backend=None" \ --apply_chat_template \ --num_fewshot 5 \ --fewshot_as_multiturn \ --output_path gsm8k_platinum_phi4_mini_instruct_fp8_dynamic \ --gen_kwargs "do_sample=False,temperature=0.0,max_gen_toks=16000" ``` ```bash lm_eval --model local-chat-completions \ --tasks mmlu_cot_llama \ --model_args "model=RedHatAI/Phi-4-mini-instruct-FP8-dynamic,max_length=131072,base_url=http://0.0.0.0:8000/v1/chat/completions,num_concurrent=128,max_retries=3,tokenized_requests=False,timeout=600,tokenizer_backend=None" \ --apply_chat_template \ --output_path mmlu_cot_phi4_mini_instruct_fp8_dynamic \ --gen_kwargs "do_sample=False,temperature=0.0,max_gen_toks=16000" ``` ```bash lm_eval --model local-chat-completions \ --tasks mmlu_pro_chat \ --model_args "model=RedHatAI/Phi-4-mini-instruct-FP8-dynamic,max_length=131072,base_url=http://0.0.0.0:8000/v1/chat/completions,num_concurrent=128,max_retries=3,tokenized_requests=False,timeout=600,tokenizer_backend=None" \ --apply_chat_template \ --num_fewshot 5 \ --fewshot_as_multiturn \ --output_path mmlu_pro_phi4_mini_instruct_fp8_dynamic \ --gen_kwargs "do_sample=False,temperature=0.0,max_gen_toks=16000" ``` ```bash lm_eval --model local-chat-completions \ --tasks ifeval \ --model_args "model=RedHatAI/Phi-4-mini-instruct-FP8-dynamic,max_length=131072,base_url=http://0.0.0.0:8000/v1/chat/completions,num_concurrent=128,max_retries=3,tokenized_requests=False,timeout=600,tokenizer_backend=None" \ --apply_chat_template \ --output_path ifeval_phi4_mini_instruct_fp8_dynamic \ --gen_kwargs "do_sample=False,temperature=0.0,max_gen_toks=16000" ``` ### lighteval litellm_config.yaml ```yaml model_parameters: provider: "hosted_vllm" model_name: "hosted_vllm/RedHatAI/Phi-4-mini-instruct-FP8-dynamic" base_url: "http://0.0.0.0:8000/v1" api_key: "" timeout: 600 concurrent_requests: 128 generation_parameters: temperature: 0.0 max_new_tokens: 16000 ``` ```bash lighteval endpoint litellm litellm_config.yaml \ math_500|0 \ --output-dir phi4_mini_instruct_fp8_dynamic \ --save-details ```

### Accuracy

Benchmark	Phi-4-mini-instruct	Phi-4-mini-instruct-FP8-dynamic (this model)	Recovery
Math 500	57.60	58.20	101.7%
GSM8k-Platinum	84.12	84.70	100.7%
MMLU CoT	67.01	66.97	99.9%
MMLU-Pro	46.75	45.60	97.5%