--- license: mit license_link: https://huggingface.co/microsoft/Phi-4-reasoning/resolve/main/LICENSE language: - en base_model: - microsoft/Phi-4-reasoning pipeline_tag: text-generation tags: - phi - nlp - math - code - chat - conversational - reasoning - red hat - FP8 - compressed-tensors - llm-compressor --- ## Model Overview - **Model Architecture:** Phi3ForCausalLM - **Input:** Text - **Output:** Text - **Model Optimizations:** - **Activation quantization:** FP8 - **Weight quantization:** FP8 - **Intended Use Cases:** This model is designed to accelerate research on language models, for use as a building block for generative AI powered features. It provides uses for general purpose AI systems and applications (primarily in English) which require: 1. Memory/compute constrained environments. 2. Latency bound scenarios. 3. Math reasoning and logic. - **Release Date:** 01/26/2026 - **Version:** 1.0 - **Model Developers:** Red Hat ### Model Optimizations This model was obtained by quantizing activation and weights of [Phi-4-reasoning](https://huggingface.co/microsoft/Phi-4-reasoning) to FP8 data type. This optimization reduces the number of bits used to represent weights and activations from 16 to 8, reducing GPU memory requirements (by approximately 50%) and increasing matrix-multiply compute throughput (by approximately 2x). Weight quantization also reduces disk size requirements by approximately 50%. Only weights and activations of the linear operators within transformers blocks are quantized. Weights are quantized with a symmetric static per-channel scheme, whereas activations are quantized with a symmetric dynamic per-token scheme. The [llm-compressor](https://github.com/vllm-project/llm-compressor) library is used for quantization. ## Deployment This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend, as shown in the example below. ```bash vllm serve RedHatAI/Phi-4-reasoning-FP8-dynamic --reasoning-parser deepseek_r1 ``` ```python from openai import OpenAI # Set OpenAI's API key and API base to use vLLM's API server. openai_api_key = "EMPTY" openai_api_base = "http://localhost:8000/v1" client = OpenAI( api_key=openai_api_key, base_url=openai_api_base, ) generated_text = client.chat.completions.create( model="RedHatAI/Phi-4-reasoning-FP8-dynamic", messages=[ {"role": "user", "content": "Give me a short introduction to large language model."}, ], ) print(generated_text.choices[0].message.content) ``` ## Creation
Creation details This model was created with [llm-compressor](https://github.com/vllm-project/llm-compressor) by running the code snippet below. ```python from transformers import AutoModelForCausalLM, AutoTokenizer from llmcompressor.modifiers.quantization import QuantizationModifier from llmcompressor.transformers import oneshot # Load model model_stub = "microsoft/Phi-4-reasoning" model_name = model_stub.split("/")[-1] tokenizer = AutoTokenizer.from_pretrained(model_stub) model = AutoModelForCausalLM.from_pretrained( model_stub, device_map="auto", torch_dtype="auto", ) # Configure the quantization algorithm and scheme recipe = QuantizationModifier( targets="Linear", scheme="FP8_dynamic", ignore=["lm_head"], ) # Apply quantization oneshot( model=model, recipe=recipe, ) # Save to disk in compressed-tensors format save_path = model_name + "-FP8-dynamic" model.save_pretrained(save_path) tokenizer.save_pretrained(save_path) print(f"Model and tokenizer saved to: {save_path}") ```
## Evaluation The model was evaluated on the AIME25, GPQA Diamond and Mathh 500 benchmarks using [lighteval](https://github.com/huggingface/lighteval), and on MMLU-Pro using [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness). In both cases [vLLM](https://vllm.ai) is used as the backend
Evaluation commands ### Start vLLM server ```bash vllm serve RedHatAI/Phi-4-reasoning-FP8-dynamic --reasoning-parser deepseek_r1 ``` ### lm-evaluation-harness ```bash lm_eval --model local-chat-completions \ --tasks mmlu_pro_chat \ --model_args "model=RedHatAI/Phi-4-reasoning-FP8-dynamic,max_length=32000,base_url=http://0.0.0.0:8000/v1/chat/completions,num_concurrent=128,max_retries=3,tokenized_requests=False,timeout=2400,tokenizer_backend=None" \ --apply_chat_template \ --num_fewshot 0 \ --output_path mmlu_pro_phi4_reasoning_fp8_dynamic \ --gen_kwargs "do_sample=True,temperature=0.8,top_k=50,top_p=0.95,max_gen_toks=24000" ``` ### lighteval litellm_config.yaml ```yaml model_parameters: provider: "hosted_vllm" model_name: "hosted_vllm/RedHatAI/Phi-4-reasoning-FP8-dynamic" base_url: "http://0.0.0.0:8000/v1" api_key: "" timeout: 1200 concurrent_requests: 64 generation_parameters: temperature: 0.8 top_k: 50 top_p: 0.95 max_new_tokens: 24000 ``` ```bash lighteval endpoint litellm litellm_config.yaml \ gpqa:diamond|0,math_500|0,aime25|0 \ --output-dir phi4_reasoning_fp8_dynamic \ --save-details ```
### Accuracy
Benchmark Phi-4-reasoning Phi-4-reasoning FP8-dynamic
(this model)
Recovery
AIME25 61.25 64.58 105.4%
GPQA Diamond 64.65 66.50 102.9%
Math 500 90.01 88.60 98.4%
MMLU-Pro 76.49 76.85 100.5%