Update README.md

cc3014d verified 5 days ago

8.24 kB

	---
	language:
	- multilingual
	- ar
	- zh
	- cs
	- da
	- nl
	- en
	- fi
	- fr
	- de
	- he
	- hu
	- it
	- ja
	- ko
	- 'no'
	- pl
	- pt
	- ru
	- es
	- sv
	- th
	- tr
	- uk
	license: mit
	license_name: mit
	license_link: https://huggingface.co/microsoft/Phi-4-mini-instruct/resolve/main/LICENSE
	name: RedHatAI/Phi-4-mini-instruct-FP8-dynamic
	base_model:
	- microsoft/Phi-4-mini-instruct
	provider: Microsoft
	description: This model was obtained by quantizing activation and weights of Phi-4-mini-instruct to FP8 data type.
	validated_on:
	- RHOAI 3.4 EA1
	- RHAIIS 3.4 EA1
	readme: https://huggingface.co/RedHatAI/Phi-4-mini-instruct-FP8-dynamic/blob/main/README.md
	pipeline_tag: text-generation
	tags:
	- nlp
	- code
	- red hat
	- FP8
	- compressed-tensors
	- llm-compressor
	---

	<h1 align: center; style="display: flex; align-items: center; gap: 10px; margin: 0;">
	Phi-4-mini-instruct-FP8-dynamic
	<img src="https://www.redhat.com/rhdc/managed-files/Catalog-Validated_model_0.png" alt="Model Icon" width="40" style="margin: 0; padding: 0;" />
	</h1>
	<a href="https://www.redhat.com/en/products/ai/validated-models" target="_blank" style="margin: 0; padding: 0;">
	<img src="https://www.redhat.com/rhdc/managed-files/Validated_badge-Dark.png" alt="Validated Badge" width="250" style="margin: 0; padding: 0;" />
	</a>


	## Model Overview
	- Model Architecture: Phi3ForCausalLM
	- Input: Text
	- Output: Text
	- Model Optimizations:
	- Activation quantization: FP8
	- Weight quantization: FP8
	- Intended Use Cases: The model is intended for broad multilingual commercial and research use. The model provides uses for general purpose AI systems and applications which require:
	1. Memory/compute constrained environments.
	2. Latency bound scenarios.
	3. Math reasoning and logic.
	- Release Date: 03/03/2025
	- Version: 1.0
	- Model Developers: Red Hat


	### Model Optimizations

	This model was obtained by quantizing activation and weights of [Phi-4-mini-instruct](https://huggingface.co/microsoft/Phi-4-mini-instruct) to FP8 data type.
	This optimization reduces the number of bits used to represent weights and activations from 16 to 8, reducing GPU memory requirements (by approximately 50%) and increasing matrix-multiply compute throughput (by approximately 2x).
	Weight quantization also reduces disk size requirements by approximately 50%.

	Only weights and activations of the linear operators within transformers blocks are quantized.
	Weights are quantized with a symmetric static per-channel scheme, whereas activations are quantized with a symmetric dynamic per-token scheme.
	The [llm-compressor](https://github.com/vllm-project/llm-compressor) library is used for quantization.

	## Deployment

	This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend, as shown in the example below.

	```bash
	vllm serve RedHatAI/Phi-4-mini-instruct-FP8-dynamic --max_model_len 131072
	```

	```python
	from openai import OpenAI
	# Set OpenAI's API key and API base to use vLLM's API server.
	openai_api_key = "EMPTY"
	openai_api_base = "http://localhost:8000/v1"

	client = OpenAI(
	api_key=openai_api_key,
	base_url=openai_api_base,
	)

	generated_text = client.chat.completions.create(
	model="RedHatAI/Phi-4-mini-instruct-FP8-dynamic",
	messages=[
	{"role": "user", "content": "Give me a short introduction to large language model."},
	],
	)
	print(generated_text.choices[0].message.content)
	```

	## Creation

	<details>
	<summary>Creation details</summary>
	This model was created with [llm-compressor](https://github.com/vllm-project/llm-compressor) by running the code snippet below.


	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer
	from llmcompressor.modifiers.quantization import QuantizationModifier
	from llmcompressor import oneshot

	# Load model
	model_stub = "microsoft/Phi-4-mini-instruct"
	model_name = model_stub.split("/")[-1]

	tokenizer = AutoTokenizer.from_pretrained(model_stub)

	model = AutoModelForCausalLM.from_pretrained(
	model_stub,
	device_map="auto",
	torch_dtype="auto",
	)

	# Configure the quantization algorithm and scheme
	recipe = QuantizationModifier(
	targets="Linear",
	scheme="FP8_dynamic",
	ignore=["lm_head"],
	)

	# Apply quantization
	oneshot(
	model=model,
	recipe=recipe,
	)

	# Save to disk in compressed-tensors format
	save_path = model_name + "-FP8-dynamic"
	model.save_pretrained(save_path)
	tokenizer.save_pretrained(save_path)
	print(f"Model and tokenizer saved to: {save_path}")
	```
	</details>



	## Evaluation

	The model was evaluated on the Mathh 500 benchmarks using [lighteval](https://github.com/huggingface/lighteval), and on GSM8k-Platinum, MMLU CoT, MMLU-Pro, and IFEval using [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness).
	In both cases [vLLM](https://vllm.ai) is used as the backend

	<details>
	<summary>Evaluation commands</summary>

	### Start vLLM server
	```bash
	vllm serve RedHatAI/Phi-4-mini-instruct-FP8-dynamic --max_model_len 131072
	```

	### lm-evaluation-harness
	```bash
	lm_eval --model local-chat-completions \
	--tasks gsm8k_platinum_cot_llama \
	--model_args "model=RedHatAI/Phi-4-mini-instruct-FP8-dynamic,max_length=131072,base_url=http://0.0.0.0:8000/v1/chat/completions,num_concurrent=128,max_retries=3,tokenized_requests=False,timeout=600,tokenizer_backend=None" \
	--apply_chat_template \
	--num_fewshot 5 \
	--fewshot_as_multiturn \
	--output_path gsm8k_platinum_phi4_mini_instruct_fp8_dynamic \
	--gen_kwargs "do_sample=False,temperature=0.0,max_gen_toks=16000"
	```

	```bash
	lm_eval --model local-chat-completions \
	--tasks mmlu_cot_llama \
	--model_args "model=RedHatAI/Phi-4-mini-instruct-FP8-dynamic,max_length=131072,base_url=http://0.0.0.0:8000/v1/chat/completions,num_concurrent=128,max_retries=3,tokenized_requests=False,timeout=600,tokenizer_backend=None" \
	--apply_chat_template \
	--output_path mmlu_cot_phi4_mini_instruct_fp8_dynamic \
	--gen_kwargs "do_sample=False,temperature=0.0,max_gen_toks=16000"
	```

	```bash
	lm_eval --model local-chat-completions \
	--tasks mmlu_pro_chat \
	--model_args "model=RedHatAI/Phi-4-mini-instruct-FP8-dynamic,max_length=131072,base_url=http://0.0.0.0:8000/v1/chat/completions,num_concurrent=128,max_retries=3,tokenized_requests=False,timeout=600,tokenizer_backend=None" \
	--apply_chat_template \
	--num_fewshot 5 \
	--fewshot_as_multiturn \
	--output_path mmlu_pro_phi4_mini_instruct_fp8_dynamic \
	--gen_kwargs "do_sample=False,temperature=0.0,max_gen_toks=16000"
	```

	```bash
	lm_eval --model local-chat-completions \
	--tasks ifeval \
	--model_args "model=RedHatAI/Phi-4-mini-instruct-FP8-dynamic,max_length=131072,base_url=http://0.0.0.0:8000/v1/chat/completions,num_concurrent=128,max_retries=3,tokenized_requests=False,timeout=600,tokenizer_backend=None" \
	--apply_chat_template \
	--output_path ifeval_phi4_mini_instruct_fp8_dynamic \
	--gen_kwargs "do_sample=False,temperature=0.0,max_gen_toks=16000"
	```

	### lighteval
	litellm_config.yaml
	```yaml
	model_parameters:
	provider: "hosted_vllm"
	model_name: "hosted_vllm/RedHatAI/Phi-4-mini-instruct-FP8-dynamic"
	base_url: "http://0.0.0.0:8000/v1"
	api_key: ""
	timeout: 600
	concurrent_requests: 128
	generation_parameters:
	temperature: 0.0
	max_new_tokens: 16000
	```

	```bash
	lighteval endpoint litellm litellm_config.yaml \
	math_500\|0 \
	--output-dir phi4_mini_instruct_fp8_dynamic \
	--save-details
	```
	</details>

	### Accuracy

	<table>
	<tr>
	<td><strong>Benchmark</strong>
	</td>
	<td><strong>Phi-4-mini-instruct</strong>
	</td>
	<td><strong>Phi-4-mini-instruct-FP8-dynamic<br>(this model)</strong>
	</td>
	<td><strong>Recovery</strong>
	</td>
	</tr>
	<tr>
	<tr>
	<td>Math 500
	</td>
	<td>57.60
	</td>
	<td>58.20
	</td>
	<td>101.7%
	</td>
	</tr>
	<tr>
	<td>GSM8k-Platinum
	</td>
	<td>84.12
	</td>
	<td>84.70
	</td>
	<td>100.7%
	</td>
	</tr>
	<tr>
	<td>MMLU CoT
	</td>
	<td>67.01
	</td>
	<td>66.97
	</td>
	<td>99.9%
	</td>
	</tr>
	<tr>
	<td>MMLU-Pro
	</td>
	<td>46.75
	</td>
	<td>45.60
	</td>
	<td>97.5%
	</td>
	</tr>
	</table>