Phi-3-small-128k-instruct ONNX
This repository hosts the optimized versions of microsoft/Phi-3-small-128k-instruct to accelerate inference with DirectML and ONNX Runtime. The Phi-3-Small-128K-Instruct is a state-of-the-art, lightweight open model developed by Microsoft, featuring 7B parameters.
Key Features:
- Parameter Count: 7B
- Tokenizer: Utilizes the tiktoken tokenizer for improved multilingual tokenization, with a vocabulary size of 100,352 tokens.
- Context Length: Default context length of 128k tokens.
Attention Mechanism:
- Implements grouped-query attention to minimize KV cache footprint, with 4 queries sharing 1 key.
- Uses alternative layers of dense attention and a novel blocksparse attention to further optimize on KV cache savings while maintaining long context retrieval performance.
- Multilingual Capability: Includes an additional 10% of multilingual data to enhance its performance across different languages.
ONNX Models
Here are some of the optimized configurations we have added:
- ONNX model for int4 DirectML: ONNX model for AMD, Intel, and NVIDIA GPUs on Windows, quantized to int4 using AWQ.
- ONNX model for int4 CPU and Mobile: ONNX model for CPU and mobile using int4 quantization via RTN. There are two versions uploaded to balance latency vs. accuracy. Acc=1 is targeted at improved accuracy, while Acc=4 is for improved performance. For mobile devices, we recommend using the model with acc-level-4.
Usage
Installation and Setup
To use the Phi-3-small-128k-instruct ONNX model on Windows with DirectML, follow these steps:
- Create and activate a Conda environment:
conda create -n onnx python=3.10
conda activate onnx
- Install Git LFS:
winget install -e --id GitHub.GitLFS
- Install Hugging Face CLI:
pip install huggingface-hub[cli]
- Download the model:
huggingface-cli download EmbeddedLLM/Phi-3-small-128k-instruct-onnx --include="onnx/directml/*" --local-dir .\Phi-3-small-128k-instruct
- Install necessary Python packages:
pip install numpy==1.26.4
pip install onnxruntime-directml
pip install --pre onnxruntime-genai-directml
- Install Visual Studio 2015 runtime:
conda install conda-forge::vs2015_runtime
- Download the example script:
Invoke-WebRequest -Uri "https://raw.githubusercontent.com/microsoft/onnxruntime-genai/main/examples/python/phi3-qa.py" -OutFile "phi3-qa.py"
- Run the example script:
python phi3-qa.py -m .\Phi-3-small-128k-instruct
Hardware Requirements
Minimum Configuration:
- Windows: DirectX 12-capable GPU (AMD/Nvidia/Intel)
- CPU: x86_64 / ARM64
Tested Configurations:
- GPU: AMD Ryzen 8000 Series iGPU (DirectML)
- CPU: AMD Ryzen CPU
Hardware Supported
The model has been tested on:
- GPU SKU: RTX 4090 (DirectML)
Minimum Configuration Required:
- Windows: DirectX 12-capable GPU and a minimum of 10GB of combined RAM
Model Description
- Developed by: Microsoft
- Model type: ONNX
- Language(s) (NLP): Python, C, C++
- License: MIT
- Model Description: This is a conversion of the Phi-3 Small 128K Instruct model for ONNX Runtime inference.
Additional Details
- Phi-3 Small, Medium, and Vision Blog
- Phi-3 Model Blog Link
- Phi-3 Model Card
- Phi-3 Technical Report
- Phi-3 on Azure AI Studio
License
The model is licensed under the MIT license.
Trademarks
This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft’s Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party’s policies.