| | --- |
| | license: mit |
| | pipeline_tag: text-generation |
| | tags: |
| | - ONNX |
| | - DML |
| | - ONNXRuntime |
| | - phi3 |
| | - nlp |
| | - conversational |
| | - custom_code |
| | - DirectML |
| | inference: false |
| | language: |
| | - en |
| | --- |
| | |
| | # Phi-3-small-128k-instruct ONNX |
| |
|
| | This repository hosts the optimized versions of [microsoft/Phi-3-small-128k-instruct](https://huggingface.co/microsoft/Phi-3-small-128k-instruct) to accelerate inference with DirectML and ONNX Runtime. |
| | The Phi-3-Small-128K-Instruct is a state-of-the-art, lightweight open model developed by Microsoft, featuring 7B parameters. |
| |
|
| | Key Features: |
| | - Parameter Count: 7B |
| | - Tokenizer: Utilizes the tiktoken tokenizer for improved multilingual tokenization, with a vocabulary size of 100,352 tokens. |
| | - Context Length: Default context length of 128k tokens. |
| |
|
| | Attention Mechanism: |
| | - Implements grouped-query attention to minimize KV cache footprint, with 4 queries sharing 1 key. |
| | - Uses alternative layers of dense attention and a novel blocksparse attention to further optimize on KV cache savings while maintaining long context retrieval performance. |
| | - Multilingual Capability: Includes an additional 10% of multilingual data to enhance its performance across different languages. |
| |
|
| | ## ONNX Models |
| |
|
| | Here are some of the optimized configurations we have added: |
| | - **ONNX model for int4 DirectML:** ONNX model for AMD, Intel, and NVIDIA GPUs on Windows, quantized to int4 using AWQ. |
| | - **ONNX model for int4 CPU and Mobile:** ONNX model for CPU and mobile using int4 quantization via RTN. There are two versions uploaded to balance latency vs. accuracy. Acc=1 is targeted at improved accuracy, while Acc=4 is for improved performance. For mobile devices, we recommend using the model with acc-level-4. |
| |
|
| | ## Usage |
| |
|
| | ### Installation and Setup |
| |
|
| | To use the Phi-3-small-128k-instruct ONNX model on Windows with DirectML, follow these steps: |
| |
|
| | 1. **Create and activate a Conda environment:** |
| | ```sh |
| | conda create -n onnx python=3.10 |
| | conda activate onnx |
| | ``` |
| |
|
| | 2. **Install Git LFS:** |
| | ```sh |
| | winget install -e --id GitHub.GitLFS |
| | ``` |
| |
|
| | 3. **Install Hugging Face CLI:** |
| | ```sh |
| | pip install huggingface-hub[cli] |
| | ``` |
| |
|
| | 4. **Download the model:** |
| | ```sh |
| | huggingface-cli download EmbeddedLLM/Phi-3-small-128k-instruct-onnx --include="onnx/directml/*" --local-dir .\Phi-3-small-128k-instruct |
| | ``` |
| |
|
| | 5. **Install necessary Python packages:** |
| | ```sh |
| | pip install numpy==1.26.4 |
| | pip install onnxruntime-directml |
| | pip install --pre onnxruntime-genai-directml |
| | ``` |
| |
|
| | 6. **Install Visual Studio 2015 runtime:** |
| | ```sh |
| | conda install conda-forge::vs2015_runtime |
| | ``` |
| |
|
| | 7. **Download the example script:** |
| | ```sh |
| | Invoke-WebRequest -Uri "https://raw.githubusercontent.com/microsoft/onnxruntime-genai/main/examples/python/phi3-qa.py" -OutFile "phi3-qa.py" |
| | ``` |
| |
|
| | 8. **Run the example script:** |
| | ```sh |
| | python phi3-qa.py -m .\Phi-3-small-128k-instruct |
| | ``` |
| |
|
| | ### Hardware Requirements |
| |
|
| | **Minimum Configuration:** |
| | - **Windows:** DirectX 12-capable GPU (AMD/Nvidia/Intel) |
| | - **CPU:** x86_64 / ARM64 |
| | |
| | **Tested Configurations:** |
| | - **GPU:** AMD Ryzen 8000 Series iGPU (DirectML) |
| | - **CPU:** AMD Ryzen CPU |
| | |
| | ## Hardware Supported |
| | |
| | The model has been tested on: |
| | - GPU SKU: RTX 4090 (DirectML) |
| | |
| | Minimum Configuration Required: |
| | - Windows: DirectX 12-capable GPU and a minimum of 10GB of combined RAM |
| | |
| | ### Model Description |
| | |
| | - **Developed by:** Microsoft |
| | - **Model type:** ONNX |
| | - **Language(s) (NLP):** Python, C, C++ |
| | - **License:** MIT |
| | - **Model Description:** This is a conversion of the Phi-3 Small 128K Instruct model for ONNX Runtime inference. |
| | |
| | ## Additional Details |
| | - [**Phi-3 Small, Medium, and Vision Blog**](https://aka.ms/phi3_ONNXBuild24) |
| | - [**Phi-3 Model Blog Link**](https://aka.ms/phi3blog-april) |
| | - [**Phi-3 Model Card**]( https://aka.ms/phi3-medium-4k-instruct) |
| | - [**Phi-3 Technical Report**](https://aka.ms/phi3-tech-report) |
| | - [**Phi-3 on Azure AI Studio**](https://aka.ms/phi3-azure-ai) |
| | |
| | ## License |
| | |
| | The model is licensed under the [MIT license](https://huggingface.co/microsoft/Phi-3-vision-128k-instruct/resolve/main/LICENSE). |
| | |
| | ## Trademarks |
| | |
| | This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow [Microsoft’s Trademark & Brand Guidelines](https://www.microsoft.com/en-us/legal/intellectualproperty/trademarks). Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party’s policies. |