Phi-3-small-128k-instruct ONNX

This repository hosts the optimized versions of microsoft/Phi-3-small-128k-instruct to accelerate inference with DirectML and ONNX Runtime. The Phi-3-Small-128K-Instruct is a state-of-the-art, lightweight open model developed by Microsoft, featuring 7B parameters.

Key Features:

  • Parameter Count: 7B
  • Tokenizer: Utilizes the tiktoken tokenizer for improved multilingual tokenization, with a vocabulary size of 100,352 tokens.
  • Context Length: Default context length of 128k tokens.

Attention Mechanism:

  • Implements grouped-query attention to minimize KV cache footprint, with 4 queries sharing 1 key.
  • Uses alternative layers of dense attention and a novel blocksparse attention to further optimize on KV cache savings while maintaining long context retrieval performance.
  • Multilingual Capability: Includes an additional 10% of multilingual data to enhance its performance across different languages.

ONNX Models

Here are some of the optimized configurations we have added:

  • ONNX model for int4 DirectML: ONNX model for AMD, Intel, and NVIDIA GPUs on Windows, quantized to int4 using AWQ.
  • ONNX model for int4 CPU and Mobile: ONNX model for CPU and mobile using int4 quantization via RTN. There are two versions uploaded to balance latency vs. accuracy. Acc=1 is targeted at improved accuracy, while Acc=4 is for improved performance. For mobile devices, we recommend using the model with acc-level-4.

Usage

Installation and Setup

To use the Phi-3-small-128k-instruct ONNX model on Windows with DirectML, follow these steps:

  1. Create and activate a Conda environment:
conda create -n onnx python=3.10
conda activate onnx
  1. Install Git LFS:
winget install -e --id GitHub.GitLFS
  1. Install Hugging Face CLI:
pip install huggingface-hub[cli]
  1. Download the model:
huggingface-cli download EmbeddedLLM/Phi-3-small-128k-instruct-onnx --include="onnx/directml/*" --local-dir .\Phi-3-small-128k-instruct
  1. Install necessary Python packages:
pip install numpy==1.26.4
pip install onnxruntime-directml
pip install --pre onnxruntime-genai-directml
  1. Install Visual Studio 2015 runtime:
conda install conda-forge::vs2015_runtime
  1. Download the example script:
Invoke-WebRequest -Uri "https://raw.githubusercontent.com/microsoft/onnxruntime-genai/main/examples/python/phi3-qa.py" -OutFile "phi3-qa.py"
  1. Run the example script:
python phi3-qa.py -m .\Phi-3-small-128k-instruct

Hardware Requirements

Minimum Configuration:

  • Windows: DirectX 12-capable GPU (AMD/Nvidia/Intel)
  • CPU: x86_64 / ARM64

Tested Configurations:

  • GPU: AMD Ryzen 8000 Series iGPU (DirectML)
  • CPU: AMD Ryzen CPU

Hardware Supported

The model has been tested on:

  • GPU SKU: RTX 4090 (DirectML)

Minimum Configuration Required:

  • Windows: DirectX 12-capable GPU and a minimum of 10GB of combined RAM

Model Description

  • Developed by: Microsoft
  • Model type: ONNX
  • Language(s) (NLP): Python, C, C++
  • License: MIT
  • Model Description: This is a conversion of the Phi-3 Small 128K Instruct model for ONNX Runtime inference.

Additional Details

License

The model is licensed under the MIT license.

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft’s Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party’s policies.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support