| | --- |
| | license: apache-2.0 |
| | library_name: onnxruntime-genai |
| | pipeline_tag: text-generation |
| | tags: |
| | - onnx |
| | - directml |
| | - int4 |
| | - quantized |
| | - qwen |
| | - qwen3 |
| | - instruct |
| | - text-generation |
| | - windows |
| | - csharp |
| | - dotnet |
| | inference: false |
| | base_model: Qwen/Qwen3-14B-Instruct |
| | language: |
| | - en |
| | - zh |
| | --- |
| | |
| | # Qwen3-14B-Instruct – DirectML INT4 (ONNX Runtime) |
| |
|
| | This repository provides **Qwen3-14B-Instruct** converted to **INT4 ONNX** and optimized for **DirectML** using **Microsoft Olive** and **ONNX Runtime GenAI**. |
| |
|
| | It is designed for **native Windows GPU inference** (Intel Arc, AMD RDNA, NVIDIA RTX) without CUDA and without running a Python server. |
| | Ideal for integration in **C# / .NET applications** using ONNX Runtime + DirectML. |
| |
|
| | --- |
| |
|
| | ## Model Details |
| |
|
| | - Base model: `OpenPipe/Qwen3-14B-Instruct` |
| | - Quantization: INT4 (MatMul NBits) |
| | - Format: ONNX |
| | - Runtime: ONNX Runtime with `DmlExecutionProvider` |
| | - Conversion toolchain: Microsoft Olive + onnxruntime-genai |
| | - Target hardware: |
| | - Intel Arc (A770, A750, 130V, etc.) |
| | - AMD RDNA2 / RDNA3 |
| | - NVIDIA RTX (via DirectML) |
| |
|
| | --- |
| |
|
| | ## Files |
| |
|
| | Main inference files: |
| |
|
| | - `model.onnx` |
| | - `model.onnx.data` ← INT4 weights (≈ 9 GB) |
| | - `genai_config.json` |
| | - `tokenizer.json`, `vocab.json`, `merges.txt` |
| | - `chat_template.jinja` |
| |
|
| | --- |
| |
|
| | ## Usage in C# (DirectML) |
| |
|
| | Example (ONNX Runtime GenAI): |
| |
|
| | ```csharp |
| | using Microsoft.ML.OnnxRuntimeGenAI; |
| | |
| | var modelPath = @"Qwen3-14B-Instruct-DirectML-INT4"; |
| | |
| | using var model = Model.Load(modelPath, new ModelOptions |
| | { |
| | ExecutionProvider = ExecutionProvider.DirectML |
| | }); |
| | |
| | using var tokenizer = new Tokenizer(model); |
| | var tokens = tokenizer.Encode("Explain what a Dutch mortgage deed is."); |
| | |
| | using var generator = new Generator(model, new GeneratorParams |
| | { |
| | MaxLength = 1024, |
| | Temperature = 0.7f |
| | }); |
| | |
| | generator.AppendTokens(tokens); |
| | generator.Generate(); |
| | |
| | string output = tokenizer.Decode(generator.GetSequence(0)); |
| | Console.WriteLine(output); |
| | Prompt Format |
| | This model supports standard chat-style prompts and works well with Hermes-style system prompts and tool calling. |
| | |
| | The included chat_template.jinja can be used to format multi-role conversations. |
| | |
| | Performance Notes |
| | INT4 allows the 14B model to run on: |
| | |
| | 16 GB VRAM GPUs (Arc 130V, RTX 3060, RX 6800) |
| | |
| | Throughput depends heavily on DirectML backend and driver quality. |
| | |
| | First token latency may be high due to graph compilation. |
| | |
| | License & Attribution |
| | Base model: |
| | |
| | Qwen3-14B-Instruct by Alibaba / OpenPipe |
| | |
| | License: see original model card |
| | |
| | Conversion: |
| | |
| | ONNX + INT4 quantization performed by Wekkel using Microsoft Olive. |
| | |
| | This is an independent community conversion. |
| | |
| | No affiliation with Alibaba or Qwen team. |