--- license: apache-2.0 library_name: onnxruntime-genai pipeline_tag: text-generation tags: - onnx - directml - int4 - quantized - qwen - qwen3 - instruct - text-generation - windows - csharp - dotnet inference: false base_model: Qwen/Qwen3-14B-Instruct language: - en - zh --- # Qwen3-14B-Instruct – DirectML INT4 (ONNX Runtime) This repository provides **Qwen3-14B-Instruct** converted to **INT4 ONNX** and optimized for **DirectML** using **Microsoft Olive** and **ONNX Runtime GenAI**. It is designed for **native Windows GPU inference** (Intel Arc, AMD RDNA, NVIDIA RTX) without CUDA and without running a Python server. Ideal for integration in **C# / .NET applications** using ONNX Runtime + DirectML. --- ## Model Details - Base model: `OpenPipe/Qwen3-14B-Instruct` - Quantization: INT4 (MatMul NBits) - Format: ONNX - Runtime: ONNX Runtime with `DmlExecutionProvider` - Conversion toolchain: Microsoft Olive + onnxruntime-genai - Target hardware: - Intel Arc (A770, A750, 130V, etc.) - AMD RDNA2 / RDNA3 - NVIDIA RTX (via DirectML) --- ## Files Main inference files: - `model.onnx` - `model.onnx.data` ← INT4 weights (≈ 9 GB) - `genai_config.json` - `tokenizer.json`, `vocab.json`, `merges.txt` - `chat_template.jinja` --- ## Usage in C# (DirectML) Example (ONNX Runtime GenAI): ```csharp using Microsoft.ML.OnnxRuntimeGenAI; var modelPath = @"Qwen3-14B-Instruct-DirectML-INT4"; using var model = Model.Load(modelPath, new ModelOptions { ExecutionProvider = ExecutionProvider.DirectML }); using var tokenizer = new Tokenizer(model); var tokens = tokenizer.Encode("Explain what a Dutch mortgage deed is."); using var generator = new Generator(model, new GeneratorParams { MaxLength = 1024, Temperature = 0.7f }); generator.AppendTokens(tokens); generator.Generate(); string output = tokenizer.Decode(generator.GetSequence(0)); Console.WriteLine(output); Prompt Format This model supports standard chat-style prompts and works well with Hermes-style system prompts and tool calling. The included chat_template.jinja can be used to format multi-role conversations. Performance Notes INT4 allows the 14B model to run on: 16 GB VRAM GPUs (Arc 130V, RTX 3060, RX 6800) Throughput depends heavily on DirectML backend and driver quality. First token latency may be high due to graph compilation. License & Attribution Base model: Qwen3-14B-Instruct by Alibaba / OpenPipe License: see original model card Conversion: ONNX + INT4 quantization performed by Wekkel using Microsoft Olive. This is an independent community conversion. No affiliation with Alibaba or Qwen team.