---
license: apache-2.0
library_name: onnxruntime-genai
pipeline_tag: text-generation
tags:
- onnx
- directml
- int4
- quantized
- qwen
- qwen3
- instruct
- text-generation
- windows
- csharp
- dotnet
inference: false
base_model: Qwen/Qwen3-14B-Instruct
language:
- en
- zh
---

# Qwen3-14B-Instruct – DirectML INT4 (ONNX Runtime)

This repository provides **Qwen3-14B-Instruct** converted to **INT4 ONNX** and optimized for **DirectML** using **Microsoft Olive** and **ONNX Runtime GenAI**.

It is designed for **native Windows GPU inference** (Intel Arc, AMD RDNA, NVIDIA RTX) without CUDA and without running a Python server.  
Ideal for integration in **C# / .NET applications** using ONNX Runtime + DirectML.

---

## Model Details

- Base model: `OpenPipe/Qwen3-14B-Instruct`
- Quantization: INT4 (MatMul NBits)
- Format: ONNX
- Runtime: ONNX Runtime with `DmlExecutionProvider`
- Conversion toolchain: Microsoft Olive + onnxruntime-genai
- Target hardware: 
  - Intel Arc (A770, A750, 130V, etc.)
  - AMD RDNA2 / RDNA3
  - NVIDIA RTX (via DirectML)

---

## Files

Main inference files:

- `model.onnx`
- `model.onnx.data`  ← INT4 weights (≈ 9 GB)
- `genai_config.json`
- `tokenizer.json`, `vocab.json`, `merges.txt`
- `chat_template.jinja`

---

## Usage in C# (DirectML)

Example (ONNX Runtime GenAI):

```csharp
using Microsoft.ML.OnnxRuntimeGenAI;

var modelPath = @"Qwen3-14B-Instruct-DirectML-INT4";

using var model = Model.Load(modelPath, new ModelOptions
{
    ExecutionProvider = ExecutionProvider.DirectML
});

using var tokenizer = new Tokenizer(model);
var tokens = tokenizer.Encode("Explain what a Dutch mortgage deed is.");

using var generator = new Generator(model, new GeneratorParams
{
    MaxLength = 1024,
    Temperature = 0.7f
});

generator.AppendTokens(tokens);
generator.Generate();

string output = tokenizer.Decode(generator.GetSequence(0));
Console.WriteLine(output);
Prompt Format
This model supports standard chat-style prompts and works well with Hermes-style system prompts and tool calling.

The included chat_template.jinja can be used to format multi-role conversations.

Performance Notes
INT4 allows the 14B model to run on:

16 GB VRAM GPUs (Arc 130V, RTX 3060, RX 6800)

Throughput depends heavily on DirectML backend and driver quality.

First token latency may be high due to graph compilation.

License & Attribution
Base model:

Qwen3-14B-Instruct by Alibaba / OpenPipe

License: see original model card

Conversion:

ONNX + INT4 quantization performed by Wekkel using Microsoft Olive.

This is an independent community conversion.

No affiliation with Alibaba or Qwen team.