wekkel's picture
Update README.md
eb23da0 verified
---
license: apache-2.0
library_name: onnxruntime-genai
pipeline_tag: text-generation
tags:
- onnx
- directml
- int4
- quantized
- qwen
- qwen3
- instruct
- text-generation
- windows
- csharp
- dotnet
inference: false
base_model: Qwen/Qwen3-14B-Instruct
language:
- en
- zh
---
# Qwen3-14B-Instruct – DirectML INT4 (ONNX Runtime)
This repository provides **Qwen3-14B-Instruct** converted to **INT4 ONNX** and optimized for **DirectML** using **Microsoft Olive** and **ONNX Runtime GenAI**.
It is designed for **native Windows GPU inference** (Intel Arc, AMD RDNA, NVIDIA RTX) without CUDA and without running a Python server.
Ideal for integration in **C# / .NET applications** using ONNX Runtime + DirectML.
---
## Model Details
- Base model: `OpenPipe/Qwen3-14B-Instruct`
- Quantization: INT4 (MatMul NBits)
- Format: ONNX
- Runtime: ONNX Runtime with `DmlExecutionProvider`
- Conversion toolchain: Microsoft Olive + onnxruntime-genai
- Target hardware:
- Intel Arc (A770, A750, 130V, etc.)
- AMD RDNA2 / RDNA3
- NVIDIA RTX (via DirectML)
---
## Files
Main inference files:
- `model.onnx`
- `model.onnx.data` ← INT4 weights (≈ 9 GB)
- `genai_config.json`
- `tokenizer.json`, `vocab.json`, `merges.txt`
- `chat_template.jinja`
---
## Usage in C# (DirectML)
Example (ONNX Runtime GenAI):
```csharp
using Microsoft.ML.OnnxRuntimeGenAI;
var modelPath = @"Qwen3-14B-Instruct-DirectML-INT4";
using var model = Model.Load(modelPath, new ModelOptions
{
ExecutionProvider = ExecutionProvider.DirectML
});
using var tokenizer = new Tokenizer(model);
var tokens = tokenizer.Encode("Explain what a Dutch mortgage deed is.");
using var generator = new Generator(model, new GeneratorParams
{
MaxLength = 1024,
Temperature = 0.7f
});
generator.AppendTokens(tokens);
generator.Generate();
string output = tokenizer.Decode(generator.GetSequence(0));
Console.WriteLine(output);
Prompt Format
This model supports standard chat-style prompts and works well with Hermes-style system prompts and tool calling.
The included chat_template.jinja can be used to format multi-role conversations.
Performance Notes
INT4 allows the 14B model to run on:
16 GB VRAM GPUs (Arc 130V, RTX 3060, RX 6800)
Throughput depends heavily on DirectML backend and driver quality.
First token latency may be high due to graph compilation.
License & Attribution
Base model:
Qwen3-14B-Instruct by Alibaba / OpenPipe
License: see original model card
Conversion:
ONNX + INT4 quantization performed by Wekkel using Microsoft Olive.
This is an independent community conversion.
No affiliation with Alibaba or Qwen team.