README.md · wekkel/Qwen3-14B-Instruct-DirectML-INT4 at main

Qwen3-14B-Instruct-DirectML-INT4 / README.md

wekkel

Update README.md

eb23da0 verified about 2 months ago

preview code

raw

history blame contribute delete

2.63 kB

	---
	license: apache-2.0
	library_name: onnxruntime-genai
	pipeline_tag: text-generation
	tags:
	- onnx
	- directml
	- int4
	- quantized
	- qwen
	- qwen3
	- instruct
	- text-generation
	- windows
	- csharp
	- dotnet
	inference: false
	base_model: Qwen/Qwen3-14B-Instruct
	language:
	- en
	- zh
	---

	# Qwen3-14B-Instruct – DirectML INT4 (ONNX Runtime)

	This repository provides Qwen3-14B-Instruct converted to INT4 ONNX and optimized for DirectML using Microsoft Olive and ONNX Runtime GenAI.

	It is designed for native Windows GPU inference (Intel Arc, AMD RDNA, NVIDIA RTX) without CUDA and without running a Python server.
	Ideal for integration in C# / .NET applications using ONNX Runtime + DirectML.

	---

	## Model Details

	- Base model: `OpenPipe/Qwen3-14B-Instruct`
	- Quantization: INT4 (MatMul NBits)
	- Format: ONNX
	- Runtime: ONNX Runtime with `DmlExecutionProvider`
	- Conversion toolchain: Microsoft Olive + onnxruntime-genai
	- Target hardware:
	- Intel Arc (A770, A750, 130V, etc.)
	- AMD RDNA2 / RDNA3
	- NVIDIA RTX (via DirectML)

	---

	## Files

	Main inference files:

	- `model.onnx`
	- `model.onnx.data` ← INT4 weights (≈ 9 GB)
	- `genai_config.json`
	- `tokenizer.json`, `vocab.json`, `merges.txt`
	- `chat_template.jinja`

	---

	## Usage in C# (DirectML)

	Example (ONNX Runtime GenAI):

	```csharp
	using Microsoft.ML.OnnxRuntimeGenAI;

	var modelPath = @"Qwen3-14B-Instruct-DirectML-INT4";

	using var model = Model.Load(modelPath, new ModelOptions
	{
	ExecutionProvider = ExecutionProvider.DirectML
	});

	using var tokenizer = new Tokenizer(model);
	var tokens = tokenizer.Encode("Explain what a Dutch mortgage deed is.");

	using var generator = new Generator(model, new GeneratorParams
	{
	MaxLength = 1024,
	Temperature = 0.7f
	});

	generator.AppendTokens(tokens);
	generator.Generate();

	string output = tokenizer.Decode(generator.GetSequence(0));
	Console.WriteLine(output);
	Prompt Format
	This model supports standard chat-style prompts and works well with Hermes-style system prompts and tool calling.

	The included chat_template.jinja can be used to format multi-role conversations.

	Performance Notes
	INT4 allows the 14B model to run on:

	16 GB VRAM GPUs (Arc 130V, RTX 3060, RX 6800)

	Throughput depends heavily on DirectML backend and driver quality.

	First token latency may be high due to graph compilation.

	License & Attribution
	Base model:

	Qwen3-14B-Instruct by Alibaba / OpenPipe

	License: see original model card

	Conversion:

	ONNX + INT4 quantization performed by Wekkel using Microsoft Olive.

	This is an independent community conversion.

	No affiliation with Alibaba or Qwen team.