TensorRT-LLM — Native Windows Build for RTX 40-Series (Ada / SM89)

A native Windows x64 build of NVIDIA TensorRT-LLM — no WSL, no Docker, no compatibility layer. Built for raw inference speed on consumer RTX 40-series GPUs, plus a ready-to-serve prebuilt INT4 engine.

⚠️ HARDWARE: RTX 40-series (Ada / SM89) ONLY — read this first

Built and tested on an RTX 4060 (Ada Lovelace, SM 8.9). The prebuilt tensorrt_llm.dll and the prebuilt engine are compiled for SM89. They will not run on:

RTX 30-series (Ampere / SM86) or older

RTX 50-series (Blackwell / SM120) or newer

Any non-Ada NVIDIA GPU

To run on a different architecture you must rebuild TRT-LLM for your SM and rebuild the engine — see patches/ + BUILD_README.md.

⚠️ Disclaimer — please read

This is a passion-project / proof-of-concept, built with AI assistance. I'm not a professional developer. It works and it's genuinely fast, but it is not a supported product — it may or may not be maintained. Anyone is free to build from it or fork it. Not affiliated with or endorsed by NVIDIA.

What this is

TensorRT-LLM normally needs WSL2 or Linux on Windows. This is a from-source native Windows build (TRT-LLM is Apache-2.0, so freely redistributable) that runs directly on Windows 11 — plus a prebuilt INT4 engine ready to serve.

Built/tested on: Windows 11 Pro x64 · RTX 4060 (Ada, SM89) · CUDA 13.2 · MSVC 14.44 / clang-cl · TensorRT-LLM 1.3.x

Requirements

Windows 11 x64
RTX 40-series GPU (Ada / SM89) — see the hardware warning above
NVIDIA CUDA 13.2 + TensorRT installed — you obtain these from NVIDIA. This repo does not bundle NVIDIA's closed libraries; tensorrt_llm.dll links them at runtime.
~5–6 GB free VRAM for the 4B INT4 model at 16k context

What's in here

Path	What
`dll/tensorrt_llm.dll`	The native Windows TensorRT-LLM runtime DLL (SM89)
`dll/tensorrt_llm.lib`	Import library
`engine/rank0.engine`	Prebuilt INT4 engine — Josiefied-Qwen3-4B (abliterated Qwen3-4B), 16k context, batch 1, ready to serve
`engine/config.json`	Engine build config
`patches/`	Source patches that make TRT-LLM compile natively on Windows (CCCL/C++20 fix, NUMA / GDRCopy / ifstream stubs, ninja `ccbin` fixes, vcvars wrapper, ODR fix, etc.)
`scripts/`	Serving + benchmark scripts (int4 / awq / fp8 / gptq / medusa)
`BUILD_README.md`	The full build log — every issue hit and how it was solved

Status — what works vs what doesn't

✅ AOT engine + C++ runtime (the fast path) — PROVEN. Build the engine ahead of time, load it through the DLL runtime, get raw-speed inference. This is the point of the repo.
⚠️ JIT Python backend / server scripts — EXPERIMENTAL / INCOMPLETE. The start_server_*.py scripts were a work in progress and were not fully finished or verified. Included for reference only — don't expect them to just work.

Using the prebuilt engine

engine/rank0.engine is a ready-to-serve TensorRT-LLM engine for Josiefied-Qwen3-4B-abliterated-v1 (INT4, 16k context, batch 1). Point a TensorRT-LLM runtime built against the included tensorrt_llm.dll at it. See scripts/ for serving examples.

⚠️ This engine is an abliterated (uncensored) model. It's built from Josiefied-Qwen3-4B-abliterated-v1, not stock Qwen3-4B. If you want a stock/aligned model, build your own engine from the base weights using the patches/ + BUILD_README.md.

Building from source

Everything needed to rebuild TRT-LLM natively on Windows (for your own SM) is in patches/, documented step-by-step in BUILD_README.md.

Credits / attribution

NVIDIA TensorRT-LLM — Apache-2.0 · https://github.com/NVIDIA/TensorRT-LLM
Qwen3 — Apache-2.0 (Alibaba). The engine is the abliterated Josiefied-Qwen3-4B.
Built with AI assistance (Claude + GitHub Copilot).

License

Apache-2.0 (matching upstream TensorRT-LLM) — see LICENSE. NVIDIA CUDA / TensorRT and the Qwen3 weights are covered by their own respective licenses.

Downloads last month: -

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for xThr45hx/TensorRT-LLM-Windows-RTX40

Base model

Qwen/Qwen3-4B-Base

Finetuned

Qwen/Qwen3-4B

Finetuned

Goekdeniz-Guelmez/Josiefied-Qwen3-4B-abliterated-v1

Finetuned

(4)

this model