TensorRT-LLM β€” Native Windows Build for RTX 40-Series (Ada / SM89)

A native Windows x64 build of NVIDIA TensorRT-LLM β€” no WSL, no Docker, no compatibility layer. Built for raw inference speed on consumer RTX 40-series GPUs, plus a ready-to-serve prebuilt INT4 engine.

⚠️ HARDWARE: RTX 40-series (Ada / SM89) ONLY β€” read this first

Built and tested on an RTX 4060 (Ada Lovelace, SM 8.9). The prebuilt tensorrt_llm.dll and the prebuilt engine are compiled for SM89. They will not run on:

  • RTX 30-series (Ampere / SM86) or older
  • RTX 50-series (Blackwell / SM120) or newer
  • Any non-Ada NVIDIA GPU

To run on a different architecture you must rebuild TRT-LLM for your SM and rebuild the engine β€” see patches/ + BUILD_README.md.

⚠️ Disclaimer β€” please read

This is a passion-project / proof-of-concept, built with AI assistance. I'm not a professional developer. It works and it's genuinely fast, but it is not a supported product β€” it may or may not be maintained. Anyone is free to build from it or fork it. Not affiliated with or endorsed by NVIDIA.

What this is

TensorRT-LLM normally needs WSL2 or Linux on Windows. This is a from-source native Windows build (TRT-LLM is Apache-2.0, so freely redistributable) that runs directly on Windows 11 β€” plus a prebuilt INT4 engine ready to serve.

Built/tested on: Windows 11 Pro x64 Β· RTX 4060 (Ada, SM89) Β· CUDA 13.2 Β· MSVC 14.44 / clang-cl Β· TensorRT-LLM 1.3.x

Requirements

  • Windows 11 x64
  • RTX 40-series GPU (Ada / SM89) β€” see the hardware warning above
  • NVIDIA CUDA 13.2 + TensorRT installed β€” you obtain these from NVIDIA. This repo does not bundle NVIDIA's closed libraries; tensorrt_llm.dll links them at runtime.
  • ~5–6 GB free VRAM for the 4B INT4 model at 16k context

What's in here

Path What
dll/tensorrt_llm.dll The native Windows TensorRT-LLM runtime DLL (SM89)
dll/tensorrt_llm.lib Import library
engine/rank0.engine Prebuilt INT4 engine β€” Josiefied-Qwen3-4B (abliterated Qwen3-4B), 16k context, batch 1, ready to serve
engine/config.json Engine build config
patches/ Source patches that make TRT-LLM compile natively on Windows (CCCL/C++20 fix, NUMA / GDRCopy / ifstream stubs, ninja ccbin fixes, vcvars wrapper, ODR fix, etc.)
scripts/ Serving + benchmark scripts (int4 / awq / fp8 / gptq / medusa)
BUILD_README.md The full build log β€” every issue hit and how it was solved

Status β€” what works vs what doesn't

  • βœ… AOT engine + C++ runtime (the fast path) β€” PROVEN. Build the engine ahead of time, load it through the DLL runtime, get raw-speed inference. This is the point of the repo.
  • ⚠️ JIT Python backend / server scripts β€” EXPERIMENTAL / INCOMPLETE. The start_server_*.py scripts were a work in progress and were not fully finished or verified. Included for reference only β€” don't expect them to just work.

Using the prebuilt engine

engine/rank0.engine is a ready-to-serve TensorRT-LLM engine for Josiefied-Qwen3-4B-abliterated-v1 (INT4, 16k context, batch 1). Point a TensorRT-LLM runtime built against the included tensorrt_llm.dll at it. See scripts/ for serving examples.

⚠️ This engine is an abliterated (uncensored) model. It's built from Josiefied-Qwen3-4B-abliterated-v1, not stock Qwen3-4B. If you want a stock/aligned model, build your own engine from the base weights using the patches/ + BUILD_README.md.

Building from source

Everything needed to rebuild TRT-LLM natively on Windows (for your own SM) is in patches/, documented step-by-step in BUILD_README.md.

Credits / attribution

  • NVIDIA TensorRT-LLM β€” Apache-2.0 Β· https://github.com/NVIDIA/TensorRT-LLM
  • Qwen3 β€” Apache-2.0 (Alibaba). The engine is the abliterated Josiefied-Qwen3-4B.
  • Built with AI assistance (Claude + GitHub Copilot).

License

Apache-2.0 (matching upstream TensorRT-LLM) β€” see LICENSE. NVIDIA CUDA / TensorRT and the Qwen3 weights are covered by their own respective licenses.

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for xThr45hx/TensorRT-LLM-Windows-RTX40

Finetuned
Qwen/Qwen3-4B
Finetuned
(4)
this model