Instructions to use xThr45hx/TensorRT-LLM-Windows-RTX40 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- TensorRT
How to use xThr45hx/TensorRT-LLM-Windows-RTX40 with TensorRT:
# No code snippets available yet for this library. # To use this model, check the repository files and the library's documentation. # Want to help? PRs adding snippets are welcome at: # https://github.com/huggingface/huggingface.js
- Notebooks
- Google Colab
- Kaggle
TensorRT-LLM β Native Windows Build for RTX 40-Series (Ada / SM89)
A native Windows x64 build of NVIDIA TensorRT-LLM β no WSL, no Docker, no compatibility layer. Built for raw inference speed on consumer RTX 40-series GPUs, plus a ready-to-serve prebuilt INT4 engine.
β οΈ HARDWARE: RTX 40-series (Ada / SM89) ONLY β read this first
Built and tested on an RTX 4060 (Ada Lovelace, SM 8.9). The prebuilt
tensorrt_llm.dlland the prebuilt engine are compiled for SM89. They will not run on:
- RTX 30-series (Ampere / SM86) or older
- RTX 50-series (Blackwell / SM120) or newer
- Any non-Ada NVIDIA GPU
To run on a different architecture you must rebuild TRT-LLM for your SM and rebuild the engine β see
patches/+BUILD_README.md.
β οΈ Disclaimer β please read
This is a passion-project / proof-of-concept, built with AI assistance. I'm not a professional developer. It works and it's genuinely fast, but it is not a supported product β it may or may not be maintained. Anyone is free to build from it or fork it. Not affiliated with or endorsed by NVIDIA.
What this is
TensorRT-LLM normally needs WSL2 or Linux on Windows. This is a from-source native Windows build (TRT-LLM is Apache-2.0, so freely redistributable) that runs directly on Windows 11 β plus a prebuilt INT4 engine ready to serve.
Built/tested on: Windows 11 Pro x64 Β· RTX 4060 (Ada, SM89) Β· CUDA 13.2 Β· MSVC 14.44 / clang-cl Β· TensorRT-LLM 1.3.x
Requirements
- Windows 11 x64
- RTX 40-series GPU (Ada / SM89) β see the hardware warning above
- NVIDIA CUDA 13.2 + TensorRT installed β you obtain these from NVIDIA. This repo does not bundle NVIDIA's closed libraries;
tensorrt_llm.dlllinks them at runtime. - ~5β6 GB free VRAM for the 4B INT4 model at 16k context
What's in here
| Path | What |
|---|---|
dll/tensorrt_llm.dll |
The native Windows TensorRT-LLM runtime DLL (SM89) |
dll/tensorrt_llm.lib |
Import library |
engine/rank0.engine |
Prebuilt INT4 engine β Josiefied-Qwen3-4B (abliterated Qwen3-4B), 16k context, batch 1, ready to serve |
engine/config.json |
Engine build config |
patches/ |
Source patches that make TRT-LLM compile natively on Windows (CCCL/C++20 fix, NUMA / GDRCopy / ifstream stubs, ninja ccbin fixes, vcvars wrapper, ODR fix, etc.) |
scripts/ |
Serving + benchmark scripts (int4 / awq / fp8 / gptq / medusa) |
BUILD_README.md |
The full build log β every issue hit and how it was solved |
Status β what works vs what doesn't
- β AOT engine + C++ runtime (the fast path) β PROVEN. Build the engine ahead of time, load it through the DLL runtime, get raw-speed inference. This is the point of the repo.
- β οΈ JIT Python backend / server scripts β EXPERIMENTAL / INCOMPLETE. The
start_server_*.pyscripts were a work in progress and were not fully finished or verified. Included for reference only β don't expect them to just work.
Using the prebuilt engine
engine/rank0.engine is a ready-to-serve TensorRT-LLM engine for Josiefied-Qwen3-4B-abliterated-v1 (INT4, 16k context, batch 1). Point a TensorRT-LLM runtime built against the included tensorrt_llm.dll at it. See scripts/ for serving examples.
β οΈ This engine is an abliterated (uncensored) model. It's built from Josiefied-Qwen3-4B-abliterated-v1, not stock Qwen3-4B. If you want a stock/aligned model, build your own engine from the base weights using the
patches/+BUILD_README.md.
Building from source
Everything needed to rebuild TRT-LLM natively on Windows (for your own SM) is in patches/, documented step-by-step in BUILD_README.md.
Credits / attribution
- NVIDIA TensorRT-LLM β Apache-2.0 Β· https://github.com/NVIDIA/TensorRT-LLM
- Qwen3 β Apache-2.0 (Alibaba). The engine is the abliterated Josiefied-Qwen3-4B.
- Built with AI assistance (Claude + GitHub Copilot).
License
Apache-2.0 (matching upstream TensorRT-LLM) β see LICENSE. NVIDIA CUDA / TensorRT and the Qwen3 weights are covered by their own respective licenses.
- Downloads last month
- -
Model tree for xThr45hx/TensorRT-LLM-Windows-RTX40
Base model
Qwen/Qwen3-4B-Base