# TensorRT-LLM [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM) is optimizes LLM inference on NVIDIA GPUs. It compiles models into a TensorRT engine with in-flight batching, paged KV caching, and tensor parallelism. [AutoDeploy](https://nvidia.github.io/TensorRT-LLM/torch/auto_deploy/auto-deploy.html) accepts Transformers models without requiring any changes. It automatically converts the model to an optimized runtime. Pass a model id from the Hub to [build_and_run_ad.py](https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/auto_deploy/build_and_run_ad.py) to run a Transformers model. ```bash cd examples/auto_deploy python build_and_run_ad.py --model meta-llama/Llama-3.2-1B ``` Under the hood, AutoDeploy creates an [LLM](https://nvidia.github.io/TensorRT-LLM/llm-api/reference.html#tensorrt_llm.llmapi.LLM) class. It loads the model configuration with [`AutoConfig.from_pretrained`] and extracts any parallelism metadata stored in `tp_plan`. [`AutoModelForCausalLM.from_pretrained`] loads the model with the config and enables Transformers' built-in tensor parallelism. ```py from tensorrt_llm._torch.auto_deploy import LLM llm = LLM(model="meta-llama/Llama-3.2-1B") ``` TensorRT-LLM extracts the model graph with `torch.export` and applies optimizations. It replaces Transformers attention with TensorRT-LLM [attention kernels](https://github.com/NVIDIA/TensorRT-LLM/tree/main/tensorrt_llm/_torch/attention_backend) and compiles the model into an optimized execution backend. ## Resources - [TensorRT-LLM docs](https://nvidia.github.io/TensorRT-LLM/) for more detailed usage guides. - [AutoDeploy guide](https://nvidia.github.io/TensorRT-LLM/torch/auto_deploy/auto-deploy.html) explains how it works with advanced examples.