# GR00T Deployment & Inference Guide Run inference with PyTorch or TensorRT acceleration for the GR00T N1.7 policy. --- ## Prerequisites - Model checkpoint: `nvidia/GR00T-N1.7-3B` - Dataset in LeRobot format (e.g., `demo_data/libero_demo`) - CUDA-enabled GPU - Setup uv environment following README.md | Platform | Installation | |----------|-------------| | **dGPU** (H100, A100, RTX 4090/5090, L20, RTX Pro 5000/6000, etc.) | `uv sync` — GPU deps (`flash-attn`, `onnx`, `tensorrt`) included | | **[Jetson Thor](https://developer.nvidia.com/embedded/jetson)** | [Jetson Thor Setup](#jetson-thor-setup) (Docker or bare metal) | | **[DGX Spark](https://developer.nvidia.com/dgx-spark)** | [DGX Spark Setup](#dgx-spark-setup) (Docker or bare metal) | | **[Jetson Orin](https://developer.nvidia.com/embedded/jetson)** | [Jetson Orin Setup](#jetson-orin-setup) (Docker or bare metal) | - dGPU local environment: use the installation commands below, then use the PyTorch or TensorRT commands in this guide - Thor Docker or bare metal: skip to [Jetson Thor Setup](#jetson-thor-setup) - Spark Docker or bare metal: skip to [DGX Spark Setup](#dgx-spark-setup) - Orin Docker or bare metal: skip to [Jetson Orin Setup](#jetson-orin-setup) ### dGPU Installation ```bash uv sync ``` GPU dependencies (`flash-attn`, `onnx`, `tensorrt`) are included in the default install. ## Download Model and Dataset Download the finetuned model to a local directory (HuggingFace does not support nested repo paths directly): ```bash uv run hf download nvidia/GR00T-N1.7-LIBERO \ --include "libero_10/config.json" "libero_10/embodiment_id.json" \ "libero_10/model-*.safetensors" "libero_10/model.safetensors.index.json" \ "libero_10/processor_config.json" "libero_10/statistics.json" \ --local-dir checkpoints/GR00T-N1.7-LIBERO ``` For demo dataset setup, see the [Data Format section in the main README](../../README.md#data-format). --- ## Quick Start: PyTorch Inference Run inference on demo trajectories using PyTorch (no TRT setup needed): ```bash uv run python scripts/deployment/standalone_inference_script.py \ --model-path checkpoints/GR00T-N1.7-LIBERO/libero_10 \ --dataset-path demo_data/libero_demo \ --embodiment-tag LIBERO_PANDA \ --traj-ids 0 1 2 3 4 \ --inference-mode pytorch \ --action-horizon 8 ``` --- ## TensorRT Acceleration The `trt_full_pipeline` mode (passed via `--inference-mode trt_full_pipeline` in `standalone_inference_script.py`) accelerates all model components with TRT engines. Speedup varies by platform — see benchmark tables below for measured results on each device. The same pipeline is referred to as `n17_full_pipeline` inside the engine-loading and build scripts (`trt_model_forward.py`, `build_trt_pipeline.py`); the two names describe the same set of engines. | Component | Engine | Notes | |-----------|--------|-------| | ViT | **TRT** | Qwen3-VL Vision (24 blocks, FP32 for accuracy) | | LLM | **TRT** | Qwen3-VL Text Model (16 layers, with deepstack injection) | | VL Self-Attention | **TRT** | SelfAttentionTransformer (4 layers, if present) | | State Encoder | **TRT** | CategorySpecificMLP | | Action Encoder | **TRT** | MultiEmbodimentActionEncoder | | DiT | **TRT** | AlternateVLDiT (32 layers) | | Action Decoder | **TRT** | CategorySpecificMLP | Lightweight ops remain in PyTorch: `embed_tokens`, `masked_scatter`, `get_rope_index`, VLLN.
DiT-only mode (legacy from N1.6) The `dit_only` export mode (`--export-mode dit_only`) optimizes only the action head DiT, leaving the backbone in PyTorch. This was the default in N1.6. For N1.7, **full_pipeline is recommended** as it accelerates the backbone (ViT + LLM) which dominates inference time.
### Build TRT Engines The unified `build_trt_pipeline.py` script runs all steps (export ONNX → build engines → verify accuracy → benchmark) in a single command: ```bash uv run python scripts/deployment/build_trt_pipeline.py \ --model-path checkpoints/GR00T-N1.7-LIBERO/libero_10 \ --dataset-path demo_data/libero_demo \ --embodiment-tag LIBERO_PANDA ``` > **Finetuned models:** Replace `--model-path` with your checkpoint path. The pipeline is identical for base and finetuned models. > **Note:** Engine build takes ~2-5 minutes depending on GPU. Engines are GPU-architecture-specific and must be rebuilt for different GPUs. > **Batch size:** The `--batch-size` value is baked as a **static** dimension into the ONNX and TRT models. Engines built with one batch size cannot be used with a different batch size at runtime. If you need a different batch size, re-run the full pipeline (`--steps export,build,verify`) with the new `--batch-size` value. You can also run a subset of steps: ```bash # Export + build only (skip verify and benchmark) uv run python scripts/deployment/build_trt_pipeline.py \ --model-path checkpoints/GR00T-N1.7-LIBERO/libero_10 \ --dataset-path demo_data/libero_demo \ --embodiment-tag LIBERO_PANDA \ --steps export,build ```
What each step does The pipeline runs 4 steps in sequence: 1. **Export to ONNX** (`export`) — Exports all model components (LLM, VL Self-Attention, State Encoder, Action Encoder, DiT, Action Decoder) to ONNX format under `/onnx/`. 2. **Build TensorRT Engines** (`build`) — Compiles each ONNX model into a GPU-specific TensorRT engine under `/engines/`. 3. **Verify Accuracy** (`verify`) — Runs PyTorch vs TRT output comparison. Expected: `Cosine Similarity: 0.999+` (PASS). 4. **Benchmark** (`benchmark`) — Measures E2E latency for PyTorch Eager, torch.compile, and TRT modes. Each step can be run individually via `--steps `. Verbose logs are written to `/pipeline.log`.
--- ## Performance ### Benchmark Results GR00T N1.7 Inference Timing (4 denoising steps, 1 camera): | Device | Mode | Data Processing | Backbone | Action Head | E2E | Frequency | E2E Speedup | |--------|------|-----------------|----------|-------------|-----|-----------|-------------| | **dGPU** | | | | | | | | | H100 80GB HBM3 | PyTorch Eager | 6.2 ms | 31.3 ms | 48.2 ms | 85.8 ms | 11.7 Hz | 1.00x | | | torch.compile | 6.2 ms | 30.4 ms | 12.0 ms | 48.6 ms | 20.6 Hz | 1.77x | | | **TensorRT (Full Pipeline)** | **6.2 ms** | **8.8 ms** | **12.3 ms** | **27.9 ms** | **35.9 Hz** | **3.08x** | | H20 96GB HBM3 | PyTorch Eager | 5.33 ms | 30.8 ms | 47.3 ms | 83.4 ms | 12.0 Hz | 1.00x | | | torch.compile | 5.33 ms | 31.1 ms | 13.3 ms | 49.7 ms | 20.1 Hz | 1.68x | | | **TensorRT (Full Pipeline)** | **5.33 ms** | **14.2 ms** | **14.5 ms** | **34.0 ms** | **29.4 Hz** | **2.45x** | | RTX Pro 6000 Blackwell | PyTorch Eager | 4.8 ms | 29.3 ms | 44.0 ms | 78.4 ms | 12.8 Hz | 1.00x | | | torch.compile | 4.8 ms | 29.4 ms | 16.5 ms | 50.7 ms | 19.7 Hz | 1.55x | | | **TensorRT (Full Pipeline)** | **4.8 ms** | **9.9 ms** | **13.2 ms** | **27.9 ms** | **35.9 Hz** | **2.81x** | | RTX Pro 5000 72GB | PyTorch Eager | 8.85 ms | 54.01 ms | 63.19 ms | 126.4 ms | 7.9 Hz | 1.00x | | | torch.compile | 8.85 ms | 55.74 ms | 20.38 ms | 84.9 ms | 11.8 Hz | 1.49x | | | **TensorRT (Full Pipeline)** | **8.85 ms** | **14.37 ms** | **17.33 ms** | **40.5 ms** | **24.7 Hz** | **3.13x** | | L40 | PyTorch Eager | 6.6 ms | 42.8 ms | 78.9 ms | 128.3 ms | 7.8 Hz | 1.00x | | | torch.compile | 6.6 ms | 42.7 ms | 19.8 ms | 69.0 ms | 14.5 Hz | 1.86x | | | **TensorRT (Full Pipeline)** | **6.6 ms** | **13.1 ms** | **18.8 ms** | **38.4 ms** | **26.0 Hz** | **3.34x** | | L20 | PyTorch Eager | 5.7 ms | 47.58 ms | 86.92 ms | 140.3 ms | 7.1 Hz | 1.00x | | | torch.compile | 5.7 ms | 47.2 ms | 20.18 ms | 73.1 ms | 13.7 Hz | 1.92x | | | **TensorRT (Full Pipeline)** | **5.7 ms** | **17.27 ms** | **19.79 ms** | **42.8 ms** | **23.3 Hz** | **3.28x** | | **Jetson / Spark** | | | | | | | | | DGX Spark | PyTorch Eager | 13.14 ms | 38.22 ms | 74.94 ms | 126.4 ms | 7.9 Hz | 1.00x | | | torch.compile | 13.14 ms | 39.23 ms | 56.49 ms | 108.8 ms | 9.2 Hz | 1.16x | | | **TensorRT (Full Pipeline)** | **13.14 ms** | **33.43 ms** | **52.37 ms** | **98.6 ms** | **10.1 Hz** | **1.28x** | | AGX Thor | PyTorch Eager | 8.21 ms | 55.26 ms | 81.65 ms | 144.9 ms | 6.9 Hz | 1.00x | | | torch.compile | 8.21 ms | 55.59 ms | 64.66 ms | 128.4 ms | 7.8 Hz | 1.13x | | | **TensorRT (Full Pipeline)** | **8.21 ms** | **28.89 ms** | **56.64 ms** | **93.8 ms** | **10.7 Hz** | **1.54x** | | Orin | PyTorch Eager | 9.45 ms | 127.6 ms | 205.39 ms | 342.8 ms | 2.9 Hz | 1.00x | | | torch.compile | 9.45 ms | 128.59 ms | 78.94 ms | 217.0 ms | 4.6 Hz | 1.58x | | | **TensorRT (DiT-only)** | **9.45 ms** | **128.38 ms** | **78.6 ms** | **216.5 ms** | **4.6 Hz** | **1.58x** | > **Note:** Orin uses DiT-only TensorRT (`--inference-mode tensorrt`) because TRT 10.3 does not support the backbone engine. All other platforms use the full pipeline (`--inference-mode trt_full_pipeline`).
Raw benchmark output (H100 80GB HBM3) ``` Hardware: NVIDIA H100 80GB HBM3 Model: checkpoints/GR00T-N1.7-LIBERO/libero_10 1 camera, Denoising Steps: 4 PyTorch Eager: E2E: 85.8 ms (11.7 Hz) Data Processing: 6.2 ms | Backbone: 31.3 ms | Action Head: 48.2 ms torch.compile: E2E: 48.6 ms (20.6 Hz), 1.77x speedup Data Processing: 6.2 ms | Backbone: 30.4 ms | Action Head: 12.0 ms TensorRT (Full Pipeline): E2E: 27.9 ms (35.9 Hz), 3.08x speedup Data Processing: 6.2 ms | Backbone: 8.8 ms | Action Head: 12.3 ms ```
### Standalone Inference with TRT The standalone inference script serves as both an accuracy validation and a reference for deploying TRT inference in your own code. It runs per-step inference on real trajectories and compares action predictions: ```bash uv run python scripts/deployment/standalone_inference_script.py \ --model-path checkpoints/GR00T-N1.7-LIBERO/libero_10 \ --dataset-path demo_data/libero_demo \ --embodiment-tag LIBERO_PANDA \ --traj-ids 0 1 2 3 4 \ --inference-mode trt_full_pipeline \ --trt-engine-path ./gr00t_trt_deployment/engines \ --save-plot-path ./output/trt_inference.png ``` Expected accuracy: MSE/MAE match PyTorch within noise. TRT produces identical action quality. Speedup varies by platform — run `build_trt_pipeline.py --steps benchmark` on your hardware for exact numbers. ### Optional: LIBERO Closed-Loop Sim Evaluation To validate TRT accuracy in end-to-end robotic tasks, run the LIBERO closed-loop evaluation. This requires a separate environment setup (~10-30 min, MuJoCo simulator + dependencies).
Setup, commands, and results (H100, 20 episodes) Task: `KITCHEN_SCENE3_turn_on_the_stove_and_put_the_moka_pot_on_it`, 20 episodes: | Mode | Success Rate | |------|-------------| | PyTorch | 100% (20/20) | | TRT (n17_full_pipeline) | 95% (19/20) | Difference is within simulation noise (p >> 0.05). > **Note:** Use `--n-envs 1` for TRT evaluation (ViT engine has static shapes for single-observation inference). ```bash # One-time LIBERO setup (~10 min) bash gr00t/eval/sim/LIBERO/setup_libero.sh # Activate LIBERO venv and install additional deps source gr00t/eval/sim/LIBERO/libero_uv/.venv/bin/activate uv pip install diffusers transformers accelerate safetensors torchcodec # TRT full pipeline evaluation python gr00t/eval/rollout_policy.py \ --model-path checkpoints/GR00T-N1.7-LIBERO/libero_10 \ --env-name "libero_sim/KITCHEN_SCENE3_turn_on_the_stove_and_put_the_moka_pot_on_it" \ --n-episodes 20 --n-envs 1 --max-episode-steps 504 \ --trt-engine-path ./gr00t_trt_deployment/engines \ --trt-mode n17_full_pipeline ```
> Run `python scripts/deployment/build_trt_pipeline.py --steps benchmark` to generate benchmarks for your hardware. --- ## Platform-Specific Setup > Jetson and Spark platforms use different dependency stacks than dGPU. Thor and Spark use CUDA 13 with PyTorch 2.10.0 from the [Jetson AI Lab cu130 index](https://pypi.jetson-ai-lab.io/sbsa/cu130). Orin uses CUDA 12.6 with PyTorch 2.10.0 from the [Jetson AI Lab cu126 index](https://pypi.jetson-ai-lab.io/jp6/cu126). ### Jetson Thor Setup Thor uses CUDA 13 and Python 3.12, which require a different dependency stack than x86 or Orin. Tested with JetPack 7.1. There are two ways to run on Thor: Docker (recommended) or bare metal.
Docker (Recommended) Build the Thor container from the repo root: ```bash cd docker && bash build.sh --profile=thor && cd .. ``` Download the finetuned model (run once, on the host): ```bash uv run hf download nvidia/GR00T-N1.7-LIBERO --include "libero_10/config.json" "libero_10/embodiment_id.json" "libero_10/model-*.safetensors" "libero_10/model.safetensors.index.json" "libero_10/processor_config.json" "libero_10/statistics.json" --local-dir checkpoints/GR00T-N1.7-LIBERO ``` Start an interactive Docker session (recommended for multi-step TRT work): ```bash docker run -it --rm --runtime nvidia --gpus all \ --ipc=host \ --ulimit memlock=-1 \ --ulimit stack=67108864 \ --network host \ -v "$(pwd)":/workspace/repo \ -v "${HF_HOME:-${HOME}/.cache/huggingface}":/root/.cache/huggingface \ -w /workspace/repo \ -e HF_TOKEN="${HF_TOKEN:-}" \ gr00t-thor \ bash ``` Then inside the container, run the full TRT pipeline (export, build, verify, benchmark): ```bash python scripts/deployment/build_trt_pipeline.py \ --model-path checkpoints/GR00T-N1.7-LIBERO/libero_10 \ --dataset-path demo_data/libero_demo \ --embodiment-tag LIBERO_PANDA ```
Bare Metal ```bash # One-time install (temporarily copies the Thor pyproject.toml and uv.lock to repo root, # installs NVPL libs, uv, Python deps, and builds torchcodec from source against the # system FFmpeg runtime) bash scripts/deployment/thor/install_deps.sh # In each new shell source .venv/bin/activate source scripts/activate_thor.sh ``` Then run the TRT pipeline or PyTorch inference as shown in the [TensorRT Acceleration](#tensorrt-acceleration) and [Quick Start](#quick-start-pytorch-inference) sections above. The activation script exports the PyTorch and CUDA library/include paths that `torchcodec` and `torch.compile` need on Thor.
--- ### DGX Spark Setup Spark uses CUDA 13 and Python 3.12 like Thor, but requires a dedicated dependency stack and source-built `flash-attn` for `sm121`. There are two ways to run on Spark: Docker (recommended) or bare metal.
Docker (Recommended) Build the Spark container from the repo root: ```bash cd docker && bash build.sh --profile=spark && cd .. ``` Download the finetuned model (run once, on the host): ```bash uv run hf download nvidia/GR00T-N1.7-LIBERO --include "libero_10/config.json" "libero_10/embodiment_id.json" "libero_10/model-*.safetensors" "libero_10/model.safetensors.index.json" "libero_10/processor_config.json" "libero_10/statistics.json" --local-dir checkpoints/GR00T-N1.7-LIBERO ``` Start an interactive Docker session (recommended for multi-step TRT work): ```bash docker run -it --rm --runtime nvidia --gpus all \ --ipc=host \ --ulimit memlock=-1 \ --ulimit stack=67108864 \ --network host \ -v "$(pwd)":/workspace/repo \ -v "${HF_HOME:-${HOME}/.cache/huggingface}":/root/.cache/huggingface \ -w /workspace/repo \ -e HF_TOKEN="${HF_TOKEN:-}" \ gr00t-spark \ bash ``` Then inside the container, run the full TRT pipeline (export, build, verify, benchmark): ```bash python scripts/deployment/build_trt_pipeline.py \ --model-path checkpoints/GR00T-N1.7-LIBERO/libero_10 \ --dataset-path demo_data/libero_demo \ --embodiment-tag LIBERO_PANDA ```
Bare Metal ```bash # One-time install (temporarily copies the Spark pyproject.toml and uv.lock to repo root, # installs NVPL libs, uv, Python deps, source-builds flash-attn for sm121, and builds # torchcodec from source against the system FFmpeg runtime) bash scripts/deployment/spark/install_deps.sh # In each new shell source .venv/bin/activate source scripts/activate_spark.sh ``` Then run the TRT pipeline or PyTorch inference as shown in the [TensorRT Acceleration](#tensorrt-acceleration) and [Quick Start](#quick-start-pytorch-inference) sections above. If you later rerun `uv sync`, rerun `bash scripts/deployment/spark/install_deps.sh` so the Spark-specific `flash-attn` build is restored and revalidated.
--- ### Jetson Orin Setup > **Note:** On Orin, only the DiT (action head) TRT export is currently supported. Use `--export-mode dit_only` instead of `full_pipeline`. Full pipeline support is in progress. Orin uses CUDA 12.6 and Python 3.10 (JetPack 6.2), which require a different dependency stack than x86 or Thor. Tested with JetPack 6.2. There are two ways to run on Orin: Docker (recommended) or bare metal.
Docker (Recommended) Build the Orin container from the repo root: ```bash cd docker && bash build.sh --profile=orin && cd .. ``` Download the finetuned model (run once, on the host): ```bash uv run hf download nvidia/GR00T-N1.7-LIBERO --include "libero_10/config.json" "libero_10/embodiment_id.json" "libero_10/model-*.safetensors" "libero_10/model.safetensors.index.json" "libero_10/processor_config.json" "libero_10/statistics.json" --local-dir checkpoints/GR00T-N1.7-LIBERO ``` Start an interactive Docker session (recommended for multi-step TRT work): ```bash docker run -it --rm --runtime nvidia --gpus all \ --ipc=host \ --ulimit memlock=-1 \ --ulimit stack=67108864 \ --network host \ -v "$(pwd)":/workspace/repo \ -v "${HF_HOME:-${HOME}/.cache/huggingface}":/root/.cache/huggingface \ -w /workspace/repo \ -e HF_TOKEN="${HF_TOKEN:-}" \ gr00t-orin \ bash ``` Then inside the container, run the TRT pipeline (DiT-only on Orin): ```bash python scripts/deployment/build_trt_pipeline.py \ --model-path checkpoints/GR00T-N1.7-LIBERO/libero_10 \ --dataset-path demo_data/libero_demo \ --embodiment-tag LIBERO_PANDA \ --export-mode dit_only ```
Bare Metal ```bash # One-time install (temporarily copies the Orin pyproject.toml and uv.lock to repo root, # installs uv, Python deps, and builds torchcodec from source against JetPack's FFmpeg # runtime) bash scripts/deployment/orin/install_deps.sh # In each new shell source .venv/bin/activate source scripts/activate_orin.sh ``` Then run the TRT pipeline (with `--export-mode dit_only`) or PyTorch inference as shown in the [TensorRT Acceleration](#tensorrt-acceleration) and [Quick Start](#quick-start-pytorch-inference) sections above. The activation script exports the PyTorch and CUDA library/include paths that `torchcodec` and `torch.compile` need on Orin.
> **Orin storage tip:** If your eMMC root is low on space, redirect the HuggingFace cache to an NVMe SSD with `export HF_HOME=/path/to/ssd/.cache/huggingface` before downloading models. > **Orin TRT limitations:** TRT 10.3 on Orin does not support the backbone (LLM) engine — the build step will report a failure for `llm_bf16.engine` and that is expected. The remaining 6 engines build successfully. Use `--export-mode action_head` for verification and `--inference-mode tensorrt` (DiT-only TRT, backbone runs in PyTorch) for inference: > ```bash > python scripts/deployment/build_trt_pipeline.py \ > --model-path checkpoints/GR00T-N1.7-LIBERO/libero_10 \ > --dataset-path demo_data/libero_demo \ > --export-mode action_head \ > --steps verify > > python scripts/deployment/standalone_inference_script.py \ > --model-path checkpoints/GR00T-N1.7-LIBERO/libero_10 \ > --dataset-path demo_data/libero_demo \ > --embodiment-tag LIBERO_PANDA \ > --traj-ids 0 \ > --inference-mode tensorrt \ > --trt-engine-path ./gr00t_n1d7_engines > ``` --- ## Command-Line Arguments ### `build_trt_pipeline.py` | Argument | Default | Description | |----------|---------|-------------| | `--model-path` | (required) | Path to model checkpoint | | `--dataset-path` | `demo_data/libero_demo` | Path to dataset (LeRobot format) | | `--embodiment-tag` | Auto-detected | Embodiment tag (auto-detected from processor_config.json if single embodiment) | | `--output-dir` | `./gr00t_trt_deployment` | Root output directory. ONNX → `/onnx/`, engines → `/engines/` | | `--precision` | `bf16` | Precision for ONNX export and TRT engine build (`bf16`, `fp16`, `fp32`) | | `--batch-size` | `1` | Batch size baked into exported ONNX/TRT models (static — see note below) | | `--export-mode` | `full_pipeline` | Export mode: `dit_only`, `action_head`, or `full_pipeline` | | `--video-backend` | `torchcodec` | Video backend for dataset loading | | `--workspace` | `8192` | TRT builder workspace size in MB | | `--num-iterations` | `20` | Number of benchmark iterations | | `--warmup` | `5` | Number of warmup iterations | | `--skip-compile` | `false` | Skip torch.compile benchmark | | `--steps` | `all` | Steps to run: `all` or comma-separated subset of `export,build,verify,benchmark` | | `--log-file` | `/pipeline.log` | Log file path | ### `standalone_inference_script.py` | Argument | Default | Description | |----------|---------|-------------| | `--model-path` | (required) | Path to model checkpoint | | `--dataset-path` | `demo_data/droid_sample` | Path to dataset (LeRobot format) | | `--embodiment-tag` | Auto-detected | Robot embodiment tag | | `--traj-ids` | `[0]` | Episode indices to evaluate (space-separated) | | `--steps` | `200` | Max steps per trajectory (capped by actual length) | | `--action-horizon` | `16` | Action prediction horizon | | `--inference-mode` | `pytorch` | `pytorch`, `tensorrt` (DiT-only TRT), or `trt_full_pipeline` (all engines) | | `--trt-engine-path` | `./gr00t_n1d7_engines` | Directory containing pre-built TRT engines | | `--denoising-steps` | `4` | Diffusion denoising iterations | | `--save-plot-path` | `None` | Save per-trajectory GT-vs-predicted comparison plots | | `--video-backend` | `torchcodec` | Video decoder: `torchcodec`, `decord`, or `torchvision_av` | | `--skip-timing-steps` | `1` | Initial steps excluded from timing stats (warmup) | | `--host` / `--port` | `127.0.0.1` / `5555` | Server address (when using client mode without `--model-path`) | | `--seed` | `42` | Random seed for reproducibility | ## Files | File | Description | |------|-------------| | `build_trt_pipeline.py` | Unified pipeline: export ONNX, build engines, verify, benchmark | | `standalone_inference_script.py` | Main inference script (PyTorch + DiT-only TensorRT) | | `trt_torch.py` | TRT Engine wrapper class (load, bind, execute) | | `trt_model_forward.py` | TRT forward functions and setup (backbone + action head) | --- ## Troubleshooting ### Engine Build Fails - Ensure you have enough GPU memory (16GB+ recommended for full pipeline) - Try reducing workspace size: `--workspace 4096` - Ensure TensorRT version matches your CUDA version - LLM engine requires `batch_size` dimension handling when using custom shape profiles ### ONNX Export Issues - If export fails with COMPLEX128 error: ensure `_simple_causal_mask` is used (not HuggingFace's `create_causal_mask`) - If `masked_scatter` size assertion fails: ensure `visual_pos_masks` has the correct number of True values matching deepstack tensor size - Check that the dataset path is valid and contains at least one trajectory ### Accuracy Issues - If cosine < 0.99: check that LLM export does NOT include the final RMSNorm (backbone returns pre-norm `hidden_states[-1]`) - If output magnitude is ~12x too small: this is the norm bug — see above - Run `build_trt_pipeline.py --steps verify --export-mode action_head` first to isolate backbone vs action head drift