add: source files (batch 3)

af83d87 verified 12 days ago

24 kB

	# GR00T Deployment & Inference Guide

	Run inference with PyTorch or TensorRT acceleration for the GR00T N1.7 policy.

	---

	## Prerequisites

	- Model checkpoint: `nvidia/GR00T-N1.7-3B`
	- Dataset in LeRobot format (e.g., `demo_data/libero_demo`)
	- CUDA-enabled GPU
	- Setup uv environment following README.md

	\| Platform \| Installation \|
	\|----------\|-------------\|
	\| dGPU (H100, A100, RTX 4090/5090, L20, RTX Pro 5000/6000, etc.) \| `uv sync` — GPU deps (`flash-attn`, `onnx`, `tensorrt`) included \|
	\| [Jetson Thor](https://developer.nvidia.com/embedded/jetson) \| [Jetson Thor Setup](#jetson-thor-setup) (Docker or bare metal) \|
	\| [DGX Spark](https://developer.nvidia.com/dgx-spark) \| [DGX Spark Setup](#dgx-spark-setup) (Docker or bare metal) \|
	\| [Jetson Orin](https://developer.nvidia.com/embedded/jetson) \| [Jetson Orin Setup](#jetson-orin-setup) (Docker or bare metal) \|

	- dGPU local environment: use the installation commands below, then use the PyTorch or TensorRT commands in this guide
	- Thor Docker or bare metal: skip to [Jetson Thor Setup](#jetson-thor-setup)
	- Spark Docker or bare metal: skip to [DGX Spark Setup](#dgx-spark-setup)
	- Orin Docker or bare metal: skip to [Jetson Orin Setup](#jetson-orin-setup)

	### dGPU Installation

	```bash
	uv sync
	```

	GPU dependencies (`flash-attn`, `onnx`, `tensorrt`) are included in the default install.

	## Download Model and Dataset

	Download the finetuned model to a local directory (HuggingFace does not support nested repo paths directly):

	```bash
	uv run hf download nvidia/GR00T-N1.7-LIBERO \
	--include "libero_10/config.json" "libero_10/embodiment_id.json" \
	"libero_10/model-*.safetensors" "libero_10/model.safetensors.index.json" \
	"libero_10/processor_config.json" "libero_10/statistics.json" \
	--local-dir checkpoints/GR00T-N1.7-LIBERO
	```

	For demo dataset setup, see the [Data Format section in the main README](../../README.md#data-format).

	---

	## Quick Start: PyTorch Inference

	Run inference on demo trajectories using PyTorch (no TRT setup needed):

	```bash
	uv run python scripts/deployment/standalone_inference_script.py \
	--model-path checkpoints/GR00T-N1.7-LIBERO/libero_10 \
	--dataset-path demo_data/libero_demo \
	--embodiment-tag LIBERO_PANDA \
	--traj-ids 0 1 2 3 4 \
	--inference-mode pytorch \
	--action-horizon 8
	```

	---

	## TensorRT Acceleration

	The `trt_full_pipeline` mode (passed via `--inference-mode trt_full_pipeline`
	in `standalone_inference_script.py`) accelerates all model components with
	TRT engines. Speedup varies by platform — see benchmark tables below for
	measured results on each device. The same pipeline is referred to as
	`n17_full_pipeline` inside the engine-loading and build scripts
	(`trt_model_forward.py`, `build_trt_pipeline.py`); the two names describe
	the same set of engines.

	\| Component \| Engine \| Notes \|
	\|-----------\|--------\|-------\|
	\| ViT \| TRT \| Qwen3-VL Vision (24 blocks, FP32 for accuracy) \|
	\| LLM \| TRT \| Qwen3-VL Text Model (16 layers, with deepstack injection) \|
	\| VL Self-Attention \| TRT \| SelfAttentionTransformer (4 layers, if present) \|
	\| State Encoder \| TRT \| CategorySpecificMLP \|
	\| Action Encoder \| TRT \| MultiEmbodimentActionEncoder \|
	\| DiT \| TRT \| AlternateVLDiT (32 layers) \|
	\| Action Decoder \| TRT \| CategorySpecificMLP \|

	Lightweight ops remain in PyTorch: `embed_tokens`, `masked_scatter`, `get_rope_index`, VLLN.

	<details>
	<summary>DiT-only mode (legacy from N1.6)</summary>

	The `dit_only` export mode (`--export-mode dit_only`) optimizes only the action head DiT, leaving the backbone in PyTorch. This was the default in N1.6. For N1.7, full_pipeline is recommended as it accelerates the backbone (ViT + LLM) which dominates inference time.
	</details>

	### Build TRT Engines

	The unified `build_trt_pipeline.py` script runs all steps (export ONNX → build engines → verify accuracy → benchmark) in a single command:

	```bash
	uv run python scripts/deployment/build_trt_pipeline.py \
	--model-path checkpoints/GR00T-N1.7-LIBERO/libero_10 \
	--dataset-path demo_data/libero_demo \
	--embodiment-tag LIBERO_PANDA
	```

	> Finetuned models: Replace `--model-path` with your checkpoint path. The pipeline is identical for base and finetuned models.

	> Note: Engine build takes ~2-5 minutes depending on GPU. Engines are GPU-architecture-specific and must be rebuilt for different GPUs.

	> Batch size: The `--batch-size` value is baked as a static dimension into the ONNX and TRT models. Engines built with one batch size cannot be used with a different batch size at runtime. If you need a different batch size, re-run the full pipeline (`--steps export,build,verify`) with the new `--batch-size` value.

	You can also run a subset of steps:

	```bash
	# Export + build only (skip verify and benchmark)
	uv run python scripts/deployment/build_trt_pipeline.py \
	--model-path checkpoints/GR00T-N1.7-LIBERO/libero_10 \
	--dataset-path demo_data/libero_demo \
	--embodiment-tag LIBERO_PANDA \
	--steps export,build
	```

	<details>
	<summary>What each step does</summary>

	The pipeline runs 4 steps in sequence:

	1. Export to ONNX (`export`) — Exports all model components (LLM, VL Self-Attention, State Encoder, Action Encoder, DiT, Action Decoder) to ONNX format under `<output-dir>/onnx/`.
	2. Build TensorRT Engines (`build`) — Compiles each ONNX model into a GPU-specific TensorRT engine under `<output-dir>/engines/`.
	3. Verify Accuracy (`verify`) — Runs PyTorch vs TRT output comparison. Expected: `Cosine Similarity: 0.999+` (PASS).
	4. Benchmark (`benchmark`) — Measures E2E latency for PyTorch Eager, torch.compile, and TRT modes.

	Each step can be run individually via `--steps <step>`. Verbose logs are written to `<output-dir>/pipeline.log`.
	</details>

	---

	## Performance

	### Benchmark Results

	GR00T N1.7 Inference Timing (4 denoising steps, 1 camera):

	\| Device \| Mode \| Data Processing \| Backbone \| Action Head \| E2E \| Frequency \| E2E Speedup \|
	\|--------\|------\|-----------------\|----------\|-------------\|-----\|-----------\|-------------\|
	\| dGPU \| \| \| \| \| \| \| \|
	\| H100 80GB HBM3 \| PyTorch Eager \| 6.2 ms \| 31.3 ms \| 48.2 ms \| 85.8 ms \| 11.7 Hz \| 1.00x \|
	\| \| torch.compile \| 6.2 ms \| 30.4 ms \| 12.0 ms \| 48.6 ms \| 20.6 Hz \| 1.77x \|
	\| \| TensorRT (Full Pipeline) \| 6.2 ms \| 8.8 ms \| 12.3 ms \| 27.9 ms \| 35.9 Hz \| 3.08x \|
	\| H20 96GB HBM3 \| PyTorch Eager \| 5.33 ms \| 30.8 ms \| 47.3 ms \| 83.4 ms \| 12.0 Hz \| 1.00x \|
	\| \| torch.compile \| 5.33 ms \| 31.1 ms \| 13.3 ms \| 49.7 ms \| 20.1 Hz \| 1.68x \|
	\| \| TensorRT (Full Pipeline) \| 5.33 ms \| 14.2 ms \| 14.5 ms \| 34.0 ms \| 29.4 Hz \| 2.45x \|
	\| RTX Pro 6000 Blackwell \| PyTorch Eager \| 4.8 ms \| 29.3 ms \| 44.0 ms \| 78.4 ms \| 12.8 Hz \| 1.00x \|
	\| \| torch.compile \| 4.8 ms \| 29.4 ms \| 16.5 ms \| 50.7 ms \| 19.7 Hz \| 1.55x \|
	\| \| TensorRT (Full Pipeline) \| 4.8 ms \| 9.9 ms \| 13.2 ms \| 27.9 ms \| 35.9 Hz \| 2.81x \|
	\| RTX Pro 5000 72GB \| PyTorch Eager \| 8.85 ms \| 54.01 ms \| 63.19 ms \| 126.4 ms \| 7.9 Hz \| 1.00x \|
	\| \| torch.compile \| 8.85 ms \| 55.74 ms \| 20.38 ms \| 84.9 ms \| 11.8 Hz \| 1.49x \|
	\| \| TensorRT (Full Pipeline) \| 8.85 ms \| 14.37 ms \| 17.33 ms \| 40.5 ms \| 24.7 Hz \| 3.13x \|
	\| L40 \| PyTorch Eager \| 6.6 ms \| 42.8 ms \| 78.9 ms \| 128.3 ms \| 7.8 Hz \| 1.00x \|
	\| \| torch.compile \| 6.6 ms \| 42.7 ms \| 19.8 ms \| 69.0 ms \| 14.5 Hz \| 1.86x \|
	\| \| TensorRT (Full Pipeline) \| 6.6 ms \| 13.1 ms \| 18.8 ms \| 38.4 ms \| 26.0 Hz \| 3.34x \|
	\| L20 \| PyTorch Eager \| 5.7 ms \| 47.58 ms \| 86.92 ms \| 140.3 ms \| 7.1 Hz \| 1.00x \|
	\| \| torch.compile \| 5.7 ms \| 47.2 ms \| 20.18 ms \| 73.1 ms \| 13.7 Hz \| 1.92x \|
	\| \| TensorRT (Full Pipeline) \| 5.7 ms \| 17.27 ms \| 19.79 ms \| 42.8 ms \| 23.3 Hz \| 3.28x \|
	\| Jetson / Spark \| \| \| \| \| \| \| \|
	\| DGX Spark \| PyTorch Eager \| 13.14 ms \| 38.22 ms \| 74.94 ms \| 126.4 ms \| 7.9 Hz \| 1.00x \|
	\| \| torch.compile \| 13.14 ms \| 39.23 ms \| 56.49 ms \| 108.8 ms \| 9.2 Hz \| 1.16x \|
	\| \| TensorRT (Full Pipeline) \| 13.14 ms \| 33.43 ms \| 52.37 ms \| 98.6 ms \| 10.1 Hz \| 1.28x \|
	\| AGX Thor \| PyTorch Eager \| 8.21 ms \| 55.26 ms \| 81.65 ms \| 144.9 ms \| 6.9 Hz \| 1.00x \|
	\| \| torch.compile \| 8.21 ms \| 55.59 ms \| 64.66 ms \| 128.4 ms \| 7.8 Hz \| 1.13x \|
	\| \| TensorRT (Full Pipeline) \| 8.21 ms \| 28.89 ms \| 56.64 ms \| 93.8 ms \| 10.7 Hz \| 1.54x \|
	\| Orin \| PyTorch Eager \| 9.45 ms \| 127.6 ms \| 205.39 ms \| 342.8 ms \| 2.9 Hz \| 1.00x \|
	\| \| torch.compile \| 9.45 ms \| 128.59 ms \| 78.94 ms \| 217.0 ms \| 4.6 Hz \| 1.58x \|
	\| \| TensorRT (DiT-only) \| 9.45 ms \| 128.38 ms \| 78.6 ms \| 216.5 ms \| 4.6 Hz \| 1.58x \|

	> Note: Orin uses DiT-only TensorRT (`--inference-mode tensorrt`) because TRT 10.3 does not support the backbone engine. All other platforms use the full pipeline (`--inference-mode trt_full_pipeline`).

	<details>
	<summary>Raw benchmark output (H100 80GB HBM3)</summary>

	```
	Hardware: NVIDIA H100 80GB HBM3
	Model: checkpoints/GR00T-N1.7-LIBERO/libero_10
	1 camera, Denoising Steps: 4

	PyTorch Eager:
	E2E: 85.8 ms (11.7 Hz)
	Data Processing: 6.2 ms \| Backbone: 31.3 ms \| Action Head: 48.2 ms

	torch.compile:
	E2E: 48.6 ms (20.6 Hz), 1.77x speedup
	Data Processing: 6.2 ms \| Backbone: 30.4 ms \| Action Head: 12.0 ms

	TensorRT (Full Pipeline):
	E2E: 27.9 ms (35.9 Hz), 3.08x speedup
	Data Processing: 6.2 ms \| Backbone: 8.8 ms \| Action Head: 12.3 ms
	```
	</details>

	### Standalone Inference with TRT

	The standalone inference script serves as both an accuracy validation and a reference for deploying TRT inference in your own code. It runs per-step inference on real trajectories and compares action predictions:

	```bash
	uv run python scripts/deployment/standalone_inference_script.py \
	--model-path checkpoints/GR00T-N1.7-LIBERO/libero_10 \
	--dataset-path demo_data/libero_demo \
	--embodiment-tag LIBERO_PANDA \
	--traj-ids 0 1 2 3 4 \
	--inference-mode trt_full_pipeline \
	--trt-engine-path ./gr00t_trt_deployment/engines \
	--save-plot-path ./output/trt_inference.png
	```

	Expected accuracy: MSE/MAE match PyTorch within noise. TRT produces identical action quality. Speedup varies by platform — run `build_trt_pipeline.py --steps benchmark` on your hardware for exact numbers.

	### Optional: LIBERO Closed-Loop Sim Evaluation

	To validate TRT accuracy in end-to-end robotic tasks, run the LIBERO closed-loop evaluation. This requires a separate environment setup (~10-30 min, MuJoCo simulator + dependencies).

	<details>
	<summary>Setup, commands, and results (H100, 20 episodes)</summary>

	Task: `KITCHEN_SCENE3_turn_on_the_stove_and_put_the_moka_pot_on_it`, 20 episodes:

	\| Mode \| Success Rate \|
	\|------\|-------------\|
	\| PyTorch \| 100% (20/20) \|
	\| TRT (n17_full_pipeline) \| 95% (19/20) \|

	Difference is within simulation noise (p >> 0.05).

	> Note: Use `--n-envs 1` for TRT evaluation (ViT engine has static shapes for single-observation inference).

	```bash
	# One-time LIBERO setup (~10 min)
	bash gr00t/eval/sim/LIBERO/setup_libero.sh

	# Activate LIBERO venv and install additional deps
	source gr00t/eval/sim/LIBERO/libero_uv/.venv/bin/activate
	uv pip install diffusers transformers accelerate safetensors torchcodec

	# TRT full pipeline evaluation
	python gr00t/eval/rollout_policy.py \
	--model-path checkpoints/GR00T-N1.7-LIBERO/libero_10 \
	--env-name "libero_sim/KITCHEN_SCENE3_turn_on_the_stove_and_put_the_moka_pot_on_it" \
	--n-episodes 20 --n-envs 1 --max-episode-steps 504 \
	--trt-engine-path ./gr00t_trt_deployment/engines \
	--trt-mode n17_full_pipeline
	```
	</details>

	> Run `python scripts/deployment/build_trt_pipeline.py --steps benchmark` to generate benchmarks for your hardware.

	---

	## Platform-Specific Setup

	> Jetson and Spark platforms use different dependency stacks than dGPU. Thor and Spark use CUDA 13 with PyTorch 2.10.0 from the [Jetson AI Lab cu130 index](https://pypi.jetson-ai-lab.io/sbsa/cu130). Orin uses CUDA 12.6 with PyTorch 2.10.0 from the [Jetson AI Lab cu126 index](https://pypi.jetson-ai-lab.io/jp6/cu126).

	### Jetson Thor Setup

	Thor uses CUDA 13 and Python 3.12, which require a different dependency stack than x86 or Orin.
	Tested with JetPack 7.1.
	There are two ways to run on Thor: Docker (recommended) or bare metal.

	<details>
	<summary><strong>Docker (Recommended)</strong></summary>

	Build the Thor container from the repo root:

	```bash
	cd docker && bash build.sh --profile=thor && cd ..
	```

	Download the finetuned model (run once, on the host):

	```bash
	uv run hf download nvidia/GR00T-N1.7-LIBERO --include "libero_10/config.json" "libero_10/embodiment_id.json" "libero_10/model-*.safetensors" "libero_10/model.safetensors.index.json" "libero_10/processor_config.json" "libero_10/statistics.json" --local-dir checkpoints/GR00T-N1.7-LIBERO
	```

	Start an interactive Docker session (recommended for multi-step TRT work):

	```bash
	docker run -it --rm --runtime nvidia --gpus all \
	--ipc=host \
	--ulimit memlock=-1 \
	--ulimit stack=67108864 \
	--network host \
	-v "$(pwd)":/workspace/repo \
	-v "${HF_HOME:-${HOME}/.cache/huggingface}":/root/.cache/huggingface \
	-w /workspace/repo \
	-e HF_TOKEN="${HF_TOKEN:-}" \
	gr00t-thor \
	bash
	```

	Then inside the container, run the full TRT pipeline (export, build, verify, benchmark):

	```bash
	python scripts/deployment/build_trt_pipeline.py \
	--model-path checkpoints/GR00T-N1.7-LIBERO/libero_10 \
	--dataset-path demo_data/libero_demo \
	--embodiment-tag LIBERO_PANDA
	```
	</details>

	<details>
	<summary><strong>Bare Metal</strong></summary>

	```bash
	# One-time install (temporarily copies the Thor pyproject.toml and uv.lock to repo root,
	# installs NVPL libs, uv, Python deps, and builds torchcodec from source against the
	# system FFmpeg runtime)
	bash scripts/deployment/thor/install_deps.sh

	# In each new shell
	source .venv/bin/activate
	source scripts/activate_thor.sh
	```

	Then run the TRT pipeline or PyTorch inference as shown in the [TensorRT Acceleration](#tensorrt-acceleration) and [Quick Start](#quick-start-pytorch-inference) sections above.
	The activation script exports the PyTorch and CUDA library/include paths that `torchcodec`
	and `torch.compile` need on Thor.
	</details>

	---

	### DGX Spark Setup

	Spark uses CUDA 13 and Python 3.12 like Thor, but requires a dedicated dependency stack and
	source-built `flash-attn` for `sm121`. There are two ways to run on Spark: Docker (recommended)
	or bare metal.

	<details>
	<summary><strong>Docker (Recommended)</strong></summary>

	Build the Spark container from the repo root:

	```bash
	cd docker && bash build.sh --profile=spark && cd ..
	```

	Download the finetuned model (run once, on the host):

	```bash
	uv run hf download nvidia/GR00T-N1.7-LIBERO --include "libero_10/config.json" "libero_10/embodiment_id.json" "libero_10/model-*.safetensors" "libero_10/model.safetensors.index.json" "libero_10/processor_config.json" "libero_10/statistics.json" --local-dir checkpoints/GR00T-N1.7-LIBERO
	```

	Start an interactive Docker session (recommended for multi-step TRT work):

	```bash
	docker run -it --rm --runtime nvidia --gpus all \
	--ipc=host \
	--ulimit memlock=-1 \
	--ulimit stack=67108864 \
	--network host \
	-v "$(pwd)":/workspace/repo \
	-v "${HF_HOME:-${HOME}/.cache/huggingface}":/root/.cache/huggingface \
	-w /workspace/repo \
	-e HF_TOKEN="${HF_TOKEN:-}" \
	gr00t-spark \
	bash
	```

	Then inside the container, run the full TRT pipeline (export, build, verify, benchmark):

	```bash
	python scripts/deployment/build_trt_pipeline.py \
	--model-path checkpoints/GR00T-N1.7-LIBERO/libero_10 \
	--dataset-path demo_data/libero_demo \
	--embodiment-tag LIBERO_PANDA
	```
	</details>

	<details>
	<summary><strong>Bare Metal</strong></summary>

	```bash
	# One-time install (temporarily copies the Spark pyproject.toml and uv.lock to repo root,
	# installs NVPL libs, uv, Python deps, source-builds flash-attn for sm121, and builds
	# torchcodec from source against the system FFmpeg runtime)
	bash scripts/deployment/spark/install_deps.sh

	# In each new shell
	source .venv/bin/activate
	source scripts/activate_spark.sh
	```

	Then run the TRT pipeline or PyTorch inference as shown in the [TensorRT Acceleration](#tensorrt-acceleration) and [Quick Start](#quick-start-pytorch-inference) sections above.
	If you later rerun `uv sync`, rerun `bash scripts/deployment/spark/install_deps.sh` so the
	Spark-specific `flash-attn` build is restored and revalidated.
	</details>

	---

	### Jetson Orin Setup

	> Note: On Orin, only the DiT (action head) TRT export is currently supported. Use `--export-mode dit_only` instead of `full_pipeline`. Full pipeline support is in progress.

	Orin uses CUDA 12.6 and Python 3.10 (JetPack 6.2), which require a different dependency stack than x86 or Thor.
	Tested with JetPack 6.2.
	There are two ways to run on Orin: Docker (recommended) or bare metal.

	<details>
	<summary><strong>Docker (Recommended)</strong></summary>

	Build the Orin container from the repo root:

	```bash
	cd docker && bash build.sh --profile=orin && cd ..
	```

	Download the finetuned model (run once, on the host):

	```bash
	uv run hf download nvidia/GR00T-N1.7-LIBERO --include "libero_10/config.json" "libero_10/embodiment_id.json" "libero_10/model-*.safetensors" "libero_10/model.safetensors.index.json" "libero_10/processor_config.json" "libero_10/statistics.json" --local-dir checkpoints/GR00T-N1.7-LIBERO
	```

	Start an interactive Docker session (recommended for multi-step TRT work):

	```bash
	docker run -it --rm --runtime nvidia --gpus all \
	--ipc=host \
	--ulimit memlock=-1 \
	--ulimit stack=67108864 \
	--network host \
	-v "$(pwd)":/workspace/repo \
	-v "${HF_HOME:-${HOME}/.cache/huggingface}":/root/.cache/huggingface \
	-w /workspace/repo \
	-e HF_TOKEN="${HF_TOKEN:-}" \
	gr00t-orin \
	bash
	```

	Then inside the container, run the TRT pipeline (DiT-only on Orin):

	```bash
	python scripts/deployment/build_trt_pipeline.py \
	--model-path checkpoints/GR00T-N1.7-LIBERO/libero_10 \
	--dataset-path demo_data/libero_demo \
	--embodiment-tag LIBERO_PANDA \
	--export-mode dit_only
	```
	</details>

	<details>
	<summary><strong>Bare Metal</strong></summary>

	```bash
	# One-time install (temporarily copies the Orin pyproject.toml and uv.lock to repo root,
	# installs uv, Python deps, and builds torchcodec from source against JetPack's FFmpeg
	# runtime)
	bash scripts/deployment/orin/install_deps.sh

	# In each new shell
	source .venv/bin/activate
	source scripts/activate_orin.sh
	```

	Then run the TRT pipeline (with `--export-mode dit_only`) or PyTorch inference as shown in the [TensorRT Acceleration](#tensorrt-acceleration) and [Quick Start](#quick-start-pytorch-inference) sections above.
	The activation script exports the PyTorch and CUDA library/include paths that `torchcodec`
	and `torch.compile` need on Orin.
	</details>

	> Orin storage tip: If your eMMC root is low on space, redirect the HuggingFace cache to an NVMe SSD with `export HF_HOME=/path/to/ssd/.cache/huggingface` before downloading models.

	> Orin TRT limitations: TRT 10.3 on Orin does not support the backbone (LLM) engine — the build step will report a failure for `llm_bf16.engine` and that is expected. The remaining 6 engines build successfully. Use `--export-mode action_head` for verification and `--inference-mode tensorrt` (DiT-only TRT, backbone runs in PyTorch) for inference:
	> ```bash
	> python scripts/deployment/build_trt_pipeline.py \
	> --model-path checkpoints/GR00T-N1.7-LIBERO/libero_10 \
	> --dataset-path demo_data/libero_demo \
	> --export-mode action_head \
	> --steps verify
	>
	> python scripts/deployment/standalone_inference_script.py \
	> --model-path checkpoints/GR00T-N1.7-LIBERO/libero_10 \
	> --dataset-path demo_data/libero_demo \
	> --embodiment-tag LIBERO_PANDA \
	> --traj-ids 0 \
	> --inference-mode tensorrt \
	> --trt-engine-path ./gr00t_n1d7_engines
	> ```

	---

	## Command-Line Arguments

	### `build_trt_pipeline.py`

	\| Argument \| Default \| Description \|
	\|----------\|---------\|-------------\|
	\| `--model-path` \| (required) \| Path to model checkpoint \|
	\| `--dataset-path` \| `demo_data/libero_demo` \| Path to dataset (LeRobot format) \|
	\| `--embodiment-tag` \| Auto-detected \| Embodiment tag (auto-detected from processor_config.json if single embodiment) \|
	\| `--output-dir` \| `./gr00t_trt_deployment` \| Root output directory. ONNX → `<output-dir>/onnx/`, engines → `<output-dir>/engines/` \|
	\| `--precision` \| `bf16` \| Precision for ONNX export and TRT engine build (`bf16`, `fp16`, `fp32`) \|
	\| `--batch-size` \| `1` \| Batch size baked into exported ONNX/TRT models (static — see note below) \|
	\| `--export-mode` \| `full_pipeline` \| Export mode: `dit_only`, `action_head`, or `full_pipeline` \|
	\| `--video-backend` \| `torchcodec` \| Video backend for dataset loading \|
	\| `--workspace` \| `8192` \| TRT builder workspace size in MB \|
	\| `--num-iterations` \| `20` \| Number of benchmark iterations \|
	\| `--warmup` \| `5` \| Number of warmup iterations \|
	\| `--skip-compile` \| `false` \| Skip torch.compile benchmark \|
	\| `--steps` \| `all` \| Steps to run: `all` or comma-separated subset of `export,build,verify,benchmark` \|
	\| `--log-file` \| `<output-dir>/pipeline.log` \| Log file path \|

	### `standalone_inference_script.py`

	\| Argument \| Default \| Description \|
	\|----------\|---------\|-------------\|
	\| `--model-path` \| (required) \| Path to model checkpoint \|
	\| `--dataset-path` \| `demo_data/droid_sample` \| Path to dataset (LeRobot format) \|
	\| `--embodiment-tag` \| Auto-detected \| Robot embodiment tag \|
	\| `--traj-ids` \| `[0]` \| Episode indices to evaluate (space-separated) \|
	\| `--steps` \| `200` \| Max steps per trajectory (capped by actual length) \|
	\| `--action-horizon` \| `16` \| Action prediction horizon \|
	\| `--inference-mode` \| `pytorch` \| `pytorch`, `tensorrt` (DiT-only TRT), or `trt_full_pipeline` (all engines) \|
	\| `--trt-engine-path` \| `./gr00t_n1d7_engines` \| Directory containing pre-built TRT engines \|
	\| `--denoising-steps` \| `4` \| Diffusion denoising iterations \|
	\| `--save-plot-path` \| `None` \| Save per-trajectory GT-vs-predicted comparison plots \|
	\| `--video-backend` \| `torchcodec` \| Video decoder: `torchcodec`, `decord`, or `torchvision_av` \|
	\| `--skip-timing-steps` \| `1` \| Initial steps excluded from timing stats (warmup) \|
	\| `--host` / `--port` \| `127.0.0.1` / `5555` \| Server address (when using client mode without `--model-path`) \|
	\| `--seed` \| `42` \| Random seed for reproducibility \|

	## Files

	\| File \| Description \|
	\|------\|-------------\|
	\| `build_trt_pipeline.py` \| Unified pipeline: export ONNX, build engines, verify, benchmark \|
	\| `standalone_inference_script.py` \| Main inference script (PyTorch + DiT-only TensorRT) \|
	\| `trt_torch.py` \| TRT Engine wrapper class (load, bind, execute) \|
	\| `trt_model_forward.py` \| TRT forward functions and setup (backbone + action head) \|

	---

	## Troubleshooting

	### Engine Build Fails

	- Ensure you have enough GPU memory (16GB+ recommended for full pipeline)
	- Try reducing workspace size: `--workspace 4096`
	- Ensure TensorRT version matches your CUDA version
	- LLM engine requires `batch_size` dimension handling when using custom shape profiles

	### ONNX Export Issues

	- If export fails with COMPLEX128 error: ensure `_simple_causal_mask` is used (not HuggingFace's `create_causal_mask`)
	- If `masked_scatter` size assertion fails: ensure `visual_pos_masks` has the correct number of True values matching deepstack tensor size
	- Check that the dataset path is valid and contains at least one trajectory

	### Accuracy Issues

	- If cosine < 0.99: check that LLM export does NOT include the final RMSNorm (backbone returns pre-norm `hidden_states[-1]`)
	- If output magnitude is ~12x too small: this is the norm bug — see above
	- Run `build_trt_pipeline.py --steps verify --export-mode action_head` first to isolate backbone vs action head drift