my current issues with setting up groot

by Raysolo - opened Mar 8

Mar 8

What I'm Trying to Do

I'm building an end-to-end Vision-Language-Action (VLA) pipeline using Isaac Lab Arena + GR00T N1.5-3B for a Unitree G1 humanoid robot doing tabletop manipulation tasks (chess piece placement, simple pick-and-place). The full pipeline is:

Define custom environment in Isaac Lab Arena (table, objects, G1 fixed-base robot)
Teleoperate with Quest 3 hand tracking via CloudXR → record demos as HDF5
Convert HDF5 demos to LeRobot format (parquet + MP4 video)
Fine-tune GR00T N1.5-3B on the demos using LoRA
Closed-loop evaluation — run the fine-tuned model in the sim, feeding camera images + joint states → get joint position targets → step physics

I've gotten steps 1-4 working, but step 5 (closed-loop eval) is where the robot doesn't move despite GR00T producing non-trivial action outputs. I'm looking for guidance on the correct way to close the loop.

My Setup

Hardware: RTX 5070 Ti (16GB), Intel i7-13700F, 32GB RAM, Meta Quest 3
Software: Isaac Sim 5.1, Isaac Lab 0.47.2, Isaac Lab Arena 0.1.1, Isaac-GR00T (NVIDIA's repo), Python 3.11
Robot: Unitree G1, 43-joint articulation, fixed-base (no locomotion), 29 active DOF
Teleop: Quest 3 → CloudXR → OpenXR hand tracking → G1TriHandUpperBodyRetargeterCfg → PINK IK solver (28D task-space → joint-space)

What's Working

Custom Arena Environment

Built a chess piece placement environment in Arena with custom assets (chess board USD from Blender, chess piece cylinder, green target square). Also built a simpler pick-and-place variant using only Isaac Lab primitives (no external USD).

Hand Tracking Teleop (Fully Working)

Quest 3 hand tracking controls the G1 robot through CloudXR + PINK IK. The retargeter outputs 21D hand poses, which feed into PinkInverseKinematicsActionCfg (28D) that solves IK for the upper body. This works great.

Important: Had to abandon Arena's G1DecoupledWBCPinkActionCfg (23D action space) because it's incompatible with the hand tracking retargeter output (21D). Switched to built-in Isaac Lab's PinkInverseKinematicsActionCfg (28D) which accepts the retargeter output
directly.

Demo Recording & Conversion

Recorded 5 teleoperated demos using record_demos.py. The HDF5 contains:

obs/robot_joint_pos — 43D joint positions (full articulation state)
obs/processed_actions — 31D post-IK joint positions (upper body only: waist + arms + hands)
obs/robot_head_cam_rgb — 480×640 ego-view camera images

Converted to LeRobot format using a custom converter that:

Remaps 43D sim joint order → GR00T policy body-group order (left_leg → right_leg → waist → left_arm → left_hand → right_arm → right_hand)
Zero-fills leg joints in actions (since we only have upper body actions from PINK IK)
Encodes video as MP4 with proper episode chunking

Dataset modality.json maps flat parquet columns to GR00T body group keys (e.g., state.waist = indices 12-14 of observation.state).

GR00T Fine-Tuning (Completed)

Fine-tuned GR00T N1.5-3B with LoRA (rank 16, alpha 32) on the 5 demos:
python scripts/gr00t_finetune.py
--dataset_path /home/ray/datasets/simple_pick_and_place/simple_pick_and_place_demos/lerobot
--data_config isaaclab_arena_gr00t.data_config:UnitreeG1FixedBaseDataConfig
--batch_size 1 --max_steps 10000 --lora_rank 16 --lora_alpha 32
--output_dir /home/ray/models/simple-pick-and-place
--report_to tensorboard

Custom UnitreeG1FixedBaseDataConfig defines:

State keys: state.waist, state.left_arm, state.left_hand, state.right_arm, state.right_hand (31D total)
Action keys: same groups as state (31D)
Video: video.ego_view
Action horizon: 16 steps

Training completed in ~10 min on RTX 5070 Ti. Merged LoRA adapter into full checkpoint with merge_and_unload().

Note: Had to manually rename LoRA keys during adapter loading (lora_A.weight → lora_A.default.weight) because PEFT's saved format differs from what get_peft_model() expects.

Where It Breaks: Closed-Loop Evaluation

The Core Problem: Action Space Mismatch

The training data contains processed_actions — these are post-IK joint positions (31D absolute joint angles output by the PINK IK solver). So GR00T learned to predict joint positions.

But the environment's action space uses PinkInverseKinematicsActionCfg which expects pre-IK EEF targets (task-space wrist poses). You can't feed joint positions back into an IK solver that expects Cartesian targets.

What I Tried

Attempt 1: Bypass the action manager entirely
Created a custom eval env config with JointPositionActionCfg (direct joint position control) instead of PINK IK. Then in the eval loop:

Read robot.data.joint_pos (43D) and camera.data.output["rgb"]
Remap joint positions from sim order → policy body-group order
Run GR00T inference → get action chunks (16 steps × 31D)
Remap actions back from policy order → sim order (43D, zero-fill legs)
Call robot.set_joint_position_target(target_pos) + manual physics stepping

Result: GR00T produces non-trivial outputs (waist: [-0.7, 0.6], arms: [-1.1, 1.0], hands: [-2.2, 2.0]) but the robot doesn't move at all. Object distance stays constant at 0.203m for all 300 steps.

Attempt 2: Use env.step() through the action manager
Same setup but pass the 43D joint targets through env.step(action) instead of manual stepping.

Result: Same — robot doesn't move. The JointPositionActionCfg action manager reports 43D action space, but joints don't change between steps.

Debug Output (Attempt 1)

[DEBUG] GR00T action_dict keys: ['action.waist', 'action.left_arm', 'action.left_hand', 'action.right_arm', 'action.right_hand']
[DEBUG] action.waist: shape=(1, 16, 3), min=-0.7357, max=0.5740
[DEBUG] action.left_arm: shape=(1, 16, 7), min=-0.6041, max=0.3572
[DEBUG] action.right_arm: shape=(1, 16, 7), min=-1.1013, max=1.0109
[DEBUG] Current joint_pos shape: torch.Size([1, 43])
[DEBUG] Current joint_pos[0, :5]: [-0.1, -0.1, 0.0, 0.0, 0.0]
[DEBUG] action_chunk shape: torch.Size([1, 16, 43])
[DEBUG] action_chunk[0, 0, :5]: [0.0, 0.0, -0.405, 0.0, 0.0]

The targets ARE different from the current positions, but joints don't move.

Other Issues Encountered Along the Way

PINK IK C++ binding crash: TypeError: No Python class registered for C++ class std::vector<std::__cxx11::basic_string<...>> when PinkIKController initializes pinocchio. This is why I had to bypass PINK IK for eval.
--enable_cameras required: TiledCamera crashes without this flag, but it's easy to forget since the training env doesn't need it.
OOM with full fine-tuning: 550M param action head + AdamW states exceeded 16GB VRAM at batch_size=2. Fixed with LoRA rank=16 + batch_size=1 (11.3GB).
Arena action space incompatibility: Arena's G1DecoupledWBCPinkActionCfg (23D) doesn't match the hand tracking retargeter output (21D). Had to build environments using built-in Isaac Lab PinkInverseKinematicsActionCfg (28D) instead.

My Questions

How should the closed-loop eval work? GR00T was trained on post-IK joint positions. What's the correct way to apply these as actions in the simulator? Is JointPositionActionCfg the right approach, and if so, why might the robot not respond to
set_joint_position_target()?
Should the training data be pre-IK or post-IK? Should I instead record raw EEF targets as actions (pre-IK) and train GR00T to predict those, then use the PINK IK action manager during eval? The NVIDIA examples seem to use joint-space actions, but the
documentation is sparse on this.
Joint ordering / normalization: The GR00T policy uses min-max normalization during training. During inference, Gr00tPolicy.get_action() should denormalize. Are the denormalized values in radians? Do they match what robot.data.joint_pos reports?
Action horizon alignment: Training uses action_horizon=16. During eval I consume one action per physics step (200Hz sim / decimation 4 = 50Hz control). Is this the right rate, or should each GR00T action correspond to multiple physics steps?
Has anyone successfully closed the loop with GR00T + Isaac Lab? Looking for a reference implementation of the eval script, especially for humanoid manipulation with PINK IK training data.

Relevant Code

Data config (UnitreeG1FixedBaseDataConfig):
class UnitreeG1FixedBaseDataConfig(BaseDataConfig):
video_keys = ["video.ego_view"]
state_keys = ["state.waist", "state.left_arm", "state.left_hand", "state.right_arm", "state.right_hand"]
action_keys = ["action.waist", "action.left_arm", "action.left_hand", "action.right_arm", "action.right_hand"]
language_keys = ["annotation.human.task_description"]
observation_indices = [0]
action_indices = list(range(16))

Eval env config (replacing PINK IK with direct joint control):
@configclass
class EvalActionsCfg:
joint_pos = mdp.JointPositionActionCfg(
asset_name="robot",
joint_names=[".*"],
scale=1.0,
use_default_offset=False,
)

GR00T inference + action application:
action_dict = policy.get_action(groot_obs) # Returns body-group actions
action_sim = remap_policy_joints_to_sim_joints(action_dict, policy_cfg, sim_cfg, device)
action_chunk = action_sim.get_joints_pos() # (1, 16, 43)
target_pos = action_chunk[:, action_idx, :] # (1, 43)
obs, rew, terminated, truncated, info = env.step(target_pos) # Robot doesn't move

Any guidance on closing this loop would be hugely appreciated. Happy to share the full eval script or dataset structure if helpful.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment