my current issues with setting up groot
What I'm Trying to Do
I'm building an end-to-end Vision-Language-Action (VLA) pipeline using Isaac Lab Arena + GR00T N1.5-3B for a Unitree G1 humanoid robot doing tabletop manipulation tasks (chess piece placement, simple pick-and-place). The full pipeline is:
- Define custom environment in Isaac Lab Arena (table, objects, G1 fixed-base robot)
- Teleoperate with Quest 3 hand tracking via CloudXR β record demos as HDF5
- Convert HDF5 demos to LeRobot format (parquet + MP4 video)
- Fine-tune GR00T N1.5-3B on the demos using LoRA
- Closed-loop evaluation β run the fine-tuned model in the sim, feeding camera images + joint states β get joint position targets β step physics
I've gotten steps 1-4 working, but step 5 (closed-loop eval) is where the robot doesn't move despite GR00T producing non-trivial action outputs. I'm looking for guidance on the correct way to close the loop.
My Setup
- Hardware: RTX 5070 Ti (16GB), Intel i7-13700F, 32GB RAM, Meta Quest 3
- Software: Isaac Sim 5.1, Isaac Lab 0.47.2, Isaac Lab Arena 0.1.1, Isaac-GR00T (NVIDIA's repo), Python 3.11
- Robot: Unitree G1, 43-joint articulation, fixed-base (no locomotion), 29 active DOF
- Teleop: Quest 3 β CloudXR β OpenXR hand tracking β G1TriHandUpperBodyRetargeterCfg β PINK IK solver (28D task-space β joint-space)
What's Working
- Custom Arena Environment
Built a chess piece placement environment in Arena with custom assets (chess board USD from Blender, chess piece cylinder, green target square). Also built a simpler pick-and-place variant using only Isaac Lab primitives (no external USD).
- Hand Tracking Teleop (Fully Working)
Quest 3 hand tracking controls the G1 robot through CloudXR + PINK IK. The retargeter outputs 21D hand poses, which feed into PinkInverseKinematicsActionCfg (28D) that solves IK for the upper body. This works great.
Important: Had to abandon Arena's G1DecoupledWBCPinkActionCfg (23D action space) because it's incompatible with the hand tracking retargeter output (21D). Switched to built-in Isaac Lab's PinkInverseKinematicsActionCfg (28D) which accepts the retargeter output
directly.
- Demo Recording & Conversion
Recorded 5 teleoperated demos using record_demos.py. The HDF5 contains:
- obs/robot_joint_pos β 43D joint positions (full articulation state)
- obs/processed_actions β 31D post-IK joint positions (upper body only: waist + arms + hands)
- obs/robot_head_cam_rgb β 480Γ640 ego-view camera images
Converted to LeRobot format using a custom converter that:
- Remaps 43D sim joint order β GR00T policy body-group order (left_leg β right_leg β waist β left_arm β left_hand β right_arm β right_hand)
- Zero-fills leg joints in actions (since we only have upper body actions from PINK IK)
- Encodes video as MP4 with proper episode chunking
Dataset modality.json maps flat parquet columns to GR00T body group keys (e.g., state.waist = indices 12-14 of observation.state).
- GR00T Fine-Tuning (Completed)
Fine-tuned GR00T N1.5-3B with LoRA (rank 16, alpha 32) on the 5 demos:
python scripts/gr00t_finetune.py
--dataset_path /home/ray/datasets/simple_pick_and_place/simple_pick_and_place_demos/lerobot
--data_config isaaclab_arena_gr00t.data_config:UnitreeG1FixedBaseDataConfig
--batch_size 1 --max_steps 10000 --lora_rank 16 --lora_alpha 32
--output_dir /home/ray/models/simple-pick-and-place
--report_to tensorboard
Custom UnitreeG1FixedBaseDataConfig defines:
- State keys: state.waist, state.left_arm, state.left_hand, state.right_arm, state.right_hand (31D total)
- Action keys: same groups as state (31D)
- Video: video.ego_view
- Action horizon: 16 steps
Training completed in ~10 min on RTX 5070 Ti. Merged LoRA adapter into full checkpoint with merge_and_unload().
Note: Had to manually rename LoRA keys during adapter loading (lora_A.weight β lora_A.default.weight) because PEFT's saved format differs from what get_peft_model() expects.
Where It Breaks: Closed-Loop Evaluation
The Core Problem: Action Space Mismatch
The training data contains processed_actions β these are post-IK joint positions (31D absolute joint angles output by the PINK IK solver). So GR00T learned to predict joint positions.
But the environment's action space uses PinkInverseKinematicsActionCfg which expects pre-IK EEF targets (task-space wrist poses). You can't feed joint positions back into an IK solver that expects Cartesian targets.
What I Tried
Attempt 1: Bypass the action manager entirely
Created a custom eval env config with JointPositionActionCfg (direct joint position control) instead of PINK IK. Then in the eval loop:
- Read robot.data.joint_pos (43D) and camera.data.output["rgb"]
- Remap joint positions from sim order β policy body-group order
- Run GR00T inference β get action chunks (16 steps Γ 31D)
- Remap actions back from policy order β sim order (43D, zero-fill legs)
- Call robot.set_joint_position_target(target_pos) + manual physics stepping
Result: GR00T produces non-trivial outputs (waist: [-0.7, 0.6], arms: [-1.1, 1.0], hands: [-2.2, 2.0]) but the robot doesn't move at all. Object distance stays constant at 0.203m for all 300 steps.
Attempt 2: Use env.step() through the action manager
Same setup but pass the 43D joint targets through env.step(action) instead of manual stepping.
Result: Same β robot doesn't move. The JointPositionActionCfg action manager reports 43D action space, but joints don't change between steps.
Debug Output (Attempt 1)
[DEBUG] GR00T action_dict keys: ['action.waist', 'action.left_arm', 'action.left_hand', 'action.right_arm', 'action.right_hand']
[DEBUG] action.waist: shape=(1, 16, 3), min=-0.7357, max=0.5740
[DEBUG] action.left_arm: shape=(1, 16, 7), min=-0.6041, max=0.3572
[DEBUG] action.right_arm: shape=(1, 16, 7), min=-1.1013, max=1.0109
[DEBUG] Current joint_pos shape: torch.Size([1, 43])
[DEBUG] Current joint_pos[0, :5]: [-0.1, -0.1, 0.0, 0.0, 0.0]
[DEBUG] action_chunk shape: torch.Size([1, 16, 43])
[DEBUG] action_chunk[0, 0, :5]: [0.0, 0.0, -0.405, 0.0, 0.0]
The targets ARE different from the current positions, but joints don't move.
Other Issues Encountered Along the Way
- PINK IK C++ binding crash: TypeError: No Python class registered for C++ class std::vector<std::__cxx11::basic_string<...>> when PinkIKController initializes pinocchio. This is why I had to bypass PINK IK for eval.
- --enable_cameras required: TiledCamera crashes without this flag, but it's easy to forget since the training env doesn't need it.
- OOM with full fine-tuning: 550M param action head + AdamW states exceeded 16GB VRAM at batch_size=2. Fixed with LoRA rank=16 + batch_size=1 (11.3GB).
- Arena action space incompatibility: Arena's G1DecoupledWBCPinkActionCfg (23D) doesn't match the hand tracking retargeter output (21D). Had to build environments using built-in Isaac Lab PinkInverseKinematicsActionCfg (28D) instead.
My Questions
- How should the closed-loop eval work? GR00T was trained on post-IK joint positions. What's the correct way to apply these as actions in the simulator? Is JointPositionActionCfg the right approach, and if so, why might the robot not respond to
set_joint_position_target()? - Should the training data be pre-IK or post-IK? Should I instead record raw EEF targets as actions (pre-IK) and train GR00T to predict those, then use the PINK IK action manager during eval? The NVIDIA examples seem to use joint-space actions, but the
documentation is sparse on this. - Joint ordering / normalization: The GR00T policy uses min-max normalization during training. During inference, Gr00tPolicy.get_action() should denormalize. Are the denormalized values in radians? Do they match what robot.data.joint_pos reports?
- Action horizon alignment: Training uses action_horizon=16. During eval I consume one action per physics step (200Hz sim / decimation 4 = 50Hz control). Is this the right rate, or should each GR00T action correspond to multiple physics steps?
- Has anyone successfully closed the loop with GR00T + Isaac Lab? Looking for a reference implementation of the eval script, especially for humanoid manipulation with PINK IK training data.
Relevant Code
Data config (UnitreeG1FixedBaseDataConfig):
class UnitreeG1FixedBaseDataConfig(BaseDataConfig):
video_keys = ["video.ego_view"]
state_keys = ["state.waist", "state.left_arm", "state.left_hand", "state.right_arm", "state.right_hand"]
action_keys = ["action.waist", "action.left_arm", "action.left_hand", "action.right_arm", "action.right_hand"]
language_keys = ["annotation.human.task_description"]
observation_indices = [0]
action_indices = list(range(16))
Eval env config (replacing PINK IK with direct joint control):
@configclass
class EvalActionsCfg:
joint_pos = mdp.JointPositionActionCfg(
asset_name="robot",
joint_names=[".*"],
scale=1.0,
use_default_offset=False,
)
GR00T inference + action application:
action_dict = policy.get_action(groot_obs) # Returns body-group actions
action_sim = remap_policy_joints_to_sim_joints(action_dict, policy_cfg, sim_cfg, device)
action_chunk = action_sim.get_joints_pos() # (1, 16, 43)
target_pos = action_chunk[:, action_idx, :] # (1, 43)
obs, rew, terminated, truncated, info = env.step(target_pos) # Robot doesn't move
Any guidance on closing this loop would be hugely appreciated. Happy to share the full eval script or dataset structure if helpful.