chore: publish rSkill OpenRAL/rskill-3d-diffuser-actor-rlbench v0.1.0

d592f2b verified 8 days ago

6.17 kB

	---
	tags:
	- OpenRAL
	- rskill
	- diffuser-actor
	- 3d-diffuser-actor
	- rlbench
	- coppeliasim
	- peract
	- manipulation
	- franka
	license: mit
	language:
	- en
	---

	<!--
	rSkill README — 3D Diffuser Actor (RLBench PerAct setup).
	Discovery + provenance card; mirrors rskill.yaml. ADR-0062.
	-->

	# rskill-3d-diffuser-actor-rlbench

	3D Diffuser Actor — a diffusion policy over end-effector keyposes for RLBench,
	running on the CoppeliaSim/PyRep RLBench benchmark backend (ADR-0062).

	## What this skill does

	Predicts the next end-effector keypose (position + orientation + gripper) from
	multi-view RGB-D, conditioned on a language instruction. Used to benchmark
	3D/keyframe manipulation on the RLBench PerAct 18-task suite. Ships the three
	live-verified starter tasks: `open_drawer`, `meat_off_grill`, `close_jar`.

	\| Field \| Value \|
	\|---\|---\|
	\| Actions \| open, close, pick, place (generalist keyframe policy) \|
	\| Objects \| drawer, grill/meat, jar — (PerAct task objects) \|
	\| Scenes \| tabletop (RLBench / CoppeliaSim) \|
	\| Embodiment \| franka_panda \|

	## How it works

	3D Diffuser Actor lifts the four RLBench camera RGB-D streams into a 3D point-cloud
	scene token field, attends over it with a relative-position transformer, and runs a
	DDPM diffusion head (100 denoising steps) to denoise an end-effector keypose
	trajectory. Each predicted keypose is executed in RLBench by its sampling-based
	motion planner (`EndEffectorPoseViaPlanning`), then the policy re-observes and
	predicts the next keypose. The policy and the CoppeliaSim/PyRep scene run in an
	out-of-process py3.10 sidecar (ZMQ + msgpack); the openral adapter
	(`openral_sim.policies.rlbench_3dda`) forks it transparently.

	### Observation → action contract

	\| dir \| key \| shape \| notes \|
	\|---\|---\|---\|---\|
	\| in \| `observation.images.{left_shoulder,right_shoulder,wrist,front}` \| `(H, W, 3) uint8` \| RLBench PerAct cameras, 256×256 \|
	\| in \| `observation.point_clouds.{…}` \| `(H, W, 3) float32` \| per-camera world-frame point clouds \|
	\| in \| `observation.gripper_pose` \| `(7,)` float32 \| `[x y z qx qy qz qw]` \|
	\| out \| keyframe action \| `(8,)` float32 \| `[x y z qx qy qz qw gripper_open]` (world frame) \|

	## Upstream model / training

	Weights are the authors' published RLBench PerAct multi-task checkpoint
	(`diffuser_actor_peract.pth`); loaded verbatim, not retrained. Trained by the
	authors on the PerAct 18-task RLBench demonstrations (multi-view RGB-D + keypose
	supervision).

	\| Field \| Value \|
	\|---\|---\|
	\| Source repo \| [`nickgkan/3d_diffuser_actor`](https://github.com/nickgkan/3d_diffuser_actor) \|
	\| Weights \| [`katefgroup/3d_diffuser_actor`](https://huggingface.co/katefgroup/3d_diffuser_actor) — `diffuser_actor_peract.pth` (168 MB) \|
	\| Paper \| [arxiv:2402.10885](https://arxiv.org/abs/2402.10885) — 3D Diffuser Actor: Policy Diffusion with 3D Scene Representations \|
	\| License \| mit (code + checkpoints) — commercially permissive \|
	\| Parameters \| ~55 M \|
	\| Training data \| RLBench PerAct 18-task demonstrations \|

	## Supported robots

	\| Robot \| Scene \| Status \| Notes \|
	\|---\|---\|---\|---\|
	\| franka_panda \| RLBench (CoppeliaSim) \| ✓ validated \| open_drawer 4/4, meat_off_grill 3/3, close_jar solved (8 GB Ada host, 2026-06-19) \|

	## Sensors required

	\| key \| modality \| resolution \| dtype \|
	\|---\|---\|---\|---\|
	\| `observation.images.left_shoulder` \| RGB \| 256 × 256 \| `uint8` \|
	\| `observation.images.right_shoulder` \| RGB \| 256 × 256 \| `uint8` \|
	\| `observation.images.wrist` \| RGB \| 256 × 256 \| `uint8` \|
	\| `observation.images.front` \| RGB \| 256 × 256 \| `uint8` \|

	## Manifest summary

	\| Field \| Value \|
	\|---\|---\|
	\| `name` \| `OpenRAL/rskill-3d-diffuser-actor-rlbench` \|
	\| `version` \| `0.1.0` \|
	\| `license` \| `mit` \|
	\| `role` \| `s1` \|
	\| `model_family` \| `diffuser_actor` \|
	\| `embodiment_tags` \| `franka_panda` \|
	\| `runtime` \| `pytorch` \|
	\| `weights_uri` \| `hf://katefgroup/3d_diffuser_actor` \|
	\| `action_contract.dim` \| `8` \|
	\| `latency_budget.per_chunk_ms` \| `3000.0` \|

	## Reproduction

	```bash
	# One-time: provision CoppeliaSim 4.1.0 + PyRep + RLBench@peract + the checkpoint
	# in the py3.10 sidecar venv (see docs/adr/0062-rlbench-benchmark-backend.md).
	openral benchmark scene \
	--config scenes/benchmark/rlbench_open_drawer.yaml \
	--rskill rskills/3d-diffuser-actor-rlbench
	```

	Inference VRAM peaks ~0.43 GB; runs comfortably on an 8 GB GPU. CoppeliaSim is
	proprietary (free EDU license) and is never vendored — it is an
	externally-provisioned dependency (CLAUDE.md §1.9 / ADR-0062).

	## Evaluation

	[`eval/rlbench.json`](eval/rlbench.json) is the full official protocol
	result (`reproduced_locally: true`), produced by the canonical
	`openral benchmark run` (ADR-0009 PR D) on an 8 GB Ada host (2026-06-20) —
	25 episodes per task, seeds 0–24, max 25 macro-keyposes:

	\| Task \| Success rate \|
	\|---\|---\|
	\| `open_drawer` \| 22/25 = 0.88 \|
	\| `meat_off_grill` \| 24/25 = 0.96 \|
	\| `close_jar` \| 19/25 = 0.76 \|
	\| Average \| 0.867 \|

	(~946 ms mean step latency; in line with the 3D Diffuser Actor paper's ~0.81
	RLBench PerAct average.) Reproduce with:

	```bash
	openral benchmark run --suite rlbench --rskill rskills/3d-diffuser-actor-rlbench
	```

	> Note on variance. RLBench's sampling-based `EndEffectorPoseViaPlanning`
	> mover is non-deterministic, so per-task rates vary run-to-run; 3 of the 75
	> episodes hit a planner path-failure and are counted as failed episodes (the
	> sidecar handles them gracefully rather than aborting the run — ADR-0062).
	> Per-task paper baselines (Ke et al., 2402.10885, Table 1) are intentionally
	> not transcribed into the artifact to avoid mis-citation.

	## License

	OpenRAL wrapper files in this repository follow the project Apache-2.0 license.
	The wrapped upstream 3D Diffuser Actor code and released
	`diffuser_actor_peract.pth` checkpoint are MIT-licensed; the manifest therefore
	uses `license: mit` for the consumer-visible weight/runtime posture.

	## See also

	- `scenes/benchmark/rlbench_open_drawer.yaml`
	- `scenes/benchmark/rlbench_meat_off_grill.yaml`
	- `scenes/benchmark/rlbench_close_jar.yaml`
	- `benchmarks/rlbench.yaml`
	- `docs/adr/0062-rlbench-benchmark-backend.md`