Spaces:
Sleeping
Sleeping
File size: 8,756 Bytes
96da58e |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 |
# Code for Phantom and Masquerade
[](https://www.python.org)
[](https://opensource.org/licenses/MIT)
<hr style="border: 2px solid gray;"></hr>
This repository contains the code used to process human videos in [Phantom: Training Robots Without Robots Using Only Human Videos](https://phantom-human-videos.github.io/) and [Masquerade: Learning from In-the-wild Human Videos using Data-Editing](https://masquerade-robot.github.io/).
<table>
<tr>
<td align="center" width="50%">
<h3><a href="https://phantom-human-videos.github.io/">Phantom: Training Robots Without Robots Using Only Human Videos</a></h3>
<p><em><a href=https://marionlepert.github.io/>Marion Lepert</a></em>, <em><a href=https://jiayingfang.github.io/>Jiaying Fang</a></em>, <em><a href=https://web.stanford.edu/~bohg/>Jeannette Bohg</a></em></p>
<a href="https://phantom-human-videos.github.io/">
<img src="docs/teaser_phantom.png" alt="Phantom Teaser" width="90%">
</a>
</td>
<td align="center" width="50%">
<h3><a href="https://masquerade-robot.github.io/">Masquerade: Learning from In-the-wild Human Videos using Data-Editing</a></h3>
<p><em><a href=https://marionlepert.github.io/>Marion Lepert*</a></em>, <em><a href=https://jiayingfang.github.io/>Jiaying Fang*</a></em>, <em><a href=https://web.stanford.edu/~bohg/>Jeannette Bohg</a></em></p>
<img src="docs/teaser_masquerade.png" alt="Masquerade Teaser" width="90%">
</td>
</tr>
</table>
Both projects use data editing to convert human videos into βrobotizedβ demonstrations. They share much of the same codebase, with some differences in the processing pipeline:
**Phantom**
* Input: RGBD videos with a single left hand visible in every frame.
* Data editing: inpaint the single human arm, overlay a rendered robot arm in the same pose.
* Action labels: extract full 3D end-effector pose (position, orientation, gripper)
**Masquerade**
* Input: RGB videos from [Epic Kitchens](https://epic-kitchens.github.io/2025); one or both hands may be visible, sometimes occluded.
* Data editing: segment and inpaint both arms, overlay a bimanual robot whose effectors follow the estimated poses (with a 3-4cm error along the depth direction due to lack of depth data)
* Action labels: use 2D projected waypoints as auxiliary supervision only (not full 3D actions)
## Installation
1. Clone this repo recursively
```bash
git clone --recursive git@github.com:MarionLepert/phantom.git
```
2. Run the following script from the root directory to install the required conda environment.
```bash
./install.sh
```
3. Download the MANO hand models. To do so, go to the [MANO website](https://mano.is.tue.mpg.de/) and register to be able to download the models. Download the left and right hand models and move MANO_LEFT.pkl and MANO_RIGHT.pkl inside the `$ROOT_DIR/submodules/phantom-hamer/_DATA/data/mano/` folder.
## Getting Started
Process **Phantom** sample data (manually collected in-lab videos)
```bash
conda activate phantom
python process_data.py demo_name=pick_and_place data_root_dir=../data/raw processed_data_root_dir=../data/processed mode=all
```
Process **Masquerade** sample data ([Epic Kitchens](https://epic-kitchens.github.io/2025) video)
```bash
conda activate phantom
python process_data.py demo_name=epic data_root_dir=../data/raw processed_data_root_dir=../data/processed mode=all --config-name=epic
```
## Codebase Overview
### Process data
Each video is processed using the following steps:
1. **Extract human hand bounding boxes**: `bbox_processor.py`
* `mode=bbox`
2. **Extract 2d human hand poses**: `hand_processor.py`
* `mode=hand2d`: extract the 2d hand pose
3. **Extract human and arm segmentation masks**: `segmentation_processor.py`
* `mode=hand_segmentation`: used for depth alignment in hand pose refinement (only works for hand3d)
* `mode=arm_segmentation`: needed in all cases to inpaint the human
2. **Extract 3d human hand poses**: `hand_processor.py`
* `mode=hand3d`: extract the 3d hand pose (note: requires depth, and was only tested on the left hand)
4. **Retarget human actions to robot actions**: `action_processor.py`
* `mode=action`
5. **Smooth human poses**: `smoothing_processor.py`
* `mode=smoothing`
6. **Remove hand from videos using inpainting**: `handinpaint_processor.py`
* `mode=hand_inpaint`
* Inpainting method [E2FGVI](https://arxiv.org/pdf/2204.02663) is used.
7. **Overlay virtual robot on video**: `robotinpaint_processor.py`
* `mode=robot_inpaint`: overlay a single robot (default) or bimanual (epic mode) robot on the image
### Config reference (see configuration files in `configs/`)
| Flag | Type | Required | Choices | Description |
|------|------|----------|---------|-------------|
| `--demo_name` | `str` | β
| - | Name of the demonstration/dataset to process |
| `--mode` | `str` (multiple) | β
| `bbox`, `hand2d`, `hand3d`, `hand_segmentation`, `arm_segmentation`, `action`, `smoothing`, `hand_inpaint`, `robot_inpaint`, `all` | Processing modes to run (can specify multiple with e.g. `'mode=[bbox,hand2d]'`) |
| `--robot_name` | `str` | β
| `Panda`, `Kinova3`, `UR5e`, `IIWA`, `Jaco` | Type of robot to use for overlays |
| `--gripper_name` | `str` | β | `Robotiq85` | Type of gripper to use |
| `--data_root_dir` | `str` | β | - | Root directory containing raw video data |
| `--processed_data_root_dir` | `str` | β | - | Root directory to save processed data |
| `--epic` | `bool` | β | - | Use Epic-Kitchens dataset processing mode |
| `--bimanual_setup` | `str` | β | `single_arm`, `shoulders` | Bimanual setup configuration to use (shoulders corresponds to the bimanual hardware configuration used in Masquerade) |
| `--target_hand` | `str` | β | `left`, `right`, `both` | Which hand(s) to target for processing |
| `--camera_intrinsics` | `str` | β | - | Path to camera intrinsics file |
| `--camera_extrinsics` | `str` | β | - | Path to camera extrinsics file |
| `--input_resolution` | `int` | β | - | Resolution of input videos |
| `--output_resolution` | `int` | β | - | Resolution of output videos |
| `--depth_for_overlay` | `bool` | β | - | Use depth information for overlays |
| `--demo_num` | `str` | β | - | Process a single demo number instead of all demos |
| `--debug_cameras` | `str` (multiple) | β | - | Additional camera names to include for debugging |
| `--constrained_hand` | `bool` | β | - | Use constrained hand processing |
| `--render` | `bool` | β | - | Render the robot overlay on the video |
**Note** Please specify `--bimanual_setup single_arm` along with `--target_hand left` or `--target_hand right` if you are using single arm. For bimanual setups, use `--bimanual_setup shoulders`.
### Camera details
* **Phantom**: a Zed2 camera was used to capture the sample data at HD1080 resolution.
* **Masquerade**: We used Epic-Kitchens videos and used the camera intrinsics provided in the dataset. To use videos captured with a different camera resolution, update the camera intrinsics and extrinsics files in `$ROOT_DIR/phantom/camera/`.
### Train policy
After processing the video data, the edited data can be used to train a policy. The following files should be used:
* Observations
* Phantom Samples: extract RGB images from `data/processed/pick_and_place/*/video_overlay_Panda_single_arm.mkv`
* Epic (In-the-wild Data) Samples: extract RGB images from `data/processed/epic/*/video_overlay_Kinova3_shoulders.mkv`
* Actions
* Phantom Samples: All data stored in `data/processed/pick_and_place/*/inpaint_processor/training_data_single_arm.npz`
* Epic (In-the-wild Data) Samples: All data stored in `data/processed/epic/*/inpaint_processor/training_data_shoulders.npz`
In Phantom, [Diffusion Policy](https://github.com/real-stanford/diffusion_policy) was used for policy training.
## Citation
```bibtex
@article{lepert2025phantomtrainingrobotsrobots,
title={Phantom: Training Robots Without Robots Using Only Human Videos},
author={Marion Lepert and Jiaying Fang and Jeannette Bohg},
year={2025},
eprint={2503.00779},
archivePrefix={arXiv},
primaryClass={cs.RO},
url={https://arxiv.org/abs/2503.00779},
}
```
```bibtex
@misc{lepert2025masqueradelearninginthewildhuman,
title={Masquerade: Learning from In-the-wild Human Videos using Data-Editing},
author={Marion Lepert and Jiaying Fang and Jeannette Bohg},
year={2025},
eprint={2508.09976},
archivePrefix={arXiv},
primaryClass={cs.RO},
url={https://arxiv.org/abs/2508.09976},
}
```
|