Generalist-IDM-1B
Generalist Inverse Dynamics Model for predicting keyboard and mouse actions from gameplay video.
Project Page · Paper (arXiv) · GitHub · Demo
Model Description
Generalist-IDM-1B is a vision-action model trained on the D2E dataset—267 hours of synchronized gameplay video and input events from 29 PC games. Given a trajectory of screen frames and actions, the model predicts the missing actions between observations (Inverse Dynamics Model).
- Architecture: Based on InternVL with 0.9B parameters
- Input: Trajectory containing screen frames (448×448) and keyboard/mouse events with timestamps
- Output: Predicted keyboard and mouse events for gaps in the trajectory
- Training Data: 29 PC games across diverse genres (FPS, open-world, sandbox, roguelike, etc.)
Quick Start
The easiest way to run inference is using the standalone script from the D2E repository:
# Clone the repository
git clone https://github.com/worv-ai/D2E.git
cd D2E
# Run inference (dependencies auto-installed by uv)
uv run inference.py input_video.mp4 output.mcap
Prerequisites
- uv
- FFmpeg
- CUDA-capable GPU (~8GB+ VRAM)
Options
uv run inference.py input_video.mp4 output.mcap --device cuda # GPU inference (default)
uv run inference.py input_video.mp4 output.mcap --device cpu # CPU inference
uv run inference.py input_video.mp4 output.mcap --max-duration 30 # Limit to 30 seconds
⏱️ Inference Time: On H100, processing 1 second of video takes ~6 seconds. For a 1-minute video, expect ~6 minutes of inference time.
Output Format
The output is an MCAP file containing predicted keyboard and mouse events with nanosecond timestamps synchronized to the input video. You can visualize the output using the Dataset Visualizer.
Programmatic Usage
import torch
from transformers import AutoModelForImageTextToText, AutoProcessor
model = AutoModelForImageTextToText.from_pretrained(
"open-world-agents/Generalist-IDM-1B",
device_map="cuda",
torch_dtype=torch.bfloat16,
trust_remote_code=True,
)
processor = AutoProcessor.from_pretrained(
"open-world-agents/Generalist-IDM-1B",
trust_remote_code=True,
)
For full inference pipeline with video preprocessing and MCAP output, see inference.py.
Training Data
This model was trained on the D2E dataset:
| Dataset | Resolution | Description |
|---|---|---|
| D2E-480p | 480p 60fps | 267 hours from 29 PC games |
| D2E-Original | FHD/QHD | Original resolution recordings |
Citation
@article{choi2025d2e,
title={D2E: Scaling Vision-Action Pretraining on Desktop Data for Transfer to Embodied AI},
author={Choi, Suhwan and Jung, Jaeyoon and Seong, Haebin and Kim, Minchan and Kim, Minyeong and Cho, Yongjun and Kim, Yoonshik and Park, Yubeen and Yu, Youngjae and Lee, Yunsung},
journal={arXiv preprint arXiv:2510.05684},
year={2025}
}
License
Apache 2.0
- Downloads last month
- 172