Generalist-IDM-1B

Generalist Inverse Dynamics Model for predicting keyboard and mouse actions from gameplay video.

Project Page · Paper (arXiv) · GitHub · Demo

Model Description

Generalist-IDM-1B is a vision-action model trained on the D2E dataset—267 hours of synchronized gameplay video and input events from 29 PC games. Given a trajectory of screen frames and actions, the model predicts the missing actions between observations (Inverse Dynamics Model).

  • Architecture: Based on InternVL with 0.9B parameters
  • Input: Trajectory containing screen frames (448×448) and keyboard/mouse events with timestamps
  • Output: Predicted keyboard and mouse events for gaps in the trajectory
  • Training Data: 29 PC games across diverse genres (FPS, open-world, sandbox, roguelike, etc.)

Quick Start

The easiest way to run inference is using the standalone script from the D2E repository:

# Clone the repository
git clone https://github.com/worv-ai/D2E.git
cd D2E

# Run inference (dependencies auto-installed by uv)
uv run inference.py input_video.mp4 output.mcap

Prerequisites

  • uv
  • FFmpeg
  • CUDA-capable GPU (~8GB+ VRAM)

Options

uv run inference.py input_video.mp4 output.mcap --device cuda        # GPU inference (default)
uv run inference.py input_video.mp4 output.mcap --device cpu         # CPU inference
uv run inference.py input_video.mp4 output.mcap --max-duration 30    # Limit to 30 seconds

⏱️ Inference Time: On H100, processing 1 second of video takes ~6 seconds. For a 1-minute video, expect ~6 minutes of inference time.

Output Format

The output is an MCAP file containing predicted keyboard and mouse events with nanosecond timestamps synchronized to the input video. You can visualize the output using the Dataset Visualizer.

Dataset Visualizer Preview

Programmatic Usage

import torch
from transformers import AutoModelForImageTextToText, AutoProcessor

model = AutoModelForImageTextToText.from_pretrained(
    "open-world-agents/Generalist-IDM-1B",
    device_map="cuda",
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
)
processor = AutoProcessor.from_pretrained(
    "open-world-agents/Generalist-IDM-1B",
    trust_remote_code=True,
)

For full inference pipeline with video preprocessing and MCAP output, see inference.py.

Training Data

This model was trained on the D2E dataset:

Dataset Resolution Description
D2E-480p 480p 60fps 267 hours from 29 PC games
D2E-Original FHD/QHD Original resolution recordings

Citation

@article{choi2025d2e,
  title={D2E: Scaling Vision-Action Pretraining on Desktop Data for Transfer to Embodied AI},
  author={Choi, Suhwan and Jung, Jaeyoon and Seong, Haebin and Kim, Minchan and Kim, Minyeong and Cho, Yongjun and Kim, Yoonshik and Park, Yubeen and Yu, Youngjae and Lee, Yunsung},
  journal={arXiv preprint arXiv:2510.05684},
  year={2025}
}

License

Apache 2.0

Downloads last month
172
Safetensors
Model size
0.9B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Datasets used to train open-world-agents/Generalist-IDM-1B

Space using open-world-agents/Generalist-IDM-1B 1

Collection including open-world-agents/Generalist-IDM-1B

Paper for open-world-agents/Generalist-IDM-1B