open-world-agents
/

Generalist-IDM-1B

+---
+license: apache-2.0
+library_name: transformers
+pipeline_tag: image-text-to-text
+tags:
+  - vision-action
+  - inverse-dynamics-model
+  - embodied-ai
+  - game-ai
+  - internvl
+datasets:
+  - open-world-agents/D2E-480p
+  - open-world-agents/D2E-Original
+arxiv: 2510.05684
+---
+# Generalist-IDM-1B
+**Generalist Inverse Dynamics Model** for predicting keyboard and mouse actions from gameplay video.
+[Project Page](https://worv-ai.github.io/d2e/) · [Paper (arXiv)](https://arxiv.org/abs/2510.05684) · [GitHub](https://github.com/worv-ai/D2E) · [Demo](https://huggingface.co/spaces/lastdefiance20/Generalist-IDM)
+## Model Description
+Generalist-IDM-1B is a vision-action model trained on the [D2E dataset](https://huggingface.co/datasets/open-world-agents/D2E-480p)—267 hours of synchronized gameplay video and input events from 29 PC games. Given a trajectory of screen frames and actions, the model predicts the missing actions between observations (Inverse Dynamics Model).
+- **Architecture**: Based on InternVL with 0.9B parameters
+- **Input**: Trajectory containing screen frames (448×448) and keyboard/mouse events with timestamps
+- **Output**: Predicted keyboard and mouse events for gaps in the trajectory
+- **Training Data**: 29 PC games across diverse genres (FPS, open-world, sandbox, roguelike, etc.)
+## Quick Start
+The easiest way to run inference is using the standalone script from the [D2E repository](https://github.com/worv-ai/D2E):
+```bash
+# Clone the repository
+git clone https://github.com/worv-ai/D2E.git
+cd D2E
+# Run inference (dependencies auto-installed by uv)
+uv run inference.py input_video.mp4 output.mcap
+```
+### Prerequisites
+- [uv](https://docs.astral.sh/uv/)
+- FFmpeg
+- CUDA-capable GPU (~8GB+ VRAM)
+### Options
+```bash
+uv run inference.py input_video.mp4 output.mcap --device cuda        # GPU inference (default)
+uv run inference.py input_video.mp4 output.mcap --device cpu         # CPU inference
+uv run inference.py input_video.mp4 output.mcap --max-duration 30    # Limit to 30 seconds
+```
+> ⏱️ **Inference Time**: On H100, processing 1 second of video takes ~6 seconds. For a 1-minute video, expect ~6 minutes of inference time.
+## Output Format
+The output is an [MCAP](https://mcap.dev/) file containing predicted keyboard and mouse events with nanosecond timestamps synchronized to the input video. You can visualize the output using the [Dataset Visualizer](https://huggingface.co/spaces/open-world-agents/visualize_dataset).
+<img src="https://github.com/open-world-agents/owa-dataset-visualizer/blob/main/.github/assets/viewer.png?raw=true" alt="Dataset Visualizer Preview" width="600">
+## Programmatic Usage
+```python
+import torch
+from transformers import AutoModelForImageTextToText, AutoProcessor
+model = AutoModelForImageTextToText.from_pretrained(
+    "open-world-agents/Generalist-IDM-1B",
+    device_map="cuda",
+    torch_dtype=torch.bfloat16,
+    trust_remote_code=True,
+)
+processor = AutoProcessor.from_pretrained(
+    "open-world-agents/Generalist-IDM-1B",
+    trust_remote_code=True,
+)
+```
+For full inference pipeline with video preprocessing and MCAP output, see [`inference.py`](https://github.com/worv-ai/D2E/blob/main/inference.py).
+## Training Data
+This model was trained on the D2E dataset:
+| Dataset | Resolution | Description |
+|---------|------------|-------------|
+| [D2E-480p](https://huggingface.co/datasets/open-world-agents/D2E-480p) | 480p 60fps | 267 hours from 29 PC games |
+| [D2E-Original](https://huggingface.co/datasets/open-world-agents/D2E-Original) | FHD/QHD | Original resolution recordings |
+## Citation
+```bibtex
+@article{choi2025d2e,
+  title={D2E: Scaling Vision-Action Pretraining on Desktop Data for Transfer to Embodied AI},
+  author={Choi, Suhwan and Jung, Jaeyoon and Seong, Haebin and Kim, Minchan and Kim, Minyeong and Cho, Yongjun and Kim, Yoonshik and Park, Yubeen and Yu, Youngjae and Lee, Yunsung},
+  journal={arXiv preprint arXiv:2510.05684},
+  year={2025}
+}
+```
+## License
+Apache 2.0