MilkClouds commited on
Commit
eae1648
·
verified ·
1 Parent(s): bf53d39

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +110 -3
README.md CHANGED
@@ -1,3 +1,110 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ library_name: transformers
4
+ pipeline_tag: image-text-to-text
5
+ tags:
6
+ - vision-action
7
+ - inverse-dynamics-model
8
+ - embodied-ai
9
+ - game-ai
10
+ - internvl
11
+ datasets:
12
+ - open-world-agents/D2E-480p
13
+ - open-world-agents/D2E-Original
14
+ arxiv: 2510.05684
15
+ ---
16
+
17
+ # Generalist-IDM-1B
18
+
19
+ **Generalist Inverse Dynamics Model** for predicting keyboard and mouse actions from gameplay video.
20
+
21
+ [Project Page](https://worv-ai.github.io/d2e/) · [Paper (arXiv)](https://arxiv.org/abs/2510.05684) · [GitHub](https://github.com/worv-ai/D2E) · [Demo](https://huggingface.co/spaces/lastdefiance20/Generalist-IDM)
22
+
23
+ ## Model Description
24
+
25
+ Generalist-IDM-1B is a vision-action model trained on the [D2E dataset](https://huggingface.co/datasets/open-world-agents/D2E-480p)—267 hours of synchronized gameplay video and input events from 29 PC games. Given a trajectory of screen frames and actions, the model predicts the missing actions between observations (Inverse Dynamics Model).
26
+
27
+ - **Architecture**: Based on InternVL with 0.9B parameters
28
+ - **Input**: Trajectory containing screen frames (448×448) and keyboard/mouse events with timestamps
29
+ - **Output**: Predicted keyboard and mouse events for gaps in the trajectory
30
+ - **Training Data**: 29 PC games across diverse genres (FPS, open-world, sandbox, roguelike, etc.)
31
+
32
+ ## Quick Start
33
+
34
+ The easiest way to run inference is using the standalone script from the [D2E repository](https://github.com/worv-ai/D2E):
35
+
36
+ ```bash
37
+ # Clone the repository
38
+ git clone https://github.com/worv-ai/D2E.git
39
+ cd D2E
40
+
41
+ # Run inference (dependencies auto-installed by uv)
42
+ uv run inference.py input_video.mp4 output.mcap
43
+ ```
44
+
45
+ ### Prerequisites
46
+
47
+ - [uv](https://docs.astral.sh/uv/)
48
+ - FFmpeg
49
+ - CUDA-capable GPU (~8GB+ VRAM)
50
+
51
+ ### Options
52
+
53
+ ```bash
54
+ uv run inference.py input_video.mp4 output.mcap --device cuda # GPU inference (default)
55
+ uv run inference.py input_video.mp4 output.mcap --device cpu # CPU inference
56
+ uv run inference.py input_video.mp4 output.mcap --max-duration 30 # Limit to 30 seconds
57
+ ```
58
+
59
+ > ⏱️ **Inference Time**: On H100, processing 1 second of video takes ~6 seconds. For a 1-minute video, expect ~6 minutes of inference time.
60
+
61
+ ## Output Format
62
+
63
+ The output is an [MCAP](https://mcap.dev/) file containing predicted keyboard and mouse events with nanosecond timestamps synchronized to the input video. You can visualize the output using the [Dataset Visualizer](https://huggingface.co/spaces/open-world-agents/visualize_dataset).
64
+
65
+ <img src="https://github.com/open-world-agents/owa-dataset-visualizer/blob/main/.github/assets/viewer.png?raw=true" alt="Dataset Visualizer Preview" width="600">
66
+
67
+ ## Programmatic Usage
68
+
69
+ ```python
70
+ import torch
71
+ from transformers import AutoModelForImageTextToText, AutoProcessor
72
+
73
+ model = AutoModelForImageTextToText.from_pretrained(
74
+ "open-world-agents/Generalist-IDM-1B",
75
+ device_map="cuda",
76
+ torch_dtype=torch.bfloat16,
77
+ trust_remote_code=True,
78
+ )
79
+ processor = AutoProcessor.from_pretrained(
80
+ "open-world-agents/Generalist-IDM-1B",
81
+ trust_remote_code=True,
82
+ )
83
+ ```
84
+
85
+ For full inference pipeline with video preprocessing and MCAP output, see [`inference.py`](https://github.com/worv-ai/D2E/blob/main/inference.py).
86
+
87
+ ## Training Data
88
+
89
+ This model was trained on the D2E dataset:
90
+
91
+ | Dataset | Resolution | Description |
92
+ |---------|------------|-------------|
93
+ | [D2E-480p](https://huggingface.co/datasets/open-world-agents/D2E-480p) | 480p 60fps | 267 hours from 29 PC games |
94
+ | [D2E-Original](https://huggingface.co/datasets/open-world-agents/D2E-Original) | FHD/QHD | Original resolution recordings |
95
+
96
+ ## Citation
97
+
98
+ ```bibtex
99
+ @article{choi2025d2e,
100
+ title={D2E: Scaling Vision-Action Pretraining on Desktop Data for Transfer to Embodied AI},
101
+ author={Choi, Suhwan and Jung, Jaeyoon and Seong, Haebin and Kim, Minchan and Kim, Minyeong and Cho, Yongjun and Kim, Yoonshik and Park, Yubeen and Yu, Youngjae and Lee, Yunsung},
102
+ journal={arXiv preprint arXiv:2510.05684},
103
+ year={2025}
104
+ }
105
+ ```
106
+
107
+ ## License
108
+
109
+ Apache 2.0
110
+