Instructions to use open-world-agents/Generalist-IDM-1B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use open-world-agents/Generalist-IDM-1B with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="open-world-agents/Generalist-IDM-1B")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForMultimodalLM

processor = AutoProcessor.from_pretrained("open-world-agents/Generalist-IDM-1B")
model = AutoModelForMultimodalLM.from_pretrained("open-world-agents/Generalist-IDM-1B")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use open-world-agents/Generalist-IDM-1B with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "open-world-agents/Generalist-IDM-1B"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "open-world-agents/Generalist-IDM-1B",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/open-world-agents/Generalist-IDM-1B

SGLang

How to use open-world-agents/Generalist-IDM-1B with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "open-world-agents/Generalist-IDM-1B" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "open-world-agents/Generalist-IDM-1B",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "open-world-agents/Generalist-IDM-1B" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "open-world-agents/Generalist-IDM-1B",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use open-world-agents/Generalist-IDM-1B with Docker Model Runner:
```
docker model run hf.co/open-world-agents/Generalist-IDM-1B
```

Generalist-IDM-1B

Generalist Inverse Dynamics Model for predicting keyboard and mouse actions from gameplay video.

Project Page · Paper (arXiv) · GitHub · Demo

Model Description

Generalist-IDM-1B is a vision-action model trained on the D2E dataset—267 hours of synchronized gameplay video and input events from 29 PC games. Given a trajectory of screen frames and actions, the model predicts the missing actions between observations (Inverse Dynamics Model).

Architecture: Based on InternVL with 0.9B parameters
Input: Trajectory containing screen frames (448×448) and keyboard/mouse events with timestamps
Output: Predicted keyboard and mouse events for gaps in the trajectory
Training Data: 29 PC games across diverse genres (FPS, open-world, sandbox, roguelike, etc.)

Quick Start

The easiest way to run inference is using the standalone script from the D2E repository:

# Clone the repository
git clone https://github.com/worv-ai/D2E.git
cd D2E

# Run inference (dependencies auto-installed by uv)
uv run inference.py input_video.mp4 output.mcap

Prerequisites

uv
FFmpeg
CUDA-capable GPU (~8GB+ VRAM)

Options

uv run inference.py input_video.mp4 output.mcap --device cuda        # GPU inference (default)
uv run inference.py input_video.mp4 output.mcap --device cpu         # CPU inference
uv run inference.py input_video.mp4 output.mcap --max-duration 30    # Limit to 30 seconds

⏱️ Inference Time: On H100, processing 1 second of video takes ~6 seconds. For a 1-minute video, expect ~6 minutes of inference time.

Output Format

The output is an MCAP file containing predicted keyboard and mouse events with nanosecond timestamps synchronized to the input video. You can visualize the output using the Dataset Visualizer.

Programmatic Usage

import torch
from transformers import AutoModelForImageTextToText, AutoProcessor

model = AutoModelForImageTextToText.from_pretrained(
    "open-world-agents/Generalist-IDM-1B",
    device_map="cuda",
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
)
processor = AutoProcessor.from_pretrained(
    "open-world-agents/Generalist-IDM-1B",
    trust_remote_code=True,
)

For full inference pipeline with video preprocessing and MCAP output, see inference.py.

Training Data

This model was trained on the D2E dataset:

Dataset	Resolution	Description
D2E-480p	480p 60fps	267 hours from 29 PC games
D2E-Original	FHD/QHD	Original resolution recordings

Citation

@article{choi2025d2e,
  title={D2E: Scaling Vision-Action Pretraining on Desktop Data for Transfer to Embodied AI},
  author={Choi, Suhwan and Jung, Jaeyoon and Seong, Haebin and Kim, Minchan and Kim, Minyeong and Cho, Yongjun and Kim, Yoonshik and Park, Yubeen and Yu, Youngjae and Lee, Yunsung},
  journal={arXiv preprint arXiv:2510.05684},
  year={2025}
}