Owl IDM - VPT-v0

Inverse Dynamics Model (IDM) that predicts keyboard and mouse inputs from gameplay video.

Model Description

Input: Sequence of RGB frames (128x128), normalized to [-1, 1]
Output:
- Button predictions (20 outputs): W, A, S, D, LShift, F, LMB, RMB, Space, R, E, V, C, Ctrl, 1, 2, 3, I, Tab, Esc
- Mouse movement (dx, dy in pixels)

Architecture

Architecture is based on OpenAI VPT IDM, with some general improvements.

Backbone: Conv3D temporal mixer → ResNet spatial encoder → learned spatial pooling
Temporal model: Transformer (d_model=1024, 12 layers)
Window size: 32 frames
Model size: N/A parameters

Training

Dataset: FPS gameplay recordings
Preprocessing: Frames scaled to [-1, 1], log1p mouse scaling: True
Loss: BCE with class-balancing pos_weight for buttons, Huber for mouse

Usage

Installation

pip install git+https://github.com/overworld/owl-idm-3.git

Inference

from owl_idms import InferencePipeline
import torch

pipeline = InferencePipeline.from_pretrained(
    "Overworld/owl-idm-vpt-v0",
    device="cuda"
)

# video: [batch, frames, channels, height, width] in range [-1, 1]
video = torch.randn(1, 256, 3, 128, 128)

button_preds, mouse_preds = pipeline(video)
# button_preds: [1, 256, 20] bool  — order: `W`, `A`, `S`, `D`, `LShift`, `F`, `LMB`, `RMB`, `Space`, `R`, `E`, `V`, `C`, `Ctrl`, `1`, `2`, `3`, `I`, `Tab`, `Esc`
# mouse_preds:  [1, 256, 2]          float  — (dx, dy) in pixels

# Check which buttons are pressed at frame 100
for label, pressed in zip(pipeline.button_labels, button_preds[0, 100]):
    if pressed:
        print(f"{label} pressed")

Model Files

config.yml: Full training configuration
model.pt: EMA model weights (state_dict, ready for load_state_dict)

License

MIT License

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support