Owl IDM - VPT-v0

Inverse Dynamics Model (IDM) that predicts keyboard and mouse inputs from gameplay video.

Model Description

  • Input: Sequence of RGB frames (128x128), normalized to [-1, 1]
  • Output:
    • Button predictions (20 outputs): W, A, S, D, LShift, F, LMB, RMB, Space, R, E, V, C, Ctrl, 1, 2, 3, I, Tab, Esc
    • Mouse movement (dx, dy in pixels)

Architecture

Architecture is based on OpenAI VPT IDM, with some general improvements.

  • Backbone: Conv3D temporal mixer โ†’ ResNet spatial encoder โ†’ learned spatial pooling
  • Temporal model: Transformer (d_model=1024, 12 layers)
  • Window size: 32 frames
  • Model size: N/A parameters

Training

  • Dataset: FPS gameplay recordings
  • Preprocessing: Frames scaled to [-1, 1], log1p mouse scaling: True
  • Loss: BCE with class-balancing pos_weight for buttons, Huber for mouse

Usage

Installation

pip install git+https://github.com/overworld/owl-idm-3.git

Inference

from owl_idms import InferencePipeline
import torch

pipeline = InferencePipeline.from_pretrained(
    "Overworld/owl-idm-vpt-v0",
    device="cuda"
)

# video: [batch, frames, channels, height, width] in range [-1, 1]
video = torch.randn(1, 256, 3, 128, 128)

button_preds, mouse_preds = pipeline(video)
# button_preds: [1, 256, 20] bool  โ€” order: `W`, `A`, `S`, `D`, `LShift`, `F`, `LMB`, `RMB`, `Space`, `R`, `E`, `V`, `C`, `Ctrl`, `1`, `2`, `3`, `I`, `Tab`, `Esc`
# mouse_preds:  [1, 256, 2]          float  โ€” (dx, dy) in pixels

# Check which buttons are pressed at frame 100
for label, pressed in zip(pipeline.button_labels, button_preds[0, 100]):
    if pressed:
        print(f"{label} pressed")

Model Files

  • config.yml: Full training configuration
  • model.pt: EMA model weights (state_dict, ready for load_state_dict)

License

MIT License

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support