Owl IDM - VPT-v0
Inverse Dynamics Model (IDM) that predicts keyboard and mouse inputs from gameplay video.
Model Description
- Input: Sequence of RGB frames (128x128), normalized to [-1, 1]
- Output:
- Button predictions (20 outputs):
W,A,S,D,LShift,F,LMB,RMB,Space,R,E,V,C,Ctrl,1,2,3,I,Tab,Esc - Mouse movement (dx, dy in pixels)
- Button predictions (20 outputs):
Architecture
Architecture is based on OpenAI VPT IDM, with some general improvements.
- Backbone: Conv3D temporal mixer โ ResNet spatial encoder โ learned spatial pooling
- Temporal model: Transformer (d_model=1024, 12 layers)
- Window size: 32 frames
- Model size: N/A parameters
Training
- Dataset: FPS gameplay recordings
- Preprocessing: Frames scaled to [-1, 1], log1p mouse scaling: True
- Loss: BCE with class-balancing pos_weight for buttons, Huber for mouse
Usage
Installation
pip install git+https://github.com/overworld/owl-idm-3.git
Inference
from owl_idms import InferencePipeline
import torch
pipeline = InferencePipeline.from_pretrained(
"Overworld/owl-idm-vpt-v0",
device="cuda"
)
# video: [batch, frames, channels, height, width] in range [-1, 1]
video = torch.randn(1, 256, 3, 128, 128)
button_preds, mouse_preds = pipeline(video)
# button_preds: [1, 256, 20] bool โ order: `W`, `A`, `S`, `D`, `LShift`, `F`, `LMB`, `RMB`, `Space`, `R`, `E`, `V`, `C`, `Ctrl`, `1`, `2`, `3`, `I`, `Tab`, `Esc`
# mouse_preds: [1, 256, 2] float โ (dx, dy) in pixels
# Check which buttons are pressed at frame 100
for label, pressed in zip(pipeline.button_labels, button_preds[0, 100]):
if pressed:
print(f"{label} pressed")
Model Files
config.yml: Full training configurationmodel.pt: EMA model weights (state_dict, ready for load_state_dict)
License
MIT License
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐ Ask for provider support