--- license: apache-2.0 base_model: Qwen/Qwen3-VL-8B-Instruct base_model_relation: finetune pipeline_tag: image-text-to-text tags: - inverse-dynamics-model - screen-understanding - screencasts - computer-use - qwen3-vl ---

--- # Inverse Dynamics Model for Action-Annotating Screencasts We present an inverse dynamics model that predicts user input actions from short windows of screen recordings. Given 10 consecutive screenshots, it emits the key presses, mouse clicks, cursor movements and scroll events that are visually implied by the frames. Please refer to the [blog post](https://pdoom.org/crowd_cast_idm.html) for details and experiments. ## Summary - **Base model:** [`Qwen/Qwen3-VL-8B-Instruct`](https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct) - **Training data:** [`crowd-cast`](https://pdoom.org/crowd_cast.html) recordings - **Training method:** LoRA on language and vision modules, merged after training - **Input:** 10 screenshots sampled at 5 FPS - **Output:** sparse JSON action list - **Eval:** [`p-doom/idm-eval-set`](https://huggingface.co/datasets/p-doom/idm-eval-set) - **Code:** [`p-doom/inverse-dynamics-model`](https://github.com/p-doom/inverse-dynamics-model) ## IDM-eval results All models are evaluated on [our eval set](https://huggingface.co/datasets/p-doom/idm-eval-set), visibility-filtered to `visible` + `inferable` actions. `MM R²` and `MM cos_mean` include missed MouseMove frames as zero predictions; `MM cov.` is MouseMove recall. | Model | Overall F1 | KeyPress F1 | MouseClick F1 | MouseMove F1 | MouseScroll F1 | MM R² | MM cos_mean | MM cov. | | ---------------- | ---------- | ----------- | ------------- | ------------ | -------------- | --------- | ----------- | ------- | | **Ours (8B)** | **0.787** | 0.791 | 0.598 | **0.857** | **0.447** | 0.708 | 0.643 | **92%** | | Gemini 3.5 Flash | 0.740 | **0.826** | **0.726** | 0.760 | 0.337 | **0.714** | 0.560 | 64% | | GPT 5.5 | 0.709 | 0.821 | 0.714 | 0.669 | 0.392 | 0.586 | 0.455 | 52% | | Kimi K2.6 | 0.540 | 0.711 | 0.444 | 0.381 | 0.326 | 0.420 | 0.177 | 25% | | Gemma 4 31B | 0.430 | 0.381 | 0.581 | 0.500 | 0.237 | 0.077 | 0.228 | 37% | | Qwen3-VL 8B | 0.360 | 0.409 | 0.449 | 0.334 | 0.127 | -6.038 | 0.035 | 28% | Interpretation: the main gap is dense temporal coverage. Off-the-shelf VLMs under-emit MouseMove actions, and the all-GT MouseMove metrics penalize these misses. ## Input Format Provide one chat message with 10 images sampled at 5 FPS. Each image should be preceded by a text label: ```text Frame F00: Frame F01: ... Frame F09: ``` The frame labels are text anchors in the message, not labels rendered into the image pixels. ## Output Format The model emits only a JSON array: ```json [ {"frame": "F02", "type": "MouseMove", "details": "120,45"}, {"frame": "F03", "type": "MouseClick", "details": "Left"}, {"frame": "F05", "type": "KeyPress", "details": "Cmd+S"}, {"frame": "F07", "type": "MouseScroll", "details": "-150"} ] ``` Action types: - `KeyPress`: key name with modifiers, e.g. `Cmd+S`, `Return`, `A` - `MouseClick`: `Left`, `Right`, or `Middle` - `MouseMove`: normalized `dx,dy`, where `1000` is a full screen-width or screen-height traversal - `MouseScroll`: normalized signed scroll magnitude Frame attribution: if an effect first appears between `F_K` and `F_{K+1}`, report the action on `F_K`, the last pre-action frame. ## Related Releases - [`p-doom/AGI-CAST-0.6k`](https://huggingface.co/datasets/p-doom/AGI-CAST-0.6k): source AGI-CAST screencasts - [`p-doom/AGI-CAST-idm-actions`](https://huggingface.co/datasets/p-doom/AGI-CAST-idm-actions): AGI-CAST action annotations generated with this model - [`p-doom/idm-eval-set`](https://huggingface.co/datasets/p-doom/idm-eval-set): manually verified IDM evaluation clips ## Limitations - The model was trained on macOS clips and can confuse OS-specific shortcuts such as `Cmd` vs `Ctrl`. - Labels are inferred from pixels, so actions with no visual evidence can be missed or hallucinated. - Fine-grained timing, cursor movement magnitude and scroll magnitude can be noisy.