Upload README.md with huggingface_hub
Browse files
README.md
ADDED
|
@@ -0,0 +1,63 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: other
|
| 3 |
+
tags:
|
| 4 |
+
- computer-use
|
| 5 |
+
- video
|
| 6 |
+
- action-decoder
|
| 7 |
+
- onevision
|
| 8 |
+
- qwen
|
| 9 |
+
---
|
| 10 |
+
|
| 11 |
+
# OneVision Task-Aware Action Decoder — checkpoints
|
| 12 |
+
|
| 13 |
+
Lightweight task-aware action decoder for computer-use / instructional-video understanding.
|
| 14 |
+
Built on a frozen **OneVision vision encoder** (`lmms-lab-encoder/onevision-encoder-large-lang`)
|
| 15 |
+
and **Qwen/Qwen3-4B** as the text encoder/decoder, **LoRA**-tuned (r=8, alpha=16, dropout=0.05,
|
| 16 |
+
applied to both vision and text). Trained with `train_onevision_lw_decoder.py` in
|
| 17 |
+
[proteusagi/onevision-compuse](https://github.com/proteusagi/onevision-compuse) under
|
| 18 |
+
`onevision-lightweight-decoder/`.
|
| 19 |
+
|
| 20 |
+
Given video visible-token embeddings plus a task prompt, the model predicts, per frame:
|
| 21 |
+
- discrete **action / key / modifier** classes,
|
| 22 |
+
- a **click heatmap** (actor query attending over visual tokens),
|
| 23 |
+
- (new_arch only) an **autoregressive per-frame transcript**.
|
| 24 |
+
|
| 25 |
+
## Checkpoints (~8.7 GB each — LoRA adapters + task heads)
|
| 26 |
+
|
| 27 |
+
### `new_arch/best.pt` — "parallel heads" architecture (2026-03-16) — RECOMMENDED
|
| 28 |
+
- Adds an **autoregressive transcript decoder** (per-frame transcript via the Qwen LM head:
|
| 29 |
+
visual features + transcript history of frames 0..t-1 predict frame t) on top of the
|
| 30 |
+
action/key/modifier classification heads and the click-heatmap head.
|
| 31 |
+
- Modularized `encode`/`decode`, **parallel prediction heads**, and an **efficient-inference**
|
| 32 |
+
path (git commits: "parallel heads", "efficient inference", "further speedup").
|
| 33 |
+
- Training performance: **train action accuracy ~64-66%**, train loss ~0.9 (see `accuracy.png`
|
| 34 |
+
and `loss.png` in the repo). The epoch ~85-117 plateau is a curriculum/data change; metrics
|
| 35 |
+
recover afterward.
|
| 36 |
+
- Choose this by default: most capable (also generates transcripts) and the actively developed arch.
|
| 37 |
+
|
| 38 |
+
### `onevision_task_action_decoder/best.pt` — original architecture (2026-03-10)
|
| 39 |
+
- The original `OneVisionTaskAwareActionDecoder`: action/key/modifier classification heads plus a
|
| 40 |
+
click-heatmap head over OneVision visual tokens, with frame self-attention and prompt
|
| 41 |
+
cross-attention. **Predates the transcript decoder** (which was added 2026-03-16), so it does
|
| 42 |
+
action/key/modifier + click only.
|
| 43 |
+
- Choose this if you specifically want the lighter, action-only baseline or to reproduce the
|
| 44 |
+
March-10 results.
|
| 45 |
+
|
| 46 |
+
> Also in this repo: `onevision_task_action_decoder/epoch_*.pt` (earlier epoch snapshots of the
|
| 47 |
+
> original run) and `onevision_task_action_decoder_8F/best.pt` (an 8-frame-clip variant).
|
| 48 |
+
|
| 49 |
+
## Which to choose
|
| 50 |
+
- **`new_arch/best.pt`** — full capability (action + transcript), latest architecture. Default pick.
|
| 51 |
+
- **`onevision_task_action_decoder/best.pt`** — original action-only model / March-10 baseline.
|
| 52 |
+
|
| 53 |
+
## Loading
|
| 54 |
+
```python
|
| 55 |
+
import torch
|
| 56 |
+
ckpt = torch.load("new_arch/best.pt", map_location="cpu")
|
| 57 |
+
# State for OneVisionTaskAwareActionDecoder:
|
| 58 |
+
# vision_encoder_name = "lmms-lab-encoder/onevision-encoder-large-lang"
|
| 59 |
+
# text_encoder_name = "Qwen/Qwen3-4B"
|
| 60 |
+
# LoRA r=8, alpha=16 on both encoders
|
| 61 |
+
```
|
| 62 |
+
See `onevision-lightweight-decoder/src/model.py` in proteusagi/onevision-compuse for the module
|
| 63 |
+
definition and `train_onevision_lw_decoder.py` for the training/eval pipeline.
|