yushg commited on
Commit
de57ee4
·
verified ·
1 Parent(s): fe59378

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +63 -0
README.md ADDED
@@ -0,0 +1,63 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: other
3
+ tags:
4
+ - computer-use
5
+ - video
6
+ - action-decoder
7
+ - onevision
8
+ - qwen
9
+ ---
10
+
11
+ # OneVision Task-Aware Action Decoder — checkpoints
12
+
13
+ Lightweight task-aware action decoder for computer-use / instructional-video understanding.
14
+ Built on a frozen **OneVision vision encoder** (`lmms-lab-encoder/onevision-encoder-large-lang`)
15
+ and **Qwen/Qwen3-4B** as the text encoder/decoder, **LoRA**-tuned (r=8, alpha=16, dropout=0.05,
16
+ applied to both vision and text). Trained with `train_onevision_lw_decoder.py` in
17
+ [proteusagi/onevision-compuse](https://github.com/proteusagi/onevision-compuse) under
18
+ `onevision-lightweight-decoder/`.
19
+
20
+ Given video visible-token embeddings plus a task prompt, the model predicts, per frame:
21
+ - discrete **action / key / modifier** classes,
22
+ - a **click heatmap** (actor query attending over visual tokens),
23
+ - (new_arch only) an **autoregressive per-frame transcript**.
24
+
25
+ ## Checkpoints (~8.7 GB each — LoRA adapters + task heads)
26
+
27
+ ### `new_arch/best.pt` — "parallel heads" architecture (2026-03-16) — RECOMMENDED
28
+ - Adds an **autoregressive transcript decoder** (per-frame transcript via the Qwen LM head:
29
+ visual features + transcript history of frames 0..t-1 predict frame t) on top of the
30
+ action/key/modifier classification heads and the click-heatmap head.
31
+ - Modularized `encode`/`decode`, **parallel prediction heads**, and an **efficient-inference**
32
+ path (git commits: "parallel heads", "efficient inference", "further speedup").
33
+ - Training performance: **train action accuracy ~64-66%**, train loss ~0.9 (see `accuracy.png`
34
+ and `loss.png` in the repo). The epoch ~85-117 plateau is a curriculum/data change; metrics
35
+ recover afterward.
36
+ - Choose this by default: most capable (also generates transcripts) and the actively developed arch.
37
+
38
+ ### `onevision_task_action_decoder/best.pt` — original architecture (2026-03-10)
39
+ - The original `OneVisionTaskAwareActionDecoder`: action/key/modifier classification heads plus a
40
+ click-heatmap head over OneVision visual tokens, with frame self-attention and prompt
41
+ cross-attention. **Predates the transcript decoder** (which was added 2026-03-16), so it does
42
+ action/key/modifier + click only.
43
+ - Choose this if you specifically want the lighter, action-only baseline or to reproduce the
44
+ March-10 results.
45
+
46
+ > Also in this repo: `onevision_task_action_decoder/epoch_*.pt` (earlier epoch snapshots of the
47
+ > original run) and `onevision_task_action_decoder_8F/best.pt` (an 8-frame-clip variant).
48
+
49
+ ## Which to choose
50
+ - **`new_arch/best.pt`** — full capability (action + transcript), latest architecture. Default pick.
51
+ - **`onevision_task_action_decoder/best.pt`** — original action-only model / March-10 baseline.
52
+
53
+ ## Loading
54
+ ```python
55
+ import torch
56
+ ckpt = torch.load("new_arch/best.pt", map_location="cpu")
57
+ # State for OneVisionTaskAwareActionDecoder:
58
+ # vision_encoder_name = "lmms-lab-encoder/onevision-encoder-large-lang"
59
+ # text_encoder_name = "Qwen/Qwen3-4B"
60
+ # LoRA r=8, alpha=16 on both encoders
61
+ ```
62
+ See `onevision-lightweight-decoder/src/model.py` in proteusagi/onevision-compuse for the module
63
+ definition and `train_onevision_lw_decoder.py` for the training/eval pipeline.