HusseinLezzaik commited on
Commit
7ffbf30
·
verified ·
1 Parent(s): 3637325

Upload folder using huggingface_hub

Browse files
Files changed (3) hide show
  1. README.md +92 -0
  2. config.json +11 -0
  3. trajectory_head.pt +3 -0
README.md ADDED
@@ -0,0 +1,92 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ tags:
4
+ - vision-language-action
5
+ - robotics
6
+ - trajectory-prediction
7
+ - diffusion-transformer
8
+ - qwen2-vl
9
+ datasets:
10
+ - TESS-Computer/quickdraw-circles-delta
11
+ ---
12
+
13
+ # Qwen-DiT-Draw (Delta Version)
14
+
15
+ Vision-Language-Action model for continuous mouse trajectory prediction using **delta (relative) movements**.
16
+
17
+ ## Model Description
18
+
19
+ This is the DiT action head trained on top of frozen Qwen2.5-VL-3B. It predicts mouse trajectories as sequences of (dx, dy, state) deltas.
20
+
21
+ - **Base VLM**: Qwen/Qwen2.5-VL-3B-Instruct (frozen)
22
+ - **Action Head**: 6-layer DiT with 512 hidden size (~37M trainable params)
23
+ - **Training**: Flow matching loss on 16-point chunks
24
+ - **Coordinate Mode**: Delta (relative movements)
25
+
26
+ ## Training Details
27
+
28
+ | Setting | Value |
29
+ |---------|-------|
30
+ | Dataset | TESS-Computer/quickdraw-circles-delta |
31
+ | Samples | 21,207 chunks |
32
+ | Epochs | 3 |
33
+ | Final Loss | 0.111 |
34
+ | Batch Size | 4 |
35
+ | Learning Rate | 1e-4 |
36
+
37
+ ## Key Learning: VLA Models Need More Epochs!
38
+
39
+ This model was trained for only 3 epochs as a proof-of-concept. Research from OpenVLA, GR00T, and pi0 shows that **VLA models need 20-30+ epochs** to learn good action patterns:
40
+
41
+ > *"Typical LLM or VLM training runs complete at most one or two epochs... In contrast, we found it important for VLA training to iterate through the training dataset significantly more times."* — OpenVLA Paper
42
+
43
+ The model produces partial arcs instead of complete circles because it hasn't seen enough training iterations.
44
+
45
+ ## Usage
46
+
47
+ ```python
48
+ from src.model import Qwen2_5_VL_Draw, TrajectoryConfig
49
+ import torch
50
+
51
+ # Load config
52
+ config = TrajectoryConfig(
53
+ chunk_size=16,
54
+ dit_hidden_size=512,
55
+ dit_num_layers=6,
56
+ )
57
+
58
+ # Create model and load weights
59
+ model = Qwen2_5_VL_Draw(
60
+ model_id="Qwen/Qwen2.5-VL-3B-Instruct",
61
+ config=config,
62
+ freeze_backbone=True,
63
+ )
64
+ model.trajectory_head.load_state_dict(
65
+ torch.load("trajectory_head.pt", map_location="cpu")
66
+ )
67
+ ```
68
+
69
+ ## Delta vs Absolute
70
+
71
+ This model uses **delta coordinates** (GR00T N1.6 style):
72
+ - Chunk 0: First point is absolute start, rest are deltas
73
+ - Chunk 1+: All points are deltas from previous chunk's last point
74
+
75
+ To reconstruct absolute positions:
76
+ ```python
77
+ # Chunk 0
78
+ abs_positions = np.cumsum(deltas, axis=0)
79
+
80
+ # Chunk 1+
81
+ abs_positions = np.cumsum(deltas, axis=0) + prev_chunk_last_point
82
+ ```
83
+
84
+ ## Links
85
+
86
+ - [GitHub: qwen-dit-draw](https://github.com/TESS-Computer/qwen-dit-draw)
87
+ - [Dataset: quickdraw-circles-delta](https://huggingface.co/datasets/TESS-Computer/quickdraw-circles-delta)
88
+ - [Absolute coordinates version](https://huggingface.co/TESS-Computer/qwen-dit-draw)
89
+
90
+ ## License
91
+
92
+ MIT
config.json ADDED
@@ -0,0 +1,11 @@
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "model_id": "Qwen/Qwen2.5-VL-3B-Instruct",
3
+ "chunk_size": 16,
4
+ "dit_hidden_size": 512,
5
+ "dit_num_layers": 6,
6
+ "use_deltas": true,
7
+ "num_epochs": 3,
8
+ "final_loss": 0.11093996120641385,
9
+ "dataset": "TESS-Computer/quickdraw-circles-delta",
10
+ "notes": "Delta (relative movement) model - trained for 3 epochs. VLA literature suggests 20-30+ epochs needed for good results."
11
+ }
trajectory_head.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a95d2f417640e4aea6a6796f751251f4d530b00ca1209530a4939ae0ce09699b
3
+ size 73580119