HusseinLezzaik commited on
Commit
41157bf
·
verified ·
1 Parent(s): 7a93b76

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +138 -0
README.md ADDED
@@ -0,0 +1,138 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ base_model: Qwen/Qwen2.5-VL-3B-Instruct
4
+ datasets:
5
+ - TESS-Computer/quickdraw-circles
6
+ tags:
7
+ - trajectory-prediction
8
+ - diffusion-transformer
9
+ - vision-language
10
+ - robotics
11
+ - drawing
12
+ pipeline_tag: image-to-image
13
+ ---
14
+
15
+ # Qwen-DiT-Draw
16
+
17
+ A Vision-Language Model with Diffusion Transformer head for trajectory prediction. Given an image and instruction, the model predicts drawing trajectories.
18
+
19
+ **Architecture:** Frozen Qwen2.5-VL-3B backbone + trainable DiT action head (36.7M params)
20
+
21
+ ## Model Details
22
+
23
+ - **Base Model:** [Qwen/Qwen2.5-VL-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct)
24
+ - **Training Data:** [TESS-Computer/quickdraw-circles](https://huggingface.co/datasets/TESS-Computer/quickdraw-circles) (21k circle drawings)
25
+ - **Architecture:** GR00T-style chunked prediction with flow matching
26
+ - **Trainable Parameters:** 36.7M (DiT head only, VLM frozen)
27
+ - **Chunk Size:** 16 points per chunk
28
+ - **Output:** (x, y, state) where state > 0.5 indicates stop signal
29
+
30
+ ## Usage
31
+
32
+ ```python
33
+ import torch
34
+ from PIL import Image
35
+ from transformers import AutoProcessor
36
+ from qwen_vl_utils import process_vision_info
37
+
38
+ # You need the model code from: https://github.com/HusseinLezzaik/Qwen-DiT-Draw
39
+ from src.model import Qwen2_5_VL_Draw, TrajectoryConfig
40
+
41
+ # Load model
42
+ config = TrajectoryConfig(chunk_size=16, dit_hidden_size=512, dit_num_layers=6)
43
+ model = Qwen2_5_VL_Draw(
44
+ model_id="Qwen/Qwen2.5-VL-3B-Instruct",
45
+ config=config,
46
+ freeze_backbone=True,
47
+ dtype=torch.bfloat16,
48
+ )
49
+
50
+ # Load trained weights
51
+ from huggingface_hub import hf_hub_download
52
+ weights_path = hf_hub_download(repo_id="TESS-Computer/qwen-dit-draw", filename="best_checkpoint/trajectory_head.pt")
53
+ model.trajectory_head.load_state_dict(torch.load(weights_path, weights_only=True))
54
+ model = model.to("cuda").eval()
55
+
56
+ # Load processor
57
+ processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-3B-Instruct")
58
+
59
+ # Create input
60
+ image = Image.new("RGB", (512, 512), "white") # White canvas
61
+ instruction = "draw a circle"
62
+
63
+ messages = [{
64
+ "role": "user",
65
+ "content": [
66
+ {"type": "image", "image": image, "min_pixels": 200704, "max_pixels": 401408},
67
+ {"type": "text", "text": instruction},
68
+ ],
69
+ }]
70
+
71
+ text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
72
+ image_inputs, _, _ = process_vision_info(messages, return_video_kwargs=True)
73
+ inputs = processor(text=[text], images=image_inputs, return_tensors="pt")
74
+ inputs = {k: v.to("cuda") if torch.is_tensor(v) else v for k, v in inputs.items()}
75
+
76
+ # Predict trajectory chunk
77
+ with torch.no_grad():
78
+ chunk = model.predict_chunk(**inputs)
79
+
80
+ chunk = chunk[0].float().cpu().numpy() # (16, 3) - (x, y, state)
81
+ print(f"Predicted {len(chunk)} points")
82
+ for i, (x, y, state) in enumerate(chunk):
83
+ print(f" Point {i}: ({x:.3f}, {y:.3f}), stop={state > 0.5}")
84
+ ```
85
+
86
+ ## Multi-Chunk Inference (Full Drawing)
87
+
88
+ For complete drawings, use visual feedback loop:
89
+
90
+ ```python
91
+ from PIL import ImageDraw
92
+
93
+ canvas = Image.new("RGB", (512, 512), "white")
94
+ all_points = []
95
+ max_chunks = 10
96
+
97
+ for chunk_idx in range(max_chunks):
98
+ # Prepare inputs with current canvas
99
+ messages = [{
100
+ "role": "user",
101
+ "content": [
102
+ {"type": "image", "image": canvas, "min_pixels": 200704, "max_pixels": 401408},
103
+ {"type": "text", "text": "draw a circle"},
104
+ ],
105
+ }]
106
+ # ... process and predict ...
107
+
108
+ # Draw on canvas (use BLACK lines to match training!)
109
+ draw = ImageDraw.Draw(canvas)
110
+ for i in range(1, len(chunk)):
111
+ x1, y1 = int(chunk[i-1][0] * 512), int(chunk[i-1][1] * 512)
112
+ x2, y2 = int(chunk[i][0] * 512), int(chunk[i][1] * 512)
113
+ draw.line([(x1, y1), (x2, y2)], fill='black', width=2)
114
+
115
+ if chunk[i][2] > 0.5: # Stop signal
116
+ break
117
+ ```
118
+
119
+ ## Training
120
+
121
+ Trained on Modal H100 for 2 epochs using flow matching loss. See [training code](https://github.com/HusseinLezzaik/Qwen-DiT-Draw).
122
+
123
+ ## Citation
124
+
125
+ ```bibtex
126
+ @misc{qwen-dit-draw,
127
+ author = {TESS Computer},
128
+ title = {Qwen-DiT-Draw: VLM + DiT for Trajectory Prediction},
129
+ year = {2025},
130
+ url = {https://huggingface.co/TESS-Computer/qwen-dit-draw}
131
+ }
132
+ ```
133
+
134
+ ## Links
135
+
136
+ - **Code:** [GitHub - Qwen-DiT-Draw](https://github.com/HusseinLezzaik/Qwen-DiT-Draw)
137
+ - **Dataset:** [TESS-Computer/quickdraw-circles](https://huggingface.co/datasets/TESS-Computer/quickdraw-circles)
138
+ - **Base Model:** [Qwen/Qwen2.5-VL-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct)