TESS-Computer
/

qwen-dit-draw

+---
+license: apache-2.0
+base_model: Qwen/Qwen2.5-VL-3B-Instruct
+datasets:
+  - TESS-Computer/quickdraw-circles
+tags:
+  - trajectory-prediction
+  - diffusion-transformer
+  - vision-language
+  - robotics
+  - drawing
+pipeline_tag: image-to-image
+---
+# Qwen-DiT-Draw
+A Vision-Language Model with Diffusion Transformer head for trajectory prediction. Given an image and instruction, the model predicts drawing trajectories.
+**Architecture:** Frozen Qwen2.5-VL-3B backbone + trainable DiT action head (36.7M params)
+## Model Details
+- **Base Model:** [Qwen/Qwen2.5-VL-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct)
+- **Training Data:** [TESS-Computer/quickdraw-circles](https://huggingface.co/datasets/TESS-Computer/quickdraw-circles) (21k circle drawings)
+- **Architecture:** GR00T-style chunked prediction with flow matching
+- **Trainable Parameters:** 36.7M (DiT head only, VLM frozen)
+- **Chunk Size:** 16 points per chunk
+- **Output:** (x, y, state) where state > 0.5 indicates stop signal
+## Usage
+```python
+import torch
+from PIL import Image
+from transformers import AutoProcessor
+from qwen_vl_utils import process_vision_info
+# You need the model code from: https://github.com/HusseinLezzaik/Qwen-DiT-Draw
+from src.model import Qwen2_5_VL_Draw, TrajectoryConfig
+# Load model
+config = TrajectoryConfig(chunk_size=16, dit_hidden_size=512, dit_num_layers=6)
+model = Qwen2_5_VL_Draw(
+    model_id="Qwen/Qwen2.5-VL-3B-Instruct",
+    config=config,
+    freeze_backbone=True,
+    dtype=torch.bfloat16,
+)
+# Load trained weights
+from huggingface_hub import hf_hub_download
+weights_path = hf_hub_download(repo_id="TESS-Computer/qwen-dit-draw", filename="best_checkpoint/trajectory_head.pt")
+model.trajectory_head.load_state_dict(torch.load(weights_path, weights_only=True))
+model = model.to("cuda").eval()
+# Load processor
+processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-3B-Instruct")
+# Create input
+image = Image.new("RGB", (512, 512), "white")  # White canvas
+instruction = "draw a circle"
+messages = [{
+    "role": "user",
+    "content": [
+        {"type": "image", "image": image, "min_pixels": 200704, "max_pixels": 401408},
+        {"type": "text", "text": instruction},
+    ],
+}]
+text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
+image_inputs, _, _ = process_vision_info(messages, return_video_kwargs=True)
+inputs = processor(text=[text], images=image_inputs, return_tensors="pt")
+inputs = {k: v.to("cuda") if torch.is_tensor(v) else v for k, v in inputs.items()}
+# Predict trajectory chunk
+with torch.no_grad():
+    chunk = model.predict_chunk(**inputs)
+chunk = chunk[0].float().cpu().numpy()  # (16, 3) - (x, y, state)
+print(f"Predicted {len(chunk)} points")
+for i, (x, y, state) in enumerate(chunk):
+    print(f"  Point {i}: ({x:.3f}, {y:.3f}), stop={state > 0.5}")
+```
+## Multi-Chunk Inference (Full Drawing)
+For complete drawings, use visual feedback loop:
+```python
+from PIL import ImageDraw
+canvas = Image.new("RGB", (512, 512), "white")
+all_points = []
+max_chunks = 10
+for chunk_idx in range(max_chunks):
+    # Prepare inputs with current canvas
+    messages = [{
+        "role": "user",
+        "content": [
+            {"type": "image", "image": canvas, "min_pixels": 200704, "max_pixels": 401408},
+            {"type": "text", "text": "draw a circle"},
+        ],
+    }]
+    # ... process and predict ...
+    # Draw on canvas (use BLACK lines to match training!)
+    draw = ImageDraw.Draw(canvas)
+    for i in range(1, len(chunk)):
+        x1, y1 = int(chunk[i-1][0] * 512), int(chunk[i-1][1] * 512)
+        x2, y2 = int(chunk[i][0] * 512), int(chunk[i][1] * 512)
+        draw.line([(x1, y1), (x2, y2)], fill='black', width=2)
+        if chunk[i][2] > 0.5:  # Stop signal
+            break
+```
+## Training
+Trained on Modal H100 for 2 epochs using flow matching loss. See [training code](https://github.com/HusseinLezzaik/Qwen-DiT-Draw).
+## Citation
+```bibtex
+@misc{qwen-dit-draw,
+  author = {TESS Computer},
+  title = {Qwen-DiT-Draw: VLM + DiT for Trajectory Prediction},
+  year = {2025},
+  url = {https://huggingface.co/TESS-Computer/qwen-dit-draw}
+}
+```
+## Links
+- **Code:** [GitHub - Qwen-DiT-Draw](https://github.com/HusseinLezzaik/Qwen-DiT-Draw)
+- **Dataset:** [TESS-Computer/quickdraw-circles](https://huggingface.co/datasets/TESS-Computer/quickdraw-circles)
+- **Base Model:** [Qwen/Qwen2.5-VL-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct)