File size: 7,052 Bytes
1f60f8a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
---
language:
- en
license: apache-2.0
base_model: Qwen/Qwen3.5-0.8B
datasets:
- google/MapTrace
pipeline_tag: image-text-to-text
tags:
- qwen3.5
- vision-language
- path-planning
- maptrace
- fine-tuned
- partial-fine-tuning
---

![MapTrace Task Illustration](https://artemisp.github.io/maptrace/static/images/s1.png)

# Qwen3.5-0.8B — MapTrace Path Planning (Partial Fine-Tune)

This model is a **Partial Fine-Tune** of [`Qwen/Qwen3.5-0.8B`](https://huggingface.co/Qwen/Qwen3.5-0.8B) specifically trained for **visual path planning** on indoor and outdoor maps. Given an image of a traversable map with marked start (green) and end (red) locations, the model predicts a sequence of normalized (x, y) waypoints that connect the two locations while staying on traversable paths.

---

## 🗺️ Task Description

**MapTrace** is a visual navigation benchmark that requires a model to:
1. Understand a map image with a marked **start location (green dot)** and **end location (red dot)**.
2. Trace a realistic, traversable path between the two points.
3. Output the path as a list of **normalized (x, y) coordinates**: `[(x1, y1), (x2, y2), ...]`

### Example Input
```
You are provided an image of a path with a start location denoted in green 
and an end location denoted in red. 
The normalized xy-coordinates of the start location are (0.77, 0.47) and 
of the end location (0.54, 0.54). 
Output a list of normalized coordinates in the form of a list [(x1,y1), (x2,y2)...] 
of the path between the start and end location. 
Ensure that the path follows the traversable locations of the map.
```

### Example Output
```python
[(0.7695, 0.4727), (0.7344, 0.5156), (0.5898, 0.5234), (0.543, 0.543)]
```

---

## 🏋️ Training Strategy

The model was trained using a **Partial Fine-Tuning** approach to balance quality and computational efficiency:

| Setting | Value |
|---|---|
| **Base Model** | `Qwen/Qwen3.5-0.8B` |
| **Strategy** | Partial Fine-Tuning (Last 6 of 24 layers) |
| **Dataset** | `google/MapTrace (maptrace_20k subset)` |
| **Vision Encoder** | ✅ Trained (not frozen) |
| **Effective Batch Size** | 128 (`1 per GPU × 32 grad_accum × 4 GPUs`) |
| **Learning Rate** | `1e-5` with Warmup (100 steps) |
| **Max Sequence Length** | `2048` tokens |
| **Mixed Precision** | `bf16` |
| **Gradient Checkpointing** | ✅ Enabled |
| **Image Resolution Cap** | `1024 × 768` pixels (VRAM safety) |
| **Hardware** | 4× NVIDIA RTX 4090 24GB |

### Why Partial Fine-Tuning?
Instead of full fine-tuning or LoRA, we opted for **unlocking the last 6 transformer layers and the visual encoder** for gradient updates. This approach:
- Preserves the model's general language understanding from the foundation weights.
- Directly adapts the "reasoning" layers to the spatial coordinate prediction task.
- Avoids LoRA's approximation error — weights are modified directly.

### Training Highlights
- Resumed from `checkpoint-50 → checkpoint-100 → final` across multiple sessions.
- VRAM stability was achieved by setting `max_pixels=1024x768` as a safety cap, preventing OOM spikes from high-resolution map images.
- Loss masking was applied so the model only learns from the **assistant's coordinate response**, not the user's prompt or image tokens.
- Final **Token Accuracy: ~~84.5%**, Final **Loss: ~0.37**.

---

## 🚀 Inference

```python
import torch
from transformers import AutoModelForImageTextToText, AutoProcessor
from PIL import Image

model_id = "TurkishCodeMan/Qwen3.5-0.8B-MapTrace-PartialFT"

processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForImageTextToText.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)
model.eval()

def predict_path(image_path: str, start_xy: tuple, end_xy: tuple) -> str:
    image = Image.open(image_path).convert("RGB")
    
    prompt_text = (
        f"You are provided an image of a path with a start location denoted in green "
        f"and an end location denoted in red. \n"
        f"The normalized xy-coordinates of the start location are ({start_xy[0]}, {start_xy[1]}) "
        f"and of the end location ({end_xy[0]}, {end_xy[1]}). \n"
        f"Output a list of normalized coordinates in the form of a list [(x1,y1), (x2,y2)...] "
        f"of the path between the start and end location. \n"
        f"Ensure that the path follows the traversable locations of the map."
    )

    messages = [
        {
            "role": "user",
            "content": [
                {"type": "image"},
                {"type": "text", "text": prompt_text},
            ],
        }
    ]

    text = processor.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True,
    )

    inputs = processor(
        text=[text],
        images=[image],
        return_tensors="pt",
        padding=True,
        min_pixels=256 * 28 * 28,
        max_pixels=1024 * 768,
    ).to(model.device)

    with torch.no_grad():
        generated_ids = model.generate(
            **inputs,
            max_new_tokens=512,
            do_sample=False,
            temperature=0.0,
        )
    
    output = processor.decode(
        generated_ids[0][inputs.input_ids.shape[1]:],
        skip_special_tokens=True
    )
    return output.strip()


# Example usage
result = predict_path(
    image_path="map.png",
    start_xy=(0.7695, 0.4727),
    end_xy=(0.543, 0.543),
)
print(result)
# Expected: [(0.7695, 0.4727), (0.7344, 0.5156), (0.5898, 0.5234), (0.543, 0.543)]
```

> **Important:** Use the **raw prompt format** shown above (not `apply_chat_template` with `add_generation_prompt=True`). Since the training data did not include `<think>` tokens, applying the default Qwen3.5 template would inject thinking tags and cause degraded output.

---

## 📊 Training Metrics

| Epoch | Step | Loss | Token Accuracy | Learning Rate |
|---|---|---|---|---|
| 0.50 | 71 | 0.5044 | 70.35% | 6.9e-06 |
...
| 0.71 | 100 | 0.4614 | 81.57% | 9.9e-06 |
| 0.85 | 120 | 0.4294 | 82.71% | 9.73e-06 |
| 1.00 | 141 | 0.4147 | 83.29% | 9.39e-06 |
| 1.42 | 200 | 0.3841 | 84.32% | 4.31e-06 |
| 1.50 | 211 | 0.3741 | 84.50% | 3.47e-06 |
| 1.63 | 224 | 0.3669 | 84.82% | 1.95e-06 |

---

## 🗂️ Dataset

The model was trained on [google/MapTrace](https://huggingface.co/datasets/google/MapTrace) (the `maptrace_20k` subset), a dataset containing 20,000 map images paired with path annotations from start to end locations.

---

## ⚠️ Limitations

- **Map-Specific:** The model is trained exclusively for top-down map images; it is not a general path planner.
- **Coordinate Precision:** Predictions are in normalized [0, 1] space and may have small pixel-level deviations.
- **Resolution Cap:** Images larger than `1024×768` will be downscaled to this resolution during inference to match training conditions.

---

## 📜 License

This model is released under the [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0) license, consistent with the base model.