TurkishCodeMan's picture
Upload Qwen3.5-0.8B MapTrace Partial-FT model with full weights
1f60f8a verified
---
language:
- en
license: apache-2.0
base_model: Qwen/Qwen3.5-0.8B
datasets:
- google/MapTrace
pipeline_tag: image-text-to-text
tags:
- qwen3.5
- vision-language
- path-planning
- maptrace
- fine-tuned
- partial-fine-tuning
---
![MapTrace Task Illustration](https://artemisp.github.io/maptrace/static/images/s1.png)
# Qwen3.5-0.8B — MapTrace Path Planning (Partial Fine-Tune)
This model is a **Partial Fine-Tune** of [`Qwen/Qwen3.5-0.8B`](https://huggingface.co/Qwen/Qwen3.5-0.8B) specifically trained for **visual path planning** on indoor and outdoor maps. Given an image of a traversable map with marked start (green) and end (red) locations, the model predicts a sequence of normalized (x, y) waypoints that connect the two locations while staying on traversable paths.
---
## 🗺️ Task Description
**MapTrace** is a visual navigation benchmark that requires a model to:
1. Understand a map image with a marked **start location (green dot)** and **end location (red dot)**.
2. Trace a realistic, traversable path between the two points.
3. Output the path as a list of **normalized (x, y) coordinates**: `[(x1, y1), (x2, y2), ...]`
### Example Input
```
You are provided an image of a path with a start location denoted in green
and an end location denoted in red.
The normalized xy-coordinates of the start location are (0.77, 0.47) and
of the end location (0.54, 0.54).
Output a list of normalized coordinates in the form of a list [(x1,y1), (x2,y2)...]
of the path between the start and end location.
Ensure that the path follows the traversable locations of the map.
```
### Example Output
```python
[(0.7695, 0.4727), (0.7344, 0.5156), (0.5898, 0.5234), (0.543, 0.543)]
```
---
## 🏋️ Training Strategy
The model was trained using a **Partial Fine-Tuning** approach to balance quality and computational efficiency:
| Setting | Value |
|---|---|
| **Base Model** | `Qwen/Qwen3.5-0.8B` |
| **Strategy** | Partial Fine-Tuning (Last 6 of 24 layers) |
| **Dataset** | `google/MapTrace (maptrace_20k subset)` |
| **Vision Encoder** | ✅ Trained (not frozen) |
| **Effective Batch Size** | 128 (`1 per GPU × 32 grad_accum × 4 GPUs`) |
| **Learning Rate** | `1e-5` with Warmup (100 steps) |
| **Max Sequence Length** | `2048` tokens |
| **Mixed Precision** | `bf16` |
| **Gradient Checkpointing** | ✅ Enabled |
| **Image Resolution Cap** | `1024 × 768` pixels (VRAM safety) |
| **Hardware** | 4× NVIDIA RTX 4090 24GB |
### Why Partial Fine-Tuning?
Instead of full fine-tuning or LoRA, we opted for **unlocking the last 6 transformer layers and the visual encoder** for gradient updates. This approach:
- Preserves the model's general language understanding from the foundation weights.
- Directly adapts the "reasoning" layers to the spatial coordinate prediction task.
- Avoids LoRA's approximation error — weights are modified directly.
### Training Highlights
- Resumed from `checkpoint-50 → checkpoint-100 → final` across multiple sessions.
- VRAM stability was achieved by setting `max_pixels=1024x768` as a safety cap, preventing OOM spikes from high-resolution map images.
- Loss masking was applied so the model only learns from the **assistant's coordinate response**, not the user's prompt or image tokens.
- Final **Token Accuracy: ~~84.5%**, Final **Loss: ~0.37**.
---
## 🚀 Inference
```python
import torch
from transformers import AutoModelForImageTextToText, AutoProcessor
from PIL import Image
model_id = "TurkishCodeMan/Qwen3.5-0.8B-MapTrace-PartialFT"
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForImageTextToText.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True,
)
model.eval()
def predict_path(image_path: str, start_xy: tuple, end_xy: tuple) -> str:
image = Image.open(image_path).convert("RGB")
prompt_text = (
f"You are provided an image of a path with a start location denoted in green "
f"and an end location denoted in red. \n"
f"The normalized xy-coordinates of the start location are ({start_xy[0]}, {start_xy[1]}) "
f"and of the end location ({end_xy[0]}, {end_xy[1]}). \n"
f"Output a list of normalized coordinates in the form of a list [(x1,y1), (x2,y2)...] "
f"of the path between the start and end location. \n"
f"Ensure that the path follows the traversable locations of the map."
)
messages = [
{
"role": "user",
"content": [
{"type": "image"},
{"type": "text", "text": prompt_text},
],
}
]
text = processor.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
)
inputs = processor(
text=[text],
images=[image],
return_tensors="pt",
padding=True,
min_pixels=256 * 28 * 28,
max_pixels=1024 * 768,
).to(model.device)
with torch.no_grad():
generated_ids = model.generate(
**inputs,
max_new_tokens=512,
do_sample=False,
temperature=0.0,
)
output = processor.decode(
generated_ids[0][inputs.input_ids.shape[1]:],
skip_special_tokens=True
)
return output.strip()
# Example usage
result = predict_path(
image_path="map.png",
start_xy=(0.7695, 0.4727),
end_xy=(0.543, 0.543),
)
print(result)
# Expected: [(0.7695, 0.4727), (0.7344, 0.5156), (0.5898, 0.5234), (0.543, 0.543)]
```
> **Important:** Use the **raw prompt format** shown above (not `apply_chat_template` with `add_generation_prompt=True`). Since the training data did not include `<think>` tokens, applying the default Qwen3.5 template would inject thinking tags and cause degraded output.
---
## 📊 Training Metrics
| Epoch | Step | Loss | Token Accuracy | Learning Rate |
|---|---|---|---|---|
| 0.50 | 71 | 0.5044 | 70.35% | 6.9e-06 |
...
| 0.71 | 100 | 0.4614 | 81.57% | 9.9e-06 |
| 0.85 | 120 | 0.4294 | 82.71% | 9.73e-06 |
| 1.00 | 141 | 0.4147 | 83.29% | 9.39e-06 |
| 1.42 | 200 | 0.3841 | 84.32% | 4.31e-06 |
| 1.50 | 211 | 0.3741 | 84.50% | 3.47e-06 |
| 1.63 | 224 | 0.3669 | 84.82% | 1.95e-06 |
---
## 🗂️ Dataset
The model was trained on [google/MapTrace](https://huggingface.co/datasets/google/MapTrace) (the `maptrace_20k` subset), a dataset containing 20,000 map images paired with path annotations from start to end locations.
---
## ⚠️ Limitations
- **Map-Specific:** The model is trained exclusively for top-down map images; it is not a general path planner.
- **Coordinate Precision:** Predictions are in normalized [0, 1] space and may have small pixel-level deviations.
- **Resolution Cap:** Images larger than `1024×768` will be downscaled to this resolution during inference to match training conditions.
---
## 📜 License
This model is released under the [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0) license, consistent with the base model.