File size: 7,052 Bytes

1f60f8a

---
language:
- en
license: apache-2.0
base_model: Qwen/Qwen3.5-0.8B
datasets:
- google/MapTrace
pipeline_tag: image-text-to-text
tags:
- qwen3.5
- vision-language
- path-planning
- maptrace
- fine-tuned
- partial-fine-tuning
---

![MapTrace Task Illustration](https://artemisp.github.io/maptrace/static/images/s1.png)

# Qwen3.5-0.8B — MapTrace Path Planning (Partial Fine-Tune)

This model is a **Partial Fine-Tune** of [`Qwen/Qwen3.5-0.8B`](https://huggingface.co/Qwen/Qwen3.5-0.8B) specifically trained for **visual path planning** on indoor and outdoor maps. Given an image of a traversable map with marked start (green) and end (red) locations, the model predicts a sequence of normalized (x, y) waypoints that connect the two locations while staying on traversable paths.

---

## 🗺️ Task Description

**MapTrace** is a visual navigation benchmark that requires a model to:
1. Understand a map image with a marked **start location (green dot)** and **end location (red dot)**.
2. Trace a realistic, traversable path between the two points.
3. Output the path as a list of **normalized (x, y) coordinates**: `[(x1, y1), (x2, y2), ...]`

### Example Input
```
You are provided an image of a path with a start location denoted in green 
and an end location denoted in red. 
The normalized xy-coordinates of the start location are (0.77, 0.47) and 
of the end location (0.54, 0.54). 
Output a list of normalized coordinates in the form of a list [(x1,y1), (x2,y2)...] 
of the path between the start and end location. 
Ensure that the path follows the traversable locations of the map.
```

### Example Output
```python
[(0.7695, 0.4727), (0.7344, 0.5156), (0.5898, 0.5234), (0.543, 0.543)]
```

---

## 🏋️ Training Strategy

The model was trained using a **Partial Fine-Tuning** approach to balance quality and computational efficiency:

| Setting | Value |
|---|---|
| **Base Model** | `Qwen/Qwen3.5-0.8B` |
| **Strategy** | Partial Fine-Tuning (Last 6 of 24 layers) |
| **Dataset** | `google/MapTrace (maptrace_20k subset)` |
| **Vision Encoder** | ✅ Trained (not frozen) |
| **Effective Batch Size** | 128 (`1 per GPU × 32 grad_accum × 4 GPUs`) |
| **Learning Rate** | `1e-5` with Warmup (100 steps) |
| **Max Sequence Length** | `2048` tokens |
| **Mixed Precision** | `bf16` |
| **Gradient Checkpointing** | ✅ Enabled |
| **Image Resolution Cap** | `1024 × 768` pixels (VRAM safety) |
| **Hardware** | 4× NVIDIA RTX 4090 24GB |

### Why Partial Fine-Tuning?
Instead of full fine-tuning or LoRA, we opted for **unlocking the last 6 transformer layers and the visual encoder** for gradient updates. This approach:
- Preserves the model's general language understanding from the foundation weights.
- Directly adapts the "reasoning" layers to the spatial coordinate prediction task.
- Avoids LoRA's approximation error — weights are modified directly.

### Training Highlights
- Resumed from `checkpoint-50 → checkpoint-100 → final` across multiple sessions.
- VRAM stability was achieved by setting `max_pixels=1024x768` as a safety cap, preventing OOM spikes from high-resolution map images.
- Loss masking was applied so the model only learns from the **assistant's coordinate response**, not the user's prompt or image tokens.
- Final **Token Accuracy: ~~84.5%**, Final **Loss: ~0.37**.

---

## 🚀 Inference

```python
import torch
from transformers import AutoModelForImageTextToText, AutoProcessor
from PIL import Image

model_id = "TurkishCodeMan/Qwen3.5-0.8B-MapTrace-PartialFT"

processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForImageTextToText.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)
model.eval()

def predict_path(image_path: str, start_xy: tuple, end_xy: tuple) -> str:
    image = Image.open(image_path).convert("RGB")
    
    prompt_text = (
        f"You are provided an image of a path with a start location denoted in green "
        f"and an end location denoted in red. \n"
        f"The normalized xy-coordinates of the start location are ({start_xy[0]}, {start_xy[1]}) "
        f"and of the end location ({end_xy[0]}, {end_xy[1]}). \n"
        f"Output a list of normalized coordinates in the form of a list [(x1,y1), (x2,y2)...] "
        f"of the path between the start and end location. \n"
        f"Ensure that the path follows the traversable locations of the map."
    )

    messages = [
        {
            "role": "user",
            "content": [
                {"type": "image"},
                {"type": "text", "text": prompt_text},
            ],
        }
    ]

    text = processor.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True,
    )

    inputs = processor(
        text=[text],
        images=[image],
        return_tensors="pt",
        padding=True,
        min_pixels=256 * 28 * 28,
        max_pixels=1024 * 768,
    ).to(model.device)

    with torch.no_grad():
        generated_ids = model.generate(
            **inputs,
            max_new_tokens=512,
            do_sample=False,
            temperature=0.0,
        )
    
    output = processor.decode(
        generated_ids[0][inputs.input_ids.shape[1]:],
        skip_special_tokens=True
    )
    return output.strip()


# Example usage
result = predict_path(
    image_path="map.png",
    start_xy=(0.7695, 0.4727),
    end_xy=(0.543, 0.543),
)
print(result)
# Expected: [(0.7695, 0.4727), (0.7344, 0.5156), (0.5898, 0.5234), (0.543, 0.543)]
```

> **Important:** Use the **raw prompt format** shown above (not `apply_chat_template` with `add_generation_prompt=True`). Since the training data did not include `<think>` tokens, applying the default Qwen3.5 template would inject thinking tags and cause degraded output.

---

## 📊 Training Metrics

| Epoch | Step | Loss | Token Accuracy | Learning Rate |
|---|---|---|---|---|
| 0.50 | 71 | 0.5044 | 70.35% | 6.9e-06 |
...
| 0.71 | 100 | 0.4614 | 81.57% | 9.9e-06 |
| 0.85 | 120 | 0.4294 | 82.71% | 9.73e-06 |
| 1.00 | 141 | 0.4147 | 83.29% | 9.39e-06 |
| 1.42 | 200 | 0.3841 | 84.32% | 4.31e-06 |
| 1.50 | 211 | 0.3741 | 84.50% | 3.47e-06 |
| 1.63 | 224 | 0.3669 | 84.82% | 1.95e-06 |

---

## 🗂️ Dataset

The model was trained on [google/MapTrace](https://huggingface.co/datasets/google/MapTrace) (the `maptrace_20k` subset), a dataset containing 20,000 map images paired with path annotations from start to end locations.

---

## ⚠️ Limitations

- **Map-Specific:** The model is trained exclusively for top-down map images; it is not a general path planner.
- **Coordinate Precision:** Predictions are in normalized [0, 1] space and may have small pixel-level deviations.
- **Resolution Cap:** Images larger than `1024×768` will be downscaled to this resolution during inference to match training conditions.

---

## 📜 License

This model is released under the [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0) license, consistent with the base model.