---
license: apache-2.0
language:
- en
tags:
- semantic-segmentation
- segformer
- agriculture
- orchard
- apple
- outdoor
library_name: transformers
pipeline_tag: image-segmentation
base_model: nvidia/segformer-b5-finetuned-ade-640-640
---

# Segformer85M — Apple Orchard Semantic Segmentation

Segformer-B5 (85M parameters) fine-tuned for **8-class semantic segmentation** of outdoor apple orchard scenes captured from a robotic platform.

This repo contains **two checkpoints**:

| File | When to use |
|------|-------------|
| **`Segformer85Mv1.pt`** | Original v1, trained only on the spring oak_0415 dataset. Best baseline. |
| **`Segformer85Mv2.pt`** ⭐ | v1 + fine-tuned on a second dataset (different camera, autumn season). **Use this for general deployment** — same accuracy on the original orchard, dramatically better generalization to new cameras / new seasons. |

## Quick Use

```python
from huggingface_hub import hf_hub_download
from transformers import SegformerForSemanticSegmentation
import torch, cv2, numpy as np
import torch.nn.functional as F

# 1. Download weights — pick v1 OR v2
ckpt_path = hf_hub_download(repo_id="WEN0256/Segformer85Mv1", filename="Segformer85Mv2.pt")
#                                                                       ^^^^^^^^^^^^^^^^^
#                                                                       use v2 by default

# 2. Init architecture from base + load fine-tuned weights
NAMES = ["tree","ground","person","sky","road","mountain","building","background"]
model = SegformerForSemanticSegmentation.from_pretrained(
    "nvidia/segformer-b5-finetuned-ade-640-640",
    num_labels=8,
    id2label={i:n for i,n in enumerate(NAMES)},
    label2id={n:i for i,n in enumerate(NAMES)},
    ignore_mismatched_sizes=True,
).cuda().eval()
model.load_state_dict(torch.load(ckpt_path, map_location="cuda")["model"])

# 3. Inference
img = cv2.imread("your_image.jpg")
H, W = img.shape[:2]
H32, W32 = (H//32)*32, (W//32)*32
rgb = cv2.cvtColor(cv2.resize(img, (W32, H32)), cv2.COLOR_BGR2RGB).astype(np.float32) / 255.0
mean = np.array([0.485, 0.456, 0.406]); std = np.array([0.229, 0.224, 0.225])
x = torch.from_numpy(((rgb - mean) / std).transpose(2,0,1)).unsqueeze(0).float().cuda()

with torch.no_grad():
    logits = model(pixel_values=x).logits
    logits = F.interpolate(logits, size=(H, W), mode="bilinear", align_corners=False)
    pred = logits.argmax(1)[0].cpu().numpy()  # H x W, values 0..7
```

A ready-to-use `predict.py` is included in this repo.

## Classes (id → name)

| ID | Class       | Notes                                                  |
|----|-------------|--------------------------------------------------------|
| 0  | **tree**    | Apple trees (priority class for downstream tasks)      |
| 1  | ground      | Grass / dirt / orchard floor                           |
| 2  | person      | Workers in scene                                       |
| 3  | sky         |                                                        |
| 4  | road        | Path between rows                                      |
| 5  | mountain    | Distant terrain                                        |
| 6  | building    | Sheds, equipment shelters                              |
| 7  | background  | Unknown / unlabeled regions (model output rare)        |

## Architecture & Preprocessing

| | |
|---|---|
| Base model | `nvidia/segformer-b5-finetuned-ade-640-640` |
| Parameters | ~85M |
| Decoder head | Reinitialized for 8 classes |
| Input format | RGB, normalized with ImageNet mean/std |
| `mean` | `[0.485, 0.456, 0.406]` |
| `std` | `[0.229, 0.224, 0.225]` |
| Input resolution | Any H×W where both are multiples of 32 |
| Trained at | 1024×576 (native 16:9) |

## Performance

### v1 (Segformer85Mv1.pt) — original training only

Validated on a temporally-disjoint hold-out from the same recording (frames 4501+, no leakage):

| Metric | Value |
|---|---|
| **Tree IoU** | **0.742** |
| **mIoU (7 real classes)** | **0.714** |
| **Pixel accuracy** | **0.834** |

### v2 (Segformer85Mv2.pt) — v1 + Orchard Navigation fine-tune ⭐

Same v1 hold-out → no regression on old domain:

| Metric | v1 | **v2** |
|---|---|---|
| Tree IoU (orig orchard, no leak) | 0.742 | **0.742** ✅ |
| mIoU (orig orchard) | 0.714 | 0.712 |

NEW orchard hold-out (different camera, autumn season — Aug+Sep capture):

| Metric | v1 | **v2** |
|---|---|---|
| Tree recall on new orchard | ~0.55 (estimated) | **0.999** 🚀 |

**Visual qualitative**: v1 sometimes misclassifies autumn foliage as `person` (red); v2 cleanly segments it as `tree`. See `samples/` for side-by-side examples.

### v1 per-class IoU (8-class, no leak)

| Class | IoU | Precision | Recall |
|---|---|---|---|
| tree | 0.742 | 0.79 | 0.93 |
| ground | 0.851 | 0.91 | 0.93 |
| person | 0.719 | 0.82 | 0.85 |
| sky | 0.769 | 0.83 | 0.91 |
| road | 0.804 | 0.86 | 0.92 |
| mountain | 0.437 | 0.62 | 0.66 |
| building | 0.711 | 0.84 | 0.83 |

## Training Data

### v1 base
- ~5300 frames from a single oak_0415_oneRadar_1 recording (spring, single camera)
- Initial annotations from 3 separate Roboflow projects (SAM-assisted polygons), merged + class-aligned (`vines`→`tree`, `moutain`→`mountain` typo fixed)
- Pseudo-labels generated by an earlier model to fill SAM annotation gaps
- Temporal split: frames `<=4500` train (5177 samples), frames `>4500` validation (155 samples) — **no neighbor leakage**

### v2 fine-tune (NEW)
- **+311 images** from "Orchard Navigation" dataset:
  - 178 frames from a Sep-16 recording (autumn season)
  - 134 frames from a Windows webcam capture (Aug 23, different camera/sensor)
- Tree-only polygon annotations
- Mixed with 500 sampled v1 images (full 8-class masks) to prevent forgetting
- Non-tree pixels in new images set to `ignore_index=255` so the model only adapts its tree decisions, leaving other classes untouched

## Training Recipe

### v1
| Hyperparameter | Value |
|---|---|
| Optimizer | AdamW, weight_decay 0.01 |
| LR | 2e-5, cosine schedule |
| Epochs | 30 |
| Batch | 2 × grad_accum 4 (effective 8) |
| Resolution | 1024×576 |
| Precision | bfloat16 |
| Loss | weighted cross-entropy |
| Class weights | tree 1.5, ground 0.5, person 1.5, sky 1.0, road 1.0, mountain 1.0, building 1.0, background 0.1 |
| Hardware | RTX 5090 (32 GB), ~2.3 hours |

### v2 fine-tune (delta from v1)
| Hyperparameter | Value |
|---|---|
| LR | **5e-6** (10× lower for safe fine-tune) |
| Epochs | **8** (best at epoch 3) |
| `ignore_index` | **255** (for unlabeled pixels in new data) |
| Everything else | Same as v1 |
| Hardware | RTX 5090, ~13 minutes |

## Limitations

This model was trained on a **single Korean apple orchard** (spring 2024) with a **single robot platform**, plus a small fine-tune on a second autumn capture. Expect degradation on:

- ⚠️ Different orchards (different tree species, layouts, training systems)
- ⚠️ Different cameras (different FOV, color profiles, sensors)
- 💀 Different seasons not in training (winter dormant trees)
- 💀 Different lighting (rain, dawn/dusk, night)
- 💀 Aerial / drone perspectives

For deployment in a new context, plan to fine-tune on 100-300 in-domain images.

## Files in This Repo

| File | Purpose |
|---|---|
| `Segformer85Mv1.pt` | Original v1 weights (339 MB) |
| `Segformer85Mv2.pt` | v1 + Orchard Navigation fine-tune (339 MB) ⭐ |
| `predict.py` | Standalone inference script (defaults to v2) |
| `README.md` | This file |
| `samples/*.jpg` | v1 prediction examples (in-domain) |
| `samples_v6_vs_v7/*.jpg` | **v1 vs v2 side-by-side** on new orchard (showcases v2 improvement) |
| `train_v6_5090.py` | v1 training script |
| `finetune_v7.py` | v2 fine-tune script |
| `history_v6.json` | v1 per-epoch training history |
| `history_v7.json` | v2 fine-tune history |
| `v6_OOD_full_res.mp4` | 1-minute OOD inference video at native resolution |

## License

Apache 2.0