Segformer85Mv1 / README.md
WEN0256's picture
Add Segformer85Mv2 (fine-tuned on Orchard Navigation, autumn+different camera). v1 unchanged.
2fd5923 verified
---
license: apache-2.0
language:
- en
tags:
- semantic-segmentation
- segformer
- agriculture
- orchard
- apple
- outdoor
library_name: transformers
pipeline_tag: image-segmentation
base_model: nvidia/segformer-b5-finetuned-ade-640-640
---
# Segformer85M β€” Apple Orchard Semantic Segmentation
Segformer-B5 (85M parameters) fine-tuned for **8-class semantic segmentation** of outdoor apple orchard scenes captured from a robotic platform.
This repo contains **two checkpoints**:
| File | When to use |
|------|-------------|
| **`Segformer85Mv1.pt`** | Original v1, trained only on the spring oak_0415 dataset. Best baseline. |
| **`Segformer85Mv2.pt`** ⭐ | v1 + fine-tuned on a second dataset (different camera, autumn season). **Use this for general deployment** β€” same accuracy on the original orchard, dramatically better generalization to new cameras / new seasons. |
## Quick Use
```python
from huggingface_hub import hf_hub_download
from transformers import SegformerForSemanticSegmentation
import torch, cv2, numpy as np
import torch.nn.functional as F
# 1. Download weights β€” pick v1 OR v2
ckpt_path = hf_hub_download(repo_id="WEN0256/Segformer85Mv1", filename="Segformer85Mv2.pt")
# ^^^^^^^^^^^^^^^^^
# use v2 by default
# 2. Init architecture from base + load fine-tuned weights
NAMES = ["tree","ground","person","sky","road","mountain","building","background"]
model = SegformerForSemanticSegmentation.from_pretrained(
"nvidia/segformer-b5-finetuned-ade-640-640",
num_labels=8,
id2label={i:n for i,n in enumerate(NAMES)},
label2id={n:i for i,n in enumerate(NAMES)},
ignore_mismatched_sizes=True,
).cuda().eval()
model.load_state_dict(torch.load(ckpt_path, map_location="cuda")["model"])
# 3. Inference
img = cv2.imread("your_image.jpg")
H, W = img.shape[:2]
H32, W32 = (H//32)*32, (W//32)*32
rgb = cv2.cvtColor(cv2.resize(img, (W32, H32)), cv2.COLOR_BGR2RGB).astype(np.float32) / 255.0
mean = np.array([0.485, 0.456, 0.406]); std = np.array([0.229, 0.224, 0.225])
x = torch.from_numpy(((rgb - mean) / std).transpose(2,0,1)).unsqueeze(0).float().cuda()
with torch.no_grad():
logits = model(pixel_values=x).logits
logits = F.interpolate(logits, size=(H, W), mode="bilinear", align_corners=False)
pred = logits.argmax(1)[0].cpu().numpy() # H x W, values 0..7
```
A ready-to-use `predict.py` is included in this repo.
## Classes (id β†’ name)
| ID | Class | Notes |
|----|-------------|--------------------------------------------------------|
| 0 | **tree** | Apple trees (priority class for downstream tasks) |
| 1 | ground | Grass / dirt / orchard floor |
| 2 | person | Workers in scene |
| 3 | sky | |
| 4 | road | Path between rows |
| 5 | mountain | Distant terrain |
| 6 | building | Sheds, equipment shelters |
| 7 | background | Unknown / unlabeled regions (model output rare) |
## Architecture & Preprocessing
| | |
|---|---|
| Base model | `nvidia/segformer-b5-finetuned-ade-640-640` |
| Parameters | ~85M |
| Decoder head | Reinitialized for 8 classes |
| Input format | RGB, normalized with ImageNet mean/std |
| `mean` | `[0.485, 0.456, 0.406]` |
| `std` | `[0.229, 0.224, 0.225]` |
| Input resolution | Any HΓ—W where both are multiples of 32 |
| Trained at | 1024Γ—576 (native 16:9) |
## Performance
### v1 (Segformer85Mv1.pt) β€” original training only
Validated on a temporally-disjoint hold-out from the same recording (frames 4501+, no leakage):
| Metric | Value |
|---|---|
| **Tree IoU** | **0.742** |
| **mIoU (7 real classes)** | **0.714** |
| **Pixel accuracy** | **0.834** |
### v2 (Segformer85Mv2.pt) β€” v1 + Orchard Navigation fine-tune ⭐
Same v1 hold-out β†’ no regression on old domain:
| Metric | v1 | **v2** |
|---|---|---|
| Tree IoU (orig orchard, no leak) | 0.742 | **0.742** βœ… |
| mIoU (orig orchard) | 0.714 | 0.712 |
NEW orchard hold-out (different camera, autumn season β€” Aug+Sep capture):
| Metric | v1 | **v2** |
|---|---|---|
| Tree recall on new orchard | ~0.55 (estimated) | **0.999** πŸš€ |
**Visual qualitative**: v1 sometimes misclassifies autumn foliage as `person` (red); v2 cleanly segments it as `tree`. See `samples/` for side-by-side examples.
### v1 per-class IoU (8-class, no leak)
| Class | IoU | Precision | Recall |
|---|---|---|---|
| tree | 0.742 | 0.79 | 0.93 |
| ground | 0.851 | 0.91 | 0.93 |
| person | 0.719 | 0.82 | 0.85 |
| sky | 0.769 | 0.83 | 0.91 |
| road | 0.804 | 0.86 | 0.92 |
| mountain | 0.437 | 0.62 | 0.66 |
| building | 0.711 | 0.84 | 0.83 |
## Training Data
### v1 base
- ~5300 frames from a single oak_0415_oneRadar_1 recording (spring, single camera)
- Initial annotations from 3 separate Roboflow projects (SAM-assisted polygons), merged + class-aligned (`vines`β†’`tree`, `moutain`β†’`mountain` typo fixed)
- Pseudo-labels generated by an earlier model to fill SAM annotation gaps
- Temporal split: frames `<=4500` train (5177 samples), frames `>4500` validation (155 samples) β€” **no neighbor leakage**
### v2 fine-tune (NEW)
- **+311 images** from "Orchard Navigation" dataset:
- 178 frames from a Sep-16 recording (autumn season)
- 134 frames from a Windows webcam capture (Aug 23, different camera/sensor)
- Tree-only polygon annotations
- Mixed with 500 sampled v1 images (full 8-class masks) to prevent forgetting
- Non-tree pixels in new images set to `ignore_index=255` so the model only adapts its tree decisions, leaving other classes untouched
## Training Recipe
### v1
| Hyperparameter | Value |
|---|---|
| Optimizer | AdamW, weight_decay 0.01 |
| LR | 2e-5, cosine schedule |
| Epochs | 30 |
| Batch | 2 Γ— grad_accum 4 (effective 8) |
| Resolution | 1024Γ—576 |
| Precision | bfloat16 |
| Loss | weighted cross-entropy |
| Class weights | tree 1.5, ground 0.5, person 1.5, sky 1.0, road 1.0, mountain 1.0, building 1.0, background 0.1 |
| Hardware | RTX 5090 (32 GB), ~2.3 hours |
### v2 fine-tune (delta from v1)
| Hyperparameter | Value |
|---|---|
| LR | **5e-6** (10Γ— lower for safe fine-tune) |
| Epochs | **8** (best at epoch 3) |
| `ignore_index` | **255** (for unlabeled pixels in new data) |
| Everything else | Same as v1 |
| Hardware | RTX 5090, ~13 minutes |
## Limitations
This model was trained on a **single Korean apple orchard** (spring 2024) with a **single robot platform**, plus a small fine-tune on a second autumn capture. Expect degradation on:
- ⚠️ Different orchards (different tree species, layouts, training systems)
- ⚠️ Different cameras (different FOV, color profiles, sensors)
- πŸ’€ Different seasons not in training (winter dormant trees)
- πŸ’€ Different lighting (rain, dawn/dusk, night)
- πŸ’€ Aerial / drone perspectives
For deployment in a new context, plan to fine-tune on 100-300 in-domain images.
## Files in This Repo
| File | Purpose |
|---|---|
| `Segformer85Mv1.pt` | Original v1 weights (339 MB) |
| `Segformer85Mv2.pt` | v1 + Orchard Navigation fine-tune (339 MB) ⭐ |
| `predict.py` | Standalone inference script (defaults to v2) |
| `README.md` | This file |
| `samples/*.jpg` | v1 prediction examples (in-domain) |
| `samples_v6_vs_v7/*.jpg` | **v1 vs v2 side-by-side** on new orchard (showcases v2 improvement) |
| `train_v6_5090.py` | v1 training script |
| `finetune_v7.py` | v2 fine-tune script |
| `history_v6.json` | v1 per-epoch training history |
| `history_v7.json` | v2 fine-tune history |
| `v6_OOD_full_res.mp4` | 1-minute OOD inference video at native resolution |
## License
Apache 2.0