Add Segformer85Mv2 (fine-tuned on Orchard Navigation, autumn+different camera). v1 unchanged.

2fd5923 verified about 1 month ago

7.94 kB

	---
	license: apache-2.0
	language:
	- en
	tags:
	- semantic-segmentation
	- segformer
	- agriculture
	- orchard
	- apple
	- outdoor
	library_name: transformers
	pipeline_tag: image-segmentation
	base_model: nvidia/segformer-b5-finetuned-ade-640-640
	---

	# Segformer85M — Apple Orchard Semantic Segmentation

	Segformer-B5 (85M parameters) fine-tuned for 8-class semantic segmentation of outdoor apple orchard scenes captured from a robotic platform.

	This repo contains two checkpoints:

	\| File \| When to use \|
	\|------\|-------------\|
	\| `Segformer85Mv1.pt` \| Original v1, trained only on the spring oak_0415 dataset. Best baseline. \|
	\| `Segformer85Mv2.pt` ⭐ \| v1 + fine-tuned on a second dataset (different camera, autumn season). Use this for general deployment — same accuracy on the original orchard, dramatically better generalization to new cameras / new seasons. \|

	## Quick Use

	```python
	from huggingface_hub import hf_hub_download
	from transformers import SegformerForSemanticSegmentation
	import torch, cv2, numpy as np
	import torch.nn.functional as F

	# 1. Download weights — pick v1 OR v2
	ckpt_path = hf_hub_download(repo_id="WEN0256/Segformer85Mv1", filename="Segformer85Mv2.pt")
	# ^^^^^^^^^^^^^^^^^
	# use v2 by default

	# 2. Init architecture from base + load fine-tuned weights
	NAMES = ["tree","ground","person","sky","road","mountain","building","background"]
	model = SegformerForSemanticSegmentation.from_pretrained(
	"nvidia/segformer-b5-finetuned-ade-640-640",
	num_labels=8,
	id2label={i:n for i,n in enumerate(NAMES)},
	label2id={n:i for i,n in enumerate(NAMES)},
	ignore_mismatched_sizes=True,
	).cuda().eval()
	model.load_state_dict(torch.load(ckpt_path, map_location="cuda")["model"])

	# 3. Inference
	img = cv2.imread("your_image.jpg")
	H, W = img.shape[:2]
	H32, W32 = (H//32)32, (W//32)32
	rgb = cv2.cvtColor(cv2.resize(img, (W32, H32)), cv2.COLOR_BGR2RGB).astype(np.float32) / 255.0
	mean = np.array([0.485, 0.456, 0.406]); std = np.array([0.229, 0.224, 0.225])
	x = torch.from_numpy(((rgb - mean) / std).transpose(2,0,1)).unsqueeze(0).float().cuda()

	with torch.no_grad():
	logits = model(pixel_values=x).logits
	logits = F.interpolate(logits, size=(H, W), mode="bilinear", align_corners=False)
	pred = logits.argmax(1)[0].cpu().numpy() # H x W, values 0..7
	```

	A ready-to-use `predict.py` is included in this repo.

	## Classes (id → name)

	\| ID \| Class \| Notes \|
	\|----\|-------------\|--------------------------------------------------------\|
	\| 0 \| tree \| Apple trees (priority class for downstream tasks) \|
	\| 1 \| ground \| Grass / dirt / orchard floor \|
	\| 2 \| person \| Workers in scene \|
	\| 3 \| sky \| \|
	\| 4 \| road \| Path between rows \|
	\| 5 \| mountain \| Distant terrain \|
	\| 6 \| building \| Sheds, equipment shelters \|
	\| 7 \| background \| Unknown / unlabeled regions (model output rare) \|

	## Architecture & Preprocessing

	\| \| \|
	\|---\|---\|
	\| Base model \| `nvidia/segformer-b5-finetuned-ade-640-640` \|
	\| Parameters \| ~85M \|
	\| Decoder head \| Reinitialized for 8 classes \|
	\| Input format \| RGB, normalized with ImageNet mean/std \|
	\| `mean` \| `[0.485, 0.456, 0.406]` \|
	\| `std` \| `[0.229, 0.224, 0.225]` \|
	\| Input resolution \| Any H×W where both are multiples of 32 \|
	\| Trained at \| 1024×576 (native 16:9) \|

	## Performance

	### v1 (Segformer85Mv1.pt) — original training only

	Validated on a temporally-disjoint hold-out from the same recording (frames 4501+, no leakage):

	\| Metric \| Value \|
	\|---\|---\|
	\| Tree IoU \| 0.742 \|
	\| mIoU (7 real classes) \| 0.714 \|
	\| Pixel accuracy \| 0.834 \|

	### v2 (Segformer85Mv2.pt) — v1 + Orchard Navigation fine-tune ⭐

	Same v1 hold-out → no regression on old domain:

	\| Metric \| v1 \| v2 \|
	\|---\|---\|---\|
	\| Tree IoU (orig orchard, no leak) \| 0.742 \| 0.742 ✅ \|
	\| mIoU (orig orchard) \| 0.714 \| 0.712 \|

	NEW orchard hold-out (different camera, autumn season — Aug+Sep capture):

	\| Metric \| v1 \| v2 \|
	\|---\|---\|---\|
	\| Tree recall on new orchard \| ~0.55 (estimated) \| 0.999 🚀 \|

	Visual qualitative: v1 sometimes misclassifies autumn foliage as `person` (red); v2 cleanly segments it as `tree`. See `samples/` for side-by-side examples.

	### v1 per-class IoU (8-class, no leak)

	\| Class \| IoU \| Precision \| Recall \|
	\|---\|---\|---\|---\|
	\| tree \| 0.742 \| 0.79 \| 0.93 \|
	\| ground \| 0.851 \| 0.91 \| 0.93 \|
	\| person \| 0.719 \| 0.82 \| 0.85 \|
	\| sky \| 0.769 \| 0.83 \| 0.91 \|
	\| road \| 0.804 \| 0.86 \| 0.92 \|
	\| mountain \| 0.437 \| 0.62 \| 0.66 \|
	\| building \| 0.711 \| 0.84 \| 0.83 \|

	## Training Data

	### v1 base
	- ~5300 frames from a single oak_0415_oneRadar_1 recording (spring, single camera)
	- Initial annotations from 3 separate Roboflow projects (SAM-assisted polygons), merged + class-aligned (`vines`→`tree`, `moutain`→`mountain` typo fixed)
	- Pseudo-labels generated by an earlier model to fill SAM annotation gaps
	- Temporal split: frames `<=4500` train (5177 samples), frames `>4500` validation (155 samples) — no neighbor leakage

	### v2 fine-tune (NEW)
	- +311 images from "Orchard Navigation" dataset:
	- 178 frames from a Sep-16 recording (autumn season)
	- 134 frames from a Windows webcam capture (Aug 23, different camera/sensor)
	- Tree-only polygon annotations
	- Mixed with 500 sampled v1 images (full 8-class masks) to prevent forgetting
	- Non-tree pixels in new images set to `ignore_index=255` so the model only adapts its tree decisions, leaving other classes untouched

	## Training Recipe

	### v1
	\| Hyperparameter \| Value \|
	\|---\|---\|
	\| Optimizer \| AdamW, weight_decay 0.01 \|
	\| LR \| 2e-5, cosine schedule \|
	\| Epochs \| 30 \|
	\| Batch \| 2 × grad_accum 4 (effective 8) \|
	\| Resolution \| 1024×576 \|
	\| Precision \| bfloat16 \|
	\| Loss \| weighted cross-entropy \|
	\| Class weights \| tree 1.5, ground 0.5, person 1.5, sky 1.0, road 1.0, mountain 1.0, building 1.0, background 0.1 \|
	\| Hardware \| RTX 5090 (32 GB), ~2.3 hours \|

	### v2 fine-tune (delta from v1)
	\| Hyperparameter \| Value \|
	\|---\|---\|
	\| LR \| 5e-6 (10× lower for safe fine-tune) \|
	\| Epochs \| 8 (best at epoch 3) \|
	\| `ignore_index` \| 255 (for unlabeled pixels in new data) \|
	\| Everything else \| Same as v1 \|
	\| Hardware \| RTX 5090, ~13 minutes \|

	## Limitations

	This model was trained on a single Korean apple orchard (spring 2024) with a single robot platform, plus a small fine-tune on a second autumn capture. Expect degradation on:

	- ⚠️ Different orchards (different tree species, layouts, training systems)
	- ⚠️ Different cameras (different FOV, color profiles, sensors)
	- 💀 Different seasons not in training (winter dormant trees)
	- 💀 Different lighting (rain, dawn/dusk, night)
	- 💀 Aerial / drone perspectives

	For deployment in a new context, plan to fine-tune on 100-300 in-domain images.

	## Files in This Repo

	\| File \| Purpose \|
	\|---\|---\|
	\| `Segformer85Mv1.pt` \| Original v1 weights (339 MB) \|
	\| `Segformer85Mv2.pt` \| v1 + Orchard Navigation fine-tune (339 MB) ⭐ \|
	\| `predict.py` \| Standalone inference script (defaults to v2) \|
	\| `README.md` \| This file \|
	\| `samples/*.jpg` \| v1 prediction examples (in-domain) \|
	\| `samples_v6_vs_v7/.jpg` \| v1 vs v2 side-by-side* on new orchard (showcases v2 improvement) \|
	\| `train_v6_5090.py` \| v1 training script \|
	\| `finetune_v7.py` \| v2 fine-tune script \|
	\| `history_v6.json` \| v1 per-epoch training history \|
	\| `history_v7.json` \| v2 fine-tune history \|
	\| `v6_OOD_full_res.mp4` \| 1-minute OOD inference video at native resolution \|

	## License

	Apache 2.0