--- license: apache-2.0 language: - en tags: - semantic-segmentation - segformer - agriculture - orchard - apple - outdoor library_name: transformers pipeline_tag: image-segmentation base_model: nvidia/segformer-b5-finetuned-ade-640-640 --- # Segformer85M — Apple Orchard Semantic Segmentation Segformer-B5 (85M parameters) fine-tuned for **8-class semantic segmentation** of outdoor apple orchard scenes captured from a robotic platform. This repo contains **two checkpoints**: | File | When to use | |------|-------------| | **`Segformer85Mv1.pt`** | Original v1, trained only on the spring oak_0415 dataset. Best baseline. | | **`Segformer85Mv2.pt`** ⭐ | v1 + fine-tuned on a second dataset (different camera, autumn season). **Use this for general deployment** — same accuracy on the original orchard, dramatically better generalization to new cameras / new seasons. | ## Quick Use ```python from huggingface_hub import hf_hub_download from transformers import SegformerForSemanticSegmentation import torch, cv2, numpy as np import torch.nn.functional as F # 1. Download weights — pick v1 OR v2 ckpt_path = hf_hub_download(repo_id="WEN0256/Segformer85Mv1", filename="Segformer85Mv2.pt") # ^^^^^^^^^^^^^^^^^ # use v2 by default # 2. Init architecture from base + load fine-tuned weights NAMES = ["tree","ground","person","sky","road","mountain","building","background"] model = SegformerForSemanticSegmentation.from_pretrained( "nvidia/segformer-b5-finetuned-ade-640-640", num_labels=8, id2label={i:n for i,n in enumerate(NAMES)}, label2id={n:i for i,n in enumerate(NAMES)}, ignore_mismatched_sizes=True, ).cuda().eval() model.load_state_dict(torch.load(ckpt_path, map_location="cuda")["model"]) # 3. Inference img = cv2.imread("your_image.jpg") H, W = img.shape[:2] H32, W32 = (H//32)*32, (W//32)*32 rgb = cv2.cvtColor(cv2.resize(img, (W32, H32)), cv2.COLOR_BGR2RGB).astype(np.float32) / 255.0 mean = np.array([0.485, 0.456, 0.406]); std = np.array([0.229, 0.224, 0.225]) x = torch.from_numpy(((rgb - mean) / std).transpose(2,0,1)).unsqueeze(0).float().cuda() with torch.no_grad(): logits = model(pixel_values=x).logits logits = F.interpolate(logits, size=(H, W), mode="bilinear", align_corners=False) pred = logits.argmax(1)[0].cpu().numpy() # H x W, values 0..7 ``` A ready-to-use `predict.py` is included in this repo. ## Classes (id → name) | ID | Class | Notes | |----|-------------|--------------------------------------------------------| | 0 | **tree** | Apple trees (priority class for downstream tasks) | | 1 | ground | Grass / dirt / orchard floor | | 2 | person | Workers in scene | | 3 | sky | | | 4 | road | Path between rows | | 5 | mountain | Distant terrain | | 6 | building | Sheds, equipment shelters | | 7 | background | Unknown / unlabeled regions (model output rare) | ## Architecture & Preprocessing | | | |---|---| | Base model | `nvidia/segformer-b5-finetuned-ade-640-640` | | Parameters | ~85M | | Decoder head | Reinitialized for 8 classes | | Input format | RGB, normalized with ImageNet mean/std | | `mean` | `[0.485, 0.456, 0.406]` | | `std` | `[0.229, 0.224, 0.225]` | | Input resolution | Any H×W where both are multiples of 32 | | Trained at | 1024×576 (native 16:9) | ## Performance ### v1 (Segformer85Mv1.pt) — original training only Validated on a temporally-disjoint hold-out from the same recording (frames 4501+, no leakage): | Metric | Value | |---|---| | **Tree IoU** | **0.742** | | **mIoU (7 real classes)** | **0.714** | | **Pixel accuracy** | **0.834** | ### v2 (Segformer85Mv2.pt) — v1 + Orchard Navigation fine-tune ⭐ Same v1 hold-out → no regression on old domain: | Metric | v1 | **v2** | |---|---|---| | Tree IoU (orig orchard, no leak) | 0.742 | **0.742** ✅ | | mIoU (orig orchard) | 0.714 | 0.712 | NEW orchard hold-out (different camera, autumn season — Aug+Sep capture): | Metric | v1 | **v2** | |---|---|---| | Tree recall on new orchard | ~0.55 (estimated) | **0.999** 🚀 | **Visual qualitative**: v1 sometimes misclassifies autumn foliage as `person` (red); v2 cleanly segments it as `tree`. See `samples/` for side-by-side examples. ### v1 per-class IoU (8-class, no leak) | Class | IoU | Precision | Recall | |---|---|---|---| | tree | 0.742 | 0.79 | 0.93 | | ground | 0.851 | 0.91 | 0.93 | | person | 0.719 | 0.82 | 0.85 | | sky | 0.769 | 0.83 | 0.91 | | road | 0.804 | 0.86 | 0.92 | | mountain | 0.437 | 0.62 | 0.66 | | building | 0.711 | 0.84 | 0.83 | ## Training Data ### v1 base - ~5300 frames from a single oak_0415_oneRadar_1 recording (spring, single camera) - Initial annotations from 3 separate Roboflow projects (SAM-assisted polygons), merged + class-aligned (`vines`→`tree`, `moutain`→`mountain` typo fixed) - Pseudo-labels generated by an earlier model to fill SAM annotation gaps - Temporal split: frames `<=4500` train (5177 samples), frames `>4500` validation (155 samples) — **no neighbor leakage** ### v2 fine-tune (NEW) - **+311 images** from "Orchard Navigation" dataset: - 178 frames from a Sep-16 recording (autumn season) - 134 frames from a Windows webcam capture (Aug 23, different camera/sensor) - Tree-only polygon annotations - Mixed with 500 sampled v1 images (full 8-class masks) to prevent forgetting - Non-tree pixels in new images set to `ignore_index=255` so the model only adapts its tree decisions, leaving other classes untouched ## Training Recipe ### v1 | Hyperparameter | Value | |---|---| | Optimizer | AdamW, weight_decay 0.01 | | LR | 2e-5, cosine schedule | | Epochs | 30 | | Batch | 2 × grad_accum 4 (effective 8) | | Resolution | 1024×576 | | Precision | bfloat16 | | Loss | weighted cross-entropy | | Class weights | tree 1.5, ground 0.5, person 1.5, sky 1.0, road 1.0, mountain 1.0, building 1.0, background 0.1 | | Hardware | RTX 5090 (32 GB), ~2.3 hours | ### v2 fine-tune (delta from v1) | Hyperparameter | Value | |---|---| | LR | **5e-6** (10× lower for safe fine-tune) | | Epochs | **8** (best at epoch 3) | | `ignore_index` | **255** (for unlabeled pixels in new data) | | Everything else | Same as v1 | | Hardware | RTX 5090, ~13 minutes | ## Limitations This model was trained on a **single Korean apple orchard** (spring 2024) with a **single robot platform**, plus a small fine-tune on a second autumn capture. Expect degradation on: - ⚠️ Different orchards (different tree species, layouts, training systems) - ⚠️ Different cameras (different FOV, color profiles, sensors) - 💀 Different seasons not in training (winter dormant trees) - 💀 Different lighting (rain, dawn/dusk, night) - 💀 Aerial / drone perspectives For deployment in a new context, plan to fine-tune on 100-300 in-domain images. ## Files in This Repo | File | Purpose | |---|---| | `Segformer85Mv1.pt` | Original v1 weights (339 MB) | | `Segformer85Mv2.pt` | v1 + Orchard Navigation fine-tune (339 MB) ⭐ | | `predict.py` | Standalone inference script (defaults to v2) | | `README.md` | This file | | `samples/*.jpg` | v1 prediction examples (in-domain) | | `samples_v6_vs_v7/*.jpg` | **v1 vs v2 side-by-side** on new orchard (showcases v2 improvement) | | `train_v6_5090.py` | v1 training script | | `finetune_v7.py` | v2 fine-tune script | | `history_v6.json` | v1 per-epoch training history | | `history_v7.json` | v2 fine-tune history | | `v6_OOD_full_res.mp4` | 1-minute OOD inference video at native resolution | ## License Apache 2.0