Image Segmentation
Transformers
English
semantic-segmentation
segformer
agriculture
orchard
apple
outdoor
Instructions to use WEN0256/Segformer85Mv1 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use WEN0256/Segformer85Mv1 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-segmentation", model="WEN0256/Segformer85Mv1")# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("WEN0256/Segformer85Mv1", dtype="auto") - Notebooks
- Google Colab
- Kaggle
| license: apache-2.0 | |
| language: | |
| - en | |
| tags: | |
| - semantic-segmentation | |
| - segformer | |
| - agriculture | |
| - orchard | |
| - apple | |
| - outdoor | |
| library_name: transformers | |
| pipeline_tag: image-segmentation | |
| base_model: nvidia/segformer-b5-finetuned-ade-640-640 | |
| # Segformer85M β Apple Orchard Semantic Segmentation | |
| Segformer-B5 (85M parameters) fine-tuned for **8-class semantic segmentation** of outdoor apple orchard scenes captured from a robotic platform. | |
| This repo contains **two checkpoints**: | |
| | File | When to use | | |
| |------|-------------| | |
| | **`Segformer85Mv1.pt`** | Original v1, trained only on the spring oak_0415 dataset. Best baseline. | | |
| | **`Segformer85Mv2.pt`** β | v1 + fine-tuned on a second dataset (different camera, autumn season). **Use this for general deployment** β same accuracy on the original orchard, dramatically better generalization to new cameras / new seasons. | | |
| ## Quick Use | |
| ```python | |
| from huggingface_hub import hf_hub_download | |
| from transformers import SegformerForSemanticSegmentation | |
| import torch, cv2, numpy as np | |
| import torch.nn.functional as F | |
| # 1. Download weights β pick v1 OR v2 | |
| ckpt_path = hf_hub_download(repo_id="WEN0256/Segformer85Mv1", filename="Segformer85Mv2.pt") | |
| # ^^^^^^^^^^^^^^^^^ | |
| # use v2 by default | |
| # 2. Init architecture from base + load fine-tuned weights | |
| NAMES = ["tree","ground","person","sky","road","mountain","building","background"] | |
| model = SegformerForSemanticSegmentation.from_pretrained( | |
| "nvidia/segformer-b5-finetuned-ade-640-640", | |
| num_labels=8, | |
| id2label={i:n for i,n in enumerate(NAMES)}, | |
| label2id={n:i for i,n in enumerate(NAMES)}, | |
| ignore_mismatched_sizes=True, | |
| ).cuda().eval() | |
| model.load_state_dict(torch.load(ckpt_path, map_location="cuda")["model"]) | |
| # 3. Inference | |
| img = cv2.imread("your_image.jpg") | |
| H, W = img.shape[:2] | |
| H32, W32 = (H//32)*32, (W//32)*32 | |
| rgb = cv2.cvtColor(cv2.resize(img, (W32, H32)), cv2.COLOR_BGR2RGB).astype(np.float32) / 255.0 | |
| mean = np.array([0.485, 0.456, 0.406]); std = np.array([0.229, 0.224, 0.225]) | |
| x = torch.from_numpy(((rgb - mean) / std).transpose(2,0,1)).unsqueeze(0).float().cuda() | |
| with torch.no_grad(): | |
| logits = model(pixel_values=x).logits | |
| logits = F.interpolate(logits, size=(H, W), mode="bilinear", align_corners=False) | |
| pred = logits.argmax(1)[0].cpu().numpy() # H x W, values 0..7 | |
| ``` | |
| A ready-to-use `predict.py` is included in this repo. | |
| ## Classes (id β name) | |
| | ID | Class | Notes | | |
| |----|-------------|--------------------------------------------------------| | |
| | 0 | **tree** | Apple trees (priority class for downstream tasks) | | |
| | 1 | ground | Grass / dirt / orchard floor | | |
| | 2 | person | Workers in scene | | |
| | 3 | sky | | | |
| | 4 | road | Path between rows | | |
| | 5 | mountain | Distant terrain | | |
| | 6 | building | Sheds, equipment shelters | | |
| | 7 | background | Unknown / unlabeled regions (model output rare) | | |
| ## Architecture & Preprocessing | |
| | | | | |
| |---|---| | |
| | Base model | `nvidia/segformer-b5-finetuned-ade-640-640` | | |
| | Parameters | ~85M | | |
| | Decoder head | Reinitialized for 8 classes | | |
| | Input format | RGB, normalized with ImageNet mean/std | | |
| | `mean` | `[0.485, 0.456, 0.406]` | | |
| | `std` | `[0.229, 0.224, 0.225]` | | |
| | Input resolution | Any HΓW where both are multiples of 32 | | |
| | Trained at | 1024Γ576 (native 16:9) | | |
| ## Performance | |
| ### v1 (Segformer85Mv1.pt) β original training only | |
| Validated on a temporally-disjoint hold-out from the same recording (frames 4501+, no leakage): | |
| | Metric | Value | | |
| |---|---| | |
| | **Tree IoU** | **0.742** | | |
| | **mIoU (7 real classes)** | **0.714** | | |
| | **Pixel accuracy** | **0.834** | | |
| ### v2 (Segformer85Mv2.pt) β v1 + Orchard Navigation fine-tune β | |
| Same v1 hold-out β no regression on old domain: | |
| | Metric | v1 | **v2** | | |
| |---|---|---| | |
| | Tree IoU (orig orchard, no leak) | 0.742 | **0.742** β | | |
| | mIoU (orig orchard) | 0.714 | 0.712 | | |
| NEW orchard hold-out (different camera, autumn season β Aug+Sep capture): | |
| | Metric | v1 | **v2** | | |
| |---|---|---| | |
| | Tree recall on new orchard | ~0.55 (estimated) | **0.999** π | | |
| **Visual qualitative**: v1 sometimes misclassifies autumn foliage as `person` (red); v2 cleanly segments it as `tree`. See `samples/` for side-by-side examples. | |
| ### v1 per-class IoU (8-class, no leak) | |
| | Class | IoU | Precision | Recall | | |
| |---|---|---|---| | |
| | tree | 0.742 | 0.79 | 0.93 | | |
| | ground | 0.851 | 0.91 | 0.93 | | |
| | person | 0.719 | 0.82 | 0.85 | | |
| | sky | 0.769 | 0.83 | 0.91 | | |
| | road | 0.804 | 0.86 | 0.92 | | |
| | mountain | 0.437 | 0.62 | 0.66 | | |
| | building | 0.711 | 0.84 | 0.83 | | |
| ## Training Data | |
| ### v1 base | |
| - ~5300 frames from a single oak_0415_oneRadar_1 recording (spring, single camera) | |
| - Initial annotations from 3 separate Roboflow projects (SAM-assisted polygons), merged + class-aligned (`vines`β`tree`, `moutain`β`mountain` typo fixed) | |
| - Pseudo-labels generated by an earlier model to fill SAM annotation gaps | |
| - Temporal split: frames `<=4500` train (5177 samples), frames `>4500` validation (155 samples) β **no neighbor leakage** | |
| ### v2 fine-tune (NEW) | |
| - **+311 images** from "Orchard Navigation" dataset: | |
| - 178 frames from a Sep-16 recording (autumn season) | |
| - 134 frames from a Windows webcam capture (Aug 23, different camera/sensor) | |
| - Tree-only polygon annotations | |
| - Mixed with 500 sampled v1 images (full 8-class masks) to prevent forgetting | |
| - Non-tree pixels in new images set to `ignore_index=255` so the model only adapts its tree decisions, leaving other classes untouched | |
| ## Training Recipe | |
| ### v1 | |
| | Hyperparameter | Value | | |
| |---|---| | |
| | Optimizer | AdamW, weight_decay 0.01 | | |
| | LR | 2e-5, cosine schedule | | |
| | Epochs | 30 | | |
| | Batch | 2 Γ grad_accum 4 (effective 8) | | |
| | Resolution | 1024Γ576 | | |
| | Precision | bfloat16 | | |
| | Loss | weighted cross-entropy | | |
| | Class weights | tree 1.5, ground 0.5, person 1.5, sky 1.0, road 1.0, mountain 1.0, building 1.0, background 0.1 | | |
| | Hardware | RTX 5090 (32 GB), ~2.3 hours | | |
| ### v2 fine-tune (delta from v1) | |
| | Hyperparameter | Value | | |
| |---|---| | |
| | LR | **5e-6** (10Γ lower for safe fine-tune) | | |
| | Epochs | **8** (best at epoch 3) | | |
| | `ignore_index` | **255** (for unlabeled pixels in new data) | | |
| | Everything else | Same as v1 | | |
| | Hardware | RTX 5090, ~13 minutes | | |
| ## Limitations | |
| This model was trained on a **single Korean apple orchard** (spring 2024) with a **single robot platform**, plus a small fine-tune on a second autumn capture. Expect degradation on: | |
| - β οΈ Different orchards (different tree species, layouts, training systems) | |
| - β οΈ Different cameras (different FOV, color profiles, sensors) | |
| - π Different seasons not in training (winter dormant trees) | |
| - π Different lighting (rain, dawn/dusk, night) | |
| - π Aerial / drone perspectives | |
| For deployment in a new context, plan to fine-tune on 100-300 in-domain images. | |
| ## Files in This Repo | |
| | File | Purpose | | |
| |---|---| | |
| | `Segformer85Mv1.pt` | Original v1 weights (339 MB) | | |
| | `Segformer85Mv2.pt` | v1 + Orchard Navigation fine-tune (339 MB) β | | |
| | `predict.py` | Standalone inference script (defaults to v2) | | |
| | `README.md` | This file | | |
| | `samples/*.jpg` | v1 prediction examples (in-domain) | | |
| | `samples_v6_vs_v7/*.jpg` | **v1 vs v2 side-by-side** on new orchard (showcases v2 improvement) | | |
| | `train_v6_5090.py` | v1 training script | | |
| | `finetune_v7.py` | v2 fine-tune script | | |
| | `history_v6.json` | v1 per-epoch training history | | |
| | `history_v7.json` | v2 fine-tune history | | |
| | `v6_OOD_full_res.mp4` | 1-minute OOD inference video at native resolution | | |
| ## License | |
| Apache 2.0 | |