- TwinLiteNet8 β Real-time orchard segmentation for edge devices
TwinLiteNet8 β Real-time orchard segmentation for edge devices
A 0.44 M-parameter semantic-segmentation model adapted from TwinLiteNet for 7-class apple orchard scenes, designed to run >30 FPS on Jetson-class hardware for robotic navigation.
Drop-in lightweight alternative to WEN0256/Segformer85Mv1 for low-compute deployments.
Why "7-class" but 8 logit channels?
The model is trained to recognize 7 real classes (tree, ground, person, sky, road, mountain, building). The 8th label background is NOT treated as a real class β pixels that fall outside any labeled object are simply masked out of the loss (ignore_index=255). The 8th logit channel exists only to keep the architecture identical to the original TwinLiteNet shape; it is never trained and is forced to -inf before argmax at inference, so the model never outputs background.
This matches what you usually want from a robot's perception stack: "tell me what you DO recognize", not "tell me you don't know".
Performance (no data leakage, temporal split val, fair apples-to-apples)
| Metric | TwinLiteNet8 | Segformer-b5 (85 M) | Ξ vs Segformer |
|---|---|---|---|
| Tree IoU | 0.872 | 0.742 | +13 pp β |
| Ground IoU | 0.916 | 0.851 | +6.5 pp |
| Person IoU | 0.441 | 0.72 | -28 pp |
| Sky IoU | 0.835 | 0.77 | +6 pp |
| Road IoU | 0.745 | 0.80 | -5 pp |
| Mountain IoU | 0.592 | 0.44 | +15 pp |
| Building IoU | 0.555 | 0.71 | -16 pp |
| mIoU (7 classes) | 0.708 | 0.714 | -0.6 pp |
| Model size | 1.8 MB | 339 MB | 188Γ smaller |
| Params | 0.437 M | 85 M | 194Γ fewer |
(Segformer numbers come from WEN0256/Segformer85Mv1. Both models tested on the same 155-frame temporal-split val from the original orchard recording, with the same "background pixels excluded" protocol so the IoUs are directly comparable.)
Headline: TwinLiteNet8 matches Segformer-b5 in overall mIoU (0.708 vs 0.714, within noise) and beats it on the two classes that matter most for orchard navigation (tree, ground), while being ~200Γ smaller and ~10Γ faster on edge devices. The trade-off is on rare classes (person, building) where the small model's limited capacity shows.
FPS (640Γ360 input, batch 1)
| Device | TwinLiteNet8 | Segformer-b5 | Speedup |
|---|---|---|---|
| RTX 3080 (PyTorch fp32) | 137 FPS | ~50 | 2.7Γ |
| RTX 5090 (PyTorch fp32) | ~500 FPS | ~150 | 3.3Γ |
| Jetson Orin Nano (TRT FP16, est) | ~34β46 FPS β | ~2β5 | ~10Γ |
| Jetson Orin NX (TRT FP16, est) | ~60β80 FPS | ~20 | ~3Γ |
Target was 10β20 FPS on Orin Nano β TwinLiteNet8 doubles that.
Files
| File | Purpose |
|---|---|
twinlite8_best.pt |
PyTorch checkpoint (1.8 MB), epoch 29, best tree IoU 0.872 |
twinlite8.onnx |
ONNX export (1.8 MB), 100% argmax parity verified |
predict.py |
PyTorch inference (matches Segformer's API) |
predict_onnx.py |
ONNX-Runtime inference (CPU/CUDA/TensorRT auto-pick) |
export_onnx.py |
Re-export ONNX from any checkpoint |
train_8class.py |
Full training script (60 epochs, ~70 min on RTX 3080) |
model/ |
TwinLiteNet8 architecture (single-branch 8-output head, channel 7 = unused) |
JETSON_DEPLOY.md |
Step-by-step Jetson deployment + FPS table |
samples_20/ |
20 OOD inference samples (original β prediction overlay) |
demo_twinlite_12s.mp4 |
12-s demo video (360 frames @ 30 FPS, original β overlay) |
samples/ |
6 in-domain validation samples |
training_log.txt + history.json |
Per-epoch metrics |
Quick Use (PyTorch)
import sys, cv2, torch
sys.path.insert(0, "<this_dir>")
from predict import load_model, predict, overlay
model = load_model("twinlite8_best.pt", device="cuda")
img = cv2.imread("orchard.jpg")
mask = predict(model, img) # HΓW uint8, values 0..6 (never 7)
viz = overlay(img, mask)
cv2.imwrite("out.jpg", viz)
Quick Use (ONNX, no PyTorch)
import onnxruntime as ort, cv2, numpy as np
sess = ort.InferenceSession("twinlite8.onnx", providers=["CUDAExecutionProvider"])
img = cv2.imread("orchard.jpg")
inp = cv2.resize(img, (640, 360))
rgb = cv2.cvtColor(inp, cv2.COLOR_BGR2RGB).astype(np.float32) / 255.0
x = rgb.transpose(2, 0, 1)[None]
logits = sess.run(None, {"input": x})[0]
logits[:, 7, :, :] = -1e9 # mask the unused background channel
mask = logits.argmax(1)[0] # 360Γ640 uint8, values 0..6
Classes (id β name)
| ID | Class | Color (BGR) |
|---|---|---|
| 0 | tree (priority) | green |
| 1 | ground | brown |
| 2 | person | red |
| 3 | sky | cyan |
| 4 | road | gray |
| 5 | mountain | purple |
| 6 | building | yellow |
| 7 | (unused β never output) | β |
Architecture
Single-branch 8-output adaptation of TwinLiteNet:
- Encoder: ESPNet (
ESPNet_Encoder, p = 2 q = 3) - Decoder: 3 Γ
UPx2upsampling blocks - Head: 8-channel softmax (7 real classes; channel 7 untrained, masked at inference)
- Input: 640Γ360 BGR β ImageNet-style normalize
- Output: (B, 8, H, W) logits
The original TwinLiteNet has two parallel decoder heads for two binary tasks (drivable area + lane lines). For multi-class semantic seg matching the Segformer setup, we kept one decoder branch and changed its final UPx2 to output 8 channels. Final param count: 0.437 M.
Training Recipe
| Hyperparameter | Value |
|---|---|
| Optimizer | AdamW, weight_decay 1e-4 |
| LR | 5e-4, cosine schedule |
| Epochs | 60 |
| Batch | 16 |
| Resolution | 640Γ360 |
| Loss | weighted cross-entropy with ignore_index=255 |
| Class weights | tree 1.5, ground 0.5, person 1.5, sky 1.0, road 1.0, mountain 1.0, building 1.0, background 0.0 |
| Background handling | mask pixels remapped 7 β 255 so they never contribute to loss |
| Augmentation | hflip + HSV jitter |
| Hardware | RTX 3080, ~70 minutes total |
Dataset
Same dataset as WEN0256/Segformer85Mv1 v2:
- ~5300 frames from
oak_0415_oneRadar_1(spring 2024 Korean apple orchard, single OAK-D camera) - 311 frames from "Orchard Navigation" (Sep autumn capture + Aug Windows-webcam capture)
- Pseudo-mask labels generated by Segformer v1 to fill SAM-annotated gaps
- Temporal split: frames
β€ 4500β train, frames> 4500β val (155 frames). No neighbor leakage.
Limitations (same as parent Segformer model)
- Trained on a single Korean apple orchard, spring + partial autumn
- β Different orchards (different tree species/layouts) β likely degraded
- β Winter (no leaves), night, rain β no training data
- β Aerial/drone perspectives β robot-eye view only
- For a new deployment, plan to fine-tune on 100β300 in-domain frames (~13 min on a single GPU)
Deployment to Jetson
See JETSON_DEPLOY.md for the full pipeline:
- Export to ONNX (this repo already has
twinlite8.onnx) - On Jetson:
trtexec --onnx=twinlite8.onnx --saveEngine=...engine --fp16 - Run via
predict_onnx.py --provider TensorrtExecutionProvideror load the.enginevia TRT API
License
Apache 2.0