--- license: apache-2.0 language: [en] tags: [semantic-segmentation, twinlitenet, agriculture, orchard, real-time, edge-deployment, jetson] pipeline_tag: image-segmentation --- # TwinLiteNet8 — Real-time orchard segmentation for edge devices A **0.44 M-parameter** semantic-segmentation model adapted from [TwinLiteNet](https://github.com/chequanghuy/TwinLiteNet) for **7-class apple orchard scenes**, designed to run **>30 FPS on Jetson-class hardware** for robotic navigation. Drop-in lightweight alternative to [WEN0256/Segformer85Mv1](https://huggingface.co/WEN0256/Segformer85Mv1) for low-compute deployments. ## Why "7-class" but 8 logit channels? The model is trained to recognize **7 real classes** (`tree`, `ground`, `person`, `sky`, `road`, `mountain`, `building`). The 8th label `background` is **NOT** treated as a real class — pixels that fall outside any labeled object are simply masked out of the loss (`ignore_index=255`). The 8th logit channel exists only to keep the architecture identical to the original TwinLiteNet shape; it is never trained and is forced to `-inf` before `argmax` at inference, so the model never outputs `background`. This matches what you usually want from a robot's perception stack: "tell me what you DO recognize", not "tell me you don't know". ## Performance (no data leakage, temporal split val, fair apples-to-apples) | Metric | TwinLiteNet8 | Segformer-b5 (85 M) | Δ vs Segformer | |---|---|---|---| | Tree IoU | **0.872** | 0.742 | **+13 pp** ⭐ | | Ground IoU | **0.916** | 0.851 | **+6.5 pp** | | Person IoU | 0.441 | 0.72 | -28 pp | | Sky IoU | 0.835 | 0.77 | +6 pp | | Road IoU | 0.745 | 0.80 | -5 pp | | Mountain IoU | 0.592 | 0.44 | +15 pp | | Building IoU | 0.555 | 0.71 | -16 pp | | **mIoU (7 classes)** | **0.708** | 0.714 | -0.6 pp | | Model size | **1.8 MB** | 339 MB | **188× smaller** | | Params | **0.437 M** | 85 M | **194× fewer** | (Segformer numbers come from `WEN0256/Segformer85Mv1`. Both models tested on the same 155-frame temporal-split val from the original orchard recording, with the same "background pixels excluded" protocol so the IoUs are directly comparable.) **Headline:** TwinLiteNet8 *matches* Segformer-b5 in overall mIoU (0.708 vs 0.714, within noise) and *beats it* on the two classes that matter most for orchard navigation (`tree`, `ground`), while being ~200× smaller and ~10× faster on edge devices. The trade-off is on rare classes (`person`, `building`) where the small model's limited capacity shows. ### FPS (640×360 input, batch 1) | Device | TwinLiteNet8 | Segformer-b5 | Speedup | |---|---|---|---| | RTX 3080 (PyTorch fp32) | **137 FPS** | ~50 | 2.7× | | RTX 5090 (PyTorch fp32) | ~500 FPS | ~150 | 3.3× | | **Jetson Orin Nano (TRT FP16, est)** | **~34–46 FPS** ⭐ | ~2–5 | **~10×** | | Jetson Orin NX (TRT FP16, est) | ~60–80 FPS | ~20 | ~3× | Target was **10–20 FPS** on Orin Nano — TwinLiteNet8 doubles that. ## Files | File | Purpose | |---|---| | `twinlite8_best.pt` | PyTorch checkpoint (1.8 MB), epoch 29, best tree IoU 0.872 | | `twinlite8.onnx` | ONNX export (1.8 MB), 100% argmax parity verified | | `predict.py` | PyTorch inference (matches Segformer's API) | | `predict_onnx.py` | ONNX-Runtime inference (CPU/CUDA/TensorRT auto-pick) | | `export_onnx.py` | Re-export ONNX from any checkpoint | | `train_8class.py` | Full training script (60 epochs, ~70 min on RTX 3080) | | `model/` | TwinLiteNet8 architecture (single-branch 8-output head, channel 7 = unused) | | `JETSON_DEPLOY.md` | Step-by-step Jetson deployment + FPS table | | `samples_20/` | 20 OOD inference samples (original ‖ prediction overlay) | | `demo_twinlite_12s.mp4` | 12-s demo video (360 frames @ 30 FPS, original ‖ overlay) | | `samples/` | 6 in-domain validation samples | | `training_log.txt` + `history.json` | Per-epoch metrics | ## Quick Use (PyTorch) ```python import sys, cv2, torch sys.path.insert(0, "") from predict import load_model, predict, overlay model = load_model("twinlite8_best.pt", device="cuda") img = cv2.imread("orchard.jpg") mask = predict(model, img) # H×W uint8, values 0..6 (never 7) viz = overlay(img, mask) cv2.imwrite("out.jpg", viz) ``` ## Quick Use (ONNX, no PyTorch) ```python import onnxruntime as ort, cv2, numpy as np sess = ort.InferenceSession("twinlite8.onnx", providers=["CUDAExecutionProvider"]) img = cv2.imread("orchard.jpg") inp = cv2.resize(img, (640, 360)) rgb = cv2.cvtColor(inp, cv2.COLOR_BGR2RGB).astype(np.float32) / 255.0 x = rgb.transpose(2, 0, 1)[None] logits = sess.run(None, {"input": x})[0] logits[:, 7, :, :] = -1e9 # mask the unused background channel mask = logits.argmax(1)[0] # 360×640 uint8, values 0..6 ``` ## Classes (id → name) | ID | Class | Color (BGR) | |---|---|---| | 0 | **tree** (priority) | green | | 1 | ground | brown | | 2 | person | red | | 3 | sky | cyan | | 4 | road | gray | | 5 | mountain | purple | | 6 | building | yellow | | 7 | (unused — never output) | — | ## Architecture Single-branch 8-output adaptation of [TwinLiteNet](https://github.com/chequanghuy/TwinLiteNet): - **Encoder**: ESPNet (`ESPNet_Encoder`, p = 2 q = 3) - **Decoder**: 3 × `UPx2` upsampling blocks - **Head**: 8-channel softmax (7 real classes; channel 7 untrained, masked at inference) - **Input**: 640×360 BGR → ImageNet-style normalize - **Output**: (B, 8, H, W) logits The original TwinLiteNet has two parallel decoder heads for two binary tasks (drivable area + lane lines). For multi-class semantic seg matching the Segformer setup, we kept one decoder branch and changed its final `UPx2` to output 8 channels. Final param count: **0.437 M**. ## Training Recipe | Hyperparameter | Value | |---|---| | Optimizer | AdamW, weight_decay 1e-4 | | LR | 5e-4, cosine schedule | | Epochs | 60 | | Batch | 16 | | Resolution | 640×360 | | Loss | weighted cross-entropy with `ignore_index=255` | | Class weights | tree 1.5, ground 0.5, person 1.5, sky 1.0, road 1.0, mountain 1.0, building 1.0, **background 0.0** | | Background handling | mask pixels remapped 7 → 255 so they never contribute to loss | | Augmentation | hflip + HSV jitter | | Hardware | RTX 3080, ~70 minutes total | ## Dataset Same dataset as [WEN0256/Segformer85Mv1](https://huggingface.co/WEN0256/Segformer85Mv1) v2: - ~5300 frames from `oak_0415_oneRadar_1` (spring 2024 Korean apple orchard, single OAK-D camera) - 311 frames from "Orchard Navigation" (Sep autumn capture + Aug Windows-webcam capture) - Pseudo-mask labels generated by Segformer v1 to fill SAM-annotated gaps - Temporal split: frames `≤ 4500` → train, frames `> 4500` → val (155 frames). No neighbor leakage. ## Limitations (same as parent Segformer model) - Trained on a single Korean apple orchard, spring + partial autumn - ❌ Different orchards (different tree species/layouts) — likely degraded - ❌ Winter (no leaves), night, rain — no training data - ❌ Aerial/drone perspectives — robot-eye view only - For a new deployment, plan to fine-tune on 100–300 in-domain frames (~13 min on a single GPU) ## Deployment to Jetson See `JETSON_DEPLOY.md` for the full pipeline: 1. Export to ONNX (this repo already has `twinlite8.onnx`) 2. On Jetson: `trtexec --onnx=twinlite8.onnx --saveEngine=...engine --fp16` 3. Run via `predict_onnx.py --provider TensorrtExecutionProvider` or load the `.engine` via TRT API ## License Apache 2.0