---
license: apache-2.0
language: [en]
tags: [semantic-segmentation, twinlitenet, agriculture, orchard, real-time, edge-deployment, jetson]
pipeline_tag: image-segmentation
---

# TwinLiteNet8 — Real-time orchard segmentation for edge devices

A **0.44 M-parameter** semantic-segmentation model adapted from [TwinLiteNet](https://github.com/chequanghuy/TwinLiteNet) for **7-class apple orchard scenes**, designed to run **>30 FPS on Jetson-class hardware** for robotic navigation.

Drop-in lightweight alternative to [WEN0256/Segformer85Mv1](https://huggingface.co/WEN0256/Segformer85Mv1) for low-compute deployments.

## Why "7-class" but 8 logit channels?

The model is trained to recognize **7 real classes** (`tree`, `ground`, `person`, `sky`, `road`, `mountain`, `building`). The 8th label `background` is **NOT** treated as a real class — pixels that fall outside any labeled object are simply masked out of the loss (`ignore_index=255`). The 8th logit channel exists only to keep the architecture identical to the original TwinLiteNet shape; it is never trained and is forced to `-inf` before `argmax` at inference, so the model never outputs `background`.

This matches what you usually want from a robot's perception stack: "tell me what you DO recognize", not "tell me you don't know".

## Performance (no data leakage, temporal split val, fair apples-to-apples)

| Metric | TwinLiteNet8 | Segformer-b5 (85 M) | Δ vs Segformer |
|---|---|---|---|
| Tree IoU | **0.872** | 0.742 | **+13 pp** ⭐ |
| Ground IoU | **0.916** | 0.851 | **+6.5 pp** |
| Person IoU | 0.441 | 0.72 | -28 pp |
| Sky IoU | 0.835 | 0.77 | +6 pp |
| Road IoU | 0.745 | 0.80 | -5 pp |
| Mountain IoU | 0.592 | 0.44 | +15 pp |
| Building IoU | 0.555 | 0.71 | -16 pp |
| **mIoU (7 classes)** | **0.708** | 0.714 | -0.6 pp |
| Model size | **1.8 MB** | 339 MB | **188× smaller** |
| Params | **0.437 M** | 85 M | **194× fewer** |

(Segformer numbers come from `WEN0256/Segformer85Mv1`. Both models tested on the same 155-frame temporal-split val from the original orchard recording, with the same "background pixels excluded" protocol so the IoUs are directly comparable.)

**Headline:** TwinLiteNet8 *matches* Segformer-b5 in overall mIoU (0.708 vs 0.714, within noise) and *beats it* on the two classes that matter most for orchard navigation (`tree`, `ground`), while being ~200× smaller and ~10× faster on edge devices. The trade-off is on rare classes (`person`, `building`) where the small model's limited capacity shows.

### FPS (640×360 input, batch 1)

| Device | TwinLiteNet8 | Segformer-b5 | Speedup |
|---|---|---|---|
| RTX 3080 (PyTorch fp32) | **137 FPS** | ~50 | 2.7× |
| RTX 5090 (PyTorch fp32) | ~500 FPS | ~150 | 3.3× |
| **Jetson Orin Nano (TRT FP16, est)** | **~34–46 FPS** ⭐ | ~2–5 | **~10×** |
| Jetson Orin NX (TRT FP16, est) | ~60–80 FPS | ~20 | ~3× |

Target was **10–20 FPS** on Orin Nano — TwinLiteNet8 doubles that.

## Files

| File | Purpose |
|---|---|
| `twinlite8_best.pt` | PyTorch checkpoint (1.8 MB), epoch 29, best tree IoU 0.872 |
| `twinlite8.onnx` | ONNX export (1.8 MB), 100% argmax parity verified |
| `predict.py` | PyTorch inference (matches Segformer's API) |
| `predict_onnx.py` | ONNX-Runtime inference (CPU/CUDA/TensorRT auto-pick) |
| `export_onnx.py` | Re-export ONNX from any checkpoint |
| `train_8class.py` | Full training script (60 epochs, ~70 min on RTX 3080) |
| `model/` | TwinLiteNet8 architecture (single-branch 8-output head, channel 7 = unused) |
| `JETSON_DEPLOY.md` | Step-by-step Jetson deployment + FPS table |
| `samples_20/` | 20 OOD inference samples (original ‖ prediction overlay) |
| `demo_twinlite_12s.mp4` | 12-s demo video (360 frames @ 30 FPS, original ‖ overlay) |
| `samples/` | 6 in-domain validation samples |
| `training_log.txt` + `history.json` | Per-epoch metrics |

## Quick Use (PyTorch)

```python
import sys, cv2, torch
sys.path.insert(0, "<this_dir>")
from predict import load_model, predict, overlay

model = load_model("twinlite8_best.pt", device="cuda")
img = cv2.imread("orchard.jpg")
mask = predict(model, img)            # H×W uint8, values 0..6 (never 7)
viz = overlay(img, mask)
cv2.imwrite("out.jpg", viz)
```

## Quick Use (ONNX, no PyTorch)

```python
import onnxruntime as ort, cv2, numpy as np
sess = ort.InferenceSession("twinlite8.onnx", providers=["CUDAExecutionProvider"])
img = cv2.imread("orchard.jpg")
inp = cv2.resize(img, (640, 360))
rgb = cv2.cvtColor(inp, cv2.COLOR_BGR2RGB).astype(np.float32) / 255.0
x = rgb.transpose(2, 0, 1)[None]
logits = sess.run(None, {"input": x})[0]
logits[:, 7, :, :] = -1e9              # mask the unused background channel
mask = logits.argmax(1)[0]              # 360×640 uint8, values 0..6
```

## Classes (id → name)

| ID | Class | Color (BGR) |
|---|---|---|
| 0 | **tree** (priority) | green |
| 1 | ground | brown |
| 2 | person | red |
| 3 | sky | cyan |
| 4 | road | gray |
| 5 | mountain | purple |
| 6 | building | yellow |
| 7 | (unused — never output) | — |

## Architecture

Single-branch 8-output adaptation of [TwinLiteNet](https://github.com/chequanghuy/TwinLiteNet):

- **Encoder**: ESPNet (`ESPNet_Encoder`, p = 2 q = 3)
- **Decoder**: 3 × `UPx2` upsampling blocks
- **Head**: 8-channel softmax (7 real classes; channel 7 untrained, masked at inference)
- **Input**: 640×360 BGR → ImageNet-style normalize
- **Output**: (B, 8, H, W) logits

The original TwinLiteNet has two parallel decoder heads for two binary tasks (drivable area + lane lines). For multi-class semantic seg matching the Segformer setup, we kept one decoder branch and changed its final `UPx2` to output 8 channels. Final param count: **0.437 M**.

## Training Recipe

| Hyperparameter | Value |
|---|---|
| Optimizer | AdamW, weight_decay 1e-4 |
| LR | 5e-4, cosine schedule |
| Epochs | 60 |
| Batch | 16 |
| Resolution | 640×360 |
| Loss | weighted cross-entropy with `ignore_index=255` |
| Class weights | tree 1.5, ground 0.5, person 1.5, sky 1.0, road 1.0, mountain 1.0, building 1.0, **background 0.0** |
| Background handling | mask pixels remapped 7 → 255 so they never contribute to loss |
| Augmentation | hflip + HSV jitter |
| Hardware | RTX 3080, ~70 minutes total |

## Dataset

Same dataset as [WEN0256/Segformer85Mv1](https://huggingface.co/WEN0256/Segformer85Mv1) v2:
- ~5300 frames from `oak_0415_oneRadar_1` (spring 2024 Korean apple orchard, single OAK-D camera)
- 311 frames from "Orchard Navigation" (Sep autumn capture + Aug Windows-webcam capture)
- Pseudo-mask labels generated by Segformer v1 to fill SAM-annotated gaps
- Temporal split: frames `≤ 4500` → train, frames `> 4500` → val (155 frames). No neighbor leakage.

## Limitations (same as parent Segformer model)

- Trained on a single Korean apple orchard, spring + partial autumn
- ❌ Different orchards (different tree species/layouts) — likely degraded
- ❌ Winter (no leaves), night, rain — no training data
- ❌ Aerial/drone perspectives — robot-eye view only
- For a new deployment, plan to fine-tune on 100–300 in-domain frames (~13 min on a single GPU)

## Deployment to Jetson

See `JETSON_DEPLOY.md` for the full pipeline:
1. Export to ONNX (this repo already has `twinlite8.onnx`)
2. On Jetson: `trtexec --onnx=twinlite8.onnx --saveEngine=...engine --fp16`
3. Run via `predict_onnx.py --provider TensorrtExecutionProvider` or load the `.engine` via TRT API

## License

Apache 2.0