Road Scene Semantic Segmentation

A semantic segmentation model for autonomous driving scenes, built with EfficientNetB5 + UNet decoder + ASPP module. The model segments road scenes into 5 classes: background, road surface, road markings, road signs, and cars.

Model Architecture

The model combines a pretrained EfficientNetB5 encoder with a custom UNet-style decoder and an ASPP (Atrous Spatial Pyramid Pooling) module at the bottleneck. This design captures both high-level semantics and fine-grained spatial details — particularly useful for thin structures like lane markings.

Input (256×256×3)
       │
EfficientNetB5 Encoder (pretrained on ImageNet)
  ├── skip connections at multiple resolutions
  └── bottleneck feature maps
       │
  ASPP Module
  ├── Conv 1×1
  ├── Dilated Conv rate=6
  ├── Dilated Conv rate=12
  └── Dilated Conv rate=18
       │
UNet Decoder
  ├── Upsample + Skip connection (×4)
  └── Conv layers
       │
Output (256×256×5) — softmax

Classes

ID	Class	Description
0	Background	Sky, buildings, trees, sidewalks, and everything else
1	Road surface	Drivable road area and road shoulders
2	Marking	Lane markings (driving and non-driving)
3	Road sign	Traffic signs and signal symbols
4	Car	Cars, SUVs, pickup trucks

Performance

Evaluated on a held-out validation set combining custom data and CamVid:

Class	IoU
Background	0.963
Road surface	0.921
Marking	0.399
Road sign	0.052
Car	0.839
Mean IoU	~0.635

Training Data

The model was trained on a combination of:

Custom dataset — ~200 annotated road images with polygon annotations in XML format, covering 5 semantic classes
CamVid — 367 images with pixel-level annotations from the Cambridge-driving Labeled Video Database, remapped to match the 5-class label scheme

Data augmentation applied during training: horizontal flip, random crop + resize, brightness/contrast/saturation/hue jitter.

Training Details

Setting	Value
Framework	TensorFlow / Keras
Input size	256 × 256
Backbone	EfficientNetB5 (ImageNet pretrained)
Loss	Weighted Focal + IoU Loss
Optimizer	Adam
Stage 1 LR	1e-3 (decoder only, backbone frozen)
Stage 2 LR	1e-4 (full model fine-tune)
Precision	mixed_float16
Batch size	8

Training was done in two stages following standard transfer learning practice: the decoder is trained first while the backbone is frozen, then the full model is fine-tuned with a lower learning rate to avoid destroying pretrained weights.

Usage

import tensorflow as tf
import numpy as np
import cv2
from matplotlib.colors import ListedColormap

# Load model
model = tf.keras.models.load_model(
    'model_eff_unet_v1.keras',
    custom_objects={
        'combined_loss': combined_loss,
        'SparseMeanIoU': SparseMeanIoU,
    },
    compile=False
)

# Prepare image
image = cv2.imread('your_image.jpg')
image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
image = cv2.resize(image, (256, 256))
image = image.astype(np.float32) / 255.0
input_tensor = np.expand_dims(image, axis=0)

# Predict
pred = model.predict(input_tensor)
pred_mask = tf.argmax(pred[0], axis=-1).numpy()

# Visualize
class_cmap = ListedColormap([
    'black',    # Background
    '#804080',  # Road surface
    'white',    # Marking
    'red',      # Road sign
    'navy',     # Car
])

import matplotlib.pyplot as plt
plt.figure(figsize=(12, 4))
plt.subplot(1, 2, 1)
plt.imshow(image)
plt.title('Input Image')
plt.axis('off')

plt.subplot(1, 2, 2)
plt.imshow(pred_mask, cmap=class_cmap, vmin=0, vmax=4)
plt.title('Segmentation Prediction')
plt.axis('off')
plt.show()

Limitations

Trained primarily on daytime, clear-weather road scenes. Performance may degrade on night scenes, rain, fog, or unusual camera angles.
Road marking detection (IoU ~0.30) is weaker than road surface detection (IoU ~0.92) due to class imbalance and the small pixel area of lane markings.
Input resolution is fixed at 256×256. Very small objects (distant signs, thin markings) may be missed.
Not suitable for safety-critical applications without further validation on a larger and more diverse dataset.

Repository Structure

├── main.ipynb          # Training notebook
├── implement.ipynb     # Inference notebook
├── model_eff_unet_v1.h5
└── model_eff_unet_v1.keras

Citation

If you use CamVid data in your work, please cite the original dataset:

@inproceedings{BrostowSFC:ECCV08,
  author    = {Gabriel J. Brostow and Jamie Shotton and Julien Fauqueur and Roberto Cipolla},
  title     = {Segmentation and Recognition Using Structure from Motion Point Clouds},
  booktitle = {ECCV (1)},
  year      = {2008},
  pages     = {44-57}
}


## Dataset
https://www.kaggle.com/datasets/trainingdatapro/roads-segmentation-dataset
https://www.kaggle.com/datasets/carlolepelaars/camvid

License

MIT License

Downloads last month: -

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support