metadata
license: mit
library_name: pytorch
tags:
- autonomous-driving
- end-to-end
- imitation-learning
- self-driving
- udacity
- vision
- transformer
- vit
- attention
datasets:
- maxim-igenbergs/thesis-data
ViT End-to-End Driving Model
Vision Transformer (ViT) adapted for end-to-end autonomous driving, trained on the Udacity self-driving car simulator for the bachelor's thesis: Dual-Axis Testing of Visual Robustness and Topological Generalization in Vision-based End-to-End Driving Models.
Model Description
This model applies the Vision Transformer architecture to the end-to-end driving task. Instead of using convolutional layers, ViT splits the input image into patches and processes them using self-attention mechanisms, allowing the model to capture global dependencies in the visual input.
Architecture
Input: RGB Image (224 × 224 × 3)
↓
Patch Embedding (16 × 16 patches)
↓
[CLS] Token + Positional Embedding
↓
Transformer Encoder Blocks (×L):
- Multi-Head Self-Attention
- Layer Normalization
- MLP (Feed-Forward)
- Layer Normalization
↓
[CLS] Token Output
↓
MLP Head
↓
Output: [steering, throttle]
Checkpoints
| Map | Checkpoint |
|---|---|
| GenRoads | genroads_20251202-152358/ |
| Jungle | jungle_20251201-132938/ |
Files per Checkpoint
best_model.ckpt: PyTorch model checkpointmeta.json: Training configuration and hyperparametershistory.csv: Training/validation metrics per epochloss_curve.png: Visualization of training progress
Citation
@thesis{igenbergs2026dualaxis,
title={Dual-Axis Testing of Visual Robustness and Topological Generalization in Vision-based End-to-End Driving Models},
author={Igenbergs, Maxim},
school={Technical University of Munich},
year={2026},
type={Bachelor's Thesis}
}