ViT End-to-End Driving Model
Vision Transformer (ViT) adapted for end-to-end autonomous driving, trained on the Udacity self-driving car simulator for the bachelor's thesis: Dual-Axis Testing of Visual Robustness and Topological Generalization in Vision-based End-to-End Driving Models.
Model Description
This model applies the Vision Transformer architecture to the end-to-end driving task. Instead of using convolutional layers, ViT splits the input image into patches and processes them using self-attention mechanisms, allowing the model to capture global dependencies in the visual input.
Architecture
Input: RGB Image (224 ร 224 ร 3)
โ
Patch Embedding (16 ร 16 patches)
โ
[CLS] Token + Positional Embedding
โ
Transformer Encoder Blocks (รL):
- Multi-Head Self-Attention
- Layer Normalization
- MLP (Feed-Forward)
- Layer Normalization
โ
[CLS] Token Output
โ
MLP Head
โ
Output: [steering, throttle]
Checkpoints
| Map | Checkpoint |
|---|---|
| GenRoads | genroads_20251202-152358/ |
| Jungle | jungle_20251201-132938/ |
Files per Checkpoint
best_model.ckpt: PyTorch model checkpointmeta.json: Training configuration and hyperparametershistory.csv: Training/validation metrics per epochloss_curve.png: Visualization of training progress
Citation
@thesis{igenbergs2026dualaxis,
title={Dual-Axis Testing of Visual Robustness and Topological Generalization in Vision-based End-to-End Driving Models},
author={Igenbergs, Maxim},
school={Technical University of Munich},
year={2026},
type={Bachelor's Thesis}
}
Related
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
๐
Ask for provider support