vit / README.md
maxim-igenbergs's picture
Update README.md
f3ebdd0 verified
metadata
license: mit
library_name: pytorch
tags:
  - autonomous-driving
  - end-to-end
  - imitation-learning
  - self-driving
  - udacity
  - vision
  - transformer
  - vit
  - attention
datasets:
  - maxim-igenbergs/thesis-data

ViT End-to-End Driving Model

Vision Transformer (ViT) adapted for end-to-end autonomous driving, trained on the Udacity self-driving car simulator for the bachelor's thesis: Dual-Axis Testing of Visual Robustness and Topological Generalization in Vision-based End-to-End Driving Models.

Model Description

This model applies the Vision Transformer architecture to the end-to-end driving task. Instead of using convolutional layers, ViT splits the input image into patches and processes them using self-attention mechanisms, allowing the model to capture global dependencies in the visual input.

Architecture

Input: RGB Image (224 × 224 × 3)
    ↓
Patch Embedding (16 × 16 patches)
    ↓
[CLS] Token + Positional Embedding
    ↓
Transformer Encoder Blocks (×L):
  - Multi-Head Self-Attention
  - Layer Normalization
  - MLP (Feed-Forward)
  - Layer Normalization
    ↓
[CLS] Token Output
    ↓
MLP Head
    ↓
Output: [steering, throttle]

Checkpoints

Map Checkpoint
GenRoads genroads_20251202-152358/
Jungle jungle_20251201-132938/

Files per Checkpoint

  • best_model.ckpt: PyTorch model checkpoint
  • meta.json: Training configuration and hyperparameters
  • history.csv: Training/validation metrics per epoch
  • loss_curve.png: Visualization of training progress

Citation

@thesis{igenbergs2026dualaxis,
  title={Dual-Axis Testing of Visual Robustness and Topological Generalization in Vision-based End-to-End Driving Models},
  author={Igenbergs, Maxim},
  school={Technical University of Munich},
  year={2026},
  type={Bachelor's Thesis}
}

Related