| --- |
| license: mit |
| language: |
| - en |
| tags: |
| - autonomous-driving |
| - self-driving-car |
| - robotics |
| - imitation-learning |
| - behavioral-cloning |
| - pilotnet |
| - esp8266 |
| - esp32-cam |
| - pytorch |
| - classification |
| library_name: pytorch |
| pipeline_tag: image-classification |
| model-index: |
| - name: ActionNet |
| results: [] |
| --- |
| |
| # ActionNet — Autonomous RC Car Driving Model |
|
|
| A lightweight classification CNN that drives a small RC car by predicting discrete motor actions from raw camera frames. Trained through imitation learning — a human drives the car while the system records frames and commands, then the model learns to replicate that behavior. |
|
|
| Part of the [OpenBot PC Server Project](https://github.com/loki-smip/openbot-pc-server-projuct-). |
|
|
| --- |
|
|
| ## Model Description |
|
|
| ActionNet classifies a single 66×200 RGB camera image into one of 9 discrete driving actions. It replaces the traditional regression approach (predicting continuous steering angles) because keyboard-driven training data only contains a handful of unique command pairs. Classification with cross-entropy loss handles this much better than mean-squared-error regression, which tends to average everything toward zero. |
|
|
| **Input:** 66×200×3 RGB image (cropped from 800×600, top 40% removed) |
| **Output:** probability distribution over 9 driving actions |
|
|
| ### The 9 Actions |
|
|
| | Index | Action | Left Motor | Right Motor | Description | |
| |---|---|---|---|---| |
| | 0 | STOP | 0 | 0 | Both motors off | |
| | 1 | FORWARD | +70 | +70 | Straight ahead | |
| | 2 | BACKWARD | -70 | -70 | Straight reverse | |
| | 3 | TURN LEFT | -49 | +49 | Pivot left (in place) | |
| | 4 | TURN RIGHT | +49 | -49 | Pivot right (in place) | |
| | 5 | FORWARD+LEFT | +21 | +70 | Arc forward-left | |
| | 6 | FORWARD+RIGHT | +70 | +21 | Arc forward-right | |
| | 7 | BACKWARD+LEFT | -21 | -70 | Arc backward-left | |
| | 8 | BACKWARD+RIGHT | -70 | -21 | Arc backward-right | |
|
|
| Motor values are shown at speed=70 and scale proportionally with the speed setting. |
|
|
| --- |
|
|
| ## Architecture |
|
|
| The convolutional backbone is based on NVIDIA's PilotNet (from the "End to End Learning for Self-Driving Cars" paper), modified with batch normalization, ELU activations, and a classification head. |
|
|
| ``` |
| Layer Output Shape Parameters |
| ───────────────────────────────────────────────────────────── |
| Input (B, 3, 66, 200) — |
| |
| Conv2d(3→24, 5×5, stride=2) (B, 24, 31, 98) 1,824 |
| BatchNorm2d(24) 48 |
| ELU — |
| |
| Conv2d(24→36, 5×5, stride=2) (B, 36, 14, 47) 21,636 |
| BatchNorm2d(36) 72 |
| ELU — |
| |
| Conv2d(36→48, 5×5, stride=2) (B, 48, 5, 22) 43,248 |
| BatchNorm2d(48) 96 |
| ELU — |
| |
| Conv2d(48→64, 3×3, stride=1) (B, 64, 3, 20) 27,712 |
| BatchNorm2d(64) 128 |
| ELU — |
| |
| Conv2d(64→64, 3×3, stride=1) (B, 64, 1, 18) 36,928 |
| BatchNorm2d(64) 128 |
| ELU — |
| |
| Dropout2d(0.15) — |
| |
| Flatten (B, 1152) — |
| Dropout(0.35) — |
| Linear(1152→64) (B, 64) 73,792 |
| ELU — |
| Dropout(0.35) — |
| Linear(64→9) (B, 9) 585 |
| |
| ───────────────────────────────────────────────────────────── |
| Total trainable parameters: ~145,000 |
| Model file size: ~1–2 MB (.pth) |
| ``` |
|
|
| ### Design Decisions |
|
|
| - **BatchNorm after every conv layer** — stabilizes training and allows higher learning rates without divergence |
| - **ELU instead of ReLU** — avoids dead neurons and produces smoother gradients, which matters when the model is small |
| - **Spatial Dropout2d (15%)** — drops entire feature maps instead of individual pixels, forcing the network to spread information across channels |
| - **Two-layer classification head with 35% dropout** — the bottleneck at 64 units forces compression and fights overfitting on small datasets |
| - **Kaiming initialization** — all conv and linear layers use He initialization (fan-out mode), which pairs well with ELU activations |
| - **Label smoothing (0.2)** — prevents the model from becoming overconfident on exact training labels. A STOP frame labeled as [1.0, 0.0, 0.0, ...] becomes [0.82, 0.02, 0.02, ...], which improves generalization |
|
|
| --- |
|
|
| ## Preprocessing |
|
|
| The full pipeline from raw camera frame to model input: |
|
|
| ``` |
| Raw 800×600 BGR frame from ESP32-CAM |
| │ |
| ▼ |
| Crop top 40% of the image |
| (removes ceiling, sky, and upper walls) |
| │ |
| ▼ |
| Convert BGR → RGB |
| │ |
| ▼ |
| Resize to 200×66 pixels |
| (using INTER_AREA interpolation) |
| │ |
| ▼ |
| ToTensor → normalize to [0, 1] float32 |
| │ |
| ▼ |
| Final shape: [batch, 3, 66, 200] |
| ``` |
|
|
| The `crop_and_resize()` function in `trainer.py` performs this transformation. The exact same function is called during both training and inference (in `autopilot.py`) to guarantee consistency. |
|
|
| Why crop the top 40%? Because the camera is mounted on a low car pointing forward. The top portion of every frame shows ceiling, walls, or sky — none of which help the model decide where to steer. Removing it reduces noise and lets the model focus on the ground, obstacles, and path ahead. |
|
|
| --- |
|
|
| ## Training Configuration |
|
|
| | Parameter | Value | Notes | |
| |---|---|---| |
| | Optimizer | AdamW | weight_decay=5e-3 for L2 regularization | |
| | Learning Rate | 0.001 | Peak rate, with OneCycleLR schedule | |
| | LR Schedule | OneCycleLR | 10% warmup, cosine anneal, div_factor=10 | |
| | Loss Function | CrossEntropyLoss | label_smoothing=0.2 | |
| | Batch Size | 32 | Fits comfortably in CPU memory | |
| | Gradient Clipping | max_norm=1.0 | Prevents gradient explosions | |
| | Early Stopping | 30 epochs patience | Monitored by validation accuracy | |
| | Class Balancing | WeightedRandomSampler | Inverse-frequency weights per class | |
| | Train/Val Split | 80% / 20% | Random split | |
|
|
| ### Data Augmentation |
|
|
| Applied on-the-fly during training: |
|
|
| | Augmentation | Probability | Details | |
| |---|---|---| |
| | Horizontal flip | 50% | Action labels are mirrored (LEFT↔RIGHT) | |
| | Random shadow | 50% | Vertical band at random brightness (30–70%) | |
| | Random brightness | 50% | HSV V-channel scaled 0.6–1.4× | |
| | Gaussian blur | 30% | Kernel 3×3 or 5×5 | |
| | Random translation | 40% | Shift ±10% in X and Y | |
| | Random erasing | 50% | Rectangular cutout on tensor | |
|
|
| The horizontal flip augmentation automatically swaps left/right action labels using a predefined mirror table, so the model never sees contradictory labels. |
|
|
| --- |
|
|
| ## Inference |
|
|
| At runtime, the autopilot module: |
|
|
| 1. Reads the latest camera frame from the MJPEG stream |
| 2. Runs `crop_and_resize()` → converts to tensor |
| 3. Forward pass through ActionNet → gets 9 logits |
| 4. Applies softmax → picks the action with highest probability |
| 5. Uses a 3-frame majority vote to smooth out flickering predictions |
| 6. Maps the smoothed action to (left, right) motor commands at the configured speed |
| 7. Sends the command to the ESP8266 over WebSocket |
|
|
| The inference loop runs at 10 FPS on a typical laptop CPU. No GPU required. |
|
|
| --- |
|
|
| ## Hardware Requirements |
|
|
| This model is designed for a specific hardware setup: |
|
|
| | Component | Role | |
| |---|---| |
| | ESP32-CAM (OV2640) | Streams 800×600 MJPEG video over HTTP | |
| | ESP8266 (NodeMCU) | Receives motor commands over WebSocket, drives L298N | |
| | L298N Motor Driver | Controls 2 DC gear motors (differential drive) | |
| | SG90 Servo (optional) | Camera pan | |
| | PC (any laptop/desktop) | Runs the server, training, and inference | |
|
|
| The PC does all the heavy lifting. The microcontrollers are just I/O — one for video, one for motors. Total hardware cost is around $25–30 USD. |
|
|
| --- |
|
|
| ## How to Use This Model |
|
|
| ### Quickstart |
|
|
| ```python |
| import torch |
| import torch.nn.functional as F |
| from torchvision import transforms |
| from model import ActionNet, action_to_command |
| |
| # Load |
| device = torch.device("cpu") |
| model = ActionNet().to(device) |
| checkpoint = torch.load("trained_models/autopilot.pth", map_location=device) |
| model.load_state_dict(checkpoint["model_state_dict"]) |
| model.eval() |
| |
| # Prepare a 66x200 RGB image as tensor |
| transform = transforms.ToTensor() |
| img_tensor = transform(your_66x200_rgb_image).unsqueeze(0).to(device) |
| |
| # Predict |
| with torch.no_grad(): |
| logits = model(img_tensor) |
| probs = F.softmax(logits, dim=1) |
| action = torch.argmax(probs, dim=1).item() |
| confidence = probs[0, action].item() |
| |
| # Convert to motor command |
| left, right = action_to_command(action, speed=70) |
| print(f"Action: {action}, Motors: L={left} R={right}, Confidence: {confidence:.1%}") |
| ``` |
|
|
| ### Within the Full System |
|
|
| The model is used automatically by the autopilot module. Start the server, record some training data through the dashboard, train from the dashboard, then click "Start Autopilot." |
|
|
| See the full [README](https://github.com/YOUR_USERNAME/openbot-pc-server-project) for step-by-step instructions including hardware assembly, firmware upload, and data collection. |
|
|
| --- |
|
|
| ## Training Your Own Model |
|
|
| 1. Assemble the hardware (ESP8266 + ESP32-CAM + motors) |
| 2. Flash firmware to both microcontrollers |
| 3. Start the PC server: `python app.py` |
| 4. Drive the car manually while recording data |
| 5. Click "Train" in the dashboard — or the model trains through the API |
| 6. The best checkpoint saves automatically to `trained_models/autopilot.pth` |
|
|
| Training runs on CPU. A dataset of 3,000 frames trains in under 5 minutes on a modern laptop. GPU is supported if available but not required. |
|
|
| --- |
|
|
| ## Limitations |
|
|
| - The model only knows what it has seen. If you train it in one room, it won't generalize to a different room without additional data. |
| - Keyboard inputs produce jerky, discrete commands. A joystick or gamepad would produce smoother training data. |
| - The 40% top-crop assumes the camera is mounted pointing roughly forward and slightly down. If your camera angle is very different, adjust the crop ratio in `trainer.py`. |
| - Performance depends heavily on lighting conditions matching between training and inference. |
| - The model has no notion of obstacles, goals, or maps. It purely replicates the visual patterns it was trained on. |
|
|
| --- |
|
|
| ## Citation |
|
|
| If you use this project in your work, a mention is appreciated but not required: |
|
|
| ``` |
| OpenBot PC Server Project — Autonomous RC Car with Imitation Learning |
| https://github.com/YOUR_USERNAME/openbot-pc-server-project |
| ``` |
|
|
| --- |
|
|
| ## License |
|
|
| MIT License — use it, modify it, ship it. |
|
|