File size: 11,096 Bytes

39b73db

---
license: mit
language:
- en
tags:
- autonomous-driving
- self-driving-car
- robotics
- imitation-learning
- behavioral-cloning
- pilotnet
- esp8266
- esp32-cam
- pytorch
- classification
library_name: pytorch
pipeline_tag: image-classification
model-index:
- name: ActionNet
  results: []
---

# ActionNet — Autonomous RC Car Driving Model

A lightweight classification CNN that drives a small RC car by predicting discrete motor actions from raw camera frames. Trained through imitation learning — a human drives the car while the system records frames and commands, then the model learns to replicate that behavior.

Part of the [OpenBot PC Server Project](https://github.com/loki-smip/openbot-pc-server-projuct-).

---

## Model Description

ActionNet classifies a single 66×200 RGB camera image into one of 9 discrete driving actions. It replaces the traditional regression approach (predicting continuous steering angles) because keyboard-driven training data only contains a handful of unique command pairs. Classification with cross-entropy loss handles this much better than mean-squared-error regression, which tends to average everything toward zero.

**Input:** 66×200×3 RGB image (cropped from 800×600, top 40% removed)
**Output:** probability distribution over 9 driving actions

### The 9 Actions

| Index | Action | Left Motor | Right Motor | Description |
|---|---|---|---|---|
| 0 | STOP | 0 | 0 | Both motors off |
| 1 | FORWARD | +70 | +70 | Straight ahead |
| 2 | BACKWARD | -70 | -70 | Straight reverse |
| 3 | TURN LEFT | -49 | +49 | Pivot left (in place) |
| 4 | TURN RIGHT | +49 | -49 | Pivot right (in place) |
| 5 | FORWARD+LEFT | +21 | +70 | Arc forward-left |
| 6 | FORWARD+RIGHT | +70 | +21 | Arc forward-right |
| 7 | BACKWARD+LEFT | -21 | -70 | Arc backward-left |
| 8 | BACKWARD+RIGHT | -70 | -21 | Arc backward-right |

Motor values are shown at speed=70 and scale proportionally with the speed setting.

---

## Architecture

The convolutional backbone is based on NVIDIA's PilotNet (from the "End to End Learning for Self-Driving Cars" paper), modified with batch normalization, ELU activations, and a classification head.

```
Layer                          Output Shape      Parameters
─────────────────────────────────────────────────────────────
Input                          (B, 3, 66, 200)   —

Conv2d(3→24, 5×5, stride=2)   (B, 24, 31, 98)   1,824
BatchNorm2d(24)                                   48
ELU                                               —

Conv2d(24→36, 5×5, stride=2)  (B, 36, 14, 47)   21,636
BatchNorm2d(36)                                   72
ELU                                               —

Conv2d(36→48, 5×5, stride=2)  (B, 48, 5, 22)    43,248
BatchNorm2d(48)                                   96
ELU                                               —

Conv2d(48→64, 3×3, stride=1)  (B, 64, 3, 20)    27,712
BatchNorm2d(64)                                   128
ELU                                               —

Conv2d(64→64, 3×3, stride=1)  (B, 64, 1, 18)    36,928
BatchNorm2d(64)                                   128
ELU                                               —

Dropout2d(0.15)                                   —

Flatten                        (B, 1152)          —
Dropout(0.35)                                     —
Linear(1152→64)                (B, 64)            73,792
ELU                                               —
Dropout(0.35)                                     —
Linear(64→9)                   (B, 9)             585

─────────────────────────────────────────────────────────────
Total trainable parameters:    ~145,000
Model file size:               ~1–2 MB (.pth)
```

### Design Decisions

- **BatchNorm after every conv layer** — stabilizes training and allows higher learning rates without divergence
- **ELU instead of ReLU** — avoids dead neurons and produces smoother gradients, which matters when the model is small
- **Spatial Dropout2d (15%)** — drops entire feature maps instead of individual pixels, forcing the network to spread information across channels
- **Two-layer classification head with 35% dropout** — the bottleneck at 64 units forces compression and fights overfitting on small datasets
- **Kaiming initialization** — all conv and linear layers use He initialization (fan-out mode), which pairs well with ELU activations
- **Label smoothing (0.2)** — prevents the model from becoming overconfident on exact training labels. A STOP frame labeled as [1.0, 0.0, 0.0, ...] becomes [0.82, 0.02, 0.02, ...], which improves generalization

---

## Preprocessing

The full pipeline from raw camera frame to model input:

```
Raw 800×600 BGR frame from ESP32-CAM
             │
             ▼
    Crop top 40% of the image
    (removes ceiling, sky, and upper walls)
             │
             ▼
    Convert BGR → RGB
             │
             ▼
    Resize to 200×66 pixels
    (using INTER_AREA interpolation)
             │
             ▼
    ToTensor → normalize to [0, 1] float32
             │
             ▼
    Final shape: [batch, 3, 66, 200]
```

The `crop_and_resize()` function in `trainer.py` performs this transformation. The exact same function is called during both training and inference (in `autopilot.py`) to guarantee consistency.

Why crop the top 40%? Because the camera is mounted on a low car pointing forward. The top portion of every frame shows ceiling, walls, or sky — none of which help the model decide where to steer. Removing it reduces noise and lets the model focus on the ground, obstacles, and path ahead.

---

## Training Configuration

| Parameter | Value | Notes |
|---|---|---|
| Optimizer | AdamW | weight_decay=5e-3 for L2 regularization |
| Learning Rate | 0.001 | Peak rate, with OneCycleLR schedule |
| LR Schedule | OneCycleLR | 10% warmup, cosine anneal, div_factor=10 |
| Loss Function | CrossEntropyLoss | label_smoothing=0.2 |
| Batch Size | 32 | Fits comfortably in CPU memory |
| Gradient Clipping | max_norm=1.0 | Prevents gradient explosions |
| Early Stopping | 30 epochs patience | Monitored by validation accuracy |
| Class Balancing | WeightedRandomSampler | Inverse-frequency weights per class |
| Train/Val Split | 80% / 20% | Random split |

### Data Augmentation

Applied on-the-fly during training:

| Augmentation | Probability | Details |
|---|---|---|
| Horizontal flip | 50% | Action labels are mirrored (LEFT↔RIGHT) |
| Random shadow | 50% | Vertical band at random brightness (30–70%) |
| Random brightness | 50% | HSV V-channel scaled 0.6–1.4× |
| Gaussian blur | 30% | Kernel 3×3 or 5×5 |
| Random translation | 40% | Shift ±10% in X and Y |
| Random erasing | 50% | Rectangular cutout on tensor |

The horizontal flip augmentation automatically swaps left/right action labels using a predefined mirror table, so the model never sees contradictory labels.

---

## Inference

At runtime, the autopilot module:

1. Reads the latest camera frame from the MJPEG stream
2. Runs `crop_and_resize()` → converts to tensor
3. Forward pass through ActionNet → gets 9 logits
4. Applies softmax → picks the action with highest probability
5. Uses a 3-frame majority vote to smooth out flickering predictions
6. Maps the smoothed action to (left, right) motor commands at the configured speed
7. Sends the command to the ESP8266 over WebSocket

The inference loop runs at 10 FPS on a typical laptop CPU. No GPU required.

---

## Hardware Requirements

This model is designed for a specific hardware setup:

| Component | Role |
|---|---|
| ESP32-CAM (OV2640) | Streams 800×600 MJPEG video over HTTP |
| ESP8266 (NodeMCU) | Receives motor commands over WebSocket, drives L298N |
| L298N Motor Driver | Controls 2 DC gear motors (differential drive) |
| SG90 Servo (optional) | Camera pan |
| PC (any laptop/desktop) | Runs the server, training, and inference |

The PC does all the heavy lifting. The microcontrollers are just I/O — one for video, one for motors. Total hardware cost is around $25–30 USD.

---

## How to Use This Model

### Quickstart

```python
import torch
import torch.nn.functional as F
from torchvision import transforms
from model import ActionNet, action_to_command

# Load
device = torch.device("cpu")
model = ActionNet().to(device)
checkpoint = torch.load("trained_models/autopilot.pth", map_location=device)
model.load_state_dict(checkpoint["model_state_dict"])
model.eval()

# Prepare a 66x200 RGB image as tensor
transform = transforms.ToTensor()
img_tensor = transform(your_66x200_rgb_image).unsqueeze(0).to(device)

# Predict
with torch.no_grad():
    logits = model(img_tensor)
    probs = F.softmax(logits, dim=1)
    action = torch.argmax(probs, dim=1).item()
    confidence = probs[0, action].item()

# Convert to motor command
left, right = action_to_command(action, speed=70)
print(f"Action: {action}, Motors: L={left} R={right}, Confidence: {confidence:.1%}")
```

### Within the Full System

The model is used automatically by the autopilot module. Start the server, record some training data through the dashboard, train from the dashboard, then click "Start Autopilot."

See the full [README](https://github.com/YOUR_USERNAME/openbot-pc-server-project) for step-by-step instructions including hardware assembly, firmware upload, and data collection.

---

## Training Your Own Model

1. Assemble the hardware (ESP8266 + ESP32-CAM + motors)
2. Flash firmware to both microcontrollers
3. Start the PC server: `python app.py`
4. Drive the car manually while recording data
5. Click "Train" in the dashboard — or the model trains through the API
6. The best checkpoint saves automatically to `trained_models/autopilot.pth`

Training runs on CPU. A dataset of 3,000 frames trains in under 5 minutes on a modern laptop. GPU is supported if available but not required.

---

## Limitations

- The model only knows what it has seen. If you train it in one room, it won't generalize to a different room without additional data.
- Keyboard inputs produce jerky, discrete commands. A joystick or gamepad would produce smoother training data.
- The 40% top-crop assumes the camera is mounted pointing roughly forward and slightly down. If your camera angle is very different, adjust the crop ratio in `trainer.py`.
- Performance depends heavily on lighting conditions matching between training and inference.
- The model has no notion of obstacles, goals, or maps. It purely replicates the visual patterns it was trained on.

---

## Citation

If you use this project in your work, a mention is appreciated but not required:

```
OpenBot PC Server Project — Autonomous RC Car with Imitation Learning
https://github.com/YOUR_USERNAME/openbot-pc-server-project
```

---

## License

MIT License — use it, modify it, ship it.