File size: 11,096 Bytes
39b73db
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
---
license: mit
language:
- en
tags:
- autonomous-driving
- self-driving-car
- robotics
- imitation-learning
- behavioral-cloning
- pilotnet
- esp8266
- esp32-cam
- pytorch
- classification
library_name: pytorch
pipeline_tag: image-classification
model-index:
- name: ActionNet
  results: []
---

# ActionNet β€” Autonomous RC Car Driving Model

A lightweight classification CNN that drives a small RC car by predicting discrete motor actions from raw camera frames. Trained through imitation learning β€” a human drives the car while the system records frames and commands, then the model learns to replicate that behavior.

Part of the [OpenBot PC Server Project](https://github.com/loki-smip/openbot-pc-server-projuct-).

---

## Model Description

ActionNet classifies a single 66Γ—200 RGB camera image into one of 9 discrete driving actions. It replaces the traditional regression approach (predicting continuous steering angles) because keyboard-driven training data only contains a handful of unique command pairs. Classification with cross-entropy loss handles this much better than mean-squared-error regression, which tends to average everything toward zero.

**Input:** 66Γ—200Γ—3 RGB image (cropped from 800Γ—600, top 40% removed)
**Output:** probability distribution over 9 driving actions

### The 9 Actions

| Index | Action | Left Motor | Right Motor | Description |
|---|---|---|---|---|
| 0 | STOP | 0 | 0 | Both motors off |
| 1 | FORWARD | +70 | +70 | Straight ahead |
| 2 | BACKWARD | -70 | -70 | Straight reverse |
| 3 | TURN LEFT | -49 | +49 | Pivot left (in place) |
| 4 | TURN RIGHT | +49 | -49 | Pivot right (in place) |
| 5 | FORWARD+LEFT | +21 | +70 | Arc forward-left |
| 6 | FORWARD+RIGHT | +70 | +21 | Arc forward-right |
| 7 | BACKWARD+LEFT | -21 | -70 | Arc backward-left |
| 8 | BACKWARD+RIGHT | -70 | -21 | Arc backward-right |

Motor values are shown at speed=70 and scale proportionally with the speed setting.

---

## Architecture

The convolutional backbone is based on NVIDIA's PilotNet (from the "End to End Learning for Self-Driving Cars" paper), modified with batch normalization, ELU activations, and a classification head.

```
Layer                          Output Shape      Parameters
─────────────────────────────────────────────────────────────
Input                          (B, 3, 66, 200)   β€”

Conv2d(3β†’24, 5Γ—5, stride=2)   (B, 24, 31, 98)   1,824
BatchNorm2d(24)                                   48
ELU                                               β€”

Conv2d(24β†’36, 5Γ—5, stride=2)  (B, 36, 14, 47)   21,636
BatchNorm2d(36)                                   72
ELU                                               β€”

Conv2d(36β†’48, 5Γ—5, stride=2)  (B, 48, 5, 22)    43,248
BatchNorm2d(48)                                   96
ELU                                               β€”

Conv2d(48β†’64, 3Γ—3, stride=1)  (B, 64, 3, 20)    27,712
BatchNorm2d(64)                                   128
ELU                                               β€”

Conv2d(64β†’64, 3Γ—3, stride=1)  (B, 64, 1, 18)    36,928
BatchNorm2d(64)                                   128
ELU                                               β€”

Dropout2d(0.15)                                   β€”

Flatten                        (B, 1152)          β€”
Dropout(0.35)                                     β€”
Linear(1152β†’64)                (B, 64)            73,792
ELU                                               β€”
Dropout(0.35)                                     β€”
Linear(64β†’9)                   (B, 9)             585

─────────────────────────────────────────────────────────────
Total trainable parameters:    ~145,000
Model file size:               ~1–2 MB (.pth)
```

### Design Decisions

- **BatchNorm after every conv layer** β€” stabilizes training and allows higher learning rates without divergence
- **ELU instead of ReLU** β€” avoids dead neurons and produces smoother gradients, which matters when the model is small
- **Spatial Dropout2d (15%)** β€” drops entire feature maps instead of individual pixels, forcing the network to spread information across channels
- **Two-layer classification head with 35% dropout** β€” the bottleneck at 64 units forces compression and fights overfitting on small datasets
- **Kaiming initialization** β€” all conv and linear layers use He initialization (fan-out mode), which pairs well with ELU activations
- **Label smoothing (0.2)** β€” prevents the model from becoming overconfident on exact training labels. A STOP frame labeled as [1.0, 0.0, 0.0, ...] becomes [0.82, 0.02, 0.02, ...], which improves generalization

---

## Preprocessing

The full pipeline from raw camera frame to model input:

```
Raw 800Γ—600 BGR frame from ESP32-CAM
             β”‚
             β–Ό
    Crop top 40% of the image
    (removes ceiling, sky, and upper walls)
             β”‚
             β–Ό
    Convert BGR β†’ RGB
             β”‚
             β–Ό
    Resize to 200Γ—66 pixels
    (using INTER_AREA interpolation)
             β”‚
             β–Ό
    ToTensor β†’ normalize to [0, 1] float32
             β”‚
             β–Ό
    Final shape: [batch, 3, 66, 200]
```

The `crop_and_resize()` function in `trainer.py` performs this transformation. The exact same function is called during both training and inference (in `autopilot.py`) to guarantee consistency.

Why crop the top 40%? Because the camera is mounted on a low car pointing forward. The top portion of every frame shows ceiling, walls, or sky β€” none of which help the model decide where to steer. Removing it reduces noise and lets the model focus on the ground, obstacles, and path ahead.

---

## Training Configuration

| Parameter | Value | Notes |
|---|---|---|
| Optimizer | AdamW | weight_decay=5e-3 for L2 regularization |
| Learning Rate | 0.001 | Peak rate, with OneCycleLR schedule |
| LR Schedule | OneCycleLR | 10% warmup, cosine anneal, div_factor=10 |
| Loss Function | CrossEntropyLoss | label_smoothing=0.2 |
| Batch Size | 32 | Fits comfortably in CPU memory |
| Gradient Clipping | max_norm=1.0 | Prevents gradient explosions |
| Early Stopping | 30 epochs patience | Monitored by validation accuracy |
| Class Balancing | WeightedRandomSampler | Inverse-frequency weights per class |
| Train/Val Split | 80% / 20% | Random split |

### Data Augmentation

Applied on-the-fly during training:

| Augmentation | Probability | Details |
|---|---|---|
| Horizontal flip | 50% | Action labels are mirrored (LEFT↔RIGHT) |
| Random shadow | 50% | Vertical band at random brightness (30–70%) |
| Random brightness | 50% | HSV V-channel scaled 0.6–1.4Γ— |
| Gaussian blur | 30% | Kernel 3Γ—3 or 5Γ—5 |
| Random translation | 40% | Shift Β±10% in X and Y |
| Random erasing | 50% | Rectangular cutout on tensor |

The horizontal flip augmentation automatically swaps left/right action labels using a predefined mirror table, so the model never sees contradictory labels.

---

## Inference

At runtime, the autopilot module:

1. Reads the latest camera frame from the MJPEG stream
2. Runs `crop_and_resize()` β†’ converts to tensor
3. Forward pass through ActionNet β†’ gets 9 logits
4. Applies softmax β†’ picks the action with highest probability
5. Uses a 3-frame majority vote to smooth out flickering predictions
6. Maps the smoothed action to (left, right) motor commands at the configured speed
7. Sends the command to the ESP8266 over WebSocket

The inference loop runs at 10 FPS on a typical laptop CPU. No GPU required.

---

## Hardware Requirements

This model is designed for a specific hardware setup:

| Component | Role |
|---|---|
| ESP32-CAM (OV2640) | Streams 800Γ—600 MJPEG video over HTTP |
| ESP8266 (NodeMCU) | Receives motor commands over WebSocket, drives L298N |
| L298N Motor Driver | Controls 2 DC gear motors (differential drive) |
| SG90 Servo (optional) | Camera pan |
| PC (any laptop/desktop) | Runs the server, training, and inference |

The PC does all the heavy lifting. The microcontrollers are just I/O β€” one for video, one for motors. Total hardware cost is around $25–30 USD.

---

## How to Use This Model

### Quickstart

```python
import torch
import torch.nn.functional as F
from torchvision import transforms
from model import ActionNet, action_to_command

# Load
device = torch.device("cpu")
model = ActionNet().to(device)
checkpoint = torch.load("trained_models/autopilot.pth", map_location=device)
model.load_state_dict(checkpoint["model_state_dict"])
model.eval()

# Prepare a 66x200 RGB image as tensor
transform = transforms.ToTensor()
img_tensor = transform(your_66x200_rgb_image).unsqueeze(0).to(device)

# Predict
with torch.no_grad():
    logits = model(img_tensor)
    probs = F.softmax(logits, dim=1)
    action = torch.argmax(probs, dim=1).item()
    confidence = probs[0, action].item()

# Convert to motor command
left, right = action_to_command(action, speed=70)
print(f"Action: {action}, Motors: L={left} R={right}, Confidence: {confidence:.1%}")
```

### Within the Full System

The model is used automatically by the autopilot module. Start the server, record some training data through the dashboard, train from the dashboard, then click "Start Autopilot."

See the full [README](https://github.com/YOUR_USERNAME/openbot-pc-server-project) for step-by-step instructions including hardware assembly, firmware upload, and data collection.

---

## Training Your Own Model

1. Assemble the hardware (ESP8266 + ESP32-CAM + motors)
2. Flash firmware to both microcontrollers
3. Start the PC server: `python app.py`
4. Drive the car manually while recording data
5. Click "Train" in the dashboard β€” or the model trains through the API
6. The best checkpoint saves automatically to `trained_models/autopilot.pth`

Training runs on CPU. A dataset of 3,000 frames trains in under 5 minutes on a modern laptop. GPU is supported if available but not required.

---

## Limitations

- The model only knows what it has seen. If you train it in one room, it won't generalize to a different room without additional data.
- Keyboard inputs produce jerky, discrete commands. A joystick or gamepad would produce smoother training data.
- The 40% top-crop assumes the camera is mounted pointing roughly forward and slightly down. If your camera angle is very different, adjust the crop ratio in `trainer.py`.
- Performance depends heavily on lighting conditions matching between training and inference.
- The model has no notion of obstacles, goals, or maps. It purely replicates the visual patterns it was trained on.

---

## Citation

If you use this project in your work, a mention is appreciated but not required:

```
OpenBot PC Server Project β€” Autonomous RC Car with Imitation Learning
https://github.com/YOUR_USERNAME/openbot-pc-server-project
```

---

## License

MIT License β€” use it, modify it, ship it.