File size: 5,233 Bytes

43b3b73

---
license: apache-2.0
tags:
  - image-classification
  - mobilenet-v3
  - video-games
  - real-time
pipeline_tag: image-classification
library_name: pytorch
---

# Game classifier — MobileNetV3-Small (CS2 / Dota 2 / Valorant)

3-class image classifier used as the §1.1 hot-path detector in
[coach-api](https://github.com/bashiryounis/gamed), a real-time AI coaching
service for competitive PC games. Decides which game is on screen so the
service routes the frame to the correct per-game event-extraction pipeline.

## Classes

`cs2`, `dota2`, `valorant` — in that order along the logits axis.

## Architecture

- `torchvision.models.mobilenet_v3_small(weights=None)`
- `classifier[3]` replaced with `nn.Linear(in_features, 3)`
- Input: 224×224 RGB, ImageNet mean/std normalization
- Output: softmax over 3 classes

## Files

| File | Purpose |
|---|---|
| `game_classifier_mobilenet_v3_small.pt` | Checkpoint. Dict with `state_dict`, `imagenet_mean`, `imagenet_std`. |
| `detector_comparison.json` | Custom-CNN vs MobileNetV3 head-to-head: per-class P/R/F1, latency p50/p95/p99, training time, size. |
| `confusion_matrices.png` | Side-by-side confusion matrices for both candidate detectors on the training split. |

## Loading

```python
import torch
from torch import nn
from torchvision import models

CLASSES = ("cs2", "dota2", "valorant")
ckpt = torch.load("game_classifier_mobilenet_v3_small.pt", map_location="cpu", weights_only=True)

model = models.mobilenet_v3_small(weights=None)
model.classifier[3] = nn.Linear(model.classifier[3].in_features, len(CLASSES))
model.load_state_dict(ckpt["state_dict"])
model.eval()
```

Preprocessing: BGR → RGB → resize 224×224 → `transforms.ToTensor()` →
`transforms.Normalize(ckpt["imagenet_mean"], ckpt["imagenet_std"])`. See
[`api/src/coach_api/services/detectors/game_classifier.py`](https://github.com/bashiryounis/gamed/blob/main/api/src/coach_api/services/detectors/game_classifier.py)
for the production preprocessing exact match.

## Training

- Data: 6450 frames extracted at 0.4 fps from 10 YouTube gameplay videos
  (3 cs2 / 3 dota2 / 4 valorant). Same `dataset/manifest.csv` schema
  published alongside the dataset repo.
- Split: frame-level `random_split(0.70 / 0.15 / 0.15)`. **Same videos
  appear in train and val** — useful for hyperparameter search, *not* a
  fair generalization estimate. See "Honest eval" below.
- Loss: cross-entropy
- Optimizer / schedule: see the training notebook
  [`notebook/Game_Classifier_…ipynb`](https://github.com/bashiryounis/gamed/tree/main/notebook).
- Random seed: 42

### Training-split metrics (`detector_comparison.json`)

| Model | Acc | Size | Mean latency (CPU) |
|---|---|---|---|
| Custom CNN | 98.76 % | 1.5 MB | 6.3 ms |
| **MobileNetV3-Small (winner)** | **99.28 %** | 5.8 MB | 10.2 ms |

The MobileNetV3 was selected for its better F1 on the harder valorant
class and headroom against domain shift, despite the larger size.

## Honest eval — held-out videos (this is the number to cite)

The training split is video-leaky. To get a fair estimate we extracted
frames from **3 fresh YouTube videos** (one per class) with zero
`video_id` overlap with training:

| Set | Frames | Accuracy | Errors |
|---|---|---|---|
| 150 / class evenly-spaced subsample | 450 | **100 %** | 0 |
| Full extraction | 1118 | **99.91 %** | 1 |

The single error is a Valorant **round-end "WON" overlay** misread as
CS2 with confidence 0.89. Valorant's orange/red post-round Combat
Report panel visually mimics CS2's MVP card. Spectator / round-end
frames are under-represented in training.

Eval tool: [`cli/gamed_classification_eval.py`](https://github.com/bashiryounis/gamed/blob/main/cli/gamed_classification_eval.py).
Eval set: companion dataset
[`ybashir/gamed-game-classification-dataset`](https://huggingface.co/datasets/ybashir/gamed-game-classification-dataset).

## Latency

Reported by `detector_comparison.json` on the training split:

| Percentile | Custom CNN | MobileNetV3-Small |
|---|---|---|
| p50 | 5.96 ms | 9.73 ms |
| p95 | 9.11 ms | 13.67 ms |
| p99 | 10.41 ms | 16.88 ms |

The held-out eval CLI measures **7.83 ms mean** end-to-end (cv2 imread +
preprocessing + forward pass) on CPU.

In production (`coach-api` service, CPU container) the per-frame
`t_classify_us` stamp averages ~21 ms — the difference is service
overhead (frame ingest, payload decode, logging).

All well inside the brief's 30 ms p95 budget.

## Intended use

Real-time game detection for a coaching service that needs to route
frames to per-game event extractors. Designed for CPU inference at
1–4 fps (VLM path) up to 30 fps (cheap CV path).

## Limitations

- Trained on YouTube "no-commentary" gameplay videos at 1080p. Frames
  with heavy streamer overlays, ultrawide aspect ratios, or HDR have
  not been evaluated.
- Three classes only. Adding a fourth (e.g. LoL, Apex) requires
  retraining the head.
- Confidently misclassifies round-end / spectator / replay frames —
  see "Honest eval" above. Mitigation in production: pipeline gates
  downstream detectors on confidence and frame-class persistence; a
  single misread doesn't propagate.

## Citation / source

Repo: <https://github.com/bashiryounis/gamed>